init push

This commit is contained in:
zachary62
2025-04-04 13:03:54 -04:00
parent e62ee2cb13
commit 2ebad5e5f2
160 changed files with 2 additions and 0 deletions

View File

@@ -0,0 +1,252 @@
# Chapter 1: Table / SSTable & TableCache
Welcome to your LevelDB journey! This is the first chapter where we'll start exploring the fundamental building blocks of LevelDB.
Imagine you're building a system to store a massive amount of data, like user profiles or product information. You need a way to save this data permanently (so it doesn't disappear when the computer turns off) and retrieve it quickly. How does LevelDB handle this?
The core idea we'll explore in this chapter is how LevelDB stores the bulk of its data on disk in special files and how it accesses them efficiently.
## What's the Problem? Storing Lots of Data Permanently
Databases need to store key-value pairs (like `user_id` -> `user_data`) persistently. This means writing the data to disk. However, disks are much slower than computer memory (RAM). If we just wrote every tiny change directly to a file, it would be very slow. Also, how do we organize the data on disk so we can find a specific key quickly without reading *everything*?
LevelDB's solution involves files called **SSTables** (Sorted String Tables), often just called **Tables** in the code.
## SSTable: The Sorted, Immutable Book on the Shelf
Think of an SSTable as a **permanently bound book** in a library.
1. **Stores Key-Value Pairs:** Just like a dictionary or an encyclopedia volume, an SSTable contains data entries, specifically key-value pairs.
2. **Sorted:** The keys inside an SSTable file are always stored in sorted order (like words in a dictionary). This is crucial for finding data quickly later on. If you're looking for the key "zebra", you know you don't need to look in the "A" section.
3. **Immutable:** Once an SSTable file is written to disk, LevelDB **never changes it**. It's like a printed book you can't erase or rewrite a page. If you need to update or delete data, LevelDB writes *new* information in *newer* SSTables. (We'll see how this works in later chapters like [Compaction](08_compaction.md)). This immutability makes many things simpler and safer.
4. **It's a File:** At the end of the day, an SSTable is just a file on your computer's disk. LevelDB gives these files names like `000005.ldb` or `000010.sst`.
Here's how LevelDB determines the filename for an SSTable:
```c++
// --- File: filename.cc ---
// Creates a filename like "dbname/000005.ldb"
std::string TableFileName(const std::string& dbname, uint64_t number) {
assert(number > 0);
// Uses a helper to format the number with leading zeros
// and adds the '.ldb' or '.sst' suffix.
return MakeFileName(dbname, number, "ldb"); // or "sst"
}
```
This simple function takes the database name (e.g., `/path/to/my/db`) and a unique number and creates the actual filename used on disk. The `.ldb` or `.sst` extension helps identify it as a LevelDB table file.
## Creating SSTables: `BuildTable`
How do these sorted, immutable files get created? This happens during processes like "flushing" data from memory or during "compaction" (which we'll cover in later chapters: [MemTable](02_memtable.md) and [Compaction](08_compaction.md)).
The function responsible for writing a new SSTable file is `BuildTable`. Think of `BuildTable` as the **printing press and binding machine** for our book analogy. It takes data (often from memory, represented by an `Iterator`) and writes it out to a new, sorted SSTable file on disk.
Let's look at a simplified view of `BuildTable`:
```c++
// --- File: builder.cc ---
// Builds an SSTable file from the key/value pairs provided by 'iter'.
Status BuildTable(const std::string& dbname, Env* env, const Options& options,
TableCache* table_cache, Iterator* iter, FileMetaData* meta) {
Status s;
// ... setup: determine filename, open the file for writing ...
std::string fname = TableFileName(dbname, meta->number);
WritableFile* file;
s = env->NewWritableFile(fname, &file);
// ... handle potential errors ...
// TableBuilder does the heavy lifting of formatting the file
TableBuilder* builder = new TableBuilder(options, file);
// Find the first key to store as the smallest key in metadata
iter->SeekToFirst();
meta->smallest.DecodeFrom(iter->key());
// Loop through all key-value pairs from the input iterator
Slice key;
for (; iter->Valid(); iter->Next()) {
key = iter->key();
// Add the key and value to the table being built
builder->Add(key, iter->value());
}
// Store the last key as the largest key in metadata
if (!key.empty()) {
meta->largest.DecodeFrom(key);
}
// Finish writing the file (adds index blocks, etc.)
s = builder->Finish();
// ... more steps: update metadata, sync file to disk, close file ...
if (s.ok()) {
meta->file_size = builder->FileSize();
s = file->Sync(); // Ensure data is physically written
}
if (s.ok()) {
s = file->Close();
}
// ... cleanup: delete builder, file; handle errors ...
return s;
}
```
**Explanation:**
1. **Input:** `BuildTable` receives data via an `Iterator`. An iterator is like a cursor that lets you go through key-value pairs one by one, already in sorted order. It also gets other necessary info like the database name (`dbname`), environment (`env`), options, the `TableCache` (we'll see this next!), and a `FileMetaData` object to store information *about* the new file (like its number, size, smallest key, and largest key).
2. **File Creation:** It creates a new, empty file using `env->NewWritableFile`.
3. **TableBuilder:** It uses a helper object called `TableBuilder` to handle the complex details of formatting the SSTable file structure (data blocks, index blocks, etc.).
4. **Iteration & Adding:** It loops through the `Iterator`. For each key-value pair, it calls `builder->Add()`. Because the input `Iterator` provides keys in sorted order, the `TableBuilder` can write them sequentially to the file.
5. **Metadata:** It records the very first key (`meta->smallest`) and the very last key (`meta->largest`) it processes. This is useful later for quickly knowing the range of keys stored in this file without opening it.
6. **Finishing Up:** It calls `builder->Finish()` to write out the final pieces of the SSTable (like the index). Then it `Sync`s the file to ensure the data is safely on disk and `Close`s it.
7. **Output:** If successful, a new `.ldb` file exists on disk containing the sorted key-value pairs, and the `meta` object is filled with details about this file.
## Accessing SSTables Efficiently: `TableCache`
Okay, so we have these SSTable files on disk. But reading from disk is slow. If we need to read from the same SSTable file multiple times (which is common), opening and closing it repeatedly, or re-reading its internal index structure, would be inefficient.
This is where the `TableCache` comes in. Think of the `TableCache` as a **smart librarian**.
1. **Keeps Files Open:** The librarian might keep the most popular books near the front desk instead of running to the far shelves every time someone asks for them. Similarly, the `TableCache` keeps recently used SSTable files open.
2. **Caches Structures:** Just opening the file isn't enough. LevelDB needs to read some index information *within* the SSTable file to find keys quickly. The `TableCache` also keeps this parsed information in memory (RAM). It uses a specific caching strategy called LRU (Least Recently Used) to decide which table information to keep in memory if the cache gets full.
3. **Provides Access:** When LevelDB needs to read data from a specific SSTable (identified by its file number), it asks the `TableCache`. The cache checks if it already has that table open and ready in memory. If yes (a "cache hit"), it returns access quickly. If no (a "cache miss"), it opens the actual file from disk, reads the necessary index info, stores it in the cache for next time, and then returns access.
Let's see how the `TableCache` finds a table:
```c++
// --- File: table_cache.cc ---
// Tries to find the Table structure for a given file number.
// If not in cache, opens the file and loads it.
Status TableCache::FindTable(uint64_t file_number, uint64_t file_size,
Cache::Handle** handle) {
Status s;
// Create a key for the cache lookup (based on file number)
char buf[sizeof(file_number)];
EncodeFixed64(buf, file_number);
Slice key(buf, sizeof(buf));
// 1. Try looking up the table in the cache
*handle = cache_->Lookup(key);
if (*handle == nullptr) { // Cache Miss!
// 2. If not found, open the actual file from disk
std::string fname = TableFileName(dbname_, file_number);
RandomAccessFile* file = nullptr;
Table* table = nullptr;
s = env_->NewRandomAccessFile(fname, &file); // Open the file
// ... handle errors, potentially check for old .sst filename ...
if (s.ok()) {
// 3. Parse the Table structure (index etc.) from the file
s = Table::Open(options_, file, file_size, &table);
}
if (s.ok()) {
// 4. Store the opened file and parsed Table in the cache
TableAndFile* tf = new TableAndFile;
tf->file = file;
tf->table = table;
*handle = cache_->Insert(key, tf, 1 /*charge*/, &DeleteEntry);
} else {
// Error occurred, cleanup
delete file;
// Note: Errors are NOT cached. We'll retry opening next time.
}
} // else: Cache Hit! *handle is already valid.
return s;
}
```
**Explanation:**
1. **Lookup:** It first tries `cache_->Lookup` using the `file_number`.
2. **Cache Miss:** If `Lookup` returns `nullptr`, it means the table isn't in the cache. It then proceeds to open the file (`env_->NewRandomAccessFile`).
3. **Table::Open:** It calls `Table::Open`, which reads the file's footer, parses the index block, and sets up a `Table` object ready for lookups.
4. **Insert:** If opening and parsing succeed, it creates a `TableAndFile` struct (holding both the file handle and the `Table` object) and inserts it into the cache using `cache_->Insert`. Now, the next time `FindTable` is called for this `file_number`, it will be a cache hit.
5. **Cache Hit:** If `Lookup` initially returned a valid handle, `FindTable` simply returns `Status::OK()`, and the caller can use the handle to get the `Table` object.
When LevelDB needs to read data, it often gets an `Iterator` for a specific SSTable via the `TableCache`:
```c++
// --- File: table_cache.cc ---
// Returns an iterator for reading the specified SSTable file.
Iterator* TableCache::NewIterator(const ReadOptions& options,
uint64_t file_number, uint64_t file_size,
Table** tableptr) {
// ... setup ...
Cache::Handle* handle = nullptr;
// Use FindTable to get the Table object (from cache or by opening file)
Status s = FindTable(file_number, file_size, &handle);
if (!s.ok()) {
return NewErrorIterator(s); // Return an iterator that yields the error
}
// Get the Table object from the cache handle
Table* table = reinterpret_cast<TableAndFile*>(cache_->Value(handle))->table;
// Ask the Table object to create a new iterator for its data
Iterator* result = table->NewIterator(options);
// Important: Register cleanup to release the cache handle when iterator is done
result->RegisterCleanup(&UnrefEntry, cache_, handle);
// Optionally return the Table object itself
if (tableptr != nullptr) {
*tableptr = table;
}
return result;
}
```
This function uses `FindTable` to get the `Table` object (either from the cache or by loading it from disk) and then asks that `Table` object to provide an `Iterator` to step through its key-value pairs. It also cleverly registers a cleanup function (`UnrefEntry`) so that when the iterator is no longer needed, the cache handle is released, allowing the cache to potentially evict the table later if needed.
Here's a diagram showing how a read might use the `TableCache`:
```mermaid
sequenceDiagram
participant Client as Read Operation
participant TableCache
participant Cache as LRUCache
participant OS/FileSystem as FS
participant TableObject as In-Memory Table Rep
Client->>TableCache: Get("some_key", file_num=5, size=1MB)
TableCache->>Cache: Lookup(file_num=5)?
alt Cache Hit
Cache-->>TableCache: Return handle for Table 5
TableCache->>TableObject: Find "some_key" within Table 5 data
TableObject-->>TableCache: Return value / not found
TableCache-->>Client: Return value / not found
else Cache Miss
Cache-->>TableCache: Not found (nullptr)
TableCache->>FS: Open file "000005.ldb"
FS-->>TableCache: Return file handle
TableCache->>TableObject: Create Table 5 representation from file handle + size
TableObject-->>TableCache: Return Table 5 object
TableCache->>Cache: Insert(file_num=5, Table 5 object)
Note right of Cache: Table 5 now cached
TableCache->>TableObject: Find "some_key" within Table 5 data
TableObject-->>TableCache: Return value / not found
TableCache-->>Client: Return value / not found
end
```
## Conclusion
In this chapter, we learned about two fundamental concepts in LevelDB:
1. **SSTable (Table):** These are the immutable, sorted files on disk where LevelDB stores the bulk of its key-value data. Think of them as sorted, bound books. They are created using `BuildTable`.
2. **TableCache:** This acts like an efficient librarian for SSTables. It keeps recently used tables open and their index structures cached in memory (RAM) to speed up access, avoiding slow disk reads whenever possible. It provides access to table data, often via iterators.
These two components work together to provide persistent storage and relatively fast access to the data within those files.
But where does the data *come from* before it gets written into an SSTable? Often, it lives in memory first. In the next chapter, we'll look at the in-memory structure where recent writes are held before being flushed to an SSTable.
Next up: [Chapter 2: MemTable](02_memtable.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

281
docs/LevelDB/02_memtable.md Normal file
View File

@@ -0,0 +1,281 @@
# Chapter 2: MemTable
In [Chapter 1: Table / SSTable & TableCache](01_table___sstable___tablecache.md), we learned how LevelDB stores the bulk of its data permanently on disk in sorted, immutable files called SSTables. We also saw how the `TableCache` helps access these files efficiently.
But imagine you're updating your data frequently adding new users, changing scores, deleting temporary items. Writing every tiny change directly to a new SSTable file on disk would be incredibly slow, like carving every single note onto a stone tablet! We need a faster way to handle recent changes.
## What's the Problem? Slow Disk Writes for Every Change
Disk drives (even fast SSDs) are much slower than your computer's main memory (RAM). If LevelDB wrote every `Put` or `Delete` operation straight to an SSTable file, your application would constantly be waiting for the disk, making it feel sluggish.
How can we accept new writes quickly but still eventually store them permanently on disk?
## MemTable: The Fast In-Memory Notepad
LevelDB's solution is the **MemTable**. Think of it as a **temporary notepad** or a **scratchpad** that lives entirely in your computer's fast RAM.
1. **In-Memory:** It's stored in RAM, making reads and writes extremely fast.
2. **Holds Recent Writes:** When you `Put` a new key-value pair or `Delete` a key, the change goes into the MemTable first.
3. **Sorted:** Just like SSTables, the data inside the MemTable is kept sorted by key. This is important for efficiency later.
4. **Temporary:** It's only a temporary holding area. Eventually, its contents get written out to a permanent SSTable file on disk.
So, when you write data:
*Your Application* -> `Put("user123", "data")` -> **MemTable** (Fast RAM write!)
This makes write operations feel almost instantaneous to your application.
## How Reads Use the MemTable
When you try to read data using `Get(key)`, LevelDB is smart. It knows the most recent data might still be on the "notepad" (MemTable). So, it checks there *first*:
1. **Check MemTable:** Look for the key in the current MemTable.
* If the key is found, return the value immediately (super fast!).
* If a "deletion marker" for the key is found, stop and report "Not Found" (the key was recently deleted).
2. **Check Older MemTable (Immutable):** If there's an older MemTable being flushed (we'll cover this next), check that too.
3. **Check SSTables:** If the key wasn't found in memory (or wasn't deleted there), *then* LevelDB looks for it in the SSTable files on disk, using the [Table / SSTable & TableCache](01_table___sstable___tablecache.md) we learned about in Chapter 1.
This "check memory first" strategy ensures that you always read the most up-to-date value, even if it hasn't hit the disk yet.
```mermaid
sequenceDiagram
participant Client as App Read (Get)
participant LevelDB
participant MemTable as Active MemTable (RAM)
participant ImMemTable as Immutable MemTable (RAM, if exists)
participant TableCache as SSTable Cache (Disk/RAM)
Client->>LevelDB: Get("some_key")
LevelDB->>MemTable: Have "some_key"?
alt Key found in Active MemTable
MemTable-->>LevelDB: Yes, value is "xyz"
LevelDB-->>Client: Return "xyz"
else Key Deleted in Active MemTable
MemTable-->>LevelDB: Yes, it's deleted
LevelDB-->>Client: Return NotFound
else Not in Active MemTable
MemTable-->>LevelDB: No
LevelDB->>ImMemTable: Have "some_key"?
alt Key found in Immutable MemTable
ImMemTable-->>LevelDB: Yes, value is "abc"
LevelDB-->>Client: Return "abc"
else Key Deleted in Immutable MemTable
ImMemTable-->>LevelDB: Yes, it's deleted
LevelDB-->>Client: Return NotFound
else Not in Immutable MemTable
ImMemTable-->>LevelDB: No
LevelDB->>TableCache: Get("some_key") from SSTables
TableCache-->>LevelDB: Found "old_value" / NotFound
LevelDB-->>Client: Return "old_value" / NotFound
end
end
```
## What Happens When the Notepad Fills Up?
The MemTable lives in RAM, which is limited. We can't just keep adding data to it forever. LevelDB has a configured size limit for the MemTable ( `options.write_buffer_size`, often a few megabytes).
When the MemTable gets close to this size:
1. **Freeze!** LevelDB declares the current MemTable "immutable" (meaning read-only). No new writes go into this specific MemTable anymore. Let's call it `imm_` (Immutable MemTable).
2. **New Notepad:** LevelDB immediately creates a *new*, empty MemTable (`mem_`) to accept incoming writes. Your application doesn't pause; new writes just start going to the fresh MemTable.
3. **Flush to Disk:** A background task starts working on the frozen `imm_`. It reads all the sorted key-value pairs from `imm_` and uses the `BuildTable` process (from [Chapter 1](01_table___sstable___tablecache.md)) to write them into a brand new SSTable file on disk. This new file becomes part of "Level-0" (we'll learn more about levels in [Chapter 8: Compaction](08_compaction.md)).
4. **Discard:** Once the `imm_` is successfully written to the SSTable file, the in-memory `imm_` is discarded, freeing up RAM.
This process ensures that writes are always fast (going to the *new* `mem_`) while the *old* data is efficiently flushed to disk in the background.
```mermaid
graph TD
subgraph Writes
A[Incoming Writes: Put/Delete] --> B(Active MemTable mem_);
end
subgraph MemTable Full
B -- Reaches Size Limit --> C{Freeze mem_ -> becomes imm_};
C --> D(Create New Empty mem_);
A --> D;
C --> E{Background Flush};
end
subgraph Background Flush
E -- Reads Data --> F(Immutable MemTable imm_);
F -- Uses BuildTable --> G([Level-0 SSTable on Disk]);
G -- Flush Complete --> H{Discard imm_};
end
style G fill:#f9f,stroke:#333,stroke-width:2px
```
## Under the Hood: Keeping it Sorted with a SkipList
We mentioned that the MemTable keeps keys sorted. Why?
1. **Efficient Flushing:** When flushing the MemTable to an SSTable, the data needs to be written in sorted order. If the MemTable is already sorted, this is very efficient we just read through it sequentially.
2. **Efficient Reads:** Keeping it sorted allows for faster lookups within the MemTable itself.
How does LevelDB keep the MemTable sorted while allowing fast inserts? It uses a clever data structure called a **SkipList**.
Imagine a sorted linked list. To find an element, you might have to traverse many nodes. Now, imagine adding some "express lanes" (higher-level links) that skip over several nodes at a time. You can use these express lanes to quickly get close to your target, then drop down to the detailed level (the base list) to find the exact spot. This is the core idea of a SkipList!
* **Fast Inserts:** Adding a new item is generally fast.
* **Fast Lookups:** Finding an item is much faster than a simple linked list, often close to the speed of more complex balanced trees.
* **Efficient Iteration:** Reading all items in sorted order (needed for flushing) is straightforward.
The MemTable essentially wraps a SkipList provided by `skiplist.h`.
```c++
// --- File: db/memtable.h ---
#include "db/skiplist.h" // The SkipList data structure
#include "util/arena.h" // Memory allocator
class MemTable {
private:
// The core data structure: a SkipList.
// The Key is 'const char*' pointing into the Arena.
// KeyComparator helps compare keys correctly (we'll see this later).
typedef SkipList<const char*, KeyComparator> Table;
Arena arena_; // Allocates memory for nodes efficiently
Table table_; // The actual SkipList instance
int refs_; // Reference count for managing lifetime
// ... other members like KeyComparator ...
public:
// Add an entry (Put or Delete marker)
void Add(SequenceNumber seq, ValueType type, const Slice& key,
const Slice& value);
// Look up a key
bool Get(const LookupKey& key, std::string* value, Status* s);
// Create an iterator to scan the MemTable's contents
Iterator* NewIterator();
// Estimate memory usage
size_t ApproximateMemoryUsage();
// Constructor, Ref/Unref omitted for brevity...
};
```
This header shows the `MemTable` class uses an `Arena` for memory management and a `Table` (which is a `SkipList`) to store the data.
## Adding and Getting Data (Code View)
Let's look at simplified versions of `Add` and `Get`.
**Adding an Entry:**
When you call `db->Put(key, value)` or `db->Delete(key)`, it eventually calls `MemTable::Add`.
```c++
// --- File: db/memtable.cc ---
void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
const Slice& value) {
// Calculate size needed for the entry in the skiplist.
// Format includes key size, key, sequence number + type tag, value size, value.
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8; // 8 bytes for seq + type
const size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) +
val_size;
// Allocate memory from the Arena
char* buf = arena_.Allocate(encoded_len);
// Encode the entry into the buffer 'buf' (details omitted)
// Format: [key_len][key_bytes][seq_num|type][value_len][value_bytes]
// ... encoding logic ...
// Insert the buffer pointer into the SkipList. The SkipList uses the
// KeyComparator to know how to sort based on the encoded format.
table_.Insert(buf);
}
```
**Explanation:**
1. **Calculate Size:** Determines how much memory is needed to store the key, value, sequence number, and type. (We'll cover sequence numbers and internal keys in [Chapter 9](09_internalkey___dbformat.md)).
2. **Allocate:** Gets a chunk of memory from the `Arena`. Arenas are efficient allocators for many small objects with similar lifetimes.
3. **Encode:** Copies the key, value, and metadata into the allocated buffer (`buf`).
4. **Insert:** Calls `table_.Insert(buf)`, where `table_` is the SkipList. The SkipList takes care of finding the correct sorted position and linking the new entry.
**Getting an Entry:**
When you call `db->Get(key)`, it checks the MemTable first using `MemTable::Get`.
```c++
// --- File: db/memtable.cc ---
bool MemTable::Get(const LookupKey& lkey, std::string* value, Status* s) {
// Get the specially formatted key to search for in the MemTable.
Slice memkey = lkey.memtable_key();
// Create an iterator for the SkipList.
Table::Iterator iter(&table_);
// Seek to the first entry >= the key we are looking for.
iter.Seek(memkey.data());
if (iter.Valid()) { // Did we find something at or after our key?
// Decode the key found in the SkipList
const char* entry = iter.key();
// ... decode logic to get user_key, sequence, type ...
Slice found_user_key = /* decoded user key */;
ValueType found_type = /* decoded type */;
// Check if the user key matches exactly
if (comparator_.comparator.user_comparator()->Compare(
found_user_key, lkey.user_key()) == 0) {
// It's the right key! Check the type.
if (found_type == kTypeValue) { // Is it a Put record?
// Decode the value and return it
Slice v = /* decoded value */;
value->assign(v.data(), v.size());
return true; // Found the value!
} else { // Must be kTypeDeletion
// Found a deletion marker for this key. Report "NotFound".
*s = Status::NotFound(Slice());
return true; // Found a deletion!
}
}
}
// Key not found in this MemTable
return false;
}
```
**Explanation:**
1. **Get Search Key:** Prepares the key in the format used internally by the MemTable (`LookupKey`).
2. **Create Iterator:** Gets a `SkipList::Iterator`.
3. **Seek:** Uses the iterator's `Seek` method to efficiently find the first entry in the SkipList whose key is greater than or equal to the search key.
4. **Check Found Entry:** If `Seek` finds an entry (`iter.Valid()`):
* It decodes the entry found in the SkipList.
* It compares the *user* part of the key to ensure it's an exact match (not just the next key in sorted order).
* If the keys match, it checks the `type`:
* If it's `kTypeValue`, it decodes the value and returns `true`.
* If it's `kTypeDeletion`, it sets the status to `NotFound` and returns `true` (indicating we found definitive information about the key it's deleted).
5. **Not Found:** If no matching key is found, it returns `false`.
## Conclusion
The **MemTable** is LevelDB's crucial in-memory cache for recent writes. It acts like a fast notepad:
* Accepts new `Put` and `Delete` operations quickly in RAM.
* Keeps entries sorted using an efficient **SkipList**.
* Allows recent data to be read quickly without touching the disk.
* When full, it's frozen, flushed to a new Level-0 **SSTable** file on disk in the background, and then discarded.
This design allows LevelDB to provide very fast write performance while still ensuring data is eventually persisted safely to disk.
However, what happens if the power goes out *after* data is written to the MemTable but *before* it's flushed to an SSTable? Isn't the data in RAM lost? To solve this, LevelDB uses another component alongside the MemTable: the Write-Ahead Log (WAL).
Next up: [Chapter 3: Write-Ahead Log (WAL) & LogWriter/LogReader](03_write_ahead_log__wal____logwriter_logreader.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,352 @@
# Chapter 3: Write-Ahead Log (WAL) & LogWriter/LogReader
In [Chapter 2: MemTable](02_memtable.md), we saw how LevelDB uses an in-memory `MemTable` (like a fast notepad) to quickly accept new writes (`Put` or `Delete`) before they are eventually flushed to an [SSTable](01_table___sstable___tablecache.md) file on disk.
This is great for speed! But what if the unthinkable happens? Imagine you've just written some important data. It's sitting safely in the `MemTable` in RAM, but *before* LevelDB gets a chance to write it to a permanent SSTable file, the power cord gets kicked out, or the server crashes!
Uh oh. Since RAM is volatile, anything in the `MemTable` that hadn't been saved to disk is **gone** forever when the power goes out. That's not very reliable for a database!
## What's the Problem? Losing Data on Crashes
How can LevelDB make sure that once your write operation *returns successfully*, the data is safe, even if the system crashes immediately afterwards? Relying only on the `MemTable` isn't enough because it lives in volatile RAM. We need a way to make writes durable (permanent) much sooner.
## Write-Ahead Log (WAL): The Database's Safety Journal
LevelDB's solution is the **Write-Ahead Log (WAL)**, often just called the **log**.
Think of the WAL as a **ship's logbook** or a **court reporter's transcript**.
1. **Write First:** Before the captain takes any significant action (like changing course), they write it down in the logbook *first*. Similarly, before LevelDB modifies the `MemTable` (which is in RAM), it **first appends** a description of the change (e.g., "Put key 'user1' with value 'dataA'") to a special file on disk the WAL file.
2. **Append-Only:** Like a logbook, entries are just added sequentially to the end. LevelDB doesn't go back and modify old entries in the current WAL file. This makes writing very fast it's just adding to the end of a file.
3. **On Disk:** Crucially, this WAL file lives on the persistent disk (HDD or SSD), not just in volatile RAM.
4. **Durability:** By writing to the WAL *before* acknowledging a write to the user, LevelDB ensures that even if the server crashes immediately after, the record of the operation is safely stored on disk in the log.
So, the write process looks like this:
*Your Application* -> `Put("user123", "data")` -> **1. Append to WAL file (Disk)** -> **2. Add to MemTable (RAM)** -> *Return Success*
```mermaid
sequenceDiagram
participant App as Application
participant LevelDB
participant WAL as WAL File (Disk)
participant MemTable as MemTable (RAM)
App->>LevelDB: Put("key", "value")
LevelDB->>WAL: Append Put("key", "value") Record
Note right of WAL: Physical disk write
WAL-->>LevelDB: Append successful
LevelDB->>MemTable: Add("key", "value")
MemTable-->>LevelDB: Add successful
LevelDB-->>App: Write successful
```
This "write-ahead" step ensures durability.
## What Happens During Recovery? Replaying the Logbook
Now, let's say the server crashes and restarts. LevelDB needs to recover its state. How does the WAL help?
1. **Check for Log:** When LevelDB starts up, it looks for a WAL file.
2. **Read the Log:** If a WAL file exists, it means the database might not have shut down cleanly, and the last `MemTable`'s contents (which were only in RAM) were lost. LevelDB creates a `LogReader` to read through the WAL file from beginning to end.
3. **Rebuild MemTable:** For each operation record found in the WAL (like "Put key 'user1' value 'dataA'", "Delete key 'user2'"), LevelDB re-applies that operation to a *new*, empty `MemTable` in memory. It's like rereading the ship's logbook to reconstruct what happened right before the incident.
4. **Recovery Complete:** Once the entire WAL is replayed, the `MemTable` is back to the state it was in right before the crash. LevelDB can now continue operating normally, accepting new reads and writes. The data from the WAL is now safely in the new `MemTable`, ready to be flushed to an SSTable later.
The WAL file essentially acts as a temporary backup for the `MemTable` until the `MemTable`'s contents are permanently stored in an SSTable. Once a `MemTable` is successfully flushed to an SSTable, the corresponding WAL file is no longer needed and can be deleted.
## LogWriter: Appending to the Log
The component responsible for writing records to the WAL file is `log::Writer`. Think of it as the dedicated writer making entries in our ship's logbook.
When LevelDB processes a write operation (often coming from a [WriteBatch](05_writebatch.md), which we'll see later), it serializes the batch of changes into a single chunk of data (a `Slice`) and asks the `log::Writer` to add it to the current log file.
```c++
// --- Simplified from db/db_impl.cc ---
// Inside DBImpl::Write(...) after preparing the batch:
Status status = log_->AddRecord(WriteBatchInternal::Contents(write_batch));
// ... check status ...
if (status.ok() && options.sync) {
// Optionally ensure the data hits the physical disk
status = logfile_->Sync();
}
if (status.ok()) {
// Only if WAL write succeeded, apply to MemTable
status = WriteBatchInternal::InsertInto(write_batch, mem_);
}
// ... handle status ...
```
**Explanation:**
1. `WriteBatchInternal::Contents(write_batch)`: Gets the serialized representation of the write operations (like one or more Puts/Deletes).
2. `log_->AddRecord(...)`: Calls the `log::Writer` instance (`log_`) to append this serialized data as a single record to the current WAL file (`logfile_`).
3. `logfile_->Sync()`: If the `sync` option is set (which is the default for ensuring durability), this command tells the operating system to *really* make sure the data written to the log file has reached the physical disk platters/flash, not just sitting in some OS buffer. This is crucial for surviving power loss.
4. `WriteBatchInternal::InsertInto(write_batch, mem_)`: Only *after* the log write is confirmed (and synced, if requested) does LevelDB apply the changes to the in-memory `MemTable`.
The `log::Writer` itself handles the details of how records are actually formatted within the log file. Log files are composed of fixed-size blocks (e.g., 32KB). A single record from `AddRecord` might be small enough to fit entirely within the remaining space in the current block, or it might be large and need to be split (fragmented) across multiple physical records spanning block boundaries.
```c++
// --- Simplified from db/log_writer.cc ---
Status Writer::AddRecord(const Slice& slice) {
const char* ptr = slice.data();
size_t left = slice.size(); // How much data is left to write?
Status s;
bool begin = true; // Is this the first fragment of this record?
do {
const int leftover = kBlockSize - block_offset_; // Space left in current block
// ... if leftover < kHeaderSize, fill trailer and start new block ...
// Calculate how much of the data can fit in this block
const size_t avail = kBlockSize - block_offset_ - kHeaderSize;
const size_t fragment_length = (left < avail) ? left : avail;
// Determine the type of this physical record (fragment)
RecordType type;
const bool end = (left == fragment_length); // Is this the last fragment?
if (begin && end) {
type = kFullType; // Fits entirely in one piece
} else if (begin) {
type = kFirstType; // First piece of a multi-piece record
} else if (end) {
type = kLastType; // Last piece of a multi-piece record
} else {
type = kMiddleType; // Middle piece of a multi-piece record
}
// Write this physical record (header + data fragment) to the file
s = EmitPhysicalRecord(type, ptr, fragment_length);
// Advance pointers and update remaining size
ptr += fragment_length;
left -= fragment_length;
begin = false; // Subsequent fragments are not the 'begin' fragment
} while (s.ok() && left > 0); // Loop until all data is written or error
return s;
}
// Simplified - Writes header (checksum, length, type) and payload
Status Writer::EmitPhysicalRecord(RecordType t, const char* ptr, size_t length) {
// ... format header (buf) with checksum, length, type ...
// ... compute checksum ...
// ... Encode checksum into header ...
// Write header and payload fragment
Status s = dest_->Append(Slice(buf, kHeaderSize));
if (s.ok()) {
s = dest_->Append(Slice(ptr, length));
// LevelDB might Flush() here or let the caller Sync() later
}
block_offset_ += kHeaderSize + length; // Update position in current block
return s;
}
```
**Explanation:**
* The `AddRecord` method takes the user's data (`slice`) and potentially breaks it into smaller `fragment_length` chunks.
* Each chunk is written as a "physical record" using `EmitPhysicalRecord`.
* `EmitPhysicalRecord` prepends a small header (`kHeaderSize`, 7 bytes) containing a checksum (for detecting corruption), the length of this fragment, and the `RecordType` (`kFullType`, `kFirstType`, `kMiddleType`, or `kLastType`).
* The `RecordType` tells the `LogReader` later how to reassemble these fragments back into the original complete record.
## LogReader: Reading the Log for Recovery
The counterpart to `LogWriter` is `log::Reader`. This is the component used during database startup (recovery) to read the records back from a WAL file. Think of it as the person carefully reading the ship's logbook after an incident.
The `log::Reader` reads the log file sequentially, block by block. It parses the physical record headers, verifies checksums, and pieces together the fragments (`kFirstType`, `kMiddleType`, `kLastType`) to reconstruct the original data records that were passed to `AddRecord`.
```c++
// --- Simplified from db/db_impl.cc ---
// Inside DBImpl::RecoverLogFile(...)
// Create the log reader for the specific log file number
std::string fname = LogFileName(dbname_, log_number);
SequentialFile* file;
Status status = env_->NewSequentialFile(fname, &file);
// ... check status ...
// Set up reporter for corruption errors
log::Reader::Reporter reporter;
// ... initialize reporter ...
log::Reader reader(file, &reporter, true /*checksum*/, 0 /*initial_offset*/);
// Read records one by one and apply them to a temporary MemTable
std::string scratch;
Slice record;
WriteBatch batch;
MemTable* mem = new MemTable(internal_comparator_);
mem->Ref();
while (reader.ReadRecord(&record, &scratch) && status.ok()) {
// record now holds a complete record originally passed to AddRecord
// Parse the record back into a WriteBatch
WriteBatchInternal::SetContents(&batch, record);
// Apply the operations from the batch to the MemTable
status = WriteBatchInternal::InsertInto(&batch, mem);
// ... check status ...
// Update the max sequence number seen
const SequenceNumber last_seq = /* ... get from batch ... */;
if (last_seq > *max_sequence) {
*max_sequence = last_seq;
}
// Optional: If MemTable gets too big during recovery, flush it
if (mem->ApproximateMemoryUsage() > options_.write_buffer_size) {
status = WriteLevel0Table(mem, edit, nullptr); // Flush to SSTable
mem->Unref();
mem = new MemTable(internal_comparator_);
mem->Ref();
// ... check status ...
}
}
delete file; // Close the log file
// ... handle final MemTable (mem) if not null ...
```
**Explanation:**
1. A `log::Reader` is created, pointing to the WAL file (`.log`) that needs recovery.
2. The code loops using `reader.ReadRecord(&record, &scratch)`.
* `record`: This `Slice` will point to the reassembled data of the next complete logical record found in the log.
* `scratch`: A temporary string buffer the reader might use if a record spans multiple blocks.
3. Inside the loop:
* The `record` (which contains a serialized `WriteBatch`) is parsed back into a `WriteBatch` object.
* `WriteBatchInternal::InsertInto(&batch, mem)` applies the operations (Puts/Deletes) from the recovered batch to the in-memory `MemTable` (`mem`).
* The code keeps track of the latest sequence number encountered.
* Optionally, if the `MemTable` fills up *during* recovery, it can be flushed to an SSTable just like during normal operation.
4. This continues until `ReadRecord` returns `false` (end of log file) or an error occurs.
The `log::Reader::ReadRecord` implementation handles the details of reading blocks, finding headers, checking checksums, and combining `kFirstType`, `kMiddleType`, `kLastType` fragments.
```c++
// --- Simplified from db/log_reader.cc ---
// Reads the next complete logical record. Returns true if successful.
bool Reader::ReadRecord(Slice* record, std::string* scratch) {
// ... skip records before initial_offset if necessary ...
scratch->clear();
record->clear();
bool in_fragmented_record = false;
Slice fragment; // To hold data from one physical record
while (true) {
// Reads the next physical record (header + data fragment) from the file blocks.
// Handles reading across block boundaries internally.
const unsigned int record_type = ReadPhysicalRecord(&fragment);
// ... handle resyncing logic after seeking ...
switch (record_type) {
case kFullType:
// ... sanity check for unexpected fragments ...
*record = fragment; // Got a complete record in one piece
return true;
case kFirstType:
// ... sanity check for unexpected fragments ...
scratch->assign(fragment.data(), fragment.size()); // Start of a new fragmented record
in_fragmented_record = true;
break;
case kMiddleType:
if (!in_fragmented_record) { /* Report corruption */ }
else { scratch->append(fragment.data(), fragment.size()); } // Append middle piece
break;
case kLastType:
if (!in_fragmented_record) { /* Report corruption */ }
else {
scratch->append(fragment.data(), fragment.size()); // Append final piece
*record = Slice(*scratch); // Reassembled record is complete
return true;
}
break;
case kEof:
return false; // End of log file
case kBadRecord:
// ... report corruption, clear state ...
in_fragmented_record = false;
scratch->clear();
break; // Try to find the next valid record
default:
// ... report corruption ...
in_fragmented_record = false;
scratch->clear();
break; // Try to find the next valid record
}
}
}
```
**Explanation:**
* `ReadRecord` calls `ReadPhysicalRecord` repeatedly in a loop.
* `ReadPhysicalRecord` (internal helper, not shown in full) reads from the file, parses the 7-byte header, checks the CRC, and returns the type and the data fragment (`result`). It handles skipping block trailers and reading new blocks as needed.
* Based on the `record_type`, `ReadRecord` either returns the complete record (`kFullType`), starts assembling fragments (`kFirstType`), appends fragments (`kMiddleType`), or finishes assembling and returns the record (`kLastType`).
* It manages the `scratch` buffer to hold the fragments being assembled.
## Recovery Process Diagram
Here's how the WAL is used during database startup if a crash occurred:
```mermaid
sequenceDiagram
participant App as Application Startup
participant LevelDB as DB::Open()
participant Env as Environment (OS/FS)
participant LogReader as log::Reader
participant MemTable as New MemTable (RAM)
App->>LevelDB: Open Database
LevelDB->>Env: Check for CURRENT file, MANIFEST, etc.
LevelDB->>Env: Look for .log files >= Manifest LogNumber
alt Log file(s) found
LevelDB->>LogReader : Create Reader for log file
loop Read Log Records
LogReader ->> Env: Read next block(s) from log file
Env-->>LogReader: Return data
LogReader ->> LogReader : Parse physical records, reassemble logical record
alt Record Found
LogReader -->> LevelDB: Return next record (WriteBatch data)
LevelDB ->> MemTable: Apply WriteBatch to MemTable
else End of Log or Error
LogReader -->> LevelDB: Indicate EOF / Error
Note right of LevelDB: Loop will exit
end
end
LevelDB ->> LogReader : Destroy Reader
Note right of LevelDB: MemTable now holds recovered state.
else No relevant log files
Note right of LevelDB: Clean shutdown or new DB. No log replay needed.
end
LevelDB-->>App: Database Opened Successfully
```
## Conclusion
The **Write-Ahead Log (WAL)** is a critical component for ensuring **durability** in LevelDB. By writing every operation to an append-only log file on disk *before* applying it to the in-memory `MemTable` and acknowledging the write, LevelDB guarantees that no acknowledged data is lost even if the server crashes.
* The `log::Writer` handles appending records to the current WAL file, dealing with block formatting and fragmentation.
* The `log::Reader` handles reading records back from the WAL file during recovery, verifying checksums and reassembling fragmented records.
* This recovery process replays the logged operations to rebuild the `MemTable` state that was lost in the crash.
The WAL, MemTable, and SSTables work together: WAL provides fast durability for recent writes, MemTable provides fast access to those recent writes in memory, and SSTables provide persistent, sorted storage for the bulk of the data.
Now that we understand the core storage structures (SSTables, MemTable, WAL), we can start looking at how they are managed and coordinated.
Next up: [Chapter 4: DBImpl](04_dbimpl.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

439
docs/LevelDB/04_dbimpl.md Normal file
View File

@@ -0,0 +1,439 @@
# Chapter 4: DBImpl - The Database General Manager
In the previous chapters, we've explored some key ingredients of LevelDB:
* [SSTables](01_table___sstable___tablecache.md) for storing data permanently on disk.
* The [MemTable](02_memtable.md) for quickly handling recent writes in memory.
* The [Write-Ahead Log (WAL)](03_write_ahead_log__wal____logwriter_logreader.md) for ensuring durability even if the system crashes.
But how do all these pieces work together? Who tells LevelDB to write to the WAL first, *then* the MemTable? Who decides when the MemTable is full and needs to be flushed to an SSTable? Who coordinates reading data from both memory *and* disk files?
## What's the Problem? Orchestrating Everything
Imagine a large library. You have librarians putting books on shelves (SSTables), a front desk clerk taking newly returned books (MemTable), and a security guard logging everyone who enters (WAL). But someone needs to be in charge of the whole operation the **General Manager**.
This manager doesn't shelve every book themselves, but they direct the staff, manage the budget, decide when to rearrange sections (compaction), and handle emergencies (recovery). Without a manager, it would be chaos!
LevelDB needs a similar central coordinator to manage all its different parts and ensure they work together smoothly and correctly.
## DBImpl: The General Manager of LevelDB
The `DBImpl` class is the heart of LevelDB's implementation. It's the **General Manager** of our database library. It doesn't *contain* the data itself (that's in MemTables and SSTables), but it **orchestrates** almost every operation.
* It takes requests from your application (like `Put`, `Get`, `Delete`).
* It directs these requests to the right components (WAL, MemTable, TableCache).
* It manages the state of the database (like which MemTable is active, which files exist).
* It initiates and manages background tasks like flushing the MemTable and running compactions.
* It handles the recovery process when the database starts up.
Almost every interaction you have with a LevelDB database object ultimately goes through `DBImpl`.
## Key Responsibilities of DBImpl
Think of the `DBImpl` general manager juggling several key tasks:
1. **Handling Writes (`Put`, `Delete`, `Write`):** Ensuring data is safely written to the WAL and then the MemTable. Managing the process when the MemTable fills up.
2. **Handling Reads (`Get`, `NewIterator`):** Figuring out where to find the requested data checking the active MemTable, the soon-to-be-flushed immutable MemTable, and finally the various SSTable files on disk (using helpers like [Version & VersionSet](06_version___versionset.md) and [Table / SSTable & TableCache](01_table___sstable___tablecache.md)).
3. **Background Maintenance ([Compaction](08_compaction.md)):** Deciding when and how to run compactions to clean up old data, merge SSTables, and keep reads efficient. It schedules and oversees this background work.
4. **Startup and Recovery:** When the database opens, `DBImpl` manages locking the database directory, reading the manifest file ([Version & VersionSet](06_version___versionset.md)), and replaying the [WAL](03_write_ahead_log__wal____logwriter_logreader.md) to recover any data that wasn't flushed before the last shutdown or crash.
5. **Snapshot Management:** Handling requests to create and release snapshots, which provide a consistent view of the database at a specific point in time.
`DBImpl` uses other components extensively to perform these tasks. It holds references to the active MemTable (`mem_`), the immutable MemTable (`imm_`), the WAL (`log_`), the `TableCache`, and the `VersionSet` (which tracks all the SSTable files).
## How DBImpl Handles Writes
Let's trace a simple `Put` operation:
1. **Request:** Your application calls `db->Put("mykey", "myvalue")`.
2. **DBImpl Entry:** This call enters the `DBImpl::Put` method (which typically wraps the operation in a [WriteBatch](05_writebatch.md) and calls `DBImpl::Write`).
3. **Queueing (Optional):** `DBImpl` manages a queue of writers to ensure writes happen in order. It might group multiple concurrent writes together for efficiency (`BuildBatchGroup`).
4. **Making Room:** Before writing, `DBImpl` checks if there's space in the current `MemTable` (`mem_`). If not (`MakeRoomForWrite`), it might:
* Pause briefly if Level-0 SSTable count is high (slowdown trigger).
* Wait if the *immutable* MemTable (`imm_`) is still being flushed.
* Wait if Level-0 SSTable count is too high (stop trigger).
* **Trigger a MemTable switch:**
* Mark the current `mem_` as read-only (`imm_`).
* Create a new empty `mem_`.
* Create a new WAL file (`logfile_`).
* Schedule a background task (`MaybeScheduleCompaction`) to flush the old `imm_` to an SSTable.
5. **Write to WAL:** `DBImpl` writes the operation(s) to the current WAL file (`log_->AddRecord(...)`). If requested (`options.sync`), it ensures the WAL data is physically on disk (`logfile_->Sync()`).
6. **Write to MemTable:** Only after the WAL write succeeds, `DBImpl` inserts the data into the active `MemTable` (`mem_->Add(...)` via `WriteBatchInternal::InsertInto`).
7. **Return:** Control returns to your application.
Here's a highly simplified view of the `Write` method:
```c++
// --- Simplified from db/db_impl.cc ---
Status DBImpl::Write(const WriteOptions& options, WriteBatch* updates) {
// ... acquire mutex, manage writer queue (omitted) ...
// Step 4: Make sure there's space. This might trigger a MemTable switch
// and schedule background work. May wait if MemTable is full or
// too many L0 files exist.
Status status = MakeRoomForWrite(updates == nullptr /* force compact? */);
if (status.ok() && updates != nullptr) {
// ... potentially group multiple concurrent writes (BuildBatchGroup) ...
// Step 5: Add the batch to the Write-Ahead Log
status = log_->AddRecord(WriteBatchInternal::Contents(updates));
if (status.ok() && options.sync) {
// Ensure log entry is on disk if requested
status = logfile_->Sync();
// ... handle sync error by recording background error ...
}
// Step 6: Insert the batch into the active MemTable (only if WAL ok)
if (status.ok()) {
status = WriteBatchInternal::InsertInto(updates, mem_);
}
}
// ... update sequence number, manage writer queue, release mutex ...
return status; // Step 7: Return status to caller
}
```
**Explanation:** This code shows the core sequence: check/make room (`MakeRoomForWrite`), write to the log (`log_->AddRecord`), potentially sync the log (`logfile_->Sync`), and finally insert into the MemTable (`InsertInto(..., mem_)`). Error handling and writer coordination are omitted for clarity.
```mermaid
sequenceDiagram
participant App as Application
participant DBImpl
participant WriterQueue as Writer Queue
participant LogWriter as log::Writer (WAL)
participant MemTable as Active MemTable (RAM)
App->>DBImpl: Put("key", "value") / Write(batch)
DBImpl->>WriterQueue: Add writer to queue
Note over DBImpl: Waits if not front of queue
DBImpl->>DBImpl: MakeRoomForWrite()?
alt MemTable Full / L0 Trigger
DBImpl->>DBImpl: Switch MemTable, Schedule Flush
end
DBImpl->>LogWriter: AddRecord(batch_data)
opt Sync Option Enabled
DBImpl->>LogWriter: Sync() Log File
end
LogWriter-->>DBImpl: Log Write Status
alt Log Write OK
DBImpl->>MemTable: InsertInto(batch_data)
MemTable-->>DBImpl: Insert Status
DBImpl->>WriterQueue: Remove writer, Signal next
DBImpl-->>App: Return OK
else Log Write Failed
DBImpl->>WriterQueue: Remove writer, Signal next
DBImpl-->>App: Return Error Status
end
```
## How DBImpl Handles Reads
Reading data involves checking different places in a specific order to ensure the most recent value is found:
1. **Request:** Your application calls `db->Get("mykey")`.
2. **DBImpl Entry:** The call enters `DBImpl::Get`.
3. **Snapshot:** `DBImpl` determines the sequence number to read up to (either from the provided `ReadOptions::snapshot` or the current latest sequence number).
4. **Check MemTable:** It first checks the active `MemTable` (`mem_`). If the key is found (either a value or a deletion marker), the search stops, and the result is returned.
5. **Check Immutable MemTable:** If not found in `mem_`, and if an immutable MemTable (`imm_`) exists (one that's waiting to be flushed), it checks `imm_`. If found, the search stops.
6. **Check SSTables:** If the key wasn't found in memory, `DBImpl` asks the current `Version` (managed by `VersionSet`) to find the key in the SSTable files (`current->Get(...)`). The `Version` object knows which files might contain the key and uses the `TableCache` to access them efficiently.
7. **Update Stats (Optional):** If the read involved checking SSTables, `DBImpl` might update internal statistics about file access (`current->UpdateStats`). If a file is read frequently, this might trigger a future compaction (`MaybeScheduleCompaction`).
8. **Return:** The value found (or a "Not Found" status) is returned to the application.
A simplified view of `Get`:
```c++
// --- Simplified from db/db_impl.cc ---
Status DBImpl::Get(const ReadOptions& options, const Slice& key,
std::string* value) {
Status s;
SequenceNumber snapshot;
// ... (Step 3) Determine snapshot sequence number ...
mutex_.Lock(); // Need lock to access mem_, imm_, current version
MemTable* mem = mem_;
MemTable* imm = imm_;
Version* current = versions_->current();
mem->Ref(); // Increase reference counts
if (imm != nullptr) imm->Ref();
current->Ref();
mutex_.Unlock(); // Unlock for potentially slow lookups
LookupKey lkey(key, snapshot); // Internal key format for lookup
// Step 4: Check active MemTable
if (mem->Get(lkey, value, &s)) {
// Found in mem_ (value or deletion marker)
}
// Step 5: Check immutable MemTable (if it exists)
else if (imm != nullptr && imm->Get(lkey, value, &s)) {
// Found in imm_
}
// Step 6: Check SSTables via current Version
else {
Version::GetStats stats; // To record file access stats
s = current->Get(options, lkey, value, &stats);
// Step 7: Maybe update stats and schedule compaction
if (current->UpdateStats(stats)) {
mutex_.Lock();
MaybeScheduleCompaction(); // Needs lock
mutex_.Unlock();
}
}
// Decrease reference counts
mutex_.Lock();
mem->Unref();
if (imm != nullptr) imm->Unref();
current->Unref();
mutex_.Unlock();
return s; // Step 8: Return status
}
```
**Explanation:** This shows the order of checking: `mem->Get`, `imm->Get`, and finally `current->Get` (which searches SSTables). It also highlights the reference counting (`Ref`/`Unref`) needed because these components might be changed or deleted by background threads while the read is in progress. The lock is held only when accessing shared pointers, not during the actual data lookup.
```mermaid
sequenceDiagram
participant App as Application
participant DBImpl
participant MemTable as Active MemTable (RAM)
participant ImmMemTable as Immutable MemTable (RAM)
participant Version as Current Version
participant TableCache as TableCache (SSTables)
App->>DBImpl: Get("key")
DBImpl->>MemTable: Get(lkey)?
alt Key Found in MemTable
MemTable-->>DBImpl: Return value / deletion
DBImpl-->>App: Return value / NotFound
else Key Not Found in MemTable
MemTable-->>DBImpl: Not Found
DBImpl->>ImmMemTable: Get(lkey)?
alt Key Found in ImmMemTable
ImmMemTable-->>DBImpl: Return value / deletion
DBImpl-->>App: Return value / NotFound
else Key Not Found in ImmMemTable
ImmMemTable-->>DBImpl: Not Found
DBImpl->>Version: Get(lkey) from SSTables?
Version->>TableCache: Find key in relevant SSTables
TableCache-->>Version: Return value / deletion / NotFound
Version-->>DBImpl: Return value / deletion / NotFound
DBImpl-->>App: Return value / NotFound
end
end
```
## Managing Background Work (Compaction)
`DBImpl` is responsible for kicking off background work. It doesn't *do* the compaction itself (that logic is largely within [Compaction](08_compaction.md) and [VersionSet](06_version___versionset.md)), but it manages the *triggering* and the background thread.
* **When is work needed?** `DBImpl` checks if work is needed in a few places:
* After a MemTable switch (`MakeRoomForWrite` schedules flush of `imm_`).
* After a read operation updates file stats (`Get` might call `MaybeScheduleCompaction`).
* After a background compaction finishes (it checks if *more* compaction is needed).
* When explicitly requested (`CompactRange`).
* **Scheduling:** If work is needed and a background task isn't already running, `DBImpl::MaybeScheduleCompaction` sets a flag (`background_compaction_scheduled_`) and asks the `Env` (Environment object, handles OS interactions) to schedule a function (`DBImpl::BGWork`) to run on a background thread.
* **Performing Work:** The background thread eventually calls `DBImpl::BackgroundCall`, which locks the mutex and calls `DBImpl::BackgroundCompaction`. This method decides *what* work to do:
* If `imm_` exists, it calls `CompactMemTable` (which uses `WriteLevel0Table` -> `BuildTable`) to flush it.
* Otherwise, it asks the `VersionSet` to pick an appropriate SSTable compaction (`versions_->PickCompaction()`).
* It then calls `DoCompactionWork` to perform the actual SSTable compaction (releasing the main lock during the heavy lifting).
* **Signaling:** Once background work finishes, it signals (`background_work_finished_signal_.SignalAll()`) any foreground threads that might be waiting (e.g., a write operation waiting for `imm_` to be flushed).
Here's the simplified scheduling logic:
```c++
// --- Simplified from db/db_impl.cc ---
void DBImpl::MaybeScheduleCompaction() {
mutex_.AssertHeld(); // Must hold lock to check/change state
if (background_compaction_scheduled_) {
// Already scheduled
} else if (shutting_down_.load(std::memory_order_acquire)) {
// DB is closing
} else if (!bg_error_.ok()) {
// Background error stopped activity
} else if (imm_ == nullptr && // No MemTable flush needed AND
manual_compaction_ == nullptr && // No manual request AND
!versions_->NeedsCompaction()) { // VersionSet says no work needed
// No work to be done
} else {
// Work needs to be done! Schedule it.
background_compaction_scheduled_ = true;
env_->Schedule(&DBImpl::BGWork, this); // Ask Env to run BGWork later
}
}
```
**Explanation:** This function checks several conditions under a lock. If there's an immutable MemTable to flush (`imm_ != nullptr`) or the `VersionSet` indicates compaction is needed (`versions_->NeedsCompaction()`) and no background task is already scheduled, it marks one as scheduled and tells the environment (`env_`) to run the `BGWork` function in the background.
```mermaid
flowchart TD
A["Write/Read/Compact finishes"] --> B{"Need Compaction?"}
B -->|Yes| C{"BG Task Scheduled?"}
B -->|No| Z["Idle"]
C -->|Yes| Z
C -->|No| D["Mark BG Scheduled = true"]
D --> E["Schedule BGWork"]
E --> F["Background Thread Pool"]
F -->|Runs| G["DBImpl::BGWork"]
G --> H["DBImpl::BackgroundCall"]
H --> I{"Compact imm_ OR Pick/Run SSTable Compaction?"}
I --> J["Perform Compaction Work"]
J --> K["Mark BG Scheduled = false"]
K --> L["Signal Waiting Threads"]
L --> B
```
## Recovery on Startup
When you open a database, `DBImpl::Open` orchestrates the recovery process:
1. **Lock:** It locks the database directory (`env_->LockFile`) to prevent other processes from using it.
2. **Recover VersionSet:** It calls `versions_->Recover()`, which reads the `MANIFEST` file to understand the state of SSTables from the last clean run.
3. **Find Logs:** It scans the database directory for any `.log` files (WAL files) that are newer than the ones recorded in the `MANIFEST`. These logs represent writes that might not have been flushed to SSTables before the last shutdown/crash.
4. **Replay Logs:** For each relevant log file found, it calls `DBImpl::RecoverLogFile`.
* Inside `RecoverLogFile`, it creates a `log::Reader`.
* It reads records (which are serialized `WriteBatch`es) from the log file one by one.
* For each record, it applies the operations (`WriteBatchInternal::InsertInto`) to a temporary in-memory `MemTable`.
* This effectively rebuilds the state of the MemTable(s) as they were just before the crash/shutdown.
5. **Finalize State:** Once all logs are replayed, the recovered MemTable becomes the active `mem_`. If the recovery process itself filled the MemTable, `RecoverLogFile` might even flush it to a Level-0 SSTable (`WriteLevel0Table`). `DBImpl` updates the `VersionSet` with the recovered sequence number and potentially writes a new `MANIFEST`.
6. **Ready:** The database is now recovered and ready for new operations.
Here's a conceptual snippet from the recovery logic:
```c++
// --- Conceptual, simplified from DBImpl::RecoverLogFile ---
// Inside loop processing a single log file during recovery:
while (reader.ReadRecord(&record, &scratch) && status.ok()) {
// Check if record looks like a valid WriteBatch
if (record.size() < 12) { /* report corruption */ continue; }
// Parse the raw log record back into a WriteBatch object
WriteBatchInternal::SetContents(&batch, record);
// Create a MemTable if we don't have one yet for this log
if (mem == nullptr) {
mem = new MemTable(internal_comparator_);
mem->Ref();
}
// Apply the operations from the batch TO THE MEMTABLE
status = WriteBatchInternal::InsertInto(&batch, mem);
// ... handle error ...
// Keep track of the latest sequence number seen
const SequenceNumber last_seq = /* ... get sequence from batch ... */;
if (last_seq > *max_sequence) {
*max_sequence = last_seq;
}
// If the MemTable gets full *during recovery*, flush it!
if (mem->ApproximateMemoryUsage() > options_.write_buffer_size) {
status = WriteLevel0Table(mem, edit, nullptr); // Flush to L0 SSTable
mem->Unref();
mem = nullptr; // Will create a new one if needed
// ... handle error ...
}
}
// After loop, handle the final state of 'mem'
```
**Explanation:** This loop reads each record (a `WriteBatch`) from the log file using `reader.ReadRecord`. It then applies the batch's changes directly to an in-memory `MemTable` (`InsertInto(&batch, mem)`), effectively replaying the lost writes. It even handles flushing this MemTable if it fills up during the recovery process.
## The DBImpl Class (Code Glimpse)
The definition of `DBImpl` in `db_impl.h` shows the key components it manages:
```c++
// --- Simplified from db/db_impl.h ---
class DBImpl : public DB {
public:
DBImpl(const Options& options, const std::string& dbname);
~DBImpl() override;
// Public API methods (implementing DB interface)
Status Put(...) override;
Status Delete(...) override;
Status Write(...) override;
Status Get(...) override;
Iterator* NewIterator(...) override;
const Snapshot* GetSnapshot() override;
void ReleaseSnapshot(...) override;
// ... other public methods ...
private:
// Friend classes allow access to private members
friend class DB;
struct CompactionState; // Helper struct for compactions
struct Writer; // Helper struct for writer queue
// Core methods for internal operations
Status Recover(VersionEdit* edit, bool* save_manifest);
void CompactMemTable();
Status RecoverLogFile(...);
Status WriteLevel0Table(...);
Status MakeRoomForWrite(...);
void MaybeScheduleCompaction();
static void BGWork(void* db); // Background task entry point
void BackgroundCall();
void BackgroundCompaction();
Status DoCompactionWork(...);
// ... other private helpers ...
// == Key Member Variables ==
Env* const env_; // OS interaction layer
const InternalKeyComparator internal_comparator_; // For sorting keys
const Options options_; // Database configuration options
const std::string dbname_; // Database directory path
TableCache* const table_cache_; // Cache for open SSTable files
FileLock* db_lock_; // Lock file handle for DB directory
port::Mutex mutex_; // Main mutex protecting shared state
std::atomic<bool> shutting_down_; // Flag indicating DB closure
port::CondVar background_work_finished_signal_ GUARDED_BY(mutex_); // For waiting
MemTable* mem_ GUARDED_BY(mutex_); // Active memtable (accepts writes)
MemTable* imm_ GUARDED_BY(mutex_); // Immutable memtable (being flushed)
std::atomic<bool> has_imm_; // Fast check if imm_ is non-null
WritableFile* logfile_; // Current WAL file handle
uint64_t logfile_number_ GUARDED_BY(mutex_); // Current WAL file number
log::Writer* log_; // WAL writer object
VersionSet* const versions_ GUARDED_BY(mutex_); // Manages SSTables/Versions
// Queue of writers waiting for their turn
std::deque<Writer*> writers_ GUARDED_BY(mutex_);
// List of active snapshots
SnapshotList snapshots_ GUARDED_BY(mutex_);
// Files being generated by compactions
std::set<uint64_t> pending_outputs_ GUARDED_BY(mutex_);
// Is a background compaction scheduled/running?
bool background_compaction_scheduled_ GUARDED_BY(mutex_);
// Error status from background threads
Status bg_error_ GUARDED_BY(mutex_);
// Compaction statistics
CompactionStats stats_[config::kNumLevels] GUARDED_BY(mutex_);
};
```
**Explanation:** This header shows `DBImpl` inheriting from the public `DB` interface. It contains references to essential components like the `Env`, `Options`, `TableCache`, `MemTable` (`mem_` and `imm_`), WAL (`log_`, `logfile_`), and `VersionSet`. Crucially, it also has a `mutex_` to protect shared state accessed by multiple threads (foreground application threads and background compaction threads) and condition variables (`background_work_finished_signal_`) to allow threads to wait for background work.
## Conclusion
`DBImpl` is the central nervous system of LevelDB. It doesn't store the data itself, but it acts as the **General Manager**, receiving requests and coordinating the actions of all the other specialized components like the MemTable, WAL, VersionSet, and TableCache. It handles the intricate dance between fast in-memory writes, durable logging, persistent disk storage, background maintenance, and safe recovery. Understanding `DBImpl`'s role is key to seeing how all the pieces of LevelDB fit together to create a functional database.
One tool `DBImpl` uses to make writes efficient and atomic is the `WriteBatch`. Let's see how that works next.
Next up: [Chapter 5: WriteBatch](05_writebatch.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,250 @@
# Chapter 5: WriteBatch - Grouping Changes Together
Welcome back! In [Chapter 4: DBImpl](04_dbimpl.md), we saw how `DBImpl` acts as the general manager, coordinating writes, reads, and background tasks. We learned that when you call `Put` or `Delete`, `DBImpl` handles writing to the [Write-Ahead Log (WAL)](03_write_ahead_log__wal____logwriter_logreader.md) and then updating the [MemTable](02_memtable.md).
But what if you need to make *multiple* changes that should happen *together*?
## What's the Problem? Making Multiple Changes Atomically
Imagine you're managing game scores. When Player A beats Player B, you need to do two things: increase Player A's score and decrease Player B's score.
```
// Goal: Increase playerA score, decrease playerB score
db->Put(options, "score_playerA", "101");
db->Put(options, "score_playerB", "49");
```
What happens if the system crashes right *after* the first `Put` but *before* the second `Put`? Player A gets a point, but Player B *doesn't* lose one. The scores are now inconsistent! This isn't good.
We need a way to tell LevelDB: "Please perform these *multiple* operations (like updating both scores) as a single, indivisible unit. Either *all* of them should succeed, or *none* of them should." This property is called **atomicity**.
## WriteBatch: The Atomic To-Do List
LevelDB provides the `WriteBatch` class to solve this exact problem.
Think of a `WriteBatch` like making a **shopping list** before you go to the store, or giving a librarian a list of multiple transactions to perform all at once (check out book A, return book B).
1. **Collect Changes:** You create an empty `WriteBatch` object. Then, instead of calling `db->Put` or `db->Delete` directly, you call `batch.Put` and `batch.Delete` to add your desired changes to the batch object. This just adds items to your "to-do list" in memory; it doesn't modify the database yet.
2. **Apply Atomically:** Once your list is complete, you hand the entire `WriteBatch` to the database using a single `db->Write(options, &batch)` call.
3. **All or Nothing:** LevelDB guarantees that all the operations (`Put`s and `Delete`s) listed in the `WriteBatch` will be applied **atomically**. They will either *all* succeed and become durable together, or if something goes wrong (like a crash during the process), *none* of them will appear to have happened after recovery.
Using `WriteBatch` for our score update:
```c++
#include "leveldb/write_batch.h"
#include "leveldb/db.h"
// ... assume db is an open LevelDB database ...
leveldb::WriteOptions write_options;
write_options.sync = true; // Ensure durability
// 1. Create an empty WriteBatch
leveldb::WriteBatch batch;
// 2. Add changes to the batch (in memory)
batch.Put("score_playerA", "101"); // Add 'Put playerA' to the list
batch.Delete("old_temp_key"); // Add 'Delete old_temp_key' to the list
batch.Put("score_playerB", "49"); // Add 'Put playerB' to the list
// 3. Apply the entire batch atomically
leveldb::Status status = db->Write(write_options, &batch);
if (status.ok()) {
// Success! Both score_playerA and score_playerB are updated,
// and old_temp_key is deleted.
} else {
// Failure! The database state is unchanged. Neither score was updated,
// and old_temp_key was not deleted.
}
```
**Explanation:**
1. We create a `WriteBatch` called `batch`.
2. We call `batch.Put` and `batch.Delete`. These methods modify the `batch` object itself, not the database. They are very fast as they just record the desired operations internally.
3. We call `db->Write` with the completed `batch`. LevelDB now takes this list and applies it atomically. Thanks to the [WAL](03_write_ahead_log__wal____logwriter_logreader.md), even if the system crashes *during* the `db->Write` call, recovery will ensure either all changes from the batch are applied or none are.
## Performance Benefit Too!
Besides atomicity, `WriteBatch` also often improves performance when making multiple changes:
* **Single Log Write:** LevelDB can write the *entire batch* as a single record to the WAL file on disk. This is usually much faster than writing separate log records for each individual `Put` or `Delete`, reducing disk I/O.
* **Single Lock Acquisition:** The `DBImpl` only needs to acquire its internal lock once for the entire `Write` call, rather than once per operation.
So, even if you don't strictly *need* atomicity, using `WriteBatch` for bulk updates can be faster.
## Under the Hood: How WriteBatch Works
What happens inside LevelDB when you call `db->Write(options, &batch)`?
1. **Serialization:** The `WriteBatch` object holds a simple, serialized representation of all the `Put` and `Delete` operations you added. It's basically a byte string (`rep_` internally) containing the sequence of operations and their arguments.
2. **DBImpl Coordination:** The `DBImpl::Write` method receives the `WriteBatch`.
3. **WAL Write:** `DBImpl` takes the entire serialized content of the `WriteBatch` (from `WriteBatchInternal::Contents`) and writes it as **one single record** to the [Write-Ahead Log (WAL)](03_write_ahead_log__wal____logwriter_logreader.md) using `log_->AddRecord()`.
4. **MemTable Update:** If the WAL write is successful (and synced to disk if `options.sync` is true), `DBImpl` then iterates through the operations *within* the `WriteBatch`. For each operation, it applies the change to the in-memory [MemTable](02_memtable.md) (`WriteBatchInternal::InsertInto(batch, mem_)`).
This two-step process (WAL first, then MemTable) ensures both durability and atomicity. If a crash occurs after the WAL write but before the MemTable update finishes, the recovery process will read the *entire batch* from the WAL and re-apply it to the MemTable, ensuring all changes are present.
```mermaid
sequenceDiagram
participant App as Application
participant DBImpl as DBImpl::Write
participant WriteBatch as WriteBatch Object
participant WAL as WAL File (Disk)
participant MemTable as MemTable (RAM)
App->>WriteBatch: batch.Put("k1", "v1")
App->>WriteBatch: batch.Delete("k2")
App->>WriteBatch: batch.Put("k3", "v3")
App->>DBImpl: db->Write(options, &batch)
DBImpl->>WriteBatch: Get serialized contents (rep_)
WriteBatch-->>DBImpl: Return byte string representing all ops
DBImpl->>WAL: AddRecord(entire batch content)
Note right of WAL: Single disk write (if sync)
WAL-->>DBImpl: WAL Write OK
DBImpl->>WriteBatch: Iterate through operations
loop Apply each operation from Batch
WriteBatch-->>DBImpl: Next Op: Put("k1", "v1")
DBImpl->>MemTable: Add("k1", "v1")
WriteBatch-->>DBImpl: Next Op: Delete("k2")
DBImpl->>MemTable: Add("k2", deletion_marker)
WriteBatch-->>DBImpl: Next Op: Put("k3", "v3")
DBImpl->>MemTable: Add("k3", "v3")
end
MemTable-->>DBImpl: MemTable Updates Done
DBImpl-->>App: Write Successful
```
## WriteBatch Internals (Code View)
Let's peek at the code.
**Adding to the Batch:**
When you call `batch.Put("key", "val")` or `batch.Delete("key")`, the `WriteBatch` simply appends a representation of that operation to its internal string buffer (`rep_`).
```c++
// --- File: leveldb/write_batch.cc ---
// Simplified serialization format:
// rep_ :=
// sequence: fixed64 (8 bytes, initially 0)
// count: fixed32 (4 bytes, number of records)
// data: record[count]
// record :=
// kTypeValue varstring varstring |
// kTypeDeletion varstring
// varstring :=
// len: varint32
// data: uint8[len]
void WriteBatch::Put(const Slice& key, const Slice& value) {
// Increment the record count stored in the header
WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
// Append the type marker (kTypeValue)
rep_.push_back(static_cast<char>(kTypeValue));
// Append the key (length-prefixed)
PutLengthPrefixedSlice(&rep_, key);
// Append the value (length-prefixed)
PutLengthPrefixedSlice(&rep_, value);
}
void WriteBatch::Delete(const Slice& key) {
// Increment the record count stored in the header
WriteBatchInternal::SetCount(this, WriteBatchInternal::Count(this) + 1);
// Append the type marker (kTypeDeletion)
rep_.push_back(static_cast<char>(kTypeDeletion));
// Append the key (length-prefixed)
PutLengthPrefixedSlice(&rep_, key);
}
// Helper to get/set the 4-byte count from the header (bytes 8-11)
int WriteBatchInternal::Count(const WriteBatch* b) {
return DecodeFixed32(b->rep_.data() + 8); // Read count from header
}
void WriteBatchInternal::SetCount(WriteBatch* b, int n) {
EncodeFixed32(&b->rep_[8], n); // Write count to header
}
// Helper to get the full serialized content
Slice WriteBatchInternal::Contents(const WriteBatch* batch) {
return Slice(batch->rep_);
}
```
**Explanation:**
* Each `Put` or `Delete` increments a counter stored in the first 12 bytes (`kHeader`) of the internal string `rep_`.
* It then appends a 1-byte type marker (`kTypeValue` or `kTypeDeletion`).
* Finally, it appends the key (and value for `Put`) using `PutLengthPrefixedSlice`, which writes the length of the slice followed by its data. This makes it easy to parse the operations back later.
**Applying the Batch to MemTable:**
When `DBImpl::Write` calls `WriteBatchInternal::InsertInto(batch, mem_)`, this helper function iterates through the serialized `rep_` string and applies each operation to the MemTable.
```c++
// --- File: leveldb/write_batch.cc ---
// Helper class used by InsertInto
namespace {
class MemTableInserter : public WriteBatch::Handler {
public:
SequenceNumber sequence_; // Starting sequence number for the batch
MemTable* mem_; // MemTable to insert into
void Put(const Slice& key, const Slice& value) override {
// Add the Put operation to the MemTable
mem_->Add(sequence_, kTypeValue, key, value);
sequence_++; // Increment sequence number for the next operation
}
void Delete(const Slice& key) override {
// Add the Delete operation (as a deletion marker) to the MemTable
mem_->Add(sequence_, kTypeDeletion, key, Slice()); // Value is ignored
sequence_++; // Increment sequence number for the next operation
}
};
} // namespace
// Applies the batch operations to the MemTable
Status WriteBatchInternal::InsertInto(const WriteBatch* b, MemTable* memtable) {
MemTableInserter inserter;
// Get the starting sequence number assigned by DBImpl::Write
inserter.sequence_ = WriteBatchInternal::Sequence(b);
inserter.mem_ = memtable;
// Iterate() parses rep_ and calls handler.Put/handler.Delete
return b->Iterate(&inserter);
}
// Helper to get/set the 8-byte sequence number from header (bytes 0-7)
SequenceNumber WriteBatchInternal::Sequence(const WriteBatch* b) {
return SequenceNumber(DecodeFixed64(b->rep_.data()));
}
void WriteBatchInternal::SetSequence(WriteBatch* b, SequenceNumber seq) {
EncodeFixed64(&b->rep_[0], seq);
}
```
**Explanation:**
1. `InsertInto` creates a helper object `MemTableInserter`.
2. It gets the starting `SequenceNumber` for this batch (which was assigned by `DBImpl::Write` and stored in the batch's header).
3. It calls `b->Iterate(&inserter)`. The `Iterate` method (code not shown, but it reverses the serialization process) parses the `rep_` string. For each operation it finds, it calls the appropriate method on the `inserter` object (`Put` or `Delete`).
4. The `inserter.Put` and `inserter.Delete` methods simply call `mem_->Add`, passing along the correct sequence number (which increments for each operation within the batch) and the type (`kTypeValue` or `kTypeDeletion`).
## Conclusion
The `WriteBatch` is a simple yet powerful tool in LevelDB. It allows you to:
1. **Group Multiple Changes:** Collect several `Put` and `Delete` operations together.
2. **Ensure Atomicity:** Apply these changes as a single, all-or-nothing unit using `db->Write`. This prevents inconsistent states if errors or crashes occur mid-operation.
3. **Improve Performance:** Often makes bulk updates faster by reducing the number of WAL writes and lock acquisitions.
It works by serializing the list of operations into a byte string, which LevelDB writes to the WAL as a single record and then replays into the MemTable.
Now that we understand how individual changes and batches of changes are safely written and stored temporarily in the MemTable and WAL, how does LevelDB manage the overall state of the database, including all the SSTable files on disk? How does it know which files contain the data for a particular key?
Next up: [Chapter 6: Version & VersionSet](06_version___versionset.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,257 @@
# Chapter 6: Version & VersionSet - The Database Catalog
In the previous chapter, [Chapter 5: WriteBatch](05_writebatch.md), we learned how LevelDB groups multiple `Put` and `Delete` operations together to apply them atomically and efficiently. We saw that writes go first to the [Write-Ahead Log (WAL)](03_write_ahead_log__wal____logwriter_logreader.md) for durability, and then to the in-memory [MemTable](02_memtable.md).
Eventually, the MemTable gets full and is flushed to an [SSTable](01_table___sstable___tablecache.md) file on disk. Over time, LevelDB also runs compactions, which read data from existing SSTables and write new ones, deleting the old ones afterwards. This means the set of SSTable files that represent the database's current state is constantly changing!
## What's the Problem? Tracking a Changing Set of Files
Imagine our library again. Books (SSTables) are constantly being added (from MemTable flushes), removed (after compaction), and sometimes even moved between sections (levels during compaction). How does the librarian know *which* books are currently part of the official collection and where they are located? If a reader asks for information, the librarian can't just guess which books to look in they need an accurate, up-to-date catalog.
Similarly, LevelDB needs a system to track:
1. Which SSTable files exist and are currently "live" (contain valid data)?
2. Which "level" each live SSTable file belongs to? (Levels are important for compaction, see [Chapter 8: Compaction](08_compaction.md)).
3. What's the overall state of the database, like the next available file number or the sequence number of the last operation?
4. How can reads see a consistent snapshot of the database, even while background tasks are adding and removing files?
## The Solution: Versions, VersionEdits, and the VersionSet
LevelDB uses a trio of concepts to manage this state:
1. **Version:** Think of a `Version` object as **one specific edition of the library's catalog**. It represents a complete, consistent snapshot of the database state at a single point in time. Specifically, it contains lists of all the live SSTable files for *each* level. Once created, a `Version` object is **immutable** it never changes, just like a printed catalog edition. Reads (`Get` operations or [Iterators](07_iterator.md)) use a specific `Version` to know which files to consult.
2. **VersionEdit:** This is like a **list of corrections and updates** to get from one catalog edition to the next. It describes the *changes* between two versions. A `VersionEdit` might say:
* "Add file number 15 to Level-0." (Because a MemTable was flushed).
* "Remove files 8 and 9 from Level-1." (Because they were compacted).
* "Add file number 25 to Level-2." (The result of the compaction).
* "Update the next available file number to 26."
* "Update the last sequence number."
These edits are small descriptions of changes. They are stored persistently in a special file called the `MANIFEST`.
3. **VersionSet:** This is the **chief librarian** or the **cataloguing department**. It's the central manager for all database state related to the set of live files. The `VersionSet` performs several critical tasks:
* Keeps track of the single `current` Version (the latest catalog edition).
* Reads the `MANIFEST` file during startup to reconstruct the database state.
* Applies `VersionEdit`s to the `current` Version to create *new* `Version`s.
* Manages essential metadata like the `next_file_number_`, `log_number_`, and `last_sequence_`.
* Decides which compactions are needed ([Chapter 8: Compaction](08_compaction.md)).
* Manages the lifecycle of `Version` objects (using reference counting) so that old versions needed by iterators or snapshots aren't deleted prematurely.
**In short:** `VersionSet` uses `VersionEdit`s (from the `MANIFEST`) to create a sequence of immutable `Version`s, each representing the database state at a point in time. The `current` `Version` tells LevelDB which files to read from.
## How Reads Use Versions
When you perform a `Get(key)` operation, the [DBImpl](04_dbimpl.md) needs to know which SSTables to check (after checking the MemTables). It does this by consulting the `current` `Version` held by the `VersionSet`.
```c++
// --- Simplified from db/db_impl.cc Get() ---
Status DBImpl::Get(const ReadOptions& options, const Slice& key,
std::string* value) {
// ... check MemTable, Immutable MemTable first ...
// If not found in memory, check SSTables:
else {
MutexLock l(&mutex_); // Need lock to get current Version pointer safely
Version* current = versions_->current(); // Ask VersionSet for current Version
current->Ref(); // Increment ref count (important!)
mutex_.Unlock(); // Unlock for potentially slow disk I/O
LookupKey lkey(key, snapshot_sequence_number); // Key to search for
Version::GetStats stats;
// Ask the Version object to perform the lookup in its files
Status s = current->Get(options, lkey, value, &stats);
mutex_.Lock(); // Re-acquire lock for cleanup
current->Unref(); // Decrement ref count
// ... maybe trigger compaction based on stats ...
mutex_.Unlock();
return s;
}
// ...
}
```
The key step is `versions_->current()->Get(...)`. The `DBImpl` asks the `VersionSet` (`versions_`) for the pointer to the `current` `Version`. It then calls the `Get` method *on that `Version` object*.
How does `Version::Get` work?
```c++
// --- Simplified from db/version_set.cc ---
Status Version::Get(const ReadOptions& options, const LookupKey& k,
std::string* value, GetStats* stats) {
Slice ikey = k.internal_key();
Slice user_key = k.user_key();
// We search level-by-level
for (int level = 0; level < config::kNumLevels; level++) {
const std::vector<FileMetaData*>& files = files_[level]; // Get list for this level
if (files.empty()) continue; // Skip empty levels
if (level == 0) {
// Level-0 files might overlap, search newest-first
std::vector<FileMetaData*> tmp;
// Find potentially overlapping files in level 0
// ... logic to find relevant files ...
// Sort them newest-first
std::sort(tmp.begin(), tmp.end(), NewestFirst);
// Search each relevant file
for (uint32_t i = 0; i < tmp.size(); i++) {
FileMetaData* f = tmp[i];
// Use TableCache to search the actual SSTable file
Status s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, /* saver state */, SaveValue);
// ... check if found/deleted/error and update stats ...
if (/* found or deleted */) return s;
}
} else {
// Levels > 0 files are sorted and non-overlapping
// Binary search to find the single file that might contain the key
uint32_t index = FindFile(vset_->icmp_, files, ikey);
if (index < files.size()) {
FileMetaData* f = files[index];
// Check if user_key is within the file's range
if (/* user_key is within f->smallest/f->largest range */) {
// Use TableCache to search the actual SSTable file
Status s = vset_->table_cache_->Get(options, f->number, f->file_size,
ikey, /* saver state */, SaveValue);
// ... check if found/deleted/error and update stats ...
if (/* found or deleted */) return s;
}
}
}
} // End loop over levels
return Status::NotFound(Slice()); // Key not found in any SSTable
}
```
**Explanation:**
1. The `Version` object has arrays (`files_[level]`) storing `FileMetaData` pointers for each level. `FileMetaData` contains the file number, size, and smallest/largest keys for an SSTable.
2. It iterates through the levels.
3. **Level 0:** Files might overlap, so it finds all potentially relevant files, sorts them newest-first (by file number), and checks each one using the [Table / SSTable & TableCache](01_table___sstable___tablecache.md).
4. **Levels > 0:** Files are sorted and non-overlapping. It performs a binary search (`FindFile`) to quickly locate the *single* file that *might* contain the key. It checks that file's key range and then searches it using the `TableCache`.
5. The search stops as soon as the key is found (either a value or a deletion marker) in any file. If it searches all relevant files in all levels without finding the key, it returns `NotFound`.
The `Version` object acts as the map, guiding the search to the correct SSTable files.
## How State Changes: Applying VersionEdits
The database state doesn't stand still. MemTables are flushed, compactions happen. How does the `VersionSet` update the state? By applying `VersionEdit`s.
When a background task (like flushing the immutable MemTable or running a compaction) finishes, it creates a `VersionEdit` describing the changes it made (e.g., "add file X, remove file Y"). It then asks the `VersionSet` to apply this edit.
The core logic is in `VersionSet::LogAndApply`:
```c++
// --- Simplified from db/version_set.cc ---
Status VersionSet::LogAndApply(VersionEdit* edit, port::Mutex* mu) {
// 1. Fill in metadata in the edit (log number, sequence number etc.)
// ... set edit->log_number_, edit->last_sequence_, etc. ...
// 2. Create a new Version based on the current one + the edit
Version* v = new Version(this);
{
Builder builder(this, current_); // Builder starts with 'current_' state
builder.Apply(edit); // Apply the changes described by 'edit'
builder.SaveTo(v); // Save the resulting state into 'v'
}
Finalize(v); // Calculate compaction score/level for the new version
// 3. Write the edit to the MANIFEST file (for persistence)
std::string record;
edit->EncodeTo(&record); // Serialize the VersionEdit
// Unlock mutex while writing to disk (can be slow)
mu->Unlock();
Status s = descriptor_log_->AddRecord(record); // Append edit to MANIFEST log
if (s.ok()) {
s = descriptor_file_->Sync(); // Ensure MANIFEST write is durable
}
// ... handle MANIFEST write errors ...
mu->Lock(); // Re-lock mutex
// 4. Install the new version as the 'current' one
if (s.ok()) {
AppendVersion(v); // Make 'v' the new current_ version
// Update VersionSet's metadata based on the edit
log_number_ = edit->log_number_;
prev_log_number_ = edit->prev_log_number_;
} else {
delete v; // Discard the new version if MANIFEST write failed
}
return s;
}
```
**Explanation:**
1. **Prepare Edit:** Fills in missing metadata fields in the `VersionEdit` (like the current log number and last sequence number).
2. **Build New Version:** Creates a temporary `Builder` object, initialized with the state of the `current_` version. It applies the changes from the `edit` to this builder and then saves the resulting state into a completely *new* `Version` object (`v`).
3. **Log to MANIFEST:** Serializes the `VersionEdit` into a string (`record`) and appends it to the `MANIFEST` log file (`descriptor_log_`). This step makes the state change persistent. If the database crashes and restarts, it can replay the `MANIFEST` file to recover the state.
4. **Install New Version:** If the `MANIFEST` write succeeds, it calls `AppendVersion(v)`. This crucial step updates the `current_` pointer in the `VersionSet` to point to the newly created `Version` `v`. Future read operations will now use this new version. It also updates the `VersionSet`'s own metadata (like `log_number_`).
This process ensures that the database state transitions atomically: a new `Version` only becomes `current` *after* the changes it represents have been safely recorded in the `MANIFEST`.
```mermaid
sequenceDiagram
participant BG as Background Task (Flush/Compact)
participant VE as VersionEdit
participant VS as VersionSet
participant VSCur as Current Version
participant VSBld as VersionSet::Builder
participant V as New Version
participant Manifest as MANIFEST Log File
BG ->> VE: Create edit (add file X, remove Y)
BG ->> VS: LogAndApply(edit)
VS ->> VSCur: Get current state
VS ->> VSBld: Create Builder(based on VSCur)
VSBld ->> VE: Apply(edit)
VSBld ->> V: Save resulting state to New Version
VS ->> V: Finalize()
VE ->> VE: EncodeTo(record)
VS ->> Manifest: AddRecord(record)
Manifest -->> VS: Write Status OK
VS ->> V: AppendVersion(V) // Make V the new 'current'
VS ->> VS: Update log_number etc.
VS -->> BG: Return OK
```
## Version Lifecycle and Snapshots
Why keep old `Version` objects around if we have a `current` one? Because ongoing read operations or snapshots might still need them!
* **Reference Counting:** Each `Version` has a reference count (`refs_`). When `DBImpl::Get` uses a version, it calls `Ref()` (increment count) before starting the lookup and `Unref()` (decrement count) when finished.
* **Snapshots:** When you request a snapshot (`db->GetSnapshot()`), LevelDB essentially gives you a pointer to the `current` `Version` at that moment and increments its reference count. As long as you hold onto that snapshot, the corresponding `Version` object (and the SSTable files it refers to) won't be deleted, even if the `current` version advances due to subsequent writes and compactions. This provides a consistent point-in-time view of the data.
* **Cleanup:** When a `Version`'s reference count drops to zero (meaning no reads or snapshots are using it anymore), it can be safely deleted. The `VersionSet` also keeps track of which underlying SSTable files are no longer referenced by *any* active `Version` and can trigger their deletion from disk ([DBImpl::RemoveObsoleteFiles](04_dbimpl.md)).
## The MANIFEST File
The `MANIFEST` file is crucial for durability. It's a log file (like the [WAL](03_write_ahead_log__wal____logwriter_logreader.md), but for metadata changes) that stores the sequence of `VersionEdit` records.
When LevelDB starts (`DB::Open`), the `VersionSet::Recover` method reads the `MANIFEST` file from beginning to end. It starts with an empty initial state and applies each `VersionEdit` it reads, step-by-step, rebuilding the database's file state in memory. This ensures that LevelDB knows exactly which SSTable files were live when it last shut down (or crashed).
Occasionally, the `MANIFEST` file can grow large. LevelDB might then write a *snapshot* of the entire current state (all files in all levels) as a single large record into a *new* `MANIFEST` file and then switch subsequent edits to that new file. This prevents the recovery process from becoming too slow.
## Conclusion
`Version`, `VersionEdit`, and `VersionSet` form the core cataloguing system of LevelDB.
* **Version:** An immutable snapshot of which SSTable files exist at each level. Used by reads to find data.
* **VersionEdit:** A description of changes (files added/deleted, metadata updated) between versions. Persisted in the `MANIFEST` log.
* **VersionSet:** Manages the `current` Version, applies edits to create new versions, handles recovery from the `MANIFEST`, and manages metadata like file numbers and sequence numbers.
Together, they allow LevelDB to manage a constantly changing set of files on disk while providing consistent views for read operations and ensuring the database state can be recovered after a restart.
Now that we understand how LevelDB finds data (checking MemTables, then using the current `Version` to check SSTables via the `TableCache`), how does it provide a way to *scan* through data, not just get single keys?
Next up: [Chapter 7: Iterator](07_iterator.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

359
docs/LevelDB/07_iterator.md Normal file
View File

@@ -0,0 +1,359 @@
# Chapter 7: Iterator - Your Guide Through the Database
Welcome back! In [Chapter 6: Version & VersionSet](06_version___versionset.md), we learned how LevelDB keeps track of all the live SSTable files using `Version` objects and the `VersionSet`. This catalog helps LevelDB efficiently find a single key by looking first in the [MemTable](02_memtable.md) and then pinpointing the right [SSTables](01_table___sstable___tablecache.md) to check.
But what if you don't want just *one* key? What if you want to see *all* the key-value pairs in the database, or all the keys within a specific range?
## What's the Problem? Scanning Multiple Keys
Imagine you have a database storing user scores, with keys like `score:userA`, `score:userB`, `score:userC`, etc. How would you find all the users whose usernames start with 'user'? Or how would you list all scores from highest to lowest?
Calling `db->Get()` repeatedly for every possible key isn't practical or efficient. We need a way to easily **scan** or **traverse** through the key-value pairs stored in the database, in sorted order.
Furthermore, this scan needs to be smart. It has to combine the data from the current MemTable (the fast notepad), potentially an older immutable MemTable, and all the different SSTable files on disk. It also needs to correctly handle situations where a key was updated or deleted showing you only the *latest* live version of the data, just like `Get` does.
## Iterator: Your Database Research Assistant
LevelDB provides the `Iterator` concept to solve this. Think of an `Iterator` as a **super-smart research assistant**.
You tell the assistant what you're looking for (e.g., "start from the beginning" or "find keys starting with 'user'"). The assistant then efficiently looks through the current notepad (`MemTable`), the previous notepad (`imm_`), and all the relevant books on the shelves (`SSTables`), using the latest catalog (`Version`).
As the assistant finds relevant entries, it presents them to you one by one, in perfect sorted order by key. Crucially, the assistant knows how to:
1. **Merge Sources:** Combine results from memory (MemTable) and disk (SSTables) seamlessly.
2. **Handle Versions:** If the same key exists in multiple places (e.g., an old value in an SSTable and a newer value in the MemTable), the assistant only shows you the *most recent* one based on the database's internal sequence numbers.
3. **Handle Deletions:** If a key has been deleted, the assistant knows to *skip* it entirely, even if older versions of the key exist in SSTables.
4. **Provide a Snapshot:** An iterator typically operates on a consistent snapshot of the database. Data added *after* the iterator was created won't suddenly appear during your scan.
The main iterator you interact with, obtained via `db->NewIterator()`, is often implemented internally by a class called `DBIter`. `DBIter` coordinates the work of lower-level iterators.
## How to Use an Iterator
Using an iterator is quite straightforward. Here's a typical pattern:
```c++
#include "leveldb/db.h"
#include "leveldb/iterator.h"
#include <iostream>
// ... assume db is an open LevelDB database ...
// 1. Create an iterator
leveldb::ReadOptions options;
// options.snapshot = db->GetSnapshot(); // Optional: Use a specific snapshot
leveldb::Iterator* it = db->NewIterator(options);
// 2. Position the iterator (e.g., seek to the first key >= "start_key")
std::string start_key = "user:";
it->Seek(start_key);
// 3. Loop through the keys
std::cout << "Keys starting with '" << start_key << "':" << std::endl;
for (; it->Valid(); it->Next()) {
leveldb::Slice key = it->key();
leveldb::Slice value = it->value();
// Optional: Stop if we go past the desired range
if (!key.starts_with(start_key)) {
break;
}
std::cout << key.ToString() << " => " << value.ToString() << std::endl;
}
// 4. Check for errors (optional but recommended)
if (!it->status().ok()) {
std::cerr << "Iterator error: " << it->status().ToString() << std::endl;
}
// 5. Clean up the iterator and snapshot (if used)
delete it;
// if (options.snapshot != nullptr) {
// db->ReleaseSnapshot(options.snapshot);
// }
```
**Explanation:**
1. **`db->NewIterator(options)`:** You ask the database for a new iterator. You can pass `ReadOptions`, optionally including a specific snapshot you obtained earlier using `db->GetSnapshot()`. If you don't provide a snapshot, the iterator uses an implicit snapshot of the database state at the time of creation.
2. **Positioning:**
* `it->Seek(slice)`: Moves the iterator to the first key-value pair whose key is greater than or equal to the `slice`.
* `it->SeekToFirst()`: Moves to the very first key-value pair in the database.
* `it->SeekToLast()`: Moves to the very last key-value pair.
3. **Looping:**
* `it->Valid()`: Returns `true` if the iterator is currently pointing to a valid key-value pair, `false` otherwise (e.g., if you've reached the end).
* `it->Next()`: Moves the iterator to the next key-value pair in sorted order.
* `it->Prev()`: Moves to the previous key-value pair (less common, but supported).
* `it->key()`: Returns a `Slice` representing the current key.
* `it->value()`: Returns a `Slice` representing the current value. **Important:** The `Slice`s returned by `key()` and `value()` are only valid until the next call that modifies the iterator (`Next`, `Prev`, `Seek`, etc.). If you need to keep the data longer, make a copy (e.g., `key.ToString()`).
4. **`it->status()`:** After the loop, check this to see if any errors occurred during iteration (e.g., disk corruption).
5. **`delete it;`:** Crucially, you **must** delete the iterator when you're done with it to free up resources. If you used an explicit snapshot, release it too.
This simple interface lets you scan through potentially vast amounts of data spread across memory and disk files without needing to know the complex details of where each piece resides.
## Under the Hood: Merging and Filtering
How does the iterator provide this unified, sorted view? It doesn't load everything into memory! Instead, it uses a clever strategy involving **merging** and **filtering**.
1. **Gather Internal Iterators:** When you call `db->NewIterator()`, the `DBImpl` asks for iterators from all the relevant sources, based on the current [Version](06_version___versionset.md):
* An iterator for the active `MemTable`.
* An iterator for the immutable `imm_` (if it exists).
* Iterators for all the files in Level-0.
* A special "concatenating" iterator for Level-1 (which opens SSTable files lazily as needed).
* Similar concatenating iterators for Level-2, Level-3, etc.
2. **Create MergingIterator:** These individual iterators are then passed to a `MergingIterator`. The `MergingIterator` acts like a zipper, taking multiple sorted streams and producing a single output stream that is also sorted. It keeps track of the current position in each input iterator and always yields the smallest key currently available across all inputs.
3. **Wrap with DBIter:** The `MergingIterator` produces *internal* keys (with sequence numbers and types). This merged stream is then wrapped by the `DBIter`. `DBIter` is the "research assistant" we talked about. It reads the stream from the `MergingIterator` and performs the final filtering:
* It compares the sequence number of each internal key with the iterator's snapshot sequence number. Keys newer than the snapshot are ignored.
* It keeps track of the current user key. If it sees multiple versions of the same user key, it only considers the one with the highest sequence number (that's still <= the snapshot sequence).
* If the most recent entry for a user key is a deletion marker (`kTypeDeletion`), it skips that key entirely.
* Only when it finds a valid, non-deleted key (`kTypeValue`) with the highest sequence number for that user key (within the snapshot) does it make that key/value available via `it->key()` and `it->value()`.
**Sequence Diagram:**
```mermaid
sequenceDiagram
participant App as Application
participant DBImpl
participant MemTable as Active MemTable
participant ImmMemTable as Immutable MemTable
participant Version as Current Version
participant MergingIter as MergingIterator
participant DBIter
App->>DBImpl: NewIterator(options)
DBImpl->>MemTable: NewIterator()
MemTable-->>DBImpl: Return mem_iter
DBImpl->>ImmMemTable: NewIterator()
ImmMemTable-->>DBImpl: Return imm_iter
DBImpl->>Version: AddIterators(options) # Gets SSTable iterators
Version-->>DBImpl: Return sstable_iters_list
DBImpl->>MergingIter: Create(mem_iter, imm_iter, sstable_iters...)
MergingIter-->>DBImpl: Return merged_iter
DBImpl->>DBIter: Create(merged_iter, snapshot_seq)
DBIter-->>DBImpl: Return db_iter
DBImpl-->>App: Return db_iter (as Iterator*)
App->>DBIter: Seek("some_key")
DBIter->>MergingIter: Seek to internal key for "some_key"
Note right of DBIter: DBIter finds the first valid user entry >= "some_key"
DBIter-->>App: Iterator positioned
App->>DBIter: Valid()?
DBIter-->>App: true
App->>DBIter: key()
DBIter-->>App: Return "user_key_A"
App->>DBIter: Next()
DBIter->>MergingIter: Next() until user key changes
Note right of DBIter: DBIter skips older versions or deleted keys
DBIter->>MergingIter: Next() to find next user key's latest version
DBIter-->>App: Iterator positioned at next valid entry
```
## Code Dive: `DBImpl::NewIterator` and `DBIter`
Let's look at how this is initiated in the code.
**1. Creating the Iterator (`db_impl.cc`)**
When you call `db->NewIterator(options)`, it eventually calls `DBImpl::NewIterator`:
```c++
// --- File: db/db_impl.cc ---
Iterator* DBImpl::NewIterator(const ReadOptions& options) {
SequenceNumber latest_snapshot;
uint32_t seed; // Used for read sampling randomization
// (1) Create the internal merging iterator
Iterator* internal_iter = NewInternalIterator(options, &latest_snapshot, &seed);
// (2) Determine the sequence number for the snapshot
SequenceNumber snapshot_seq =
(options.snapshot != nullptr
? static_cast<const SnapshotImpl*>(options.snapshot)
->sequence_number()
: latest_snapshot);
// (3) Wrap the internal iterator with DBIter
return NewDBIterator(this, // Pass DBImpl pointer for read sampling
user_comparator(),
internal_iter,
snapshot_seq,
seed);
}
```
**Explanation:**
1. `NewInternalIterator`: This helper function (we'll glance at it next) creates the `MergingIterator` that combines MemTables and SSTables.
2. `snapshot_seq`: It figures out which sequence number to use. If the user provided an explicit `options.snapshot`, it uses that snapshot's sequence number. Otherwise, it uses the latest sequence number in the database when the iterator was created (`latest_snapshot`).
3. `NewDBIterator`: This function (defined in `db_iter.cc`) creates the `DBIter` object, passing it the underlying `internal_iter` and the `snapshot_seq` to use for filtering.
**2. Creating the Internal Iterator (`db_impl.cc`)**
The `NewInternalIterator` gathers all the source iterators:
```c++
// --- File: db/db_impl.cc ---
Iterator* DBImpl::NewInternalIterator(const ReadOptions& options,
SequenceNumber* latest_snapshot,
uint32_t* seed) {
mutex_.Lock(); // Need lock to access shared state (mem_, imm_, versions_)
*latest_snapshot = versions_->LastSequence();
*seed = ++seed_; // For random sampling
// Collect together all needed child iterators
std::vector<Iterator*> list;
// Add iterator for active MemTable
list.push_back(mem_->NewIterator());
mem_->Ref(); // Manage lifetime with ref counting
// Add iterator for immutable MemTable (if it exists)
if (imm_ != nullptr) {
list.push_back(imm_->NewIterator());
imm_->Ref();
}
// Add iterators for all SSTable files in the current Version
versions_->current()->AddIterators(options, &list);
versions_->current()->Ref();
// Create the MergingIterator
Iterator* internal_iter =
NewMergingIterator(&internal_comparator_, &list[0], list.size());
// Register cleanup function to Unref MemTables/Version when iterator is deleted
IterState* cleanup = new IterState(&mutex_, mem_, imm_, versions_->current());
internal_iter->RegisterCleanup(CleanupIteratorState, cleanup, nullptr);
mutex_.Unlock();
return internal_iter;
}
```
**Explanation:**
1. It locks the database mutex to safely access the current MemTables (`mem_`, `imm_`) and the current `Version`.
2. It creates iterators for `mem_` and `imm_` using their `NewIterator()` methods ([MemTable](02_memtable.md) uses a SkipList iterator).
3. It calls `versions_->current()->AddIterators(...)`. This method (in `version_set.cc`) adds iterators for Level-0 files and the special concatenating iterators for Levels 1+ to the `list`. See [Version & VersionSet](06_version___versionset.md).
4. `NewMergingIterator` creates the iterator that merges all sources in `list`.
5. `RegisterCleanup` ensures that the MemTables and Version are properly `Unref`'d when the iterator is eventually deleted.
6. It returns the `MergingIterator`.
**3. `DBIter` Filtering Logic (`db_iter.cc`)**
The `DBIter` class takes the `MergingIterator` and applies the filtering logic. Let's look at a simplified `Next()` method:
```c++
// --- File: db/db_iter.cc ---
void DBIter::Next() {
assert(valid_);
if (direction_ == kReverse) {
// ... code to switch from moving backward to forward ...
// Position iter_ at the first entry >= saved_key_
// Fall through to FindNextUserEntry...
direction_ = kForward;
} else {
// We are moving forward. Save the current user key so we can skip
// all other entries for it.
SaveKey(ExtractUserKey(iter_->key()), &saved_key_);
// Advance the internal iterator.
iter_->Next();
}
// Find the next user key entry that is visible at our sequence number.
FindNextUserEntry(true, &saved_key_);
}
// Find the next entry for a different user key, skipping deleted
// or older versions of the key in 'skip'.
void DBIter::FindNextUserEntry(bool skipping, std::string* skip) {
// Loop until we hit an acceptable entry
assert(iter_->Valid() || !valid_); // iter_ might be invalid if Next() moved past end
assert(direction_ == kForward);
do {
if (!iter_->Valid()) { // Reached end of internal iterator
valid_ = false;
return;
}
ParsedInternalKey ikey;
// Parse the internal key (key, sequence, type)
if (ParseKey(&ikey)) {
// Check if the sequence number is visible in our snapshot
if (ikey.sequence <= sequence_) {
// Check the type (Put or Deletion)
switch (ikey.type) {
case kTypeDeletion:
// This key is deleted. Save the user key so we skip
// any older versions of it we might encounter later.
SaveKey(ikey.user_key, skip);
skipping = true; // Ensure we skip older versions
break;
case kTypeValue:
// This is a potential result (a Put operation).
// Is it for the user key we are trying to skip?
if (skipping &&
user_comparator_->Compare(ikey.user_key, *skip) <= 0) {
// Yes, it's hidden by a newer deletion or is an older version
// of the key we just yielded. Skip it.
} else {
// Found a valid entry!
valid_ = true;
// Clear skip key since we found a new valid key
// saved_key_.clear(); // Done in Next() or Seek()
return; // Exit the loop, iterator is now positioned correctly.
}
break;
}
}
} else {
// Corrupted key, mark iterator as invalid
valid_ = false;
status_ = Status::Corruption("corrupted internal key in DBIter");
return;
}
// Current internal key was skipped (too new, deleted, hidden), move to next.
iter_->Next();
} while (true); // Loop until we return or reach the end
}
```
**Explanation:**
* The `Next()` method first handles switching direction if needed. If moving forward, it saves the current user key (`saved_key_`) so it can skip other entries for the same key. It then advances the underlying `iter_` (the `MergingIterator`).
* `FindNextUserEntry` is the core loop. It repeatedly gets the next entry from `iter_`.
* `ParseKey(&ikey)` decodes the internal key, sequence number, and type.
* It checks if `ikey.sequence <= sequence_` (the iterator's snapshot sequence number). If the entry is too new, it's skipped.
* If it's a `kTypeDeletion`, the user key is saved in `skip`, and the `skipping` flag is set to true. Any older entries for this `user_key` will be ignored.
* If it's a `kTypeValue`:
* It checks if `skipping` is true and if the current `ikey.user_key` is less than or equal to the key in `skip`. If so, it means this entry is hidden by a newer deletion or is an older version of a key we just processed, so it's skipped.
* Otherwise, this is the newest, visible version of this user key! The loop terminates, `valid_` is set to true, and the `DBIter` is now positioned at this entry.
* If the current entry from `iter_` was skipped for any reason, the loop continues by calling `iter_->Next()`.
This careful dance ensures that `DBIter` only exposes the correct, latest, non-deleted user key/value pairs according to the snapshot sequence number, while efficiently merging data from all underlying sources.
## Conclusion
LevelDB's `Iterator` provides a powerful and convenient way to scan through key-value pairs. It acts like a smart assistant, giving you a unified, sorted view across data stored in the `MemTable` and numerous `SSTable` files.
Under the hood, it uses a `MergingIterator` to combine multiple sorted sources and the `DBIter` wrapper to filter out deleted entries and older versions based on sequence numbers and the requested snapshot.
This ability to efficiently scan sorted data is not just useful for application queries, but it's also fundamental to how LevelDB maintains itself. How does LevelDB merge old SSTables and incorporate data flushed from the MemTable to keep the database structure efficient? It uses these very same iterator concepts!
Next up: [Chapter 8: Compaction](08_compaction.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,310 @@
# Chapter 8: Compaction - Keeping the Library Tidy
In [Chapter 7: Iterator](07_iterator.md), we saw how LevelDB provides iterators to give us a unified, sorted view of our data, cleverly merging information from the in-memory [MemTable](02_memtable.md) and the various [SSTable](01_table___sstable___tablecache.md) files on disk.
This works great, but think about what happens over time. Every time a MemTable fills up, it gets flushed to a *new* SSTable file in Level-0. If you have lots of writes, you'll quickly accumulate many small files in Level-0. Also, when you update or delete a key, LevelDB doesn't modify old SSTables; it just writes a *new* entry (a new value or a deletion marker) in a newer MemTable or SSTable. This means older files contain outdated or deleted data that's just taking up space.
## What's the Problem? A Messy, Inefficient Library
Imagine our library again. New notes and pamphlets (MemTable flushes) keep arriving and get dumped in a temporary pile (Level-0). Meanwhile, older books on the main shelves (higher levels) contain crossed-out paragraphs (deleted data) or outdated information (overwritten data).
This leads to several problems:
1. **Slow Reads:** To find a specific piece of information, the librarian might have to check *many* different pamphlets in the temporary pile (Level-0) before even getting to the main shelves. Reading from many files is slow.
2. **Wasted Space:** The library shelves are cluttered with books containing crossed-out sections or old editions that are no longer needed. This wastes valuable shelf space.
3. **Growing Number of Files:** The temporary pile (Level-0) just keeps growing, making it harder and harder to manage.
We need a process to periodically tidy up this library, organize the temporary pile into the main shelves, and remove the outdated information.
## Compaction: The Background Tidy-Up Crew
**Compaction** is LevelDB's background process that solves these problems. It's like the library staff who work quietly behind the scenes to keep the library organized and efficient.
Here's what compaction does:
1. **Selects Files:** It picks one or more SSTable files from a specific level (let's say Level-N). Often, this starts with files in Level-0.
2. **Finds Overlapping Files:** It identifies the files in the *next* level (Level-N+1) whose key ranges overlap with the selected files from Level-N.
3. **Merges and Filters:** It reads the key-value pairs from *all* these selected files (from both Level-N and Level-N+1) using iterators, much like the merging process we saw in [Chapter 7: Iterator](07_iterator.md). As it merges, it performs crucial filtering:
* It keeps only the *latest* version of each key (based on sequence numbers).
* It completely discards keys that have been deleted.
* It discards older versions of keys that have been updated.
4. **Writes New Files:** It writes the resulting stream of filtered, sorted key-value pairs into *new* SSTable files at Level-N+1. These new files are typically larger and contain only live data.
5. **Updates Catalog:** It updates the database's catalog ([Version & VersionSet](06_version___versionset.md)) to reflect the changes: the old input files (from Level-N and Level-N+1) are marked for deletion, and the new output files (in Level-N+1) are added.
6. **Deletes Old Files:** Finally, the old, now-obsolete input SSTable files are deleted from the disk.
**Analogy:** The library staff takes a batch of pamphlets from the temporary pile (Level-0) and finds the corresponding books on the main shelves (Level-1). They go through both, creating a new, clean edition of the book (new Level-1 SSTable) that incorporates the new information from the pamphlets, removes any crossed-out entries, and keeps only the latest version of each topic. Then, they discard the original pamphlets and the old version of the book.
This process happens continuously in the background, keeping the database structure efficient.
## Triggering Compaction: When to Tidy Up?
How does LevelDB decide when to run a compaction? The [DBImpl](04_dbimpl.md) checks if compaction is needed after writes or reads, or when background work finishes. It uses the [VersionSet](06_version___versionset.md) to determine this, primarily based on two conditions:
1. **Size Compaction:** Each level (except Level-0) has a target size limit. If the total size of files in a level exceeds its limit, the `VersionSet` calculates a "compaction score". If the score is >= 1, a size compaction is needed. This is the most common trigger. Level-0 is special: it triggers compaction based on the *number* of files, not their total size, because too many files there significantly slows down reads.
* `config::kL0_CompactionTrigger`: Default is 4 files.
* Higher levels (1+): Trigger based on total bytes (`MaxBytesForLevel`).
2. **Seek Compaction:** To avoid performance issues caused by reading very wide (many keys) but shallow (few overwrites/deletions) files repeatedly, LevelDB tracks how many times a file is "seeked" during reads. If a file receives too many seeks (`allowed_seeks` counter drops to zero), it might be chosen for compaction even if the level size limit isn't reached. This helps rewrite files that are frequently accessed, potentially merging them or breaking them up.
When `DBImpl::MaybeScheduleCompaction` detects that work is needed (and no other background work is running), it schedules the `DBImpl::BGWork` function to run on a background thread.
```c++
// --- Simplified from db/db_impl.cc ---
void DBImpl::MaybeScheduleCompaction() {
mutex_.AssertHeld(); // Must hold lock to check/change state
if (background_compaction_scheduled_) {
// Already scheduled
} else if (shutting_down_.load(std::memory_order_acquire)) {
// DB is closing
} else if (!bg_error_.ok()) {
// Background error stopped activity
} else if (imm_ == nullptr && // No MemTable flush needed AND
manual_compaction_ == nullptr && // No manual request AND
!versions_->NeedsCompaction()) { // <<-- VersionSet check!
// No work to be done: VersionSet says size/seek limits are okay.
} else {
// Work needs to be done! Schedule it.
background_compaction_scheduled_ = true;
env_->Schedule(&DBImpl::BGWork, this); // Ask Env to run BGWork later
}
}
// --- Simplified from db/version_set.h ---
// In VersionSet::NeedsCompaction()
bool NeedsCompaction() const {
Version* v = current_;
// Check score (size trigger) OR if a file needs compaction due to seeks
return (v->compaction_score_ >= 1) || (v->file_to_compact_ != nullptr);
}
```
## The Compaction Process: A Closer Look
Let's break down the steps involved when a background compaction runs (specifically a major compaction between levels N and N+1):
**1. Picking the Compaction (`VersionSet::PickCompaction`)**
The first step is deciding *what* to compact. `VersionSet::PickCompaction` is responsible for this:
* It checks if a seek-based compaction is pending (`file_to_compact_ != nullptr`). If so, it chooses that file and its level.
* Otherwise, it looks at the `compaction_score_` and `compaction_level_` pre-calculated for the current [Version](06_version___versionset.md). If the score is >= 1, it chooses that level for a size-based compaction.
* It creates a `Compaction` object to hold information about this task.
* It selects an initial set of files from the chosen level (Level-N) to compact. For size compactions, it often picks the file just after the `compact_pointer_` for that level (a bookmark remembering where the last compaction ended) to ensure work spreads across the key range over time.
* For Level-0, since files can overlap, it expands this initial set to include *all* Level-0 files that overlap with the initially chosen file(s).
```c++
// --- Simplified from db/version_set.cc ---
Compaction* VersionSet::PickCompaction() {
Compaction* c = nullptr;
int level;
// Check for seek-triggered compaction first
const bool seek_compaction = (current_->file_to_compact_ != nullptr);
if (seek_compaction) {
level = current_->file_to_compact_level_;
c = new Compaction(options_, level);
c->inputs_[0].push_back(current_->file_to_compact_); // Add the specific file
} else {
// Check for size-triggered compaction
const bool size_compaction = (current_->compaction_score_ >= 1);
if (!size_compaction) {
return nullptr; // No compaction needed
}
level = current_->compaction_level_;
c = new Compaction(options_, level);
// Pick starting file in chosen level (often based on compact_pointer_)
// ... logic to select initial file(s) ...
// c->inputs_[0].push_back(chosen_file);
}
c->input_version_ = current_; // Remember which Version we are compacting
c->input_version_->Ref();
// Expand Level-0 inputs if necessary due to overlap
if (level == 0) {
InternalKey smallest, largest;
GetRange(c->inputs_[0], &smallest, &largest); // Find range of initial file(s)
// Find ALL L0 files overlapping that range
current_->GetOverlappingInputs(0, &smallest, &largest, &c->inputs_[0]);
assert(!c->inputs_[0].empty());
}
// Now figure out the overlapping files in the next level (Level+1)
SetupOtherInputs(c);
return c;
}
```
**2. Setting Up Inputs (`VersionSet::SetupOtherInputs`)**
Once the initial Level-N files are chosen, `SetupOtherInputs` figures out the rest:
* It determines the smallest and largest keys covered by the Level-N input files.
* It finds all files in Level-(N+1) that overlap this key range. These become `c->inputs_[1]`.
* It might slightly expand the Level-N inputs if doing so allows including more Level-N files without pulling in any *additional* Level-(N+1) files (this can make compactions more efficient).
* It finds all files in Level-(N+2) that overlap the *total* key range of the compaction. These are the "grandparents". This is important to prevent creating huge files in Level-(N+1) that would overlap too much data in Level-(N+2), making future compactions expensive.
**3. Performing the Work (`DBImpl::DoCompactionWork`)**
This is where the main merging happens. It runs on the background thread, and importantly, it **releases the main database lock** (`mutex_.Unlock()`) while doing the heavy I/O.
* **Input Iterator:** Creates a `MergingIterator` ([Chapter 7: Iterator](07_iterator.md)) that reads from all input files (Level-N and Level-N+1) as a single sorted stream (`versions_->MakeInputIterator(compact)`).
* **Snapshot:** Determines the oldest sequence number needed by any existing snapshot (`compact->smallest_snapshot`). Entries older than this can potentially be dropped even if not deleted.
* **Loop:** Iterates through the `MergingIterator`:
* Reads the next internal key/value.
* **Parses Key:** Extracts user key, sequence number, and type.
* **Checks for Stop:** Decides if the current output file should be finished and a new one started (e.g., due to size limits or too much overlap with grandparents).
* **Drop Logic:** Determines if the current entry should be dropped:
* Is it a deletion marker for a key that has no older data in lower levels (`IsBaseLevelForKey`) and is older than the oldest snapshot? (Obsolete deletion marker).
* Is it an entry for a key where we've already seen a *newer* entry during this same compaction?
* Is it older than the `smallest_snapshot` AND we've already seen a newer entry for this key (even if that newer entry was also dropped)?
* **Keep Logic:** If the entry is not dropped:
* Opens a new output SSTable file in Level-(N+1) if one isn't already open (`OpenCompactionOutputFile`).
* Adds the key/value pair to the `TableBuilder` (`compact->builder->Add`).
* Updates the smallest/largest keys for the output file metadata.
* Closes the output file if it reaches the target size (`FinishCompactionOutputFile`).
* Moves to the next input entry (`input->Next()`).
* **Finish:** Writes the last output file.
* **Status:** Checks for errors from the input iterator or file writes.
```c++
// --- Highly simplified loop from db/db_impl.cc DoCompactionWork ---
// Create iterator over Level-N and Level-N+1 input files
Iterator* input = versions_->MakeInputIterator(compact->compaction);
input->SeekToFirst();
// ... Release Mutex ...
while (input->Valid() && !shutting_down_) {
Slice key = input->key();
Slice value = input->value();
// Should we finish the current output file and start a new one?
if (compact->compaction->ShouldStopBefore(key) && compact->builder != nullptr) {
status = FinishCompactionOutputFile(compact, input);
// ... handle status ...
}
// Should we drop this key/value pair?
bool drop = false;
if (ParseInternalKey(key, &ikey)) {
// Logic based on ikey.sequence, ikey.type, smallest_snapshot,
// last_sequence_for_key, IsBaseLevelForKey...
// drop = true if this entry is deleted, shadowed, or obsolete.
} else {
// Corrupt key? Maybe keep it? (See actual code for details)
}
if (!drop) {
// Open output file if needed
if (compact->builder == nullptr) {
status = OpenCompactionOutputFile(compact);
// ... handle status ...
}
// Add key/value to the output file being built
compact->builder->Add(key, value);
// ... update output file metadata (smallest/largest key) ...
// Close output file if it's big enough
if (compact->builder->FileSize() >= compact->compaction->MaxOutputFileSize()) {
status = FinishCompactionOutputFile(compact, input);
// ... handle status ...
}
}
// Advance to the next key in the merged input stream
input->Next();
}
// ... Finish the last output file ...
// ... Check input iterator status ...
delete input;
// ... Re-acquire Mutex ...
```
**4. Installing Results (`DBImpl::InstallCompactionResults`)**
If the compaction work finished successfully:
* A `VersionEdit` is created.
* It records the deletion of all input files (from Level-N and Level-N+1).
* It records the addition of all the newly created output files (in Level-N+1), including their file numbers, sizes, and key ranges.
* `VersionSet::LogAndApply` is called to:
* Write the `VersionEdit` to the `MANIFEST` file.
* Create a new `Version` reflecting these changes.
* Make this new `Version` the `current` one.
**5. Cleaning Up (`DBImpl::RemoveObsoleteFiles`)**
After the new `Version` is successfully installed:
* `DBImpl` calls `RemoveObsoleteFiles`.
* This function gets the list of all files needed by *any* live `Version` (including those held by snapshots or iterators).
* It compares this list with the actual files in the database directory.
* Any file that exists on disk but is *not* in the live set (like the input files from the just-completed compaction) is deleted from the filesystem.
**Compaction Flow Diagram:**
```mermaid
sequenceDiagram
participant DBImplBG as Background Thread
participant VS as VersionSet
participant Version as Current Version
participant InputIter as Merging Iterator
participant Builder as TableBuilder
participant Manifest as MANIFEST Log
participant FS as File System
DBImplBG->>VS: PickCompaction()
VS->>Version: Find files based on score/seeks
VS-->>DBImplBG: Return Compaction object 'c'
DBImplBG->>VS: MakeInputIterator(c)
VS->>Version: Get iterators for input files (L-N, L-N+1)
VS-->>DBImplBG: Return InputIter
DBImplBG->>InputIter: SeekToFirst()
Note over DBImplBG: Releases DB Mutex
loop While InputIter.Valid()
DBImplBG->>InputIter: key(), value()
alt Keep Entry
DBImplBG->>Builder: Open File / Add(key, value)
DBImplBG->>Builder: Finish File if needed
else Drop Entry
Note over DBImplBG: Skip Add() call
end
DBImplBG->>InputIter: Next()
end
Note over DBImplBG: Re-acquires DB Mutex
DBImplBG->>VS: LogAndApply(edit describing changes)
VS->>Manifest: AddRecord(edit)
Manifest-->>VS: OK
VS->>VS: Create New Version, make it current
VS-->>DBImplBG: OK
DBImplBG->>DBImplBG: RemoveObsoleteFiles()
DBImplBG->>FS: Delete old input SSTables
```
## Conclusion
Compaction is the essential background process that keeps LevelDB performant and prevents space usage from growing indefinitely due to old data. It intelligently merges files between levels, filtering out deleted and overwritten entries while preserving a consistent view for readers.
* It's triggered by level size or file access patterns.
* It uses iterators to merge input files efficiently.
* It drops obsolete data based on sequence numbers and deletion markers.
* It writes new, clean SSTable files to the next level.
* It atomically updates the database state using `VersionEdit`s, the `MANIFEST`, and the `VersionSet`.
Understanding compaction helps explain how LevelDB achieves good performance despite its append-only (immutable SSTable) design.
But how does LevelDB manage those sequence numbers and deletion markers internally? How does it combine a user's key with this metadata?
Next up: [Chapter 9: InternalKey & DBFormat](09_internalkey___dbformat.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

View File

@@ -0,0 +1,313 @@
# Chapter 9: InternalKey & DBFormat - LevelDB's Internal Bookkeeping
Welcome to the final chapter of our deep dive into LevelDB's core components! In [Chapter 8: Compaction](08_compaction.md), we saw how LevelDB keeps its storage tidy by merging and rewriting [SSTables](01_table___sstable___tablecache.md) in the background. This compaction process relies heavily on being able to correctly compare different versions of the same key and discard old or deleted data.
But how does LevelDB know which version of a key is newer? If you write `("mykey", "value1")` and later `("mykey", "value2")`, how does LevelDB know that `value2` is the current one? And how does it handle `Delete("mykey")`? It can't just erase entries from immutable SSTable files.
## What's the Problem? Tracking Versions and Deletions
Imagine a simple library catalog that only lists book titles (user keys) and their shelf locations (user values).
1. You add "Adventures of Tom Sawyer" on Shelf A. Catalog: `("Tom Sawyer", "Shelf A")`
2. Later, you move it to Shelf B. If you just add `("Tom Sawyer", "Shelf B")`, how do you know Shelf A is wrong? The catalog now has two entries!
3. Later still, you remove the book entirely. How do you mark this in the catalog?
Just storing the user's key and value isn't enough. LevelDB needs extra internal bookkeeping information attached to every entry to handle updates, deletions, and also [Snapshots](07_iterator.md) (reading the database as it was at a specific point in time).
## The Solution: Sequence Numbers and Value Types
LevelDB solves this by adding two extra pieces of information to every key-value pair internally:
1. **Sequence Number:** Think of this like a **unique version number** or a **timestamp** assigned to every modification. Every time you `Put` or `Delete` data (usually as part of a [WriteBatch](05_writebatch.md)), LevelDB assigns a strictly increasing sequence number to that operation. A higher sequence number means the operation happened more recently. This number increments globally for the entire database.
2. **Value Type:** This is a simple flag indicating whether an entry represents a **value** or a **deletion**.
* `kTypeValue`: Represents a regular key-value pair resulting from a `Put`.
* `kTypeDeletion`: Represents a "tombstone" marker indicating that a key was deleted by a `Delete` operation.
## InternalKey: The Full Story
LevelDB combines the user's key with these two extra pieces of information into a structure called an **InternalKey**.
**InternalKey = `user_key` + `sequence_number` + `value_type`**
This `InternalKey` is what LevelDB *actually* stores and sorts within the [MemTable](02_memtable.md) and [SSTables](01_table___sstable___tablecache.md). When you ask LevelDB for `Get("mykey")`, it internally searches for `InternalKey`s associated with `"mykey"` and uses the sequence numbers and value types to figure out the correct, most recent state.
## Sorting InternalKeys: The Magic Ingredient
How `InternalKey`s are sorted is crucial for LevelDB's efficiency. They are sorted based on the following rules:
1. **User Key:** First, compare the `user_key` part using the standard comparator you configured for the database (e.g., lexicographical order). Keys `apple` come before `banana`.
2. **Sequence Number (Descending):** If the user keys are the same, compare the `sequence_number` in **DESCENDING** order. The entry with the *highest* sequence number comes *first*.
3. **Value Type (Descending):** If user keys and sequence numbers are the same (which shouldn't normally happen for distinct operations), compare the `value_type` in **DESCENDING** order (`kTypeValue` comes before `kTypeDeletion`).
**Why sort sequence numbers descending?** Because when LevelDB looks for a user key, it wants to find the *most recent* version first. By sorting the highest sequence number first, a simple search or iteration naturally encounters the latest state of the key immediately.
**Example:** Let's revisit our `Put`/`Put`/`Delete` example for `mykey`:
1. `Put("mykey", "v1")` -> gets Sequence = 5 -> InternalKey: `("mykey", 5, kTypeValue)`
2. `Put("mykey", "v2")` -> gets Sequence = 10 -> InternalKey: `("mykey", 10, kTypeValue)`
3. `Delete("mykey")` -> gets Sequence = 15 -> InternalKey: `("mykey", 15, kTypeDeletion)`
When these are sorted according to the rules, the order is:
1. `("mykey", 15, kTypeDeletion)` (Highest sequence)
2. `("mykey", 10, kTypeValue)`
3. `("mykey", 5, kTypeValue)` (Lowest sequence)
Now, when you call `Get("mykey")`:
* LevelDB searches for entries matching `mykey`.
* It finds `("mykey", 15, kTypeDeletion)` first because it sorts first.
* It sees the `kTypeDeletion` marker and immediately knows the key is deleted, returning `NotFound` without even needing to look at the older versions (`v2` and `v1`).
**Snapshots:** Snapshots work by using a specific sequence number. If you take a snapshot at sequence 12, a `Get("mykey")` using that snapshot would ignore sequence 15. It would find `("mykey", 10, kTypeValue)` first, see it's `kTypeValue` and `sequence <= 12`, and return `"v2"`.
## The `dbformat` Module: Defining the Rules
The code that defines the `InternalKey` structure, the `ValueType` enum, sequence numbers, helper functions for manipulating them, and crucial constants is located in `dbformat.h` and `dbformat.cc`.
**1. Key Structures and Constants (`dbformat.h`)**
This header file defines the core types:
```c++
// --- File: db/dbformat.h ---
namespace leveldb {
// Value types: Deletion or Value
enum ValueType { kTypeDeletion = 0x0, kTypeValue = 0x1 };
// ValueType used for seeking. (Uses highest type value)
static const ValueType kValueTypeForSeek = kTypeValue;
// Type for sequence numbers. 56 bits available.
typedef uint64_t SequenceNumber;
// Max possible sequence number.
static const SequenceNumber kMaxSequenceNumber = ((0x1ull << 56) - 1);
// Structure to hold the parsed parts of an InternalKey
struct ParsedInternalKey {
Slice user_key;
SequenceNumber sequence;
ValueType type;
// Constructors... DebugString()...
};
// Helper class to manage the encoded string representation
class InternalKey {
private:
std::string rep_; // Holds the encoded key: user_key + seq/type tag
public:
// Constructors... DecodeFrom()... Encode()... user_key()...
InternalKey(const Slice& user_key, SequenceNumber s, ValueType t);
};
// ... other definitions like LookupKey, InternalKeyComparator ...
} // namespace leveldb
```
**Explanation:**
* Defines `ValueType` enum (`kTypeDeletion`, `kTypeValue`).
* Defines `SequenceNumber` (a 64-bit integer, but only 56 bits are used, leaving 8 bits for the type).
* `ParsedInternalKey`: A temporary struct holding the three components separately.
* `InternalKey`: A class that usually stores the *encoded* form (as a single string) for efficiency.
**2. Encoding and Parsing (`dbformat.cc`, `dbformat.h`)**
LevelDB needs to combine the three parts (`user_key`, `sequence`, `type`) into a single `Slice` (a pointer + length, representing a string) for storage and comparison, and then parse them back out. The sequence and type are packed together into the last 8 bytes of the internal key string.
```c++
// --- File: db/dbformat.h --- (Inline functions)
// Combine sequence and type into 8 bytes (64 bits)
static uint64_t PackSequenceAndType(uint64_t seq, ValueType t) {
// seq uses upper 56 bits, type uses lower 8 bits
return (seq << 8) | t;
}
// Extract the user_key part from an encoded internal key
inline Slice ExtractUserKey(const Slice& internal_key) {
assert(internal_key.size() >= 8);
return Slice(internal_key.data(), internal_key.size() - 8); // All bytes EXCEPT the last 8
}
// --- File: db/dbformat.cc ---
// Append the encoded internal key to a string 'result'
void AppendInternalKey(std::string* result, const ParsedInternalKey& key) {
result->append(key.user_key.data(), key.user_key.size()); // Append user key
// Append the 8-byte packed sequence and type
PutFixed64(result, PackSequenceAndType(key.sequence, key.type));
}
// Parse an encoded internal key 'internal_key' into 'result'
bool ParseInternalKey(const Slice& internal_key, ParsedInternalKey* result) {
const size_t n = internal_key.size();
if (n < 8) return false; // Must have the 8-byte trailer
// Decode the 8-byte trailer
uint64_t num = DecodeFixed64(internal_key.data() + n - 8);
uint8_t c = num & 0xff; // Lower 8 bits are the type
result->sequence = num >> 8; // Upper 56 bits are sequence
result->type = static_cast<ValueType>(c);
result->user_key = Slice(internal_key.data(), n - 8); // The rest is user key
return (c <= static_cast<uint8_t>(kTypeValue)); // Basic validation
}
```
**Explanation:**
* `PackSequenceAndType`: Shifts the sequence number left by 8 bits and combines it with the 1-byte type.
* `AppendInternalKey`: Builds the string representation: user key bytes followed by the 8-byte packed sequence/type.
* `ExtractUserKey`: Returns a slice pointing to the user key portion (all but the last 8 bytes).
* `ParseInternalKey`: Does the reverse of `AppendInternalKey`, extracting the parts from the encoded slice.
**3. Comparing Internal Keys (`dbformat.cc`)**
The `InternalKeyComparator` uses the user-provided comparator for the user keys and then implements the descending sequence number logic.
```c++
// --- File: db/dbformat.cc ---
int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const {
// 1. Compare user keys using the user's comparator
int r = user_comparator_->Compare(ExtractUserKey(akey), ExtractUserKey(bkey));
if (r == 0) {
// User keys are equal, compare sequence numbers (descending)
// Decode the 8-byte tag (seq+type) from the end of each key
const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
// Higher sequence number should come first (negative result)
if (anum > bnum) {
r = -1;
} else if (anum < bnum) {
r = +1;
}
// If sequence numbers are also equal, type decides (descending,
// but packed value comparison handles this implicitly).
}
return r;
}
```
**Explanation:** This function first compares user keys. If they differ, that result is returned. If they are the same, it decodes the 8-byte tag from both keys and compares them. Since a higher sequence number results in a larger packed `uint64_t` value, comparing `anum` and `bnum` directly and flipping the sign (`-1` if `anum > bnum`, `+1` if `anum < bnum`) achieves the desired descending order for sequence numbers.
**4. Seeking with LookupKey (`dbformat.h`, `dbformat.cc`)**
When you call `Seek(target_key)` on an iterator, LevelDB needs to find the internal key representing the latest version of `target_key` at or before the iterator's snapshot sequence number. Directly seeking using an internal key `(target_key, snapshot_seq, kTypeValue)` might overshoot, landing on an entry *newer* than the snapshot.
`LookupKey` creates a specially formatted key for seeking in MemTables and internal iterators.
```c++
// --- File: db/dbformat.h ---
// A helper class useful for DBImpl::Get() and Iterator::Seek()
class LookupKey {
public:
// Create a key for looking up user_key at snapshot 'sequence'.
LookupKey(const Slice& user_key, SequenceNumber sequence);
~LookupKey();
// Key for MemTable lookup (includes length prefix for internal key)
Slice memtable_key() const;
// Key for Internal Iterator lookup (user_key + seq/type tag)
Slice internal_key() const;
// User key part
Slice user_key() const;
private:
const char* start_; // Beginning of allocated buffer
const char* kstart_; // Beginning of user_key portion
const char* end_; // End of allocated buffer
char space_[200]; // Avoid heap allocation for short keys
};
// --- File: db/dbformat.cc --- (Simplified Constructor Logic)
LookupKey::LookupKey(const Slice& user_key, SequenceNumber s) {
size_t usize = user_key.size();
// Need space for: internal key length, user key, 8-byte tag
size_t needed = VarintLength(usize + 8) + usize + 8;
char* dst = /* ... allocate space_ or new char[] ... */ ;
start_ = dst;
// Encode length of internal key (user_key size + 8)
dst = EncodeVarint32(dst, usize + 8);
kstart_ = dst; // Mark start of internal key part
// Copy user key data
std::memcpy(dst, user_key.data(), usize);
dst += usize;
// Encode the 8-byte tag: Use the target sequence 's' BUT use
// kValueTypeForSeek (which is kTypeValue, the highest type value).
EncodeFixed64(dst, PackSequenceAndType(s, kValueTypeForSeek));
dst += 8;
end_ = dst; // Mark end of buffer
}
```
**Explanation:**
* A `LookupKey` bundles the `user_key` with the target `sequence` number.
* Critically, when creating the 8-byte tag, it uses `kValueTypeForSeek`. Because internal keys are sorted by user key, then *descending* sequence, then *descending* type, seeking for `(user_key, sequence, kValueTypeForSeek)` ensures we find the *first* entry whose user key matches and whose sequence number is less than or equal to the target `sequence`. This correctly handles the descending sort order during seeks.
**5. Configuration Constants (`dbformat.h`)**
`dbformat.h` also defines key constants that control LevelDB's behavior, especially related to compaction triggers:
```c++
// --- File: db/dbformat.h ---
namespace config {
static const int kNumLevels = 7; // Number of levels in the LSM tree
// Level-0 compaction is started when we hit this many files.
static const int kL0_CompactionTrigger = 4;
// Soft limit on number of level-0 files. We slow down writes at this point.
static const int kL0_SlowdownWritesTrigger = 8;
// Maximum number of level-0 files. We stop writes at this point.
static const int kL0_StopWritesTrigger = 12;
// Maximum level to push a new memtable compaction to if it doesn't overlap.
static const int kMaxMemCompactLevel = 2;
// ... other constants ...
} // namespace config
```
**Explanation:** These constants define parameters like the number of levels and the file count thresholds in Level-0 that trigger compactions or slow down/stop writes. They are part of the database "format" because changing them affects performance and behavior.
**Internal Key Structure Diagram**
```mermaid
graph TB
A[User Application] --> |"Put('key', 'value')"| B(LevelDB)
B --> |"Assigns Seq=10"| C{Internal Operation}
C --> |"Creates"| D[InternalKey String]
D --> I{Storage}
subgraph "Key Components"
D --- E["InternalKey Structure"]
E --> E1["User Key"]
E --> E2["8-byte Tag"]
E2 --> G["Seq # (56 bits)"]
E2 --> H["Type (8 bits)"]
end
subgraph "Sort Order"
I --> J["By User Key"]
J --> K["By Sequence DESC"]
K --> L["By Type DESC"]
end
```
## Conclusion
LevelDB doesn't just store your raw keys and values. It enhances them internally by adding a **sequence number** (like a version timestamp) and a **value type** (Value or Deletion). This combined structure, the **InternalKey**, is what LevelDB actually sorts and stores in its MemTables and SSTables.
The specific way InternalKeys are sorted (user key ascending, sequence number descending) is critical for efficiently finding the latest version of a key and handling deletions and snapshots correctly. The `dbformat` module (`dbformat.h`, `dbformat.cc`) defines these internal structures, their encoding/decoding rules, the comparison logic (`InternalKeyComparator`), the special `LookupKey` for seeks, and other important constants related to the database's structure and behavior.
Understanding `InternalKey` and `dbformat` reveals the clever bookkeeping that allows LevelDB's Log-Structured Merge-Tree design to function correctly and efficiently. This chapter concludes our tour of the fundamental building blocks of LevelDB!
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)

57
docs/LevelDB/index.md Normal file
View File

@@ -0,0 +1,57 @@
# Tutorial: LevelDB
LevelDB is a fast *key-value storage library* written at Google.
Think of it like a simple database where you store pieces of data (values) associated with unique names (keys).
It's designed to be **very fast** for both writing new data and reading existing data, and it reliably stores everything on **disk**.
It uses a *log-structured merge-tree (LSM-tree)* design to achieve high write performance and manages data in sorted files (*SSTables*) across different levels for efficient reads and space management.
**Source Repository:** [https://github.com/google/leveldb/tree/main/db](https://github.com/google/leveldb/tree/main/db)
```mermaid
flowchart TD
A0["DBImpl"]
A1["MemTable"]
A2["Table / SSTable & TableCache"]
A3["Version & VersionSet"]
A4["Write-Ahead Log (WAL) & LogWriter/LogReader"]
A5["Iterator"]
A6["WriteBatch"]
A7["Compaction"]
A8["InternalKey & DBFormat"]
A0 -- "Manages active/immutable" --> A1
A0 -- "Uses Cache for reads" --> A2
A0 -- "Manages DB state" --> A3
A0 -- "Writes to Log" --> A4
A0 -- "Applies Batches" --> A6
A0 -- "Triggers/Runs Compaction" --> A7
A1 -- "Provides Iterator" --> A5
A1 -- "Stores Keys Using" --> A8
A2 -- "Provides Iterator via Cache" --> A5
A3 -- "References SSTables" --> A2
A3 -- "Picks Files For" --> A7
A4 -- "Recovers MemTable From" --> A1
A4 -- "Contains Batch Data" --> A6
A5 -- "Parses/Hides InternalKey" --> A8
A6 -- "Inserts Into" --> A1
A7 -- "Builds SSTables" --> A2
A7 -- "Updates Versions Via Edit" --> A3
A7 -- "Uses Iterator for Merging" --> A5
```
## Chapters
1. [Table / SSTable & TableCache](01_table___sstable___tablecache.md)
2. [MemTable](02_memtable.md)
3. [Write-Ahead Log (WAL) & LogWriter/LogReader](03_write_ahead_log__wal____logwriter_logreader.md)
4. [DBImpl](04_dbimpl.md)
5. [WriteBatch](05_writebatch.md)
6. [Version & VersionSet](06_version___versionset.md)
7. [Iterator](07_iterator.md)
8. [Compaction](08_compaction.md)
9. [InternalKey & DBFormat](09_internalkey___dbformat.md)
---
Generated by [AI Codebase Knowledge Builder](https://github.com/The-Pocket/Tutorial-Codebase-Knowledge)