Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hitchhiker’s Guide — Option A: LevelDB API over FoundationDB

The question this answers: “I have code that uses LevelDB. Can I swap in FoundationDB underneath with zero changes to my application?”

The deeper question: “What is LevelDB’s API contract, and how exactly does each piece of it map onto FDB primitives?”


Table of Contents

  1. What LevelDB Is (and Isn’t)
  2. The Five Concepts: Get, Put, Delete, Batch, Iterator, Snapshot
  3. Subspaces — The Key Encoding
  4. Write Batches — Atomicity Made Explicit
  5. Iterators — Range Scans Without Cursors
  6. Snapshots — MVCC Exposed to the Caller
  7. How the Demo Works Step by Step
  8. What the Real LevelDB Does That We Don’t
  9. Real-World Analogue: goleveldb, RocksDB, PebbleDB
  10. Exercises — Build on This

1. What LevelDB Is

LevelDB (released by Google in 2011) is an embedded, ordered, key–value store implemented as a Log-Structured Merge Tree (LSM tree). “Embedded” means it runs in the same process as your application — no server, no network. “Ordered” means the same thing as in FDB: keys are sorted lexicographically, and you can range-scan them efficiently.

LevelDB’s API surface is deliberately tiny:

db.Get(key)
db.Put(key, value)
db.Delete(key)
batch := new(leveldb.Batch)
batch.Put / batch.Delete
db.Write(batch)
iter := db.NewIterator(...)
snap := db.GetSnapshot()

This simplicity is why LevelDB became the embedded storage engine of choice for Chrome (IndexedDB), Bitcoin Core, and countless other applications.

Where LevelDB lives in the storage stack:

Application code
    ↓
LevelDB API (Get/Put/Delete/Batch/Iterator)
    ↓
LSM tree (MemTable + SSTables on disk)
    ↓
Filesystem / OS

Where our layer lives:

Application code
    ↓
[our layer] — same LevelDB-shaped API (Get/Put/Delete/Batch/Iterator)
    ↓
FDB transactions
    ↓
FDB cluster (distributed, replicated)

The application sees the same interface. The durability substrate changes completely.


2. The Five Concepts

Get — One Read, One Transaction

In LevelDB, Get opens a brief “read lock” (via a snapshot) and reads one key. There is no explicit transaction; LevelDB handles it internally.

In our layer:

func (d *DB) Get(key []byte) ([]byte, error) {
    v, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.Get(d.ns.Pack(key)).Get()
    })
    ...
}

ReadTransact opens a read-only FDB transaction (no conflict tracking on writes, cheaper than a full Transact). The read version is chosen by FDB to be a recent committed version — typically a few milliseconds behind real-time. This gives you a consistent view even if concurrent writers are active.

rt.Get(k) does not block on return. It returns a FutureByteSlice. Calling .Get() on the future is what blocks (sends the request to the FDB storage server and waits for the response). This two-phase call style is how FDB supports pipelining: you can call rt.Get(k1), rt.Get(k2), rt.Get(k3) in sequence, then .Get() all three — FDB sends all three requests before blocking on any response. This is critical for LookupByIndex in option-c, as we’ll see.

Put — One Write, One Transaction

func (d *DB) Put(key, value []byte) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Set(d.ns.Pack(key), value)
        return nil, nil
    })
    return err
}

Transact opens a read-write transaction. tr.Set adds the (key, value) pair to the transaction’s local write buffer. Nothing is sent to the cluster until the function returns without error, at which point FDB commits. If there was a conflict (another writer touched this key since our read version), FDB calls our function again with a fresh transaction automatically.

Delete — Same as Put, but Clear

tr.Clear(key) adds a tombstone to the write buffer. In FDB’s model, a cleared key is identical to a key that was never written — there is no “null” value. This is important: Get on a cleared key returns nil (which our layer translates to ErrNotFound), not a special tombstone value.

When Each One is Right

OperationUse whenFDB primitive
GetReading a single keyReadTransact
PutWriting a single keyTransact + Set
DeleteRemoving a single keyTransact + Clear
BatchWriting multiple keys atomically (below)Transact (one)

3. Subspaces — The Key Encoding

Every key that hits FDB is first passed through the Subspace.Pack method:

// encoding.go
type Subspace struct{ prefix []byte }

func (s Subspace) Pack(userKey []byte) fdb.Key {
    out := make([]byte, 0, len(s.prefix)+1+len(userKey))
    out = append(out, s.prefix...)
    out = append(out, 0x00)        // separator byte
    out = append(out, userKey...)
    return fdb.Key(out)
}

If your namespace is "demo" and your key is "apple", the actual FDB key is the byte string "demo\x00apple".

Why the separator byte?

Without a separator, two namespaces "foo" and "foobar" would collide: the key "foobar\x00somekey" would appear inside both foo’s range and foobar’s range. The separator \x00 prevents this because "foo\x00" is not a prefix of "foobar\x00".

Why 0x00 specifically?

Because 0x00 is the smallest possible byte value. When we compute the range end for a subspace scan, we copy the begin key and change the last byte from 0x00 to 0x01:

func (s Subspace) Range() fdb.KeyRange {
    begin := append([]byte{}, s.prefix...)
    begin = append(begin, 0x00)
    end := append([]byte{}, s.prefix...)
    end = append(end, 0x01)
    return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}

The range ["demo\x00", "demo\x01") contains exactly and only the keys packed by this subspace. This is a simple, efficient way to express “all keys in this namespace” without needing a sentinel end key.

The tuple layer alternative:

Official FDB client libraries encode subspace ranges using the Tuple encoding, which handles nested subspaces, escaping, and typed values. Our hand-rolled encoding is simpler but less general — for production use, adopt the Tuple layer.


4. Write Batches — Atomicity Made Explicit

LevelDB’s WriteBatch is the mechanism for writing multiple keys atomically. Without a batch, each Put is a separate transaction — if your process crashes between two Puts, the second one is missing.

b := layer.NewBatch()
b.Put([]byte("user:1:name"), []byte("Alice"))
b.Put([]byte("user:1:email"), []byte("alice@example.com"))
b.Put([]byte("user:1:score"), []byte("100"))
db.Write(b)

After Write, either all three keys exist or none do.

Our implementation:

// batch.go
func (d *DB) Write(b *Batch) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for _, op := range b.ops {
            if op.clear {
                tr.Clear(d.ns.Pack(op.key))
            } else {
                tr.Set(d.ns.Pack(op.key), op.value)
            }
        }
        return nil, nil
    })
    return err
}

One Transact call. All ops go into the transaction buffer together. FDB commits them atomically. This is exactly what LevelDB’s WriteBatch does internally — it writes all ops to the Write-Ahead Log in one fsync, then applies them to the MemTable.

FDB’s size limits:

FDB transactions are capped at approximately 10 MB of reads + writes. For most use cases this is not a limit, but if you’re writing millions of keys you’ll need to split into multiple transactions. See option-b-leveldb for the chunking pattern.


5. Iterators — Range Scans Without Cursors

LevelDB’s iterator is a cursor over the sorted key space. It supports bidirectional movement and seeking to arbitrary positions.

The streaming vs. materializing decision:

A “streaming” iterator would keep a live FDB transaction open and use fdb.RangeIterator to fetch keys page by page as you call Next(). This is efficient for large ranges but ties the iterator’s lifetime to an open transaction.

FDB transactions have a ~5 second timeout. LevelDB iterators are frequently held open much longer (e.g., while a background compaction reads a full SSTable). Forcing a 5-second limit would break drop-in compatibility.

Our solution: materialize the entire range into a slice upfront:

func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
    v, _ := fdbDB.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
    })
    it.kvs = v.([]fdb.KeyValue)
    ...
}

The entire GetSliceWithError call happens inside one transaction. The transaction closes. The iterator holds the materialized slice — it can outlive the transaction indefinitely.

Trade-off: We read all matching keys eagerly, even if the caller only needs the first few. For small ranges (as in most LevelDB use cases) this is fine. For ranges spanning millions of keys, a streaming approach would be necessary.

Navigation:

it.First()          // idx = 0
it.Next()           // idx++
it.Prev()           // idx--
it.Last()           // idx = len(kvs)-1
it.Seek(target)     // binary (or linear) search for key >= target
it.Key()            // kvs[idx].Key unpacked from subspace
it.Value()          // kvs[idx].Value
it.Valid()          // 0 <= idx < len(kvs)

The Seek in our implementation is a linear scan for pedagogical clarity. A production implementation would use sort.Search (binary search) since the slice is sorted.


6. Snapshots — MVCC Exposed to the Caller

This is where FDB’s MVCC machinery becomes directly visible.

A LevelDB snapshot captures the database state at a point in time. Reads through the snapshot always see that exact state, regardless of subsequent writes.

snap := db.NewSnapshot()
// ... later, after writes have occurred ...
v1, _ := db.Get(key)    // sees current state
v2, _ := snap.Get(key)  // sees state at snapshot time
// v1 != v2 if the key was mutated after the snapshot

How we implement this:

// snapshot.go
type Snapshot struct {
    db          fdb.Database
    readVersion int64
}

func (d *DB) NewSnapshot() (*Snapshot, error) {
    rv, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetReadVersion().Get()
    })
    ...
    return &Snapshot{db: d.fdb, readVersion: rv.(int64)}, nil
}

GetReadVersion() returns the logical timestamp FDB assigned to our transaction. This is a monotonically increasing integer — FDB’s “version clock”. We store it.

Later, when the snapshot is asked for a key:

func (s *Snapshot) Get(key []byte) ([]byte, error) {
    v, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.SetReadVersion(s.readVersion)   // ← the key line
        return tr.Get(s.ns.Pack(key)).Get()
    })
    ...
}

SetReadVersion pins the transaction to read from the exact version we captured. FDB’s MVCC machinery ensures that the storage servers still have the old version in their version history — provided it hasn’t been garbage collected (the ~5 second window).

Why SetReadVersion requires a writable Transaction:

This is a quirk of the FDB Go API: SetReadVersion is only available on fdb.Transaction (read-write), not on fdb.ReadTransaction (read-only). So we open a writable transaction but never write anything — effectively using it as a “pinnable read transaction”. The transaction is automatically abandoned when the closure returns.

What this maps to in real databases:

  • PostgreSQL: BEGIN; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; creates a snapshot valid for the transaction’s duration
  • MySQL InnoDB: START TRANSACTION WITH CONSISTENT SNAPSHOT
  • Spanner: BeginTransaction(mode: READ_ONLY, timestamp: exact_staleness: Duration)
  • CockroachDB: AS OF SYSTEM TIME <timestamp> clause on SELECT

All of these are, at their core, “read from this specific version of the MVCC chain.”


7. The Demo, Step by Step

The demo in demo/main.go exercises every feature and prints the results. Here is the internal FDB call sequence for the MVCC section:

1. db.Put("color", "red")
   → Transact: Set("demo\x00color", "red"), commit version v1

2. snap = db.NewSnapshot()
   → ReadTransact: GetReadVersion() → returns v1 (or very close)
   → snap.readVersion = v1

3. db.Put("color", "blue")
   → Transact: Set("demo\x00color", "blue"), commit version v2

4. snap.Get("color")
   → Transact: SetReadVersion(v1), Get("demo\x00color")
   → FDB returns "red"  (value at v1)

5. db.Get("color")
   → ReadTransact: Get("demo\x00color")
   → FDB returns "blue" (latest committed value)

This demonstrates that the snapshot truly pins to v1, seeing the pre-mutation value.


8. What the Real LevelDB Does That We Don’t

LevelDB has significant machinery we skip:

LSM Tree internals:

  • MemTable (skip list in memory) + WAL (write-ahead log on disk)
  • SSTable files (sorted, immutable, bloom-filter indexed)
  • Compaction (background merging of SSTables to reclaim space and bound read amplification)
  • Bloom filters (avoid disk reads for non-existent keys)

None of this applies to our layer because FDB handles all of it. FDB’s storage servers use their own B-tree-like storage (a custom tree called the “FDB B-tree” that supports MVCC) and their own write-ahead logs. We simply call tr.Set and FDB’s internals handle the rest.

What we deliberately emulate:

  • The API surface (same method names and semantics as goleveldb)
  • Namespace isolation (subspace = virtual database)
  • Atomic batches
  • Forward/backward iteration
  • MVCC snapshots

What we don’t emulate:

  • Compaction (irrelevant — FDB handles it)
  • Bloom filters (FDB has its own read optimization)
  • File-level operations (no such thing in FDB)
  • Checksums (FDB provides end-to-end data integrity)

9. Real-World Analogue: goleveldb, RocksDB, PebbleDB

goleveldb (github.com/syndtr/goleveldb) is the Go port of LevelDB that option-b-leveldb uses as a consumer. Its storage.Storage interface is the pluggable storage backend. We’ll see this in option-b’s guide.

RocksDB (Facebook/Meta, 2013) is the evolution of LevelDB: more configurable, multi-threaded compaction, column families, transactional API. It is the storage engine inside MySQL 8 (MyRocks), CockroachDB, TiKV, and many others. Every concept from our layer applies directly to RocksDB’s API.

PebbleDB (CockroachDB, 2019) is a Go implementation of RocksDB’s key ideas, designed for CockroachDB’s specific workload. CockroachDB switched from RocksDB to Pebble in 2021 for improved performance and simpler operations.

The common thread: all of these expose the same Get/Put/Delete/Batch/ Iterator/Snapshot interface. Building this interface over FDB means you understand the contract deeply, because you have to implement it rather than just use it.


10. Exercises — Build on This

These are not hypothetical. Each one adds a real capability:

Exercise 1 — Atomic Compare-And-Swap (CAS)

func (d *DB) CAS(key, expected, next []byte) (bool, error)

Read key, compare to expected, write next — all in one transaction. If the key changed since you read it, FDB retries automatically (conflict detection does this for you — you don’t need to loop).

Exercise 2 — TTL (Time-To-Live) Keys Store expiry timestamps alongside values. Modify Get to return ErrNotFound for expired keys, and add a Sweep() method that clears all expired entries with a ClearRange.

Exercise 3 — Transactions Across Multiple Keys

tx := db.Begin()
tx.Put("account:alice:balance", "900")
tx.Put("account:bob:balance", "100")
tx.Commit()

This is just a Batch.Write today. Extend it to include optimistic reads (read alice’s balance, check it’s >= 100 before subtracting) — you’ll need a real Transact closure, not just a batch.

Exercise 4 — Prefix Scan

rows, err := db.Scan(prefix []byte) ([]KV, error)

Use RangeWithin(prefix, nil) and filter server-side. This is the basis for the Record Layer’s ScanRecords.

Exercise 5 — Size Estimate Use FDB’s GetEstimatedRangeSizeBytes (an atomic op) to estimate how many bytes live in your subspace. This is how database engines implement SHOW TABLE STATUS without a full scan.


11. Source Code Deep Dive — Every Line Explained

This section walks through the full source of layer/db.go, layer/encoding.go, layer/iterator.go, and layer/snapshot.go with annotations about the non-obvious decisions.

db.go — The Core Database Type

type DB struct {
    fdb fdb.Database
    ns  Subspace
}

Two fields. fdb is the FDB connection (goroutine-safe, long-lived). ns is the namespace: a byte prefix prepended to every key. Multiple DB instances on the same FDB cluster with different ns values are completely isolated — their key ranges do not overlap.

func Open(fdbDB fdb.Database, namespace []byte) *DB {
    return &DB{fdb: fdbDB, ns: NewSubspace(namespace)}
}

Open does not contact FDB. It’s a pure in-memory initialization. The connection to FDB was already established when fdb.OpenDefault() was called in main.go. Open just associates this DB with a prefix.

Why take fdb.Database rather than a cluster address string? This lets the caller decide how the FDB connection is configured (API version, cluster file path, network options) and share one connection across multiple DB instances. Multiple DB instances share one FDB network thread and one connection pool.

func (d *DB) FDB() fdb.Database { return d.fdb }
func (d *DB) Namespace() []byte { return d.ns.prefix }

Accessors for embedding and testing. FDB() lets a consumer pass the FDB connection to another layer (e.g., a Record Layer built on top of this DB). Namespace() lets tests inspect the key prefix.

encoding.go — The Full Subspace Implementation

func (s Subspace) RangeWithin(start, end []byte) fdb.KeyRange {
    var begin, endKey fdb.Key
    if start == nil {
        begin = s.Range().Begin
    } else {
        begin = s.Pack(start)
    }
    if end == nil {
        endKey = s.Range().End
    } else {
        endKey = s.Pack(end)
    }
    return fdb.KeyRange{Begin: begin, End: endKey}
}

RangeWithin lets callers specify a sub-range within the subspace. If start = nil, the range starts at the beginning of the subspace. If end = nil, the range ends at the end of the subspace. This is used by the iterator to implement LevelDB’s NewIterator(slice *util.Range) — an iterator over a restricted key range.

The subtle end encoding: When end is provided, we use Pack(end) as the upper bound. This is an exclusive upper bound in FDB (same as Python’s range()GetRange(begin, end) returns keys where begin <= key < end). Since Pack(end) = prefix + 0x00 + end, this correctly excludes the end key itself.

iterator.go — Forward/Backward Navigation

func (it *Iterator) compareKeys(a, b []byte) int {
    return bytes.Compare(a, b)
}

One line, but critical: key comparison is lexicographic byte order, not string collation, not numeric order. "9" > "10" under this comparison. This is identical to LevelDB’s default comparator (BytewiseComparator). If you need a different sort order, you need a sort-preserving encoding — which is why encodeInt64 (used in option-c’s index keys) exists.

func (it *Iterator) Seek(target []byte) {
    for it.idx = 0; it.idx < len(it.kvs); it.idx++ {
        if it.compareKeys(it.Key(), target) >= 0 {
            return
        }
    }
}

Linear scan for Seek. Correct but O(n). A production implementation uses binary search:

it.idx = sort.Search(len(it.kvs), func(i int) bool {
    return bytes.Compare(it.kvs[i].Key, target) >= 0
})

This is O(log n) — important for large result sets. The linear scan is kept here for readability.

snapshot.go — Pinning MVCC Versions

func (s *Snapshot) NewIterator(start, end []byte) *Iterator {
    return newIteratorAt(s.db, s.ns, start, end, s.readVersion)
}

The iterator constructor takes readVersion and passes it to the internal newIteratorAt. Inside newIteratorAt:

func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
    v, _ := fdbDB.Transact(func(tr fdb.Transaction) (interface{}, error) {
        if readVersion > 0 {
            tr.SetReadVersion(readVersion)
        }
        return tr.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
    })
    ...
}

If readVersion > 0, we pin the transaction to that version. The range scan returns results as of that exact version. This enables snapshot-consistent iteration — the iterator sees a stable, non-changing view even while concurrent writers are active.

The Transact retry loop and SetReadVersion: Transact retries on conflict. But SetReadVersion sets a fixed read version — the transaction’s read set is pinned. Can we still get a conflict on a read-only operation that reads from a pinned version? No — conflicts are about write-write and read-write conflicts. A transaction that only reads (even via Transact) cannot be retried due to conflict unless it also writes. Our snapshot transactions only read, so Transact will not retry them for conflict reasons. The only reason for retry would be a transaction_too_old error (if readVersion is too old and the data has been GC’d), which surfaces as an error to the caller.


12. Production Considerations

12.1 FDB Transaction Size Limits

FDB enforces hard limits on transactions:

  • 10 MB total mutation size (sum of all Set and Clear calls in one transaction)
  • 10 MB total read size (sum of all values read)
  • 5 seconds maximum transaction duration (from first operation to commit)

For option-a-leveldb, the most likely hit is the iterator: GetSliceWithError() reads the entire range into memory in one transaction. If a namespace contains 50 MB of data and you create an iterator over it, you’ll get a transaction too-large error.

Production solution: Paginated iteration with cursors:

// Instead of reading everything at once:
var cursor fdb.Key = ns.Range().Begin
const pageSize = 10_000
for {
    kvs, _ := db.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(fdb.KeyRange{Begin: cursor, End: ns.Range().End},
            fdb.RangeOptions{Limit: pageSize}).GetSliceWithError()
    })
    if len(kvs) == 0 {
        break
    }
    // process kvs...
    cursor = fdb.Key(append(kvs[len(kvs)-1].Key, 0x00)) // next page starts after last key
}

12.2 FDB Go Binding Concurrency Model

The FDB Go binding uses a single-threaded network loop internally (the FDB C library has one network thread). All transactions are multiplexed over this thread. This means:

  • Multiple goroutines can each have their own ReadTransact or Transact calls concurrently — these are correctly serialized by the binding
  • fdb.Database is goroutine-safe
  • But the network thread is single-threaded: if you saturate it (thousands of concurrent transactions), you’ll see latency increase

For high concurrency, FDB recommends batching multiple operations within one transaction rather than creating many small transactions.

12.3 Key Space Planning

Before deploying, decide on your namespace structure. Once data is in production, renaming a namespace (changing the prefix) requires migrating all data — a potentially days-long background job.

Good practice:

// Use a versioned namespace prefix
db := layer.Open(fdb, []byte("myapp:v1:users"))
// If schema changes require a new encoding:
newDB := layer.Open(fdb, []byte("myapp:v2:users"))
// Migrate in background; dual-write during migration window

12.4 Monitoring and Observability

FDB exposes cluster health via its status API:

fdbcli --exec "status json"

Key metrics to monitor:

  • Transactions committed/second — throughput
  • Conflicts/second — high values indicate hot keys or poorly structured transactions
  • Storage server read/write latency — P99 should be < 10ms
  • Data distribution lag — if a shard is being moved, latency spikes

For application-level monitoring, instrument every Transact call with a timer and tag the metric with the operation name.


13. Interview Questions — LevelDB, MVCC, and FDB

Q: What is the difference between ReadTransact and Transact in the FDB Go binding?

ReadTransact opens a read-only transaction: no write buffer, no conflict tracking, no commit phase. It’s cheaper than Transact because it skips the commit round-trip. Use ReadTransact whenever you’re only reading. Transact opens a read-write transaction: it tracks the read key set for conflict detection and requires a commit round-trip to apply writes. For single-key reads like Get, using Transact instead of ReadTransact works but wastes latency and cluster resources.

Q: If FDB’s MVCC window is ~5 seconds, what happens to a snapshot older than 5 seconds?

Reading from that snapshot returns a transaction_too_old error. FDB garbage-collects old MVCC versions after the configured version history window (default ~5 seconds). Any transaction — including read-only snapshot reads — that tries to read a version older than the GC horizon fails with a retriable error. In our Snapshot implementation, this means that a snapshot held open longer than ~5 seconds will start returning errors on the next Get or NewIterator call. Production code must handle this by recreating the snapshot.

Q: LevelDB supports custom comparators. What would you need to change to support a different sort order in this layer?

The sort order is determined by FDB’s key ordering, which is always lexicographic byte order. To support a different sort order (e.g., “integers sort numerically”), you would change the key encoding: encode integer keys using encodeInt64 (sign-bit-flipped big-endian) so that their byte order matches their numeric order. You cannot change FDB’s comparator; you can only change what bytes you write as keys. This is why sort-preserving encoding is the fundamental concept in layer design.

Q: How does FDB’s Transact retry loop interact with side effects?

Transact retries the closure function if the transaction conflicts. If the closure has side effects outside FDB (e.g., incrementing a counter, logging, sending an HTTP request), those side effects will execute multiple times on retry. The FDB convention is: the Transact closure must be idempotent or side-effect-free. For logging, use a separate post-commit hook. For counters, use FDB atomic operations (tr.Add) which are commutative and do not require retry logic.

Q: How does an iterator over a snapshot differ from an iterator over the current database state?

A snapshot iterator is pinned to the read version captured at NewSnapshot() time. Even if concurrent writers modify or delete keys between when the snapshot was taken and when the iterator is created, the iterator sees the state at snapshot time. A current-state iterator uses the latest committed version, so it sees all mutations up to the moment the GetRange call is sent to the cluster. In our implementation, the difference is one line: tr.SetReadVersion(readVersion) in the snapshot path.