π Local build β commit and timestamp are injected automatically on Cloudflare Pages.
FoundationDB Layers β Go Implementations
Five self-contained Go implementations that show different ways to build a βlayerβ on top of FoundationDB. Modeled after real projects (mvsqlite, fdb-record-layer) but pared down for clarity.
Layout
| Folder | Plugs in | Teaches |
|---|---|---|
| option-a-leveldb | Above LevelDB | LevelDBβs external API: iterators, snapshots, write batches |
| option-a-sqlite | Above SQLite | How SQL decomposes into storage ops; vtab query planning |
| option-b-leveldb | Below LevelDB | LevelDB internals: SST files, WAL, MANIFEST, CURRENT |
| option-b-sqlite | Below SQLite | SQLite internals: page model, VFS, journaling |
| option-c-record-layer | Directly on FDB | Native FDB layer: records + secondary indexes |
Prerequisites
- Docker β to run a local single-node FDB cluster
- FoundationDB client library on the host (
libfdb_c) β required by the Go bindings (CGO)- macOS:
brew install foundationdb(or install the official.pkgfrom Appleβs release page)
- macOS:
- Go 1.22+
Bootstrap the cluster
docker compose up -d
./scripts/bootstrap-fdb.sh
This creates ./fdb.cluster at the repo root. Every demo reads it via the relative path ../fdb.cluster.
To shut down: docker compose down (data persists in ./fdb-data).
To wipe everything: docker compose down -v && rm -rf fdb-data fdb-config fdb.cluster.
Running a demo
Each option is an independent Go module:
cd option-a-leveldb
go run ./demo
See each folderβs docs/ for an architecture walk-through.
Documentation (mdbook)
The docs are structured as an mdbook under book/. To read them locally:
# Install mdbook (once)
brew install mdbook
# Build the book (output β book/dist/)
mdbook build
# Or serve with live-reload
mdbook serve --port 3001
The book is organized into two parts:
- Background β The Hitchhikerβs Guide split into six chapters (storage stack, FDB internals, the layer concept, real-world systems, repo overview, further reading)
- Labs β One overview + architecture walk-through per option
The Hitchhikerβs Guide to FoundationDB and Storage Layers
βDonβt Panic.β
This guide assumes you have seen a keyβvalue store before but have never thought hard about what happens underneath. By the end you should be able to design and build your own layer on top of FoundationDB β not just copy one.
What This Guide Covers
Six chapters that take you from first principles all the way to production-grade distributed systems. Each chapter is self-contained and builds on the previous.
| Chapter | Topic | What youβll know after |
|---|---|---|
| 1. The Storage Stack | B-trees, LSM-trees, WAL, buffer pools | Why every database makes the same five trade-offs |
| 2. FoundationDB in Depth | MVCC, commit pipeline, cluster roles, simulation | How FDB achieves correctness and why itβs trusted at Apple scale |
| 3. The Layer Concept | Key encoding, subspace pattern, atomicity | How to turn an ordered KV store into any data model |
| 4. How Real Systems Use FDB | CloudKit, mvsqlite, Document Layer, TiKV | That the patterns in this repo are battle-tested in production |
| 5. This Repository | The five lab implementations | How to navigate, run, and extend the labs |
| 6. Reading Guide | Papers, books, source code to read next | Where to go to become a true expert |
How to Read This
If you have 2 hours: Read chapters 1β3, then pick any one lab and read its Architecture Walk-through.
If you have a day: Read all six chapters, then work through all five labs in order (A-leveldb β A-sqlite β B-leveldb β B-sqlite β C-record-layer). The labs increase in conceptual complexity.
If youβre preparing for a systems interview: Focus on chapters 1β3 and the Interview Q&A sections at the end of each lab guide.
If you just want to run the code: Jump straight to Chapter 5 β This Repository for quick-start instructions, then come back to the background chapters for depth.
Core Insight
Every database is a stack of layers, each one a client of the one below. FoundationDB occupies one specific layer: ordered, transactional, distributed key-value storage. Every layer above it β SQL, document model, LevelDB API, SQLite page store, record layer β is a pure encoding problem. Learn the encoding and you understand the database.
That is what this guide teaches.
The Storage Stack β Where Everything Lives
Every database you have ever used is a stack of layers, each hiding detail from the layer above. Knowing exactly where FoundationDB sits in that stack β and why that position is powerful β is the foundation of everything else in this guide.
The Canonical Stack
βββββββββββββββββββββββββββββββββββββββββββββββ
β Application (SQL, gRPC, REST API) β β you usually live here
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Query Planner / Optimizer β β rewrites queries, picks indexes
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Execution Engine (joins, aggregates) β β evaluates the query plan
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Storage Engine (rows, columns, indexes) β β where FDB plays
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Page / Buffer Cache β β in-memory hot pages
βββββββββββββββββββββββββββββββββββββββββββββββ€
β File / Block I/O (VFS, pread/pwrite) β β option-b-sqlite plugs in here
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Kernel: page cache, scheduler, filesystem β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Physical: SSD / NVMe / network storage β
βββββββββββββββββββββββββββββββββββββββββββββββ
The interesting question is: which layer do you own, and what contract does it expose downward?
Most database literature focuses on the top (SQL, query plans). Most OS literature focuses on the bottom (filesystems, block I/O). The storage engine layer β the middle β is where the real trade-offs live: ordering, atomicity, durability, and concurrency.
FoundationDB is a storage engine that exposes a remarkably clean API to everything above it. This repository explores five different ways to build on top of that API.
B-Trees vs. Log-Structured Merge Trees
Every practical storage engine is built on one of two foundational data structures. Understanding both is essential for reasoning about FDBβs design and the designs of the engines it replaces.
B-Trees
A B-tree stores data as a balanced tree of fixed-size pages (usually 4β16 KB). Updates happen in-place: to change a value, you find its page, overwrite it, and write it back to disk.
ββββββββββββββββ
β [20] [50] β β interior node (keys only)
ββββββββββββββββ
/ | \
ββββββββββββ ββββββββββββ ββββββββββββ
β [5][12] β β[25][35] β β[55][80] β β leaf nodes (key+value)
ββββββββββββ ββββββββββββ ββββββββββββ
Properties:
- Read amplification: O(log N) page reads to find a key (very good)
- Write amplification: ~1β3Γ (one write per page touched, but journaling doubles it)
- Space amplification: ~33% overhead on average (partially filled pages)
- Random writes: each update may touch a different page β high random I/O on spinning disks
Why B-trees dominated for 40 years: HDDs were sequential-read fast, random-read slow. B-trees minimize the number of seeks to reach a key. SSDs changed this equation β random reads became cheap.
Used by: SQLite, PostgreSQL, MySQL InnoDB, Oracle, early MongoDB (MMAPv1)
Log-Structured Merge Trees (LSM Trees)
LevelDB, RocksDB, and Cassandraβs storage engine are all LSM trees. The core idea: never overwrite; always append.
Write path:
incoming write
β
WAL (append-only log on disk β for durability)
β
MemTable (skip list in memory β for fast writes)
β (when MemTable is full, flush to disk)
Level 0 SSTables (sorted, immutable files)
β (background compaction)
Level 1, 2, ... SSTables
Anatomy of an SSTable (Sorted String Table):
βββββββββββββββββββββββββββββββββββββββββββββββ
β Block 0: appleβ1, bananaβ2, cherryβ3 ... β
β Block 1: dateβ4, elderberryβ5, figβ6 ... β
β ... β
β Index block: [appleβoffset0, dateβoffset1] β
β Bloom filter: membership query, ~1% FP rate β
β Footer: offsets to index+filter blocks β
βββββββββββββββββββββββββββββββββββββββββββββββ
Properties:
- Write amplification: high (10β30Γ in a 7-level LSM; data is rewritten during each levelβs compaction)
- Read amplification: moderate (must check all levels for a key; bloom filters help)
- Space amplification: moderate (old versions linger until compaction)
- Sequential writes: almost entirely sequential β excellent on SSDs and HDDs alike
Why LSM trees dominate at scale today: write throughput is often the bottleneck at scale, and LSM trees turn random writes into sequential appends. RocksDB (Facebook) and Cassandra (Apache) both validate this at planetary scale.
The RUM Conjecture
The RUM conjecture (Idreos et al., 2016) formalizes the fundamental trade-off:
You cannot minimize Read cost, Update cost, and Memory/space cost simultaneously. Optimizing two always worsens the third.
| Engine | Read (R) | Update (U) | Memory (M) | Best for |
|---|---|---|---|---|
| B-tree | Low | Medium | Medium | Read-heavy OLTP |
| LSM tree | Medium | Low | High | Write-heavy OLTP |
| Hash table | Very low | Medium | High | Point lookups only |
| Column store | Low | High | Low | Analytical queries (OLAP) |
| FDB (B-tree) | Low | Medium | Medium | Consistent distributed KV |
FDB uses a custom B-tree variant internally on each Storage Server. The distributed nature adds replication overhead but also provides horizontal read scaling.
Write-Ahead Log (WAL) β Crash Recovery Mechanics
Every durable storage engine uses a WAL. Understanding WAL mechanics is essential for understanding what FDB guarantees and what xSync (in the SQLite VFS) must do.
The problem: writing a page to disk is not atomic. A 4 KB page write may partially complete before a power failure. Result: corrupted page.
WAL solution:
1. Before modifying Page X, write the INTENT to the WAL:
WAL record: {LSN: 1042, page: X, before: <old bytes>, after: <new bytes>}
2. fsync() the WAL record to disk (durability guarantee)
3. Now apply the change to Page X in the buffer cache
4. Page X is written back to disk lazily (the WAL already protects us)
On crash recovery:
- Re-apply all WAL records since the last checkpoint β consistent state
- If a WAL record is incomplete (partially written) β discard it, prior state is clean
FDBβs relationship with WAL:
FDBβs Transaction Log layer IS the WAL. When a commit is acknowledged, the write is persisted in the Transaction Log on f+1 machines (where f is the fault tolerance factor). Storage Servers apply these writes asynchronously. This is why FDBβs commit is durable even if Storage Servers crash before applying the write β the Transaction Log holds the record.
This also explains why our SQLite VFS (option-b-sqlite) makes xSync a no-op: by the time Transact() returns, FDB has already done the equivalent of fsync on the WAL.
Buffer Pool Management
In any storage engine, the buffer pool (or page cache) is the in-memory cache of disk pages. It is often the single most important performance variable.
Key metrics:
- Hit ratio: fraction of page reads served from cache (aim for >99%)
- Eviction policy: LRU (Least Recently Used), LFU (Least Frequently Used), CLOCK
- Write-back vs. write-through: most use write-back (dirty pages flushed lazily to amortize I/O)
Why this matters for FDB:
FDBβs clients do not manage a buffer pool. Each Storage Server manages its own buffer pool internally. When you call rt.Get(key), the FDB client sends a network request; the Storage Server consults its buffer pool and returns the value. You have no visibility into whether this was a cache hit or a disk read.
For read-heavy workloads, adding more Storage Servers (and thus more total buffer pool capacity) linearly increases the read cache size.
Where Each Lab Sits in the Stack
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Application code (demo/main.go) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β option-a-leveldb β option-a-sqlite β option-c-record β
β (LevelDB API) β (SQL engine) β (Record + Index) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FoundationDB client library (fdb.Transact / ReadTransact) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FDB cluster (coordinators, proxies, storage) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β option-b-sqlite β option-b-leveldb β
β (SQLite VFS) β (LevelDB storage.Storage) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FDB as the backing file store β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Options A and C sit above FDB β they use FDB as a storage substrate and build a new API on top. Options B sit below an existing engine β they replace the file system with FDB while keeping the engineβs logic intact.
This is a crucial architectural distinction. In the A/C case, FDBβs ACID semantics are exposed to the caller. In the B case, FDBβs semantics are hidden inside a file-like abstraction, and the engine above (LevelDB, SQLite) provides its own ACID layer β layered on top of FDBβs.
Interview Questions
Q: What is write amplification and why does it matter?
Write amplification is the ratio of bytes written to disk versus bytes written by the application. A write amplification of 10Γ means writing 1 MB of data causes 10 MB of disk writes. LSM trees have high write amplification (compaction rewrites data multiple times) but low latency writes. B-trees have low write amplification but high random-write I/O. At scale, write amplification directly determines storage cost and SSD wear.
Q: Why does an LSM tree use multiple levels?
Compaction merges SSTables from one level into the next. Without levels, youβd need to merge all SSTables at once β an O(total_data) operation for every write. Leveled compaction keeps each level to a fixed size ratio (~10Γ), so each compaction is bounded. This bounds read amplification too: there are at most one SSTable per key per level, so a lookup checks at most L SSTables.
Q: What is the difference between a WAL and an SSTable?
A WAL (Write-Ahead Log) is an append-only sequence of write operations β it records WHAT changed, not the final state. Itβs used for crash recovery. An SSTable is an immutable, sorted snapshot of actual data β it stores key-value pairs in sorted order for efficient lookup. The WAL is ephemeral (truncated after checkpointing). SSTables are persistent (they ARE the data).
Q: What makes FDBβs storage engine different from a traditional B-tree?
FDBβs storage servers use a custom B-tree variant, but the key difference is architectural: FDB is distributed, sharded, and replicated. Each Storage Server holds a shard (a key range). Reads are served directly by Storage Servers (bypassing the Proxy). Writes go through the Transaction Log for durability before being applied to Storage Servers. This means durability and availability are decoupled β you can lose a Storage Server without losing committed data.
FoundationDB in Depth
FoundationDB is one of the most carefully engineered distributed databases ever built. This chapter goes far beyond the API surface β it covers the commit pipeline, MVCC mechanics, cluster architecture, and the simulation harness that makes FDB trustworthy at Apple-scale production.
2.1 What FDB Actually Is
FoundationDB is a distributed, ordered, transactional keyβvalue store. Each word carries precise meaning:
Distributed: data is automatically sharded across machines. You do not choose a partition key. FDBβs shard boundaries move dynamically as data grows or machines are added. The client library discovers shard locations by reading the cluster file and caching routing information.
Ordered: keys are stored in lexicographic byte order across the entire cluster. GetRange(begin, end) returns keys in sorted order regardless of which Storage Servers hold them. This is the single most powerful property for building layers, because co-location of related data is entirely under your control via key encoding.
Transactional: full ACID across arbitrary key ranges, on multiple machines, at unlimited scale. This is rare. DynamoDB offers single-item atomicity (later added limited cross-item transactions). Cassandra offers lightweight transactions (LWT) at one consistency level. Redis offers multi-key transactions (MULTI/EXEC) without conflict detection. FDB offers true serializable transactions with automatic conflict detection and retry β across any number of keys, on any number of machines.
Keyβvalue: the value is an opaque byte string (max 100 KB). FDB does not care about structure inside the value. All richness is imposed by the layer above β this is by design. It keeps FDBβs guarantees simple and composable.
2.2 The Ordered KV Model β Deep Dive
The FDB key space is conceptually a single sorted array of byte arrays, spanning the entire cluster:
Key space (lexicographic byte order):
"" (smallest possible key)
...
"\x00"
"\x00\x00"
...
"apple"
"apple\x00price" β prefix scan works naturally
"apple\x00weight"
"banana"
...
"\xff\xff\xff..." (largest possible key)
Range semantics β half-open intervals:
All FDB ranges are [begin, end) β inclusive begin, exclusive end. This is universal in FDBβs API:
fdb.KeyRange{Begin: fdb.Key("apple\x00"), End: fdb.Key("apple\x01")}
// Returns: apple\x00price, apple\x00weight, apple\x00... everything
// Does NOT return: apple\x01, banana, ...
The trick of incrementing the last byte (\x00 β \x01) creates a tight upper bound for prefix scans. This pattern appears in every layer in this repo.
Why ordering is the enabler for layers:
Without ordering, you could only do point lookups (O(1) Get). Ordering enables:
- Prefix scans: all keys for user
42(range scan onuser:42:*) - Range queries: all records with
age BETWEEN 25 AND 40(range scan on encoded age values) - Sorted iteration: return all records in PK order (scan the records subspace)
- Pagination: βnext pageβ = resume range scan from last seen key
Every SQL feature that involves ORDER BY, BETWEEN, prefix LIKE, or primary key range is implementable purely through ordered range scans.
2.3 Transactions: The Full Picture
ACID Guarantees
FDB transactions implement Strict Serializability (Herlihy & Wing, 1990): the strongest isolation level. It is equivalent to:
- Serializability (transactions appear to execute one at a time)
- Real-time ordering (if T1 commits before T2 starts, T1 appears before T2)
Most databases offer weaker levels by default. PostgreSQL defaults to βRead Committedβ (not serializable). MySQL InnoDB defaults to βRepeatable Readβ. FDB always runs at strict serializability β there is no weaker option.
The Commit Pipeline
This is the key to understanding FDBβs performance characteristics:
Client Proxy Resolver Transaction Log Storage Servers
β β β β β
β 1. Begin Txn β β β β
β (local, no network) β β β β
β β β β β
β 2. Reads β GetReadVersion β β β
β ββββββββββββββββββββββΊβ β β β
β βββββββββββββββββββββββ readVersion=v100 β β β
β β β β β
β 3. rt.Get(key) β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β (reads served directly by Storage Servers β Proxy NOT in critical read path) β
β β β β β
β 4. tr.Set(key,val) β β β β
β (buffered locally, no network yet) β β β
β β β β β
β 5. Commit: send {readVersion, readSet, writeSet} β β
β ββββββββββββββββββββββΊβ β β β
β β 6. Assign commitVersion=v101 β β
β β βββββββββββββββββββΊβ β β
β β 7. Check conflictsβ β β
β β (did anyone write a key in readSet β β
β β between v100 and v101?) β β
β β ββββββββββββββββββββ β β
β β 8. If no conflict: write to TLog β β
β β ββββββββββββββββββββββββββββββββββββββββββΊβ β
β β 9. TLog confirms durability (f+1 copies) β β
β β βββββββββββββββββββββββββββββββββββββββββββ β
β 10. Commit ack β β β β
β βββββββββββββββββββββββ β β β
β β (Storage Servers apply writes asynchronously from TLog) β
Key insight from the pipeline:
- Reads bypass the Proxy β they go directly to Storage Servers. This means read throughput scales linearly with Storage Server count.
- Writes buffer locally β no network traffic until commit. A transaction that writes 1,000 keys generates exactly one network round-trip.
- Conflict checking is done by the Resolver β a separate stateless process that checks whether any key in the transactionβs read set was written by another transaction between
readVersionandcommitVersion. No lock managers, no 2PL. - Durability before acknowledgment β FDB doesnβt ack a commit until the write is in the Transaction Log on
f+1machines. Storage Servers may lag behind.
Retry Semantics
The db.Transact(func) loop handles retries automatically. The function is called again if:
- A conflict is detected (error code 1020)
- The transaction is too old (error code 1007 β MVCC window expired)
- A transient network error occurs
The retry rule: the function passed to Transact must be idempotent for side effects outside FDB. FDB operations are always safe to retry (each retry gets a fresh transaction). But if your function sends an email, charges a credit card, or writes to another database, a retry will do that twice.
// WRONG: email sent on retry
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Set(key, value)
sendWelcomeEmail(user) // β sent multiple times on conflict!
return nil, nil
})
// RIGHT: defer side effects to after commit
_, err := db.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Set(key, value)
return nil, nil
})
if err == nil {
sendWelcomeEmail(user) // β exactly once
}
Read-Only Transactions
db.ReadTransact(func) opens a read-only transaction. It:
- Does not track a write set
- Does not need a commit round-trip
- Cannot conflict (reads never conflict with other reads)
- Uses a recent committed version as the read version
Read-only transactions are cheaper and should be preferred for any operation that only reads data.
2.4 MVCC β Multi-Version Concurrency Control
FDB stores multiple versions of each key, identified by a monotonically increasing version number (a 64-bit integer incremented globally for every commit):
Version timeline:
v100: Set("color", "red") β committed at v100
v101: Set("count", "5") β committed at v101
v102: Set("color", "blue") β committed at v102
v103: Set("count", "6") β committed at v103
A transaction reading at version v101 sees:
"color"="red"(latest version β€ v101)"count"="5"(latest version β€ v101)
A transaction reading at version v103 sees:
"color"="blue"(latest version β€ v103)"count"="6"(latest version β€ v103)
The 5-second window:
FDB garbage-collects old versions after approximately 5 seconds. This is configurable but the default. A transaction that starts at v100 and then tries to read 6 seconds later will get error 1007 (transaction_too_old). This is why long-running transactions are not supported in FDB.
This is a fundamental design choice. FDB optimizes for high-throughput short transactions. If you need to read data while performing long-running computation (e.g., a streaming job), the correct pattern is:
- Read a batch of data, process it
- Commit the batch
- Read the next batch in a new transaction
MVCC vs. Locking
| Feature | MVCC (FDB, PostgreSQL) | Locking (older MySQL) |
|---|---|---|
| Reads block writers | No | Yes (shared lock) |
| Writers block reads | No | Yes (exclusive lock) |
| Deadlocks | Not possible | Possible (lock cycles) |
| Stale reads | Possible if version too old | Not possible |
| Overhead | Garbage collection of old vers | Lock manager overhead |
MVCC is strictly better for read-heavy workloads and any workload that mixes reads and writes.
2.5 Watches, Versionstamps, and Atomic Operations
These three primitives are rarely covered in introductory FDB material but are essential for building production systems.
Watches
tr.Watch(key) returns a FutureNil that fires when the keyβs value changes. The watch is registered at commit time and fires asynchronously.
Production use cases:
- Distributed lock re-acquisition: hold a lock key; watch it to know when released
- Cache invalidation: watch a βcache bustβ key; invalidate local cache when it fires
- Event notification: watch a βlatest eventβ key; re-read when it changes
- Leader election: watch the leader key; trigger re-election when itβs cleared
// Polling-free pub/sub pattern
func watchForChanges(db fdb.Database, key fdb.Key) {
for {
var watch fdb.FutureNil
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
watch = tr.Watch(key)
return nil, nil
})
watch.Get() // blocks until key changes
// process the change...
}
}
Limits: FDB recommends no more than 10,000 active watches per client. Each watch consumes resources on the Storage Server holding that key.
Versionstamps
A versionstamp is a 10-byte identifier that is globally unique and monotonically increasing. It consists of the commit version (8 bytes) plus a user-assigned offset (2 bytes) for ordering within one transaction.
tr.SetVersionstampedKey(keyTemplate, value): writes a key where one 10-byte slot in keyTemplate is replaced with the commit versionstamp.
// Building a totally ordered event log β no coordination required
keyTemplate := append([]byte("events:"), make([]byte, 10)...) // 10-byte placeholder
keyTemplate = append(keyTemplate, []byte(":metadata")...)
tr.SetVersionstampedKey(fdb.Key(keyTemplate), eventPayload)
// After commit: key is "events:<10-byte versionstamp>:metadata"
// Every write gets a globally unique, strictly ordered key
Why this is powerful:
- A sequence of commits from different clients, on different machines, all get globally ordered keys
- No sequence number coordinator required
- Reading all events in order = a single
GetRange("events:", "events\x01")scan
This is how fdb-record-layer implements change feeds and how any FDB-based event sourcing system should be built.
Atomic Operations
Atomic operations modify a key without reading it in the same transaction. They commute β two concurrent Add(key, 1) calls on the same key will both succeed, with a final result of 2, even if they had the same read version.
| Operation | Semantics | Use case |
|---|---|---|
Add(key, n) | Atomic 64-bit integer addition | Counters, sequence numbers |
BitAnd/Or/Xor | Bitwise operation on the value | Bit set membership |
Max/Min | Atomic max or min of the current value | High-water marks, metrics |
SetVersionstamped | Stamp the commit version into key/value | Ordered event logs |
CompareAndClear | Clear key if current value equals argument | Expiry, conditional delete |
AppendIfFits | Append bytes if result fits in value size | Accumulator patterns |
Critical property: atomic ops do not generate a read conflict for their key. Two transactions can both do tr.Add("counter", 1) without conflicting. This is fundamentally different from read-modify-write (which would conflict).
Under the hood: atomic ops are sent to the Storage Server as special mutation records that the server applies during version materialization. The server has the current value and applies the operation when making the value visible.
2.6 Cluster Architecture β Every Component Explained
βββββββββββββββββββββββββββββββββββ
β Client Process β
β fdb.OpenDatabase(clusterFile) β
ββββββββββββββ¬βββββββββββββββββββββ
β reads cluster file once;
β caches routing table
ββββββββββββββΌβββββββββββββββββββββ
β Coordinators (3β5) β
β Paxos-based consensus β
β Store: cluster configuration β
β Elect: active Proxies, TLogs, β
β Data Distributor, β
β Ratekeeper β
ββββββββββββββ¬βββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ
β Proxies β β Resolvers β β Transaction Log β
β (GRV + commit) β β (conflicts) β β (WAL, durable) β
ββββββββββ¬βββββββββ ββββββββββββββββ ββββββββββββββββββββ
β after conflict check + TLog commit
β Storage Servers apply writes asynchronously
ββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β Storage Servers (many) β
β Each holds a shard (contiguous key range) β
β Serves reads directly; applies mutations β
β from TLog β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
Proxies (GRV Proxies + Commit Proxies)
FDB 7.x splits proxy responsibilities:
- GRV Proxies (Get Read Version): assign read versions to transactions. This is a global operation β all reads in the cluster must see a consistent version ordering. GRV proxies serialize these assignments.
- Commit Proxies: receive commit requests, forward to Resolvers for conflict checking, then write to the Transaction Log.
Proxies are stateless (they hold no data). They can be added or removed without data migration.
Resolvers
Resolvers track which keys were written in each recent commit. When a Commit Proxy submits a transactionβs read set, the Resolver checks: βwas any key in this read set written by a committed transaction between this transactionβs read version and now?β If yes β conflict, retry. If no β proceed.
Resolvers are in-memory only (they hold recent write history, not persistent data). Their state window matches the MVCC window (~5 seconds of commits).
Transaction Log
The TLog is FDBβs Write-Ahead Log β but distributed. Each commit is written to f+1 TLog machines before being acknowledged to the client. TLog uses its own durable storage (either fdatasync-safe files or SSDs).
After Storage Servers confirm they have applied all mutations up to a version (the βdurable versionβ), TLog can truncate its older entries.
Storage Servers
Each Storage Server holds a subset of the key space (a shard). When a client wants to read a key, it looks up which Storage Server holds that keyβs shard (from its cached routing table), and sends the read directly to that server.
Storage Servers use a custom storage engine (called βKeyValueStoreβ internally) built on a variant of SQLite or a custom B-tree, depending on configuration. Each mutation from the TLog is applied to this local store.
Data Distribution: The Data Distributor process monitors shard sizes and access patterns. If a shard grows too large (>250MB by default), it splits it and migrates half to another Storage Server. This is transparent to clients.
Coordinators
Coordinators run a Paxos-like consensus protocol to maintain the βcluster controllerβ β the process that manages the lifecycle of all other FDB roles. If the cluster controller fails, coordinators elect a new one. Coordinators themselves are stateless beyond their Paxos log.
The cluster file (fdb.cluster) contains just the coordinator addresses: fdbdemo:fdbdemo@127.0.0.1:4500. This single file is how clients bootstrap their connection to any FDB cluster.
2.7 Production Limits and Sizing
These are not academic β they affect how you design your layer:
| Parameter | Limit / Default | Implication |
|---|---|---|
| Max value size | 100 KB (102,400 bytes) | Large values must be chunked (option-b) |
| Max transaction size | ~10 MB (reads + writes combined) | Bulk loads must be split |
| Max transaction duration | ~5 seconds (MVCC window) | No long-running transactions |
| Max watches per client | ~10,000 | Watches are not free |
| Max key size | 10 KB | Keep keys short (< 1 KB preferred) |
| Read version latency | ~1 ms (local datacenter) | Fast; dominated by network |
| Commit latency | ~2β5 ms (local datacenter) | TLog write + network |
| Max throughput (single proxy) | ~250,000 operations/sec | Scale horizontally by adding proxies |
| Storage per server | ~100 GBβ4 TB typical | FDB handles redistribution automatically |
Production rule of thumb: keep transactions small (< 1 MB), short (< 1 second), and focused (few key ranges). This maximizes throughput and minimizes conflicts.
2.8 The Simulation Harness β Why FDB Is Trusted
FDBβs simulation framework is the most sophisticated correctness testing system in any open-source database. Understanding it explains why Apple, Snowflake, and others trust FDB with mission-critical data.
How Simulation Works
The entire FDB cluster β network stack, disk I/O, clocks, process scheduling β is implemented twice:
- The real implementation (for production)
- A deterministic simulation (for testing)
In simulation mode:
- All network calls are function calls (no actual sockets)
- Disk I/O is simulated in memory with configurable failure injection
- Clocks are virtual (controlled by the simulator)
- Processes are coroutines (no real threads)
- Random seeds control all non-determinism
The simulator injects:
- Machine failures: simulate killing any process at any time
- Network partitions: drop messages between specific nodes
- Disk failures: simulate corrupted writes, partial writes
- Clock skew: advance clocks non-uniformly across nodes
- Slow operations: simulate disk latency spikes
Why Determinism Matters
When a bug is found in simulation:
- Record the random seed that triggered it
- Reproduce the exact same sequence of events with the same seed
- Add print statements, breakpoints, rerun β identical behavior every time
This is impossible with real distributed systems, where timing non-determinism makes bugs nearly unreproducible.
Scale of Testing
Before each FDB release, the simulation runs for thousands of machine-years worth of simulated time. A team member once stated: βWeβve found more bugs through simulation than through all other methods combined.β
This testing discipline is why FDB can make strong correctness guarantees. Itβs also why the FDB team is one of few database teams in the world that will confidently claim their ACID implementation is correct under arbitrary failure injection.
What This Means for You
When you use FDB as your storage substrate, you inherit these correctness guarantees. The five layers in this repository can fail in arbitrary ways β the processes can crash, the network can partition β and FDB will maintain consistency. The only failure that can cause data loss is the simultaneous loss of f+1 machines during a transaction commit window (where f is the configured fault tolerance level, typically 1 for a 3-machine cluster or 2 for a 5-machine cluster).
Interview Questions
Q: What is strict serializability and how does it differ from serializability?
Serializability means transactions appear to execute in some sequential order. Strict serializability (also called linearizability + serializability) additionally requires that the sequential order respects real-time ordering: if transaction T1 commits before transaction T2 begins, T1 must appear before T2 in the serial order. This matters for distributed systems where different clients observe commits at different times. FDB provides strict serializability. Most databases (PostgreSQL serializable, MySQL serializable) provide only serializability.
Q: How does FDB avoid deadlocks?
FDB doesnβt use locks. Instead, it uses optimistic concurrency control: reads are conflict-tracked, writes are buffered, and conflict detection happens at commit time. If a conflict is detected, the transaction retries with a new read version β it never βwaitsβ for another transaction to release a lock. Since thereβs no waiting, there canβt be a cycle of transactions waiting on each other.
Q: What happens to a running FDB transaction if the Proxy crashes?
The clientβs Transact loop handles this. The proxy failure will manifest as a network error on the commit request. The FDB client library translates this to a retriable error, and Transact calls the function again with a new transaction and a new proxy (the client discovers the new proxy from the Coordinator). The retry is transparent to application code.
Q: Why does FDB limit transaction size to 10 MB?
FDBβs conflict resolution requires the Resolver to hold recent write history in memory. Large transactions would require holding large amounts of data in the Resolverβs conflict window. Additionally, the Transaction Log must write the entire transaction atomically before acknowledging. Large transactions create tail latency. The 10 MB limit keeps both of these bounded. For bulk loading, the correct approach is to batch writes in 1β5 MB transactions.
Q: What is a βread versionβ in FDB?
A read version is a 64-bit integer that represents a point in time in FDBβs commit history. Every committed transaction increments the global version counter. A transaction reading at version V sees all commits with version β€ V and no commits with version > V. Read versions are assigned by GRV Proxies and represent a consistent snapshot of the entire cluster at that version.
Q: Can two FDB transactions that donβt share any keys conflict?
No. Conflict detection in FDB is purely key-based. Two transactions with disjoint key sets can never conflict, regardless of how close together they commit. This is why carefully designed key spaces (using subspaces to isolate different logical entities) dramatically reduce conflict rates in high-throughput applications.
The Layer Concept
The layer concept is the central architectural idea of FoundationDB. FDBβs API surface is tiny: Get, Set, Clear, ClearRange, GetRange, and transactions. All the richness of a βdatabaseβ β tables, indexes, files, records, schemas β must be built from these six primitives. This chapter explains exactly how.
3.1 Encoding Is Everything
FDB stores byte strings. There is no concept of integer columns, string columns, or JSON fields at the FDB level. You must encode your data into bytes and decode it back out. The art of building a layer is entirely the art of encoding β turning rich data structures into sequences of (key, value) pairs such that:
- Unambiguous decoding: given a raw key-value pair, you can reconstruct the original data
- Co-location: related data is adjacent in key order, enabling efficient range scans
- Minimal write amplification: updating one piece of data doesnβt force rewrites of unrelated data
- Sort-preserving encoding: values you want to range-query sort correctly under byte comparison
All four properties must hold simultaneously. Letβs examine each.
Property 1: Unambiguous Decoding
Suppose you store two fields, name and city, in a single key:
key: "user:" + name + ":" + city
If name = "Alice:Paris" and city = "London", the key becomes "user:Alice:Paris:London", which looks identical to name = "Alice", city = "Paris:London". Ambiguous. Solution: choose a separator that cannot appear in field values, or escape it.
FDBβs Tuple layer solves this by escaping \x00 inside strings as \x00\xFF, then using bare \x00 as a separator. This makes encoding unambiguous for arbitrary byte strings.
Property 2: Co-location
Imagine storing a userβs profile fields as separate keys:
"user:alice:name" β "Alice"
"user:alice:email" β "alice@example.com"
"user:alice:city" β "Paris"
"user:bob:name" β "Bob"
Reading Aliceβs full profile = one GetRange("user:alice:", "user:alice\x01") call. All three keys are adjacent in FDBβs sorted key space. One round-trip, no matter how many fields she has.
Contrast with storing all users in one key: "users" β msgpack([alice, bob, ...]). Reading one user requires reading the entire blob. Updating one user requires reading the entire blob, deserializing, modifying, re-serializing, and writing back. Terrible co-location.
Co-location means: keys for the same logical entity should share a prefix, so they sort together and can be read or deleted in one range operation.
Property 3: Minimal Write Amplification
Consider a user with 50 indexed fields. If a user moves from Paris to Tokyo, only two keys change: the old city index entry (cleared) and the new city index entry (set). The userβs record and all other index entries are untouched.
Designing your encoding so that βupdate one fieldβ touches only the keys for that field is the goal. If your encoding stores the entire record in one value, every field update touches the entire value β wasteful when records are large.
Property 4: Sort-Preserving Encoding
If you want to range-query age BETWEEN 25 AND 40, the encoded age values must sort in the same order as the numeric values.
String encoding fails for integers:
"9" > "30" (byte comparison: '9' = 0x39 > '3' = 0x33)
β sorted order: "1", "10", "100", "2", "20", "3", "30", "9" β wrong
Signed big-endian fails across sign boundary:
-1 β 0xFFFFFFFFFFFFFFFF (sorts AFTER all positive numbers)
The correct encoding (used in every layer in this repo):
// encoding.go (option-c-record-layer)
func encodeInt64(x int64) []byte {
b := make([]byte, 8)
// Flip the sign bit. Maps signed range to unsigned range preserving order.
// -2^63 β 0x0000...0000 (smallest unsigned)
// -1 β 0x7FFF...FFFF
// 0 β 0x8000...0000
// 1 β 0x8000...0001
// 2^63-1 β 0xFFFF...FFFF (largest unsigned)
binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
return b
}
This encoding is the industry standard for sort-preserving signed integer encoding. It appears in Apache HBase, Apache Cassandra, FDBβs own Tuple layer, and Google Bigtable.
Why XOR with 1<<63 is the same as adding 2^63: For any signed 64-bit integer x, uint64(x) XOR (1<<63) == uint64(x) + 2^63 when computed in 64-bit unsigned arithmetic. XOR is just the bit-manipulation shorthand for the offset encoding.
3.2 The Subspace Pattern
The single most important pattern in any FDB layer is the subspace: a byte prefix that namespaces all keys for a logical entity.
// encoding.go (option-a-leveldb)
type Subspace struct {
prefix []byte
}
func (s Subspace) Pack(userKey []byte) fdb.Key {
out := make([]byte, 0, len(s.prefix)+1+len(userKey))
out = append(out, s.prefix...)
out = append(out, 0x00) // separator byte
out = append(out, userKey...)
return fdb.Key(out)
}
func (s Subspace) Range() fdb.KeyRange {
begin := append([]byte{}, s.prefix...)
begin = append(begin, 0x00)
end := append([]byte{}, s.prefix...)
end = append(end, 0x01) // one byte past the separator
return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}
Why the separator byte?
Without a separator, subspaces "foo" and "foobar" collide: the key "foobar\x00somekey" has "foo" as a prefix but was packed by the "foobar" subspace. The separator \x00 prevents this: "foo\x00" is never a prefix of "foobar\x00" because their 4th bytes differ (\x00 vs b).
The range trick β \x00 β \x01:
Incrementing the separator byte from \x00 to \x01 creates a tight upper bound. Every key in the "demo\x00" subspace sorts before "demo\x01". So GetRange("demo\x00", "demo\x01") returns exactly the keys for the βdemoβ namespace β no more, no less.
Subspace composition (nested subspaces):
You can build hierarchical key spaces by composing prefixes:
users subspace: "app\x00users\x00"
orders subspace: "app\x00orders\x00"
indexes subspace: "app\x00idx\x00"
city index: "app\x00idx\x00city\x00"
value "Paris": "app\x00idx\x00city\x00Paris\x00"
Each prefix cleanly separates a logical domain. ClearRange("app\x00users\x00", "app\x00users\x01") atomically deletes all users. ClearRange("app\x00idx\x00city\x00Paris\x00", "app\x00idx\x00city\x00Paris\x01") atomically deletes all city=βParisβ index entries.
All five layers in this repo use this pattern:
| Layer | Subspace layout |
|---|---|
| option-a-leveldb | ns + 0x00 + userKey |
| option-a-sqlite | ns + 0x00 + tableName (catalog), ns + 0x01 + tableName + 0x00 + pk (rows) |
| option-b-leveldb | ns + 0x01 + fileType + fileNum (meta), ns + 0x02 + fileType + fileNum + chunkNum (data) |
| option-b-sqlite | ns + 0x00 (size key), ns + 0x01 + pageNum (pages) |
| option-c-record-layer | ns + 0x00 + schema + 0x00 + pk (records), ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk (indexes) |
3.3 The Tuple Layer β Industry Standard Encoding
The official FDB client libraries (Python, Java, Go, Ruby) provide a Tuple encoding that standardizes the above patterns. It encodes lists of typed values into byte strings that sort correctly under FDBβs lexicographic comparison.
Tuple type codes and sort behavior:
| Type | Prefix byte | Encoding | Sort behavior |
|---|---|---|---|
None / nil | 0x00 | just the prefix | always smallest |
| Byte string | 0x01 | bytes + \x00 (with \x00 β \x00\xFF) | lexicographic |
| Unicode string | 0x02 | UTF-8 + \x00 (same escaping) | lexicographic |
| Nested tuple | 0x05 | nested encoding + \x00 | element-by-element |
| Integer 0 | 0x14 | just the prefix | numeric zero |
| Positive integer | 0x15β0x1C | 1β8 bytes big-endian | numerically ascending |
| Negative integer | 0x0Cβ0x13 | 1β8 bytes, bitwise NOT | numerically ascending |
| Float (32-bit) | 0x20 | 4 bytes, sign-bit-flipped IEEE 754 | numerically ascending |
| Double (64-bit) | 0x21 | 8 bytes, sign-bit-flipped IEEE 754 | numerically ascending |
| Boolean false | 0x26 | just the prefix | false < true |
| Boolean true | 0x27 | just the prefix | false < true |
| UUID | 0x30 | 16 bytes | UUID byte order |
| Versionstamp | 0x32/0x33 | 12 bytes | commit version ordering |
Why the layers in this repo donβt use the Tuple layer:
Using the official Tuple layer would be correct and production-appropriate. The layers here use simpler hand-rolled encodings so that each file is readable without understanding the Tuple specification. When reading the source, you can see append(out, 0x00) instead of tuple.Pack(tuple.Tuple{schema, pk}) β the intent is obvious.
For production use, adopt the Tuple layer. It handles edge cases (null bytes in strings, arbitrarily large integers, nested structures) that the hand-rolled encodings in this repo skip.
3.4 Atomicity as a Design Axiom
The most important design discipline in FDB layer development: every logical operation that must be atomic must live inside one transaction.
This sounds obvious but is violated constantly by developers who treat FDB as a fancy cache. Letβs examine the consequences of each atomicity violation:
Case 1: Split Insert (SQL Layer)
// WRONG β two transactions for one logical insert
rowid := allocateRowid() // Transaction 1: increment counter
writeRow(schema, rowid, values) // Transaction 2: write row data
// If crash between T1 and T2:
// β counter incremented, row never written
// β counter is now permanently incorrect (rowid was "consumed")
// β future inserts skip this rowid
The correct approach (used in option-a-sqlite):
// RIGHT β one transaction
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Read and increment rowid counter
rowid := tr.Get(rowidKey).Get() + 1
tr.Set(rowidKey, encode(rowid))
// Write the row
tr.Set(rowKey(schema, rowid), encode(values))
return nil, nil
})
Case 2: Split Index Update (Record Layer)
// WRONG β separate transactions for record and index
tr1.Set(recordKey, newRecord) // T1: update record
tr2.Clear(oldIndexKey) // T2: remove old index
tr2.Set(newIndexKey, nil) // T2: add new index
// If crash between T1 and T2:
// β record shows new value
// β index still points to old value
// β LookupByIndex returns stale data forever
// β This is a phantom index entry β silent corruption
The correct approach (used in option-c-record-layer):
// RIGHT β all in one transaction
s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Read old record to know what indexes to remove
prevBytes, _ := tr.Get(recordKey(s.ns, schema, pk)).Get()
if prevBytes != nil {
var prev Record
msgpackUnmarshal(prevBytes, &prev)
for _, f := range sc.Indexes {
if v, ok := prev[f]; ok {
tr.Clear(indexKey(s.ns, schema, f, v, pk))
}
}
}
// Write new record and new indexes
tr.Set(recordKey(s.ns, schema, pk), encoded)
for _, f := range sc.Indexes {
if v, ok := rec[f]; ok {
tr.Set(indexKey(s.ns, schema, f, v, pk), nil)
}
}
return nil, nil
})
Case 3: Split File Rename (LevelDB Storage Layer)
// WRONG β separate transactions for copy and delete
copyFileData(newFd, oldFd) // T1: copy all chunks
deleteFile(oldFd) // T2: clear old chunks
// If crash between T1 and T2:
// β both files exist simultaneously
// β LevelDB manifest references old name
// β on restart, LevelDB sees duplicate SSTables β corruption
The correct approach (used in option-b-leveldb):
// RIGHT β one transaction for atomic rename
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
kvs, _ := tr.GetRange(s.dataRange(oldfd), ...).GetSliceWithError()
tr.ClearRange(s.dataRange(oldfd)) // clear old
tr.Clear(s.metaKey(oldfd))
for _, kv := range kvs { // write new
tr.Set(translateKey(kv.Key, oldfd, newfd), kv.Value)
}
tr.Set(s.metaKey(newfd), metaBytes)
return nil, nil
})
The pattern: read old state β clear old keys β write new keys β in one transaction. Crash at any point: transaction not committed, neither state is visible. Crash after commit: new state is fully visible. No intermediate state is ever observable.
3.5 Key Space Design Patterns and Anti-Patterns
Pattern: Sequential Integer Keys
key: ns + 0x00 + encode_uint64(rowid)
Values: row data. Row id is monotonically increasing. New rows always insert at the end of the range (append-like behavior in the key space). Range scans return rows in insertion order.
Production consideration: In a write-heavy workload, all inserts land on the same key space boundary, creating a hotspot on the FDB shard holding the highest rowid. FDB will split this shard, but the split lags the write rate. For very high insert rates, use a randomized prefix (hash of rowid) to distribute writes across shards.
Pattern: Reverse Timestamp Keys
key: ns + 0x00 + encode_uint64(MaxUint64 - timestamp)
Most recent entries sort first. A GetRange with no bounds returns entries in reverse chronological order. Useful for βrecent eventsβ, βactivity feedsβ, βaudit logsβ.
Pattern: Composite Keys for Multi-Dimensional Queries
key: ns + encode(field1) + 0x00 + encode(field2) + 0x00 + pk
Example: city + 0x00 + age + 0x00 + pk enables range queries on city AND age:
- All users in Paris sorted by age:
GetRange(prefix + "Paris\x00", prefix + "Paris\x01") - All users in Paris aged 25β40:
GetRange(prefix + "Paris\x00" + encode(25), prefix + "Paris\x00" + encode(40) + "\xff")
This is the compound index pattern. It is exactly how MySQLβs compound indexes work.
Anti-Pattern: Large Values
Storing all of a userβs data in one 50 KB value:
key: ns + pk
value: msgpack({name, email, all_orders: [...], all_addresses: [...], ...})
Problems:
- Reading one field requires reading and deserializing the entire value (O(size))
- Updating one field rewrites the entire value (O(size) write amplification)
- The value may approach FDBβs 100 KB limit
- Hot users (many orders) create unbalanced value sizes
Better: normalize into separate keys per logical entity:
ns + 0x00 + pk β profile (name, email)
ns + 0x01 + pk + orderid β individual orders
ns + 0x02 + pk + addrid β individual addresses
Anti-Pattern: Hot Key Contention
key: "global_counter" β every write touches this
All transactions that increment a counter conflict on this key. At high throughput, this becomes a serialization bottleneck.
Solutions:
- FDB atomic ops:
tr.Add("global_counter", 1)β atomic, no conflict, no retry needed - Sharded counters: split into N shard keys; read by summing all shards
- Approximate counting: use a HyperLogLog or count-min sketch in a single key
3.6 The FDB Blob Layer Pattern
FDBβs value size limit (100 KB) is not a fundamental limitation β itβs a performance tuning decision. The solution is chunking, used in option-b-leveldb:
key: ns + 0x01 + fileType + fileNum + chunkNum β up to 65,536 bytes
To read a file: range-scan all chunks (sorted by chunkNum due to big-endian encoding), concatenate values.
This pattern generalizes to any large object:
- Files: split by byte range (64 KB chunks)
- Images: split by byte range or by tile
- Blobs: split by arbitrary chunk size
- Large JSON documents: split by logical sub-object
FDBβs official Blob Layer (in the FDB foundationdb-python-layers repository) implements this exact pattern. Amazon S3 and Google Cloud Storageβs consistency guarantees are actually powered by similar chunk-based storage backed by distributed KV stores.
3.7 Schema Evolution β How Layers Survive Change
A layerβs encoding is its schema. Changing the encoding without migrating old data is a common source of production incidents. Strategies:
1. Prefix-versioned keys:
ns + version_byte + ...rest
New code writes version = 0x02 keys; old code writes version = 0x01 keys. A background migration reads all 0x01 keys, converts, writes 0x02 keys. Read code handles both versions.
2. Additive-only changes: If your value is msgpack/Protobuf, adding new optional fields is backward compatible. Old code ignores unknown fields. New code sets defaults for missing fields. This is safe and zero-downtime.
3. Dual writes during migration: Write to old AND new key format simultaneously during a migration window. Verify correctness. Backfill old data. Cut over to new format. Remove old code path. Zero-downtime migration.
4. Never reuse key prefixes: Deleted keys leave no trace in FDB (thereβs no tombstone value). But if you reuse a deleted subspaceβs prefix for a new purpose, you risk stale data colliding with new data. Always use new prefixes for new logical entities.
Interview Questions
Q: Why is key ordering so important in FDB β canβt you just use a hash map?
A hash map gives O(1) point lookups but cannot efficiently answer range queries (βall users aged 25β40β), prefix queries (βall keys starting with βuser:42:ββ), or sorted iteration. Ordered storage enables all three with a single range scan. The entire record layer concept β where secondary indexes are just cleverly prefixed keys β relies on ordering. Without ordering, every query would require a full scan.
Q: What is a βsubspaceβ in FDB and why is it important?
A subspace is a key prefix that namespaces a set of keys. It provides namespace isolation (keys in subspace A cannot collide with keys in subspace B), atomic deletion (one ClearRange deletes all keys in a subspace), and efficient range queries (one GetRange over the subspace prefix reads all related data). Every production FDB layer uses subspaces as the fundamental organizational unit.
Q: How do you handle null bytes inside keys in FDB?
FDB keys can contain any bytes, but if you use \x00 as a separator, null bytes inside field values create ambiguity. The official solution is the Tuple encoding: escape \x00 inside string values as \x00\xFF, and use bare \x00 as the separator. The decoder sees \x00\xFF and knows itβs an escaped null; bare \x00 ends the field. The layers in this repo use simple encodings that assume field values donβt contain null bytes β acceptable for a teaching implementation, not for production with arbitrary user data.
Q: Why is the PutRecord function in the record layer more expensive than a simple Set?
PutRecord must read the old record before writing the new one, so it can remove stale index entries. This adds one Get call to every write β a full round-trip to a Storage Server. The cost is necessary: without reading the old record, you canβt know which old index entries to remove. Production systems amortize this by batching multiple record updates in one transaction, or by using techniques like βshadow documentsβ (keeping old field values in the record itself to avoid the read).
Q: What is the difference between Transact and ReadTransact in FDB?
ReadTransact opens a read-only transaction: no conflict tracking, no commit round-trip, cheaper. Itβs for operations that only read data. Transact opens a read-write transaction: tracks the read set for conflict detection, buffers writes, commits at the end. Use ReadTransact whenever youβre not writing anything β itβs faster and imposes less load on the cluster.
How Real Systems Use FDB
One of the most useful ways to understand an architecture is to see it in production at scale. FoundationDB is unusual in that several of the systems built on it are open-source or at least publicly described. Each of the five labs in this repo maps directly to a pattern used in a real, deployed production system.
4.1 Appleβs FoundationDB Record Layer (iCloud / CloudKit)
The fdb-record-layer is a Java library open-sourced by Apple in 2018. It is the storage engine behind Appleβs CloudKit metadata services β the cross-device sync that backs Photos, Contacts, Notes, iMessage, and nearly every other iCloud-connected app.
What it provides:
- Protobuf-typed records (define your schema in
.protofiles) - Declarative index definitions: value indexes, rank indexes, aggregate indexes, text indexes
- A query planner that compiles predicates into FDB range scans
- Schema evolution (add/remove/change fields without downtime)
- Continuations (long queries can be resumed across transactions)
- Change feeds via versionstamped keys
How the key layout maps to our option-c-record-layer:
fdb-record-layer uses a similar two-subspace structure:
[storePrefix] [META_DATA_VERSION] β version metadata
[storePrefix] [RECORD_TYPE_KEY] β schema info
[storePrefix] [RECORD] [pk bytes] β serialized Protobuf record
[storePrefix] [INDEX] [indexName] [indexValue] [pk bytes] β index entry
Our record layer uses a simpler version:
ns + 0x00 + schema + 0x00 + pk β msgpack record
ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk β index entry (empty value)
The production scale: CloudKit syncs data for hundreds of millions of Apple devices. Every Photos upload, every note edit, every contact sync creates record layer transactions. FDB handles this via horizontal partitioning: different usersβ data lives in different FDB shards, so per-user transaction load is low. The global throughput is enormous, but each individual transaction is small.
Key insight β continuations: fdb-record-layerβs most important production feature is the βcontinuationβ β a cursor that can be serialized, stored, and resumed across transaction boundaries. Since FDB limits transaction duration to 5 seconds, long queries must be broken into chunks. A continuation saves the last-seen key so the next transaction can resume exactly where the previous one left off. Our labs donβt implement this, but every production layer needs it for full-table scans, bulk exports, and maintenance jobs.
Schema evolution in production: Adding a new field to a record type is safe (existing records just donβt have it; the field defaults to absent). Removing a field requires first removing all code that writes it, then removing the field from the schema. Renaming a field is the dangerous operation β it looks like βadd new field + remove old fieldβ but requires migrating all existing records. fdb-record-layer provides a dedicated schema migration API for this case.
4.2 FoundationDB Document Layer (MongoDB Wire Protocol)
The Document Layer was an open-source FDB layer that spoke the MongoDB wire protocol. A MongoDB client (driver, shell, or Compass) could point at the Document Layer unchanged, and all data was stored in FDB with ACID guarantees.
What it did:
- Parsed MongoDB wire protocol (BSON over TCP)
- Translated BSON documents into FDB key-value pairs
- Translated MongoDB queries (
{city: "Paris", age: {$gte: 25}}) into FDB range scans and filters - Provided MongoDBβs consistency guarantees (which are weaker than FDBβs β the Document Layer could offer stronger consistency than stock MongoDB by virtue of FDBβs serializable transactions)
Key encoding for a BSON document:
[collectionPrefix] [docId] β BSON document value
[indexPrefix] [fieldName] [fieldValue] [docId] β index entry
This is structurally identical to option-c-record-layer with schema = collectionName.
Why it was deprecated: The Document Layer was maintained by FoundationDBβs internal team at Apple. When Apple open-sourced FDB in 2018 and then in 2019 changed some library licensing, maintaining a full MongoDB-compatible query engine was outside scope. It was deprecated in 2019 but the codebase remains instructive as a reference implementation.
Lesson: A full relational/document query engine can be built atop FDB in ~50,000 lines of C++. The query planner complexity is in compiling predicate trees to efficient index scan strategies. The storage layer is just option-c-record-layer.
4.3 mvsqlite β SQLite on FDB with Multi-Process MVCC
mvsqlite is a Rust project by Heyang Zhou (losfair). It implements a SQLite VFS backed by FDB such that multiple processes can open the same βdatabaseβ concurrently with full MVCC isolation.
How it maps to option-b-sqlite:
Our pagestore stores pages at:
ns + 0x01 + pageNum (big-endian) β 4096-byte page content
mvsqlite stores pages at:
[namespace] [pageNum_big_endian] β page content at that version
What mvsqlite adds beyond our implementation:
-
Page-level MVCC: Instead of storing just the current page, mvsqlite stores multiple historical versions. Old page versions are kept for active readers. This is implemented by appending a version prefix:
[namespace] [pageNum] [snapshotVersion] β page at that versionA reader at snapshot V only reads pages at
version <= V. The GC process cleans up versions older than the oldest active reader. -
Multi-process write serialization: mvsqlite serializes concurrent writers using an FDB transaction on a βwrite lockβ key. The winner gets exclusive write access; losers retry. SQLiteβs own xLock/xUnlock calls are translated to FDB key operations.
-
Write-ahead buffering in FDB: Instead of a local WAL file, mvsqlite buffers pending writes in FDB under a separate subspace. Commit flushes the buffer to the page subspace atomically.
-
Namespace mapping: SQLite filenames map to FDB key prefixes.
ATTACH DATABASE 'db2.mvsqlite' AS db2opens a second namespace in the same FDB cluster.
The production use case: A web application running in a multi-process or multi-instance environment (e.g., multiple Gunicorn workers, multiple Docker containers) can share a single mvsqlite database. Each worker reads from a consistent snapshot without locking other workers. Writes are serialized. This is impossible with standard SQLite (which only allows one writer and assumes a shared filesystem).
4.4 LevelDBβs goleveldb Storage Interface Pattern
The goleveldb storage.Storage interface (used in option-b-leveldb) is a seam designed for exactly this purpose: replacing the local filesystem with a remote storage backend while keeping all of LevelDBβs logic intact.
In practice, option-b-leveldb is analogous to how production systems host embedded databases on shared storage:
Real analog β etcd on block storage: etcd uses bbolt (a B-tree engine) for its data. When running in Kubernetes, etcdβs data directory is a persistent volume (a network-attached block device). The βstorage.Storageβ for etcd is the filesystem API on top of that network block device.
Real analog β TiKVβs RocksDB on FDB: TiKV, the storage layer behind TiDB, uses RocksDB. There are research prototypes that replace RocksDBβs Env (RocksDBβs equivalent of storage.Storage) with a remote storage backend backed by a distributed KV store.
Real analog β FoundationDBβs own Blob Layer: FDBβs official Blob Layer (in the Python layers repository) uses the exact chunking pattern from option-b-leveldb:
[prefix] [blobId] [chunkNum] β up to 10,000 bytes
The chunk size is tunable. Read performance is O(n_chunks) round-trips, optimized by using GetRange to fetch all chunks in one round-trip.
What makes option-b-leveldb interesting architecturally: LevelDB produces 7 types of files (log, sstable, manifest, current, lock, temp, info). Each has different I/O patterns: log files are append-only, sstables are write-once read-many, manifests are written atomically. Our storage.Storage implementation stores all file types as FDB key ranges, but a production implementation would optimize differently per file type (e.g., log files could use FDBβs atomic append via versionstamps rather than read-modify-write chunking).
4.5 Snowflakeβs Use of FDB
Snowflake uses FoundationDB as its metadata store β the catalog that tracks table schemas, partition file locations, transaction history, and access controls. The actual table data is stored in cloud object storage (S3/Azure Blob/GCS), but the metadata that makes queries possible lives in FDB.
Why FDB for metadata?
- Cloud object storage is strongly consistent for object operations but cannot do atomic multi-object updates. Creating a table, writing partition files, and committing a transaction all need to appear atomically or not at all. FDB provides this.
- Metadata is small but highly contended. FDBβs optimistic concurrency and high throughput handle schema-level operations (DDL, partition registration) without bottlenecks.
- FDBβs key ordering makes catalog operations efficient: scan all partitions for table T, scan all columns in schema S, scan all transactions in time range [t1, t2].
The encoding (likely, based on public talks):
[catalog] [tenant] [schemaName] [tableName] [partitionId] β partition metadata
[txnLog] [startVersion] β transaction record
[schema] [tenant] [schemaName] [tableName] β column definitions
This is structurally option-a-sqlite (the SQL catalog pattern) extended to a distributed multi-tenant context.
The lesson: FDB is not used because Snowflake needed a storage engine for query data β S3 handles that. FDB is used because catalog metadata needs ACID guarantees that no other component in a cloud-native stack provides cheaply. If youβre building a distributed system that needs a consistent metadata store, FDB is frequently the correct answer.
4.6 Apple CloudKit β FDB at Hundreds of Millions of Devices Scale
CloudKit is Appleβs cross-device sync infrastructure. It stores photos metadata, notes content hashes, contact sync tokens, and app-defined data for third-party iOS apps. As of 2023, it services hundreds of millions of Apple devices.
The FDB usage in CloudKit:
- User records (contacts, notes, photo metadata) are stored as fdb-record-layer records
- Sync tokens are versionstamped keys that encode the exact FDB version at which a sync point was taken
- Conflict resolution uses FDBβs read-your-writes consistency to detect and merge concurrent edits
Why versionstamps are central to sync:
// Sync token = FDB versionstamp at the time of last sync
// Next sync: "give me all changes since versionstamp V"
// FDB key encoding: versionstamp β big-endian 10-byte key prefix
// GetRange from (V, MAX) returns all records created/modified after V
This is FDBβs versionstamp feature in action: each committed transaction gets a globally monotonic 10-byte version. Records written in that transaction embed the versionstamp in their key, enabling efficient βfetch all changes since Xβ queries with a single range scan.
Horizontal partitioning: Each CloudKit userβs data lives in a dedicated FDB keyspace (their userID is a subspace prefix). No two usersβ transactions conflict because they operate on disjoint keys. This is how FDB achieves per-user isolation without database-per-user overhead.
Interview Questions
Q: Why would you use FDB over Postgres for metadata in a distributed system like Snowflake?
Postgres is a single-node database. To scale writes, you need sharding or replication, which introduces coordination complexity. FDB is horizontally scalable by design: adding machines increases throughput linearly. For metadata that needs global ACID guarantees across hundreds of nodes, a single-node Postgres (or even a replicated Postgres with synchronous replication) cannot match FDBβs throughput. FDB also runs in the same datacenter/cloud region as the application, eliminating cross-region round-trips for metadata operations.
Q: The Document Layer is deprecated β does that mean FDB isnβt good for document storage?
No. The Document Layer was deprecated because maintaining a MongoDB-compatible query engine (including the query planner, aggregation pipeline, and wire protocol) was expensive to maintain. The underlying pattern β document as FDB record, query predicates as range scans over index subspaces β is sound and is exactly what fdb-record-layer implements (with Protobuf instead of BSON). The Document Layerβs death was organizational, not architectural.
Q: How does mvsqlite enable multiple concurrent writers to the same SQLite database?
Standard SQLite uses POSIX file locks (or Windows file locks) to serialize writes. One writer holds a RESERVED lock; others wait. mvsqlite replaces file locks with FDB transactions on a βwrite lockβ key. The writer that wins the FDB transaction proceeds; others retry. Because FDBβs transaction semantics are stronger than POSIX file locks (FDB provides serializable isolation; POSIX locks only prevent concurrent access), mvsqlite actually provides stronger guarantees than standard SQLite in a networked environment.
Q: What is a βcontinuationβ in fdb-record-layer and why does every production FDB layer need one?
FDB limits transaction duration to 5 seconds. Any operation that might take longer β full-table scan, bulk export, index rebuild β must be broken into multiple transactions. A continuation is a serialized cursor: it records which keys have been processed so the next transaction can resume from the right place. Without continuations, a query that touches more records than fit in one 5-second window simply cannot be executed. In our labs, we ignore this constraint β all queries run in one transaction, which fails with a transaction-too-old error if the dataset is large enough. Adding continuation support is the single most important production readiness improvement for any of these layers.
This Repository β Five Implementations
This repo contains five working implementations of FDB layers, ordered from conceptually simple to architecturally sophisticated. Each folder is self-contained: it has its own go.mod, source code, and docs/GUIDE.md with a deep dive into how and why that layer is built the way it is.
foundationdb-labs/
βββ option-a-leveldb/ LevelDB API above FDB β docs/GUIDE.md
βββ option-a-sqlite/ SQL engine above FDB β docs/GUIDE.md
βββ option-c-record-layer/ Records + secondary indexes β docs/GUIDE.md
βββ option-b-leveldb/ LevelDB on FDB storage β docs/GUIDE.md
βββ option-b-sqlite/ SQLite VFS substrate β docs/GUIDE.md
5.1 Recommended Reading Order
| Order | Folder | Core concept learned | Difficulty |
|---|---|---|---|
| 1 | option-a-leveldb | Subspaces, MVCC snapshots, write batches, iterators | ββ |
| 2 | option-c-record-layer | Secondary indexes, index consistency, PK encoding | βββ |
| 3 | option-a-sqlite | Table catalog, row encoding, column types | βββ |
| 4 | option-b-sqlite | Page model, byte-range I/O, partial-page RMW | ββββ |
| 5 | option-b-leveldb | File chunks, storage.Storage interface, atomic rename | ββββ |
Start with option-a-leveldb regardless of your background. It introduces all the core patterns β subspaces, range scans, snapshot reads, transaction retry loops β in the simplest possible context.
5.2 Layer-by-Layer Architecture Map
option-a-leveldb (API Layer above FDB)
LevelDB API (Get/Put/Delete/Batch/Iterator/Snapshot)
β
FDB KV store (subspace-encoded keys)
The layer sits above FDB. User code calls db.Put(key, value). The layer encodes the key as ns + 0x00 + key and calls tr.Set(encodedKey, value). All LevelDB semantics (iterators, snapshots, batches) are implemented using FDB primitives.
Why this layer: LevelDBβs API is the lingua franca of embedded KV stores. By implementing it on FDB, existing code that uses LevelDB can be dropped into a distributed, replicated context without changing its API usage.
option-c-record-layer (Record + Index Layer)
Application (PutRecord / GetRecord / LookupByIndex)
β
FDB (two subspaces: records and indexes)
The layer sits above FDB. A record is stored at a fixed key. Every indexed field also has an index key in a separate subspace. Both are written in the same transaction to keep records and indexes consistent.
Why this layer: This is the fundamental pattern for any βdatabaseβ feature. Tables, collections, documents β all of them are βrecords with indexes.β Understanding how to keep a record and its indexes consistent under concurrent writes is the core of database internals.
option-a-sqlite (SQL Engine Layer)
SQL (CREATE TABLE / INSERT / SELECT / UPDATE / DELETE)
β
Table catalog (FDB) + Row store (FDB)
A minimal SQL parser and query engine above FDB. Tables are defined in a catalog (a separate subspace). Rows are encoded as msgpack values keyed by primary key. SELECT * is a range scan over the tableβs key range.
Why this layer: Shows that a full (if minimal) relational database is ~300 lines of Go above FDB. The interesting parts are the catalog encoding and the primary key sort-order problem.
option-b-sqlite (SQLite VFS Substrate)
SQLite (full SQL engine)
β
SQLite VFS API (xRead / xWrite / xSync / xLock)
β
FDB (page-keyed store: pageNum β 4096 bytes)
SQLite provides its own SQL engine, query planner, and transaction management. The VFS is just a storage substrate: given a logical page number, read or write 4096 bytes. FDB stores those pages.
Why this layer: Shows how to extend an existing system by replacing its storage interface rather than reimplementing its logic. The interesting parts are partial-page writes (SQLite calls xWrite with sub-page ranges, requiring read-modify-write) and locking (translating SQLiteβs POSIX lock semantics to FDB key operations).
option-b-leveldb (LevelDB Storage Substrate)
LevelDB (full KV engine)
β
storage.Storage interface (Create / Open / Rename / List)
β
FDB (file metadata keys + chunked file data keys)
LevelDB provides its own full KV engine. The storage layer is just a filesystem abstraction: create/open/read/write/rename/delete files. FDB stores those files as key ranges.
Why this layer: Shows that even complex storage engines like LevelDB (which manage their own B-trees, compaction, and WAL) can be decoupled from the filesystem. The interesting parts are chunk encoding (64KB chunks to stay under FDBβs value limit), atomic rename (LevelDB uses rename to commit new SSTables), and why WAL redundancy is a non-issue (FDBβs own log subsystem replaces the LevelDB WALβs durability purpose).
5.3 Running the Labs
Each lab has a demo/main.go that exercises the layer. To run any lab:
# Start FDB (requires Docker)
docker-compose up -d
# Run a specific demo
cd option-a-leveldb && go run demo/main.go
cd option-c-record-layer && go run demo/main.go
# ... etc.
Prerequisites:
- Docker + Docker Compose (for the FDB cluster)
- Go 1.21+
- The FDB C client library (installed by the bootstrap script:
bash scripts/bootstrap-fdb.sh)
FDB API version: All labs use API version 710 (FDB 7.1.x). This is set at process start:
fdb.MustAPIVersion(710)
5.4 Extending the Labs β Suggested Exercises
Each docs/GUIDE.md has exercises specific to that lab. Here are cross-cutting improvements that apply to multiple labs:
1. Continuations (all labs): Add a ScanWithCursor(cursor []byte, limit int) function that returns at most limit results and a cursor byte string. The cursor encodes the last key seen. Next call resumes from cursor. This makes full-table scans safe under FDBβs 5-second transaction limit.
2. Transactions exposed to the caller (option-a-leveldb, option-a-sqlite): Wrap multiple operations in a user-visible transaction: db.Begin() β Tx, tx.Put(k, v), tx.Get(k), tx.Commit(). Internally, this is just db.Transact(func(tr) { ... }) β all operations share one FDB transaction.
3. Versionstamp-based change feed (option-c-record-layer): When writing a record, also write to a change log subspace: changelogKey = ns + changelogSubspace + versionstamp. A background consumer can do GetRange(lastCheckpoint, MAX) to efficiently read all changes since the last poll. This is how fdb-record-layer implements change feeds.
4. Range index queries (option-c-record-layer): Current index lookup finds exact matches (field = value). Add LookupByRange(schema, field, minValue, maxValue) that does GetRange(indexPrefix(field, minValue), indexPrefix(field, maxValue)). This requires your index value to be sort-preserving encoded (integers via encodeInt64, strings lexicographically).
5. Multi-file SQLite (option-b-sqlite): Support ATTACH DATABASE by mapping SQLite filenames to separate FDB subspaces. Two βattachedβ databases are just two different ns prefixes in FDB β no code duplication needed.
Reading Guide β Where to Go Next
Youβve read through the five chapters of this guide and the lab implementation guides. By now you understand FDBβs architecture from the commit pipeline through every component of the cluster, the layer concept from byte-level encoding through production schema evolution, and how five real systems (Apple CloudKit, Snowflake, mvsqlite, Document Layer, Record Layer) deploy these exact patterns.
This chapter maps out the landscape of resources for going deeper in each direction.
6.1 FoundationDB β Going Deeper
Essential Reading
FDB Documentation β The official reference. The βDeveloper Guideβ section covers API semantics in depth, including the precise semantics of read-your-writes consistency, exactly which operations conflict, and the full list of transaction options.
FDB White Paper β SIGMOD 2021 β βFoundationDB: A Distributed Unbundled Transactional Key Value Store.β The definitive technical reference for FDBβs internal architecture. Covers the simulation framework, commit pipeline, log system, and storage layer in academic depth. If youβve read this guide, you have the vocabulary to understand every section of this paper.
FDB Forum β Design discussions, Q&A, and announcements. Some of the most insightful posts are from FDBβs core team explaining design decisions.
Source Code
apple/foundationdb β The full FDB C++ source. Key files:
fdbserver/MasterProxyServer.actor.cppβ Commit Proxy implementation (the commit pipeline)fdbserver/Resolver.actor.cppβ Resolver (conflict detection)fdbserver/TLogServer.actor.cppβ Transaction Log (write-ahead log)fdbserver/StorageServer.actor.cppβ Storage Server (reads, MVCC)fdbclient/ReadYourWrites.actor.cppβ Client-side read-your-writes cachefdbserver/workloads/β Simulation workloads (randomized fault testing)
FoundationDB/fdb-record-layer β Java. The most complete example of a production FDB layer. Study FDBRecordContext (transaction management), RecordQueryPlanner (query compilation to range scans), and OnlineIndexer (safe online index builds). Reading this source after this guide will be directly comprehensible.
6.2 Storage Engines β Going Deeper
Books
Designing Data-Intensive Applications by Martin Kleppmann β The best single-volume overview of storage trade-offs, replication, consistency, and distributed transactions. Chapters 3 (storage engines), 7 (transactions), and 9 (consistency) are directly relevant. If you havenβt read this book, stop here and read it. It contextualizes everything in this guide.
Database Internals by Alex Petrov β Deep dive into B-trees (B+ variants, page splits, page merges), LSM trees (all the internals: bloom filters, compaction algorithms, manifest management), and distributed consensus (Paxos, Raft, Multi-Paxos). The LSM chapter is the best public explanation of RocksDB/LevelDB internals.
Papers
The Log-Structured Merge Tree (1996) β The original LSM paper by OβNeil et al. Defines the C0/C1/C2 component model that LevelDB simplifies into level-0/level-1/β¦
Bigtable: A Distributed Storage System for Structured Data (2006) β The architecture paper for Google Bigtable. Introduced the tablet-server model, SSTable format, and hierarchical metadata server. Direct ancestor of HBase, Cassandraβs SSTables, and FDBβs storage layer design.
Spanner: Googleβs Globally-Distributed Database (2012) β How Google built a globally distributed ACID database using TrueTime for external consistency. The βPaxos groupsβ concept is closely related to FDBβs shard-level durability. Essential reading for understanding distributed transactions.
Source Code
google/leveldb β The C++ LevelDB. Read: db/version_set.cc (compaction and manifest management), db/log_reader.cc + db/log_writer.cc (WAL format), table/block.cc + table/format.cc (SSTable on-disk format), util/arena.cc (memtable arena allocator).
syndtr/goleveldb β The Go LevelDB used in option-b-leveldb. The storage/ package defines the storage.Storage interface that our layer implements.
sqlite.org/sqlite β The SQLite source. Study: btree.c (B-tree implementation), vfs.c (VFS abstraction), pager.c (page cache and WAL). The SQLite source is famously well-commented.
facebook/rocksdb β RocksDB is LevelDBβs production successor. Much more complex (tiered compaction, bloom filters, column families, transactions) but the fundamental data model is identical.
6.3 SQLite VFS β Going Deeper
SQLite VFS Documentation β The full VFS API with method semantics. Essential for understanding what each of the xRead, xWrite, xSync, xLock, xUnlock methods must guarantee.
SQLite File Format β The page-by-page layout of a SQLite database file. After reading the option-b-sqlite guide, this document will make complete sense. The page header format, freelist, overflow pages, and B-tree cell formats are all documented here.
losfair/mvsqlite β The production FDB-backed SQLite VFS. The Rust code is clean and readable. The mvfs/ directory contains the VFS implementation; mvsqlite/ contains the page store. Compare with option-b-sqlite/pagestore/pagestore.go line by line.
SQLite WAL Mode β The WAL (Write-Ahead Logging) mode documentation. Explains how WAL provides concurrent reads and writes. Our pagestore uses rollback journal mode semantics (xSync is a no-op because FDB is durable), but understanding WAL mode helps you see what weβre not implementing.
6.4 Distributed Systems Fundamentals
The Part-Time Parliament (Paxos) β Lamportβs original Paxos paper. Dense, but the consensus problem it solves (how do N machines agree on one value when any machine can fail?) is the foundation of every distributed database.
In Search of an Understandable Consensus Algorithm (Raft) β The Raft paper. More accessible than Paxos. FDB uses a Paxos variant, not Raft, but the problems solved are identical. After this, read the Raft visualization.
Linearizability: A Correctness Condition for Concurrent Objects β The formal definition of linearizability (which FDB provides for reads and writes) and its relationship to sequential consistency and serializability.
A Critique of ANSI SQL Isolation Levels β The Berenson et al. paper that formalized isolation anomalies (dirty reads, non-repeatable reads, phantoms, lost updates, write skew) and showed that ANSI SQLβs definitions were imprecise. Essential vocabulary for discussing βserializableβ vs βsnapshot isolationβ vs βrepeatable read.β
6.5 Building Something Real β A Progression
If you want to go from βI understand these labsβ to βI built something production-worthy,β here is a concrete progression:
Step 1: Add continuations to option-a-leveldb.
Add IterateWithCursor(cursor []byte, limit int) ([]KV, nextCursor []byte). This is the single most important production feature missing from all the labs. Every real FDB application needs cursor-based pagination for range queries.
Step 2: Add range index queries to option-c-record-layer.
Implement LookupByRange(schema, field string, minVal, maxVal interface{}) ([]Record, error). This requires: (a) sort-preserving encoding for all index value types, and (b) a range scan on the index subspace instead of a point lookup.
Step 3: Implement a simple query planner.
For WHERE city='Paris' AND age >= 25, decide: which index scan is more selective? Scan city=βParisβ and filter age, or scan age >= 25 and filter city? Look at how fdb-record-layerβs RecordQueryPlanner.java makes this decision.
Step 4: Add online index building.
Given a table with 1 million records, add a new index without downtime. The approach: (a) mark index as βbuildingβ, (b) background job scans records in cursor-paginated chunks and writes index entries, (c) any concurrent PutRecord/DeleteRecord writes to both old and new index state, (d) when background job finishes, mark index βreadyβ. This is how fdb-record-layerβs OnlineIndexer works.
Step 5: Read the fdb-record-layer source.
After steps 1-4, open fdb-record-layerβs FDBRecordStore.java, RecordQueryPlanner.java, and OnlineIndexer.java. You will understand every design decision immediately. The gap between βlabs exerciseβ and βproduction library used at Apple scaleβ is these four features: continuations, sort-preserving range indexes, query planning, and online index builds.
Option A β LevelDB API on top of FoundationDB
Pattern: βAPI layer above LevelDBβ β we keep the familiar LevelDB surface (
Put/Get/Delete/Batch/Iterator/Snapshot) but the bytes never touch LevelDB. They live in FoundationDB instead.
Why this is interesting
LevelDB is a local embedded KV. FoundationDB is a distributed transactional
KV. Both speak (key, value) pairs over ordered byte keys, so the API shapes
are nearly identical β but the guarantees underneath are wildly different:
| Concept | LevelDB | This layer (over FDB) |
|---|---|---|
| Put | Append to MemTable + WAL on local disk | Set inside tr.Transact{...} |
| Get | MemTable β immutable tables β SSTs | tr.Get (consistent across the cluster) |
| Batch | Single WAL record | One FDB transaction (cross-key atomic) |
| Range | LevelDB SST merging iterator | tr.GetRange over a Subspace |
| Snapshot | Pins on-disk sequence number | SetReadVersion(capturedVersion) (MVCC) |
The mapping is almost mechanical, and thatβs the point: it makes FDBβs primitives concrete.
Files
layer/
encoding.go Subspace: ns + 0x00 + userKey β fdb.Key, plus range helpers
db.go Open / Close / Get / Put / Delete
batch.go Batch + DB.Write (atomic multi-op)
iterator.go Forward+backward cursor over GetRange results
snapshot.go MVCC snapshot via captured read version
demo/main.go End-to-end: CRUD β batch β range β snapshot vs. live reads
Key-space layout
<namespace> 0x00 <userKey> -> <value>
Subspace (see layer/encoding.go) keeps multiple logical
databases isolated inside one FDB cluster. The 0x00 separator + the fact
that the user-key portion is appended as opaque bytes gives us the same
lexicographic ordering LevelDB users expect.
Mapping each LevelDB op to an FDB call
Put(k,v)βdb.Transact(func(tr){ tr.Set(ns.Pack(k), v) }).Transacthandles automatic retry on conflict.Get(k)βdb.ReadTransact(func(rt){ rt.Get(ns.Pack(k)).Get() }). An empty FDB result (nil) becomes ourErrNotFound.Delete(k)βtr.Clear(ns.Pack(k))(no-op if absent).Batch.Writeβ a singleTransactcontaining manySet/Clearops. Either all of them land, or none do.Iteratorβ materialized list fromtr.GetRange(subspace.Range()). Real production code would stream viaRangeIterator, but materializing keeps the iterator usable outside a live transaction (the LevelDB shape).Snapshotβ capture the read version withtr.GetReadVersion(), then on subsequent reads calltr.SetReadVersion(captured)so FDB serves the data as it was at that point in time.
Running
- Start the shared FDB cluster (from the repo root):
docker compose up -d ./scripts/bootstrap-fdb.sh - Install the FDB client library on the host (required by the Go bindings,
which are CGO-linked to
libfdb_c):brew install foundationdb # macOS - Run the demo:
cd option-a-leveldb go run ./demo -cluster ../fdb.cluster
Expected output (abridged):
Get apple -> red
batch applied (cherry+date inserted, banana deleted)
range scan [a, z):
apple = red
cherry = red
date = brown
snapshot read version = 1234567
live Get apple -> green
live Get cherry -> layer: not found
snap Get apple -> red (was 'red' at snapshot)
snap Get cherry -> red (err=<nil>)
That last block is the punchline: the snapshot sees the world before our post-snapshot writes, because FDB just gives us older versions on demand.
What this layer intentionally omits
- Compaction / SST files β there are none. FDB handles storage internally.
- Bloom filters β not needed; FDB indexes by key range natively.
- Write batches with sequence numbers β FDBβs transaction commit version
plays that role automatically (
tr.GetCommittedVersion()after commit).
See ../option-b-leveldb for the inverse experiment: keeping real LevelDB
code paths (memtable + SSTs) but storing the SST bytes in FDB.
Hitchhikerβs Guide β Option A: LevelDB API over FoundationDB
The question this answers: βI have code that uses LevelDB. Can I swap in FoundationDB underneath with zero changes to my application?β
The deeper question: βWhat is LevelDBβs API contract, and how exactly does each piece of it map onto FDB primitives?β
Table of Contents
- What LevelDB Is (and Isnβt)
- The Five Concepts: Get, Put, Delete, Batch, Iterator, Snapshot
- Subspaces β The Key Encoding
- Write Batches β Atomicity Made Explicit
- Iterators β Range Scans Without Cursors
- Snapshots β MVCC Exposed to the Caller
- How the Demo Works Step by Step
- What the Real LevelDB Does That We Donβt
- Real-World Analogue: goleveldb, RocksDB, PebbleDB
- Exercises β Build on This
1. What LevelDB Is
LevelDB (released by Google in 2011) is an embedded, ordered, keyβvalue store implemented as a Log-Structured Merge Tree (LSM tree). βEmbeddedβ means it runs in the same process as your application β no server, no network. βOrderedβ means the same thing as in FDB: keys are sorted lexicographically, and you can range-scan them efficiently.
LevelDBβs API surface is deliberately tiny:
db.Get(key)
db.Put(key, value)
db.Delete(key)
batch := new(leveldb.Batch)
batch.Put / batch.Delete
db.Write(batch)
iter := db.NewIterator(...)
snap := db.GetSnapshot()
This simplicity is why LevelDB became the embedded storage engine of choice for Chrome (IndexedDB), Bitcoin Core, and countless other applications.
Where LevelDB lives in the storage stack:
Application code
β
LevelDB API (Get/Put/Delete/Batch/Iterator)
β
LSM tree (MemTable + SSTables on disk)
β
Filesystem / OS
Where our layer lives:
Application code
β
[our layer] β same LevelDB-shaped API (Get/Put/Delete/Batch/Iterator)
β
FDB transactions
β
FDB cluster (distributed, replicated)
The application sees the same interface. The durability substrate changes completely.
2. The Five Concepts
Get β One Read, One Transaction
In LevelDB, Get opens a brief βread lockβ (via a snapshot) and reads
one key. There is no explicit transaction; LevelDB handles it internally.
In our layer:
func (d *DB) Get(key []byte) ([]byte, error) {
v, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.Get(d.ns.Pack(key)).Get()
})
...
}
ReadTransact opens a read-only FDB transaction (no conflict tracking on
writes, cheaper than a full Transact). The read version is chosen by FDB
to be a recent committed version β typically a few milliseconds behind
real-time. This gives you a consistent view even if concurrent writers are
active.
rt.Get(k) does not block on return. It returns a FutureByteSlice.
Calling .Get() on the future is what blocks (sends the request to the FDB
storage server and waits for the response). This two-phase call style is how
FDB supports pipelining: you can call rt.Get(k1), rt.Get(k2),
rt.Get(k3) in sequence, then .Get() all three β FDB sends all three
requests before blocking on any response. This is critical for
LookupByIndex in option-c, as weβll see.
Put β One Write, One Transaction
func (d *DB) Put(key, value []byte) error {
_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Set(d.ns.Pack(key), value)
return nil, nil
})
return err
}
Transact opens a read-write transaction. tr.Set adds the (key, value)
pair to the transactionβs local write buffer. Nothing is sent to the cluster
until the function returns without error, at which point FDB commits. If
there was a conflict (another writer touched this key since our read version),
FDB calls our function again with a fresh transaction automatically.
Delete β Same as Put, but Clear
tr.Clear(key) adds a tombstone to the write buffer. In FDBβs model, a
cleared key is identical to a key that was never written β there is no βnullβ
value. This is important: Get on a cleared key returns nil (which our
layer translates to ErrNotFound), not a special tombstone value.
When Each One is Right
| Operation | Use when | FDB primitive |
|---|---|---|
| Get | Reading a single key | ReadTransact |
| Put | Writing a single key | Transact + Set |
| Delete | Removing a single key | Transact + Clear |
| Batch | Writing multiple keys atomically (below) | Transact (one) |
3. Subspaces β The Key Encoding
Every key that hits FDB is first passed through the Subspace.Pack method:
// encoding.go
type Subspace struct{ prefix []byte }
func (s Subspace) Pack(userKey []byte) fdb.Key {
out := make([]byte, 0, len(s.prefix)+1+len(userKey))
out = append(out, s.prefix...)
out = append(out, 0x00) // separator byte
out = append(out, userKey...)
return fdb.Key(out)
}
If your namespace is "demo" and your key is "apple", the actual FDB key
is the byte string "demo\x00apple".
Why the separator byte?
Without a separator, two namespaces "foo" and "foobar" would collide:
the key "foobar\x00somekey" would appear inside both fooβs range and
foobarβs range. The separator \x00 prevents this because "foo\x00" is
not a prefix of "foobar\x00".
Why 0x00 specifically?
Because 0x00 is the smallest possible byte value. When we compute the range
end for a subspace scan, we copy the begin key and change the last byte from
0x00 to 0x01:
func (s Subspace) Range() fdb.KeyRange {
begin := append([]byte{}, s.prefix...)
begin = append(begin, 0x00)
end := append([]byte{}, s.prefix...)
end = append(end, 0x01)
return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}
The range ["demo\x00", "demo\x01") contains exactly and only the keys
packed by this subspace. This is a simple, efficient way to express βall keys
in this namespaceβ without needing a sentinel end key.
The tuple layer alternative:
Official FDB client libraries encode subspace ranges using the Tuple encoding, which handles nested subspaces, escaping, and typed values. Our hand-rolled encoding is simpler but less general β for production use, adopt the Tuple layer.
4. Write Batches β Atomicity Made Explicit
LevelDBβs WriteBatch is the mechanism for writing multiple keys
atomically. Without a batch, each Put is a separate transaction β if your
process crashes between two Puts, the second one is missing.
b := layer.NewBatch()
b.Put([]byte("user:1:name"), []byte("Alice"))
b.Put([]byte("user:1:email"), []byte("alice@example.com"))
b.Put([]byte("user:1:score"), []byte("100"))
db.Write(b)
After Write, either all three keys exist or none do.
Our implementation:
// batch.go
func (d *DB) Write(b *Batch) error {
_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
for _, op := range b.ops {
if op.clear {
tr.Clear(d.ns.Pack(op.key))
} else {
tr.Set(d.ns.Pack(op.key), op.value)
}
}
return nil, nil
})
return err
}
One Transact call. All ops go into the transaction buffer together. FDB
commits them atomically. This is exactly what LevelDBβs WriteBatch does
internally β it writes all ops to the Write-Ahead Log in one fsync, then
applies them to the MemTable.
FDBβs size limits:
FDB transactions are capped at approximately 10 MB of reads + writes. For most use cases this is not a limit, but if youβre writing millions of keys youβll need to split into multiple transactions. See option-b-leveldb for the chunking pattern.
5. Iterators β Range Scans Without Cursors
LevelDBβs iterator is a cursor over the sorted key space. It supports bidirectional movement and seeking to arbitrary positions.
The streaming vs. materializing decision:
A βstreamingβ iterator would keep a live FDB transaction open and use
fdb.RangeIterator to fetch keys page by page as you call Next(). This
is efficient for large ranges but ties the iteratorβs lifetime to an open
transaction.
FDB transactions have a ~5 second timeout. LevelDB iterators are frequently held open much longer (e.g., while a background compaction reads a full SSTable). Forcing a 5-second limit would break drop-in compatibility.
Our solution: materialize the entire range into a slice upfront:
func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
v, _ := fdbDB.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
})
it.kvs = v.([]fdb.KeyValue)
...
}
The entire GetSliceWithError call happens inside one transaction. The
transaction closes. The iterator holds the materialized slice β it can
outlive the transaction indefinitely.
Trade-off: We read all matching keys eagerly, even if the caller only needs the first few. For small ranges (as in most LevelDB use cases) this is fine. For ranges spanning millions of keys, a streaming approach would be necessary.
Navigation:
it.First() // idx = 0
it.Next() // idx++
it.Prev() // idx--
it.Last() // idx = len(kvs)-1
it.Seek(target) // binary (or linear) search for key >= target
it.Key() // kvs[idx].Key unpacked from subspace
it.Value() // kvs[idx].Value
it.Valid() // 0 <= idx < len(kvs)
The Seek in our implementation is a linear scan for pedagogical clarity.
A production implementation would use sort.Search (binary search) since
the slice is sorted.
6. Snapshots β MVCC Exposed to the Caller
This is where FDBβs MVCC machinery becomes directly visible.
A LevelDB snapshot captures the database state at a point in time. Reads through the snapshot always see that exact state, regardless of subsequent writes.
snap := db.NewSnapshot()
// ... later, after writes have occurred ...
v1, _ := db.Get(key) // sees current state
v2, _ := snap.Get(key) // sees state at snapshot time
// v1 != v2 if the key was mutated after the snapshot
How we implement this:
// snapshot.go
type Snapshot struct {
db fdb.Database
readVersion int64
}
func (d *DB) NewSnapshot() (*Snapshot, error) {
rv, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetReadVersion().Get()
})
...
return &Snapshot{db: d.fdb, readVersion: rv.(int64)}, nil
}
GetReadVersion() returns the logical timestamp FDB assigned to our
transaction. This is a monotonically increasing integer β FDBβs βversion
clockβ. We store it.
Later, when the snapshot is asked for a key:
func (s *Snapshot) Get(key []byte) ([]byte, error) {
v, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.SetReadVersion(s.readVersion) // β the key line
return tr.Get(s.ns.Pack(key)).Get()
})
...
}
SetReadVersion pins the transaction to read from the exact version we
captured. FDBβs MVCC machinery ensures that the storage servers still have
the old version in their version history β provided it hasnβt been garbage
collected (the ~5 second window).
Why SetReadVersion requires a writable Transaction:
This is a quirk of the FDB Go API: SetReadVersion is only available on
fdb.Transaction (read-write), not on fdb.ReadTransaction (read-only).
So we open a writable transaction but never write anything β effectively using
it as a βpinnable read transactionβ. The transaction is automatically
abandoned when the closure returns.
What this maps to in real databases:
- PostgreSQL:
BEGIN; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;creates a snapshot valid for the transactionβs duration - MySQL InnoDB:
START TRANSACTION WITH CONSISTENT SNAPSHOT - Spanner:
BeginTransaction(mode: READ_ONLY, timestamp: exact_staleness: Duration) - CockroachDB:
AS OF SYSTEM TIME <timestamp>clause on SELECT
All of these are, at their core, βread from this specific version of the MVCC chain.β
7. The Demo, Step by Step
The demo in demo/main.go exercises every feature and prints the results.
Here is the internal FDB call sequence for the MVCC section:
1. db.Put("color", "red")
β Transact: Set("demo\x00color", "red"), commit version v1
2. snap = db.NewSnapshot()
β ReadTransact: GetReadVersion() β returns v1 (or very close)
β snap.readVersion = v1
3. db.Put("color", "blue")
β Transact: Set("demo\x00color", "blue"), commit version v2
4. snap.Get("color")
β Transact: SetReadVersion(v1), Get("demo\x00color")
β FDB returns "red" (value at v1)
5. db.Get("color")
β ReadTransact: Get("demo\x00color")
β FDB returns "blue" (latest committed value)
This demonstrates that the snapshot truly pins to v1, seeing the pre-mutation value.
8. What the Real LevelDB Does That We Donβt
LevelDB has significant machinery we skip:
LSM Tree internals:
- MemTable (skip list in memory) + WAL (write-ahead log on disk)
- SSTable files (sorted, immutable, bloom-filter indexed)
- Compaction (background merging of SSTables to reclaim space and bound read amplification)
- Bloom filters (avoid disk reads for non-existent keys)
None of this applies to our layer because FDB handles all of it. FDBβs
storage servers use their own B-tree-like storage (a custom tree called the
βFDB B-treeβ that supports MVCC) and their own write-ahead logs. We simply
call tr.Set and FDBβs internals handle the rest.
What we deliberately emulate:
- The API surface (same method names and semantics as goleveldb)
- Namespace isolation (subspace = virtual database)
- Atomic batches
- Forward/backward iteration
- MVCC snapshots
What we donβt emulate:
- Compaction (irrelevant β FDB handles it)
- Bloom filters (FDB has its own read optimization)
- File-level operations (no such thing in FDB)
- Checksums (FDB provides end-to-end data integrity)
9. Real-World Analogue: goleveldb, RocksDB, PebbleDB
goleveldb (github.com/syndtr/goleveldb) is the Go port of LevelDB that
option-b-leveldb uses as a consumer. Its storage.Storage interface is the
pluggable storage backend. Weβll see this in option-bβs guide.
RocksDB (Facebook/Meta, 2013) is the evolution of LevelDB: more configurable, multi-threaded compaction, column families, transactional API. It is the storage engine inside MySQL 8 (MyRocks), CockroachDB, TiKV, and many others. Every concept from our layer applies directly to RocksDBβs API.
PebbleDB (CockroachDB, 2019) is a Go implementation of RocksDBβs key ideas, designed for CockroachDBβs specific workload. CockroachDB switched from RocksDB to Pebble in 2021 for improved performance and simpler operations.
The common thread: all of these expose the same Get/Put/Delete/Batch/ Iterator/Snapshot interface. Building this interface over FDB means you
understand the contract deeply, because you have to implement it rather than
just use it.
10. Exercises β Build on This
These are not hypothetical. Each one adds a real capability:
Exercise 1 β Atomic Compare-And-Swap (CAS)
func (d *DB) CAS(key, expected, next []byte) (bool, error)
Read key, compare to expected, write next β all in one transaction. If the key changed since you read it, FDB retries automatically (conflict detection does this for you β you donβt need to loop).
Exercise 2 β TTL (Time-To-Live) Keys
Store expiry timestamps alongside values. Modify Get to return ErrNotFound
for expired keys, and add a Sweep() method that clears all expired entries
with a ClearRange.
Exercise 3 β Transactions Across Multiple Keys
tx := db.Begin()
tx.Put("account:alice:balance", "900")
tx.Put("account:bob:balance", "100")
tx.Commit()
This is just a Batch.Write today. Extend it to include optimistic reads
(read aliceβs balance, check itβs >= 100 before subtracting) β youβll need a
real Transact closure, not just a batch.
Exercise 4 β Prefix Scan
rows, err := db.Scan(prefix []byte) ([]KV, error)
Use RangeWithin(prefix, nil) and filter server-side. This is the basis
for the Record Layerβs ScanRecords.
Exercise 5 β Size Estimate
Use FDBβs GetEstimatedRangeSizeBytes (an atomic op) to estimate how many
bytes live in your subspace. This is how database engines implement SHOW TABLE STATUS without a full scan.
11. Source Code Deep Dive β Every Line Explained
This section walks through the full source of layer/db.go, layer/encoding.go, layer/iterator.go, and layer/snapshot.go with annotations about the non-obvious decisions.
db.go β The Core Database Type
type DB struct {
fdb fdb.Database
ns Subspace
}
Two fields. fdb is the FDB connection (goroutine-safe, long-lived). ns is the namespace: a byte prefix prepended to every key. Multiple DB instances on the same FDB cluster with different ns values are completely isolated β their key ranges do not overlap.
func Open(fdbDB fdb.Database, namespace []byte) *DB {
return &DB{fdb: fdbDB, ns: NewSubspace(namespace)}
}
Open does not contact FDB. Itβs a pure in-memory initialization. The connection to FDB was already established when fdb.OpenDefault() was called in main.go. Open just associates this DB with a prefix.
Why take fdb.Database rather than a cluster address string? This lets the caller decide how the FDB connection is configured (API version, cluster file path, network options) and share one connection across multiple DB instances. Multiple DB instances share one FDB network thread and one connection pool.
func (d *DB) FDB() fdb.Database { return d.fdb }
func (d *DB) Namespace() []byte { return d.ns.prefix }
Accessors for embedding and testing. FDB() lets a consumer pass the FDB connection to another layer (e.g., a Record Layer built on top of this DB). Namespace() lets tests inspect the key prefix.
encoding.go β The Full Subspace Implementation
func (s Subspace) RangeWithin(start, end []byte) fdb.KeyRange {
var begin, endKey fdb.Key
if start == nil {
begin = s.Range().Begin
} else {
begin = s.Pack(start)
}
if end == nil {
endKey = s.Range().End
} else {
endKey = s.Pack(end)
}
return fdb.KeyRange{Begin: begin, End: endKey}
}
RangeWithin lets callers specify a sub-range within the subspace. If start = nil, the range starts at the beginning of the subspace. If end = nil, the range ends at the end of the subspace. This is used by the iterator to implement LevelDBβs NewIterator(slice *util.Range) β an iterator over a restricted key range.
The subtle end encoding: When end is provided, we use Pack(end) as the upper bound. This is an exclusive upper bound in FDB (same as Pythonβs range() β GetRange(begin, end) returns keys where begin <= key < end). Since Pack(end) = prefix + 0x00 + end, this correctly excludes the end key itself.
iterator.go β Forward/Backward Navigation
func (it *Iterator) compareKeys(a, b []byte) int {
return bytes.Compare(a, b)
}
One line, but critical: key comparison is lexicographic byte order, not string collation, not numeric order. "9" > "10" under this comparison. This is identical to LevelDBβs default comparator (BytewiseComparator). If you need a different sort order, you need a sort-preserving encoding β which is why encodeInt64 (used in option-cβs index keys) exists.
func (it *Iterator) Seek(target []byte) {
for it.idx = 0; it.idx < len(it.kvs); it.idx++ {
if it.compareKeys(it.Key(), target) >= 0 {
return
}
}
}
Linear scan for Seek. Correct but O(n). A production implementation uses binary search:
it.idx = sort.Search(len(it.kvs), func(i int) bool {
return bytes.Compare(it.kvs[i].Key, target) >= 0
})
This is O(log n) β important for large result sets. The linear scan is kept here for readability.
snapshot.go β Pinning MVCC Versions
func (s *Snapshot) NewIterator(start, end []byte) *Iterator {
return newIteratorAt(s.db, s.ns, start, end, s.readVersion)
}
The iterator constructor takes readVersion and passes it to the internal newIteratorAt. Inside newIteratorAt:
func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
v, _ := fdbDB.Transact(func(tr fdb.Transaction) (interface{}, error) {
if readVersion > 0 {
tr.SetReadVersion(readVersion)
}
return tr.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
})
...
}
If readVersion > 0, we pin the transaction to that version. The range scan returns results as of that exact version. This enables snapshot-consistent iteration β the iterator sees a stable, non-changing view even while concurrent writers are active.
The Transact retry loop and SetReadVersion: Transact retries on conflict. But SetReadVersion sets a fixed read version β the transactionβs read set is pinned. Can we still get a conflict on a read-only operation that reads from a pinned version? No β conflicts are about write-write and read-write conflicts. A transaction that only reads (even via Transact) cannot be retried due to conflict unless it also writes. Our snapshot transactions only read, so Transact will not retry them for conflict reasons. The only reason for retry would be a transaction_too_old error (if readVersion is too old and the data has been GCβd), which surfaces as an error to the caller.
12. Production Considerations
12.1 FDB Transaction Size Limits
FDB enforces hard limits on transactions:
- 10 MB total mutation size (sum of all
SetandClearcalls in one transaction) - 10 MB total read size (sum of all values read)
- 5 seconds maximum transaction duration (from first operation to commit)
For option-a-leveldb, the most likely hit is the iterator: GetSliceWithError() reads the entire range into memory in one transaction. If a namespace contains 50 MB of data and you create an iterator over it, youβll get a transaction too-large error.
Production solution: Paginated iteration with cursors:
// Instead of reading everything at once:
var cursor fdb.Key = ns.Range().Begin
const pageSize = 10_000
for {
kvs, _ := db.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(fdb.KeyRange{Begin: cursor, End: ns.Range().End},
fdb.RangeOptions{Limit: pageSize}).GetSliceWithError()
})
if len(kvs) == 0 {
break
}
// process kvs...
cursor = fdb.Key(append(kvs[len(kvs)-1].Key, 0x00)) // next page starts after last key
}
12.2 FDB Go Binding Concurrency Model
The FDB Go binding uses a single-threaded network loop internally (the FDB C library has one network thread). All transactions are multiplexed over this thread. This means:
- Multiple goroutines can each have their own
ReadTransactorTransactcalls concurrently β these are correctly serialized by the binding fdb.Databaseis goroutine-safe- But the network thread is single-threaded: if you saturate it (thousands of concurrent transactions), youβll see latency increase
For high concurrency, FDB recommends batching multiple operations within one transaction rather than creating many small transactions.
12.3 Key Space Planning
Before deploying, decide on your namespace structure. Once data is in production, renaming a namespace (changing the prefix) requires migrating all data β a potentially days-long background job.
Good practice:
// Use a versioned namespace prefix
db := layer.Open(fdb, []byte("myapp:v1:users"))
// If schema changes require a new encoding:
newDB := layer.Open(fdb, []byte("myapp:v2:users"))
// Migrate in background; dual-write during migration window
12.4 Monitoring and Observability
FDB exposes cluster health via its status API:
fdbcli --exec "status json"
Key metrics to monitor:
- Transactions committed/second β throughput
- Conflicts/second β high values indicate hot keys or poorly structured transactions
- Storage server read/write latency β P99 should be < 10ms
- Data distribution lag β if a shard is being moved, latency spikes
For application-level monitoring, instrument every Transact call with a timer and tag the metric with the operation name.
13. Interview Questions β LevelDB, MVCC, and FDB
Q: What is the difference between ReadTransact and Transact in the FDB Go binding?
ReadTransact opens a read-only transaction: no write buffer, no conflict tracking, no commit phase. Itβs cheaper than Transact because it skips the commit round-trip. Use ReadTransact whenever youβre only reading. Transact opens a read-write transaction: it tracks the read key set for conflict detection and requires a commit round-trip to apply writes. For single-key reads like Get, using Transact instead of ReadTransact works but wastes latency and cluster resources.
Q: If FDBβs MVCC window is ~5 seconds, what happens to a snapshot older than 5 seconds?
Reading from that snapshot returns a transaction_too_old error. FDB garbage-collects old MVCC versions after the configured version history window (default ~5 seconds). Any transaction β including read-only snapshot reads β that tries to read a version older than the GC horizon fails with a retriable error. In our Snapshot implementation, this means that a snapshot held open longer than ~5 seconds will start returning errors on the next Get or NewIterator call. Production code must handle this by recreating the snapshot.
Q: LevelDB supports custom comparators. What would you need to change to support a different sort order in this layer?
The sort order is determined by FDBβs key ordering, which is always lexicographic byte order. To support a different sort order (e.g., βintegers sort numericallyβ), you would change the key encoding: encode integer keys using encodeInt64 (sign-bit-flipped big-endian) so that their byte order matches their numeric order. You cannot change FDBβs comparator; you can only change what bytes you write as keys. This is why sort-preserving encoding is the fundamental concept in layer design.
Q: How does FDBβs Transact retry loop interact with side effects?
Transact retries the closure function if the transaction conflicts. If the closure has side effects outside FDB (e.g., incrementing a counter, logging, sending an HTTP request), those side effects will execute multiple times on retry. The FDB convention is: the Transact closure must be idempotent or side-effect-free. For logging, use a separate post-commit hook. For counters, use FDB atomic operations (tr.Add) which are commutative and do not require retry logic.
Q: How does an iterator over a snapshot differ from an iterator over the current database state?
A snapshot iterator is pinned to the read version captured at NewSnapshot() time. Even if concurrent writers modify or delete keys between when the snapshot was taken and when the iterator is created, the iterator sees the state at snapshot time. A current-state iterator uses the latest committed version, so it sees all mutations up to the moment the GetRange call is sent to the cluster. In our implementation, the difference is one line: tr.SetReadVersion(readVersion) in the snapshot path.
Option A β βSQL Layerβ over FoundationDB
Pattern: relational engine above FDB. Tables β subspaces, rows β KV pairs, ACID inherited from FDB transactions. This is conceptually what Appleβs old (now-archived)
fdb-sql-layerdid.
What we ship (and donβt)
We ship the storage half of a SQL engine: a catalog, row encoding, table scans, primary-key lookup, and atomic inserts/updates/deletes β all in about 250 lines.
We do not ship a SQL parser. The Go API itself is the query language:
db.CreateTable(sqllayer.TableDef{Name:"users", PK:"id", Columns: ...})
db.Insert("users", sqllayer.Row{"id":1, "name":"Alice", "city":"Paris"})
db.SelectWhere("users", func(r sqllayer.Row) bool { return r["city"] == "Paris" })
A real SQL frontend would compile SELECT ... WHERE city='Paris' into
exactly the same backend call.
Key layout
<ns> 0x00 <tableName> -> msgpack(TableDef) [catalog]
<ns> 0x01 <tableName> 0x00 <encodedPK> -> msgpack(Row) [data]
<ns> 0x02 <tableName> -> uint64 (LE counter) [rowid seq]
- Catalog lets a fresh
Open()reconstruct the in-memory schema cache. - Data rows live under a per-table subspace, so
GetRangeon<ns> 0x01 <tableName> 0x00is a full table scan in key order. - Seq is used only when the table has no declared PK (we mint a rowid).
PK values are encoded with the same sign-flipping big-endian trick used in option-c-record-layer: integer rows sort numerically, strings sort lexicographically. This is what makes table scans produce ordered output βfor freeβ.
Transactionality
Insert does the seq read+write and the row write in the same
fdb.Transact, so two concurrent inserts can never collide on the same
rowid: FDB will retry one of them. DropTable clears catalog, all rows,
and the seq counter atomically β no orphan rows can be observed after a
drop.
Why this is not a real SQLite layer
A real SQLite vtab would let you write SELECT name FROM users WHERE city='Paris'
in standard SQL and have SQLiteβs planner call back into FDB. Building that
requires:
- Implementing the
sqlite3_moduleC ABI (via CGO with mattn/go-sqlite3, or usingmodernc.org/sqliteβs yet-undocumented vtab hooks). - Mapping SQLiteβs
xBestIndex/xFilter/xColumncallbacks onto FDB range scans. - Round-tripping SQLiteβs rowids to our PK encoding.
Thatβs an order of magnitude more code (~1.5k lines) and adds little to your understanding of FDB itself; the interesting parts β how does the relational model fit on top of an ordered KV store? β are all here.
Running
cd option-a-sqlite
go mod tidy
go run ./demo -cluster ../fdb.cluster
Expected output:
SELECT * FROM users;
map[city:Paris id:1 name:Alice]
map[city:Tokyo id:2 name:Bob]
map[city:Paris id:3 name:Carol]
SELECT * FROM users WHERE city='Paris';
map[city:Paris id:1 name:Alice]
map[city:Paris id:3 name:Carol]
After UPDATE users SET city='Tokyo' WHERE id=1;
map[city:Tokyo id:1 name:Alice]
map[city:Tokyo id:2 name:Bob]
map[city:Paris id:3 name:Carol]
After DELETE FROM users WHERE id=3;
map[city:Tokyo id:1 name:Alice]
map[city:Tokyo id:2 name:Bob]
Hitchhikerβs Guide β Option A: SQL Engine over FoundationDB
The question this answers: βWhat does a relational database actually store? If you strip away the SQL parser and the query planner, what is left?β
The deeper question: βHow do tables, rows, primary keys, and catalogs map onto an ordered keyβvalue store?β
Table of Contents
- What a Relational Database Actually Stores
- The Catalog β A Database About Databases
- Row Encoding β Turning a Map into Bytes
- Primary Keys and Sort Order
- Full Table Scan β The Honest Query
- INSERT as a Rowid Allocation + Write
- UPDATE = Overwrite (Why it Works)
- DELETE = Clear (Why it Works)
- Key Layout In Detail
- What We Didnβt Build (and Why Those Parts Are Hard)
- Real-World Analogues: SQLite, MySQL, PostgreSQL
- Exercises
1. What a Relational Database Actually Stores
A relational database is, at its core, a system for:
- Storing rows (collections of typed values) under a primary key.
- Keeping a catalog (metadata about what tables exist and their schemas).
- Maintaining indexes (secondary data structures for fast lookup by non-primary-key fields).
- Executing queries (finding rows matching a predicate).
Strip away SQL parsing, the query planner, authentication, and network protocols, and you have a structured keyβvalue store with a catalog.
Our layer implements steps 1 and 2. Step 3 is in option-c-record-layer.
Step 4 is a SelectWhere(pred) that does a full scan β honest but slow,
which makes the need for indexes obvious.
2. The Catalog β A Database About Databases
Every real database has a catalog: a table of tables. PostgreSQL stores
it in pg_catalog.pg_class. MySQL uses information_schema.TABLES. SQLite
writes the catalog to the first page of the database file as sqlite_master.
Our catalog lives in FDB under a dedicated subspace:
<ns> 0x00 <tableName> β msgpack(TableDef)
TableDef contains the table name, column definitions (name + type), and the
declared primary-key column name (if any). When Open() is called, we scan
this subspace and reconstruct the in-memory schemas map:
func (d *DB) loadCatalog() error {
// range-scan the catalog subspace
kvs, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(d.catalogRange(), fdb.RangeOptions{}).GetSliceWithError()
})
for _, kv := range kvs {
var def TableDef
msgpack.Unmarshal(kv.Value, &def)
d.schemas[def.Name] = def
}
}
This means the catalog is self-describing β a fresh process connecting to
an existing FDB namespace automatically discovers all tables. You donβt need
to pass schemas at startup. This mirrors how SQLite opens a database file and
reads its catalog from sqlite_master before accepting any queries.
CreateTable is atomic: the catalog write and any initial setup happen
inside one FDB transaction. Two concurrent CreateTable("users") calls will
either both succeed (if they have identical schemas) or one will fail with a
conflict. There is no possibility of a half-created table.
3. Row Encoding β Turning a Map into Bytes
Our Row type is map[string]any. We serialize it with msgpack, which
produces a compact binary representation:
{"name": "Alice", "city": "Paris", "id": 1}
β 83 a4 6e 61 6d 65 a5 41 6c 69 63 65 a4 63 69 74 79 a5 50 61 72 69 73 a2 69 64 01
(msgpack map: 3 entries, name="Alice", city="Paris", id=1)
The row bytes become the value in FDB. The key is derived from the primary key value (see below).
Why msgpack, not JSON or Protobuf?
- JSON is human-readable but wastes space (quoted keys, text numbers).
- Protobuf is compact but requires a schema file and code generation.
- msgpack is compact, self-describing (no external schema), and has first- class Go support.
For a teaching implementation, msgpack hits the right trade-off. For production, Protobuf or Capβn Proto would be better choices, since they support schema evolution (adding fields without breaking old readers) and are more efficient for fixed schemas.
4. Primary Keys and Sort Order
A primary key must be unique and determines the storage order of rows.
In FDB, βstorage orderβ = βkey byte orderβ. So we need to encode primary key values into bytes in a way that preserves the intended sort order.
String PKs: store as raw UTF-8 bytes. UTF-8βs byte ordering preserves lexicographic order for ASCII characters. Enough for a demo; a production system would use Unicode collation.
Integer PKs: this is where it gets interesting.
Naive approach: encode as decimal string. Problem: "9" > "30" in byte
comparison. Rows would sort as: 1, 10, 100, 2, 20, 3β¦ β wrong.
Better: big-endian 64-bit integer. Problem: negative numbers have the high bit
set, so -1 encodes as 0xFF... which sorts after all positive numbers.
Our solution (and the industry standard):
func encodeInt64(x int64) []byte {
b := make([]byte, 8)
// Flip the sign bit. This turns the two's-complement representation into
// an unsigned representation that sorts correctly:
// 0 β 0x8000_0000_0000_0000 (sorts between negatives and positives)
// 1 β 0x8000_0000_0000_0001
// -1 β 0x7FFF_FFFF_FFFF_FFFF (sorts before 0)
// -2 β 0x7FFF_FFFF_FFFF_FFFE (sorts before -1)
binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
return b
}
After this encoding, integer byte strings sort in the same order as their
numeric values. A GetRange on the rows subspace returns rows in ascending
PK order, for free, by FDBβs native byte-order scan.
This is the same trick used by:
- Apache HBase (for row key ordering)
- Google Cloud Bigtable
- FoundationDBβs official Tuple layer (for negative integers)
- Apache Cassandraβs blob serialization
5. Full Table Scan β The Honest Query
SelectAll and SelectWhere both scan the entire rows subspace:
func (d *DB) SelectWhere(table string, pred func(Row) bool) ([]Row, error) {
v, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(d.tableRange(table), fdb.RangeOptions{}).GetSliceWithError()
})
for _, kv := range v.([]fdb.KeyValue) {
var row Row
msgpack.Unmarshal(kv.Value, &row)
if pred == nil || pred(row) {
out = append(out, row)
}
}
}
The predicate is applied in Go, after fetching all rows from FDB. This is
a full table scan β the same operation as SELECT * FROM users WHERE city='Paris' without any index.
Why not push the predicate to FDB?
FDB has no server-side filtering. Its API is: βgive me this key range.β All filtering must happen on the client. This is a deliberate design choice by FDB: the cluster is a dumb-fast KV store; smarts live in the layer.
Real SQL databases push predicates βdownβ to the storage engine to avoid fetching unnecessary rows over the network. In our setup, βthe networkβ is the connection to FDBβs storage servers β for large tables, a full scan incurs real latency. This is the motivation for secondary indexes (option-c).
What a real SQL engine does instead:
A query planner looks at the predicate city='Paris' and checks whether there
is an index on city. If yes, it rewrites the scan as:
scan index(city = 'Paris') β [pk1, pk2, pk3]
fetch rows by pk: Get(pk1), Get(pk2), Get(pk3)
This is exactly what LookupByIndex does in option-c.
6. INSERT as Rowid Allocation + Write
When a table has no declared PK, we allocate a monotonic rowid:
_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
curBytes, _ := tr.Get(d.seqKey(table)).Get()
var cur uint64
if len(curBytes) == 8 {
cur = binary.LittleEndian.Uint64(curBytes)
}
next := cur + 1
// Write the new counter value
binary.LittleEndian.PutUint64(buf, next)
tr.Set(d.seqKey(table), buf)
// Write the row
tr.Set(d.rowKey(table, encodeInt64(int64(next))), encoded)
return nil, nil
})
The counter read + increment + row write happen in one transaction. Two concurrent inserts cannot get the same rowid β FDBβs conflict detection will cause one to retry with a fresh counter read.
This is identical to how SQLite allocates its rowid (a monotonic integer
stored in the B-tree page header) and how PostgreSQL allocates OIDs and
ctids. The key insight: the counter is a row, and updating it is a
transaction.
7. UPDATE = Overwrite (Why it Works)
LevelDB and FDB share a property: Set(key, newValue) on an existing key
simply replaces it. There is no βupdate in placeβ at the storage level.
Our βUPDATEβ is just another Insert call with the same PK:
db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Tokyo"})
// This calls tr.Set(rowKey(table, encode(1)), msgpack(new_row))
// The old row bytes under the same key are replaced.
FDB does not copy the old value to a TOAST or undo segment β it just writes the new value. The old version is kept in FDBβs MVCC history for ~5 seconds (for in-flight read transactions) and then garbage-collected.
What changes when you have indexes:
Once you have secondary indexes (option-c), an update is no longer just a single key write. You must:
- Read the old row (to find out what old index entries to remove)
- Clear the old index entries
- Write the new row
- Write the new index entries
All in one transaction. This is why option-cβs PutRecord is more complex
than our Insert.
8. DELETE = Clear (Why it Works)
func (d *DB) Delete(table string, pk any) error {
encodedPK, _ := encodePK(pk)
_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Clear(d.rowKey(table, encodedPK))
return nil, nil
})
return err
}
Clear adds a tombstone to the transaction buffer. After commit, the key
disappears from FDB as if it was never written. Future reads return nil.
No cascades: unlike SQL with ON DELETE CASCADE, our engine has no
foreign key enforcement. This is an intentional simplification. Real SQL
engines resolve cascades by finding related rows (via indexes on the FK column)
and deleting them in the same transaction.
What about DropTable?
func (d *DB) DropTable(name string) error {
_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Clear(d.catalogKey(name)) // remove from catalog
tr.ClearRange(d.tableRange(name)) // delete all rows
tr.Clear(d.seqKey(name)) // reset rowid counter
return nil, nil
})
}
One transaction that deletes the catalog entry, all data rows, and the sequence counter. Atomically. A reader that has an open transaction might still see the table (MVCC), but any new transaction started after commit sees the table as gone.
9. Key Layout In Detail
Letβs trace exactly what happens when you call:
db.CreateTable(TableDef{Name: "users", PK: "id", Columns: [...]})
db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Paris"})
db.Insert("users", Row{"id": 2, "name": "Bob", "city": "Tokyo"})
With namespace "demo-sql":
Catalog key:
"demo-sql" + 0x00 + "users"
= 64 65 6d 6f 2d 73 71 6c 00 75 73 65 72 73
β value: msgpack({Name:"users", PK:"id", Columns:[...]})
Row key for id=1:
"demo-sql" + 0x01 + "users" + 0x00 + encodeInt64(1)
= 64 65 6d 6f 2d 73 71 6c 01 75 73 65 72 73 00 80 00 00 00 00 00 00 01
β encoded 1 (sign bit flipped)
β value: msgpack({id:1, name:"Alice", city:"Paris"})
Row key for id=2:
"demo-sql" + 0x01 + "users" + 0x00 + 80 00 00 00 00 00 00 02
β value: msgpack({id:2, name:"Bob", city:"Tokyo"})
Table scan range:
begin: "demo-sql" + 0x01 + "users" + 0x00
end: "demo-sql" + 0x01 + "users" + 0x01
β returns id=1 row, then id=2 row (in ascending id order)
The 0x00 separator between tag, table name, and PK ensures:
- Table βuserβ doesnβt collide with table βusersβ
- Row key doesnβt collide with catalog key (different tag byte: 0x01 vs 0x00)
- Rows for table βusersβ donβt collide with rows for table βordersβ (different table name segment)
10. What We Didnβt Build (and Why Those Parts Are Hard)
Secondary indexes:
We have SelectWhere but it does a full scan. Adding a secondary index
(e.g., βfind all users where city=βParisββ) requires maintaining an extra
key for every row that maps (city value, pk) β "". The hard part is
keeping the index consistent when rows change β see option-c.
SQL parsing:
A SQL parser turns SELECT name FROM users WHERE city='Paris' into an AST.
This is ~1000 lines of code for a minimal SQL grammar. We skip it because
it adds no FDB-specific insight.
Query planner: Given multiple indexes, which one should we use? This is a graph problem (estimating cost of each query plan) that modern databases spend enormous effort on. The basic version is simple (pick the most selective index), but getting it right for all cases is a career.
JOIN: Joining two tables in SQL is conceptually a nested loop (for each row in table A, find matching rows in table B). The hard part is making this fast when both tables are large. Most JOIN algorithms (hash join, sort-merge join) still reduce to: βscan one table, look up matching rows from the other.β
MVCC for uncommitted readers:
We provide MVCC through FDBβs read versions, but we donβt expose
BEGIN TRANSACTION / COMMIT / ROLLBACK at the SQL level. Adding this
requires keeping an open FDB transaction for the duration of the user
transaction β which is possible but requires careful lifecycle management.
11. Real-World Analogues
SQLite
SQLite stores its catalog in sqlite_master, a special table on page 1:
CREATE TABLE sqlite_master (
type TEXT, -- "table", "index", "view", "trigger"
name TEXT,
tbl_name TEXT,
rootpage INTEGER, -- page number of the B-tree root for this table
sql TEXT -- the original CREATE statement
);
Every tableβs rows are stored in a B-tree rooted at rootpage. The B-tree
keys are the rowid values (64-bit integers, always). The row data (all
columns) is packed into the B-tree leaf value using SQLiteβs own compact
encoding.
Our rowKey(table, encodedPK) is the FDB equivalent of the B-tree rowid.
Our msgpack row serialization is the FDB equivalent of SQLiteβs row record
format.
MySQL InnoDB
InnoDB uses a clustered index β the primary key IS the B-tree. All columns are stored in the PK leaf node. Secondary indexes store the PK value (not the physical row location) as their βpointerβ:
PRIMARY KEY index: pk β all_columns
SECONDARY INDEX on city: city β pk
To look up a row by city:
- Scan secondary index for
city = 'Paris'β get pk values - Look up each pk in the primary index β get the full row
This two-step lookup is called a βkey lookupβ in MySQLβs EXPLAIN output. Our option-c-record-layer does exactly this.
PostgreSQL
PostgreSQL uses βheap filesβ β rows are stored in pages in insertion order,
not sorted by PK. The PK is just another index (a B-tree) that maps
pk β ctid (page number + slot within page). Secondary indexes also map
field_value β ctid.
This is subtly different from InnoDB: in PostgreSQL, all indexes point to physical page locations, not to the PK. This makes index maintenance easier (an update to a non-indexed column doesnβt touch any index), but makes VACUUM (garbage collection of old row versions) more complex.
CockroachDB / TiKV / YugabyteDB
These βNewSQLβ databases are essentially what weβre building here: a SQL query engine on top of a distributed, ordered, transactional KV store. Their storage architecture is:
SQL layer (parsing, planning, execution)
β
Encoding layer (rows β keys, indexes β keys)
β
Distributed KV layer (RocksDB + Raft replication per shard)
CockroachDBβs row encoding is:
/<table_id>/<index_id>/<col1_value>/<col2_value>/... β rest_of_columns
This is the same pattern as our <ns>/<tag>/<table_name>/<pk> β just with
more type information and a smarter Tuple encoding.
12. Exercises
Exercise 1 β Add a Secondary Index
Modify Insert to also write index entries:
ns + 0x02 + tableName + 0x00 + columnName + 0x00 + encodedValue + 0x00 + pk β ""
Modify Delete to also clear those entries (requires reading the row first).
Add SelectByIndex(table, column, value) ([]Row, error).
Exercise 2 β Foreign Keys
Add a References field to Column:
type Column struct {
Name string
Type ColumnType
References *ForeignKey // nil if no FK
}
In Insert, check that the referenced PK exists (a Get inside the same
transaction). In Delete, check that no child rows reference this PK (a
GetRange on the FK index).
Exercise 3 β Transactions for Callers
Add a Begin() *Tx method that returns a transaction object. Tx should
buffer operations and commit on tx.Commit() β all inside one FDB
db.Transact() call.
Exercise 4 β DISTINCT and COUNT
Implement SelectDistinct(table, column) and Count(table) using a single
FDB GetRange and in-process aggregation. This is your first aggregation
operator β the same one query engines implement with hash tables or sort passes.
Exercise 5 β Schema Evolution
Add a version number to TableDef. When Open() loads a table with version
< current, run a migration function that adds default values for new columns.
13. Source Code Deep Dive β sqllayer/db.go
The entire engine is in sqllayer/db.go. Letβs trace the critical paths.
The DB Type
type DB struct {
fdb fdb.Database
ns []byte
schemas map[string]TableDef
mu sync.RWMutex
}
schemas is an in-memory cache of the catalog. It is loaded at Open() time from FDB and updated by CreateTable/DropTable. The mutex protects concurrent access to schemas.
Design question: Why cache schemas in memory at all? Because every query needs the schema to know column types (for encoding/decoding) and the PK column name. Fetching the schema from FDB on every query would add a round-trip. The trade-off: the in-memory cache can be stale if another process creates/drops a table. Production systems solve this with schema versioning: every DDL operation increments a global version; processes re-sync when they detect a version mismatch.
Key Helpers
func (d *DB) catalogKey(table string) fdb.Key {
return fdb.Key(append(append([]byte{}, d.ns...), append([]byte{0x00}, []byte(table)...)...))
}
func (d *DB) rowKey(table string, encodedPK []byte) fdb.Key {
prefix := append(append([]byte{}, d.ns...), 0x01)
prefix = append(prefix, []byte(table)...)
prefix = append(prefix, 0x00)
return fdb.Key(append(prefix, encodedPK...))
}
Tag bytes distinguish subspaces:
0x00β catalog entries0x01β row data0x02β sequence (rowid counter)
This tag-byte approach is simpler than the full Tuple encoding but achieves the same goal: prevent catalog keys from colliding with row keys even if a table is named βusersβ and another subspace prefix also uses βusersβ.
The Sequence Counter Pattern
func (d *DB) nextRowid(tr fdb.Transaction, table string) (int64, error) {
key := d.seqKey(table)
cur, _ := tr.Get(key).Get()
var n int64 = 1
if len(cur) == 8 {
n = int64(binary.LittleEndian.Uint64(cur)) + 1
}
b := make([]byte, 8)
binary.LittleEndian.PutUint64(b, uint64(n))
tr.Set(key, b)
return n, nil
}
This is a read-modify-write within one transaction. The caller passes tr (an active transaction). The Get reads the current counter value. The Set increments it. Both are in the same transaction as the row write.
If two concurrent inserts both read the same counter value, theyβll both try to write n+1 β but FDBβs conflict detection will catch this: both transactions read the same key, and both tried to write it. One will conflict and retry, which re-reads the counter and gets a fresh value. The retry is invisible to the caller.
Why not use FDBβs atomic add?
tr.Add(key, encodeInt64(1)) // atomic increment, no conflict
tr.Add is a commutative atomic operation β it doesnβt cause read-write conflicts. But it returns no value β you canβt know what rowid was assigned. If you need to use the rowid in the same transaction (to build the row key), you need the read-modify-write pattern. If the rowid is just an opaque row identifier that callers donβt use, tr.Add plus a versionstamp (to create a unique, ordered key without knowing the exact value) is a better approach.
The encodePK Function
func encodePK(v any) ([]byte, error) {
switch x := v.(type) {
case int:
return encodeInt64(int64(x)), nil
case int64:
return encodeInt64(x), nil
case string:
return []byte(x), nil
case []byte:
return x, nil
default:
return nil, fmt.Errorf("unsupported PK type %T", v)
}
}
String PKs are stored as raw UTF-8 bytes. This sorts correctly for ASCII but not for Unicode (e.g., βΓ©β sorts after βzβ in byte order but before βzβ in French locale). For a production system, use the ICU library for locale-aware collation, or restrict PKs to ASCII.
Integer PKs use encodeInt64: big-endian with sign-bit flipped. Read Chapter 3 of this guide for the derivation.
What about composite PKs? (user_id, order_id) as a PK means the row key is encode(user_id) + 0x00 + encode(order_id). The separator between components is 0x00 (the Tuple encoding style). This naturally makes rows for the same user adjacent in key order β a range scan GetRange(encode(user_id) + 0x00, encode(user_id) + 0x01) returns all orders for that user. Our implementation only supports single-column PKs, but the extension is straightforward.
14. Production Architecture β What a Real SQL-on-FDB Looks Like
The five components of a production SQL-on-FDB system:
Component 1: SQL Parser
Takes SQL text, produces an AST. For SELECT name FROM users WHERE city='Paris':
SelectStatement {
Columns: ["name"],
Table: "users",
Where: EqualExpr{Column: "city", Value: "Paris"},
}
Libraries: pingcap/tidb/parser (Go, production-grade), xwb1989/sqlparser (Go, simpler), or write your own using gocc or pigeon.
Component 2: Query Planner
Takes the AST, looks at available indexes, estimates row counts, and picks the cheapest plan:
Without index on city:
plan: TableScan("users") β Filter(city='Paris')
cost: O(n_rows) reads
With index on city:
plan: IndexScan(city='Paris') β PKLookup("users", [pk1, pk2, ...])
cost: O(index_entries) + O(result_rows) reads
The planner is the hardest part of a query engine. Modern planners use statistics (histograms of column value distributions) stored in a catalog table to estimate n_rows for each plan.
Component 3: Encoding Layer (what we built)
Rows β keys in FDB. Indexes β keys in FDB. All with sort-preserving encodings.
Component 4: FDB Transaction Management
SQL has BEGIN / COMMIT / ROLLBACK. FDB has Transact. Bridging these requires holding an FDB Transaction object open for the duration of the SQL transaction. The FDB 5-second limit means long-running SQL transactions must be broken up or use FDBβs βgrvβ caching.
Component 5: Network Protocol
Postgres wire protocol (libpq-compatible), MySQL wire protocol, or a custom protocol. This is what lets standard clients (psql, JDBC, SQLAlchemy) connect. CockroachDB implements the Postgres wire protocol on top of their FDB-like storage.
15. Interview Questions β Relational Databases and FDB
Q: What is a clustered index, and how does option-a-sqlite implement it?
A clustered index is an index where the data rows are stored at the index leaf nodes, sorted by the index key. In option-a-sqlite, the FDB key for each row IS the primary key, and the row data is the value at that key. Range scanning by primary key returns rows in PK order and fetches the data in the same operation. This is a clustered index: the data is organized by PK. InnoDB uses the same design (the primary key B-tree is the table).
Q: What happens when two concurrent transactions try to insert a row with the same primary key?
Both transactions read the same row key (which doesnβt exist yet β they get nil). Both write a new value to that key. FDBβs conflict detection: one transaction commits; the other conflicts (it wrote to a key that was also written by the committed transaction). The conflicting transaction retries from the beginning. On retry, it reads the row key again β this time it exists. Depending on application semantics, the retry can: (a) return a duplicate key error, (b) return the existing row (SELECT or INSERT/IGNORE), or (c) replace it (INSERT OR REPLACE).
Q: Why does PostgreSQL store rows in heap order rather than clustered by PK?
PostgreSQLβs original design assumed that MVCC updates would frequently need to move rows (updating a row creates a new row version at a new physical location; the old version stays until VACUUM). If rows were clustered by PK, a moved row would need to update all indexes to point to the new physical location. By using heap files with stable physical locations (ctid), updates only create a new heap entry and update the visibility chain β indexes donβt need to change. The trade-off: PK range scans in PostgreSQL require index traversal + random I/O into the heap, while InnoDBβs clustered index gives sequential I/O for PK range scans.
Q: How does your row encoding handle schema migrations β adding a new column to a table with existing rows?
With msgpack, adding a new column to the schema is safe for new rows (they include the new column). Old rows, when read back, will simply not have the column in the map. Application code should handle missing fields with defaults. This is βforward compatibilityβ: old data is compatible with new code. The hard case is βbackward compatibilityβ: if you need all rows to have the new column (for a NOT NULL constraint), you must run a background migration job that reads every old row, adds the default value, and writes it back β one paginated batch per transaction.
This is how every production database handles ALTER TABLE ADD COLUMN.
Option B β LevelDB on top of FoundationDB
Pattern: existing storage engine, FDB as the disk. We give the unmodified
goleveldblibrary an FDB-backed implementation of itsstorage.Storageinterface. LevelDB still does its LSM thing (memtables, SSTables, compaction, MANIFEST), but every byte ends up in FDB key ranges instead of on local disk.
This is the mirror image of option-a: the storage engine sits below FDB rather than above it.
How LevelDB sees the world
goleveldb accesses persistence exclusively through a small interface:
type Storage interface {
Lock() (Locker, error)
Log(str string)
SetMeta(FileDesc) error
GetMeta() (FileDesc, error)
List(FileType) ([]FileDesc, error)
Open(FileDesc) (Reader, error)
Create(FileDesc) (Writer, error)
Remove(FileDesc) error
Rename(old, new FileDesc) error
Close() error
}
Each FileDesc is {Type, Num} β e.g. {TypeTable, 42} for SST #42 or
{TypeManifest, 7} for the 7th manifest. Filenames are an implementation
detail; LevelDB never looks at strings.
Our fdbstorage package implements that interface against FDB. The whole
file is ~250 lines.
Key layout
<ns> 0x01 <ftype:1B> <num:int64 BE> -> uint64 BE file size
<ns> 0x02 <ftype:1B> <num:int64 BE> <chunk:uint32 BE> -> 64 KiB chunk
<ns> 0x03 -> current MANIFEST {ftype,num}
<ns> 0x04 -> lock marker
- Files are split into 64 KiB chunks so we stay well under FDBβs 100 KiB-per- value soft limit and the 10 MB-per-transaction hard limit.
Createreturns aWriterthat buffers in memory and flushes onSync/Close. We split the flush across multiple transactions (100 chunks each β 6 MiB) to safely handle files larger than 10 MB.Renameis implemented as copy-then-clear inside one transaction. LevelDB only renames small files (temp β real on flush completion), so the inefficiency doesnβt matter.SetMetawrites the manifest pointer atomically. Because FDB transactions are serializable, two concurrent flushes canβt observe a half-rotated manifest.
Why this is interesting
You get a real LevelDB instance β with bloom filters, compaction, snapshots, the works β whose durability story is βwhatever FDBβs durability story is.β That means:
- Geo-replication and read scaling come for free from the FDB cluster.
- Backups are FDB backups.
- The local node has no on-disk state at all; it can crash and restart against a different FDB coordinator without losing anything.
The cost is latency: every SST read is at least one FDB round-trip, every flush is many. This isnβt a production architecture; itβs a teaching artifact that proves how cleanly the layers separate.
Running
cd option-b-leveldb
go mod tidy
go run ./demo -cluster ../fdb.cluster
Expected output (the second session re-opens and reads the persisted data):
First session: wrote 3 keys, then closed.
Reopening LevelDB on the same FDB namespace...
apple -> red
banana -> yellow
cherry -> red
Iterating the whole DB:
apple -> red
banana -> yellow
cherry -> red
What this implementation skips
- Locker isnβt multi-process safe across long-lived processes β if a holder crashes the lock key stays set. A production version would attach the lock to a client UUID and TTL it via FDB watches.
- Reader loads the whole file into memory. LevelDB SSTs are bounded (default 2 MB), so this is fine for a demo but not for huge tables.
- No caching layer. Every
Openis a fresh FDB scan. A real impl would cache hot SSTs.
Read fdbstorage/storage.go β the whole thing is one file deliberately, so you can follow the data flow end to end.
Hitchhikerβs Guide β Option B: LevelDB on FDB Storage
The question this answers: βCan I run a real LSM-tree storage engine β the actual LevelDB binary with all its compaction logic β with its files stored in FoundationDB instead of a local disk?β
The deeper question: βWhat is the
storage.Storageinterface, why does it exist, and what does it tell us about how databases handle file I/O?β
Table of Contents
- The Storage Abstraction: Why LevelDB Has a Plugin Point
- What LevelDB Actually Writes to Disk
- The storage.Storage Interface β Dissected
- How We Map LevelDB Files to FDB Keys
- Chunking: Overcoming the 100 KiB Value Limit
- Atomic Rename β Durabilityβs Secret Weapon
- The Writer: Batching Chunks into Transactions
- Why the WAL is Redundant With FDB
- The Blob Layer Pattern
- Real-World Analogues: RocksDB on Cloud Storage
- Exercises
1. The Storage Abstraction: Why LevelDB Has a Plugin Point
LevelDBβs storage.Storage interface exists because the original LevelDB
authors (Jeff Dean and Sanjay Ghemawat) designed it for portability. Not
every environment has a POSIX filesystem. Google has internal systems where
storage might be Bigtable, Colossus, or a custom log-structured store.
The interface says: βif you can implement these 8 methods, LevelDB will run on your storage.β The application code (goleveldb) doesnβt know whether itβs writing to ext4, NTFS, GCS, or FDB β it just calls the interface.
type Storage interface {
Lock() (util.Releaser, error)
Log(m storage.FileDesc) (storage.Writer, error)
Open(fd storage.FileDesc) (storage.Reader, error)
Create(fd storage.FileDesc) (storage.Writer, error)
Remove(fd storage.FileDesc) error
Rename(oldfd, newfd storage.FileDesc) error
GetMeta() (storage.FileDesc, error)
SetMeta(fd storage.FileDesc) error
List() ([]storage.FileDesc, error)
Close() error
}
Our fdbstorage.Storage implements all of these, storing files as FDB
key-value pairs. LevelDB itself (in the syndtr/goleveldb package) calls
these methods. It has no idea the βfilesβ are actually chunks in a distributed
database.
2. What LevelDB Actually Writes to Disk
To understand what we need to store, letβs look at what LevelDB writes:
File types:
TypeJournal (.log) β Write-Ahead Log: records every write before the
MemTable is flushed. Used to recover unflushed writes
after a crash.
TypeManifest (.MANIFEST) β Lists which SSTables are "live" (not yet
garbage-collected). Updated at each compaction.
TypeTable (.ldb / .sst) β Sorted String Tables. Immutable, sorted KV
data files produced by compaction.
TypeCurrent (CURRENT) β A single file containing the name of the latest
MANIFEST file.
TypeTemp (.tmp) β Temporary files used during compaction.
TypeLock (LOCK) β A file held open to prevent two processes from
opening the same database simultaneously.
File descriptor:
type FileDesc struct {
Type FileType // TypeJournal, TypeManifest, etc.
Num int64 // unique file number (monotonically increasing)
}
A LevelDB database directory looks like:
000003.log β journal (WAL)
000004.ldb β SSTable level 0
000005.ldb β SSTable level 0
MANIFEST-000002 β current manifest
CURRENT β "MANIFEST-000002\n"
LOCK β lockfile
When compaction happens:
- LevelDB picks some SSTables, merges and sorts them into a new SSTable.
- It writes the new SSTable as a
.tmpfile (viaCreate(TypeTemp, ...)) - It renames the
.tmpto the final.ldbname (viaRename) - It updates the MANIFEST to list the new SSTable and de-list the old ones.
- It removes the old SSTables (via
Remove).
This is the temp-then-rename durability pattern: create a new file
atomically, then rename it into place. POSIX rename is atomic β the old
name or the new name is visible, never a partial file. Our FDB implementation
must replicate this property.
3. The storage.Storage Interface β Dissected
Letβs look at each method and what it does:
Lock() (util.Releaser, error)
Prevents two processes from opening the same database simultaneously. We
implement this by writing a βlockβ key to FDB. The Releaser clears it.
Create(fd FileDesc) (Writer, error) and Open(fd FileDesc) (Reader, error)
Create starts a new file (for writing). Open opens an existing file (for
reading). In FDB terms: Create returns a writer that buffers bytes; Open
reads all chunks for the file into memory and returns a bytes.Reader.
Remove(fd FileDesc) error
Deletes a file. In FDB: ClearRange over all chunk keys for this file.
Rename(oldfd, newfd FileDesc) error
Renames a file atomically. In FDB: copy all chunks from oldfd keys to
newfd keys, then clear all oldfd keys β in one transaction. This is the
atomic rename.
GetMeta() (FileDesc, error) and SetMeta(fd FileDesc) error
Get/set the βcurrentβ file pointer β which MANIFEST is current. In FDB:
a single key (ns + tagManifest + 0x00) stores the current FileDesc.
This replaces the CURRENT file in LevelDBβs original design.
List() ([]FileDesc, error)
List all files. We implement this as a range scan over the meta key prefix.
We store a meta key for each file alongside its data.
4. How We Map LevelDB Files to FDB Keys
Each LevelDB file is identified by (FileType, FileNum). We encode this as:
Meta key (file existence + type):
ns + tagFileMeta(0x01) + fileType(1 byte) + fileNum(8 bytes BE)
β msgpack({Type: ft, Num: n, size: totalBytes})
Data chunks:
ns + tagFileData(0x02) + fileType(1 byte) + fileNum(8 bytes BE) + chunkNum(8 bytes BE)
β up to 64 KiB of file data
Manifest pointer (replaces CURRENT file):
ns + tagManifest(0x03)
β msgpack(FileDesc{Type: TypeManifest, Num: n})
Lock key:
ns + tagLock(0x04)
β "locked" (any non-empty value means locked)
Why big-endian for file numbers?
Big-endian encoding preserves sort order. File numbers are monotonically
increasing (LevelDB never reuses a file number). By storing them big-endian,
a range scan over ns+tagFileData+ft+num+* returns chunks in chunk-number
order β which is the correct order to reassemble the file. Without big-endian
encoding, chunk 10 would sort before chunk 2 (0x0A < 0x02 is false in
big-endian but 0x0000000A < 0x00000002 is also false β you need
lexicographic order over big-endian bytes).
File number and type as part of the key:
This means all chunks of file (TypeJournal, 3) sort together, before all
chunks of (TypeJournal, 4), which sort before (TypeTable, 5). Clean,
hierarchical key organization.
5. Chunking: Overcoming the 100 KiB Value Limit
FDB has a hard limit: values may not exceed 100 KiB (102,400 bytes). A typical LevelDB SSTable is 2β4 MB. We cannot store it in one FDB value.
Our solution: split each file into 64 KiB chunks:
const chunkSize = 64 * 1024 // 65,536 bytes
// Writing a 200 KiB file:
// Chunk 0: bytes [0, 65536)
// Chunk 1: bytes [65536, 131072)
// Chunk 2: bytes [131072, 200000) (partial last chunk)
Each chunk is stored as a separate FDB key-value pair:
ns+0x02+ft+num+00000000_00000000 β 65536 bytes
ns+0x02+ft+num+00000000_00000001 β 65536 bytes
ns+0x02+ft+num+00000000_00000002 β 68528 bytes (partial)
Reading a file: range-scan all chunk keys for (ft, num), sort by chunk
number (already in order due to big-endian encoding), concatenate the values.
func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
})
var buf []byte
for _, kv := range kvs.([]fdb.KeyValue) {
buf = append(buf, kv.Value...)
}
return io.NopCloser(bytes.NewReader(buf)), nil
}
Why 64 KiB chunks?
- Smaller than FDBβs 100 KiB value limit β
- Large enough to minimize key overhead (a 4 MB SSTable = 64 chunks, not thousands)
- Aligns with filesystem block sizes (4β64 KiB typical)
6. Atomic Rename β Durabilityβs Secret Weapon
POSIX rename(src, dst) is the single most important durability primitive in
filesystems. Its contract: after rename returns, dst exists and
src does not, with no window where neither exists. This is atomic
replacement.
LevelDB uses rename heavily:
Rename(TypeTemp, n, TypeTable, n): promote temp SSTable to final nameRename(TypeTemp, n, TypeManifest, n): promote temp manifest
Without atomicity, a crash during rename could leave:
- Neither file existing β data loss
- Both files existing β ambiguity about which is current
- A partial file at dst β corruption
In FDB, we implement atomic rename as copy + clear in one transaction:
func (s *Storage) Rename(oldfd, newfd storage.FileDesc) error {
_, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// 1. Read all old chunks
kvs, _ := tr.GetRange(s.dataRange(oldfd), fdb.RangeOptions{}).GetSliceWithError()
// 2. Clear old data
tr.ClearRange(s.dataRange(oldfd))
tr.Clear(s.metaKey(oldfd))
// 3. Write new data
for _, kv := range kvs {
newKey := s.translateChunkKey(kv.Key, oldfd, newfd)
tr.Set(newKey, kv.Value)
}
tr.Set(s.metaKey(newfd), metaBytes)
return nil, nil
})
return err
}
One transaction. The cluster either commits all of this (old is gone, new is present) or none of it (crash safety). The atomicity guarantee is identical to POSIX rename β and arguably stronger, since FDB replicates the commit across multiple machines before returning.
The 10 MB transaction limit:
FDB transactions are limited to ~10 MB of reads + writes. A large SSTable (4 MB) would have chunks adding up to 4 MB of writes in one transaction. Thatβs under the 10 MB limit. But 8 MB SSTables would be risky.
Our Rename reads all chunks in the transaction (4 MB reads) and writes them
all back (4 MB writes) β totaling 8 MB. Safe for typical LevelDB files.
For larger files, weβd need to either:
- Break the rename into multiple transactions (violating atomicity), or
- Use a two-phase approach: write new chunks in a first transaction, then atomically swap the meta key in a second transaction (using a PENDING state key as the βin-progress renameβ marker).
7. The Writer: Batching Chunks into Transactions
When LevelDB writes a new SSTable, it calls Create(fd) which returns a
Writer. The writer accumulates bytes via Write(p []byte). When Close()
is called, we flush everything to FDB.
Batch size:
const maxChunksPerTx = 100 // 100 Γ 64 KiB = 6.4 MB per transaction
We flush up to 100 chunks per FDB transaction. This stays well within the 10 MB limit. A 20 MB SSTable would be flushed in 4 transactions of 5 MB each.
func (w *writer) flush(final bool) error {
start := w.flushedChunks
end := start + maxChunksPerTx
if end > len(w.chunks) {
end = len(w.chunks)
}
_, err := w.s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
for i := start; i < end; i++ {
tr.Set(w.s.chunkKey(w.fd, i), w.chunks[i])
}
if final && end == len(w.chunks) {
// Write the meta key only on the final flush
tr.Set(w.s.metaKey(w.fd), metaBytes)
}
return nil, nil
})
w.flushedChunks = end
return err
}
The meta-key-last invariant:
We write the meta key (the fileβs βdirectory entryβ) only in the last batch
of chunks. This ensures that List() never returns a file whose chunks are
only partially written β the file is only βvisibleβ once all its chunks exist.
This is the FDB equivalent of:
- Write all content to a temp file
rename(temp, final)atomically
8. Why the WAL is Redundant With FDB
LevelDBβs Write-Ahead Log (WAL / journal, TypeJournal) exists for one
reason: crash recovery. If the process crashes after writing to the in-memory
MemTable but before flushing the MemTable to an SSTable on disk, the WAL
is replayed to reconstruct the MemTable.
With FDB as the storage backend:
Every write is already durable before Write(p) returns.
Our writer.Write buffers bytes in memory. Our writer.Close flushes to FDB
in transactions. Each FDB Transact call does not return until the commit is
confirmed by FDBβs replication protocol β the data is on at least f+1
machines (where f is the fault tolerance level, typically 2). A process
crash after Close() returns means the data is safe.
The WAL is protecting against βdata written to OS memory but not yet on disk.β
FDBβs Transact eliminates this window. By the time the WAL file is written
through our fdbstorage.Writer, the bytes are already in FDB.
A production implementation would patch goleveldb to skip WAL writes
entirely (or use Options.DisableSeeksCompaction and a custom journal
implementation thatβs a no-op). This would improve write throughput by 50% or
more and reduce FDB key usage.
9. The Blob Layer Pattern
FDBβs core team documented the βBlob Layerβ pattern: storing binary blobs (arbitrary large byte arrays) in FDB by chunking them. Our file storage is an instance of this pattern.
The Blob Layer pattern:
blob_key + chunkNum β chunk_data
It solves the 100 KiB value limit while preserving atomic operations on the whole blob (via FDB transactions) and efficient byte-range access (read only the chunks you need, e.g., for seeking within a large file).
Applications:
- Store large media files (> 100 KiB) in FDB for atomic metadata-plus-content updates
- Store ML model weights alongside their metadata records
- Store serialized protocol buffers larger than 100 KiB
- Back any file system abstraction (exactly what weβre doing)
10. Real-World Analogues
RocksDB Remote Compaction (Project Titan, Ripple)
Meta (Facebook) runs RocksDB on distributed storage in some configurations. Their βRippleβ project stores RocksDB SSTables in a distributed block store (similar to HDFS or GFS). The storage interface they use is exactly the same concept: RocksDB writes βfilesβ via an abstract interface; the implementation stores chunks in a distributed system.
TiKV on Disaggregated Storage
TiDB (PingCAP) is moving toward a disaggregated architecture where TiKV (which uses RocksDB internally) stores its SSTables in object storage (S3, GCS). The TiKV storage engine writes SSTables through an abstract file interface to S3. This is identical to our pattern.
Pebble (CockroachDB)
CockroachDB replaced RocksDB with Pebble (a Go implementation) in 2021.
Pebble has a vfs.FS interface β a virtual filesystem abstraction β that
allows swapping the storage backend. CockroachDB uses this for testing (an
in-memory FS) and is exploring using it for cloud storage.
The Patternβs Universality
Every LSM-tree engine eventually adds a pluggable storage interface:
- LevelDB:
storage.Storage - RocksDB:
Env(virtual filesystem) - Pebble:
vfs.FS - WiredTiger:
WT_FILE_SYSTEM
Why? Because running the compaction engine without worrying about where data lives is architecturally clean. The engine is responsible for LSM semantics; the storage interface is responsible for durability. Separation of concerns.
11. Exercises
Exercise 1 β Streaming Reader
Instead of materializing the entire file into memory in Open(), return an
io.ReadSeekCloser that fetches chunks lazily. A read at offset 128 KiB should
only fetch chunks 2β3, not chunk 0 and 1.
This reduces memory usage for large SSTables and enables efficient
Seek(offset, io.SeekStart) for random-access reads.
Exercise 2 β File Size Cache
List() currently returns all file descriptors by scanning the meta keys.
Open(fd) reads the meta key to get the file size, then reads all chunk keys.
Add a small in-memory LRU cache mapping FileDesc β size. On Open, check
the cache first. Invalidate the cache entry on Remove and Rename.
Measure the reduction in FDB round-trips for a workload with many small reads on recently-opened files.
Exercise 3 β Compression
Before storing each 64 KiB chunk, compress it with compress/flate or
github.com/golang/snappy. Store a compression-type byte in the meta key.
On read, decompress transparently.
LevelDB SSTables are already internally compressed (Snappy by default), so this may not reduce size much for TypeTable files. But TypeJournal files are not compressed and might benefit.
Exercise 4 β Two-Phase Large Rename
For files larger than 5 MB (which would exceed the transaction limit in
our current Rename), implement the two-phase rename:
Phase 1: Write all new chunks in multiple transactions. Write a
βrename-pendingβ key: ns+tagPending+oldfd β newfd.
Phase 2: In one transaction, atomically: clear the pending key, clear all old chunks and meta, set the meta for new fd (chunks already exist).
On startup, check for any pending keys and complete or roll back the rename. This is essentially a two-phase commit for large file renames.
Exercise 5 β Multi-Tenant Databases
Add a namespace concept: allow multiple LevelDB databases to share one FDB
cluster with independent key spaces. Each New(fdb, namespace) call returns
a storage implementation that is completely isolated from others.
This is how mvsqlite handles multiple SQLite βdatabase filesβ β each is an FDB namespace.
12. Source Code Deep Dive β fdbstorage/storage.go
The Storage Struct
type Storage struct {
db fdb.Database
ns []byte
}
Minimal. db is the FDB connection; ns is the byte prefix for all keys. The entire storage is two fields. All complexity lives in the key encoding and transaction logic.
Key Encoding Helpers
func (s *Storage) metaKey(fd storage.FileDesc) fdb.Key {
// ns + 0x01 + type(1 byte) + num(8 bytes big-endian)
key := make([]byte, len(s.ns)+10)
copy(key, s.ns)
key[len(s.ns)] = 0x01
key[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
return fdb.Key(key)
}
func (s *Storage) chunkKey(fd storage.FileDesc, chunkNum int) fdb.Key {
// ns + 0x02 + type(1 byte) + num(8 bytes) + chunk(8 bytes)
key := make([]byte, len(s.ns)+18)
copy(key, s.ns)
key[len(s.ns)] = 0x02
key[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
binary.BigEndian.PutUint64(key[len(s.ns)+10:], uint64(chunkNum))
return fdb.Key(key)
}
Why 8-byte big-endian for chunkNum? Chunk numbers are read back via GetRange which returns chunks in key order. Big-endian ensures key order equals chunk number order. If we used little-endian, chunk 256 (LE: 00 01 00 00 00 00 00 00) would sort before chunk 1 (LE: 01 00 00 00 00 00 00 00) β wrong.
The dataRange Helper
func (s *Storage) dataRange(fd storage.FileDesc) fdb.KeyRange {
begin := s.chunkKey(fd, 0)
// end: same prefix but with chunkNum = MaxUint64 + 1 β use next-prefix trick
endPrefix := make([]byte, len(s.ns)+10) // ns + 0x02 + type + num
copy(endPrefix, s.ns)
endPrefix[len(s.ns)] = 0x02
endPrefix[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(endPrefix[len(s.ns)+2:], uint64(fd.Num))
end := append(endPrefix, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF)
return fdb.KeyRange{Begin: begin, End: fdb.Key(append(end, 0x01))}
}
This range covers all chunk keys for (type, num) regardless of chunkNum. GetRange(dataRange(fd)) fetches all chunks in order.
The List() Implementation
func (s *Storage) List() ([]storage.FileDesc, error) {
kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(s.metaRange(), fdb.RangeOptions{}).GetSliceWithError()
})
var fds []storage.FileDesc
for _, kv := range kvs.([]fdb.KeyValue) {
var fd storage.FileDesc
msgpack.Unmarshal(kv.Value, &fd)
fds = append(fds, fd)
}
return fds, nil
}
A single range scan over all meta keys returns all files in one round-trip. LevelDB calls List() at startup to find all existing files. With FDB, this is O(1) round-trips regardless of file count.
With a local filesystem, List() is an opendir/readdir syscall β also O(1) in latency, but I/O must go through the local disk controller. With FDB, the I/O goes to the closest FDB storage server over the network, with similar or lower latency than a rotational disk.
The Lock Implementation
func (s *Storage) Lock() (util.Releaser, error) {
_, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
existing, _ := tr.Get(s.lockKey()).Get()
if len(existing) > 0 {
return nil, errors.New("storage: already locked")
}
tr.Set(s.lockKey(), []byte("locked"))
return nil, nil
})
if err != nil {
return nil, err
}
return util.ReleaserFunc(func() {
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Clear(s.lockKey())
return nil, nil
})
}), nil
}
The lock is a FDB key. Acquiring the lock: check if the key exists; if not, set it β in one atomic transaction. This check-then-set is race-free because FDBβs optimistic concurrency ensures that if two processes both read βno lockβ and both try to write βlockedβ, only one will commit (the other will conflict and retry, then find the lock held).
Limitation: This is a process-level lock, not a durable lease. If the lock-holding process crashes without calling Release(), the lock remains set until manually cleared. For production, use a lock with an expiry: store the lock as {holder: processID, expires: time.Now().Add(30*time.Second)} and have each lock holder refresh it periodically. A lock that isnβt refreshed is treated as expired.
13. Production Considerations
13.1 Transaction Size for Large SSTables
LevelDB level-0 SSTables are 2β4 MB. Level-1 SSTables are larger (up to L1_target_size, configurable). For a L1_target_size of 64 MB, level-1 SSTables are 64 MB each. Our current Rename would fail for files this large (exceeds the 10 MB transaction limit).
Solution: For production, configure LevelDBβs CompactionTableSize to keep SSTables small:
opts := &opt.Options{
CompactionTableSize: 2 * 1024 * 1024, // 2 MB SSTables
}
2 MB SSTables = 32 chunks of 64 KiB. Rename transaction: 32 reads + 32 writes = 4 MB total. Well within limits.
13.2 Read Performance for Large SSTables
Reading a 2 MB SSTable requires fetching 32 chunks from FDB. Our current implementation reads them in one GetRange β one round-trip, 32 key-value pairs returned. Latency: ~1β5 ms (FDB cluster local read latency).
A local filesystem read of 2 MB: ~1β3 ms on SSD, ~10β20 ms on HDD.
For a warm FDB cluster, FDB storage is competitive with SSDs and dramatically better than spinning disks. For random chunk access (seeking within large files), FDB may be faster because it can pipeline multiple point reads, while a spinning disk requires physical seeking.
13.3 Write Amplification
Our chunking adds write amplification: writing a 64 KiB chunk requires writing the chunk key (18 bytes) + value (64 KiB) = 64 KiB + 18 bytes. The key overhead is <0.03%, negligible.
But FDB itself adds write amplification internally: each committed transaction is written to the Transaction Log (TLog), then asynchronously applied to Storage Servers. The TLog write is sequential (fast). The Storage Server write is to FDBβs B-tree (with its own write amplification). FDBβs overall write amplification is roughly 3β5x β comparable to RocksDBβs LSM write amplification.
13.4 Monitoring
Key metrics for a fdbstorage-backed LevelDB deployment:
- FDB transaction latency P99: should be < 10ms for small transactions (meta reads)
- FDB range scan bytes/second: correlates with compaction throughput
- FDB conflict rate: if high, indicates concurrent compaction and write contention
- LevelDB metrics via
db.GetProperty("leveldb.stats"): still valid β LevelDB reports its own view of compaction and SSTable counts, just the βdisk I/Oβ is actually FDB I/O
14. Interview Questions β Storage Abstractions and LSM Trees
Q: What is the purpose of the Rename operation in LevelDBβs storage interface, and how does your FDB implementation preserve its atomicity guarantee?
Rename is LevelDBβs way of atomically promoting a new SSTable (or MANIFEST) into production. During compaction, LevelDB writes the new SSTable to a temp file, then renames it to its final name. POSIX rename is atomic: either the old name or the new name is visible, never a half-written file. Our FDB implementation reads all chunks with the old file descriptor, writes them with the new file descriptor, and clears the old keys β all in one FDB transaction. FDBβs transaction atomicity provides the same guarantee: either the old keys or the new keys are visible, never both or neither.
Q: Why does LevelDB use a Write-Ahead Log, and is it still necessary when using FDB as the storage backend?
The WAL protects against crash scenarios where data was written to the in-memory MemTable but not yet flushed to an SSTable on disk. Without a WAL, a crash after the MemTable write but before the SSTable flush would lose those writes. With FDB as storage, our writer.Close() writes chunks to FDB in transactions. Each committed FDB transaction is durable (replicated to at least two machines). A crash after Close() returns has no data loss. The WALβs durability purpose is already provided by FDB. A production implementation would use a no-op WAL to skip the overhead.
Q: What is the 10 MB transaction limit in FDB, and what design patterns avoid hitting it?
FDB limits the total read + write size per transaction to approximately 10 MB to bound the memory required on Commit Proxies and to keep transaction resolution fast. Patterns to stay within the limit: (1) chunk large values (as we do with 64 KiB chunks), (2) break bulk writes into multiple transactions with cursor-based pagination, (3) configure LevelDBβs compaction to keep SSTable sizes small (< 2 MB), (4) use FDB atomic operations (tr.Add, tr.SetVersionstampedKey) where possible β atomic operations donβt count against the read portion of the limit.
Q: How would you extend this implementation to support multiple concurrent LevelDB instances sharing the same FDB cluster?
Give each LevelDB instance its own ns prefix. The FDB key space is naturally partitioned: ns1 + ... keys and ns2 + ... keys are completely disjoint. Multiple instances can read and write concurrently with no coordination overhead β FDBβs conflict detection only fires when two transactions write the same key, and different namespaces use different keys. The lock key (ns + tagLock) is also per-namespace, so locking one instance doesnβt affect others.
Option B β SQLite VFS substrate on FoundationDB
Pattern: SQLite (or any pager-style engine) with FDB as the disk. We implement the byte-range storage primitive a SQLite VFS sits on top of β fixed-size pages keyed by page number, atomic page-level updates β and demonstrate it through a small Go API. Wiring the C-level VFS hooks is mechanical glue thatβs left for a follow-up.
What a SQLite VFS actually needs
SQLiteβs pager talks to βthe OSβ through a thin C interface defined in vfs.h. Boiled down, a VFS file handle exposes:
| SQLite call | What we map it to |
|---|---|
xRead(buf, n, offset) | File.ReadAt(buf, offset) |
xWrite(buf, n, offset) | File.WriteAt(buf, offset) |
xTruncate(size) | File.Truncate(size) |
xFileSize() | File.Size() |
xLock / xUnlock | File.Lock(holder) / File.Unlock(holder) |
xSync | no-op β every WriteAt is already durable in FDB |
SQLite always reads and writes in multiples of the page size (4096 by default) once itβs past the 100-byte file header, so storing one FDB KV per page is a natural fit and makes the pager-to-storage mapping 1:1.
Key layout
<ns> 0x00 -> uint64 BE file size in bytes
<ns> 0x01 <pageNum:uint64 BE> -> 4096-byte page
<ns> 0x02 -> lock holder name (or absent)
Where the transactional magic lives
The interesting method is WriteAt. For partial-page writes, we:
- Issue
tr.Getfor every affected page. - Merge the existing page bytes with the new bytes.
tr.Setthe resulting full page.- Update the file-size key if we grew.
All inside one FDB transaction. That gives SQLite a property it cannot get from a normal filesystem: multi-page writes are atomic. SQLite has elaborate journal/WAL machinery to recover from βwe crashed halfway through updating pages 17, 18, and 19.β On this VFS that recovery code becomes dead β either all three pages flipped or none did.
In practice youβd still set PRAGMA journal_mode = MEMORY (so SQLite skips
the rollback journal it doesnβt need) and rely on FDBβs transactional commit
as the single durability point.
Hooking this into a real SQLite
There are two paths:
- cgo + mattn/go-sqlite3 or zombiezen.com/go/sqlite: register a custom
sqlite3_vfswhosexRead/xWrite/...thunks call into ourpagestore.Filemethods via a CGO bridge. ~300 lines of glue. - modernc.org/sqlite: pure-Go SQLite. Its
vfssubpackage exposesRegister/VFStypes. Same wiring, no CGO.
Either way the interesting code is whatβs already here β the C/Go thunks add no further insight into how FDB serves as the storage tier.
Running the demo
cd option-b-sqlite
go mod tidy
go run ./demo -cluster ../fdb.cluster
Expected output:
After writing header (16 B): size=16
After 100-byte cross-page write at offset 4090: size=4190
Read 100 B back, all 'A'? true
Header preserved? "SQLite format 3\x00"
After truncate to 4096: size=4096
Lock contention as expected: pagestore: locked by "conn-A"
Lock handoff OK.
The cross-page write at offset 4090 is the key correctness test: it spans page 0 (bytes 4090β4095) and page 1 (bytes 4096β4189). The output above proves that:
- The pre-existing 16-byte header on page 0 was preserved during the read- modify-write merge.
- Both halves of the payload made it to disk.
- A subsequent
Truncate(4096)cleared page 1 entirely.
Whatβs deliberately omitted
- WAL mode. Skipping it forces SQLite into rollback-journal mode, which
emits ordinary writes only β exactly what our VFS supports. WAL would
require a second βshared-memoryβ backing store (
xShmMap) and is the single largest source of complexity in real VFS implementations. - Multi-process locking. The lock key works for cooperating clients but has no liveness guarantee on holder crash. Production code would combine it with FDB watches and a heartbeat.
- The C bridge itself. See
docs/README.mdabove for the wiring sketch.
Hitchhikerβs Guide β Option B: SQLite VFS over FoundationDB
The question this answers: βHow does SQLite actually read and write its database file? Can we replace the file with FoundationDB?β
The deeper question: βWhat is a database page, how does the pager work, and why is the page model the right abstraction for a VFS?β
Table of Contents
- SQLiteβs Architecture β From SQL to Bytes
- The Virtual File System (VFS) β SQLiteβs Plugin Point
- The Page Model β How SQLite Organizes Storage
- Our pagestore β FDB as a Page Store
- The Partial-Page Write Problem β And Its Atomic Solution
- xSync is a No-Op β And Why Thatβs Correct
- Locking β From POSIX Flock to FDB Keys
- Journal Modes: DELETE, WAL, MEMORY
- mvsqlite β The Production Version of This
- Real-World Analogues: libSQL, Litestream, LiteFS
- Exercises
1. SQLiteβs Architecture β From SQL to Bytes
SQLite processes queries through seven layers:
SQL text
β Tokenizer + Parser
Abstract Syntax Tree (AST)
β Code Generator
Bytecode program (VDBE instructions)
β Virtual Database Engine (VDBE)
B-tree operations (seek, insert, delete on B-tree pages)
β Pager
Page cache: reads/writes logical page numbers
β OS Interface (VFS)
File reads/writes at byte offsets
β Actual storage
We plug in at the VFS layer. Everything above (SQL parsing, the B-tree, the pager) runs as normal. The VFS is called for all I/O, and our implementation redirects those calls to FDB.
The key insight: SQLite doesnβt know or care whether the VFS talks to a local file, a network mount, or a distributed database. It only needs the VFS contract to be upheld.
2. The Virtual File System (VFS) β SQLiteβs Plugin Point
The VFS is documented in SQLiteβs C API. The key functions (C signatures simplified):
// File operations (called on each open file):
int xRead(file, buf, amount, offset); // read `amount` bytes at `offset`
int xWrite(file, buf, amount, offset); // write `amount` bytes at `offset`
int xTruncate(file, size); // truncate to `size` bytes
int xSync(file, flags); // flush to durable storage
int xFileSize(file, *size); // return current size
int xLock(file, locktype); // acquire SHARED/EXCLUSIVE/etc
int xUnlock(file, locktype); // release lock
int xCheckReservedLock(file, *result); // is anyone holding RESERVED lock?
int xShmMap(file, region, size, ...); // map shared memory region (WAL mode)
Our Go pagestore implements the subset needed for non-WAL mode:
ReadAt, WriteAt, Truncate, FileSize, Lock, Unlock.
Why a Go struct instead of a C VFS?
A real SQLite VFS requires implementing C structs (sqlite3_vfs,
sqlite3_file) and using CGO to register them. This is correct but adds
~200 lines of glue code that would obscure the core concept. Our pagestore
is a Go struct that exercises the exact same I/O patterns as SQLite would use.
The demo calls ReadAt/WriteAt directly, simulating what the SQLite pager
would call through the VFS.
3. The Page Model β How SQLite Organizes Storage
SQLiteβs database file is divided into fixed-size pages (default 4096 bytes). Every structure in a SQLite database β B-tree interior nodes, B-tree leaf nodes, overflow pages, free-list pages β is exactly one page. The pager manages a cache of these pages and issues I/O to the VFS in page-aligned operations.
SQLite database file layout:
Offset 0 β 4095: Page 1 (database header + B-tree root)
Offset 4096 β 8191: Page 2
Offset 8192 β 12287: Page 3
...
Offset (N-1)*4096 β N*4096-1: Page N
Page 1 is special: its first 100 bytes are the database header:
Offset 0: "SQLite format 3\000" (16 bytes, magic string)
Offset 16: page size (2 bytes)
Offset 18: file format write version (1 byte)
Offset 19: file format read version (1 byte)
...
Offset 28: file change counter (4 bytes)
...
Offset 52: schema format number (4 bytes)
...
The partial-page write problem arises here: the header is 100 bytes, but
SQLite might write just the header (100 bytes at offset 0), while the rest
of page 1 (bytes 100β4095) contains B-tree data. An xWrite(file, header, 100, 0) call writes only 100 bytes β not the full page.
4. Our pagestore β FDB as a Page Store
We store pages as FDB keys:
const PageSize = 4096
// Page data:
// ns + 0x01 + pageNum(uint64 big-endian) β 4096 bytes
// Database size (in pages):
// ns + 0x00 β uint64 (big-endian)
// Exclusive lock:
// ns + 0x02 β "locked" (any non-empty value)
Page numbers:
SQLite page numbers are 1-based (page 1 is the first page). We store them as 0-padded big-endian uint64s:
Page 1: ns + 0x01 + 0000000000000001
Page 2: ns + 0x01 + 0000000000000002
Page 1000: ns + 0x01 + 00000000000003E8
Big-endian encoding ensures that a GetRange over ns+0x01+0 to
ns+0x01+FFFFFFFFFFFFFFFF returns pages in ascending page-number order β
useful for scanning the entire database.
Size tracking:
SQLite calls xFileSize to determine how many pages exist. We track this
explicitly with a size key rather than scanning for the highest-numbered
page key (which would be an O(N) scan). The size key is updated atomically
in the same transaction as every WriteAt and Truncate.
5. The Partial-Page Write Problem β And Its Atomic Solution
This is the trickiest part of implementing a page-based storage engine.
SQLiteβs VFS contract says:
xWrite(offset, amount, data):
Write `amount` bytes starting at `offset`.
`offset` and `amount` may be any values (not necessarily page-aligned).
Examples of partial-page writes:
- Write header (100 bytes at offset 0) to page 1
- Write a record that spans a 4-byte boundary at the end of a page
- Write a single integer (4 bytes at offset 12) to a page
We cannot simply map WriteAt(offset, data) to Set(pageKey(offset/4096), data) because that would overwrite the entire 4096-byte page with only
the partial data β corrupting the rest of the page.
Our solution: read-modify-write inside one transaction.
func (p *PageStore) WriteAt(data []byte, off int64) (int, error) {
firstPage := off / PageSize
lastPage := (off + int64(len(data)) - 1) / PageSize
_, err := p.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
for pageNum := firstPage; pageNum <= lastPage; pageNum++ {
// 1. Read current page content (or zeros if it doesn't exist yet)
existing, _ := tr.Get(p.pageKey(pageNum)).Get()
page := make([]byte, PageSize)
copy(page, existing) // pad to 4096 with zeros if shorter
// 2. Compute which bytes of `data` go into this page
pageStart := pageNum * PageSize
pageEnd := pageStart + PageSize
writeStart := max(off, pageStart) - pageStart
writeEnd := min(off+int64(len(data)), pageEnd) - pageStart
dataStart := max(off, pageStart) - off
dataEnd := min(off+int64(len(data)), pageEnd) - off
// 3. Overlay the new bytes onto the existing page
copy(page[writeStart:writeEnd], data[dataStart:dataEnd])
// 4. Write the modified page back
tr.Set(p.pageKey(pageNum), page)
}
// 5. Update size key if this write extends the file
...
return nil, nil
})
return len(data), err
}
Steps 1β4 happen inside one FDB transaction. The read and the write are atomic: no concurrent writer can modify the page between our read and write. No partial update is visible to other transactions.
The atomicity guarantee:
If the process crashes after step 1 (read) but before step 4 (write), the transaction is never committed. FDB abandons it. The page is unchanged. No corruption.
If two concurrent writers both try to read-modify-write the same page, FDBβs conflict detection will cause one to retry. The second writer will re-read the page (now including the first writerβs change) and overlay its own data on top. Correct behavior, no coordination required.
6. xSync is a No-Op β And Why Thatβs Correct
SQLite calls xSync to tell the VFS: βmake sure all previous writes are
durable on physical storage before returning.β On a normal filesystem, this
calls fsync() which flushes the OS page cache to the disk controller and
waits for the disk to confirm durability.
Why do we implement it as a no-op?
FDBβs Transact is synchronous and durable.
When WriteAt calls p.db.Transact(...) and it returns without error, the
data is already:
- Written to FDBβs transaction log on
f+1machines (durably replicated) - Committed (visible to new transactions)
By the time WriteAt returns to the SQLite pager, the durability guarantee
is already satisfied β stronger than fsync on a local disk.
xSync is a no-op because the work it would do has already been done
during xWrite. There is nothing left to flush.
Analogy: Imagine a bank transfer. xWrite is βdebit Alice, credit Bob,
commit to ledger.β xSync is βmake sure the ledger is durable.β If the
ledger is already in a replicated distributed database, xSync has nothing
to do.
7. Locking β From POSIX Flock to FDB Keys
SQLite uses file locking to coordinate concurrent access. The POSIX lock levels, in order of increasing exclusivity:
UNLOCKED β SHARED β RESERVED β PENDING β EXCLUSIVE
- SHARED: reader has this. Many readers can hold SHARED simultaneously.
- RESERVED: writer intending to modify. One writer can hold RESERVED while readers continue holding SHARED.
- PENDING: writer waiting for all SHARED holders to release.
- EXCLUSIVE: sole writer, no readers. Required before writing pages.
A full VFS implementation would map these levels to FDB lock keys with appropriate semantics (e.g., a counter for SHARED, a flag for EXCLUSIVE).
Our implementation is simplified: we use one βexclusive lockβ key. Setting it means exclusive ownership; clearing it means unlocked. This correctly implements single-writer semantics but doesnβt allow concurrent readers.
For multi-reader/single-writer, a production implementation would:
SHARED lock: counter key (increment on lock, decrement on unlock)
EXCLUSIVE lock: flag key
β acquire EXCLUSIVE: check counter == 0, set flag key
β acquire SHARED: check flag key is unset, increment counter
Both checks-and-sets would use FDB transactions to ensure atomicity.
8. Journal Modes: DELETE, WAL, MEMORY
SQLite has several journal modes, selected with PRAGMA journal_mode=MODE.
The journal is SQLiteβs crash recovery mechanism (distinct from FDBβs own
recovery).
DELETE (default rollback journal): Before modifying a page, SQLite copies the original page content to a separate journal file. If a crash occurs during a write, SQLite replays the journal to restore the original page content.
With FDB, this is redundant: FDBβs transactions provide rollback for free.
A partial WriteAt that crashes mid-transaction rolls back automatically.
The journal file would be stored in FDB via our VFS, adding overhead for
no benefit.
WAL (Write-Ahead Log): Instead of copying original pages before modification, SQLite appends new page versions to a WAL file. Readers check the WAL before reading the main database file. A βcheckpointβ operation copies WAL pages back to the main file.
WAL mode requires xShmMap β shared memory for the WAL index. This is a
memory-mapped file that multiple processes share to coordinate which WAL
frames are valid. Implementing xShmMap in FDB would require:
- A small FDB key range to store the WAL index state.
- Synchronized access to that state via FDB transactions. This is complex but possible β mvsqlite does it.
MEMORY: SQLite uses an in-memory journal, not written to any file. Rollback is possible within a transaction but not after a crash.
For our FDB VFS, PRAGMA journal_mode=MEMORY is the right choice: SQLite
wonβt try to create journal files (which would trigger extra VFS calls), and
crash recovery is handled by FDB. This is what our demo uses implicitly by
not specifying a journal mode (we would need to open the database and run
PRAGMA journal_mode=MEMORY before any writes).
9. mvsqlite β The Production Version of This
mvsqlite (by losfair, Rust) is the production-grade implementation of the same idea. Its architecture:
Namespace mapping: Each SQLite βdatabase fileβ path is mapped to an FDB namespace prefix. Multiple processes can open the same βfileβ (namespace) simultaneously with MVCC isolation.
Page-level MVCC: mvsqlite keeps multiple versions of each page, similar to how PostgreSQL keeps multiple row versions. When a reader opens the database, it captures an FDB read version. Page reads are served from FDB at that version. New writes create new page versions. Old versions are kept until all readers that need them complete.
Our pagestore is a simplified, non-MVCC version: every read sees the latest
page version. Adding page-level MVCC would require:
ns + 0x01 + pageNum + version β 4096 bytes (instead of just pageNum)
With a cleanup process that removes old versions when no readers hold them.
Write-ahead log in FDB: mvsqlite implements WAL mode by storing the WAL log pages in FDB itself:
ns + 0x03 + txnId + pageNum β 4096 bytes (uncommitted WAL pages)
ns + 0x04 + txnId β commit record
This allows SQLiteβs WAL mode to work without xShmMap (the shared memory
region) β instead, the WAL index state lives in FDB and is accessed via
transactions.
10. Real-World Analogues
libSQL (Turso)
libSQL is a fork of SQLite that adds multi-tenancy and remote storage. Tursoβs cloud product stores SQLite databases in a distributed object store (similar to our FDB approach). The architecture:
- Local replicas for low-latency reads
- A primary writes to the distributed store
- Followers pull changes from the store and apply them locally
Litestream
Litestream continuously replicates SQLite databases to S3 or other cloud storage by intercepting the WAL. It monitors the SQLite WAL file and copies new frames to cloud storage in near-real-time. Restore is done by downloading frames and applying them.
Litestream does NOT modify SQLiteβs VFS β it runs as a separate process that monitors the WAL file. This is simpler but means it canβt provide multi-writer consistency (only one process can write to a SQLite database at a time).
LiteFS (Fly.io)
LiteFS mounts a FUSE filesystem that intercepts SQLite writes, replicates them to a primary node, and distributes to replicas. It does intercept at the filesystem level (FUSE = Filesystem in Userspace), which is essentially implementing the VFS pattern at the OS level.
The Patternβs Significance
All of these tools (mvsqlite, libSQL, Litestream, LiteFS) are attacking the same problem: SQLite is an excellent embedded database, but itβs tied to a single file on a single machine. The solution in each case is to intercept the storage layer and redirect I/O to a distributed, replicated system.
Our pagestore is the minimal proof-of-concept for this idea: 150 lines of
Go that demonstrate the core page-store primitives.
11. Exercises
Exercise 1 β Page Cache
Add an in-process LRU page cache:
type PageStore struct {
...
cache *lru.Cache // maps pageNum β [PageSize]byte
}
ReadAt checks the cache before going to FDB. WriteAt updates the cache
after writing to FDB. Evict on Truncate.
Measure the cache hit rate for a typical SQLite workload (a mix of reads and writes). Youβll find that B-tree root pages have very high hit rates β theyβre accessed on every query.
Exercise 2 β Implement xShmMap for WAL Mode
WAL mode requires shared memory for the WAL index. In mvsqlite, this is done with FDB keys. Implement a simplified version:
- Add a βWAL regionβ subspace:
ns + 0x03 + regionNum β 32768 bytes xShmMap(region, size)reads the FDB key and returns a[]bytexShmBarrier()writes the modified byte slice back to FDBxShmLockmaps to a FDB lock key per-region
With this, enable PRAGMA journal_mode=WAL in the demo and verify reads
and writes still work.
Exercise 3 β Multi-Version Pages
Change the page key layout to:
ns + 0x01 + pageNum + readVersion β 4096 bytes
WriteAt reads the current version, writes a new version at the current
FDB commit version. ReadAt scans backwards from the requested read version
to find the most recent page version β€ that version.
Add a Compact(olderThan version) that clears all page versions older than
a given version. This is the MVCC GC mechanism.
Exercise 4 β Register as a Real SQLite VFS
Using CGO, implement the actual sqlite3_vfs and sqlite3_file C structs
backed by our pagestore. Register the VFS with sqlite3_vfs_register.
Open a real SQLite connection using this VFS:
db, err := sql.Open("sqlite3", "file:demo.db?vfs=fdbvfs")
db.Exec("CREATE TABLE t (id INTEGER PRIMARY KEY, val TEXT)")
db.Exec("INSERT INTO t VALUES (1, 'hello')")
All I/O will go through pagestore to FDB. You have now replaced SQLiteβs
storage engine while keeping the full SQL query interface.
Exercise 5 β Multi-Database Isolation
The pagestore uses a namespace prefix to isolate one database. Add a
DatabaseManager that:
- Lists all databases (range scan over a well-known metadata prefix)
- Creates a new database (allocates a namespace, writes a metadata entry)
- Deletes a database (one
ClearRange+ metadata clear) - Returns a
pagestorefor a given database name
12. Source Code Deep Dive β pagestore/pagestore.go
The File Struct
type File struct {
db fdb.Database
ns []byte
mu sync.Mutex
}
ns is the namespace byte prefix. mu protects concurrent calls to WriteAt and Truncate from a single File instance (though FDBβs own transactions provide the real isolation for concurrent processes).
Key Construction
func (f *File) sizeKey() fdb.Key {
key := make([]byte, len(f.ns)+1)
copy(key, f.ns)
key[len(f.ns)] = 0x00
return fdb.Key(key)
}
func (f *File) pageKey(pageNum int64) fdb.Key {
key := make([]byte, len(f.ns)+9)
copy(key, f.ns)
key[len(f.ns)] = 0x01
binary.BigEndian.PutUint64(key[len(f.ns)+1:], uint64(pageNum))
return fdb.Key(key)
}
sizeKey() stores the total file size in bytes. pageKey(n) stores the 4096-byte page at page number n. The big-endian 8-byte encoding of pageNum ensures that GetRange over sizeKey to the end of the 0x01 subspace returns pages in ascending page-number order.
ReadAt β Handling Partial-Page Reads
func (f *File) ReadAt(p []byte, off int64) (int, error) {
firstPage := off / PageSize
lastPage := (off + int64(len(p)) - 1) / PageSize
futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
_, err := f.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
for i := firstPage; i <= lastPage; i++ {
futures[i-firstPage] = rt.Get(f.pageKey(i))
}
return nil, nil
})
// Wait for all futures and assemble the result
for i, fut := range futures {
pageData, _ := fut.Get()
// copy relevant bytes from page into p...
}
}
Pipelining multiple page reads: All rt.Get(pageKey(i)) calls are issued inside one ReadTransact closure. FDB sends all read requests to the storage servers before waiting for any response. If the read spans 3 pages (common for reads that cross page boundaries), all 3 are fetched in one round-trip. Without pipelining, each page would require a separate round-trip β 3x the latency.
This is the same pipelining pattern used in option-c-record-layerβs LookupByIndex.
WriteAt β The Read-Modify-Write Pattern
func (f *File) WriteAt(p []byte, off int64) (int, error) {
f.mu.Lock()
defer f.mu.Unlock()
firstPage := off / PageSize
lastPage := (off + int64(len(p)) - 1) / PageSize
_, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Issue all reads first (pipelined)
futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
for i := firstPage; i <= lastPage; i++ {
futures[i-firstPage] = tr.Get(f.pageKey(i))
}
sizeF := tr.Get(f.sizeKey())
// Wait for reads and apply writes
for i := firstPage; i <= lastPage; i++ {
existing, _ := futures[i-firstPage].Get()
page := make([]byte, PageSize)
copy(page, existing)
// overlay p bytes onto page...
tr.Set(f.pageKey(i), page)
}
// Update size
curSize := int64(binary.BigEndian.Uint64(sizeF.MustGet()))
newEnd := off + int64(len(p))
if newEnd > curSize {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(newEnd))
tr.Set(f.sizeKey(), b)
}
return nil, nil
})
return len(p), err
}
The read futures are pipelined: all tr.Get() calls are issued before any .Get() is called to block. This means all page reads happen in parallel β one round-trip for up to N pages regardless of N.
The size update is in the same transaction: The page writes and the size update are atomic. If the process crashes mid-transaction, FDB abandons it: neither the page updates nor the size update are applied. The database remains in its previous state β consistent, readable, no corruption.
pageRange β For Bulk Operations
func (f *File) pageRange() fdb.KeyRange {
begin := f.pageKey(0)
end := make([]byte, len(f.ns)+1)
copy(end, f.ns)
end[len(f.ns)] = 0x02 // one past the page subspace tag (0x01)
return fdb.KeyRange{Begin: begin, End: fdb.Key(end)}
}
Used by Truncate to clear all pages above the new size:
func (f *File) Truncate(size int64) error {
newLastPage := size / PageSize
_, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Clear all pages above newLastPage
clearBegin := f.pageKey(newLastPage + 1)
tr.ClearRange(fdb.KeyRange{Begin: clearBegin, End: f.pageRange().End})
// Update size
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(size))
tr.Set(f.sizeKey(), b)
return nil, nil
})
return err
}
One ClearRange atomically removes all pages above the new size. This is O(1) in FDB key operations (one range-clear instruction), even if there are thousands of pages to clear. This is vastly more efficient than deleting pages one by one.
13. Production Considerations
13.1 Page Size Selection
The default SQLite page size is 4096 bytes. You can change it with PRAGMA page_size=N (must be set before the first write). Larger pages reduce the number of B-tree levels and improve scan performance on large tables, but increase read amplification for small point lookups. For an FDB-backed store:
- 4096 bytes (default): 4096-byte FDB values. Small key overhead. Good for mixed workloads.
- 8192 bytes: 8192-byte values. Better for large sequential scans.
- 65536 bytes (maximum SQLite page size): 64 KB values. Below FDBβs 100 KB limit. Excellent for large sequential scans, but poor for point lookups.
For most use cases, 4096 is the right choice. Match it to your workloadβs read/write unit.
13.2 Hot Page Contention
Page 1 contains the database header and the root page of the main B-tree. Every query touches page 1. Every transaction increments the βfile change counterβ in the header. This means every write transaction does a read-modify-write on page 1 β even if the query doesnβt touch any user data on page 1 (e.g., inserting into a deep table touches only leaf pages and page 1 for the change counter).
In a high-concurrency write workload, this is a hot key. FDBβs conflict detection will serialize all transactions that write to page 1 β reducing concurrency.
Solution (mvsqliteβs approach): Move the change counter out of the page and into a separate FDB key. Use FDBβs atomic add to increment it without a read-modify-write. Since atomic add is commutative, it doesnβt cause conflicts. The page itself (minus the counter) is written only when actual B-tree structure changes.
13.3 Crash Recovery Testing
With FDBβs transactional guarantees, the traditional SQLite recovery path (re-applying the rollback journal) is never needed. But itβs worth testing:
- Write some data.
- Kill the process mid-write (use
kill -9oros.Exit(1)inside aWriteAt). - Restart the process and open the database.
- Verify the database is in a consistent state (either the write happened or it didnβt, no partial state).
FDBβs transaction semantics make this trivially correct, but testing it gives you confidence that your implementation upholds the contract.
14. Interview Questions β SQLite, VFS, and Page Models
Q: What is a database page and why do databases use a fixed page size?
A page is the fundamental unit of I/O and storage in a database. Fixed page sizes allow: (1) simple address calculation β page N starts at byte (N-1) * pageSize; (2) alignment with OS page sizes (4 KiB for virtual memory, matching SQLiteβs default); (3) buffer pool management β a page cache stores exactly one page per slot. Variable-size records are packed into pages; a B-tree node is exactly one page. The fixed size ensures that a B-tree node read is always one I/O operation, not an I/O per record.
Q: What is the partial-page write problem, and how does your FDB implementation solve it?
SQLiteβs VFS xWrite can write any byte range β not necessarily page-aligned. If a 100-byte write starts at offset 0, it overwrites bytes 0β99 of page 1, but bytes 100β4095 are unchanged. An implementation that naively maps this to Set(page1Key, 100_bytes) would corrupt page 1 (losing bytes 100β4095). Our solution: read the existing page, overlay the new bytes, write the full 4096-byte page back β in one FDB transaction. The transaction atomicity ensures that no other writer sees a partial page and that a crash during the operation leaves the page unchanged.
Q: Why is xSync a no-op in your VFS implementation?
xSync requests durability: βflush all previous writes to stable storage.β Our WriteAt implementation uses fdb.Transact which does not return until the write is committed and replicated to FDBβs transaction log. The commit is synchronous and durable by the time WriteAt returns. There is nothing left for xSync to flush. The durability guarantee of xSync is already satisfied by the end of WriteAt.
Q: How does SQLite locking work, and how would you implement a multi-reader single-writer lock in FDB?
SQLite uses 5 lock levels. For multi-reader single-writer: maintain a SHARED lock counter (number of active readers) and an EXCLUSIVE lock flag. Acquire SHARED: check exclusive flag is unset, increment counter (in one FDB transaction). Acquire EXCLUSIVE: check counter == 0, set exclusive flag (in one FDB transaction). Release SHARED: decrement counter. Release EXCLUSIVE: clear flag. All check-and-set operations are atomic in FDB transactions, so no intermediate state is observable. A would-be exclusive lock holder that sees counter > 0 must retry (wait for readers to finish).
This is the NamespaceManager concept in mvsqlite, and itβs how a
multi-tenant SQLite service would work.
Option C β Record Layer over FoundationDB
Pattern: βstructured records + secondary indexes, native FDBβ β the same idea Appleβs
fdb-record-layeris built around, distilled to ~250 lines of Go so you can read the whole thing in one sitting.
What problem this solves
FDBβs wire-level API is ordered KV. Most applications want records (a name, a city, an email) plus queries like βfind users in Parisβ. A record layer sits in between:
- Records are serialized blobs stored under a primary key.
- Indexes are extra KV entries keyed by (field, value, pk) so range scans on the index subspace return matching primary keys.
- Both updates happen in one FDB transaction, so the index is never out of sync with the records β even under concurrent writers.
Files
recordlayer/
encoding.go Key layouts for records (tag 0x00) and indexes (tag 0x01)
msgpack.go Wrapper around github.com/vmihailenco/msgpack
store.go Open / PutRecord / GetRecord / DeleteRecord / ScanRecords
index.go LookupByIndex (range-scan index β batch-read records)
demo/main.go Users keyed by id, indexed by city
Key layout
records: <ns> 0x00 <schemaName> 0x00 <pk> -> msgpack(record)
indexes: <ns> 0x01 <schemaName> 0x00 <fieldName> 0x00 <value> 0x00 <pk> -> ""
Splitting records and indexes into two top-level βtagsβ keeps each subspace
range-scannable without touching the other. Integers are written as
sign-bit-flipped big-endian (see encodeIndexValue) so negative values sort
before positive in the index β a tiny but useful property when you want range
queries like βage >= 18β.
Why itβs transactional by construction
PutRecord does, inside one db.Transact:
- Read the previous version of the record (if any).
- For each indexed field that changed or was removed,
Clearthe old index entry. Setthe new record bytes.Setfresh index entries for the new field values.
Because FDB transactions are serializable, no other client can observe a state where the record was updated but its indexes were not. This is the single biggest reason to build atop FDB rather than a non-transactional KV.
Mapping LookupByIndex to FDB ops
db.ReadTransact(rt -> {
idxKVs := rt.GetRange(indexPrefix(schema, field, value)) // ordered scan
for each idxKV:
pendings[i] = rt.Get(recordKey(schema, pk)) // pipelined
for each pending: collect record
})
The pipeline trick (issuing all rt.Get calls before awaiting any) lets FDB
overlap network round-trips. Latency β slowest single read instead of
sum-of-reads.
Running
- Bring up FDB and bootstrap (see top-level README).
cd option-c-record-layer && go run ./demo -cluster ../fdb.cluster
Expected output:
All users (PK order):
u1 -> map[city:Paris name:Alice]
u2 -> map[city:Tokyo name:Bob]
u3 -> map[city:Paris name:Carol]
Lookup city=Paris (via secondary index):
u1 -> map[city:Paris name:Alice]
u3 -> map[city:Paris name:Carol]
After moving Alice to Tokyo:
Lookup city=Paris:
u3 -> map[city:Paris name:Carol]
Lookup city=Tokyo:
u1 -> map[city:Tokyo name:Alice]
u2 -> map[city:Tokyo name:Bob]
What this layer omits compared to fdb-record-layer
- Schema evolution / Protobuf descriptors.
- Index definitions like
COUNT,MAX,SUMaggregates. - Query planner over multiple indexes.
- Versioned records and meta-data subspace.
But the storage shape and atomicity story are exactly the same.
Hitchhikerβs Guide β Option C: Record Layer with Secondary Indexes
The question this answers: βHow does a database maintain secondary indexes so they always agree with the primary data β even under concurrent writes?β
The deeper question: βWhat does Apple actually do inside iCloud, and why did they choose FoundationDB as the substrate?β
Table of Contents
- What the Record Layer Is
- Records vs Indexes: Two Subspaces, One Truth
- The Index Key Format β Anatomy of a Lookup Key
- PutRecord β Read-Modify-Write as a Pattern
- LookupByIndex β Pipelining Explained
- The encodeInt64 Sign-Bit Trick (Derived from First Principles)
- Index Consistency: The Core Guarantee
- The Apple Story: fdb-record-layer and CloudKit
- Real-World Analogues: MySQL, PostgreSQL, MongoDB
- Exercises
1. What the Record Layer Is
The FoundationDB Record Layer is a Java library open-sourced by Apple in 2019. It provides a record-oriented storage layer above FDB with:
- Typed records using Protobuf descriptors
- Declarative indexes (define once; maintenance is automatic)
- A query planner that compiles predicates into index range scans
- Schema evolution without downtime
- Change feeds via versionstamped keys
Our implementation is a Go reimagining: records are map[string]any
(instead of Protobuf messages), indexes are field names (instead of
declarative index definitions), and there is no query planner β you call
LookupByIndex directly. But every core concept is the same.
Why a βrecord layerβ and not just βa KV storeβ?
The moment you have more than one way to find a record, you need an index. The moment you have an index, you have two copies of the same data in two different shapes. The moment you have two copies, you need a protocol for keeping them consistent. The record layer is that protocol.
2. Records vs Indexes: Two Subspaces, One Truth
Our key space uses two tag bytes to separate the two subspaces:
Records subspace:
<ns> + 0x00 + <schemaName> + 0x00 + <pk>
β value: msgpack(record)
Indexes subspace:
<ns> + 0x01 + <schemaName> + 0x00 + <fieldName> + 0x00 + <encodedFieldValue> + 0x00 + <pk>
β value: "" (empty β the key IS the data)
Example with namespace "demo", schema "user", user Alice (pk="alice")
with field city="Paris":
Records:
64 65 6d 6f 00 75 73 65 72 00 61 6c 69 63 65
"demo" 00 "user" 00 "alice"
β msgpack({id:"alice", name:"Alice", city:"Paris"})
Indexes (city field):
64 65 6d 6f 01 75 73 65 72 00 63 69 74 79 00 50 61 72 69 73 00 61 6c 69 63 65
"demo" 01 "user" 00 "city" 00 "Paris" 00 "alice"
β "" (empty value)
Why the value is empty for indexes:
The index key encodes everything you need to find the record: which schema, which field, what value, and what the PK is. The value at an index key carries no additional information. This is identical to how MySQLβs InnoDB secondary indexes work: the B-tree leaf stores just the clustered PK, not the full row.
Why the subspace separation matters:
- A full table scan (read all records) =
GetRange(ns+0x00+schema, ns+0x00+schema+0x01)β hits only records, not indexes. - An index scan (find all users where city=βParisβ) =
GetRange(ns+0x01+schema+0x00+city+0x00+Paris, ns+0x01+schema+0x00+city+0x00+Paris+0x01)β hits only that index column/value, not other indexes or records. - Clearing all indexes for a schema =
ClearRange(ns+0x01+schema, ns+0x01+schema+0x01).
With one tag byte, ranges are perfectly clean. Without it, youβd need carefully chosen sentinels and youβd still risk collisions.
3. The Index Key Format β Anatomy of a Lookup Key
Letβs trace exactly what key is written for each user:
Alice (pk="alice", city="Paris"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "alice"
Bob (pk="bob", city="Tokyo"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Tokyo" + 0x00 + "bob"
Carol (pk="carol", city="Paris"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "carol"
In FDBβs sorted key space, these appear as:
(sorted by bytes)
...
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"alice" β ""
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"carol" β ""
ns+0x01+"user"+0x00+"city"+0x00+"Tokyo"+0x00+"bob" β ""
...
A GetRange for city="Paris" scans from ...+"Paris"+0x00 to
...+"Paris"+0x01, returning exactly two keys (Alice and Carol). The PKs
("alice", "carol") are embedded in the key suffix. This is the index scan.
The extractPkFromIndexKey function:
Given the raw FDB index key, we need to extract just the PK suffix. The PK
starts after the last 0x00 byte:
func extractPkFromIndexKey(key fdb.Key) []byte {
for i := len(key) - 1; i >= 0; i-- {
if key[i] == 0x00 {
return key[i+1:]
}
}
return nil
}
We scan backwards because the PK itself might contain 0x00 bytes if it
was a binary key. Scanning backwards from the end finds the last 0x00,
which is the separator between the encoded field value and the PK. This is
safe because the encoded field value ends before that last 0x00, and the
PK follows.
A subtle limitation: if the PK itself can contain 0x00, this parsing
is ambiguous. A production implementation uses the Tuple encoding, which
escapes 0x00 bytes within strings (\x00\xff) so the separator \x00\x00
is unambiguous.
4. PutRecord β Read-Modify-Write as a Pattern
This is the most important function in the record layer. Letβs walk through it completely:
func (s *Store) PutRecord(schema Schema, pk string, rec Record) error {
_, err := s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Step 1: Read the OLD record (if it exists)
oldBytes, _ := tr.Get(recordKey(s.ns, schema.Name, pk)).Get()
// Step 2: Decode the old record
var oldRec Record
if len(oldBytes) > 0 {
msgpack.Unmarshal(oldBytes, &oldRec)
}
// Step 3: Remove OLD index entries
for _, field := range schema.Indexes {
if oldVal, ok := oldRec[field]; ok {
tr.Clear(indexKey(s.ns, schema.Name, field, oldVal, pk))
}
}
// Step 4: Write the NEW record
encoded, _ := msgpack.Marshal(rec)
tr.Set(recordKey(s.ns, schema.Name, pk), encoded)
// Step 5: Write NEW index entries
for _, field := range schema.Indexes {
if newVal, ok := rec[field]; ok {
tr.Set(indexKey(s.ns, schema.Name, field, newVal, pk), []byte{})
}
}
return nil, nil
})
return err
}
Why must you read the old record before writing the new one?
Consider Alice moving from Paris to Tokyo:
Before: {id:"alice", name:"Alice", city:"Paris"}
After: {id:"alice", name:"Alice", city:"Tokyo"}
We need to:
- Remove the old index entry for
city=Paris, pk=alice - Add a new index entry for
city=Tokyo, pk=alice - Write the new record
We cannot do step 1 without knowing that Alice was previously in Paris. That information lives in the old record. So we read it first.
Why inside one transaction?
If we split this into two transactions:
Transaction 1: read old record, decide to clear Paris index
Transaction 2: clear Paris index, write new record, write Tokyo index
A crash between T1 and T2 leaves the Paris index entry pointing to a record that now says Tokyo. This is a phantom index entry β a stale pointer that silently returns wrong results.
With all steps in one transaction: either the entire operation commits (both index update and record update) or neither does. The index and the record always agree.
This is what database people mean when they say atomicity eliminates a class of bugs. Itβs not just about performance; itβs about correctness.
The cost: one extra read per update.
Every PutRecord does one Get (the old record) before writing. This adds
a read-version conflict key for the recordβs current PK slot. If two writers
try to update the same record concurrently, one will get a conflict and retry.
This is acceptable and correct β the alternative (no-conflict update) risks
interleaving index writes.
5. LookupByIndex β Pipelining Explained
This is where we use FDBβs pipelining to make multi-key lookups efficient.
The naive approach (one round-trip per record):
// BAD: N sequential round-trips for N results
pks := getIndexScan(...) // one round-trip: get all PKs from index
for _, pk := range pks {
rec := getRecord(pk) // one round-trip each: N round-trips total
results = append(results, rec)
}
For N=100, this is 101 network round-trips. At 1ms per round-trip, thatβs ~100ms for a simple lookup. Unacceptable.
The pipelined approach (two round-trips total):
// GOOD: pipeline all Get calls, then collect all results
_, err = s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
// Phase 1: issue ALL Get calls. None block yet.
futures := make([]fdb.FutureByteSlice, len(pks))
for i, pk := range pks {
futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
// rt.Get returns a Future immediately β it has NOT waited for a response.
}
// Phase 2: now block on each future in sequence.
for i, fut := range futures {
b, _ := fut.Get() // blocks until this specific response arrives
// But by the time we block on futures[1], the response for futures[1]
// through futures[N-1] may already be in flight or arrived.
...
}
})
Phase 1 sends N requests in rapid succession without waiting for any response.
The FDB client library batches these into its network buffer and sends them.
The storage servers process all N in parallel. By the time we call Get() on
the first future, many responses have already arrived.
Network diagram:
Without pipelining: With pipelining:
β Get(pk1) β Get(pk1), Get(pk2), Get(pk3), ... (burst)
β reply β reply(pk1)
β Get(pk2) β reply(pk2)
β reply β reply(pk3)
... (collect all)
Total: N round-trips Total: ~1 round-trip worth of latency
Why this works in FDBβs model:
FDB transactions are snapshot reads. All Get calls within one
ReadTransact closure see the same version of the database. There are no
cross-key dependencies or ordering constraints on reads within a transaction.
So the FDB client can send all reads in any order, and they will all return
consistent (same read version) values.
This is the same principle as HTTP pipelining, REDIS pipelining, or database query batching β but with a guarantee that all reads are from the same snapshot.
6. The encodeInt64 Sign-Bit Trick (Derived from First Principles)
When indexing a numeric field (e.g., age: 25), we want the index to sort
numerically. Bytes sort lexicographically. How do we make byte-sort match
numeric sort for 64-bit signed integers?
Step 1: Start with what we have.
A 64-bit signed integer in twoβs complement:
0 = 0000 0000 0000 0000 0000 0000 0000 0000 (0x0000_0000_0000_0000)
1 = 0000 0000 0000 0000 0000 0000 0000 0001 (0x0000_0000_0000_0001)
-1 = 1111 1111 1111 1111 1111 1111 1111 1111 (0xFFFF_FFFF_FFFF_FFFF)
-2 = 1111 1111 1111 1111 1111 1111 1111 1110 (0xFFFF_FFFF_FFFF_FFFE)
MIN = 1000 0000 0000 0000 0000 0000 0000 0000 (0x8000_0000_0000_0000)
MAX = 0111 1111 1111 1111 1111 1111 1111 1111 (0x7FFF_FFFF_FFFF_FFFF)
Big-endian byte-sort order: MIN, -2, -1, 0, 1, MAX (MIN starts with 0x80, which is 128, larger than 0x00 which is 0). So big-endian twoβs complement does NOT sort correctly across the sign boundary.
Step 2: What would correct order look like?
We want the sorted sequence: β¦, -2, -1, 0, 1, β¦, MAX. In bytes, each element should be strictly less than the next. If we could assign distinct 64-bit unsigned values to each signed value such that the unsigned order matches the signed order, weβd be done.
Step 3: The offset encoding idea.
One option: add 2^63 to shift the range. Then:
-2^63 + 2^63 = 0
-1 + 2^63 = 2^63 - 1
0 + 2^63 = 2^63
1 + 2^63 = 2^63 + 1
2^63-1 + 2^63 = 2^64 - 1
These are now unsigned values in the range [0, 2^64-1], in the same order as the original signed values. And they sort correctly in big-endian bytes!
Step 4: Addition vs. XOR.
Adding 2^63 is the same as flipping just the sign bit (bit 63). This is because adding 2^63 to a 64-bit value either sets bit 63 (if it was 0) or clears bit 63 and carries into bit 64 (which wraps around, but 2βs complement overflow is the same as XOR on individual bits).
More precisely: x + 2^63 = x XOR 2^63 for all x in the range
[-2^63, 2^63-1]. This is because:
- If x β₯ 0: bit 63 of x is 0; adding 2^63 sets it to 1. XOR with 1<<63 also sets bit 63. Same result.
- If x < 0: bit 63 of x is 1; adding 2^63 causes a carry into bit 64, which is discarded in 64-bit arithmetic, leaving bit 63 as 0. XOR with 1<<63 also clears bit 63. Same result.
So the implementation is just:
func encodeInt64(x int64) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
return b
}
Verification:
-2 β uint64(-2) = 0xFFFFFFFFFFFFFFFE β XOR 0x8000... β 0x7FFFFFFFFFFFFFFE
-1 β 0xFFFFFFFFFFFFFFFF β XOR β 0x7FFFFFFFFFFFFFFF
0 β 0x0000000000000000 β XOR β 0x8000000000000000
1 β 0x0000000000000001 β XOR β 0x8000000000000001
2 β 0x0000000000000002 β XOR β 0x8000000000000002
Sorted by bytes: 0x7FFFFFFFFFFFFFE, 0x7FFFFFFFFFFFFFFF, 0x8000000000000000, 0x8000000000000001, 0x8000000000000002 β which is the same order as -2, -1, 0, 1, 2. Correct.
This is the standard encoding used by:
- FDBβs official Tuple layer (negative integers)
- Apache HBase row key encoding
- Google Cloud Bigtable ordered keys
- Apache Flink and Spark for sorted key encodings
7. Index Consistency: The Core Guarantee
Letβs state it precisely: at any point in time, for every record with field
F=V and PK=P, there exists an index entry at (F, V, P). And there are no
index entries without corresponding records.
Formally:
forall records: indexEntries(record) β presentIndexKeysforall indexKey: correspondingRecord exists
Both directions. Our PutRecord maintains this invariant within one
transaction:
- Remove old index entries β breaks βno stale entriesβ for old value
- Write new record β old record value is overwritten
- Write new index entries β restores consistency
Between steps 1 and 3, within the transaction, the invariant is temporarily broken. But no other transaction sees this intermediate state β FDBβs isolation means only the final committed state is visible.
What happens if the invariant is violated?
A phantom index entry (index says Alice is in Paris, record says Tokyo):
LookupByIndex(city="Paris")returns Alice- Fetching Aliceβs record shows city=Tokyo
- Application logic sees inconsistent data
This is precisely the class of bugs that a record layer is designed to prevent.
8. The Apple Story: fdb-record-layer and CloudKit
Appleβs CloudKit powers iCloud sync for iOS and macOS apps. At its peak it handles hundreds of millions of devices syncing calendar events, notes, photos metadata, and app-specific data. This is the production scale at which fdb-record-layer operates.
Why FDB?
Before FDB, Apple had a custom distributed database for CloudKit. It had the usual problems: hard to scale, hard to operate, hard to maintain consistency under failures. FDB was acquired in 2015. Over the next few years Apple rebuilt CloudKitβs storage engine on FDB.
Why a record layer on top?
FDB is a KV store. CloudKit stores typed records (calendar events have a
start_date, end_date, title; notes have body, attachments). You
cannot just say βfind all events starting tomorrowβ to an FDB cluster β you
need to translate that query into a range scan over an encoded index.
fdb-record-layer provides:
- Protobuf-typed records (schema validation, efficient encoding)
- Value indexes, rank indexes, text indexes, aggregate indexes
- A query planner that selects the best index for a given predicate
- Change feed via versionstamped index keys (see GUIDE.md chapter on versionstamps)
- Schema evolution (add fields with defaults, deprecate fields, change types safely)
Our implementation omits the Protobuf (replaced with map[string]any) and
the query planner (you call LookupByIndex directly). But the index key
layout and the PutRecord read-modify-write pattern are the same.
9. Real-World Analogues
MySQL InnoDB Secondary Indexes
In InnoDB, a secondary index B-tree contains:
(secondary_key_value, primary_key_value) β ""
Identical to our (field, encodedValue, pk) β "". To look up a row by
a secondary key:
- Range-scan the secondary index to get PKs.
- Look up each PK in the clustered index (the primary B-tree).
This two-step process is called a βdouble-lookupβ or βbookmark lookupβ
in MySQL docs. Our LookupByIndex does exactly this.
The reason InnoDB stores the PK (not the heap pointer) in secondary indexes: when a row is updated and moves in the clustered index, secondary indexes donβt need to be updated (they still point to the same PK value, which now lives in a different B-tree page but is found by the clustered index scan).
PostgreSQL GIN / B-Tree Indexes
PostgreSQL secondary indexes store (index_value β ctid) where ctid is the
physical page+slot address. This is faster to look up (one step, not two)
but requires updating ALL indexes when a row moves (e.g., after VACUUM).
PostgreSQL recently added βindex-only scansβ where the full row is stored in the index (similar to InnoDBβs covering indexes), trading space for read speed.
MongoDB Indexes
MongoDB maintains indexes as separate B-trees:
{ city: "Paris" } β [ ObjectId("alice"), ObjectId("carol") ]
MongoDBβs analog of LookupByIndex is called a βsecondary index scan +
fetchβ. The MongoDB Document Layer (which stored MongoDB data in FDB) used
exactly our two-subspace approach β one FDB range for documents, one for
each index.
10. Exercises
Exercise 1 β COUNT Aggregate Index
Add a special βcountβ index that maintains a per-schema counter:
ns + 0x02 + schemaName + 0x00 β msgpack(int64)
In PutRecord, if old record doesnβt exist, increment. In DeleteRecord,
decrement. Both inside the same transaction as the record+index update.
This gives you COUNT(*) in O(1) instead of a full table scan.
Exercise 2 β Range Index Query
Add LookupByIndexRange(schema, field, minVal, maxVal any) ([]Record, error).
Since the index is sorted by encoded field value, you can do:
begin = indexPrefix(ns, schema, field) + encodeValue(minVal)
end = indexPrefix(ns, schema, field) + encodeValue(maxVal) + 0x01
GetRange(begin, end)
This enables WHERE age >= 25 AND age < 35 as a single range scan.
Make sure encodeValue produces sort-preserving bytes for your supported
types (use encodeInt64 for integers).
Exercise 3 β Composite Indexes
Add support for multi-field indexes:
Schema{Name: "event", Indexes: [{"start", "end"}, {"city", "name"}]}
A composite index on (start, end) allows efficient queries like:
WHERE start >= monday AND end <= friday. The key would be:
ns + 0x01 + schema + 0x00 + "start,end" + 0x00 + encodeValue(start) + 0x00 + encodeValue(end) + 0x00 + pk
Exercise 4 β Covering Indexes
A covering index stores the full record at the index key (instead of just
""). Then LookupByIndex returns full records from the index scan alone,
with no secondary lookup:
index key β msgpack(record) (instead of β "")
Upside: LookupByIndex requires only one FDB range scan, no Get calls.
Downside: PutRecord must update index values when any indexed field changes.
Compare the trade-off with real-world covering indexes in MySQL and
PostgreSQL.
Exercise 5 β Versionstamp-Based Change Feed
Use FDBβs SetVersionstampedKey in PutRecord to write a change log entry:
ns + 0xFF + <10-byte versionstamp> β msgpack({schema, pk, change_type})
Build a SubscribeChanges(since versionstamp) (<-chan Change, error) function
that uses GetRange + Watch to stream changes in real time. This is a
simplified version of fdb-record-layerβs change feed.
11. Source Code Deep Dive
encoding.go β The Key Construction Functions
// tagRecords = 0x00, tagIndexes = 0x01
func recordKey(ns []byte, schema, pk string) fdb.Key {
return append(ns, append([]byte{tagRecords}, []byte(schema+"\x00"+pk)...)...)
}
func indexKey(ns []byte, schema, field string, val any, pk string) fdb.Key {
return append(ns, append([]byte{tagIndexes},
[]byte(schema+"\x00"+field+"\x00"+encodeIndexValue(val)+"\x00"+pk)...)...)
}
Why embed the separator \x00 as a Go string literal?
The \x00 byte is the separator between components. Using it directly in a string literal is readable and avoids a separate bytes.Join call. The key correctness requirement: no component value may contain \x00 unescaped. For string field values from map[string]any, callers must validate or escape input. The Tuple encoding in fdb-record-layer handles this by escaping \x00 inside strings as \x00\xFF.
encodeIndexValue β Dispatching on Type
func encodeIndexValue(val any) string {
switch v := val.(type) {
case int64:
return string(encodeInt64(v))
case float64:
// JSON numbers unmarshal as float64 in Go
return string(encodeInt64(int64(v)))
case string:
return v
default:
return fmt.Sprintf("%v", v)
}
}
The float64 case: In Goβs encoding/json, all JSON numbers unmarshal to float64 when the target type is any. This means {"age": 25} gives you age = float64(25). The encoding converts float64 to int64 first. This is lossy for non-integer floats β a production implementation would have a separate encodeFloat64 using the IEEE 754 bit pattern with a sign-flip (identical logic to encodeInt64, but on the math.Float64bits representation).
store.go β ScanRecords with Cursor Logic
func (s *Store) ScanRecords(schema string, limit int) ([]Record, error) {
kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(recordRange(s.ns, schema), fdb.RangeOptions{
Limit: limit,
}).GetSliceWithError()
})
var recs []Record
for _, kv := range kvs.([]fdb.KeyValue) {
var rec Record
msgpack.Unmarshal(kv.Value, &rec)
recs = append(recs, rec)
}
return recs, nil
}
fdb.RangeOptions{Limit: limit} instructs FDB to stop returning results after limit key-value pairs. This is pushed down to the storage server β the server stops scanning early, reducing both network bandwidth and server-side work.
The continuation problem: For a dataset with 1 million records, a single GetRange with no limit would transfer all 1M records in one response. This hits FDBβs per-transaction byte limit (~10 MB) and would transfer far more data than needed. A production scan should use a continuation token:
type Continuation struct {
LastPK string
}
func (s *Store) ScanRecordsPage(schema string, limit int, cont *Continuation) ([]Record, Continuation, error) {
begin := recordRange(s.ns, schema).Begin
if cont != nil {
// Start after the last-seen PK
begin = fdb.Key(append(recordKey(s.ns, schema, cont.LastPK), 0x01...))
}
// GetRange from begin to end of schema range, limit = limit
...
}
Each page call fetches limit records starting after the continuation. The caller loops, passing the continuation from each response into the next call.
index.go β LookupByIndex Pipelining in Detail
func (s *Store) LookupByIndex(schema Schema, field string, value any) ([]Record, error) {
prefix := indexPrefix(s.ns, schema.Name, field, value)
// Step 1: get all PKs from the index (one round-trip)
kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(
fdb.KeyRange{Begin: prefix, End: fdb.Key(append(prefix, 0xFF))},
fdb.RangeOptions{},
).GetSliceWithError()
})
pks := make([]string, len(kvs))
for i, kv := range kvs {
pks[i] = string(extractPkFromIndexKey(kv.Key))
}
// Step 2: fetch all records by PK (pipelined, one round-trip)
recs := make([]Record, 0, len(pks))
s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
futures := make([]fdb.FutureByteSlice, len(pks))
for i, pk := range pks {
futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
}
for _, fut := range futures {
b, _ := fut.Get()
var rec Record
msgpack.Unmarshal(b, &rec)
recs = append(recs, rec)
}
return nil, nil
})
return recs, nil
}
Two round-trips for any number of results: Step 1 is one GetRange β one round-trip. Step 2 issues N Get calls inside one transaction β they are pipelined to the storage servers, returning in approximately one round-trip (the latency of the slowest individual read, not N Γ single-read latency).
Why two separate transactions?
The index scan (step 1) and the record fetches (step 2) could be in one ReadTransact closure β they would both see the same read version. We separate them here for clarity. In production, combining them into one transaction is slightly more efficient: one beginRead call, one read version negotiation, one transaction context.
12. Production Considerations
12.1 Write Amplification Under Index Updates
Every PutRecord that changes an indexed field performs:
- 1 read (old record)
- 1 clear (old index entry)
- 1 set (new record)
- 1 set (new index entry)
For a schema with K indexes and a write that changes all K indexed fields:
- K reads (old index values embedded in old record, but we need the old record regardless)
- K clears (old index entries)
- 1 set (new record value)
- K sets (new index entries)
Total mutations: K clears + K+1 sets = 2K+1 write operations per update. For a schema with 5 indexes, every update writes 11 keys. This is standard for indexed databases β MySQL and PostgreSQL have the same write amplification for secondary indexes. FDBβs commit pipeline absorbs this efficiently because all K sets and K clears go into one transaction batch.
Mitigation: Only maintain indexes on fields that are actually queried. Donβt add indexes speculatively.
12.2 Online Index Building
Adding a new index to a schema that already has millions of records requires backfilling: writing index entries for every existing record. This must be done without taking the database offline.
The safe approach:
- Start writing new index entries in
PutRecordimmediately (for new and updated records). - In a background worker, page through all existing records (using the continuation pattern) and write their index entries.
- Once the background worker reaches the βfrontierβ (current write position), the index is consistent.
During step 2, queries against the new index may return incomplete results (missing records that havenβt been backfilled yet). This is acceptable if the application can tolerate βeventually consistentβ index builds. FDBβs versionstamp mechanism can track the backfill position and indicate to callers which index entries are βsafeβ (backfilled) vs βtentative.β
12.3 N+1 Read Problem
LookupByIndex with N results does:
- 1 index range scan
- N record reads (pipelined)
If the calling code then calls LookupByIndex for each result (e.g., βfind all events in Paris, then for each event, find all attendeesβ), you get M index scans + MΓN record reads. This is the N+1 problem from ORM land, applied to FDB.
Solution: Batch the secondary lookups. Collect all PKs from all index scans, then issue one pipelined batch of record reads. FDBβs pipelining makes this efficient even when batching hundreds of PKs.
13. Interview Questions β Secondary Indexes and Record Layers
Q: Why does PutRecord need to read the old record before writing the new one?
To maintain index consistency. If a recordβs indexed field value changes (e.g., city: βParisβ β βTokyoβ), the old index entry (city=Paris,pk=alice) must be deleted and a new entry (city=Tokyo,pk=alice) must be created β in the same transaction as the record write. To know the old index value, we need the old record. Without reading the old record, we canβt know which index keys to clear, so stale index entries would accumulate, pointing to records with wrong field values.
Q: How does LookupByIndex achieve O(2) round-trips regardless of result count?
The index range scan (step 1) is one GetRange that returns all PKs for the queried field value β one round-trip. The record reads (step 2) are issued as N rt.Get calls inside one ReadTransact closure. The FDB client library sends all N requests before waiting for any response (pipelining). The storage servers process them in parallel. By the time the first response arrives, most others are already in flight or returned. The effective latency is one round-trip, not N.
Q: What is the encodeInt64 function doing, and why is it necessary?
FDB keys are sorted lexicographically by bytes. For integer field values to sort correctly as numbers (β1 < 0 < 1 < 2), the byte encoding must be sort-preserving. Twoβs complement big-endian is almost right, but fails at the sign boundary: β1 (0xFFFFβ¦) sorts after MAX_INT (0x7FFFβ¦) in byte order. Flipping the sign bit (XOR with 0x8000_0000_0000_0000) shifts the entire integer range to unsigned, making byte order match signed integer order. The result: every negative numberβs encoding is less than every non-negative numberβs encoding, and within each sign region, the encoding preserves order.
Q: How would you implement multi-tenant isolation so that one tenantβs records are never visible to another?
Add a tenant identifier to the namespace prefix: ns = globalPrefix + tenantId. All record and index keys for that tenant are prefixed with their unique tenant bytes. A range scan by tenant A (GetRange(ns_A, ns_A + 0xFF)) only returns keys in tenant Aβs namespace. The FDB cluster enforces this: there are no keys shared between tenant namespaces, so no data can leak across tenants. Additionally, for strict isolation, use FDBβs tenant feature (available in 7.x) which restricts a transaction to a specific tenant prefix at the FDB server level.