Keyboard shortcuts

Press ← or β†’ to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

πŸ“– Local build β€” commit and timestamp are injected automatically on Cloudflare Pages.

FoundationDB Layers β€” Go Implementations

Five self-contained Go implementations that show different ways to build a β€œlayer” on top of FoundationDB. Modeled after real projects (mvsqlite, fdb-record-layer) but pared down for clarity.

Layout

FolderPlugs inTeaches
option-a-leveldbAbove LevelDBLevelDB’s external API: iterators, snapshots, write batches
option-a-sqliteAbove SQLiteHow SQL decomposes into storage ops; vtab query planning
option-b-leveldbBelow LevelDBLevelDB internals: SST files, WAL, MANIFEST, CURRENT
option-b-sqliteBelow SQLiteSQLite internals: page model, VFS, journaling
option-c-record-layerDirectly on FDBNative FDB layer: records + secondary indexes

Prerequisites

  1. Docker β€” to run a local single-node FDB cluster
  2. FoundationDB client library on the host (libfdb_c) β€” required by the Go bindings (CGO)
    • macOS: brew install foundationdb (or install the official .pkg from Apple’s release page)
  3. Go 1.22+

Bootstrap the cluster

docker compose up -d
./scripts/bootstrap-fdb.sh

This creates ./fdb.cluster at the repo root. Every demo reads it via the relative path ../fdb.cluster.

To shut down: docker compose down (data persists in ./fdb-data).

To wipe everything: docker compose down -v && rm -rf fdb-data fdb-config fdb.cluster.

Running a demo

Each option is an independent Go module:

cd option-a-leveldb
go run ./demo

See each folder’s docs/ for an architecture walk-through.

Documentation (mdbook)

The docs are structured as an mdbook under book/. To read them locally:

# Install mdbook (once)
brew install mdbook

# Build the book (output β†’ book/dist/)
mdbook build

# Or serve with live-reload
mdbook serve --port 3001

The book is organized into two parts:

  • Background β€” The Hitchhiker’s Guide split into six chapters (storage stack, FDB internals, the layer concept, real-world systems, repo overview, further reading)
  • Labs β€” One overview + architecture walk-through per option

The Hitchhiker’s Guide to FoundationDB and Storage Layers

β€œDon’t Panic.”

This guide assumes you have seen a key–value store before but have never thought hard about what happens underneath. By the end you should be able to design and build your own layer on top of FoundationDB β€” not just copy one.


What This Guide Covers

Six chapters that take you from first principles all the way to production-grade distributed systems. Each chapter is self-contained and builds on the previous.

ChapterTopicWhat you’ll know after
1. The Storage StackB-trees, LSM-trees, WAL, buffer poolsWhy every database makes the same five trade-offs
2. FoundationDB in DepthMVCC, commit pipeline, cluster roles, simulationHow FDB achieves correctness and why it’s trusted at Apple scale
3. The Layer ConceptKey encoding, subspace pattern, atomicityHow to turn an ordered KV store into any data model
4. How Real Systems Use FDBCloudKit, mvsqlite, Document Layer, TiKVThat the patterns in this repo are battle-tested in production
5. This RepositoryThe five lab implementationsHow to navigate, run, and extend the labs
6. Reading GuidePapers, books, source code to read nextWhere to go to become a true expert

How to Read This

If you have 2 hours: Read chapters 1–3, then pick any one lab and read its Architecture Walk-through.

If you have a day: Read all six chapters, then work through all five labs in order (A-leveldb β†’ A-sqlite β†’ B-leveldb β†’ B-sqlite β†’ C-record-layer). The labs increase in conceptual complexity.

If you’re preparing for a systems interview: Focus on chapters 1–3 and the Interview Q&A sections at the end of each lab guide.

If you just want to run the code: Jump straight to Chapter 5 β€” This Repository for quick-start instructions, then come back to the background chapters for depth.


Core Insight

Every database is a stack of layers, each one a client of the one below. FoundationDB occupies one specific layer: ordered, transactional, distributed key-value storage. Every layer above it β€” SQL, document model, LevelDB API, SQLite page store, record layer β€” is a pure encoding problem. Learn the encoding and you understand the database.

That is what this guide teaches.

The Storage Stack β€” Where Everything Lives

Every database you have ever used is a stack of layers, each hiding detail from the layer above. Knowing exactly where FoundationDB sits in that stack β€” and why that position is powerful β€” is the foundation of everything else in this guide.


The Canonical Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Application  (SQL, gRPC, REST API)         β”‚  ← you usually live here
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Query Planner / Optimizer                  β”‚  ← rewrites queries, picks indexes
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Execution Engine  (joins, aggregates)      β”‚  ← evaluates the query plan
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Storage Engine  (rows, columns, indexes)   β”‚  ← where FDB plays
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Page / Buffer Cache                        β”‚  ← in-memory hot pages
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  File / Block I/O  (VFS, pread/pwrite)      β”‚  ← option-b-sqlite plugs in here
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Kernel: page cache, scheduler, filesystem  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Physical: SSD / NVMe / network storage     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The interesting question is: which layer do you own, and what contract does it expose downward?

Most database literature focuses on the top (SQL, query plans). Most OS literature focuses on the bottom (filesystems, block I/O). The storage engine layer β€” the middle β€” is where the real trade-offs live: ordering, atomicity, durability, and concurrency.

FoundationDB is a storage engine that exposes a remarkably clean API to everything above it. This repository explores five different ways to build on top of that API.


B-Trees vs. Log-Structured Merge Trees

Every practical storage engine is built on one of two foundational data structures. Understanding both is essential for reasoning about FDB’s design and the designs of the engines it replaces.

B-Trees

A B-tree stores data as a balanced tree of fixed-size pages (usually 4–16 KB). Updates happen in-place: to change a value, you find its page, overwrite it, and write it back to disk.

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  [20] [50]   β”‚  ← interior node (keys only)
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                   /        |        \
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚ [5][12]  β”‚  β”‚[25][35]  β”‚  β”‚[55][80]  β”‚  ← leaf nodes (key+value)
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Properties:

  • Read amplification: O(log N) page reads to find a key (very good)
  • Write amplification: ~1–3Γ— (one write per page touched, but journaling doubles it)
  • Space amplification: ~33% overhead on average (partially filled pages)
  • Random writes: each update may touch a different page β†’ high random I/O on spinning disks

Why B-trees dominated for 40 years: HDDs were sequential-read fast, random-read slow. B-trees minimize the number of seeks to reach a key. SSDs changed this equation β€” random reads became cheap.

Used by: SQLite, PostgreSQL, MySQL InnoDB, Oracle, early MongoDB (MMAPv1)

Log-Structured Merge Trees (LSM Trees)

LevelDB, RocksDB, and Cassandra’s storage engine are all LSM trees. The core idea: never overwrite; always append.

Write path:
  incoming write
      ↓
  WAL (append-only log on disk β€” for durability)
      ↓
  MemTable (skip list in memory β€” for fast writes)
      ↓ (when MemTable is full, flush to disk)
  Level 0 SSTables (sorted, immutable files)
      ↓ (background compaction)
  Level 1, 2, ... SSTables

Anatomy of an SSTable (Sorted String Table):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
│ Block 0: apple→1, banana→2, cherry→3 ...    │
│ Block 1: date→4, elderberry→5, fig→6 ...    │
β”‚ ...                                          β”‚
│ Index block: [apple→offset0, date→offset1]  │
β”‚ Bloom filter: membership query, ~1% FP rate β”‚
β”‚ Footer: offsets to index+filter blocks      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Properties:

  • Write amplification: high (10–30Γ— in a 7-level LSM; data is rewritten during each level’s compaction)
  • Read amplification: moderate (must check all levels for a key; bloom filters help)
  • Space amplification: moderate (old versions linger until compaction)
  • Sequential writes: almost entirely sequential β†’ excellent on SSDs and HDDs alike

Why LSM trees dominate at scale today: write throughput is often the bottleneck at scale, and LSM trees turn random writes into sequential appends. RocksDB (Facebook) and Cassandra (Apache) both validate this at planetary scale.

The RUM Conjecture

The RUM conjecture (Idreos et al., 2016) formalizes the fundamental trade-off:

You cannot minimize Read cost, Update cost, and Memory/space cost simultaneously. Optimizing two always worsens the third.

EngineRead (R)Update (U)Memory (M)Best for
B-treeLowMediumMediumRead-heavy OLTP
LSM treeMediumLowHighWrite-heavy OLTP
Hash tableVery lowMediumHighPoint lookups only
Column storeLowHighLowAnalytical queries (OLAP)
FDB (B-tree)LowMediumMediumConsistent distributed KV

FDB uses a custom B-tree variant internally on each Storage Server. The distributed nature adds replication overhead but also provides horizontal read scaling.


Write-Ahead Log (WAL) β€” Crash Recovery Mechanics

Every durable storage engine uses a WAL. Understanding WAL mechanics is essential for understanding what FDB guarantees and what xSync (in the SQLite VFS) must do.

The problem: writing a page to disk is not atomic. A 4 KB page write may partially complete before a power failure. Result: corrupted page.

WAL solution:

1. Before modifying Page X, write the INTENT to the WAL:
      WAL record: {LSN: 1042, page: X, before: <old bytes>, after: <new bytes>}
2. fsync() the WAL record to disk (durability guarantee)
3. Now apply the change to Page X in the buffer cache
4. Page X is written back to disk lazily (the WAL already protects us)

On crash recovery:

  • Re-apply all WAL records since the last checkpoint β†’ consistent state
  • If a WAL record is incomplete (partially written) β†’ discard it, prior state is clean

FDB’s relationship with WAL: FDB’s Transaction Log layer IS the WAL. When a commit is acknowledged, the write is persisted in the Transaction Log on f+1 machines (where f is the fault tolerance factor). Storage Servers apply these writes asynchronously. This is why FDB’s commit is durable even if Storage Servers crash before applying the write β€” the Transaction Log holds the record.

This also explains why our SQLite VFS (option-b-sqlite) makes xSync a no-op: by the time Transact() returns, FDB has already done the equivalent of fsync on the WAL.


Buffer Pool Management

In any storage engine, the buffer pool (or page cache) is the in-memory cache of disk pages. It is often the single most important performance variable.

Key metrics:

  • Hit ratio: fraction of page reads served from cache (aim for >99%)
  • Eviction policy: LRU (Least Recently Used), LFU (Least Frequently Used), CLOCK
  • Write-back vs. write-through: most use write-back (dirty pages flushed lazily to amortize I/O)

Why this matters for FDB: FDB’s clients do not manage a buffer pool. Each Storage Server manages its own buffer pool internally. When you call rt.Get(key), the FDB client sends a network request; the Storage Server consults its buffer pool and returns the value. You have no visibility into whether this was a cache hit or a disk read.

For read-heavy workloads, adding more Storage Servers (and thus more total buffer pool capacity) linearly increases the read cache size.


Where Each Lab Sits in the Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Application code  (demo/main.go)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  option-a-leveldb  β”‚  option-a-sqlite  β”‚  option-c-record   β”‚
β”‚  (LevelDB API)     β”‚  (SQL engine)     β”‚  (Record + Index)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  FoundationDB client library  (fdb.Transact / ReadTransact)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚           FDB cluster (coordinators, proxies, storage)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  option-b-sqlite   β”‚  option-b-leveldb                       β”‚
β”‚  (SQLite VFS)      β”‚  (LevelDB storage.Storage)             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚            FDB as the backing file store                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Options A and C sit above FDB β€” they use FDB as a storage substrate and build a new API on top. Options B sit below an existing engine β€” they replace the file system with FDB while keeping the engine’s logic intact.

This is a crucial architectural distinction. In the A/C case, FDB’s ACID semantics are exposed to the caller. In the B case, FDB’s semantics are hidden inside a file-like abstraction, and the engine above (LevelDB, SQLite) provides its own ACID layer β€” layered on top of FDB’s.


Interview Questions

Q: What is write amplification and why does it matter?

Write amplification is the ratio of bytes written to disk versus bytes written by the application. A write amplification of 10Γ— means writing 1 MB of data causes 10 MB of disk writes. LSM trees have high write amplification (compaction rewrites data multiple times) but low latency writes. B-trees have low write amplification but high random-write I/O. At scale, write amplification directly determines storage cost and SSD wear.

Q: Why does an LSM tree use multiple levels?

Compaction merges SSTables from one level into the next. Without levels, you’d need to merge all SSTables at once β€” an O(total_data) operation for every write. Leveled compaction keeps each level to a fixed size ratio (~10Γ—), so each compaction is bounded. This bounds read amplification too: there are at most one SSTable per key per level, so a lookup checks at most L SSTables.

Q: What is the difference between a WAL and an SSTable?

A WAL (Write-Ahead Log) is an append-only sequence of write operations β€” it records WHAT changed, not the final state. It’s used for crash recovery. An SSTable is an immutable, sorted snapshot of actual data β€” it stores key-value pairs in sorted order for efficient lookup. The WAL is ephemeral (truncated after checkpointing). SSTables are persistent (they ARE the data).

Q: What makes FDB’s storage engine different from a traditional B-tree?

FDB’s storage servers use a custom B-tree variant, but the key difference is architectural: FDB is distributed, sharded, and replicated. Each Storage Server holds a shard (a key range). Reads are served directly by Storage Servers (bypassing the Proxy). Writes go through the Transaction Log for durability before being applied to Storage Servers. This means durability and availability are decoupled β€” you can lose a Storage Server without losing committed data.

FoundationDB in Depth

FoundationDB is one of the most carefully engineered distributed databases ever built. This chapter goes far beyond the API surface β€” it covers the commit pipeline, MVCC mechanics, cluster architecture, and the simulation harness that makes FDB trustworthy at Apple-scale production.


2.1 What FDB Actually Is

FoundationDB is a distributed, ordered, transactional key–value store. Each word carries precise meaning:

Distributed: data is automatically sharded across machines. You do not choose a partition key. FDB’s shard boundaries move dynamically as data grows or machines are added. The client library discovers shard locations by reading the cluster file and caching routing information.

Ordered: keys are stored in lexicographic byte order across the entire cluster. GetRange(begin, end) returns keys in sorted order regardless of which Storage Servers hold them. This is the single most powerful property for building layers, because co-location of related data is entirely under your control via key encoding.

Transactional: full ACID across arbitrary key ranges, on multiple machines, at unlimited scale. This is rare. DynamoDB offers single-item atomicity (later added limited cross-item transactions). Cassandra offers lightweight transactions (LWT) at one consistency level. Redis offers multi-key transactions (MULTI/EXEC) without conflict detection. FDB offers true serializable transactions with automatic conflict detection and retry β€” across any number of keys, on any number of machines.

Key–value: the value is an opaque byte string (max 100 KB). FDB does not care about structure inside the value. All richness is imposed by the layer above β€” this is by design. It keeps FDB’s guarantees simple and composable.


2.2 The Ordered KV Model β€” Deep Dive

The FDB key space is conceptually a single sorted array of byte arrays, spanning the entire cluster:

Key space (lexicographic byte order):
  ""  (smallest possible key)
  ...
  "\x00"
  "\x00\x00"
  ...
  "apple"
  "apple\x00price"       ← prefix scan works naturally
  "apple\x00weight"
  "banana"
  ...
  "\xff\xff\xff..."  (largest possible key)

Range semantics β€” half-open intervals:

All FDB ranges are [begin, end) β€” inclusive begin, exclusive end. This is universal in FDB’s API:

fdb.KeyRange{Begin: fdb.Key("apple\x00"), End: fdb.Key("apple\x01")}
// Returns: apple\x00price, apple\x00weight, apple\x00... everything
// Does NOT return: apple\x01, banana, ...

The trick of incrementing the last byte (\x00 β†’ \x01) creates a tight upper bound for prefix scans. This pattern appears in every layer in this repo.

Why ordering is the enabler for layers:

Without ordering, you could only do point lookups (O(1) Get). Ordering enables:

  • Prefix scans: all keys for user 42 (range scan on user:42:*)
  • Range queries: all records with age BETWEEN 25 AND 40 (range scan on encoded age values)
  • Sorted iteration: return all records in PK order (scan the records subspace)
  • Pagination: β€œnext page” = resume range scan from last seen key

Every SQL feature that involves ORDER BY, BETWEEN, prefix LIKE, or primary key range is implementable purely through ordered range scans.


2.3 Transactions: The Full Picture

ACID Guarantees

FDB transactions implement Strict Serializability (Herlihy & Wing, 1990): the strongest isolation level. It is equivalent to:

  • Serializability (transactions appear to execute one at a time)
  • Real-time ordering (if T1 commits before T2 starts, T1 appears before T2)

Most databases offer weaker levels by default. PostgreSQL defaults to β€œRead Committed” (not serializable). MySQL InnoDB defaults to β€œRepeatable Read”. FDB always runs at strict serializability β€” there is no weaker option.

The Commit Pipeline

This is the key to understanding FDB’s performance characteristics:

Client                  Proxy               Resolver            Transaction Log       Storage Servers
  β”‚                       β”‚                    β”‚                      β”‚                     β”‚
  β”‚  1. Begin Txn         β”‚                    β”‚                      β”‚                     β”‚
  β”‚  (local, no network)  β”‚                    β”‚                      β”‚                     β”‚
  β”‚                       β”‚                    β”‚                      β”‚                     β”‚
  β”‚  2. Reads β†’ GetReadVersion                 β”‚                      β”‚                     β”‚
  β”‚ ─────────────────────►│                    β”‚                      β”‚                     β”‚
  β”‚ ◄─────────────────────│ readVersion=v100   β”‚                      β”‚                     β”‚
  β”‚                       β”‚                    β”‚                      β”‚                     β”‚
  β”‚  3. rt.Get(key)       β”‚                    β”‚                      β”‚                     β”‚
  β”‚ ─────────────────────────────────────────────────────────────────────────────────────►  β”‚
  β”‚ ◄─────────────────────────────────────────────────────────────────────────────────────  β”‚
  β”‚  (reads served directly by Storage Servers β€” Proxy NOT in critical read path)           β”‚
  β”‚                       β”‚                    β”‚                      β”‚                     β”‚
  β”‚  4. tr.Set(key,val)   β”‚                    β”‚                      β”‚                     β”‚
  β”‚  (buffered locally, no network yet)        β”‚                      β”‚                     β”‚
  β”‚                       β”‚                    β”‚                      β”‚                     β”‚
  β”‚  5. Commit: send {readVersion, readSet, writeSet}                  β”‚                     β”‚
  β”‚ ─────────────────────►│                    β”‚                      β”‚                     β”‚
  β”‚                       β”‚  6. Assign commitVersion=v101             β”‚                     β”‚
  β”‚                       β”‚ ──────────────────►│                      β”‚                     β”‚
  β”‚                       β”‚  7. Check conflictsβ”‚                      β”‚                     β”‚
  β”‚                       β”‚  (did anyone write a key in readSet       β”‚                     β”‚
  β”‚                       β”‚   between v100 and v101?)                 β”‚                     β”‚
  β”‚                       β”‚ ◄──────────────────│                      β”‚                     β”‚
  β”‚                       β”‚  8. If no conflict: write to TLog         β”‚                     β”‚
  β”‚                       β”‚ ─────────────────────────────────────────►│                     β”‚
  β”‚                       β”‚  9. TLog confirms durability (f+1 copies) β”‚                     β”‚
  β”‚                       β”‚ ◄─────────────────────────────────────────│                     β”‚
  β”‚  10. Commit ack       β”‚                    β”‚                      β”‚                     β”‚
  β”‚ ◄─────────────────────│                    β”‚                      β”‚                     β”‚
  β”‚                       β”‚  (Storage Servers apply writes asynchronously from TLog)        β”‚

Key insight from the pipeline:

  1. Reads bypass the Proxy β€” they go directly to Storage Servers. This means read throughput scales linearly with Storage Server count.
  2. Writes buffer locally β€” no network traffic until commit. A transaction that writes 1,000 keys generates exactly one network round-trip.
  3. Conflict checking is done by the Resolver β€” a separate stateless process that checks whether any key in the transaction’s read set was written by another transaction between readVersion and commitVersion. No lock managers, no 2PL.
  4. Durability before acknowledgment β€” FDB doesn’t ack a commit until the write is in the Transaction Log on f+1 machines. Storage Servers may lag behind.

Retry Semantics

The db.Transact(func) loop handles retries automatically. The function is called again if:

  • A conflict is detected (error code 1020)
  • The transaction is too old (error code 1007 β€” MVCC window expired)
  • A transient network error occurs

The retry rule: the function passed to Transact must be idempotent for side effects outside FDB. FDB operations are always safe to retry (each retry gets a fresh transaction). But if your function sends an email, charges a credit card, or writes to another database, a retry will do that twice.

// WRONG: email sent on retry
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    tr.Set(key, value)
    sendWelcomeEmail(user)   // ← sent multiple times on conflict!
    return nil, nil
})

// RIGHT: defer side effects to after commit
_, err := db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    tr.Set(key, value)
    return nil, nil
})
if err == nil {
    sendWelcomeEmail(user)  // ← exactly once
}

Read-Only Transactions

db.ReadTransact(func) opens a read-only transaction. It:

  • Does not track a write set
  • Does not need a commit round-trip
  • Cannot conflict (reads never conflict with other reads)
  • Uses a recent committed version as the read version

Read-only transactions are cheaper and should be preferred for any operation that only reads data.


2.4 MVCC β€” Multi-Version Concurrency Control

FDB stores multiple versions of each key, identified by a monotonically increasing version number (a 64-bit integer incremented globally for every commit):

Version timeline:
  v100: Set("color", "red")     ← committed at v100
  v101: Set("count", "5")       ← committed at v101
  v102: Set("color", "blue")    ← committed at v102
  v103: Set("count", "6")       ← committed at v103

A transaction reading at version v101 sees:

  • "color" = "red" (latest version ≀ v101)
  • "count" = "5" (latest version ≀ v101)

A transaction reading at version v103 sees:

  • "color" = "blue" (latest version ≀ v103)
  • "count" = "6" (latest version ≀ v103)

The 5-second window:

FDB garbage-collects old versions after approximately 5 seconds. This is configurable but the default. A transaction that starts at v100 and then tries to read 6 seconds later will get error 1007 (transaction_too_old). This is why long-running transactions are not supported in FDB.

This is a fundamental design choice. FDB optimizes for high-throughput short transactions. If you need to read data while performing long-running computation (e.g., a streaming job), the correct pattern is:

  1. Read a batch of data, process it
  2. Commit the batch
  3. Read the next batch in a new transaction

MVCC vs. Locking

FeatureMVCC (FDB, PostgreSQL)Locking (older MySQL)
Reads block writersNoYes (shared lock)
Writers block readsNoYes (exclusive lock)
DeadlocksNot possiblePossible (lock cycles)
Stale readsPossible if version too oldNot possible
OverheadGarbage collection of old versLock manager overhead

MVCC is strictly better for read-heavy workloads and any workload that mixes reads and writes.


2.5 Watches, Versionstamps, and Atomic Operations

These three primitives are rarely covered in introductory FDB material but are essential for building production systems.

Watches

tr.Watch(key) returns a FutureNil that fires when the key’s value changes. The watch is registered at commit time and fires asynchronously.

Production use cases:

  • Distributed lock re-acquisition: hold a lock key; watch it to know when released
  • Cache invalidation: watch a β€œcache bust” key; invalidate local cache when it fires
  • Event notification: watch a β€œlatest event” key; re-read when it changes
  • Leader election: watch the leader key; trigger re-election when it’s cleared
// Polling-free pub/sub pattern
func watchForChanges(db fdb.Database, key fdb.Key) {
    for {
        var watch fdb.FutureNil
        db.Transact(func(tr fdb.Transaction) (interface{}, error) {
            watch = tr.Watch(key)
            return nil, nil
        })
        watch.Get()  // blocks until key changes
        // process the change...
    }
}

Limits: FDB recommends no more than 10,000 active watches per client. Each watch consumes resources on the Storage Server holding that key.

Versionstamps

A versionstamp is a 10-byte identifier that is globally unique and monotonically increasing. It consists of the commit version (8 bytes) plus a user-assigned offset (2 bytes) for ordering within one transaction.

tr.SetVersionstampedKey(keyTemplate, value): writes a key where one 10-byte slot in keyTemplate is replaced with the commit versionstamp.

// Building a totally ordered event log β€” no coordination required
keyTemplate := append([]byte("events:"), make([]byte, 10)...)  // 10-byte placeholder
keyTemplate = append(keyTemplate, []byte(":metadata")...)

tr.SetVersionstampedKey(fdb.Key(keyTemplate), eventPayload)
// After commit: key is "events:<10-byte versionstamp>:metadata"
// Every write gets a globally unique, strictly ordered key

Why this is powerful:

  • A sequence of commits from different clients, on different machines, all get globally ordered keys
  • No sequence number coordinator required
  • Reading all events in order = a single GetRange("events:", "events\x01") scan

This is how fdb-record-layer implements change feeds and how any FDB-based event sourcing system should be built.

Atomic Operations

Atomic operations modify a key without reading it in the same transaction. They commute β€” two concurrent Add(key, 1) calls on the same key will both succeed, with a final result of 2, even if they had the same read version.

OperationSemanticsUse case
Add(key, n)Atomic 64-bit integer additionCounters, sequence numbers
BitAnd/Or/XorBitwise operation on the valueBit set membership
Max/MinAtomic max or min of the current valueHigh-water marks, metrics
SetVersionstampedStamp the commit version into key/valueOrdered event logs
CompareAndClearClear key if current value equals argumentExpiry, conditional delete
AppendIfFitsAppend bytes if result fits in value sizeAccumulator patterns

Critical property: atomic ops do not generate a read conflict for their key. Two transactions can both do tr.Add("counter", 1) without conflicting. This is fundamentally different from read-modify-write (which would conflict).

Under the hood: atomic ops are sent to the Storage Server as special mutation records that the server applies during version materialization. The server has the current value and applies the operation when making the value visible.


2.6 Cluster Architecture β€” Every Component Explained

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚         Client Process           β”‚
                    β”‚   fdb.OpenDatabase(clusterFile)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚ reads cluster file once;
                                 β”‚ caches routing table
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚       Coordinators (3–5)         β”‚
                    β”‚  Paxos-based consensus           β”‚
                    β”‚  Store: cluster configuration    β”‚
                    β”‚  Elect: active Proxies, TLogs,   β”‚
                    β”‚         Data Distributor,        β”‚
                    β”‚         Ratekeeper                β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β–Ό                  β–Ό                   β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚    Proxies      β”‚  β”‚   Resolvers  β”‚  β”‚  Transaction Log  β”‚
    β”‚ (GRV + commit)  β”‚  β”‚ (conflicts)  β”‚  β”‚  (WAL, durable)   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚ after conflict check + TLog commit
             β”‚ Storage Servers apply writes asynchronously
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚             Storage Servers (many)             β”‚
    β”‚  Each holds a shard (contiguous key range)    β”‚
    β”‚  Serves reads directly; applies mutations     β”‚
    β”‚  from TLog                                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Proxies (GRV Proxies + Commit Proxies)

FDB 7.x splits proxy responsibilities:

  • GRV Proxies (Get Read Version): assign read versions to transactions. This is a global operation β€” all reads in the cluster must see a consistent version ordering. GRV proxies serialize these assignments.
  • Commit Proxies: receive commit requests, forward to Resolvers for conflict checking, then write to the Transaction Log.

Proxies are stateless (they hold no data). They can be added or removed without data migration.

Resolvers

Resolvers track which keys were written in each recent commit. When a Commit Proxy submits a transaction’s read set, the Resolver checks: β€œwas any key in this read set written by a committed transaction between this transaction’s read version and now?” If yes β†’ conflict, retry. If no β†’ proceed.

Resolvers are in-memory only (they hold recent write history, not persistent data). Their state window matches the MVCC window (~5 seconds of commits).

Transaction Log

The TLog is FDB’s Write-Ahead Log β€” but distributed. Each commit is written to f+1 TLog machines before being acknowledged to the client. TLog uses its own durable storage (either fdatasync-safe files or SSDs).

After Storage Servers confirm they have applied all mutations up to a version (the β€œdurable version”), TLog can truncate its older entries.

Storage Servers

Each Storage Server holds a subset of the key space (a shard). When a client wants to read a key, it looks up which Storage Server holds that key’s shard (from its cached routing table), and sends the read directly to that server.

Storage Servers use a custom storage engine (called β€œKeyValueStore” internally) built on a variant of SQLite or a custom B-tree, depending on configuration. Each mutation from the TLog is applied to this local store.

Data Distribution: The Data Distributor process monitors shard sizes and access patterns. If a shard grows too large (>250MB by default), it splits it and migrates half to another Storage Server. This is transparent to clients.

Coordinators

Coordinators run a Paxos-like consensus protocol to maintain the β€œcluster controller” β€” the process that manages the lifecycle of all other FDB roles. If the cluster controller fails, coordinators elect a new one. Coordinators themselves are stateless beyond their Paxos log.

The cluster file (fdb.cluster) contains just the coordinator addresses: fdbdemo:fdbdemo@127.0.0.1:4500. This single file is how clients bootstrap their connection to any FDB cluster.


2.7 Production Limits and Sizing

These are not academic β€” they affect how you design your layer:

ParameterLimit / DefaultImplication
Max value size100 KB (102,400 bytes)Large values must be chunked (option-b)
Max transaction size~10 MB (reads + writes combined)Bulk loads must be split
Max transaction duration~5 seconds (MVCC window)No long-running transactions
Max watches per client~10,000Watches are not free
Max key size10 KBKeep keys short (< 1 KB preferred)
Read version latency~1 ms (local datacenter)Fast; dominated by network
Commit latency~2–5 ms (local datacenter)TLog write + network
Max throughput (single proxy)~250,000 operations/secScale horizontally by adding proxies
Storage per server~100 GB–4 TB typicalFDB handles redistribution automatically

Production rule of thumb: keep transactions small (< 1 MB), short (< 1 second), and focused (few key ranges). This maximizes throughput and minimizes conflicts.


2.8 The Simulation Harness β€” Why FDB Is Trusted

FDB’s simulation framework is the most sophisticated correctness testing system in any open-source database. Understanding it explains why Apple, Snowflake, and others trust FDB with mission-critical data.

How Simulation Works

The entire FDB cluster β€” network stack, disk I/O, clocks, process scheduling β€” is implemented twice:

  1. The real implementation (for production)
  2. A deterministic simulation (for testing)

In simulation mode:

  • All network calls are function calls (no actual sockets)
  • Disk I/O is simulated in memory with configurable failure injection
  • Clocks are virtual (controlled by the simulator)
  • Processes are coroutines (no real threads)
  • Random seeds control all non-determinism

The simulator injects:

  • Machine failures: simulate killing any process at any time
  • Network partitions: drop messages between specific nodes
  • Disk failures: simulate corrupted writes, partial writes
  • Clock skew: advance clocks non-uniformly across nodes
  • Slow operations: simulate disk latency spikes

Why Determinism Matters

When a bug is found in simulation:

  1. Record the random seed that triggered it
  2. Reproduce the exact same sequence of events with the same seed
  3. Add print statements, breakpoints, rerun β€” identical behavior every time

This is impossible with real distributed systems, where timing non-determinism makes bugs nearly unreproducible.

Scale of Testing

Before each FDB release, the simulation runs for thousands of machine-years worth of simulated time. A team member once stated: β€œWe’ve found more bugs through simulation than through all other methods combined.”

This testing discipline is why FDB can make strong correctness guarantees. It’s also why the FDB team is one of few database teams in the world that will confidently claim their ACID implementation is correct under arbitrary failure injection.

What This Means for You

When you use FDB as your storage substrate, you inherit these correctness guarantees. The five layers in this repository can fail in arbitrary ways β€” the processes can crash, the network can partition β€” and FDB will maintain consistency. The only failure that can cause data loss is the simultaneous loss of f+1 machines during a transaction commit window (where f is the configured fault tolerance level, typically 1 for a 3-machine cluster or 2 for a 5-machine cluster).


Interview Questions

Q: What is strict serializability and how does it differ from serializability?

Serializability means transactions appear to execute in some sequential order. Strict serializability (also called linearizability + serializability) additionally requires that the sequential order respects real-time ordering: if transaction T1 commits before transaction T2 begins, T1 must appear before T2 in the serial order. This matters for distributed systems where different clients observe commits at different times. FDB provides strict serializability. Most databases (PostgreSQL serializable, MySQL serializable) provide only serializability.

Q: How does FDB avoid deadlocks?

FDB doesn’t use locks. Instead, it uses optimistic concurrency control: reads are conflict-tracked, writes are buffered, and conflict detection happens at commit time. If a conflict is detected, the transaction retries with a new read version β€” it never β€œwaits” for another transaction to release a lock. Since there’s no waiting, there can’t be a cycle of transactions waiting on each other.

Q: What happens to a running FDB transaction if the Proxy crashes?

The client’s Transact loop handles this. The proxy failure will manifest as a network error on the commit request. The FDB client library translates this to a retriable error, and Transact calls the function again with a new transaction and a new proxy (the client discovers the new proxy from the Coordinator). The retry is transparent to application code.

Q: Why does FDB limit transaction size to 10 MB?

FDB’s conflict resolution requires the Resolver to hold recent write history in memory. Large transactions would require holding large amounts of data in the Resolver’s conflict window. Additionally, the Transaction Log must write the entire transaction atomically before acknowledging. Large transactions create tail latency. The 10 MB limit keeps both of these bounded. For bulk loading, the correct approach is to batch writes in 1–5 MB transactions.

Q: What is a β€œread version” in FDB?

A read version is a 64-bit integer that represents a point in time in FDB’s commit history. Every committed transaction increments the global version counter. A transaction reading at version V sees all commits with version ≀ V and no commits with version > V. Read versions are assigned by GRV Proxies and represent a consistent snapshot of the entire cluster at that version.

Q: Can two FDB transactions that don’t share any keys conflict?

No. Conflict detection in FDB is purely key-based. Two transactions with disjoint key sets can never conflict, regardless of how close together they commit. This is why carefully designed key spaces (using subspaces to isolate different logical entities) dramatically reduce conflict rates in high-throughput applications.

The Layer Concept

The layer concept is the central architectural idea of FoundationDB. FDB’s API surface is tiny: Get, Set, Clear, ClearRange, GetRange, and transactions. All the richness of a β€œdatabase” β€” tables, indexes, files, records, schemas β€” must be built from these six primitives. This chapter explains exactly how.


3.1 Encoding Is Everything

FDB stores byte strings. There is no concept of integer columns, string columns, or JSON fields at the FDB level. You must encode your data into bytes and decode it back out. The art of building a layer is entirely the art of encoding β€” turning rich data structures into sequences of (key, value) pairs such that:

  1. Unambiguous decoding: given a raw key-value pair, you can reconstruct the original data
  2. Co-location: related data is adjacent in key order, enabling efficient range scans
  3. Minimal write amplification: updating one piece of data doesn’t force rewrites of unrelated data
  4. Sort-preserving encoding: values you want to range-query sort correctly under byte comparison

All four properties must hold simultaneously. Let’s examine each.

Property 1: Unambiguous Decoding

Suppose you store two fields, name and city, in a single key:

key: "user:" + name + ":" + city

If name = "Alice:Paris" and city = "London", the key becomes "user:Alice:Paris:London", which looks identical to name = "Alice", city = "Paris:London". Ambiguous. Solution: choose a separator that cannot appear in field values, or escape it.

FDB’s Tuple layer solves this by escaping \x00 inside strings as \x00\xFF, then using bare \x00 as a separator. This makes encoding unambiguous for arbitrary byte strings.

Property 2: Co-location

Imagine storing a user’s profile fields as separate keys:

"user:alice:name"    β†’ "Alice"
"user:alice:email"   β†’ "alice@example.com"
"user:alice:city"    β†’ "Paris"
"user:bob:name"      β†’ "Bob"

Reading Alice’s full profile = one GetRange("user:alice:", "user:alice\x01") call. All three keys are adjacent in FDB’s sorted key space. One round-trip, no matter how many fields she has.

Contrast with storing all users in one key: "users" β†’ msgpack([alice, bob, ...]). Reading one user requires reading the entire blob. Updating one user requires reading the entire blob, deserializing, modifying, re-serializing, and writing back. Terrible co-location.

Co-location means: keys for the same logical entity should share a prefix, so they sort together and can be read or deleted in one range operation.

Property 3: Minimal Write Amplification

Consider a user with 50 indexed fields. If a user moves from Paris to Tokyo, only two keys change: the old city index entry (cleared) and the new city index entry (set). The user’s record and all other index entries are untouched.

Designing your encoding so that β€œupdate one field” touches only the keys for that field is the goal. If your encoding stores the entire record in one value, every field update touches the entire value β€” wasteful when records are large.

Property 4: Sort-Preserving Encoding

If you want to range-query age BETWEEN 25 AND 40, the encoded age values must sort in the same order as the numeric values.

String encoding fails for integers:

"9"  > "30"  (byte comparison: '9' = 0x39 > '3' = 0x33)
β†’ sorted order: "1", "10", "100", "2", "20", "3", "30", "9" β€” wrong

Signed big-endian fails across sign boundary:

-1 β†’ 0xFFFFFFFFFFFFFFFF  (sorts AFTER all positive numbers)

The correct encoding (used in every layer in this repo):

// encoding.go (option-c-record-layer)
func encodeInt64(x int64) []byte {
    b := make([]byte, 8)
    // Flip the sign bit. Maps signed range to unsigned range preserving order.
    // -2^63 β†’ 0x0000...0000 (smallest unsigned)
    // -1    β†’ 0x7FFF...FFFF
    //  0    β†’ 0x8000...0000
    //  1    β†’ 0x8000...0001
    //  2^63-1 β†’ 0xFFFF...FFFF (largest unsigned)
    binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
    return b
}

This encoding is the industry standard for sort-preserving signed integer encoding. It appears in Apache HBase, Apache Cassandra, FDB’s own Tuple layer, and Google Bigtable.

Why XOR with 1<<63 is the same as adding 2^63: For any signed 64-bit integer x, uint64(x) XOR (1<<63) == uint64(x) + 2^63 when computed in 64-bit unsigned arithmetic. XOR is just the bit-manipulation shorthand for the offset encoding.


3.2 The Subspace Pattern

The single most important pattern in any FDB layer is the subspace: a byte prefix that namespaces all keys for a logical entity.

// encoding.go (option-a-leveldb)
type Subspace struct {
    prefix []byte
}

func (s Subspace) Pack(userKey []byte) fdb.Key {
    out := make([]byte, 0, len(s.prefix)+1+len(userKey))
    out = append(out, s.prefix...)
    out = append(out, 0x00)        // separator byte
    out = append(out, userKey...)
    return fdb.Key(out)
}

func (s Subspace) Range() fdb.KeyRange {
    begin := append([]byte{}, s.prefix...)
    begin = append(begin, 0x00)
    end := append([]byte{}, s.prefix...)
    end = append(end, 0x01)  // one byte past the separator
    return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}

Why the separator byte?

Without a separator, subspaces "foo" and "foobar" collide: the key "foobar\x00somekey" has "foo" as a prefix but was packed by the "foobar" subspace. The separator \x00 prevents this: "foo\x00" is never a prefix of "foobar\x00" because their 4th bytes differ (\x00 vs b).

The range trick β€” \x00 β†’ \x01:

Incrementing the separator byte from \x00 to \x01 creates a tight upper bound. Every key in the "demo\x00" subspace sorts before "demo\x01". So GetRange("demo\x00", "demo\x01") returns exactly the keys for the β€œdemo” namespace β€” no more, no less.

Subspace composition (nested subspaces):

You can build hierarchical key spaces by composing prefixes:

users subspace:    "app\x00users\x00"
orders subspace:   "app\x00orders\x00"
indexes subspace:  "app\x00idx\x00"
  city index:      "app\x00idx\x00city\x00"
    value "Paris": "app\x00idx\x00city\x00Paris\x00"

Each prefix cleanly separates a logical domain. ClearRange("app\x00users\x00", "app\x00users\x01") atomically deletes all users. ClearRange("app\x00idx\x00city\x00Paris\x00", "app\x00idx\x00city\x00Paris\x01") atomically deletes all city=β€œParis” index entries.

All five layers in this repo use this pattern:

LayerSubspace layout
option-a-leveldbns + 0x00 + userKey
option-a-sqlitens + 0x00 + tableName (catalog), ns + 0x01 + tableName + 0x00 + pk (rows)
option-b-leveldbns + 0x01 + fileType + fileNum (meta), ns + 0x02 + fileType + fileNum + chunkNum (data)
option-b-sqlitens + 0x00 (size key), ns + 0x01 + pageNum (pages)
option-c-record-layerns + 0x00 + schema + 0x00 + pk (records), ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk (indexes)

3.3 The Tuple Layer β€” Industry Standard Encoding

The official FDB client libraries (Python, Java, Go, Ruby) provide a Tuple encoding that standardizes the above patterns. It encodes lists of typed values into byte strings that sort correctly under FDB’s lexicographic comparison.

Tuple type codes and sort behavior:

TypePrefix byteEncodingSort behavior
None / nil0x00just the prefixalways smallest
Byte string0x01bytes + \x00 (with \x00 β†’ \x00\xFF)lexicographic
Unicode string0x02UTF-8 + \x00 (same escaping)lexicographic
Nested tuple0x05nested encoding + \x00element-by-element
Integer 00x14just the prefixnumeric zero
Positive integer0x15–0x1C1–8 bytes big-endiannumerically ascending
Negative integer0x0C–0x131–8 bytes, bitwise NOTnumerically ascending
Float (32-bit)0x204 bytes, sign-bit-flipped IEEE 754numerically ascending
Double (64-bit)0x218 bytes, sign-bit-flipped IEEE 754numerically ascending
Boolean false0x26just the prefixfalse < true
Boolean true0x27just the prefixfalse < true
UUID0x3016 bytesUUID byte order
Versionstamp0x32/0x3312 bytescommit version ordering

Why the layers in this repo don’t use the Tuple layer:

Using the official Tuple layer would be correct and production-appropriate. The layers here use simpler hand-rolled encodings so that each file is readable without understanding the Tuple specification. When reading the source, you can see append(out, 0x00) instead of tuple.Pack(tuple.Tuple{schema, pk}) β€” the intent is obvious.

For production use, adopt the Tuple layer. It handles edge cases (null bytes in strings, arbitrarily large integers, nested structures) that the hand-rolled encodings in this repo skip.


3.4 Atomicity as a Design Axiom

The most important design discipline in FDB layer development: every logical operation that must be atomic must live inside one transaction.

This sounds obvious but is violated constantly by developers who treat FDB as a fancy cache. Let’s examine the consequences of each atomicity violation:

Case 1: Split Insert (SQL Layer)

// WRONG β€” two transactions for one logical insert
rowid := allocateRowid()           // Transaction 1: increment counter
writeRow(schema, rowid, values)    // Transaction 2: write row data

// If crash between T1 and T2:
// β†’ counter incremented, row never written
// β†’ counter is now permanently incorrect (rowid was "consumed")
// β†’ future inserts skip this rowid

The correct approach (used in option-a-sqlite):

// RIGHT β€” one transaction
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    // Read and increment rowid counter
    rowid := tr.Get(rowidKey).Get() + 1
    tr.Set(rowidKey, encode(rowid))
    // Write the row
    tr.Set(rowKey(schema, rowid), encode(values))
    return nil, nil
})

Case 2: Split Index Update (Record Layer)

// WRONG β€” separate transactions for record and index
tr1.Set(recordKey, newRecord)        // T1: update record
tr2.Clear(oldIndexKey)               // T2: remove old index
tr2.Set(newIndexKey, nil)            // T2: add new index

// If crash between T1 and T2:
// β†’ record shows new value
// β†’ index still points to old value
// β†’ LookupByIndex returns stale data forever
// β†’ This is a phantom index entry β€” silent corruption

The correct approach (used in option-c-record-layer):

// RIGHT β€” all in one transaction
s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
    // Read old record to know what indexes to remove
    prevBytes, _ := tr.Get(recordKey(s.ns, schema, pk)).Get()
    if prevBytes != nil {
        var prev Record
        msgpackUnmarshal(prevBytes, &prev)
        for _, f := range sc.Indexes {
            if v, ok := prev[f]; ok {
                tr.Clear(indexKey(s.ns, schema, f, v, pk))
            }
        }
    }
    // Write new record and new indexes
    tr.Set(recordKey(s.ns, schema, pk), encoded)
    for _, f := range sc.Indexes {
        if v, ok := rec[f]; ok {
            tr.Set(indexKey(s.ns, schema, f, v, pk), nil)
        }
    }
    return nil, nil
})

Case 3: Split File Rename (LevelDB Storage Layer)

// WRONG β€” separate transactions for copy and delete
copyFileData(newFd, oldFd)    // T1: copy all chunks
deleteFile(oldFd)              // T2: clear old chunks

// If crash between T1 and T2:
// β†’ both files exist simultaneously
// β†’ LevelDB manifest references old name
// β†’ on restart, LevelDB sees duplicate SSTables β†’ corruption

The correct approach (used in option-b-leveldb):

// RIGHT β€” one transaction for atomic rename
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    kvs, _ := tr.GetRange(s.dataRange(oldfd), ...).GetSliceWithError()
    tr.ClearRange(s.dataRange(oldfd))   // clear old
    tr.Clear(s.metaKey(oldfd))
    for _, kv := range kvs {            // write new
        tr.Set(translateKey(kv.Key, oldfd, newfd), kv.Value)
    }
    tr.Set(s.metaKey(newfd), metaBytes)
    return nil, nil
})

The pattern: read old state β†’ clear old keys β†’ write new keys β€” in one transaction. Crash at any point: transaction not committed, neither state is visible. Crash after commit: new state is fully visible. No intermediate state is ever observable.


3.5 Key Space Design Patterns and Anti-Patterns

Pattern: Sequential Integer Keys

key: ns + 0x00 + encode_uint64(rowid)

Values: row data. Row id is monotonically increasing. New rows always insert at the end of the range (append-like behavior in the key space). Range scans return rows in insertion order.

Production consideration: In a write-heavy workload, all inserts land on the same key space boundary, creating a hotspot on the FDB shard holding the highest rowid. FDB will split this shard, but the split lags the write rate. For very high insert rates, use a randomized prefix (hash of rowid) to distribute writes across shards.

Pattern: Reverse Timestamp Keys

key: ns + 0x00 + encode_uint64(MaxUint64 - timestamp)

Most recent entries sort first. A GetRange with no bounds returns entries in reverse chronological order. Useful for β€œrecent events”, β€œactivity feeds”, β€œaudit logs”.

Pattern: Composite Keys for Multi-Dimensional Queries

key: ns + encode(field1) + 0x00 + encode(field2) + 0x00 + pk

Example: city + 0x00 + age + 0x00 + pk enables range queries on city AND age:

  • All users in Paris sorted by age: GetRange(prefix + "Paris\x00", prefix + "Paris\x01")
  • All users in Paris aged 25–40: GetRange(prefix + "Paris\x00" + encode(25), prefix + "Paris\x00" + encode(40) + "\xff")

This is the compound index pattern. It is exactly how MySQL’s compound indexes work.

Anti-Pattern: Large Values

Storing all of a user’s data in one 50 KB value:

key: ns + pk
value: msgpack({name, email, all_orders: [...], all_addresses: [...], ...})

Problems:

  • Reading one field requires reading and deserializing the entire value (O(size))
  • Updating one field rewrites the entire value (O(size) write amplification)
  • The value may approach FDB’s 100 KB limit
  • Hot users (many orders) create unbalanced value sizes

Better: normalize into separate keys per logical entity:

ns + 0x00 + pk              β†’ profile (name, email)
ns + 0x01 + pk + orderid    β†’ individual orders
ns + 0x02 + pk + addrid     β†’ individual addresses

Anti-Pattern: Hot Key Contention

key: "global_counter"   ← every write touches this

All transactions that increment a counter conflict on this key. At high throughput, this becomes a serialization bottleneck.

Solutions:

  1. FDB atomic ops: tr.Add("global_counter", 1) β€” atomic, no conflict, no retry needed
  2. Sharded counters: split into N shard keys; read by summing all shards
  3. Approximate counting: use a HyperLogLog or count-min sketch in a single key

3.6 The FDB Blob Layer Pattern

FDB’s value size limit (100 KB) is not a fundamental limitation β€” it’s a performance tuning decision. The solution is chunking, used in option-b-leveldb:

key: ns + 0x01 + fileType + fileNum + chunkNum  β†’  up to 65,536 bytes

To read a file: range-scan all chunks (sorted by chunkNum due to big-endian encoding), concatenate values.

This pattern generalizes to any large object:

  • Files: split by byte range (64 KB chunks)
  • Images: split by byte range or by tile
  • Blobs: split by arbitrary chunk size
  • Large JSON documents: split by logical sub-object

FDB’s official Blob Layer (in the FDB foundationdb-python-layers repository) implements this exact pattern. Amazon S3 and Google Cloud Storage’s consistency guarantees are actually powered by similar chunk-based storage backed by distributed KV stores.


3.7 Schema Evolution β€” How Layers Survive Change

A layer’s encoding is its schema. Changing the encoding without migrating old data is a common source of production incidents. Strategies:

1. Prefix-versioned keys:

ns + version_byte + ...rest

New code writes version = 0x02 keys; old code writes version = 0x01 keys. A background migration reads all 0x01 keys, converts, writes 0x02 keys. Read code handles both versions.

2. Additive-only changes: If your value is msgpack/Protobuf, adding new optional fields is backward compatible. Old code ignores unknown fields. New code sets defaults for missing fields. This is safe and zero-downtime.

3. Dual writes during migration: Write to old AND new key format simultaneously during a migration window. Verify correctness. Backfill old data. Cut over to new format. Remove old code path. Zero-downtime migration.

4. Never reuse key prefixes: Deleted keys leave no trace in FDB (there’s no tombstone value). But if you reuse a deleted subspace’s prefix for a new purpose, you risk stale data colliding with new data. Always use new prefixes for new logical entities.


Interview Questions

Q: Why is key ordering so important in FDB β€” can’t you just use a hash map?

A hash map gives O(1) point lookups but cannot efficiently answer range queries (β€œall users aged 25–40”), prefix queries (β€œall keys starting with β€˜user:42:’”), or sorted iteration. Ordered storage enables all three with a single range scan. The entire record layer concept β€” where secondary indexes are just cleverly prefixed keys β€” relies on ordering. Without ordering, every query would require a full scan.

Q: What is a β€œsubspace” in FDB and why is it important?

A subspace is a key prefix that namespaces a set of keys. It provides namespace isolation (keys in subspace A cannot collide with keys in subspace B), atomic deletion (one ClearRange deletes all keys in a subspace), and efficient range queries (one GetRange over the subspace prefix reads all related data). Every production FDB layer uses subspaces as the fundamental organizational unit.

Q: How do you handle null bytes inside keys in FDB?

FDB keys can contain any bytes, but if you use \x00 as a separator, null bytes inside field values create ambiguity. The official solution is the Tuple encoding: escape \x00 inside string values as \x00\xFF, and use bare \x00 as the separator. The decoder sees \x00\xFF and knows it’s an escaped null; bare \x00 ends the field. The layers in this repo use simple encodings that assume field values don’t contain null bytes β€” acceptable for a teaching implementation, not for production with arbitrary user data.

Q: Why is the PutRecord function in the record layer more expensive than a simple Set?

PutRecord must read the old record before writing the new one, so it can remove stale index entries. This adds one Get call to every write β€” a full round-trip to a Storage Server. The cost is necessary: without reading the old record, you can’t know which old index entries to remove. Production systems amortize this by batching multiple record updates in one transaction, or by using techniques like β€œshadow documents” (keeping old field values in the record itself to avoid the read).

Q: What is the difference between Transact and ReadTransact in FDB?

ReadTransact opens a read-only transaction: no conflict tracking, no commit round-trip, cheaper. It’s for operations that only read data. Transact opens a read-write transaction: tracks the read set for conflict detection, buffers writes, commits at the end. Use ReadTransact whenever you’re not writing anything β€” it’s faster and imposes less load on the cluster.

How Real Systems Use FDB

One of the most useful ways to understand an architecture is to see it in production at scale. FoundationDB is unusual in that several of the systems built on it are open-source or at least publicly described. Each of the five labs in this repo maps directly to a pattern used in a real, deployed production system.


4.1 Apple’s FoundationDB Record Layer (iCloud / CloudKit)

The fdb-record-layer is a Java library open-sourced by Apple in 2018. It is the storage engine behind Apple’s CloudKit metadata services β€” the cross-device sync that backs Photos, Contacts, Notes, iMessage, and nearly every other iCloud-connected app.

What it provides:

  • Protobuf-typed records (define your schema in .proto files)
  • Declarative index definitions: value indexes, rank indexes, aggregate indexes, text indexes
  • A query planner that compiles predicates into FDB range scans
  • Schema evolution (add/remove/change fields without downtime)
  • Continuations (long queries can be resumed across transactions)
  • Change feeds via versionstamped keys

How the key layout maps to our option-c-record-layer:

fdb-record-layer uses a similar two-subspace structure:

[storePrefix] [META_DATA_VERSION] β†’ version metadata
[storePrefix] [RECORD_TYPE_KEY]   β†’ schema info
[storePrefix] [RECORD] [pk bytes] β†’ serialized Protobuf record
[storePrefix] [INDEX] [indexName] [indexValue] [pk bytes] β†’ index entry

Our record layer uses a simpler version:

ns + 0x00 + schema + 0x00 + pk    β†’ msgpack record
ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk β†’ index entry (empty value)

The production scale: CloudKit syncs data for hundreds of millions of Apple devices. Every Photos upload, every note edit, every contact sync creates record layer transactions. FDB handles this via horizontal partitioning: different users’ data lives in different FDB shards, so per-user transaction load is low. The global throughput is enormous, but each individual transaction is small.

Key insight β€” continuations: fdb-record-layer’s most important production feature is the β€œcontinuation” β€” a cursor that can be serialized, stored, and resumed across transaction boundaries. Since FDB limits transaction duration to 5 seconds, long queries must be broken into chunks. A continuation saves the last-seen key so the next transaction can resume exactly where the previous one left off. Our labs don’t implement this, but every production layer needs it for full-table scans, bulk exports, and maintenance jobs.

Schema evolution in production: Adding a new field to a record type is safe (existing records just don’t have it; the field defaults to absent). Removing a field requires first removing all code that writes it, then removing the field from the schema. Renaming a field is the dangerous operation β€” it looks like β€œadd new field + remove old field” but requires migrating all existing records. fdb-record-layer provides a dedicated schema migration API for this case.


4.2 FoundationDB Document Layer (MongoDB Wire Protocol)

The Document Layer was an open-source FDB layer that spoke the MongoDB wire protocol. A MongoDB client (driver, shell, or Compass) could point at the Document Layer unchanged, and all data was stored in FDB with ACID guarantees.

What it did:

  • Parsed MongoDB wire protocol (BSON over TCP)
  • Translated BSON documents into FDB key-value pairs
  • Translated MongoDB queries ({city: "Paris", age: {$gte: 25}}) into FDB range scans and filters
  • Provided MongoDB’s consistency guarantees (which are weaker than FDB’s β€” the Document Layer could offer stronger consistency than stock MongoDB by virtue of FDB’s serializable transactions)

Key encoding for a BSON document:

[collectionPrefix] [docId]               β†’ BSON document value
[indexPrefix] [fieldName] [fieldValue]  [docId] β†’ index entry

This is structurally identical to option-c-record-layer with schema = collectionName.

Why it was deprecated: The Document Layer was maintained by FoundationDB’s internal team at Apple. When Apple open-sourced FDB in 2018 and then in 2019 changed some library licensing, maintaining a full MongoDB-compatible query engine was outside scope. It was deprecated in 2019 but the codebase remains instructive as a reference implementation.

Lesson: A full relational/document query engine can be built atop FDB in ~50,000 lines of C++. The query planner complexity is in compiling predicate trees to efficient index scan strategies. The storage layer is just option-c-record-layer.


4.3 mvsqlite β€” SQLite on FDB with Multi-Process MVCC

mvsqlite is a Rust project by Heyang Zhou (losfair). It implements a SQLite VFS backed by FDB such that multiple processes can open the same β€œdatabase” concurrently with full MVCC isolation.

How it maps to option-b-sqlite:

Our pagestore stores pages at:

ns + 0x01 + pageNum (big-endian) β†’ 4096-byte page content

mvsqlite stores pages at:

[namespace] [pageNum_big_endian] β†’ page content at that version

What mvsqlite adds beyond our implementation:

  1. Page-level MVCC: Instead of storing just the current page, mvsqlite stores multiple historical versions. Old page versions are kept for active readers. This is implemented by appending a version prefix:

    [namespace] [pageNum] [snapshotVersion] β†’ page at that version
    

    A reader at snapshot V only reads pages at version <= V. The GC process cleans up versions older than the oldest active reader.

  2. Multi-process write serialization: mvsqlite serializes concurrent writers using an FDB transaction on a β€œwrite lock” key. The winner gets exclusive write access; losers retry. SQLite’s own xLock/xUnlock calls are translated to FDB key operations.

  3. Write-ahead buffering in FDB: Instead of a local WAL file, mvsqlite buffers pending writes in FDB under a separate subspace. Commit flushes the buffer to the page subspace atomically.

  4. Namespace mapping: SQLite filenames map to FDB key prefixes. ATTACH DATABASE 'db2.mvsqlite' AS db2 opens a second namespace in the same FDB cluster.

The production use case: A web application running in a multi-process or multi-instance environment (e.g., multiple Gunicorn workers, multiple Docker containers) can share a single mvsqlite database. Each worker reads from a consistent snapshot without locking other workers. Writes are serialized. This is impossible with standard SQLite (which only allows one writer and assumes a shared filesystem).


4.4 LevelDB’s goleveldb Storage Interface Pattern

The goleveldb storage.Storage interface (used in option-b-leveldb) is a seam designed for exactly this purpose: replacing the local filesystem with a remote storage backend while keeping all of LevelDB’s logic intact.

In practice, option-b-leveldb is analogous to how production systems host embedded databases on shared storage:

Real analog β€” etcd on block storage: etcd uses bbolt (a B-tree engine) for its data. When running in Kubernetes, etcd’s data directory is a persistent volume (a network-attached block device). The β€œstorage.Storage” for etcd is the filesystem API on top of that network block device.

Real analog β€” TiKV’s RocksDB on FDB: TiKV, the storage layer behind TiDB, uses RocksDB. There are research prototypes that replace RocksDB’s Env (RocksDB’s equivalent of storage.Storage) with a remote storage backend backed by a distributed KV store.

Real analog β€” FoundationDB’s own Blob Layer: FDB’s official Blob Layer (in the Python layers repository) uses the exact chunking pattern from option-b-leveldb:

[prefix] [blobId] [chunkNum] β†’ up to 10,000 bytes

The chunk size is tunable. Read performance is O(n_chunks) round-trips, optimized by using GetRange to fetch all chunks in one round-trip.

What makes option-b-leveldb interesting architecturally: LevelDB produces 7 types of files (log, sstable, manifest, current, lock, temp, info). Each has different I/O patterns: log files are append-only, sstables are write-once read-many, manifests are written atomically. Our storage.Storage implementation stores all file types as FDB key ranges, but a production implementation would optimize differently per file type (e.g., log files could use FDB’s atomic append via versionstamps rather than read-modify-write chunking).


4.5 Snowflake’s Use of FDB

Snowflake uses FoundationDB as its metadata store β€” the catalog that tracks table schemas, partition file locations, transaction history, and access controls. The actual table data is stored in cloud object storage (S3/Azure Blob/GCS), but the metadata that makes queries possible lives in FDB.

Why FDB for metadata?

  • Cloud object storage is strongly consistent for object operations but cannot do atomic multi-object updates. Creating a table, writing partition files, and committing a transaction all need to appear atomically or not at all. FDB provides this.
  • Metadata is small but highly contended. FDB’s optimistic concurrency and high throughput handle schema-level operations (DDL, partition registration) without bottlenecks.
  • FDB’s key ordering makes catalog operations efficient: scan all partitions for table T, scan all columns in schema S, scan all transactions in time range [t1, t2].

The encoding (likely, based on public talks):

[catalog] [tenant] [schemaName] [tableName] [partitionId] β†’ partition metadata
[txnLog]  [startVersion]                                  β†’ transaction record
[schema]  [tenant] [schemaName] [tableName]               β†’ column definitions

This is structurally option-a-sqlite (the SQL catalog pattern) extended to a distributed multi-tenant context.

The lesson: FDB is not used because Snowflake needed a storage engine for query data β€” S3 handles that. FDB is used because catalog metadata needs ACID guarantees that no other component in a cloud-native stack provides cheaply. If you’re building a distributed system that needs a consistent metadata store, FDB is frequently the correct answer.


4.6 Apple CloudKit β€” FDB at Hundreds of Millions of Devices Scale

CloudKit is Apple’s cross-device sync infrastructure. It stores photos metadata, notes content hashes, contact sync tokens, and app-defined data for third-party iOS apps. As of 2023, it services hundreds of millions of Apple devices.

The FDB usage in CloudKit:

  • User records (contacts, notes, photo metadata) are stored as fdb-record-layer records
  • Sync tokens are versionstamped keys that encode the exact FDB version at which a sync point was taken
  • Conflict resolution uses FDB’s read-your-writes consistency to detect and merge concurrent edits

Why versionstamps are central to sync:

// Sync token = FDB versionstamp at the time of last sync
// Next sync: "give me all changes since versionstamp V"
// FDB key encoding: versionstamp β†’ big-endian 10-byte key prefix
// GetRange from (V, MAX) returns all records created/modified after V

This is FDB’s versionstamp feature in action: each committed transaction gets a globally monotonic 10-byte version. Records written in that transaction embed the versionstamp in their key, enabling efficient β€œfetch all changes since X” queries with a single range scan.

Horizontal partitioning: Each CloudKit user’s data lives in a dedicated FDB keyspace (their userID is a subspace prefix). No two users’ transactions conflict because they operate on disjoint keys. This is how FDB achieves per-user isolation without database-per-user overhead.


Interview Questions

Q: Why would you use FDB over Postgres for metadata in a distributed system like Snowflake?

Postgres is a single-node database. To scale writes, you need sharding or replication, which introduces coordination complexity. FDB is horizontally scalable by design: adding machines increases throughput linearly. For metadata that needs global ACID guarantees across hundreds of nodes, a single-node Postgres (or even a replicated Postgres with synchronous replication) cannot match FDB’s throughput. FDB also runs in the same datacenter/cloud region as the application, eliminating cross-region round-trips for metadata operations.

Q: The Document Layer is deprecated β€” does that mean FDB isn’t good for document storage?

No. The Document Layer was deprecated because maintaining a MongoDB-compatible query engine (including the query planner, aggregation pipeline, and wire protocol) was expensive to maintain. The underlying pattern β€” document as FDB record, query predicates as range scans over index subspaces β€” is sound and is exactly what fdb-record-layer implements (with Protobuf instead of BSON). The Document Layer’s death was organizational, not architectural.

Q: How does mvsqlite enable multiple concurrent writers to the same SQLite database?

Standard SQLite uses POSIX file locks (or Windows file locks) to serialize writes. One writer holds a RESERVED lock; others wait. mvsqlite replaces file locks with FDB transactions on a β€œwrite lock” key. The writer that wins the FDB transaction proceeds; others retry. Because FDB’s transaction semantics are stronger than POSIX file locks (FDB provides serializable isolation; POSIX locks only prevent concurrent access), mvsqlite actually provides stronger guarantees than standard SQLite in a networked environment.

Q: What is a β€œcontinuation” in fdb-record-layer and why does every production FDB layer need one?

FDB limits transaction duration to 5 seconds. Any operation that might take longer β€” full-table scan, bulk export, index rebuild β€” must be broken into multiple transactions. A continuation is a serialized cursor: it records which keys have been processed so the next transaction can resume from the right place. Without continuations, a query that touches more records than fit in one 5-second window simply cannot be executed. In our labs, we ignore this constraint β€” all queries run in one transaction, which fails with a transaction-too-old error if the dataset is large enough. Adding continuation support is the single most important production readiness improvement for any of these layers.

This Repository β€” Five Implementations

This repo contains five working implementations of FDB layers, ordered from conceptually simple to architecturally sophisticated. Each folder is self-contained: it has its own go.mod, source code, and docs/GUIDE.md with a deep dive into how and why that layer is built the way it is.

foundationdb-labs/
β”œβ”€β”€ option-a-leveldb/     LevelDB API above FDB        β†’ docs/GUIDE.md
β”œβ”€β”€ option-a-sqlite/      SQL engine above FDB         β†’ docs/GUIDE.md
β”œβ”€β”€ option-c-record-layer/ Records + secondary indexes β†’ docs/GUIDE.md
β”œβ”€β”€ option-b-leveldb/     LevelDB on FDB storage       β†’ docs/GUIDE.md
└── option-b-sqlite/      SQLite VFS substrate         β†’ docs/GUIDE.md

OrderFolderCore concept learnedDifficulty
1option-a-leveldbSubspaces, MVCC snapshots, write batches, iterators⭐⭐
2option-c-record-layerSecondary indexes, index consistency, PK encoding⭐⭐⭐
3option-a-sqliteTable catalog, row encoding, column types⭐⭐⭐
4option-b-sqlitePage model, byte-range I/O, partial-page RMW⭐⭐⭐⭐
5option-b-leveldbFile chunks, storage.Storage interface, atomic rename⭐⭐⭐⭐

Start with option-a-leveldb regardless of your background. It introduces all the core patterns β€” subspaces, range scans, snapshot reads, transaction retry loops β€” in the simplest possible context.


5.2 Layer-by-Layer Architecture Map

option-a-leveldb (API Layer above FDB)

LevelDB API (Get/Put/Delete/Batch/Iterator/Snapshot)
                        ↕
         FDB KV store (subspace-encoded keys)

The layer sits above FDB. User code calls db.Put(key, value). The layer encodes the key as ns + 0x00 + key and calls tr.Set(encodedKey, value). All LevelDB semantics (iterators, snapshots, batches) are implemented using FDB primitives.

Why this layer: LevelDB’s API is the lingua franca of embedded KV stores. By implementing it on FDB, existing code that uses LevelDB can be dropped into a distributed, replicated context without changing its API usage.

option-c-record-layer (Record + Index Layer)

Application (PutRecord / GetRecord / LookupByIndex)
                        ↕
   FDB (two subspaces: records and indexes)

The layer sits above FDB. A record is stored at a fixed key. Every indexed field also has an index key in a separate subspace. Both are written in the same transaction to keep records and indexes consistent.

Why this layer: This is the fundamental pattern for any β€œdatabase” feature. Tables, collections, documents β€” all of them are β€œrecords with indexes.” Understanding how to keep a record and its indexes consistent under concurrent writes is the core of database internals.

option-a-sqlite (SQL Engine Layer)

SQL (CREATE TABLE / INSERT / SELECT / UPDATE / DELETE)
                        ↕
     Table catalog (FDB) + Row store (FDB)

A minimal SQL parser and query engine above FDB. Tables are defined in a catalog (a separate subspace). Rows are encoded as msgpack values keyed by primary key. SELECT * is a range scan over the table’s key range.

Why this layer: Shows that a full (if minimal) relational database is ~300 lines of Go above FDB. The interesting parts are the catalog encoding and the primary key sort-order problem.

option-b-sqlite (SQLite VFS Substrate)

SQLite (full SQL engine)
           ↕
   SQLite VFS API (xRead / xWrite / xSync / xLock)
           ↕
   FDB (page-keyed store: pageNum β†’ 4096 bytes)

SQLite provides its own SQL engine, query planner, and transaction management. The VFS is just a storage substrate: given a logical page number, read or write 4096 bytes. FDB stores those pages.

Why this layer: Shows how to extend an existing system by replacing its storage interface rather than reimplementing its logic. The interesting parts are partial-page writes (SQLite calls xWrite with sub-page ranges, requiring read-modify-write) and locking (translating SQLite’s POSIX lock semantics to FDB key operations).

option-b-leveldb (LevelDB Storage Substrate)

LevelDB (full KV engine)
           ↕
   storage.Storage interface (Create / Open / Rename / List)
           ↕
   FDB (file metadata keys + chunked file data keys)

LevelDB provides its own full KV engine. The storage layer is just a filesystem abstraction: create/open/read/write/rename/delete files. FDB stores those files as key ranges.

Why this layer: Shows that even complex storage engines like LevelDB (which manage their own B-trees, compaction, and WAL) can be decoupled from the filesystem. The interesting parts are chunk encoding (64KB chunks to stay under FDB’s value limit), atomic rename (LevelDB uses rename to commit new SSTables), and why WAL redundancy is a non-issue (FDB’s own log subsystem replaces the LevelDB WAL’s durability purpose).


5.3 Running the Labs

Each lab has a demo/main.go that exercises the layer. To run any lab:

# Start FDB (requires Docker)
docker-compose up -d

# Run a specific demo
cd option-a-leveldb && go run demo/main.go
cd option-c-record-layer && go run demo/main.go
# ... etc.

Prerequisites:

  • Docker + Docker Compose (for the FDB cluster)
  • Go 1.21+
  • The FDB C client library (installed by the bootstrap script: bash scripts/bootstrap-fdb.sh)

FDB API version: All labs use API version 710 (FDB 7.1.x). This is set at process start:

fdb.MustAPIVersion(710)

5.4 Extending the Labs β€” Suggested Exercises

Each docs/GUIDE.md has exercises specific to that lab. Here are cross-cutting improvements that apply to multiple labs:

1. Continuations (all labs): Add a ScanWithCursor(cursor []byte, limit int) function that returns at most limit results and a cursor byte string. The cursor encodes the last key seen. Next call resumes from cursor. This makes full-table scans safe under FDB’s 5-second transaction limit.

2. Transactions exposed to the caller (option-a-leveldb, option-a-sqlite): Wrap multiple operations in a user-visible transaction: db.Begin() β†’ Tx, tx.Put(k, v), tx.Get(k), tx.Commit(). Internally, this is just db.Transact(func(tr) { ... }) β€” all operations share one FDB transaction.

3. Versionstamp-based change feed (option-c-record-layer): When writing a record, also write to a change log subspace: changelogKey = ns + changelogSubspace + versionstamp. A background consumer can do GetRange(lastCheckpoint, MAX) to efficiently read all changes since the last poll. This is how fdb-record-layer implements change feeds.

4. Range index queries (option-c-record-layer): Current index lookup finds exact matches (field = value). Add LookupByRange(schema, field, minValue, maxValue) that does GetRange(indexPrefix(field, minValue), indexPrefix(field, maxValue)). This requires your index value to be sort-preserving encoded (integers via encodeInt64, strings lexicographically).

5. Multi-file SQLite (option-b-sqlite): Support ATTACH DATABASE by mapping SQLite filenames to separate FDB subspaces. Two β€œattached” databases are just two different ns prefixes in FDB β€” no code duplication needed.

Reading Guide β€” Where to Go Next

You’ve read through the five chapters of this guide and the lab implementation guides. By now you understand FDB’s architecture from the commit pipeline through every component of the cluster, the layer concept from byte-level encoding through production schema evolution, and how five real systems (Apple CloudKit, Snowflake, mvsqlite, Document Layer, Record Layer) deploy these exact patterns.

This chapter maps out the landscape of resources for going deeper in each direction.


6.1 FoundationDB β€” Going Deeper

Essential Reading

FDB Documentation β€” The official reference. The β€œDeveloper Guide” section covers API semantics in depth, including the precise semantics of read-your-writes consistency, exactly which operations conflict, and the full list of transaction options.

FDB White Paper β€” SIGMOD 2021 β€” β€œFoundationDB: A Distributed Unbundled Transactional Key Value Store.” The definitive technical reference for FDB’s internal architecture. Covers the simulation framework, commit pipeline, log system, and storage layer in academic depth. If you’ve read this guide, you have the vocabulary to understand every section of this paper.

FDB Forum β€” Design discussions, Q&A, and announcements. Some of the most insightful posts are from FDB’s core team explaining design decisions.

Source Code

apple/foundationdb β€” The full FDB C++ source. Key files:

  • fdbserver/MasterProxyServer.actor.cpp β†’ Commit Proxy implementation (the commit pipeline)
  • fdbserver/Resolver.actor.cpp β†’ Resolver (conflict detection)
  • fdbserver/TLogServer.actor.cpp β†’ Transaction Log (write-ahead log)
  • fdbserver/StorageServer.actor.cpp β†’ Storage Server (reads, MVCC)
  • fdbclient/ReadYourWrites.actor.cpp β†’ Client-side read-your-writes cache
  • fdbserver/workloads/ β†’ Simulation workloads (randomized fault testing)

FoundationDB/fdb-record-layer β€” Java. The most complete example of a production FDB layer. Study FDBRecordContext (transaction management), RecordQueryPlanner (query compilation to range scans), and OnlineIndexer (safe online index builds). Reading this source after this guide will be directly comprehensible.


6.2 Storage Engines β€” Going Deeper

Books

Designing Data-Intensive Applications by Martin Kleppmann β€” The best single-volume overview of storage trade-offs, replication, consistency, and distributed transactions. Chapters 3 (storage engines), 7 (transactions), and 9 (consistency) are directly relevant. If you haven’t read this book, stop here and read it. It contextualizes everything in this guide.

Database Internals by Alex Petrov β€” Deep dive into B-trees (B+ variants, page splits, page merges), LSM trees (all the internals: bloom filters, compaction algorithms, manifest management), and distributed consensus (Paxos, Raft, Multi-Paxos). The LSM chapter is the best public explanation of RocksDB/LevelDB internals.

Papers

The Log-Structured Merge Tree (1996) β€” The original LSM paper by O’Neil et al. Defines the C0/C1/C2 component model that LevelDB simplifies into level-0/level-1/…

Bigtable: A Distributed Storage System for Structured Data (2006) β€” The architecture paper for Google Bigtable. Introduced the tablet-server model, SSTable format, and hierarchical metadata server. Direct ancestor of HBase, Cassandra’s SSTables, and FDB’s storage layer design.

Spanner: Google’s Globally-Distributed Database (2012) β€” How Google built a globally distributed ACID database using TrueTime for external consistency. The β€œPaxos groups” concept is closely related to FDB’s shard-level durability. Essential reading for understanding distributed transactions.

Source Code

google/leveldb β€” The C++ LevelDB. Read: db/version_set.cc (compaction and manifest management), db/log_reader.cc + db/log_writer.cc (WAL format), table/block.cc + table/format.cc (SSTable on-disk format), util/arena.cc (memtable arena allocator).

syndtr/goleveldb β€” The Go LevelDB used in option-b-leveldb. The storage/ package defines the storage.Storage interface that our layer implements.

sqlite.org/sqlite β€” The SQLite source. Study: btree.c (B-tree implementation), vfs.c (VFS abstraction), pager.c (page cache and WAL). The SQLite source is famously well-commented.

facebook/rocksdb β€” RocksDB is LevelDB’s production successor. Much more complex (tiered compaction, bloom filters, column families, transactions) but the fundamental data model is identical.


6.3 SQLite VFS β€” Going Deeper

SQLite VFS Documentation β€” The full VFS API with method semantics. Essential for understanding what each of the xRead, xWrite, xSync, xLock, xUnlock methods must guarantee.

SQLite File Format β€” The page-by-page layout of a SQLite database file. After reading the option-b-sqlite guide, this document will make complete sense. The page header format, freelist, overflow pages, and B-tree cell formats are all documented here.

losfair/mvsqlite β€” The production FDB-backed SQLite VFS. The Rust code is clean and readable. The mvfs/ directory contains the VFS implementation; mvsqlite/ contains the page store. Compare with option-b-sqlite/pagestore/pagestore.go line by line.

SQLite WAL Mode β€” The WAL (Write-Ahead Logging) mode documentation. Explains how WAL provides concurrent reads and writes. Our pagestore uses rollback journal mode semantics (xSync is a no-op because FDB is durable), but understanding WAL mode helps you see what we’re not implementing.


6.4 Distributed Systems Fundamentals

The Part-Time Parliament (Paxos) β€” Lamport’s original Paxos paper. Dense, but the consensus problem it solves (how do N machines agree on one value when any machine can fail?) is the foundation of every distributed database.

In Search of an Understandable Consensus Algorithm (Raft) β€” The Raft paper. More accessible than Paxos. FDB uses a Paxos variant, not Raft, but the problems solved are identical. After this, read the Raft visualization.

Linearizability: A Correctness Condition for Concurrent Objects β€” The formal definition of linearizability (which FDB provides for reads and writes) and its relationship to sequential consistency and serializability.

A Critique of ANSI SQL Isolation Levels β€” The Berenson et al. paper that formalized isolation anomalies (dirty reads, non-repeatable reads, phantoms, lost updates, write skew) and showed that ANSI SQL’s definitions were imprecise. Essential vocabulary for discussing β€œserializable” vs β€œsnapshot isolation” vs β€œrepeatable read.”


6.5 Building Something Real β€” A Progression

If you want to go from β€œI understand these labs” to β€œI built something production-worthy,” here is a concrete progression:

Step 1: Add continuations to option-a-leveldb. Add IterateWithCursor(cursor []byte, limit int) ([]KV, nextCursor []byte). This is the single most important production feature missing from all the labs. Every real FDB application needs cursor-based pagination for range queries.

Step 2: Add range index queries to option-c-record-layer. Implement LookupByRange(schema, field string, minVal, maxVal interface{}) ([]Record, error). This requires: (a) sort-preserving encoding for all index value types, and (b) a range scan on the index subspace instead of a point lookup.

Step 3: Implement a simple query planner. For WHERE city='Paris' AND age >= 25, decide: which index scan is more selective? Scan city=β€˜Paris’ and filter age, or scan age >= 25 and filter city? Look at how fdb-record-layer’s RecordQueryPlanner.java makes this decision.

Step 4: Add online index building. Given a table with 1 million records, add a new index without downtime. The approach: (a) mark index as β€œbuilding”, (b) background job scans records in cursor-paginated chunks and writes index entries, (c) any concurrent PutRecord/DeleteRecord writes to both old and new index state, (d) when background job finishes, mark index β€œready”. This is how fdb-record-layer’s OnlineIndexer works.

Step 5: Read the fdb-record-layer source. After steps 1-4, open fdb-record-layer’s FDBRecordStore.java, RecordQueryPlanner.java, and OnlineIndexer.java. You will understand every design decision immediately. The gap between β€œlabs exercise” and β€œproduction library used at Apple scale” is these four features: continuations, sort-preserving range indexes, query planning, and online index builds.

Option A β€” LevelDB API on top of FoundationDB

Pattern: β€œAPI layer above LevelDB” β€” we keep the familiar LevelDB surface (Put/Get/Delete/Batch/Iterator/Snapshot) but the bytes never touch LevelDB. They live in FoundationDB instead.

Why this is interesting

LevelDB is a local embedded KV. FoundationDB is a distributed transactional KV. Both speak (key, value) pairs over ordered byte keys, so the API shapes are nearly identical β€” but the guarantees underneath are wildly different:

ConceptLevelDBThis layer (over FDB)
PutAppend to MemTable + WAL on local diskSet inside tr.Transact{...}
GetMemTable β†’ immutable tables β†’ SSTstr.Get (consistent across the cluster)
BatchSingle WAL recordOne FDB transaction (cross-key atomic)
RangeLevelDB SST merging iteratortr.GetRange over a Subspace
SnapshotPins on-disk sequence numberSetReadVersion(capturedVersion) (MVCC)

The mapping is almost mechanical, and that’s the point: it makes FDB’s primitives concrete.

Files

layer/
  encoding.go   Subspace: ns + 0x00 + userKey β†’ fdb.Key, plus range helpers
  db.go         Open / Close / Get / Put / Delete
  batch.go      Batch + DB.Write (atomic multi-op)
  iterator.go   Forward+backward cursor over GetRange results
  snapshot.go   MVCC snapshot via captured read version
demo/main.go    End-to-end: CRUD β†’ batch β†’ range β†’ snapshot vs. live reads

Key-space layout

<namespace> 0x00 <userKey>  -> <value>

Subspace (see layer/encoding.go) keeps multiple logical databases isolated inside one FDB cluster. The 0x00 separator + the fact that the user-key portion is appended as opaque bytes gives us the same lexicographic ordering LevelDB users expect.

Mapping each LevelDB op to an FDB call

  • Put(k,v) β†’ db.Transact(func(tr){ tr.Set(ns.Pack(k), v) }). Transact handles automatic retry on conflict.
  • Get(k) β†’ db.ReadTransact(func(rt){ rt.Get(ns.Pack(k)).Get() }). An empty FDB result (nil) becomes our ErrNotFound.
  • Delete(k) β†’ tr.Clear(ns.Pack(k)) (no-op if absent).
  • Batch.Write β†’ a single Transact containing many Set/Clear ops. Either all of them land, or none do.
  • Iterator β†’ materialized list from tr.GetRange(subspace.Range()). Real production code would stream via RangeIterator, but materializing keeps the iterator usable outside a live transaction (the LevelDB shape).
  • Snapshot β†’ capture the read version with tr.GetReadVersion(), then on subsequent reads call tr.SetReadVersion(captured) so FDB serves the data as it was at that point in time.

Running

  1. Start the shared FDB cluster (from the repo root):
    docker compose up -d
    ./scripts/bootstrap-fdb.sh
    
  2. Install the FDB client library on the host (required by the Go bindings, which are CGO-linked to libfdb_c):
    brew install foundationdb        # macOS
    
  3. Run the demo:
    cd option-a-leveldb
    go run ./demo -cluster ../fdb.cluster
    

Expected output (abridged):

Get apple -> red
batch applied (cherry+date inserted, banana deleted)
range scan [a, z):
  apple = red
  cherry = red
  date = brown
snapshot read version = 1234567
live    Get apple   -> green
live    Get cherry  -> layer: not found
snap    Get apple   -> red   (was 'red' at snapshot)
snap    Get cherry  -> red   (err=<nil>)

That last block is the punchline: the snapshot sees the world before our post-snapshot writes, because FDB just gives us older versions on demand.

What this layer intentionally omits

  • Compaction / SST files β€” there are none. FDB handles storage internally.
  • Bloom filters β€” not needed; FDB indexes by key range natively.
  • Write batches with sequence numbers β€” FDB’s transaction commit version plays that role automatically (tr.GetCommittedVersion() after commit).

See ../option-b-leveldb for the inverse experiment: keeping real LevelDB code paths (memtable + SSTs) but storing the SST bytes in FDB.

Hitchhiker’s Guide β€” Option A: LevelDB API over FoundationDB

The question this answers: β€œI have code that uses LevelDB. Can I swap in FoundationDB underneath with zero changes to my application?”

The deeper question: β€œWhat is LevelDB’s API contract, and how exactly does each piece of it map onto FDB primitives?”


Table of Contents

  1. What LevelDB Is (and Isn’t)
  2. The Five Concepts: Get, Put, Delete, Batch, Iterator, Snapshot
  3. Subspaces β€” The Key Encoding
  4. Write Batches β€” Atomicity Made Explicit
  5. Iterators β€” Range Scans Without Cursors
  6. Snapshots β€” MVCC Exposed to the Caller
  7. How the Demo Works Step by Step
  8. What the Real LevelDB Does That We Don’t
  9. Real-World Analogue: goleveldb, RocksDB, PebbleDB
  10. Exercises β€” Build on This

1. What LevelDB Is

LevelDB (released by Google in 2011) is an embedded, ordered, key–value store implemented as a Log-Structured Merge Tree (LSM tree). β€œEmbedded” means it runs in the same process as your application β€” no server, no network. β€œOrdered” means the same thing as in FDB: keys are sorted lexicographically, and you can range-scan them efficiently.

LevelDB’s API surface is deliberately tiny:

db.Get(key)
db.Put(key, value)
db.Delete(key)
batch := new(leveldb.Batch)
batch.Put / batch.Delete
db.Write(batch)
iter := db.NewIterator(...)
snap := db.GetSnapshot()

This simplicity is why LevelDB became the embedded storage engine of choice for Chrome (IndexedDB), Bitcoin Core, and countless other applications.

Where LevelDB lives in the storage stack:

Application code
    ↓
LevelDB API (Get/Put/Delete/Batch/Iterator)
    ↓
LSM tree (MemTable + SSTables on disk)
    ↓
Filesystem / OS

Where our layer lives:

Application code
    ↓
[our layer] β€” same LevelDB-shaped API (Get/Put/Delete/Batch/Iterator)
    ↓
FDB transactions
    ↓
FDB cluster (distributed, replicated)

The application sees the same interface. The durability substrate changes completely.


2. The Five Concepts

Get β€” One Read, One Transaction

In LevelDB, Get opens a brief β€œread lock” (via a snapshot) and reads one key. There is no explicit transaction; LevelDB handles it internally.

In our layer:

func (d *DB) Get(key []byte) ([]byte, error) {
    v, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.Get(d.ns.Pack(key)).Get()
    })
    ...
}

ReadTransact opens a read-only FDB transaction (no conflict tracking on writes, cheaper than a full Transact). The read version is chosen by FDB to be a recent committed version β€” typically a few milliseconds behind real-time. This gives you a consistent view even if concurrent writers are active.

rt.Get(k) does not block on return. It returns a FutureByteSlice. Calling .Get() on the future is what blocks (sends the request to the FDB storage server and waits for the response). This two-phase call style is how FDB supports pipelining: you can call rt.Get(k1), rt.Get(k2), rt.Get(k3) in sequence, then .Get() all three β€” FDB sends all three requests before blocking on any response. This is critical for LookupByIndex in option-c, as we’ll see.

Put β€” One Write, One Transaction

func (d *DB) Put(key, value []byte) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Set(d.ns.Pack(key), value)
        return nil, nil
    })
    return err
}

Transact opens a read-write transaction. tr.Set adds the (key, value) pair to the transaction’s local write buffer. Nothing is sent to the cluster until the function returns without error, at which point FDB commits. If there was a conflict (another writer touched this key since our read version), FDB calls our function again with a fresh transaction automatically.

Delete β€” Same as Put, but Clear

tr.Clear(key) adds a tombstone to the write buffer. In FDB’s model, a cleared key is identical to a key that was never written β€” there is no β€œnull” value. This is important: Get on a cleared key returns nil (which our layer translates to ErrNotFound), not a special tombstone value.

When Each One is Right

OperationUse whenFDB primitive
GetReading a single keyReadTransact
PutWriting a single keyTransact + Set
DeleteRemoving a single keyTransact + Clear
BatchWriting multiple keys atomically (below)Transact (one)

3. Subspaces β€” The Key Encoding

Every key that hits FDB is first passed through the Subspace.Pack method:

// encoding.go
type Subspace struct{ prefix []byte }

func (s Subspace) Pack(userKey []byte) fdb.Key {
    out := make([]byte, 0, len(s.prefix)+1+len(userKey))
    out = append(out, s.prefix...)
    out = append(out, 0x00)        // separator byte
    out = append(out, userKey...)
    return fdb.Key(out)
}

If your namespace is "demo" and your key is "apple", the actual FDB key is the byte string "demo\x00apple".

Why the separator byte?

Without a separator, two namespaces "foo" and "foobar" would collide: the key "foobar\x00somekey" would appear inside both foo’s range and foobar’s range. The separator \x00 prevents this because "foo\x00" is not a prefix of "foobar\x00".

Why 0x00 specifically?

Because 0x00 is the smallest possible byte value. When we compute the range end for a subspace scan, we copy the begin key and change the last byte from 0x00 to 0x01:

func (s Subspace) Range() fdb.KeyRange {
    begin := append([]byte{}, s.prefix...)
    begin = append(begin, 0x00)
    end := append([]byte{}, s.prefix...)
    end = append(end, 0x01)
    return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}

The range ["demo\x00", "demo\x01") contains exactly and only the keys packed by this subspace. This is a simple, efficient way to express β€œall keys in this namespace” without needing a sentinel end key.

The tuple layer alternative:

Official FDB client libraries encode subspace ranges using the Tuple encoding, which handles nested subspaces, escaping, and typed values. Our hand-rolled encoding is simpler but less general β€” for production use, adopt the Tuple layer.


4. Write Batches β€” Atomicity Made Explicit

LevelDB’s WriteBatch is the mechanism for writing multiple keys atomically. Without a batch, each Put is a separate transaction β€” if your process crashes between two Puts, the second one is missing.

b := layer.NewBatch()
b.Put([]byte("user:1:name"), []byte("Alice"))
b.Put([]byte("user:1:email"), []byte("alice@example.com"))
b.Put([]byte("user:1:score"), []byte("100"))
db.Write(b)

After Write, either all three keys exist or none do.

Our implementation:

// batch.go
func (d *DB) Write(b *Batch) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for _, op := range b.ops {
            if op.clear {
                tr.Clear(d.ns.Pack(op.key))
            } else {
                tr.Set(d.ns.Pack(op.key), op.value)
            }
        }
        return nil, nil
    })
    return err
}

One Transact call. All ops go into the transaction buffer together. FDB commits them atomically. This is exactly what LevelDB’s WriteBatch does internally β€” it writes all ops to the Write-Ahead Log in one fsync, then applies them to the MemTable.

FDB’s size limits:

FDB transactions are capped at approximately 10 MB of reads + writes. For most use cases this is not a limit, but if you’re writing millions of keys you’ll need to split into multiple transactions. See option-b-leveldb for the chunking pattern.


5. Iterators β€” Range Scans Without Cursors

LevelDB’s iterator is a cursor over the sorted key space. It supports bidirectional movement and seeking to arbitrary positions.

The streaming vs. materializing decision:

A β€œstreaming” iterator would keep a live FDB transaction open and use fdb.RangeIterator to fetch keys page by page as you call Next(). This is efficient for large ranges but ties the iterator’s lifetime to an open transaction.

FDB transactions have a ~5 second timeout. LevelDB iterators are frequently held open much longer (e.g., while a background compaction reads a full SSTable). Forcing a 5-second limit would break drop-in compatibility.

Our solution: materialize the entire range into a slice upfront:

func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
    v, _ := fdbDB.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
    })
    it.kvs = v.([]fdb.KeyValue)
    ...
}

The entire GetSliceWithError call happens inside one transaction. The transaction closes. The iterator holds the materialized slice β€” it can outlive the transaction indefinitely.

Trade-off: We read all matching keys eagerly, even if the caller only needs the first few. For small ranges (as in most LevelDB use cases) this is fine. For ranges spanning millions of keys, a streaming approach would be necessary.

Navigation:

it.First()          // idx = 0
it.Next()           // idx++
it.Prev()           // idx--
it.Last()           // idx = len(kvs)-1
it.Seek(target)     // binary (or linear) search for key >= target
it.Key()            // kvs[idx].Key unpacked from subspace
it.Value()          // kvs[idx].Value
it.Valid()          // 0 <= idx < len(kvs)

The Seek in our implementation is a linear scan for pedagogical clarity. A production implementation would use sort.Search (binary search) since the slice is sorted.


6. Snapshots β€” MVCC Exposed to the Caller

This is where FDB’s MVCC machinery becomes directly visible.

A LevelDB snapshot captures the database state at a point in time. Reads through the snapshot always see that exact state, regardless of subsequent writes.

snap := db.NewSnapshot()
// ... later, after writes have occurred ...
v1, _ := db.Get(key)    // sees current state
v2, _ := snap.Get(key)  // sees state at snapshot time
// v1 != v2 if the key was mutated after the snapshot

How we implement this:

// snapshot.go
type Snapshot struct {
    db          fdb.Database
    readVersion int64
}

func (d *DB) NewSnapshot() (*Snapshot, error) {
    rv, err := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetReadVersion().Get()
    })
    ...
    return &Snapshot{db: d.fdb, readVersion: rv.(int64)}, nil
}

GetReadVersion() returns the logical timestamp FDB assigned to our transaction. This is a monotonically increasing integer β€” FDB’s β€œversion clock”. We store it.

Later, when the snapshot is asked for a key:

func (s *Snapshot) Get(key []byte) ([]byte, error) {
    v, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.SetReadVersion(s.readVersion)   // ← the key line
        return tr.Get(s.ns.Pack(key)).Get()
    })
    ...
}

SetReadVersion pins the transaction to read from the exact version we captured. FDB’s MVCC machinery ensures that the storage servers still have the old version in their version history β€” provided it hasn’t been garbage collected (the ~5 second window).

Why SetReadVersion requires a writable Transaction:

This is a quirk of the FDB Go API: SetReadVersion is only available on fdb.Transaction (read-write), not on fdb.ReadTransaction (read-only). So we open a writable transaction but never write anything β€” effectively using it as a β€œpinnable read transaction”. The transaction is automatically abandoned when the closure returns.

What this maps to in real databases:

  • PostgreSQL: BEGIN; SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; creates a snapshot valid for the transaction’s duration
  • MySQL InnoDB: START TRANSACTION WITH CONSISTENT SNAPSHOT
  • Spanner: BeginTransaction(mode: READ_ONLY, timestamp: exact_staleness: Duration)
  • CockroachDB: AS OF SYSTEM TIME <timestamp> clause on SELECT

All of these are, at their core, β€œread from this specific version of the MVCC chain.”


7. The Demo, Step by Step

The demo in demo/main.go exercises every feature and prints the results. Here is the internal FDB call sequence for the MVCC section:

1. db.Put("color", "red")
   β†’ Transact: Set("demo\x00color", "red"), commit version v1

2. snap = db.NewSnapshot()
   β†’ ReadTransact: GetReadVersion() β†’ returns v1 (or very close)
   β†’ snap.readVersion = v1

3. db.Put("color", "blue")
   β†’ Transact: Set("demo\x00color", "blue"), commit version v2

4. snap.Get("color")
   β†’ Transact: SetReadVersion(v1), Get("demo\x00color")
   β†’ FDB returns "red"  (value at v1)

5. db.Get("color")
   β†’ ReadTransact: Get("demo\x00color")
   β†’ FDB returns "blue" (latest committed value)

This demonstrates that the snapshot truly pins to v1, seeing the pre-mutation value.


8. What the Real LevelDB Does That We Don’t

LevelDB has significant machinery we skip:

LSM Tree internals:

  • MemTable (skip list in memory) + WAL (write-ahead log on disk)
  • SSTable files (sorted, immutable, bloom-filter indexed)
  • Compaction (background merging of SSTables to reclaim space and bound read amplification)
  • Bloom filters (avoid disk reads for non-existent keys)

None of this applies to our layer because FDB handles all of it. FDB’s storage servers use their own B-tree-like storage (a custom tree called the β€œFDB B-tree” that supports MVCC) and their own write-ahead logs. We simply call tr.Set and FDB’s internals handle the rest.

What we deliberately emulate:

  • The API surface (same method names and semantics as goleveldb)
  • Namespace isolation (subspace = virtual database)
  • Atomic batches
  • Forward/backward iteration
  • MVCC snapshots

What we don’t emulate:

  • Compaction (irrelevant β€” FDB handles it)
  • Bloom filters (FDB has its own read optimization)
  • File-level operations (no such thing in FDB)
  • Checksums (FDB provides end-to-end data integrity)

9. Real-World Analogue: goleveldb, RocksDB, PebbleDB

goleveldb (github.com/syndtr/goleveldb) is the Go port of LevelDB that option-b-leveldb uses as a consumer. Its storage.Storage interface is the pluggable storage backend. We’ll see this in option-b’s guide.

RocksDB (Facebook/Meta, 2013) is the evolution of LevelDB: more configurable, multi-threaded compaction, column families, transactional API. It is the storage engine inside MySQL 8 (MyRocks), CockroachDB, TiKV, and many others. Every concept from our layer applies directly to RocksDB’s API.

PebbleDB (CockroachDB, 2019) is a Go implementation of RocksDB’s key ideas, designed for CockroachDB’s specific workload. CockroachDB switched from RocksDB to Pebble in 2021 for improved performance and simpler operations.

The common thread: all of these expose the same Get/Put/Delete/Batch/ Iterator/Snapshot interface. Building this interface over FDB means you understand the contract deeply, because you have to implement it rather than just use it.


10. Exercises β€” Build on This

These are not hypothetical. Each one adds a real capability:

Exercise 1 β€” Atomic Compare-And-Swap (CAS)

func (d *DB) CAS(key, expected, next []byte) (bool, error)

Read key, compare to expected, write next β€” all in one transaction. If the key changed since you read it, FDB retries automatically (conflict detection does this for you β€” you don’t need to loop).

Exercise 2 β€” TTL (Time-To-Live) Keys Store expiry timestamps alongside values. Modify Get to return ErrNotFound for expired keys, and add a Sweep() method that clears all expired entries with a ClearRange.

Exercise 3 β€” Transactions Across Multiple Keys

tx := db.Begin()
tx.Put("account:alice:balance", "900")
tx.Put("account:bob:balance", "100")
tx.Commit()

This is just a Batch.Write today. Extend it to include optimistic reads (read alice’s balance, check it’s >= 100 before subtracting) β€” you’ll need a real Transact closure, not just a batch.

Exercise 4 β€” Prefix Scan

rows, err := db.Scan(prefix []byte) ([]KV, error)

Use RangeWithin(prefix, nil) and filter server-side. This is the basis for the Record Layer’s ScanRecords.

Exercise 5 β€” Size Estimate Use FDB’s GetEstimatedRangeSizeBytes (an atomic op) to estimate how many bytes live in your subspace. This is how database engines implement SHOW TABLE STATUS without a full scan.


11. Source Code Deep Dive β€” Every Line Explained

This section walks through the full source of layer/db.go, layer/encoding.go, layer/iterator.go, and layer/snapshot.go with annotations about the non-obvious decisions.

db.go β€” The Core Database Type

type DB struct {
    fdb fdb.Database
    ns  Subspace
}

Two fields. fdb is the FDB connection (goroutine-safe, long-lived). ns is the namespace: a byte prefix prepended to every key. Multiple DB instances on the same FDB cluster with different ns values are completely isolated β€” their key ranges do not overlap.

func Open(fdbDB fdb.Database, namespace []byte) *DB {
    return &DB{fdb: fdbDB, ns: NewSubspace(namespace)}
}

Open does not contact FDB. It’s a pure in-memory initialization. The connection to FDB was already established when fdb.OpenDefault() was called in main.go. Open just associates this DB with a prefix.

Why take fdb.Database rather than a cluster address string? This lets the caller decide how the FDB connection is configured (API version, cluster file path, network options) and share one connection across multiple DB instances. Multiple DB instances share one FDB network thread and one connection pool.

func (d *DB) FDB() fdb.Database { return d.fdb }
func (d *DB) Namespace() []byte { return d.ns.prefix }

Accessors for embedding and testing. FDB() lets a consumer pass the FDB connection to another layer (e.g., a Record Layer built on top of this DB). Namespace() lets tests inspect the key prefix.

encoding.go β€” The Full Subspace Implementation

func (s Subspace) RangeWithin(start, end []byte) fdb.KeyRange {
    var begin, endKey fdb.Key
    if start == nil {
        begin = s.Range().Begin
    } else {
        begin = s.Pack(start)
    }
    if end == nil {
        endKey = s.Range().End
    } else {
        endKey = s.Pack(end)
    }
    return fdb.KeyRange{Begin: begin, End: endKey}
}

RangeWithin lets callers specify a sub-range within the subspace. If start = nil, the range starts at the beginning of the subspace. If end = nil, the range ends at the end of the subspace. This is used by the iterator to implement LevelDB’s NewIterator(slice *util.Range) β€” an iterator over a restricted key range.

The subtle end encoding: When end is provided, we use Pack(end) as the upper bound. This is an exclusive upper bound in FDB (same as Python’s range() β€” GetRange(begin, end) returns keys where begin <= key < end). Since Pack(end) = prefix + 0x00 + end, this correctly excludes the end key itself.

iterator.go β€” Forward/Backward Navigation

func (it *Iterator) compareKeys(a, b []byte) int {
    return bytes.Compare(a, b)
}

One line, but critical: key comparison is lexicographic byte order, not string collation, not numeric order. "9" > "10" under this comparison. This is identical to LevelDB’s default comparator (BytewiseComparator). If you need a different sort order, you need a sort-preserving encoding β€” which is why encodeInt64 (used in option-c’s index keys) exists.

func (it *Iterator) Seek(target []byte) {
    for it.idx = 0; it.idx < len(it.kvs); it.idx++ {
        if it.compareKeys(it.Key(), target) >= 0 {
            return
        }
    }
}

Linear scan for Seek. Correct but O(n). A production implementation uses binary search:

it.idx = sort.Search(len(it.kvs), func(i int) bool {
    return bytes.Compare(it.kvs[i].Key, target) >= 0
})

This is O(log n) β€” important for large result sets. The linear scan is kept here for readability.

snapshot.go β€” Pinning MVCC Versions

func (s *Snapshot) NewIterator(start, end []byte) *Iterator {
    return newIteratorAt(s.db, s.ns, start, end, s.readVersion)
}

The iterator constructor takes readVersion and passes it to the internal newIteratorAt. Inside newIteratorAt:

func newIteratorAt(fdbDB fdb.Database, ns Subspace, start, end []byte, readVersion int64) *Iterator {
    v, _ := fdbDB.Transact(func(tr fdb.Transaction) (interface{}, error) {
        if readVersion > 0 {
            tr.SetReadVersion(readVersion)
        }
        return tr.GetRange(ns.RangeWithin(start, end), fdb.RangeOptions{}).GetSliceWithError()
    })
    ...
}

If readVersion > 0, we pin the transaction to that version. The range scan returns results as of that exact version. This enables snapshot-consistent iteration β€” the iterator sees a stable, non-changing view even while concurrent writers are active.

The Transact retry loop and SetReadVersion: Transact retries on conflict. But SetReadVersion sets a fixed read version β€” the transaction’s read set is pinned. Can we still get a conflict on a read-only operation that reads from a pinned version? No β€” conflicts are about write-write and read-write conflicts. A transaction that only reads (even via Transact) cannot be retried due to conflict unless it also writes. Our snapshot transactions only read, so Transact will not retry them for conflict reasons. The only reason for retry would be a transaction_too_old error (if readVersion is too old and the data has been GC’d), which surfaces as an error to the caller.


12. Production Considerations

12.1 FDB Transaction Size Limits

FDB enforces hard limits on transactions:

  • 10 MB total mutation size (sum of all Set and Clear calls in one transaction)
  • 10 MB total read size (sum of all values read)
  • 5 seconds maximum transaction duration (from first operation to commit)

For option-a-leveldb, the most likely hit is the iterator: GetSliceWithError() reads the entire range into memory in one transaction. If a namespace contains 50 MB of data and you create an iterator over it, you’ll get a transaction too-large error.

Production solution: Paginated iteration with cursors:

// Instead of reading everything at once:
var cursor fdb.Key = ns.Range().Begin
const pageSize = 10_000
for {
    kvs, _ := db.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(fdb.KeyRange{Begin: cursor, End: ns.Range().End},
            fdb.RangeOptions{Limit: pageSize}).GetSliceWithError()
    })
    if len(kvs) == 0 {
        break
    }
    // process kvs...
    cursor = fdb.Key(append(kvs[len(kvs)-1].Key, 0x00)) // next page starts after last key
}

12.2 FDB Go Binding Concurrency Model

The FDB Go binding uses a single-threaded network loop internally (the FDB C library has one network thread). All transactions are multiplexed over this thread. This means:

  • Multiple goroutines can each have their own ReadTransact or Transact calls concurrently β€” these are correctly serialized by the binding
  • fdb.Database is goroutine-safe
  • But the network thread is single-threaded: if you saturate it (thousands of concurrent transactions), you’ll see latency increase

For high concurrency, FDB recommends batching multiple operations within one transaction rather than creating many small transactions.

12.3 Key Space Planning

Before deploying, decide on your namespace structure. Once data is in production, renaming a namespace (changing the prefix) requires migrating all data β€” a potentially days-long background job.

Good practice:

// Use a versioned namespace prefix
db := layer.Open(fdb, []byte("myapp:v1:users"))
// If schema changes require a new encoding:
newDB := layer.Open(fdb, []byte("myapp:v2:users"))
// Migrate in background; dual-write during migration window

12.4 Monitoring and Observability

FDB exposes cluster health via its status API:

fdbcli --exec "status json"

Key metrics to monitor:

  • Transactions committed/second β€” throughput
  • Conflicts/second β€” high values indicate hot keys or poorly structured transactions
  • Storage server read/write latency β€” P99 should be < 10ms
  • Data distribution lag β€” if a shard is being moved, latency spikes

For application-level monitoring, instrument every Transact call with a timer and tag the metric with the operation name.


13. Interview Questions β€” LevelDB, MVCC, and FDB

Q: What is the difference between ReadTransact and Transact in the FDB Go binding?

ReadTransact opens a read-only transaction: no write buffer, no conflict tracking, no commit phase. It’s cheaper than Transact because it skips the commit round-trip. Use ReadTransact whenever you’re only reading. Transact opens a read-write transaction: it tracks the read key set for conflict detection and requires a commit round-trip to apply writes. For single-key reads like Get, using Transact instead of ReadTransact works but wastes latency and cluster resources.

Q: If FDB’s MVCC window is ~5 seconds, what happens to a snapshot older than 5 seconds?

Reading from that snapshot returns a transaction_too_old error. FDB garbage-collects old MVCC versions after the configured version history window (default ~5 seconds). Any transaction β€” including read-only snapshot reads β€” that tries to read a version older than the GC horizon fails with a retriable error. In our Snapshot implementation, this means that a snapshot held open longer than ~5 seconds will start returning errors on the next Get or NewIterator call. Production code must handle this by recreating the snapshot.

Q: LevelDB supports custom comparators. What would you need to change to support a different sort order in this layer?

The sort order is determined by FDB’s key ordering, which is always lexicographic byte order. To support a different sort order (e.g., β€œintegers sort numerically”), you would change the key encoding: encode integer keys using encodeInt64 (sign-bit-flipped big-endian) so that their byte order matches their numeric order. You cannot change FDB’s comparator; you can only change what bytes you write as keys. This is why sort-preserving encoding is the fundamental concept in layer design.

Q: How does FDB’s Transact retry loop interact with side effects?

Transact retries the closure function if the transaction conflicts. If the closure has side effects outside FDB (e.g., incrementing a counter, logging, sending an HTTP request), those side effects will execute multiple times on retry. The FDB convention is: the Transact closure must be idempotent or side-effect-free. For logging, use a separate post-commit hook. For counters, use FDB atomic operations (tr.Add) which are commutative and do not require retry logic.

Q: How does an iterator over a snapshot differ from an iterator over the current database state?

A snapshot iterator is pinned to the read version captured at NewSnapshot() time. Even if concurrent writers modify or delete keys between when the snapshot was taken and when the iterator is created, the iterator sees the state at snapshot time. A current-state iterator uses the latest committed version, so it sees all mutations up to the moment the GetRange call is sent to the cluster. In our implementation, the difference is one line: tr.SetReadVersion(readVersion) in the snapshot path.

Option A β€” β€œSQL Layer” over FoundationDB

Pattern: relational engine above FDB. Tables β†’ subspaces, rows β†’ KV pairs, ACID inherited from FDB transactions. This is conceptually what Apple’s old (now-archived) fdb-sql-layer did.

What we ship (and don’t)

We ship the storage half of a SQL engine: a catalog, row encoding, table scans, primary-key lookup, and atomic inserts/updates/deletes β€” all in about 250 lines.

We do not ship a SQL parser. The Go API itself is the query language:

db.CreateTable(sqllayer.TableDef{Name:"users", PK:"id", Columns: ...})
db.Insert("users", sqllayer.Row{"id":1, "name":"Alice", "city":"Paris"})
db.SelectWhere("users", func(r sqllayer.Row) bool { return r["city"] == "Paris" })

A real SQL frontend would compile SELECT ... WHERE city='Paris' into exactly the same backend call.

Key layout

<ns> 0x00 <tableName>                     -> msgpack(TableDef)    [catalog]
<ns> 0x01 <tableName> 0x00 <encodedPK>    -> msgpack(Row)         [data]
<ns> 0x02 <tableName>                     -> uint64 (LE counter)  [rowid seq]
  • Catalog lets a fresh Open() reconstruct the in-memory schema cache.
  • Data rows live under a per-table subspace, so GetRange on <ns> 0x01 <tableName> 0x00 is a full table scan in key order.
  • Seq is used only when the table has no declared PK (we mint a rowid).

PK values are encoded with the same sign-flipping big-endian trick used in option-c-record-layer: integer rows sort numerically, strings sort lexicographically. This is what makes table scans produce ordered output β€œfor free”.

Transactionality

Insert does the seq read+write and the row write in the same fdb.Transact, so two concurrent inserts can never collide on the same rowid: FDB will retry one of them. DropTable clears catalog, all rows, and the seq counter atomically β€” no orphan rows can be observed after a drop.

Why this is not a real SQLite layer

A real SQLite vtab would let you write SELECT name FROM users WHERE city='Paris' in standard SQL and have SQLite’s planner call back into FDB. Building that requires:

  • Implementing the sqlite3_module C ABI (via CGO with mattn/go-sqlite3, or using modernc.org/sqlite’s yet-undocumented vtab hooks).
  • Mapping SQLite’s xBestIndex / xFilter / xColumn callbacks onto FDB range scans.
  • Round-tripping SQLite’s rowids to our PK encoding.

That’s an order of magnitude more code (~1.5k lines) and adds little to your understanding of FDB itself; the interesting parts β€” how does the relational model fit on top of an ordered KV store? β€” are all here.

Running

cd option-a-sqlite
go mod tidy
go run ./demo -cluster ../fdb.cluster

Expected output:

SELECT * FROM users;
  map[city:Paris id:1 name:Alice]
  map[city:Tokyo id:2 name:Bob]
  map[city:Paris id:3 name:Carol]

SELECT * FROM users WHERE city='Paris';
  map[city:Paris id:1 name:Alice]
  map[city:Paris id:3 name:Carol]

After UPDATE users SET city='Tokyo' WHERE id=1;
  map[city:Tokyo id:1 name:Alice]
  map[city:Tokyo id:2 name:Bob]
  map[city:Paris id:3 name:Carol]

After DELETE FROM users WHERE id=3;
  map[city:Tokyo id:1 name:Alice]
  map[city:Tokyo id:2 name:Bob]

Hitchhiker’s Guide β€” Option A: SQL Engine over FoundationDB

The question this answers: β€œWhat does a relational database actually store? If you strip away the SQL parser and the query planner, what is left?”

The deeper question: β€œHow do tables, rows, primary keys, and catalogs map onto an ordered key–value store?”


Table of Contents

  1. What a Relational Database Actually Stores
  2. The Catalog β€” A Database About Databases
  3. Row Encoding β€” Turning a Map into Bytes
  4. Primary Keys and Sort Order
  5. Full Table Scan β€” The Honest Query
  6. INSERT as a Rowid Allocation + Write
  7. UPDATE = Overwrite (Why it Works)
  8. DELETE = Clear (Why it Works)
  9. Key Layout In Detail
  10. What We Didn’t Build (and Why Those Parts Are Hard)
  11. Real-World Analogues: SQLite, MySQL, PostgreSQL
  12. Exercises

1. What a Relational Database Actually Stores

A relational database is, at its core, a system for:

  1. Storing rows (collections of typed values) under a primary key.
  2. Keeping a catalog (metadata about what tables exist and their schemas).
  3. Maintaining indexes (secondary data structures for fast lookup by non-primary-key fields).
  4. Executing queries (finding rows matching a predicate).

Strip away SQL parsing, the query planner, authentication, and network protocols, and you have a structured key–value store with a catalog.

Our layer implements steps 1 and 2. Step 3 is in option-c-record-layer. Step 4 is a SelectWhere(pred) that does a full scan β€” honest but slow, which makes the need for indexes obvious.


2. The Catalog β€” A Database About Databases

Every real database has a catalog: a table of tables. PostgreSQL stores it in pg_catalog.pg_class. MySQL uses information_schema.TABLES. SQLite writes the catalog to the first page of the database file as sqlite_master.

Our catalog lives in FDB under a dedicated subspace:

<ns> 0x00 <tableName>  β†’  msgpack(TableDef)

TableDef contains the table name, column definitions (name + type), and the declared primary-key column name (if any). When Open() is called, we scan this subspace and reconstruct the in-memory schemas map:

func (d *DB) loadCatalog() error {
    // range-scan the catalog subspace
    kvs, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(d.catalogRange(), fdb.RangeOptions{}).GetSliceWithError()
    })
    for _, kv := range kvs {
        var def TableDef
        msgpack.Unmarshal(kv.Value, &def)
        d.schemas[def.Name] = def
    }
}

This means the catalog is self-describing β€” a fresh process connecting to an existing FDB namespace automatically discovers all tables. You don’t need to pass schemas at startup. This mirrors how SQLite opens a database file and reads its catalog from sqlite_master before accepting any queries.

CreateTable is atomic: the catalog write and any initial setup happen inside one FDB transaction. Two concurrent CreateTable("users") calls will either both succeed (if they have identical schemas) or one will fail with a conflict. There is no possibility of a half-created table.


3. Row Encoding β€” Turning a Map into Bytes

Our Row type is map[string]any. We serialize it with msgpack, which produces a compact binary representation:

{"name": "Alice", "city": "Paris", "id": 1}
β†’  83 a4 6e 61 6d 65 a5 41 6c 69 63 65 a4 63 69 74 79 a5 50 61 72 69 73 a2 69 64 01
   (msgpack map: 3 entries, name="Alice", city="Paris", id=1)

The row bytes become the value in FDB. The key is derived from the primary key value (see below).

Why msgpack, not JSON or Protobuf?

  • JSON is human-readable but wastes space (quoted keys, text numbers).
  • Protobuf is compact but requires a schema file and code generation.
  • msgpack is compact, self-describing (no external schema), and has first- class Go support.

For a teaching implementation, msgpack hits the right trade-off. For production, Protobuf or Cap’n Proto would be better choices, since they support schema evolution (adding fields without breaking old readers) and are more efficient for fixed schemas.


4. Primary Keys and Sort Order

A primary key must be unique and determines the storage order of rows.

In FDB, β€œstorage order” = β€œkey byte order”. So we need to encode primary key values into bytes in a way that preserves the intended sort order.

String PKs: store as raw UTF-8 bytes. UTF-8’s byte ordering preserves lexicographic order for ASCII characters. Enough for a demo; a production system would use Unicode collation.

Integer PKs: this is where it gets interesting.

Naive approach: encode as decimal string. Problem: "9" > "30" in byte comparison. Rows would sort as: 1, 10, 100, 2, 20, 3… β€” wrong.

Better: big-endian 64-bit integer. Problem: negative numbers have the high bit set, so -1 encodes as 0xFF... which sorts after all positive numbers.

Our solution (and the industry standard):

func encodeInt64(x int64) []byte {
    b := make([]byte, 8)
    // Flip the sign bit. This turns the two's-complement representation into
    // an unsigned representation that sorts correctly:
    //  0 β†’ 0x8000_0000_0000_0000 (sorts between negatives and positives)
    //  1 β†’ 0x8000_0000_0000_0001
    // -1 β†’ 0x7FFF_FFFF_FFFF_FFFF (sorts before 0)
    // -2 β†’ 0x7FFF_FFFF_FFFF_FFFE (sorts before -1)
    binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
    return b
}

After this encoding, integer byte strings sort in the same order as their numeric values. A GetRange on the rows subspace returns rows in ascending PK order, for free, by FDB’s native byte-order scan.

This is the same trick used by:

  • Apache HBase (for row key ordering)
  • Google Cloud Bigtable
  • FoundationDB’s official Tuple layer (for negative integers)
  • Apache Cassandra’s blob serialization

5. Full Table Scan β€” The Honest Query

SelectAll and SelectWhere both scan the entire rows subspace:

func (d *DB) SelectWhere(table string, pred func(Row) bool) ([]Row, error) {
    v, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(d.tableRange(table), fdb.RangeOptions{}).GetSliceWithError()
    })
    for _, kv := range v.([]fdb.KeyValue) {
        var row Row
        msgpack.Unmarshal(kv.Value, &row)
        if pred == nil || pred(row) {
            out = append(out, row)
        }
    }
}

The predicate is applied in Go, after fetching all rows from FDB. This is a full table scan β€” the same operation as SELECT * FROM users WHERE city='Paris' without any index.

Why not push the predicate to FDB?

FDB has no server-side filtering. Its API is: β€œgive me this key range.” All filtering must happen on the client. This is a deliberate design choice by FDB: the cluster is a dumb-fast KV store; smarts live in the layer.

Real SQL databases push predicates β€œdown” to the storage engine to avoid fetching unnecessary rows over the network. In our setup, β€œthe network” is the connection to FDB’s storage servers β€” for large tables, a full scan incurs real latency. This is the motivation for secondary indexes (option-c).

What a real SQL engine does instead:

A query planner looks at the predicate city='Paris' and checks whether there is an index on city. If yes, it rewrites the scan as:

scan index(city = 'Paris') β†’ [pk1, pk2, pk3]
fetch rows by pk: Get(pk1), Get(pk2), Get(pk3)

This is exactly what LookupByIndex does in option-c.


6. INSERT as Rowid Allocation + Write

When a table has no declared PK, we allocate a monotonic rowid:

_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
    curBytes, _ := tr.Get(d.seqKey(table)).Get()
    var cur uint64
    if len(curBytes) == 8 {
        cur = binary.LittleEndian.Uint64(curBytes)
    }
    next := cur + 1
    // Write the new counter value
    binary.LittleEndian.PutUint64(buf, next)
    tr.Set(d.seqKey(table), buf)
    // Write the row
    tr.Set(d.rowKey(table, encodeInt64(int64(next))), encoded)
    return nil, nil
})

The counter read + increment + row write happen in one transaction. Two concurrent inserts cannot get the same rowid β€” FDB’s conflict detection will cause one to retry with a fresh counter read.

This is identical to how SQLite allocates its rowid (a monotonic integer stored in the B-tree page header) and how PostgreSQL allocates OIDs and ctids. The key insight: the counter is a row, and updating it is a transaction.


7. UPDATE = Overwrite (Why it Works)

LevelDB and FDB share a property: Set(key, newValue) on an existing key simply replaces it. There is no β€œupdate in place” at the storage level.

Our β€œUPDATE” is just another Insert call with the same PK:

db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Tokyo"})
// This calls tr.Set(rowKey(table, encode(1)), msgpack(new_row))
// The old row bytes under the same key are replaced.

FDB does not copy the old value to a TOAST or undo segment β€” it just writes the new value. The old version is kept in FDB’s MVCC history for ~5 seconds (for in-flight read transactions) and then garbage-collected.

What changes when you have indexes:

Once you have secondary indexes (option-c), an update is no longer just a single key write. You must:

  1. Read the old row (to find out what old index entries to remove)
  2. Clear the old index entries
  3. Write the new row
  4. Write the new index entries

All in one transaction. This is why option-c’s PutRecord is more complex than our Insert.


8. DELETE = Clear (Why it Works)

func (d *DB) Delete(table string, pk any) error {
    encodedPK, _ := encodePK(pk)
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Clear(d.rowKey(table, encodedPK))
        return nil, nil
    })
    return err
}

Clear adds a tombstone to the transaction buffer. After commit, the key disappears from FDB as if it was never written. Future reads return nil.

No cascades: unlike SQL with ON DELETE CASCADE, our engine has no foreign key enforcement. This is an intentional simplification. Real SQL engines resolve cascades by finding related rows (via indexes on the FK column) and deleting them in the same transaction.

What about DropTable?

func (d *DB) DropTable(name string) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Clear(d.catalogKey(name))         // remove from catalog
        tr.ClearRange(d.tableRange(name))    // delete all rows
        tr.Clear(d.seqKey(name))             // reset rowid counter
        return nil, nil
    })
}

One transaction that deletes the catalog entry, all data rows, and the sequence counter. Atomically. A reader that has an open transaction might still see the table (MVCC), but any new transaction started after commit sees the table as gone.


9. Key Layout In Detail

Let’s trace exactly what happens when you call:

db.CreateTable(TableDef{Name: "users", PK: "id", Columns: [...]})
db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Paris"})
db.Insert("users", Row{"id": 2, "name": "Bob",   "city": "Tokyo"})

With namespace "demo-sql":

Catalog key:
  "demo-sql" + 0x00 + "users"
  = 64 65 6d 6f 2d 73 71 6c 00 75 73 65 72 73
  β†’ value: msgpack({Name:"users", PK:"id", Columns:[...]})

Row key for id=1:
  "demo-sql" + 0x01 + "users" + 0x00 + encodeInt64(1)
  = 64 65 6d 6f 2d 73 71 6c 01 75 73 65 72 73 00 80 00 00 00 00 00 00 01
                                               ↑ encoded 1 (sign bit flipped)
  β†’ value: msgpack({id:1, name:"Alice", city:"Paris"})

Row key for id=2:
  "demo-sql" + 0x01 + "users" + 0x00 + 80 00 00 00 00 00 00 02
  β†’ value: msgpack({id:2, name:"Bob", city:"Tokyo"})

Table scan range:
  begin: "demo-sql" + 0x01 + "users" + 0x00
  end:   "demo-sql" + 0x01 + "users" + 0x01
  β†’ returns id=1 row, then id=2 row (in ascending id order)

The 0x00 separator between tag, table name, and PK ensures:

  • Table β€œuser” doesn’t collide with table β€œusers”
  • Row key doesn’t collide with catalog key (different tag byte: 0x01 vs 0x00)
  • Rows for table β€œusers” don’t collide with rows for table β€œorders” (different table name segment)

10. What We Didn’t Build (and Why Those Parts Are Hard)

Secondary indexes: We have SelectWhere but it does a full scan. Adding a secondary index (e.g., β€œfind all users where city=β€˜Paris’”) requires maintaining an extra key for every row that maps (city value, pk) β†’ "". The hard part is keeping the index consistent when rows change β€” see option-c.

SQL parsing: A SQL parser turns SELECT name FROM users WHERE city='Paris' into an AST. This is ~1000 lines of code for a minimal SQL grammar. We skip it because it adds no FDB-specific insight.

Query planner: Given multiple indexes, which one should we use? This is a graph problem (estimating cost of each query plan) that modern databases spend enormous effort on. The basic version is simple (pick the most selective index), but getting it right for all cases is a career.

JOIN: Joining two tables in SQL is conceptually a nested loop (for each row in table A, find matching rows in table B). The hard part is making this fast when both tables are large. Most JOIN algorithms (hash join, sort-merge join) still reduce to: β€œscan one table, look up matching rows from the other.”

MVCC for uncommitted readers: We provide MVCC through FDB’s read versions, but we don’t expose BEGIN TRANSACTION / COMMIT / ROLLBACK at the SQL level. Adding this requires keeping an open FDB transaction for the duration of the user transaction β€” which is possible but requires careful lifecycle management.


11. Real-World Analogues

SQLite

SQLite stores its catalog in sqlite_master, a special table on page 1:

CREATE TABLE sqlite_master (
    type TEXT,    -- "table", "index", "view", "trigger"
    name TEXT,
    tbl_name TEXT,
    rootpage INTEGER,  -- page number of the B-tree root for this table
    sql TEXT           -- the original CREATE statement
);

Every table’s rows are stored in a B-tree rooted at rootpage. The B-tree keys are the rowid values (64-bit integers, always). The row data (all columns) is packed into the B-tree leaf value using SQLite’s own compact encoding.

Our rowKey(table, encodedPK) is the FDB equivalent of the B-tree rowid. Our msgpack row serialization is the FDB equivalent of SQLite’s row record format.

MySQL InnoDB

InnoDB uses a clustered index β€” the primary key IS the B-tree. All columns are stored in the PK leaf node. Secondary indexes store the PK value (not the physical row location) as their β€œpointer”:

PRIMARY KEY index: pk β†’ all_columns
SECONDARY INDEX on city: city β†’ pk

To look up a row by city:

  1. Scan secondary index for city = 'Paris' β†’ get pk values
  2. Look up each pk in the primary index β†’ get the full row

This two-step lookup is called a β€œkey lookup” in MySQL’s EXPLAIN output. Our option-c-record-layer does exactly this.

PostgreSQL

PostgreSQL uses β€œheap files” β€” rows are stored in pages in insertion order, not sorted by PK. The PK is just another index (a B-tree) that maps pk β†’ ctid (page number + slot within page). Secondary indexes also map field_value β†’ ctid.

This is subtly different from InnoDB: in PostgreSQL, all indexes point to physical page locations, not to the PK. This makes index maintenance easier (an update to a non-indexed column doesn’t touch any index), but makes VACUUM (garbage collection of old row versions) more complex.

CockroachDB / TiKV / YugabyteDB

These β€œNewSQL” databases are essentially what we’re building here: a SQL query engine on top of a distributed, ordered, transactional KV store. Their storage architecture is:

SQL layer (parsing, planning, execution)
    ↓
Encoding layer (rows β†’ keys, indexes β†’ keys)
    ↓
Distributed KV layer (RocksDB + Raft replication per shard)

CockroachDB’s row encoding is:

/<table_id>/<index_id>/<col1_value>/<col2_value>/...  β†’  rest_of_columns

This is the same pattern as our <ns>/<tag>/<table_name>/<pk> β€” just with more type information and a smarter Tuple encoding.


12. Exercises

Exercise 1 β€” Add a Secondary Index

Modify Insert to also write index entries:

ns + 0x02 + tableName + 0x00 + columnName + 0x00 + encodedValue + 0x00 + pk β†’ ""

Modify Delete to also clear those entries (requires reading the row first). Add SelectByIndex(table, column, value) ([]Row, error).

Exercise 2 β€” Foreign Keys

Add a References field to Column:

type Column struct {
    Name       string
    Type       ColumnType
    References *ForeignKey  // nil if no FK
}

In Insert, check that the referenced PK exists (a Get inside the same transaction). In Delete, check that no child rows reference this PK (a GetRange on the FK index).

Exercise 3 β€” Transactions for Callers

Add a Begin() *Tx method that returns a transaction object. Tx should buffer operations and commit on tx.Commit() β€” all inside one FDB db.Transact() call.

Exercise 4 β€” DISTINCT and COUNT

Implement SelectDistinct(table, column) and Count(table) using a single FDB GetRange and in-process aggregation. This is your first aggregation operator β€” the same one query engines implement with hash tables or sort passes.

Exercise 5 β€” Schema Evolution

Add a version number to TableDef. When Open() loads a table with version < current, run a migration function that adds default values for new columns.


13. Source Code Deep Dive β€” sqllayer/db.go

The entire engine is in sqllayer/db.go. Let’s trace the critical paths.

The DB Type

type DB struct {
    fdb     fdb.Database
    ns      []byte
    schemas map[string]TableDef
    mu      sync.RWMutex
}

schemas is an in-memory cache of the catalog. It is loaded at Open() time from FDB and updated by CreateTable/DropTable. The mutex protects concurrent access to schemas.

Design question: Why cache schemas in memory at all? Because every query needs the schema to know column types (for encoding/decoding) and the PK column name. Fetching the schema from FDB on every query would add a round-trip. The trade-off: the in-memory cache can be stale if another process creates/drops a table. Production systems solve this with schema versioning: every DDL operation increments a global version; processes re-sync when they detect a version mismatch.

Key Helpers

func (d *DB) catalogKey(table string) fdb.Key {
    return fdb.Key(append(append([]byte{}, d.ns...), append([]byte{0x00}, []byte(table)...)...))
}

func (d *DB) rowKey(table string, encodedPK []byte) fdb.Key {
    prefix := append(append([]byte{}, d.ns...), 0x01)
    prefix = append(prefix, []byte(table)...)
    prefix = append(prefix, 0x00)
    return fdb.Key(append(prefix, encodedPK...))
}

Tag bytes distinguish subspaces:

  • 0x00 β†’ catalog entries
  • 0x01 β†’ row data
  • 0x02 β†’ sequence (rowid counter)

This tag-byte approach is simpler than the full Tuple encoding but achieves the same goal: prevent catalog keys from colliding with row keys even if a table is named β€œusers” and another subspace prefix also uses β€œusers”.

The Sequence Counter Pattern

func (d *DB) nextRowid(tr fdb.Transaction, table string) (int64, error) {
    key := d.seqKey(table)
    cur, _ := tr.Get(key).Get()
    var n int64 = 1
    if len(cur) == 8 {
        n = int64(binary.LittleEndian.Uint64(cur)) + 1
    }
    b := make([]byte, 8)
    binary.LittleEndian.PutUint64(b, uint64(n))
    tr.Set(key, b)
    return n, nil
}

This is a read-modify-write within one transaction. The caller passes tr (an active transaction). The Get reads the current counter value. The Set increments it. Both are in the same transaction as the row write.

If two concurrent inserts both read the same counter value, they’ll both try to write n+1 β€” but FDB’s conflict detection will catch this: both transactions read the same key, and both tried to write it. One will conflict and retry, which re-reads the counter and gets a fresh value. The retry is invisible to the caller.

Why not use FDB’s atomic add?

tr.Add(key, encodeInt64(1))  // atomic increment, no conflict

tr.Add is a commutative atomic operation β€” it doesn’t cause read-write conflicts. But it returns no value β€” you can’t know what rowid was assigned. If you need to use the rowid in the same transaction (to build the row key), you need the read-modify-write pattern. If the rowid is just an opaque row identifier that callers don’t use, tr.Add plus a versionstamp (to create a unique, ordered key without knowing the exact value) is a better approach.

The encodePK Function

func encodePK(v any) ([]byte, error) {
    switch x := v.(type) {
    case int:
        return encodeInt64(int64(x)), nil
    case int64:
        return encodeInt64(x), nil
    case string:
        return []byte(x), nil
    case []byte:
        return x, nil
    default:
        return nil, fmt.Errorf("unsupported PK type %T", v)
    }
}

String PKs are stored as raw UTF-8 bytes. This sorts correctly for ASCII but not for Unicode (e.g., β€œΓ©β€ sorts after β€œz” in byte order but before β€œz” in French locale). For a production system, use the ICU library for locale-aware collation, or restrict PKs to ASCII.

Integer PKs use encodeInt64: big-endian with sign-bit flipped. Read Chapter 3 of this guide for the derivation.

What about composite PKs? (user_id, order_id) as a PK means the row key is encode(user_id) + 0x00 + encode(order_id). The separator between components is 0x00 (the Tuple encoding style). This naturally makes rows for the same user adjacent in key order β€” a range scan GetRange(encode(user_id) + 0x00, encode(user_id) + 0x01) returns all orders for that user. Our implementation only supports single-column PKs, but the extension is straightforward.


14. Production Architecture β€” What a Real SQL-on-FDB Looks Like

The five components of a production SQL-on-FDB system:

Component 1: SQL Parser

Takes SQL text, produces an AST. For SELECT name FROM users WHERE city='Paris':

SelectStatement {
    Columns: ["name"],
    Table:   "users",
    Where:   EqualExpr{Column: "city", Value: "Paris"},
}

Libraries: pingcap/tidb/parser (Go, production-grade), xwb1989/sqlparser (Go, simpler), or write your own using gocc or pigeon.

Component 2: Query Planner

Takes the AST, looks at available indexes, estimates row counts, and picks the cheapest plan:

Without index on city:
  plan: TableScan("users") β†’ Filter(city='Paris')
  cost: O(n_rows) reads

With index on city:
  plan: IndexScan(city='Paris') β†’ PKLookup("users", [pk1, pk2, ...])
  cost: O(index_entries) + O(result_rows) reads

The planner is the hardest part of a query engine. Modern planners use statistics (histograms of column value distributions) stored in a catalog table to estimate n_rows for each plan.

Component 3: Encoding Layer (what we built)

Rows β†’ keys in FDB. Indexes β†’ keys in FDB. All with sort-preserving encodings.

Component 4: FDB Transaction Management

SQL has BEGIN / COMMIT / ROLLBACK. FDB has Transact. Bridging these requires holding an FDB Transaction object open for the duration of the SQL transaction. The FDB 5-second limit means long-running SQL transactions must be broken up or use FDB’s β€œgrv” caching.

Component 5: Network Protocol

Postgres wire protocol (libpq-compatible), MySQL wire protocol, or a custom protocol. This is what lets standard clients (psql, JDBC, SQLAlchemy) connect. CockroachDB implements the Postgres wire protocol on top of their FDB-like storage.


15. Interview Questions β€” Relational Databases and FDB

Q: What is a clustered index, and how does option-a-sqlite implement it?

A clustered index is an index where the data rows are stored at the index leaf nodes, sorted by the index key. In option-a-sqlite, the FDB key for each row IS the primary key, and the row data is the value at that key. Range scanning by primary key returns rows in PK order and fetches the data in the same operation. This is a clustered index: the data is organized by PK. InnoDB uses the same design (the primary key B-tree is the table).

Q: What happens when two concurrent transactions try to insert a row with the same primary key?

Both transactions read the same row key (which doesn’t exist yet β€” they get nil). Both write a new value to that key. FDB’s conflict detection: one transaction commits; the other conflicts (it wrote to a key that was also written by the committed transaction). The conflicting transaction retries from the beginning. On retry, it reads the row key again β€” this time it exists. Depending on application semantics, the retry can: (a) return a duplicate key error, (b) return the existing row (SELECT or INSERT/IGNORE), or (c) replace it (INSERT OR REPLACE).

Q: Why does PostgreSQL store rows in heap order rather than clustered by PK?

PostgreSQL’s original design assumed that MVCC updates would frequently need to move rows (updating a row creates a new row version at a new physical location; the old version stays until VACUUM). If rows were clustered by PK, a moved row would need to update all indexes to point to the new physical location. By using heap files with stable physical locations (ctid), updates only create a new heap entry and update the visibility chain β€” indexes don’t need to change. The trade-off: PK range scans in PostgreSQL require index traversal + random I/O into the heap, while InnoDB’s clustered index gives sequential I/O for PK range scans.

Q: How does your row encoding handle schema migrations β€” adding a new column to a table with existing rows?

With msgpack, adding a new column to the schema is safe for new rows (they include the new column). Old rows, when read back, will simply not have the column in the map. Application code should handle missing fields with defaults. This is β€œforward compatibility”: old data is compatible with new code. The hard case is β€œbackward compatibility”: if you need all rows to have the new column (for a NOT NULL constraint), you must run a background migration job that reads every old row, adds the default value, and writes it back β€” one paginated batch per transaction.

This is how every production database handles ALTER TABLE ADD COLUMN.

Option B β€” LevelDB on top of FoundationDB

Pattern: existing storage engine, FDB as the disk. We give the unmodified goleveldb library an FDB-backed implementation of its storage.Storage interface. LevelDB still does its LSM thing (memtables, SSTables, compaction, MANIFEST), but every byte ends up in FDB key ranges instead of on local disk.

This is the mirror image of option-a: the storage engine sits below FDB rather than above it.

How LevelDB sees the world

goleveldb accesses persistence exclusively through a small interface:

type Storage interface {
    Lock() (Locker, error)
    Log(str string)
    SetMeta(FileDesc) error
    GetMeta() (FileDesc, error)
    List(FileType) ([]FileDesc, error)
    Open(FileDesc) (Reader, error)
    Create(FileDesc) (Writer, error)
    Remove(FileDesc) error
    Rename(old, new FileDesc) error
    Close() error
}

Each FileDesc is {Type, Num} β€” e.g. {TypeTable, 42} for SST #42 or {TypeManifest, 7} for the 7th manifest. Filenames are an implementation detail; LevelDB never looks at strings.

Our fdbstorage package implements that interface against FDB. The whole file is ~250 lines.

Key layout

<ns> 0x01 <ftype:1B> <num:int64 BE>                 -> uint64 BE file size
<ns> 0x02 <ftype:1B> <num:int64 BE> <chunk:uint32 BE> -> 64 KiB chunk
<ns> 0x03                                             -> current MANIFEST {ftype,num}
<ns> 0x04                                             -> lock marker
  • Files are split into 64 KiB chunks so we stay well under FDB’s 100 KiB-per- value soft limit and the 10 MB-per-transaction hard limit.
  • Create returns a Writer that buffers in memory and flushes on Sync / Close. We split the flush across multiple transactions (100 chunks each β‰ˆ 6 MiB) to safely handle files larger than 10 MB.
  • Rename is implemented as copy-then-clear inside one transaction. LevelDB only renames small files (temp β†’ real on flush completion), so the inefficiency doesn’t matter.
  • SetMeta writes the manifest pointer atomically. Because FDB transactions are serializable, two concurrent flushes can’t observe a half-rotated manifest.

Why this is interesting

You get a real LevelDB instance β€” with bloom filters, compaction, snapshots, the works β€” whose durability story is β€œwhatever FDB’s durability story is.” That means:

  • Geo-replication and read scaling come for free from the FDB cluster.
  • Backups are FDB backups.
  • The local node has no on-disk state at all; it can crash and restart against a different FDB coordinator without losing anything.

The cost is latency: every SST read is at least one FDB round-trip, every flush is many. This isn’t a production architecture; it’s a teaching artifact that proves how cleanly the layers separate.

Running

cd option-b-leveldb
go mod tidy
go run ./demo -cluster ../fdb.cluster

Expected output (the second session re-opens and reads the persisted data):

First session: wrote 3 keys, then closed.
Reopening LevelDB on the same FDB namespace...

  apple -> red
  banana -> yellow
  cherry -> red

Iterating the whole DB:
  apple -> red
  banana -> yellow
  cherry -> red

What this implementation skips

  • Locker isn’t multi-process safe across long-lived processes β€” if a holder crashes the lock key stays set. A production version would attach the lock to a client UUID and TTL it via FDB watches.
  • Reader loads the whole file into memory. LevelDB SSTs are bounded (default 2 MB), so this is fine for a demo but not for huge tables.
  • No caching layer. Every Open is a fresh FDB scan. A real impl would cache hot SSTs.

Read fdbstorage/storage.go β€” the whole thing is one file deliberately, so you can follow the data flow end to end.

Hitchhiker’s Guide β€” Option B: LevelDB on FDB Storage

The question this answers: β€œCan I run a real LSM-tree storage engine β€” the actual LevelDB binary with all its compaction logic β€” with its files stored in FoundationDB instead of a local disk?”

The deeper question: β€œWhat is the storage.Storage interface, why does it exist, and what does it tell us about how databases handle file I/O?”


Table of Contents

  1. The Storage Abstraction: Why LevelDB Has a Plugin Point
  2. What LevelDB Actually Writes to Disk
  3. The storage.Storage Interface β€” Dissected
  4. How We Map LevelDB Files to FDB Keys
  5. Chunking: Overcoming the 100 KiB Value Limit
  6. Atomic Rename β€” Durability’s Secret Weapon
  7. The Writer: Batching Chunks into Transactions
  8. Why the WAL is Redundant With FDB
  9. The Blob Layer Pattern
  10. Real-World Analogues: RocksDB on Cloud Storage
  11. Exercises

1. The Storage Abstraction: Why LevelDB Has a Plugin Point

LevelDB’s storage.Storage interface exists because the original LevelDB authors (Jeff Dean and Sanjay Ghemawat) designed it for portability. Not every environment has a POSIX filesystem. Google has internal systems where storage might be Bigtable, Colossus, or a custom log-structured store.

The interface says: β€œif you can implement these 8 methods, LevelDB will run on your storage.” The application code (goleveldb) doesn’t know whether it’s writing to ext4, NTFS, GCS, or FDB β€” it just calls the interface.

type Storage interface {
    Lock() (util.Releaser, error)
    Log(m storage.FileDesc) (storage.Writer, error)
    Open(fd storage.FileDesc) (storage.Reader, error)
    Create(fd storage.FileDesc) (storage.Writer, error)
    Remove(fd storage.FileDesc) error
    Rename(oldfd, newfd storage.FileDesc) error
    GetMeta() (storage.FileDesc, error)
    SetMeta(fd storage.FileDesc) error
    List() ([]storage.FileDesc, error)
    Close() error
}

Our fdbstorage.Storage implements all of these, storing files as FDB key-value pairs. LevelDB itself (in the syndtr/goleveldb package) calls these methods. It has no idea the β€œfiles” are actually chunks in a distributed database.


2. What LevelDB Actually Writes to Disk

To understand what we need to store, let’s look at what LevelDB writes:

File types:
  TypeJournal  (.log)  β€” Write-Ahead Log: records every write before the
                         MemTable is flushed. Used to recover unflushed writes
                         after a crash.
  TypeManifest (.MANIFEST) β€” Lists which SSTables are "live" (not yet
                         garbage-collected). Updated at each compaction.
  TypeTable    (.ldb / .sst) β€” Sorted String Tables. Immutable, sorted KV
                         data files produced by compaction.
  TypeCurrent  (CURRENT) β€” A single file containing the name of the latest
                         MANIFEST file.
  TypeTemp     (.tmp)   β€” Temporary files used during compaction.
  TypeLock     (LOCK)   β€” A file held open to prevent two processes from
                         opening the same database simultaneously.

File descriptor:
type FileDesc struct {
    Type FileType  // TypeJournal, TypeManifest, etc.
    Num  int64     // unique file number (monotonically increasing)
}

A LevelDB database directory looks like:

000003.log       ← journal (WAL)
000004.ldb       ← SSTable level 0
000005.ldb       ← SSTable level 0
MANIFEST-000002  ← current manifest
CURRENT          ← "MANIFEST-000002\n"
LOCK             ← lockfile

When compaction happens:

  1. LevelDB picks some SSTables, merges and sorts them into a new SSTable.
  2. It writes the new SSTable as a .tmp file (via Create(TypeTemp, ...))
  3. It renames the .tmp to the final .ldb name (via Rename)
  4. It updates the MANIFEST to list the new SSTable and de-list the old ones.
  5. It removes the old SSTables (via Remove).

This is the temp-then-rename durability pattern: create a new file atomically, then rename it into place. POSIX rename is atomic β€” the old name or the new name is visible, never a partial file. Our FDB implementation must replicate this property.


3. The storage.Storage Interface β€” Dissected

Let’s look at each method and what it does:

Lock() (util.Releaser, error) Prevents two processes from opening the same database simultaneously. We implement this by writing a β€œlock” key to FDB. The Releaser clears it.

Create(fd FileDesc) (Writer, error) and Open(fd FileDesc) (Reader, error) Create starts a new file (for writing). Open opens an existing file (for reading). In FDB terms: Create returns a writer that buffers bytes; Open reads all chunks for the file into memory and returns a bytes.Reader.

Remove(fd FileDesc) error Deletes a file. In FDB: ClearRange over all chunk keys for this file.

Rename(oldfd, newfd FileDesc) error Renames a file atomically. In FDB: copy all chunks from oldfd keys to newfd keys, then clear all oldfd keys β€” in one transaction. This is the atomic rename.

GetMeta() (FileDesc, error) and SetMeta(fd FileDesc) error Get/set the β€œcurrent” file pointer β€” which MANIFEST is current. In FDB: a single key (ns + tagManifest + 0x00) stores the current FileDesc. This replaces the CURRENT file in LevelDB’s original design.

List() ([]FileDesc, error) List all files. We implement this as a range scan over the meta key prefix. We store a meta key for each file alongside its data.


4. How We Map LevelDB Files to FDB Keys

Each LevelDB file is identified by (FileType, FileNum). We encode this as:

Meta key (file existence + type):
  ns + tagFileMeta(0x01) + fileType(1 byte) + fileNum(8 bytes BE)
  β†’ msgpack({Type: ft, Num: n, size: totalBytes})

Data chunks:
  ns + tagFileData(0x02) + fileType(1 byte) + fileNum(8 bytes BE) + chunkNum(8 bytes BE)
  β†’ up to 64 KiB of file data

Manifest pointer (replaces CURRENT file):
  ns + tagManifest(0x03)
  β†’ msgpack(FileDesc{Type: TypeManifest, Num: n})

Lock key:
  ns + tagLock(0x04)
  β†’ "locked" (any non-empty value means locked)

Why big-endian for file numbers?

Big-endian encoding preserves sort order. File numbers are monotonically increasing (LevelDB never reuses a file number). By storing them big-endian, a range scan over ns+tagFileData+ft+num+* returns chunks in chunk-number order β€” which is the correct order to reassemble the file. Without big-endian encoding, chunk 10 would sort before chunk 2 (0x0A < 0x02 is false in big-endian but 0x0000000A < 0x00000002 is also false β€” you need lexicographic order over big-endian bytes).

File number and type as part of the key:

This means all chunks of file (TypeJournal, 3) sort together, before all chunks of (TypeJournal, 4), which sort before (TypeTable, 5). Clean, hierarchical key organization.


5. Chunking: Overcoming the 100 KiB Value Limit

FDB has a hard limit: values may not exceed 100 KiB (102,400 bytes). A typical LevelDB SSTable is 2–4 MB. We cannot store it in one FDB value.

Our solution: split each file into 64 KiB chunks:

const chunkSize = 64 * 1024  // 65,536 bytes

// Writing a 200 KiB file:
// Chunk 0: bytes [0, 65536)
// Chunk 1: bytes [65536, 131072)
// Chunk 2: bytes [131072, 200000)  (partial last chunk)

Each chunk is stored as a separate FDB key-value pair:

ns+0x02+ft+num+00000000_00000000  β†’ 65536 bytes
ns+0x02+ft+num+00000000_00000001  β†’ 65536 bytes
ns+0x02+ft+num+00000000_00000002  β†’ 68528 bytes (partial)

Reading a file: range-scan all chunk keys for (ft, num), sort by chunk number (already in order due to big-endian encoding), concatenate the values.

func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
    })
    var buf []byte
    for _, kv := range kvs.([]fdb.KeyValue) {
        buf = append(buf, kv.Value...)
    }
    return io.NopCloser(bytes.NewReader(buf)), nil
}

Why 64 KiB chunks?

  • Smaller than FDB’s 100 KiB value limit βœ“
  • Large enough to minimize key overhead (a 4 MB SSTable = 64 chunks, not thousands)
  • Aligns with filesystem block sizes (4–64 KiB typical)

6. Atomic Rename β€” Durability’s Secret Weapon

POSIX rename(src, dst) is the single most important durability primitive in filesystems. Its contract: after rename returns, dst exists and src does not, with no window where neither exists. This is atomic replacement.

LevelDB uses rename heavily:

  • Rename(TypeTemp, n, TypeTable, n): promote temp SSTable to final name
  • Rename(TypeTemp, n, TypeManifest, n): promote temp manifest

Without atomicity, a crash during rename could leave:

  • Neither file existing β†’ data loss
  • Both files existing β†’ ambiguity about which is current
  • A partial file at dst β†’ corruption

In FDB, we implement atomic rename as copy + clear in one transaction:

func (s *Storage) Rename(oldfd, newfd storage.FileDesc) error {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // 1. Read all old chunks
        kvs, _ := tr.GetRange(s.dataRange(oldfd), fdb.RangeOptions{}).GetSliceWithError()

        // 2. Clear old data
        tr.ClearRange(s.dataRange(oldfd))
        tr.Clear(s.metaKey(oldfd))

        // 3. Write new data
        for _, kv := range kvs {
            newKey := s.translateChunkKey(kv.Key, oldfd, newfd)
            tr.Set(newKey, kv.Value)
        }
        tr.Set(s.metaKey(newfd), metaBytes)
        return nil, nil
    })
    return err
}

One transaction. The cluster either commits all of this (old is gone, new is present) or none of it (crash safety). The atomicity guarantee is identical to POSIX rename β€” and arguably stronger, since FDB replicates the commit across multiple machines before returning.

The 10 MB transaction limit:

FDB transactions are limited to ~10 MB of reads + writes. A large SSTable (4 MB) would have chunks adding up to 4 MB of writes in one transaction. That’s under the 10 MB limit. But 8 MB SSTables would be risky.

Our Rename reads all chunks in the transaction (4 MB reads) and writes them all back (4 MB writes) β€” totaling 8 MB. Safe for typical LevelDB files.

For larger files, we’d need to either:

  1. Break the rename into multiple transactions (violating atomicity), or
  2. Use a two-phase approach: write new chunks in a first transaction, then atomically swap the meta key in a second transaction (using a PENDING state key as the β€œin-progress rename” marker).

7. The Writer: Batching Chunks into Transactions

When LevelDB writes a new SSTable, it calls Create(fd) which returns a Writer. The writer accumulates bytes via Write(p []byte). When Close() is called, we flush everything to FDB.

Batch size:

const maxChunksPerTx = 100  // 100 Γ— 64 KiB = 6.4 MB per transaction

We flush up to 100 chunks per FDB transaction. This stays well within the 10 MB limit. A 20 MB SSTable would be flushed in 4 transactions of 5 MB each.

func (w *writer) flush(final bool) error {
    start := w.flushedChunks
    end := start + maxChunksPerTx
    if end > len(w.chunks) {
        end = len(w.chunks)
    }
    _, err := w.s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for i := start; i < end; i++ {
            tr.Set(w.s.chunkKey(w.fd, i), w.chunks[i])
        }
        if final && end == len(w.chunks) {
            // Write the meta key only on the final flush
            tr.Set(w.s.metaKey(w.fd), metaBytes)
        }
        return nil, nil
    })
    w.flushedChunks = end
    return err
}

The meta-key-last invariant:

We write the meta key (the file’s β€œdirectory entry”) only in the last batch of chunks. This ensures that List() never returns a file whose chunks are only partially written β€” the file is only β€œvisible” once all its chunks exist.

This is the FDB equivalent of:

  1. Write all content to a temp file
  2. rename(temp, final) atomically

8. Why the WAL is Redundant With FDB

LevelDB’s Write-Ahead Log (WAL / journal, TypeJournal) exists for one reason: crash recovery. If the process crashes after writing to the in-memory MemTable but before flushing the MemTable to an SSTable on disk, the WAL is replayed to reconstruct the MemTable.

With FDB as the storage backend:

Every write is already durable before Write(p) returns.

Our writer.Write buffers bytes in memory. Our writer.Close flushes to FDB in transactions. Each FDB Transact call does not return until the commit is confirmed by FDB’s replication protocol β€” the data is on at least f+1 machines (where f is the fault tolerance level, typically 2). A process crash after Close() returns means the data is safe.

The WAL is protecting against β€œdata written to OS memory but not yet on disk.” FDB’s Transact eliminates this window. By the time the WAL file is written through our fdbstorage.Writer, the bytes are already in FDB.

A production implementation would patch goleveldb to skip WAL writes entirely (or use Options.DisableSeeksCompaction and a custom journal implementation that’s a no-op). This would improve write throughput by 50% or more and reduce FDB key usage.


9. The Blob Layer Pattern

FDB’s core team documented the β€œBlob Layer” pattern: storing binary blobs (arbitrary large byte arrays) in FDB by chunking them. Our file storage is an instance of this pattern.

The Blob Layer pattern:

blob_key + chunkNum  β†’  chunk_data

It solves the 100 KiB value limit while preserving atomic operations on the whole blob (via FDB transactions) and efficient byte-range access (read only the chunks you need, e.g., for seeking within a large file).

Applications:

  • Store large media files (> 100 KiB) in FDB for atomic metadata-plus-content updates
  • Store ML model weights alongside their metadata records
  • Store serialized protocol buffers larger than 100 KiB
  • Back any file system abstraction (exactly what we’re doing)

10. Real-World Analogues

RocksDB Remote Compaction (Project Titan, Ripple)

Meta (Facebook) runs RocksDB on distributed storage in some configurations. Their β€œRipple” project stores RocksDB SSTables in a distributed block store (similar to HDFS or GFS). The storage interface they use is exactly the same concept: RocksDB writes β€œfiles” via an abstract interface; the implementation stores chunks in a distributed system.

TiKV on Disaggregated Storage

TiDB (PingCAP) is moving toward a disaggregated architecture where TiKV (which uses RocksDB internally) stores its SSTables in object storage (S3, GCS). The TiKV storage engine writes SSTables through an abstract file interface to S3. This is identical to our pattern.

Pebble (CockroachDB)

CockroachDB replaced RocksDB with Pebble (a Go implementation) in 2021. Pebble has a vfs.FS interface β€” a virtual filesystem abstraction β€” that allows swapping the storage backend. CockroachDB uses this for testing (an in-memory FS) and is exploring using it for cloud storage.

The Pattern’s Universality

Every LSM-tree engine eventually adds a pluggable storage interface:

  • LevelDB: storage.Storage
  • RocksDB: Env (virtual filesystem)
  • Pebble: vfs.FS
  • WiredTiger: WT_FILE_SYSTEM

Why? Because running the compaction engine without worrying about where data lives is architecturally clean. The engine is responsible for LSM semantics; the storage interface is responsible for durability. Separation of concerns.


11. Exercises

Exercise 1 β€” Streaming Reader

Instead of materializing the entire file into memory in Open(), return an io.ReadSeekCloser that fetches chunks lazily. A read at offset 128 KiB should only fetch chunks 2–3, not chunk 0 and 1.

This reduces memory usage for large SSTables and enables efficient Seek(offset, io.SeekStart) for random-access reads.

Exercise 2 β€” File Size Cache

List() currently returns all file descriptors by scanning the meta keys. Open(fd) reads the meta key to get the file size, then reads all chunk keys.

Add a small in-memory LRU cache mapping FileDesc β†’ size. On Open, check the cache first. Invalidate the cache entry on Remove and Rename.

Measure the reduction in FDB round-trips for a workload with many small reads on recently-opened files.

Exercise 3 β€” Compression

Before storing each 64 KiB chunk, compress it with compress/flate or github.com/golang/snappy. Store a compression-type byte in the meta key. On read, decompress transparently.

LevelDB SSTables are already internally compressed (Snappy by default), so this may not reduce size much for TypeTable files. But TypeJournal files are not compressed and might benefit.

Exercise 4 β€” Two-Phase Large Rename

For files larger than 5 MB (which would exceed the transaction limit in our current Rename), implement the two-phase rename:

Phase 1: Write all new chunks in multiple transactions. Write a β€œrename-pending” key: ns+tagPending+oldfd β†’ newfd.

Phase 2: In one transaction, atomically: clear the pending key, clear all old chunks and meta, set the meta for new fd (chunks already exist).

On startup, check for any pending keys and complete or roll back the rename. This is essentially a two-phase commit for large file renames.

Exercise 5 β€” Multi-Tenant Databases

Add a namespace concept: allow multiple LevelDB databases to share one FDB cluster with independent key spaces. Each New(fdb, namespace) call returns a storage implementation that is completely isolated from others.

This is how mvsqlite handles multiple SQLite β€œdatabase files” β€” each is an FDB namespace.


12. Source Code Deep Dive β€” fdbstorage/storage.go

The Storage Struct

type Storage struct {
    db  fdb.Database
    ns  []byte
}

Minimal. db is the FDB connection; ns is the byte prefix for all keys. The entire storage is two fields. All complexity lives in the key encoding and transaction logic.

Key Encoding Helpers

func (s *Storage) metaKey(fd storage.FileDesc) fdb.Key {
    // ns + 0x01 + type(1 byte) + num(8 bytes big-endian)
    key := make([]byte, len(s.ns)+10)
    copy(key, s.ns)
    key[len(s.ns)] = 0x01
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    return fdb.Key(key)
}

func (s *Storage) chunkKey(fd storage.FileDesc, chunkNum int) fdb.Key {
    // ns + 0x02 + type(1 byte) + num(8 bytes) + chunk(8 bytes)
    key := make([]byte, len(s.ns)+18)
    copy(key, s.ns)
    key[len(s.ns)] = 0x02
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    binary.BigEndian.PutUint64(key[len(s.ns)+10:], uint64(chunkNum))
    return fdb.Key(key)
}

Why 8-byte big-endian for chunkNum? Chunk numbers are read back via GetRange which returns chunks in key order. Big-endian ensures key order equals chunk number order. If we used little-endian, chunk 256 (LE: 00 01 00 00 00 00 00 00) would sort before chunk 1 (LE: 01 00 00 00 00 00 00 00) β€” wrong.

The dataRange Helper

func (s *Storage) dataRange(fd storage.FileDesc) fdb.KeyRange {
    begin := s.chunkKey(fd, 0)
    // end: same prefix but with chunkNum = MaxUint64 + 1 β€” use next-prefix trick
    endPrefix := make([]byte, len(s.ns)+10) // ns + 0x02 + type + num
    copy(endPrefix, s.ns)
    endPrefix[len(s.ns)] = 0x02
    endPrefix[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(endPrefix[len(s.ns)+2:], uint64(fd.Num))
    end := append(endPrefix, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF)
    return fdb.KeyRange{Begin: begin, End: fdb.Key(append(end, 0x01))}
}

This range covers all chunk keys for (type, num) regardless of chunkNum. GetRange(dataRange(fd)) fetches all chunks in order.

The List() Implementation

func (s *Storage) List() ([]storage.FileDesc, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.metaRange(), fdb.RangeOptions{}).GetSliceWithError()
    })
    var fds []storage.FileDesc
    for _, kv := range kvs.([]fdb.KeyValue) {
        var fd storage.FileDesc
        msgpack.Unmarshal(kv.Value, &fd)
        fds = append(fds, fd)
    }
    return fds, nil
}

A single range scan over all meta keys returns all files in one round-trip. LevelDB calls List() at startup to find all existing files. With FDB, this is O(1) round-trips regardless of file count.

With a local filesystem, List() is an opendir/readdir syscall β€” also O(1) in latency, but I/O must go through the local disk controller. With FDB, the I/O goes to the closest FDB storage server over the network, with similar or lower latency than a rotational disk.

The Lock Implementation

func (s *Storage) Lock() (util.Releaser, error) {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        existing, _ := tr.Get(s.lockKey()).Get()
        if len(existing) > 0 {
            return nil, errors.New("storage: already locked")
        }
        tr.Set(s.lockKey(), []byte("locked"))
        return nil, nil
    })
    if err != nil {
        return nil, err
    }
    return util.ReleaserFunc(func() {
        s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
            tr.Clear(s.lockKey())
            return nil, nil
        })
    }), nil
}

The lock is a FDB key. Acquiring the lock: check if the key exists; if not, set it β€” in one atomic transaction. This check-then-set is race-free because FDB’s optimistic concurrency ensures that if two processes both read β€œno lock” and both try to write β€œlocked”, only one will commit (the other will conflict and retry, then find the lock held).

Limitation: This is a process-level lock, not a durable lease. If the lock-holding process crashes without calling Release(), the lock remains set until manually cleared. For production, use a lock with an expiry: store the lock as {holder: processID, expires: time.Now().Add(30*time.Second)} and have each lock holder refresh it periodically. A lock that isn’t refreshed is treated as expired.


13. Production Considerations

13.1 Transaction Size for Large SSTables

LevelDB level-0 SSTables are 2–4 MB. Level-1 SSTables are larger (up to L1_target_size, configurable). For a L1_target_size of 64 MB, level-1 SSTables are 64 MB each. Our current Rename would fail for files this large (exceeds the 10 MB transaction limit).

Solution: For production, configure LevelDB’s CompactionTableSize to keep SSTables small:

opts := &opt.Options{
    CompactionTableSize: 2 * 1024 * 1024,  // 2 MB SSTables
}

2 MB SSTables = 32 chunks of 64 KiB. Rename transaction: 32 reads + 32 writes = 4 MB total. Well within limits.

13.2 Read Performance for Large SSTables

Reading a 2 MB SSTable requires fetching 32 chunks from FDB. Our current implementation reads them in one GetRange β€” one round-trip, 32 key-value pairs returned. Latency: ~1–5 ms (FDB cluster local read latency).

A local filesystem read of 2 MB: ~1–3 ms on SSD, ~10–20 ms on HDD.

For a warm FDB cluster, FDB storage is competitive with SSDs and dramatically better than spinning disks. For random chunk access (seeking within large files), FDB may be faster because it can pipeline multiple point reads, while a spinning disk requires physical seeking.

13.3 Write Amplification

Our chunking adds write amplification: writing a 64 KiB chunk requires writing the chunk key (18 bytes) + value (64 KiB) = 64 KiB + 18 bytes. The key overhead is <0.03%, negligible.

But FDB itself adds write amplification internally: each committed transaction is written to the Transaction Log (TLog), then asynchronously applied to Storage Servers. The TLog write is sequential (fast). The Storage Server write is to FDB’s B-tree (with its own write amplification). FDB’s overall write amplification is roughly 3–5x β€” comparable to RocksDB’s LSM write amplification.

13.4 Monitoring

Key metrics for a fdbstorage-backed LevelDB deployment:

  • FDB transaction latency P99: should be < 10ms for small transactions (meta reads)
  • FDB range scan bytes/second: correlates with compaction throughput
  • FDB conflict rate: if high, indicates concurrent compaction and write contention
  • LevelDB metrics via db.GetProperty("leveldb.stats"): still valid β€” LevelDB reports its own view of compaction and SSTable counts, just the β€œdisk I/O” is actually FDB I/O

14. Interview Questions β€” Storage Abstractions and LSM Trees

Q: What is the purpose of the Rename operation in LevelDB’s storage interface, and how does your FDB implementation preserve its atomicity guarantee?

Rename is LevelDB’s way of atomically promoting a new SSTable (or MANIFEST) into production. During compaction, LevelDB writes the new SSTable to a temp file, then renames it to its final name. POSIX rename is atomic: either the old name or the new name is visible, never a half-written file. Our FDB implementation reads all chunks with the old file descriptor, writes them with the new file descriptor, and clears the old keys β€” all in one FDB transaction. FDB’s transaction atomicity provides the same guarantee: either the old keys or the new keys are visible, never both or neither.

Q: Why does LevelDB use a Write-Ahead Log, and is it still necessary when using FDB as the storage backend?

The WAL protects against crash scenarios where data was written to the in-memory MemTable but not yet flushed to an SSTable on disk. Without a WAL, a crash after the MemTable write but before the SSTable flush would lose those writes. With FDB as storage, our writer.Close() writes chunks to FDB in transactions. Each committed FDB transaction is durable (replicated to at least two machines). A crash after Close() returns has no data loss. The WAL’s durability purpose is already provided by FDB. A production implementation would use a no-op WAL to skip the overhead.

Q: What is the 10 MB transaction limit in FDB, and what design patterns avoid hitting it?

FDB limits the total read + write size per transaction to approximately 10 MB to bound the memory required on Commit Proxies and to keep transaction resolution fast. Patterns to stay within the limit: (1) chunk large values (as we do with 64 KiB chunks), (2) break bulk writes into multiple transactions with cursor-based pagination, (3) configure LevelDB’s compaction to keep SSTable sizes small (< 2 MB), (4) use FDB atomic operations (tr.Add, tr.SetVersionstampedKey) where possible β€” atomic operations don’t count against the read portion of the limit.

Q: How would you extend this implementation to support multiple concurrent LevelDB instances sharing the same FDB cluster?

Give each LevelDB instance its own ns prefix. The FDB key space is naturally partitioned: ns1 + ... keys and ns2 + ... keys are completely disjoint. Multiple instances can read and write concurrently with no coordination overhead β€” FDB’s conflict detection only fires when two transactions write the same key, and different namespaces use different keys. The lock key (ns + tagLock) is also per-namespace, so locking one instance doesn’t affect others.

Option B β€” SQLite VFS substrate on FoundationDB

Pattern: SQLite (or any pager-style engine) with FDB as the disk. We implement the byte-range storage primitive a SQLite VFS sits on top of β€” fixed-size pages keyed by page number, atomic page-level updates β€” and demonstrate it through a small Go API. Wiring the C-level VFS hooks is mechanical glue that’s left for a follow-up.

What a SQLite VFS actually needs

SQLite’s pager talks to β€œthe OS” through a thin C interface defined in vfs.h. Boiled down, a VFS file handle exposes:

SQLite callWhat we map it to
xRead(buf, n, offset)File.ReadAt(buf, offset)
xWrite(buf, n, offset)File.WriteAt(buf, offset)
xTruncate(size)File.Truncate(size)
xFileSize()File.Size()
xLock / xUnlockFile.Lock(holder) / File.Unlock(holder)
xSyncno-op β€” every WriteAt is already durable in FDB

SQLite always reads and writes in multiples of the page size (4096 by default) once it’s past the 100-byte file header, so storing one FDB KV per page is a natural fit and makes the pager-to-storage mapping 1:1.

Key layout

<ns> 0x00                            -> uint64 BE  file size in bytes
<ns> 0x01 <pageNum:uint64 BE>        -> 4096-byte page
<ns> 0x02                            -> lock holder name (or absent)

Where the transactional magic lives

The interesting method is WriteAt. For partial-page writes, we:

  1. Issue tr.Get for every affected page.
  2. Merge the existing page bytes with the new bytes.
  3. tr.Set the resulting full page.
  4. Update the file-size key if we grew.

All inside one FDB transaction. That gives SQLite a property it cannot get from a normal filesystem: multi-page writes are atomic. SQLite has elaborate journal/WAL machinery to recover from β€œwe crashed halfway through updating pages 17, 18, and 19.” On this VFS that recovery code becomes dead β€” either all three pages flipped or none did.

In practice you’d still set PRAGMA journal_mode = MEMORY (so SQLite skips the rollback journal it doesn’t need) and rely on FDB’s transactional commit as the single durability point.

Hooking this into a real SQLite

There are two paths:

  1. cgo + mattn/go-sqlite3 or zombiezen.com/go/sqlite: register a custom sqlite3_vfs whose xRead/xWrite/... thunks call into our pagestore.File methods via a CGO bridge. ~300 lines of glue.
  2. modernc.org/sqlite: pure-Go SQLite. Its vfs subpackage exposes Register / VFS types. Same wiring, no CGO.

Either way the interesting code is what’s already here β€” the C/Go thunks add no further insight into how FDB serves as the storage tier.

Running the demo

cd option-b-sqlite
go mod tidy
go run ./demo -cluster ../fdb.cluster

Expected output:

After writing header (16 B): size=16
After 100-byte cross-page write at offset 4090: size=4190
Read 100 B back, all 'A'? true
Header preserved? "SQLite format 3\x00"
After truncate to 4096: size=4096
Lock contention as expected: pagestore: locked by "conn-A"
Lock handoff OK.

The cross-page write at offset 4090 is the key correctness test: it spans page 0 (bytes 4090–4095) and page 1 (bytes 4096–4189). The output above proves that:

  • The pre-existing 16-byte header on page 0 was preserved during the read- modify-write merge.
  • Both halves of the payload made it to disk.
  • A subsequent Truncate(4096) cleared page 1 entirely.

What’s deliberately omitted

  • WAL mode. Skipping it forces SQLite into rollback-journal mode, which emits ordinary writes only β€” exactly what our VFS supports. WAL would require a second β€œshared-memory” backing store (xShmMap) and is the single largest source of complexity in real VFS implementations.
  • Multi-process locking. The lock key works for cooperating clients but has no liveness guarantee on holder crash. Production code would combine it with FDB watches and a heartbeat.
  • The C bridge itself. See docs/README.md above for the wiring sketch.

Hitchhiker’s Guide β€” Option B: SQLite VFS over FoundationDB

The question this answers: β€œHow does SQLite actually read and write its database file? Can we replace the file with FoundationDB?”

The deeper question: β€œWhat is a database page, how does the pager work, and why is the page model the right abstraction for a VFS?”


Table of Contents

  1. SQLite’s Architecture β€” From SQL to Bytes
  2. The Virtual File System (VFS) β€” SQLite’s Plugin Point
  3. The Page Model β€” How SQLite Organizes Storage
  4. Our pagestore β€” FDB as a Page Store
  5. The Partial-Page Write Problem β€” And Its Atomic Solution
  6. xSync is a No-Op β€” And Why That’s Correct
  7. Locking β€” From POSIX Flock to FDB Keys
  8. Journal Modes: DELETE, WAL, MEMORY
  9. mvsqlite β€” The Production Version of This
  10. Real-World Analogues: libSQL, Litestream, LiteFS
  11. Exercises

1. SQLite’s Architecture β€” From SQL to Bytes

SQLite processes queries through seven layers:

SQL text
   ↓ Tokenizer + Parser
   Abstract Syntax Tree (AST)
   ↓ Code Generator
   Bytecode program (VDBE instructions)
   ↓ Virtual Database Engine (VDBE)
   B-tree operations (seek, insert, delete on B-tree pages)
   ↓ Pager
   Page cache: reads/writes logical page numbers
   ↓ OS Interface (VFS)
   File reads/writes at byte offsets
   ↓ Actual storage

We plug in at the VFS layer. Everything above (SQL parsing, the B-tree, the pager) runs as normal. The VFS is called for all I/O, and our implementation redirects those calls to FDB.

The key insight: SQLite doesn’t know or care whether the VFS talks to a local file, a network mount, or a distributed database. It only needs the VFS contract to be upheld.


2. The Virtual File System (VFS) β€” SQLite’s Plugin Point

The VFS is documented in SQLite’s C API. The key functions (C signatures simplified):

// File operations (called on each open file):
int xRead(file, buf, amount, offset);   // read `amount` bytes at `offset`
int xWrite(file, buf, amount, offset);  // write `amount` bytes at `offset`
int xTruncate(file, size);              // truncate to `size` bytes
int xSync(file, flags);                 // flush to durable storage
int xFileSize(file, *size);             // return current size
int xLock(file, locktype);             // acquire SHARED/EXCLUSIVE/etc
int xUnlock(file, locktype);           // release lock
int xCheckReservedLock(file, *result); // is anyone holding RESERVED lock?
int xShmMap(file, region, size, ...);  // map shared memory region (WAL mode)

Our Go pagestore implements the subset needed for non-WAL mode: ReadAt, WriteAt, Truncate, FileSize, Lock, Unlock.

Why a Go struct instead of a C VFS?

A real SQLite VFS requires implementing C structs (sqlite3_vfs, sqlite3_file) and using CGO to register them. This is correct but adds ~200 lines of glue code that would obscure the core concept. Our pagestore is a Go struct that exercises the exact same I/O patterns as SQLite would use. The demo calls ReadAt/WriteAt directly, simulating what the SQLite pager would call through the VFS.


3. The Page Model β€” How SQLite Organizes Storage

SQLite’s database file is divided into fixed-size pages (default 4096 bytes). Every structure in a SQLite database β€” B-tree interior nodes, B-tree leaf nodes, overflow pages, free-list pages β€” is exactly one page. The pager manages a cache of these pages and issues I/O to the VFS in page-aligned operations.

SQLite database file layout:
  Offset  0 – 4095:   Page 1 (database header + B-tree root)
  Offset  4096 – 8191:  Page 2
  Offset  8192 – 12287: Page 3
  ...
  Offset (N-1)*4096 – N*4096-1: Page N

Page 1 is special: its first 100 bytes are the database header:

Offset  0: "SQLite format 3\000" (16 bytes, magic string)
Offset 16: page size (2 bytes)
Offset 18: file format write version (1 byte)
Offset 19: file format read version (1 byte)
...
Offset 28: file change counter (4 bytes)
...
Offset 52: schema format number (4 bytes)
...

The partial-page write problem arises here: the header is 100 bytes, but SQLite might write just the header (100 bytes at offset 0), while the rest of page 1 (bytes 100–4095) contains B-tree data. An xWrite(file, header, 100, 0) call writes only 100 bytes β€” not the full page.


4. Our pagestore β€” FDB as a Page Store

We store pages as FDB keys:

const PageSize = 4096

// Page data:
//   ns + 0x01 + pageNum(uint64 big-endian)  β†’  4096 bytes

// Database size (in pages):
//   ns + 0x00  β†’  uint64 (big-endian)

// Exclusive lock:
//   ns + 0x02  β†’  "locked" (any non-empty value)

Page numbers:

SQLite page numbers are 1-based (page 1 is the first page). We store them as 0-padded big-endian uint64s:

Page 1:  ns + 0x01 + 0000000000000001
Page 2:  ns + 0x01 + 0000000000000002
Page 1000: ns + 0x01 + 00000000000003E8

Big-endian encoding ensures that a GetRange over ns+0x01+0 to ns+0x01+FFFFFFFFFFFFFFFF returns pages in ascending page-number order β€” useful for scanning the entire database.

Size tracking:

SQLite calls xFileSize to determine how many pages exist. We track this explicitly with a size key rather than scanning for the highest-numbered page key (which would be an O(N) scan). The size key is updated atomically in the same transaction as every WriteAt and Truncate.


5. The Partial-Page Write Problem β€” And Its Atomic Solution

This is the trickiest part of implementing a page-based storage engine.

SQLite’s VFS contract says:

xWrite(offset, amount, data):
  Write `amount` bytes starting at `offset`.
  `offset` and `amount` may be any values (not necessarily page-aligned).

Examples of partial-page writes:

  • Write header (100 bytes at offset 0) to page 1
  • Write a record that spans a 4-byte boundary at the end of a page
  • Write a single integer (4 bytes at offset 12) to a page

We cannot simply map WriteAt(offset, data) to Set(pageKey(offset/4096), data) because that would overwrite the entire 4096-byte page with only the partial data β€” corrupting the rest of the page.

Our solution: read-modify-write inside one transaction.

func (p *PageStore) WriteAt(data []byte, off int64) (int, error) {
    firstPage := off / PageSize
    lastPage := (off + int64(len(data)) - 1) / PageSize

    _, err := p.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for pageNum := firstPage; pageNum <= lastPage; pageNum++ {
            // 1. Read current page content (or zeros if it doesn't exist yet)
            existing, _ := tr.Get(p.pageKey(pageNum)).Get()
            page := make([]byte, PageSize)
            copy(page, existing)  // pad to 4096 with zeros if shorter

            // 2. Compute which bytes of `data` go into this page
            pageStart := pageNum * PageSize
            pageEnd := pageStart + PageSize
            writeStart := max(off, pageStart) - pageStart
            writeEnd := min(off+int64(len(data)), pageEnd) - pageStart
            dataStart := max(off, pageStart) - off
            dataEnd := min(off+int64(len(data)), pageEnd) - off

            // 3. Overlay the new bytes onto the existing page
            copy(page[writeStart:writeEnd], data[dataStart:dataEnd])

            // 4. Write the modified page back
            tr.Set(p.pageKey(pageNum), page)
        }
        // 5. Update size key if this write extends the file
        ...
        return nil, nil
    })
    return len(data), err
}

Steps 1–4 happen inside one FDB transaction. The read and the write are atomic: no concurrent writer can modify the page between our read and write. No partial update is visible to other transactions.

The atomicity guarantee:

If the process crashes after step 1 (read) but before step 4 (write), the transaction is never committed. FDB abandons it. The page is unchanged. No corruption.

If two concurrent writers both try to read-modify-write the same page, FDB’s conflict detection will cause one to retry. The second writer will re-read the page (now including the first writer’s change) and overlay its own data on top. Correct behavior, no coordination required.


6. xSync is a No-Op β€” And Why That’s Correct

SQLite calls xSync to tell the VFS: β€œmake sure all previous writes are durable on physical storage before returning.” On a normal filesystem, this calls fsync() which flushes the OS page cache to the disk controller and waits for the disk to confirm durability.

Why do we implement it as a no-op?

FDB’s Transact is synchronous and durable.

When WriteAt calls p.db.Transact(...) and it returns without error, the data is already:

  1. Written to FDB’s transaction log on f+1 machines (durably replicated)
  2. Committed (visible to new transactions)

By the time WriteAt returns to the SQLite pager, the durability guarantee is already satisfied β€” stronger than fsync on a local disk.

xSync is a no-op because the work it would do has already been done during xWrite. There is nothing left to flush.

Analogy: Imagine a bank transfer. xWrite is β€œdebit Alice, credit Bob, commit to ledger.” xSync is β€œmake sure the ledger is durable.” If the ledger is already in a replicated distributed database, xSync has nothing to do.


7. Locking β€” From POSIX Flock to FDB Keys

SQLite uses file locking to coordinate concurrent access. The POSIX lock levels, in order of increasing exclusivity:

UNLOCKED β†’ SHARED β†’ RESERVED β†’ PENDING β†’ EXCLUSIVE
  • SHARED: reader has this. Many readers can hold SHARED simultaneously.
  • RESERVED: writer intending to modify. One writer can hold RESERVED while readers continue holding SHARED.
  • PENDING: writer waiting for all SHARED holders to release.
  • EXCLUSIVE: sole writer, no readers. Required before writing pages.

A full VFS implementation would map these levels to FDB lock keys with appropriate semantics (e.g., a counter for SHARED, a flag for EXCLUSIVE).

Our implementation is simplified: we use one β€œexclusive lock” key. Setting it means exclusive ownership; clearing it means unlocked. This correctly implements single-writer semantics but doesn’t allow concurrent readers.

For multi-reader/single-writer, a production implementation would:

SHARED lock: counter key (increment on lock, decrement on unlock)
EXCLUSIVE lock: flag key
  β†’ acquire EXCLUSIVE: check counter == 0, set flag key
  β†’ acquire SHARED: check flag key is unset, increment counter

Both checks-and-sets would use FDB transactions to ensure atomicity.


8. Journal Modes: DELETE, WAL, MEMORY

SQLite has several journal modes, selected with PRAGMA journal_mode=MODE. The journal is SQLite’s crash recovery mechanism (distinct from FDB’s own recovery).

DELETE (default rollback journal): Before modifying a page, SQLite copies the original page content to a separate journal file. If a crash occurs during a write, SQLite replays the journal to restore the original page content.

With FDB, this is redundant: FDB’s transactions provide rollback for free. A partial WriteAt that crashes mid-transaction rolls back automatically. The journal file would be stored in FDB via our VFS, adding overhead for no benefit.

WAL (Write-Ahead Log): Instead of copying original pages before modification, SQLite appends new page versions to a WAL file. Readers check the WAL before reading the main database file. A β€œcheckpoint” operation copies WAL pages back to the main file.

WAL mode requires xShmMap β€” shared memory for the WAL index. This is a memory-mapped file that multiple processes share to coordinate which WAL frames are valid. Implementing xShmMap in FDB would require:

  1. A small FDB key range to store the WAL index state.
  2. Synchronized access to that state via FDB transactions. This is complex but possible β€” mvsqlite does it.

MEMORY: SQLite uses an in-memory journal, not written to any file. Rollback is possible within a transaction but not after a crash.

For our FDB VFS, PRAGMA journal_mode=MEMORY is the right choice: SQLite won’t try to create journal files (which would trigger extra VFS calls), and crash recovery is handled by FDB. This is what our demo uses implicitly by not specifying a journal mode (we would need to open the database and run PRAGMA journal_mode=MEMORY before any writes).


9. mvsqlite β€” The Production Version of This

mvsqlite (by losfair, Rust) is the production-grade implementation of the same idea. Its architecture:

Namespace mapping: Each SQLite β€œdatabase file” path is mapped to an FDB namespace prefix. Multiple processes can open the same β€œfile” (namespace) simultaneously with MVCC isolation.

Page-level MVCC: mvsqlite keeps multiple versions of each page, similar to how PostgreSQL keeps multiple row versions. When a reader opens the database, it captures an FDB read version. Page reads are served from FDB at that version. New writes create new page versions. Old versions are kept until all readers that need them complete.

Our pagestore is a simplified, non-MVCC version: every read sees the latest page version. Adding page-level MVCC would require:

ns + 0x01 + pageNum + version  β†’  4096 bytes  (instead of just pageNum)

With a cleanup process that removes old versions when no readers hold them.

Write-ahead log in FDB: mvsqlite implements WAL mode by storing the WAL log pages in FDB itself:

ns + 0x03 + txnId + pageNum  β†’  4096 bytes  (uncommitted WAL pages)
ns + 0x04 + txnId           β†’  commit record

This allows SQLite’s WAL mode to work without xShmMap (the shared memory region) β€” instead, the WAL index state lives in FDB and is accessed via transactions.


10. Real-World Analogues

libSQL (Turso)

libSQL is a fork of SQLite that adds multi-tenancy and remote storage. Turso’s cloud product stores SQLite databases in a distributed object store (similar to our FDB approach). The architecture:

  • Local replicas for low-latency reads
  • A primary writes to the distributed store
  • Followers pull changes from the store and apply them locally

Litestream

Litestream continuously replicates SQLite databases to S3 or other cloud storage by intercepting the WAL. It monitors the SQLite WAL file and copies new frames to cloud storage in near-real-time. Restore is done by downloading frames and applying them.

Litestream does NOT modify SQLite’s VFS β€” it runs as a separate process that monitors the WAL file. This is simpler but means it can’t provide multi-writer consistency (only one process can write to a SQLite database at a time).

LiteFS (Fly.io)

LiteFS mounts a FUSE filesystem that intercepts SQLite writes, replicates them to a primary node, and distributes to replicas. It does intercept at the filesystem level (FUSE = Filesystem in Userspace), which is essentially implementing the VFS pattern at the OS level.

The Pattern’s Significance

All of these tools (mvsqlite, libSQL, Litestream, LiteFS) are attacking the same problem: SQLite is an excellent embedded database, but it’s tied to a single file on a single machine. The solution in each case is to intercept the storage layer and redirect I/O to a distributed, replicated system.

Our pagestore is the minimal proof-of-concept for this idea: 150 lines of Go that demonstrate the core page-store primitives.


11. Exercises

Exercise 1 β€” Page Cache

Add an in-process LRU page cache:

type PageStore struct {
    ...
    cache *lru.Cache  // maps pageNum β†’ [PageSize]byte
}

ReadAt checks the cache before going to FDB. WriteAt updates the cache after writing to FDB. Evict on Truncate.

Measure the cache hit rate for a typical SQLite workload (a mix of reads and writes). You’ll find that B-tree root pages have very high hit rates β€” they’re accessed on every query.

Exercise 2 β€” Implement xShmMap for WAL Mode

WAL mode requires shared memory for the WAL index. In mvsqlite, this is done with FDB keys. Implement a simplified version:

  1. Add a β€œWAL region” subspace: ns + 0x03 + regionNum β†’ 32768 bytes
  2. xShmMap(region, size) reads the FDB key and returns a []byte
  3. xShmBarrier() writes the modified byte slice back to FDB
  4. xShmLock maps to a FDB lock key per-region

With this, enable PRAGMA journal_mode=WAL in the demo and verify reads and writes still work.

Exercise 3 β€” Multi-Version Pages

Change the page key layout to:

ns + 0x01 + pageNum + readVersion  β†’  4096 bytes

WriteAt reads the current version, writes a new version at the current FDB commit version. ReadAt scans backwards from the requested read version to find the most recent page version ≀ that version.

Add a Compact(olderThan version) that clears all page versions older than a given version. This is the MVCC GC mechanism.

Exercise 4 β€” Register as a Real SQLite VFS

Using CGO, implement the actual sqlite3_vfs and sqlite3_file C structs backed by our pagestore. Register the VFS with sqlite3_vfs_register. Open a real SQLite connection using this VFS:

db, err := sql.Open("sqlite3", "file:demo.db?vfs=fdbvfs")
db.Exec("CREATE TABLE t (id INTEGER PRIMARY KEY, val TEXT)")
db.Exec("INSERT INTO t VALUES (1, 'hello')")

All I/O will go through pagestore to FDB. You have now replaced SQLite’s storage engine while keeping the full SQL query interface.

Exercise 5 β€” Multi-Database Isolation

The pagestore uses a namespace prefix to isolate one database. Add a DatabaseManager that:

  1. Lists all databases (range scan over a well-known metadata prefix)
  2. Creates a new database (allocates a namespace, writes a metadata entry)
  3. Deletes a database (one ClearRange + metadata clear)
  4. Returns a pagestore for a given database name

12. Source Code Deep Dive β€” pagestore/pagestore.go

The File Struct

type File struct {
    db   fdb.Database
    ns   []byte
    mu   sync.Mutex
}

ns is the namespace byte prefix. mu protects concurrent calls to WriteAt and Truncate from a single File instance (though FDB’s own transactions provide the real isolation for concurrent processes).

Key Construction

func (f *File) sizeKey() fdb.Key {
    key := make([]byte, len(f.ns)+1)
    copy(key, f.ns)
    key[len(f.ns)] = 0x00
    return fdb.Key(key)
}

func (f *File) pageKey(pageNum int64) fdb.Key {
    key := make([]byte, len(f.ns)+9)
    copy(key, f.ns)
    key[len(f.ns)] = 0x01
    binary.BigEndian.PutUint64(key[len(f.ns)+1:], uint64(pageNum))
    return fdb.Key(key)
}

sizeKey() stores the total file size in bytes. pageKey(n) stores the 4096-byte page at page number n. The big-endian 8-byte encoding of pageNum ensures that GetRange over sizeKey to the end of the 0x01 subspace returns pages in ascending page-number order.

ReadAt β€” Handling Partial-Page Reads

func (f *File) ReadAt(p []byte, off int64) (int, error) {
    firstPage := off / PageSize
    lastPage := (off + int64(len(p)) - 1) / PageSize

    futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)

    _, err := f.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        for i := firstPage; i <= lastPage; i++ {
            futures[i-firstPage] = rt.Get(f.pageKey(i))
        }
        return nil, nil
    })
    // Wait for all futures and assemble the result
    for i, fut := range futures {
        pageData, _ := fut.Get()
        // copy relevant bytes from page into p...
    }
}

Pipelining multiple page reads: All rt.Get(pageKey(i)) calls are issued inside one ReadTransact closure. FDB sends all read requests to the storage servers before waiting for any response. If the read spans 3 pages (common for reads that cross page boundaries), all 3 are fetched in one round-trip. Without pipelining, each page would require a separate round-trip β€” 3x the latency.

This is the same pipelining pattern used in option-c-record-layer’s LookupByIndex.

WriteAt β€” The Read-Modify-Write Pattern

func (f *File) WriteAt(p []byte, off int64) (int, error) {
    f.mu.Lock()
    defer f.mu.Unlock()

    firstPage := off / PageSize
    lastPage := (off + int64(len(p)) - 1) / PageSize

    _, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // Issue all reads first (pipelined)
        futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
        for i := firstPage; i <= lastPage; i++ {
            futures[i-firstPage] = tr.Get(f.pageKey(i))
        }
        sizeF := tr.Get(f.sizeKey())

        // Wait for reads and apply writes
        for i := firstPage; i <= lastPage; i++ {
            existing, _ := futures[i-firstPage].Get()
            page := make([]byte, PageSize)
            copy(page, existing)
            // overlay p bytes onto page...
            tr.Set(f.pageKey(i), page)
        }

        // Update size
        curSize := int64(binary.BigEndian.Uint64(sizeF.MustGet()))
        newEnd := off + int64(len(p))
        if newEnd > curSize {
            b := make([]byte, 8)
            binary.BigEndian.PutUint64(b, uint64(newEnd))
            tr.Set(f.sizeKey(), b)
        }

        return nil, nil
    })
    return len(p), err
}

The read futures are pipelined: all tr.Get() calls are issued before any .Get() is called to block. This means all page reads happen in parallel β€” one round-trip for up to N pages regardless of N.

The size update is in the same transaction: The page writes and the size update are atomic. If the process crashes mid-transaction, FDB abandons it: neither the page updates nor the size update are applied. The database remains in its previous state β€” consistent, readable, no corruption.

pageRange β€” For Bulk Operations

func (f *File) pageRange() fdb.KeyRange {
    begin := f.pageKey(0)
    end := make([]byte, len(f.ns)+1)
    copy(end, f.ns)
    end[len(f.ns)] = 0x02  // one past the page subspace tag (0x01)
    return fdb.KeyRange{Begin: begin, End: fdb.Key(end)}
}

Used by Truncate to clear all pages above the new size:

func (f *File) Truncate(size int64) error {
    newLastPage := size / PageSize
    _, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // Clear all pages above newLastPage
        clearBegin := f.pageKey(newLastPage + 1)
        tr.ClearRange(fdb.KeyRange{Begin: clearBegin, End: f.pageRange().End})
        // Update size
        b := make([]byte, 8)
        binary.BigEndian.PutUint64(b, uint64(size))
        tr.Set(f.sizeKey(), b)
        return nil, nil
    })
    return err
}

One ClearRange atomically removes all pages above the new size. This is O(1) in FDB key operations (one range-clear instruction), even if there are thousands of pages to clear. This is vastly more efficient than deleting pages one by one.


13. Production Considerations

13.1 Page Size Selection

The default SQLite page size is 4096 bytes. You can change it with PRAGMA page_size=N (must be set before the first write). Larger pages reduce the number of B-tree levels and improve scan performance on large tables, but increase read amplification for small point lookups. For an FDB-backed store:

  • 4096 bytes (default): 4096-byte FDB values. Small key overhead. Good for mixed workloads.
  • 8192 bytes: 8192-byte values. Better for large sequential scans.
  • 65536 bytes (maximum SQLite page size): 64 KB values. Below FDB’s 100 KB limit. Excellent for large sequential scans, but poor for point lookups.

For most use cases, 4096 is the right choice. Match it to your workload’s read/write unit.

13.2 Hot Page Contention

Page 1 contains the database header and the root page of the main B-tree. Every query touches page 1. Every transaction increments the β€œfile change counter” in the header. This means every write transaction does a read-modify-write on page 1 β€” even if the query doesn’t touch any user data on page 1 (e.g., inserting into a deep table touches only leaf pages and page 1 for the change counter).

In a high-concurrency write workload, this is a hot key. FDB’s conflict detection will serialize all transactions that write to page 1 β€” reducing concurrency.

Solution (mvsqlite’s approach): Move the change counter out of the page and into a separate FDB key. Use FDB’s atomic add to increment it without a read-modify-write. Since atomic add is commutative, it doesn’t cause conflicts. The page itself (minus the counter) is written only when actual B-tree structure changes.

13.3 Crash Recovery Testing

With FDB’s transactional guarantees, the traditional SQLite recovery path (re-applying the rollback journal) is never needed. But it’s worth testing:

  1. Write some data.
  2. Kill the process mid-write (use kill -9 or os.Exit(1) inside a WriteAt).
  3. Restart the process and open the database.
  4. Verify the database is in a consistent state (either the write happened or it didn’t, no partial state).

FDB’s transaction semantics make this trivially correct, but testing it gives you confidence that your implementation upholds the contract.


14. Interview Questions β€” SQLite, VFS, and Page Models

Q: What is a database page and why do databases use a fixed page size?

A page is the fundamental unit of I/O and storage in a database. Fixed page sizes allow: (1) simple address calculation β€” page N starts at byte (N-1) * pageSize; (2) alignment with OS page sizes (4 KiB for virtual memory, matching SQLite’s default); (3) buffer pool management β€” a page cache stores exactly one page per slot. Variable-size records are packed into pages; a B-tree node is exactly one page. The fixed size ensures that a B-tree node read is always one I/O operation, not an I/O per record.

Q: What is the partial-page write problem, and how does your FDB implementation solve it?

SQLite’s VFS xWrite can write any byte range β€” not necessarily page-aligned. If a 100-byte write starts at offset 0, it overwrites bytes 0–99 of page 1, but bytes 100–4095 are unchanged. An implementation that naively maps this to Set(page1Key, 100_bytes) would corrupt page 1 (losing bytes 100–4095). Our solution: read the existing page, overlay the new bytes, write the full 4096-byte page back β€” in one FDB transaction. The transaction atomicity ensures that no other writer sees a partial page and that a crash during the operation leaves the page unchanged.

Q: Why is xSync a no-op in your VFS implementation?

xSync requests durability: β€œflush all previous writes to stable storage.” Our WriteAt implementation uses fdb.Transact which does not return until the write is committed and replicated to FDB’s transaction log. The commit is synchronous and durable by the time WriteAt returns. There is nothing left for xSync to flush. The durability guarantee of xSync is already satisfied by the end of WriteAt.

Q: How does SQLite locking work, and how would you implement a multi-reader single-writer lock in FDB?

SQLite uses 5 lock levels. For multi-reader single-writer: maintain a SHARED lock counter (number of active readers) and an EXCLUSIVE lock flag. Acquire SHARED: check exclusive flag is unset, increment counter (in one FDB transaction). Acquire EXCLUSIVE: check counter == 0, set exclusive flag (in one FDB transaction). Release SHARED: decrement counter. Release EXCLUSIVE: clear flag. All check-and-set operations are atomic in FDB transactions, so no intermediate state is observable. A would-be exclusive lock holder that sees counter > 0 must retry (wait for readers to finish).

This is the NamespaceManager concept in mvsqlite, and it’s how a multi-tenant SQLite service would work.

Option C β€” Record Layer over FoundationDB

Pattern: β€œstructured records + secondary indexes, native FDB” β€” the same idea Apple’s fdb-record-layer is built around, distilled to ~250 lines of Go so you can read the whole thing in one sitting.

What problem this solves

FDB’s wire-level API is ordered KV. Most applications want records (a name, a city, an email) plus queries like β€œfind users in Paris”. A record layer sits in between:

  • Records are serialized blobs stored under a primary key.
  • Indexes are extra KV entries keyed by (field, value, pk) so range scans on the index subspace return matching primary keys.
  • Both updates happen in one FDB transaction, so the index is never out of sync with the records β€” even under concurrent writers.

Files

recordlayer/
  encoding.go   Key layouts for records (tag 0x00) and indexes (tag 0x01)
  msgpack.go    Wrapper around github.com/vmihailenco/msgpack
  store.go      Open / PutRecord / GetRecord / DeleteRecord / ScanRecords
  index.go      LookupByIndex (range-scan index β†’ batch-read records)
demo/main.go    Users keyed by id, indexed by city

Key layout

records:   <ns> 0x00 <schemaName> 0x00 <pk>                               -> msgpack(record)
indexes:   <ns> 0x01 <schemaName> 0x00 <fieldName> 0x00 <value> 0x00 <pk> -> ""

Splitting records and indexes into two top-level β€œtags” keeps each subspace range-scannable without touching the other. Integers are written as sign-bit-flipped big-endian (see encodeIndexValue) so negative values sort before positive in the index β€” a tiny but useful property when you want range queries like β€œage >= 18”.

Why it’s transactional by construction

PutRecord does, inside one db.Transact:

  1. Read the previous version of the record (if any).
  2. For each indexed field that changed or was removed, Clear the old index entry.
  3. Set the new record bytes.
  4. Set fresh index entries for the new field values.

Because FDB transactions are serializable, no other client can observe a state where the record was updated but its indexes were not. This is the single biggest reason to build atop FDB rather than a non-transactional KV.

Mapping LookupByIndex to FDB ops

db.ReadTransact(rt -> {
    idxKVs := rt.GetRange(indexPrefix(schema, field, value))   // ordered scan
    for each idxKV:
        pendings[i] = rt.Get(recordKey(schema, pk))            // pipelined
    for each pending: collect record
})

The pipeline trick (issuing all rt.Get calls before awaiting any) lets FDB overlap network round-trips. Latency β‰ˆ slowest single read instead of sum-of-reads.

Running

  1. Bring up FDB and bootstrap (see top-level README).
  2. cd option-c-record-layer && go run ./demo -cluster ../fdb.cluster

Expected output:

All users (PK order):
  u1 -> map[city:Paris name:Alice]
  u2 -> map[city:Tokyo name:Bob]
  u3 -> map[city:Paris name:Carol]

Lookup city=Paris (via secondary index):
  u1 -> map[city:Paris name:Alice]
  u3 -> map[city:Paris name:Carol]

After moving Alice to Tokyo:
Lookup city=Paris:
  u3 -> map[city:Paris name:Carol]
Lookup city=Tokyo:
  u1 -> map[city:Tokyo name:Alice]
  u2 -> map[city:Tokyo name:Bob]

What this layer omits compared to fdb-record-layer

  • Schema evolution / Protobuf descriptors.
  • Index definitions like COUNT, MAX, SUM aggregates.
  • Query planner over multiple indexes.
  • Versioned records and meta-data subspace.

But the storage shape and atomicity story are exactly the same.

Hitchhiker’s Guide β€” Option C: Record Layer with Secondary Indexes

The question this answers: β€œHow does a database maintain secondary indexes so they always agree with the primary data β€” even under concurrent writes?”

The deeper question: β€œWhat does Apple actually do inside iCloud, and why did they choose FoundationDB as the substrate?”


Table of Contents

  1. What the Record Layer Is
  2. Records vs Indexes: Two Subspaces, One Truth
  3. The Index Key Format β€” Anatomy of a Lookup Key
  4. PutRecord β€” Read-Modify-Write as a Pattern
  5. LookupByIndex β€” Pipelining Explained
  6. The encodeInt64 Sign-Bit Trick (Derived from First Principles)
  7. Index Consistency: The Core Guarantee
  8. The Apple Story: fdb-record-layer and CloudKit
  9. Real-World Analogues: MySQL, PostgreSQL, MongoDB
  10. Exercises

1. What the Record Layer Is

The FoundationDB Record Layer is a Java library open-sourced by Apple in 2019. It provides a record-oriented storage layer above FDB with:

  • Typed records using Protobuf descriptors
  • Declarative indexes (define once; maintenance is automatic)
  • A query planner that compiles predicates into index range scans
  • Schema evolution without downtime
  • Change feeds via versionstamped keys

Our implementation is a Go reimagining: records are map[string]any (instead of Protobuf messages), indexes are field names (instead of declarative index definitions), and there is no query planner β€” you call LookupByIndex directly. But every core concept is the same.

Why a β€œrecord layer” and not just β€œa KV store”?

The moment you have more than one way to find a record, you need an index. The moment you have an index, you have two copies of the same data in two different shapes. The moment you have two copies, you need a protocol for keeping them consistent. The record layer is that protocol.


2. Records vs Indexes: Two Subspaces, One Truth

Our key space uses two tag bytes to separate the two subspaces:

Records subspace:
  <ns> + 0x00 + <schemaName> + 0x00 + <pk>
  β†’ value: msgpack(record)

Indexes subspace:
  <ns> + 0x01 + <schemaName> + 0x00 + <fieldName> + 0x00 + <encodedFieldValue> + 0x00 + <pk>
  β†’ value: "" (empty β€” the key IS the data)

Example with namespace "demo", schema "user", user Alice (pk="alice") with field city="Paris":

Records:
  64 65 6d 6f 00 75 73 65 72 00 61 6c 69 63 65
  "demo" 00 "user" 00 "alice"
  β†’ msgpack({id:"alice", name:"Alice", city:"Paris"})

Indexes (city field):
  64 65 6d 6f 01 75 73 65 72 00 63 69 74 79 00 50 61 72 69 73 00 61 6c 69 63 65
  "demo" 01 "user" 00 "city" 00 "Paris" 00 "alice"
  β†’ "" (empty value)

Why the value is empty for indexes:

The index key encodes everything you need to find the record: which schema, which field, what value, and what the PK is. The value at an index key carries no additional information. This is identical to how MySQL’s InnoDB secondary indexes work: the B-tree leaf stores just the clustered PK, not the full row.

Why the subspace separation matters:

  • A full table scan (read all records) = GetRange(ns+0x00+schema, ns+0x00+schema+0x01) β€” hits only records, not indexes.
  • An index scan (find all users where city=β€˜Paris’) = GetRange(ns+0x01+schema+0x00+city+0x00+Paris, ns+0x01+schema+0x00+city+0x00+Paris+0x01) β€” hits only that index column/value, not other indexes or records.
  • Clearing all indexes for a schema = ClearRange(ns+0x01+schema, ns+0x01+schema+0x01).

With one tag byte, ranges are perfectly clean. Without it, you’d need carefully chosen sentinels and you’d still risk collisions.


3. The Index Key Format β€” Anatomy of a Lookup Key

Let’s trace exactly what key is written for each user:

Alice (pk="alice", city="Paris"):
  index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "alice"

Bob (pk="bob", city="Tokyo"):
  index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Tokyo" + 0x00 + "bob"

Carol (pk="carol", city="Paris"):
  index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "carol"

In FDB’s sorted key space, these appear as:

(sorted by bytes)
...
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"alice"  β†’ ""
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"carol"  β†’ ""
ns+0x01+"user"+0x00+"city"+0x00+"Tokyo"+0x00+"bob"    β†’ ""
...

A GetRange for city="Paris" scans from ...+"Paris"+0x00 to ...+"Paris"+0x01, returning exactly two keys (Alice and Carol). The PKs ("alice", "carol") are embedded in the key suffix. This is the index scan.

The extractPkFromIndexKey function:

Given the raw FDB index key, we need to extract just the PK suffix. The PK starts after the last 0x00 byte:

func extractPkFromIndexKey(key fdb.Key) []byte {
    for i := len(key) - 1; i >= 0; i-- {
        if key[i] == 0x00 {
            return key[i+1:]
        }
    }
    return nil
}

We scan backwards because the PK itself might contain 0x00 bytes if it was a binary key. Scanning backwards from the end finds the last 0x00, which is the separator between the encoded field value and the PK. This is safe because the encoded field value ends before that last 0x00, and the PK follows.

A subtle limitation: if the PK itself can contain 0x00, this parsing is ambiguous. A production implementation uses the Tuple encoding, which escapes 0x00 bytes within strings (\x00\xff) so the separator \x00\x00 is unambiguous.


4. PutRecord β€” Read-Modify-Write as a Pattern

This is the most important function in the record layer. Let’s walk through it completely:

func (s *Store) PutRecord(schema Schema, pk string, rec Record) error {
    _, err := s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // Step 1: Read the OLD record (if it exists)
        oldBytes, _ := tr.Get(recordKey(s.ns, schema.Name, pk)).Get()

        // Step 2: Decode the old record
        var oldRec Record
        if len(oldBytes) > 0 {
            msgpack.Unmarshal(oldBytes, &oldRec)
        }

        // Step 3: Remove OLD index entries
        for _, field := range schema.Indexes {
            if oldVal, ok := oldRec[field]; ok {
                tr.Clear(indexKey(s.ns, schema.Name, field, oldVal, pk))
            }
        }

        // Step 4: Write the NEW record
        encoded, _ := msgpack.Marshal(rec)
        tr.Set(recordKey(s.ns, schema.Name, pk), encoded)

        // Step 5: Write NEW index entries
        for _, field := range schema.Indexes {
            if newVal, ok := rec[field]; ok {
                tr.Set(indexKey(s.ns, schema.Name, field, newVal, pk), []byte{})
            }
        }
        return nil, nil
    })
    return err
}

Why must you read the old record before writing the new one?

Consider Alice moving from Paris to Tokyo:

Before: {id:"alice", name:"Alice", city:"Paris"}
After:  {id:"alice", name:"Alice", city:"Tokyo"}

We need to:

  1. Remove the old index entry for city=Paris, pk=alice
  2. Add a new index entry for city=Tokyo, pk=alice
  3. Write the new record

We cannot do step 1 without knowing that Alice was previously in Paris. That information lives in the old record. So we read it first.

Why inside one transaction?

If we split this into two transactions:

Transaction 1: read old record, decide to clear Paris index
Transaction 2: clear Paris index, write new record, write Tokyo index

A crash between T1 and T2 leaves the Paris index entry pointing to a record that now says Tokyo. This is a phantom index entry β€” a stale pointer that silently returns wrong results.

With all steps in one transaction: either the entire operation commits (both index update and record update) or neither does. The index and the record always agree.

This is what database people mean when they say atomicity eliminates a class of bugs. It’s not just about performance; it’s about correctness.

The cost: one extra read per update.

Every PutRecord does one Get (the old record) before writing. This adds a read-version conflict key for the record’s current PK slot. If two writers try to update the same record concurrently, one will get a conflict and retry. This is acceptable and correct β€” the alternative (no-conflict update) risks interleaving index writes.


5. LookupByIndex β€” Pipelining Explained

This is where we use FDB’s pipelining to make multi-key lookups efficient.

The naive approach (one round-trip per record):

// BAD: N sequential round-trips for N results
pks := getIndexScan(...)  // one round-trip: get all PKs from index
for _, pk := range pks {
    rec := getRecord(pk)  // one round-trip each: N round-trips total
    results = append(results, rec)
}

For N=100, this is 101 network round-trips. At 1ms per round-trip, that’s ~100ms for a simple lookup. Unacceptable.

The pipelined approach (two round-trips total):

// GOOD: pipeline all Get calls, then collect all results
_, err = s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
    // Phase 1: issue ALL Get calls. None block yet.
    futures := make([]fdb.FutureByteSlice, len(pks))
    for i, pk := range pks {
        futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
        // rt.Get returns a Future immediately β€” it has NOT waited for a response.
    }
    // Phase 2: now block on each future in sequence.
    for i, fut := range futures {
        b, _ := fut.Get()  // blocks until this specific response arrives
        // But by the time we block on futures[1], the response for futures[1]
        // through futures[N-1] may already be in flight or arrived.
        ...
    }
})

Phase 1 sends N requests in rapid succession without waiting for any response. The FDB client library batches these into its network buffer and sends them. The storage servers process all N in parallel. By the time we call Get() on the first future, many responses have already arrived.

Network diagram:

Without pipelining:        With pipelining:
  β†’ Get(pk1)                β†’ Get(pk1), Get(pk2), Get(pk3), ...  (burst)
  ← reply                   ← reply(pk1)
  β†’ Get(pk2)                ← reply(pk2)
  ← reply                   ← reply(pk3)
  ...                       (collect all)
  Total: N round-trips      Total: ~1 round-trip worth of latency

Why this works in FDB’s model:

FDB transactions are snapshot reads. All Get calls within one ReadTransact closure see the same version of the database. There are no cross-key dependencies or ordering constraints on reads within a transaction. So the FDB client can send all reads in any order, and they will all return consistent (same read version) values.

This is the same principle as HTTP pipelining, REDIS pipelining, or database query batching β€” but with a guarantee that all reads are from the same snapshot.


6. The encodeInt64 Sign-Bit Trick (Derived from First Principles)

When indexing a numeric field (e.g., age: 25), we want the index to sort numerically. Bytes sort lexicographically. How do we make byte-sort match numeric sort for 64-bit signed integers?

Step 1: Start with what we have.

A 64-bit signed integer in two’s complement:

 0   = 0000 0000 0000 0000 0000 0000 0000 0000  (0x0000_0000_0000_0000)
 1   = 0000 0000 0000 0000 0000 0000 0000 0001  (0x0000_0000_0000_0001)
-1   = 1111 1111 1111 1111 1111 1111 1111 1111  (0xFFFF_FFFF_FFFF_FFFF)
-2   = 1111 1111 1111 1111 1111 1111 1111 1110  (0xFFFF_FFFF_FFFF_FFFE)
MIN  = 1000 0000 0000 0000 0000 0000 0000 0000  (0x8000_0000_0000_0000)
MAX  = 0111 1111 1111 1111 1111 1111 1111 1111  (0x7FFF_FFFF_FFFF_FFFF)

Big-endian byte-sort order: MIN, -2, -1, 0, 1, MAX (MIN starts with 0x80, which is 128, larger than 0x00 which is 0). So big-endian two’s complement does NOT sort correctly across the sign boundary.

Step 2: What would correct order look like?

We want the sorted sequence: …, -2, -1, 0, 1, …, MAX. In bytes, each element should be strictly less than the next. If we could assign distinct 64-bit unsigned values to each signed value such that the unsigned order matches the signed order, we’d be done.

Step 3: The offset encoding idea.

One option: add 2^63 to shift the range. Then:

-2^63 + 2^63 = 0
    -1 + 2^63 = 2^63 - 1
     0 + 2^63 = 2^63
     1 + 2^63 = 2^63 + 1
2^63-1 + 2^63 = 2^64 - 1

These are now unsigned values in the range [0, 2^64-1], in the same order as the original signed values. And they sort correctly in big-endian bytes!

Step 4: Addition vs. XOR.

Adding 2^63 is the same as flipping just the sign bit (bit 63). This is because adding 2^63 to a 64-bit value either sets bit 63 (if it was 0) or clears bit 63 and carries into bit 64 (which wraps around, but 2’s complement overflow is the same as XOR on individual bits).

More precisely: x + 2^63 = x XOR 2^63 for all x in the range [-2^63, 2^63-1]. This is because:

  • If x β‰₯ 0: bit 63 of x is 0; adding 2^63 sets it to 1. XOR with 1<<63 also sets bit 63. Same result.
  • If x < 0: bit 63 of x is 1; adding 2^63 causes a carry into bit 64, which is discarded in 64-bit arithmetic, leaving bit 63 as 0. XOR with 1<<63 also clears bit 63. Same result.

So the implementation is just:

func encodeInt64(x int64) []byte {
    b := make([]byte, 8)
    binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
    return b
}

Verification:

-2  β†’ uint64(-2) = 0xFFFFFFFFFFFFFFFE β†’ XOR 0x8000... β†’ 0x7FFFFFFFFFFFFFFE
-1  β†’ 0xFFFFFFFFFFFFFFFF β†’ XOR β†’ 0x7FFFFFFFFFFFFFFF
 0  β†’ 0x0000000000000000 β†’ XOR β†’ 0x8000000000000000
 1  β†’ 0x0000000000000001 β†’ XOR β†’ 0x8000000000000001
 2  β†’ 0x0000000000000002 β†’ XOR β†’ 0x8000000000000002

Sorted by bytes: 0x7FFFFFFFFFFFFFE, 0x7FFFFFFFFFFFFFFF, 0x8000000000000000, 0x8000000000000001, 0x8000000000000002 β€” which is the same order as -2, -1, 0, 1, 2. Correct.

This is the standard encoding used by:

  • FDB’s official Tuple layer (negative integers)
  • Apache HBase row key encoding
  • Google Cloud Bigtable ordered keys
  • Apache Flink and Spark for sorted key encodings

7. Index Consistency: The Core Guarantee

Let’s state it precisely: at any point in time, for every record with field F=V and PK=P, there exists an index entry at (F, V, P). And there are no index entries without corresponding records.

Formally:

  • forall records: indexEntries(record) βŠ† presentIndexKeys
  • forall indexKey: correspondingRecord exists

Both directions. Our PutRecord maintains this invariant within one transaction:

  1. Remove old index entries β†’ breaks β€œno stale entries” for old value
  2. Write new record β†’ old record value is overwritten
  3. Write new index entries β†’ restores consistency

Between steps 1 and 3, within the transaction, the invariant is temporarily broken. But no other transaction sees this intermediate state β€” FDB’s isolation means only the final committed state is visible.

What happens if the invariant is violated?

A phantom index entry (index says Alice is in Paris, record says Tokyo):

  • LookupByIndex(city="Paris") returns Alice
  • Fetching Alice’s record shows city=Tokyo
  • Application logic sees inconsistent data

This is precisely the class of bugs that a record layer is designed to prevent.


8. The Apple Story: fdb-record-layer and CloudKit

Apple’s CloudKit powers iCloud sync for iOS and macOS apps. At its peak it handles hundreds of millions of devices syncing calendar events, notes, photos metadata, and app-specific data. This is the production scale at which fdb-record-layer operates.

Why FDB?

Before FDB, Apple had a custom distributed database for CloudKit. It had the usual problems: hard to scale, hard to operate, hard to maintain consistency under failures. FDB was acquired in 2015. Over the next few years Apple rebuilt CloudKit’s storage engine on FDB.

Why a record layer on top?

FDB is a KV store. CloudKit stores typed records (calendar events have a start_date, end_date, title; notes have body, attachments). You cannot just say β€œfind all events starting tomorrow” to an FDB cluster β€” you need to translate that query into a range scan over an encoded index.

fdb-record-layer provides:

  1. Protobuf-typed records (schema validation, efficient encoding)
  2. Value indexes, rank indexes, text indexes, aggregate indexes
  3. A query planner that selects the best index for a given predicate
  4. Change feed via versionstamped index keys (see GUIDE.md chapter on versionstamps)
  5. Schema evolution (add fields with defaults, deprecate fields, change types safely)

Our implementation omits the Protobuf (replaced with map[string]any) and the query planner (you call LookupByIndex directly). But the index key layout and the PutRecord read-modify-write pattern are the same.


9. Real-World Analogues

MySQL InnoDB Secondary Indexes

In InnoDB, a secondary index B-tree contains:

(secondary_key_value, primary_key_value) β†’ ""

Identical to our (field, encodedValue, pk) β†’ "". To look up a row by a secondary key:

  1. Range-scan the secondary index to get PKs.
  2. Look up each PK in the clustered index (the primary B-tree).

This two-step process is called a β€œdouble-lookup” or β€œbookmark lookup” in MySQL docs. Our LookupByIndex does exactly this.

The reason InnoDB stores the PK (not the heap pointer) in secondary indexes: when a row is updated and moves in the clustered index, secondary indexes don’t need to be updated (they still point to the same PK value, which now lives in a different B-tree page but is found by the clustered index scan).

PostgreSQL GIN / B-Tree Indexes

PostgreSQL secondary indexes store (index_value β†’ ctid) where ctid is the physical page+slot address. This is faster to look up (one step, not two) but requires updating ALL indexes when a row moves (e.g., after VACUUM).

PostgreSQL recently added β€œindex-only scans” where the full row is stored in the index (similar to InnoDB’s covering indexes), trading space for read speed.

MongoDB Indexes

MongoDB maintains indexes as separate B-trees:

{ city: "Paris" } β†’ [ ObjectId("alice"), ObjectId("carol") ]

MongoDB’s analog of LookupByIndex is called a β€œsecondary index scan + fetch”. The MongoDB Document Layer (which stored MongoDB data in FDB) used exactly our two-subspace approach β€” one FDB range for documents, one for each index.


10. Exercises

Exercise 1 β€” COUNT Aggregate Index

Add a special β€œcount” index that maintains a per-schema counter:

ns + 0x02 + schemaName + 0x00  β†’  msgpack(int64)

In PutRecord, if old record doesn’t exist, increment. In DeleteRecord, decrement. Both inside the same transaction as the record+index update.

This gives you COUNT(*) in O(1) instead of a full table scan.

Exercise 2 β€” Range Index Query

Add LookupByIndexRange(schema, field, minVal, maxVal any) ([]Record, error).

Since the index is sorted by encoded field value, you can do:

begin = indexPrefix(ns, schema, field) + encodeValue(minVal)
end   = indexPrefix(ns, schema, field) + encodeValue(maxVal) + 0x01
GetRange(begin, end)

This enables WHERE age >= 25 AND age < 35 as a single range scan. Make sure encodeValue produces sort-preserving bytes for your supported types (use encodeInt64 for integers).

Exercise 3 β€” Composite Indexes

Add support for multi-field indexes:

Schema{Name: "event", Indexes: [{"start", "end"}, {"city", "name"}]}

A composite index on (start, end) allows efficient queries like: WHERE start >= monday AND end <= friday. The key would be:

ns + 0x01 + schema + 0x00 + "start,end" + 0x00 + encodeValue(start) + 0x00 + encodeValue(end) + 0x00 + pk

Exercise 4 β€” Covering Indexes

A covering index stores the full record at the index key (instead of just ""). Then LookupByIndex returns full records from the index scan alone, with no secondary lookup:

index key β†’ msgpack(record)  (instead of β†’ "")

Upside: LookupByIndex requires only one FDB range scan, no Get calls. Downside: PutRecord must update index values when any indexed field changes. Compare the trade-off with real-world covering indexes in MySQL and PostgreSQL.

Exercise 5 β€” Versionstamp-Based Change Feed

Use FDB’s SetVersionstampedKey in PutRecord to write a change log entry:

ns + 0xFF + <10-byte versionstamp>  β†’  msgpack({schema, pk, change_type})

Build a SubscribeChanges(since versionstamp) (<-chan Change, error) function that uses GetRange + Watch to stream changes in real time. This is a simplified version of fdb-record-layer’s change feed.


11. Source Code Deep Dive

encoding.go β€” The Key Construction Functions

// tagRecords = 0x00, tagIndexes = 0x01
func recordKey(ns []byte, schema, pk string) fdb.Key {
    return append(ns, append([]byte{tagRecords}, []byte(schema+"\x00"+pk)...)...)
}

func indexKey(ns []byte, schema, field string, val any, pk string) fdb.Key {
    return append(ns, append([]byte{tagIndexes},
        []byte(schema+"\x00"+field+"\x00"+encodeIndexValue(val)+"\x00"+pk)...)...)
}

Why embed the separator \x00 as a Go string literal?

The \x00 byte is the separator between components. Using it directly in a string literal is readable and avoids a separate bytes.Join call. The key correctness requirement: no component value may contain \x00 unescaped. For string field values from map[string]any, callers must validate or escape input. The Tuple encoding in fdb-record-layer handles this by escaping \x00 inside strings as \x00\xFF.

encodeIndexValue β€” Dispatching on Type

func encodeIndexValue(val any) string {
    switch v := val.(type) {
    case int64:
        return string(encodeInt64(v))
    case float64:
        // JSON numbers unmarshal as float64 in Go
        return string(encodeInt64(int64(v)))
    case string:
        return v
    default:
        return fmt.Sprintf("%v", v)
    }
}

The float64 case: In Go’s encoding/json, all JSON numbers unmarshal to float64 when the target type is any. This means {"age": 25} gives you age = float64(25). The encoding converts float64 to int64 first. This is lossy for non-integer floats β€” a production implementation would have a separate encodeFloat64 using the IEEE 754 bit pattern with a sign-flip (identical logic to encodeInt64, but on the math.Float64bits representation).

store.go β€” ScanRecords with Cursor Logic

func (s *Store) ScanRecords(schema string, limit int) ([]Record, error) {
    kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(recordRange(s.ns, schema), fdb.RangeOptions{
            Limit: limit,
        }).GetSliceWithError()
    })
    var recs []Record
    for _, kv := range kvs.([]fdb.KeyValue) {
        var rec Record
        msgpack.Unmarshal(kv.Value, &rec)
        recs = append(recs, rec)
    }
    return recs, nil
}

fdb.RangeOptions{Limit: limit} instructs FDB to stop returning results after limit key-value pairs. This is pushed down to the storage server β€” the server stops scanning early, reducing both network bandwidth and server-side work.

The continuation problem: For a dataset with 1 million records, a single GetRange with no limit would transfer all 1M records in one response. This hits FDB’s per-transaction byte limit (~10 MB) and would transfer far more data than needed. A production scan should use a continuation token:

type Continuation struct {
    LastPK string
}

func (s *Store) ScanRecordsPage(schema string, limit int, cont *Continuation) ([]Record, Continuation, error) {
    begin := recordRange(s.ns, schema).Begin
    if cont != nil {
        // Start after the last-seen PK
        begin = fdb.Key(append(recordKey(s.ns, schema, cont.LastPK), 0x01...))
    }
    // GetRange from begin to end of schema range, limit = limit
    ...
}

Each page call fetches limit records starting after the continuation. The caller loops, passing the continuation from each response into the next call.

index.go β€” LookupByIndex Pipelining in Detail

func (s *Store) LookupByIndex(schema Schema, field string, value any) ([]Record, error) {
    prefix := indexPrefix(s.ns, schema.Name, field, value)
    // Step 1: get all PKs from the index (one round-trip)
    kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(
            fdb.KeyRange{Begin: prefix, End: fdb.Key(append(prefix, 0xFF))},
            fdb.RangeOptions{},
        ).GetSliceWithError()
    })

    pks := make([]string, len(kvs))
    for i, kv := range kvs {
        pks[i] = string(extractPkFromIndexKey(kv.Key))
    }

    // Step 2: fetch all records by PK (pipelined, one round-trip)
    recs := make([]Record, 0, len(pks))
    s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        futures := make([]fdb.FutureByteSlice, len(pks))
        for i, pk := range pks {
            futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
        }
        for _, fut := range futures {
            b, _ := fut.Get()
            var rec Record
            msgpack.Unmarshal(b, &rec)
            recs = append(recs, rec)
        }
        return nil, nil
    })
    return recs, nil
}

Two round-trips for any number of results: Step 1 is one GetRange β†’ one round-trip. Step 2 issues N Get calls inside one transaction β†’ they are pipelined to the storage servers, returning in approximately one round-trip (the latency of the slowest individual read, not N Γ— single-read latency).

Why two separate transactions?

The index scan (step 1) and the record fetches (step 2) could be in one ReadTransact closure β€” they would both see the same read version. We separate them here for clarity. In production, combining them into one transaction is slightly more efficient: one beginRead call, one read version negotiation, one transaction context.


12. Production Considerations

12.1 Write Amplification Under Index Updates

Every PutRecord that changes an indexed field performs:

  • 1 read (old record)
  • 1 clear (old index entry)
  • 1 set (new record)
  • 1 set (new index entry)

For a schema with K indexes and a write that changes all K indexed fields:

  • K reads (old index values embedded in old record, but we need the old record regardless)
  • K clears (old index entries)
  • 1 set (new record value)
  • K sets (new index entries)

Total mutations: K clears + K+1 sets = 2K+1 write operations per update. For a schema with 5 indexes, every update writes 11 keys. This is standard for indexed databases β€” MySQL and PostgreSQL have the same write amplification for secondary indexes. FDB’s commit pipeline absorbs this efficiently because all K sets and K clears go into one transaction batch.

Mitigation: Only maintain indexes on fields that are actually queried. Don’t add indexes speculatively.

12.2 Online Index Building

Adding a new index to a schema that already has millions of records requires backfilling: writing index entries for every existing record. This must be done without taking the database offline.

The safe approach:

  1. Start writing new index entries in PutRecord immediately (for new and updated records).
  2. In a background worker, page through all existing records (using the continuation pattern) and write their index entries.
  3. Once the background worker reaches the β€œfrontier” (current write position), the index is consistent.

During step 2, queries against the new index may return incomplete results (missing records that haven’t been backfilled yet). This is acceptable if the application can tolerate β€œeventually consistent” index builds. FDB’s versionstamp mechanism can track the backfill position and indicate to callers which index entries are β€œsafe” (backfilled) vs β€œtentative.”

12.3 N+1 Read Problem

LookupByIndex with N results does:

  • 1 index range scan
  • N record reads (pipelined)

If the calling code then calls LookupByIndex for each result (e.g., β€œfind all events in Paris, then for each event, find all attendees”), you get M index scans + MΓ—N record reads. This is the N+1 problem from ORM land, applied to FDB.

Solution: Batch the secondary lookups. Collect all PKs from all index scans, then issue one pipelined batch of record reads. FDB’s pipelining makes this efficient even when batching hundreds of PKs.


13. Interview Questions β€” Secondary Indexes and Record Layers

Q: Why does PutRecord need to read the old record before writing the new one?

To maintain index consistency. If a record’s indexed field value changes (e.g., city: β€œParis” β†’ β€œTokyo”), the old index entry (city=Paris,pk=alice) must be deleted and a new entry (city=Tokyo,pk=alice) must be created β€” in the same transaction as the record write. To know the old index value, we need the old record. Without reading the old record, we can’t know which index keys to clear, so stale index entries would accumulate, pointing to records with wrong field values.

Q: How does LookupByIndex achieve O(2) round-trips regardless of result count?

The index range scan (step 1) is one GetRange that returns all PKs for the queried field value β€” one round-trip. The record reads (step 2) are issued as N rt.Get calls inside one ReadTransact closure. The FDB client library sends all N requests before waiting for any response (pipelining). The storage servers process them in parallel. By the time the first response arrives, most others are already in flight or returned. The effective latency is one round-trip, not N.

Q: What is the encodeInt64 function doing, and why is it necessary?

FDB keys are sorted lexicographically by bytes. For integer field values to sort correctly as numbers (βˆ’1 < 0 < 1 < 2), the byte encoding must be sort-preserving. Two’s complement big-endian is almost right, but fails at the sign boundary: βˆ’1 (0xFFFF…) sorts after MAX_INT (0x7FFF…) in byte order. Flipping the sign bit (XOR with 0x8000_0000_0000_0000) shifts the entire integer range to unsigned, making byte order match signed integer order. The result: every negative number’s encoding is less than every non-negative number’s encoding, and within each sign region, the encoding preserves order.

Q: How would you implement multi-tenant isolation so that one tenant’s records are never visible to another?

Add a tenant identifier to the namespace prefix: ns = globalPrefix + tenantId. All record and index keys for that tenant are prefixed with their unique tenant bytes. A range scan by tenant A (GetRange(ns_A, ns_A + 0xFF)) only returns keys in tenant A’s namespace. The FDB cluster enforces this: there are no keys shared between tenant namespaces, so no data can leak across tenants. Additionally, for strict isolation, use FDB’s tenant feature (available in 7.x) which restricts a transaction to a specific tenant prefix at the FDB server level.