The Storage Stack — Where Everything Lives
Every database you have ever used is a stack of layers, each hiding detail from the layer above. Knowing exactly where FoundationDB sits in that stack — and why that position is powerful — is the foundation of everything else in this guide.
The Canonical Stack
┌─────────────────────────────────────────────┐
│ Application (SQL, gRPC, REST API) │ ← you usually live here
├─────────────────────────────────────────────┤
│ Query Planner / Optimizer │ ← rewrites queries, picks indexes
├─────────────────────────────────────────────┤
│ Execution Engine (joins, aggregates) │ ← evaluates the query plan
├─────────────────────────────────────────────┤
│ Storage Engine (rows, columns, indexes) │ ← where FDB plays
├─────────────────────────────────────────────┤
│ Page / Buffer Cache │ ← in-memory hot pages
├─────────────────────────────────────────────┤
│ File / Block I/O (VFS, pread/pwrite) │ ← option-b-sqlite plugs in here
├─────────────────────────────────────────────┤
│ Kernel: page cache, scheduler, filesystem │
├─────────────────────────────────────────────┤
│ Physical: SSD / NVMe / network storage │
└─────────────────────────────────────────────┘
The interesting question is: which layer do you own, and what contract does it expose downward?
Most database literature focuses on the top (SQL, query plans). Most OS literature focuses on the bottom (filesystems, block I/O). The storage engine layer — the middle — is where the real trade-offs live: ordering, atomicity, durability, and concurrency.
FoundationDB is a storage engine that exposes a remarkably clean API to everything above it. This repository explores five different ways to build on top of that API.
B-Trees vs. Log-Structured Merge Trees
Every practical storage engine is built on one of two foundational data structures. Understanding both is essential for reasoning about FDB’s design and the designs of the engines it replaces.
B-Trees
A B-tree stores data as a balanced tree of fixed-size pages (usually 4–16 KB). Updates happen in-place: to change a value, you find its page, overwrite it, and write it back to disk.
┌──────────────┐
│ [20] [50] │ ← interior node (keys only)
└──────────────┘
/ | \
┌──────────┐ ┌──────────┐ ┌──────────┐
│ [5][12] │ │[25][35] │ │[55][80] │ ← leaf nodes (key+value)
└──────────┘ └──────────┘ └──────────┘
Properties:
- Read amplification: O(log N) page reads to find a key (very good)
- Write amplification: ~1–3× (one write per page touched, but journaling doubles it)
- Space amplification: ~33% overhead on average (partially filled pages)
- Random writes: each update may touch a different page → high random I/O on spinning disks
Why B-trees dominated for 40 years: HDDs were sequential-read fast, random-read slow. B-trees minimize the number of seeks to reach a key. SSDs changed this equation — random reads became cheap.
Used by: SQLite, PostgreSQL, MySQL InnoDB, Oracle, early MongoDB (MMAPv1)
Log-Structured Merge Trees (LSM Trees)
LevelDB, RocksDB, and Cassandra’s storage engine are all LSM trees. The core idea: never overwrite; always append.
Write path:
incoming write
↓
WAL (append-only log on disk — for durability)
↓
MemTable (skip list in memory — for fast writes)
↓ (when MemTable is full, flush to disk)
Level 0 SSTables (sorted, immutable files)
↓ (background compaction)
Level 1, 2, ... SSTables
Anatomy of an SSTable (Sorted String Table):
┌─────────────────────────────────────────────┐
│ Block 0: apple→1, banana→2, cherry→3 ... │
│ Block 1: date→4, elderberry→5, fig→6 ... │
│ ... │
│ Index block: [apple→offset0, date→offset1] │
│ Bloom filter: membership query, ~1% FP rate │
│ Footer: offsets to index+filter blocks │
└─────────────────────────────────────────────┘
Properties:
- Write amplification: high (10–30× in a 7-level LSM; data is rewritten during each level’s compaction)
- Read amplification: moderate (must check all levels for a key; bloom filters help)
- Space amplification: moderate (old versions linger until compaction)
- Sequential writes: almost entirely sequential → excellent on SSDs and HDDs alike
Why LSM trees dominate at scale today: write throughput is often the bottleneck at scale, and LSM trees turn random writes into sequential appends. RocksDB (Facebook) and Cassandra (Apache) both validate this at planetary scale.
The RUM Conjecture
The RUM conjecture (Idreos et al., 2016) formalizes the fundamental trade-off:
You cannot minimize Read cost, Update cost, and Memory/space cost simultaneously. Optimizing two always worsens the third.
| Engine | Read (R) | Update (U) | Memory (M) | Best for |
|---|---|---|---|---|
| B-tree | Low | Medium | Medium | Read-heavy OLTP |
| LSM tree | Medium | Low | High | Write-heavy OLTP |
| Hash table | Very low | Medium | High | Point lookups only |
| Column store | Low | High | Low | Analytical queries (OLAP) |
| FDB (B-tree) | Low | Medium | Medium | Consistent distributed KV |
FDB uses a custom B-tree variant internally on each Storage Server. The distributed nature adds replication overhead but also provides horizontal read scaling.
Write-Ahead Log (WAL) — Crash Recovery Mechanics
Every durable storage engine uses a WAL. Understanding WAL mechanics is essential for understanding what FDB guarantees and what xSync (in the SQLite VFS) must do.
The problem: writing a page to disk is not atomic. A 4 KB page write may partially complete before a power failure. Result: corrupted page.
WAL solution:
1. Before modifying Page X, write the INTENT to the WAL:
WAL record: {LSN: 1042, page: X, before: <old bytes>, after: <new bytes>}
2. fsync() the WAL record to disk (durability guarantee)
3. Now apply the change to Page X in the buffer cache
4. Page X is written back to disk lazily (the WAL already protects us)
On crash recovery:
- Re-apply all WAL records since the last checkpoint → consistent state
- If a WAL record is incomplete (partially written) → discard it, prior state is clean
FDB’s relationship with WAL:
FDB’s Transaction Log layer IS the WAL. When a commit is acknowledged, the write is persisted in the Transaction Log on f+1 machines (where f is the fault tolerance factor). Storage Servers apply these writes asynchronously. This is why FDB’s commit is durable even if Storage Servers crash before applying the write — the Transaction Log holds the record.
This also explains why our SQLite VFS (option-b-sqlite) makes xSync a no-op: by the time Transact() returns, FDB has already done the equivalent of fsync on the WAL.
Buffer Pool Management
In any storage engine, the buffer pool (or page cache) is the in-memory cache of disk pages. It is often the single most important performance variable.
Key metrics:
- Hit ratio: fraction of page reads served from cache (aim for >99%)
- Eviction policy: LRU (Least Recently Used), LFU (Least Frequently Used), CLOCK
- Write-back vs. write-through: most use write-back (dirty pages flushed lazily to amortize I/O)
Why this matters for FDB:
FDB’s clients do not manage a buffer pool. Each Storage Server manages its own buffer pool internally. When you call rt.Get(key), the FDB client sends a network request; the Storage Server consults its buffer pool and returns the value. You have no visibility into whether this was a cache hit or a disk read.
For read-heavy workloads, adding more Storage Servers (and thus more total buffer pool capacity) linearly increases the read cache size.
Where Each Lab Sits in the Stack
┌──────────────────────────────────────────────────────────────┐
│ Application code (demo/main.go) │
├──────────────────────────────────────────────────────────────┤
│ option-a-leveldb │ option-a-sqlite │ option-c-record │
│ (LevelDB API) │ (SQL engine) │ (Record + Index) │
├──────────────────────────────────────────────────────────────┤
│ FoundationDB client library (fdb.Transact / ReadTransact) │
├──────────────────────────────────────────────────────────────┤
│ FDB cluster (coordinators, proxies, storage) │
├──────────────────────────────────────────────────────────────┤
│ option-b-sqlite │ option-b-leveldb │
│ (SQLite VFS) │ (LevelDB storage.Storage) │
├──────────────────────────────────────────────────────────────┤
│ FDB as the backing file store │
└──────────────────────────────────────────────────────────────┘
Options A and C sit above FDB — they use FDB as a storage substrate and build a new API on top. Options B sit below an existing engine — they replace the file system with FDB while keeping the engine’s logic intact.
This is a crucial architectural distinction. In the A/C case, FDB’s ACID semantics are exposed to the caller. In the B case, FDB’s semantics are hidden inside a file-like abstraction, and the engine above (LevelDB, SQLite) provides its own ACID layer — layered on top of FDB’s.
Interview Questions
Q: What is write amplification and why does it matter?
Write amplification is the ratio of bytes written to disk versus bytes written by the application. A write amplification of 10× means writing 1 MB of data causes 10 MB of disk writes. LSM trees have high write amplification (compaction rewrites data multiple times) but low latency writes. B-trees have low write amplification but high random-write I/O. At scale, write amplification directly determines storage cost and SSD wear.
Q: Why does an LSM tree use multiple levels?
Compaction merges SSTables from one level into the next. Without levels, you’d need to merge all SSTables at once — an O(total_data) operation for every write. Leveled compaction keeps each level to a fixed size ratio (~10×), so each compaction is bounded. This bounds read amplification too: there are at most one SSTable per key per level, so a lookup checks at most L SSTables.
Q: What is the difference between a WAL and an SSTable?
A WAL (Write-Ahead Log) is an append-only sequence of write operations — it records WHAT changed, not the final state. It’s used for crash recovery. An SSTable is an immutable, sorted snapshot of actual data — it stores key-value pairs in sorted order for efficient lookup. The WAL is ephemeral (truncated after checkpointing). SSTables are persistent (they ARE the data).
Q: What makes FDB’s storage engine different from a traditional B-tree?
FDB’s storage servers use a custom B-tree variant, but the key difference is architectural: FDB is distributed, sharded, and replicated. Each Storage Server holds a shard (a key range). Reads are served directly by Storage Servers (bypassing the Proxy). Writes go through the Transaction Log for durability before being applied to Storage Servers. This means durability and availability are decoupled — you can lose a Storage Server without losing committed data.