Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

FoundationDB in Depth

FoundationDB is one of the most carefully engineered distributed databases ever built. This chapter goes far beyond the API surface — it covers the commit pipeline, MVCC mechanics, cluster architecture, and the simulation harness that makes FDB trustworthy at Apple-scale production.


2.1 What FDB Actually Is

FoundationDB is a distributed, ordered, transactional key–value store. Each word carries precise meaning:

Distributed: data is automatically sharded across machines. You do not choose a partition key. FDB’s shard boundaries move dynamically as data grows or machines are added. The client library discovers shard locations by reading the cluster file and caching routing information.

Ordered: keys are stored in lexicographic byte order across the entire cluster. GetRange(begin, end) returns keys in sorted order regardless of which Storage Servers hold them. This is the single most powerful property for building layers, because co-location of related data is entirely under your control via key encoding.

Transactional: full ACID across arbitrary key ranges, on multiple machines, at unlimited scale. This is rare. DynamoDB offers single-item atomicity (later added limited cross-item transactions). Cassandra offers lightweight transactions (LWT) at one consistency level. Redis offers multi-key transactions (MULTI/EXEC) without conflict detection. FDB offers true serializable transactions with automatic conflict detection and retry — across any number of keys, on any number of machines.

Key–value: the value is an opaque byte string (max 100 KB). FDB does not care about structure inside the value. All richness is imposed by the layer above — this is by design. It keeps FDB’s guarantees simple and composable.


2.2 The Ordered KV Model — Deep Dive

The FDB key space is conceptually a single sorted array of byte arrays, spanning the entire cluster:

Key space (lexicographic byte order):
  ""  (smallest possible key)
  ...
  "\x00"
  "\x00\x00"
  ...
  "apple"
  "apple\x00price"       ← prefix scan works naturally
  "apple\x00weight"
  "banana"
  ...
  "\xff\xff\xff..."  (largest possible key)

Range semantics — half-open intervals:

All FDB ranges are [begin, end) — inclusive begin, exclusive end. This is universal in FDB’s API:

fdb.KeyRange{Begin: fdb.Key("apple\x00"), End: fdb.Key("apple\x01")}
// Returns: apple\x00price, apple\x00weight, apple\x00... everything
// Does NOT return: apple\x01, banana, ...

The trick of incrementing the last byte (\x00\x01) creates a tight upper bound for prefix scans. This pattern appears in every layer in this repo.

Why ordering is the enabler for layers:

Without ordering, you could only do point lookups (O(1) Get). Ordering enables:

  • Prefix scans: all keys for user 42 (range scan on user:42:*)
  • Range queries: all records with age BETWEEN 25 AND 40 (range scan on encoded age values)
  • Sorted iteration: return all records in PK order (scan the records subspace)
  • Pagination: “next page” = resume range scan from last seen key

Every SQL feature that involves ORDER BY, BETWEEN, prefix LIKE, or primary key range is implementable purely through ordered range scans.


2.3 Transactions: The Full Picture

ACID Guarantees

FDB transactions implement Strict Serializability (Herlihy & Wing, 1990): the strongest isolation level. It is equivalent to:

  • Serializability (transactions appear to execute one at a time)
  • Real-time ordering (if T1 commits before T2 starts, T1 appears before T2)

Most databases offer weaker levels by default. PostgreSQL defaults to “Read Committed” (not serializable). MySQL InnoDB defaults to “Repeatable Read”. FDB always runs at strict serializability — there is no weaker option.

The Commit Pipeline

This is the key to understanding FDB’s performance characteristics:

Client                  Proxy               Resolver            Transaction Log       Storage Servers
  │                       │                    │                      │                     │
  │  1. Begin Txn         │                    │                      │                     │
  │  (local, no network)  │                    │                      │                     │
  │                       │                    │                      │                     │
  │  2. Reads → GetReadVersion                 │                      │                     │
  │ ─────────────────────►│                    │                      │                     │
  │ ◄─────────────────────│ readVersion=v100   │                      │                     │
  │                       │                    │                      │                     │
  │  3. rt.Get(key)       │                    │                      │                     │
  │ ─────────────────────────────────────────────────────────────────────────────────────►  │
  │ ◄─────────────────────────────────────────────────────────────────────────────────────  │
  │  (reads served directly by Storage Servers — Proxy NOT in critical read path)           │
  │                       │                    │                      │                     │
  │  4. tr.Set(key,val)   │                    │                      │                     │
  │  (buffered locally, no network yet)        │                      │                     │
  │                       │                    │                      │                     │
  │  5. Commit: send {readVersion, readSet, writeSet}                  │                     │
  │ ─────────────────────►│                    │                      │                     │
  │                       │  6. Assign commitVersion=v101             │                     │
  │                       │ ──────────────────►│                      │                     │
  │                       │  7. Check conflicts│                      │                     │
  │                       │  (did anyone write a key in readSet       │                     │
  │                       │   between v100 and v101?)                 │                     │
  │                       │ ◄──────────────────│                      │                     │
  │                       │  8. If no conflict: write to TLog         │                     │
  │                       │ ─────────────────────────────────────────►│                     │
  │                       │  9. TLog confirms durability (f+1 copies) │                     │
  │                       │ ◄─────────────────────────────────────────│                     │
  │  10. Commit ack       │                    │                      │                     │
  │ ◄─────────────────────│                    │                      │                     │
  │                       │  (Storage Servers apply writes asynchronously from TLog)        │

Key insight from the pipeline:

  1. Reads bypass the Proxy — they go directly to Storage Servers. This means read throughput scales linearly with Storage Server count.
  2. Writes buffer locally — no network traffic until commit. A transaction that writes 1,000 keys generates exactly one network round-trip.
  3. Conflict checking is done by the Resolver — a separate stateless process that checks whether any key in the transaction’s read set was written by another transaction between readVersion and commitVersion. No lock managers, no 2PL.
  4. Durability before acknowledgment — FDB doesn’t ack a commit until the write is in the Transaction Log on f+1 machines. Storage Servers may lag behind.

Retry Semantics

The db.Transact(func) loop handles retries automatically. The function is called again if:

  • A conflict is detected (error code 1020)
  • The transaction is too old (error code 1007 — MVCC window expired)
  • A transient network error occurs

The retry rule: the function passed to Transact must be idempotent for side effects outside FDB. FDB operations are always safe to retry (each retry gets a fresh transaction). But if your function sends an email, charges a credit card, or writes to another database, a retry will do that twice.

// WRONG: email sent on retry
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    tr.Set(key, value)
    sendWelcomeEmail(user)   // ← sent multiple times on conflict!
    return nil, nil
})

// RIGHT: defer side effects to after commit
_, err := db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    tr.Set(key, value)
    return nil, nil
})
if err == nil {
    sendWelcomeEmail(user)  // ← exactly once
}

Read-Only Transactions

db.ReadTransact(func) opens a read-only transaction. It:

  • Does not track a write set
  • Does not need a commit round-trip
  • Cannot conflict (reads never conflict with other reads)
  • Uses a recent committed version as the read version

Read-only transactions are cheaper and should be preferred for any operation that only reads data.


2.4 MVCC — Multi-Version Concurrency Control

FDB stores multiple versions of each key, identified by a monotonically increasing version number (a 64-bit integer incremented globally for every commit):

Version timeline:
  v100: Set("color", "red")     ← committed at v100
  v101: Set("count", "5")       ← committed at v101
  v102: Set("color", "blue")    ← committed at v102
  v103: Set("count", "6")       ← committed at v103

A transaction reading at version v101 sees:

  • "color" = "red" (latest version ≤ v101)
  • "count" = "5" (latest version ≤ v101)

A transaction reading at version v103 sees:

  • "color" = "blue" (latest version ≤ v103)
  • "count" = "6" (latest version ≤ v103)

The 5-second window:

FDB garbage-collects old versions after approximately 5 seconds. This is configurable but the default. A transaction that starts at v100 and then tries to read 6 seconds later will get error 1007 (transaction_too_old). This is why long-running transactions are not supported in FDB.

This is a fundamental design choice. FDB optimizes for high-throughput short transactions. If you need to read data while performing long-running computation (e.g., a streaming job), the correct pattern is:

  1. Read a batch of data, process it
  2. Commit the batch
  3. Read the next batch in a new transaction

MVCC vs. Locking

FeatureMVCC (FDB, PostgreSQL)Locking (older MySQL)
Reads block writersNoYes (shared lock)
Writers block readsNoYes (exclusive lock)
DeadlocksNot possiblePossible (lock cycles)
Stale readsPossible if version too oldNot possible
OverheadGarbage collection of old versLock manager overhead

MVCC is strictly better for read-heavy workloads and any workload that mixes reads and writes.


2.5 Watches, Versionstamps, and Atomic Operations

These three primitives are rarely covered in introductory FDB material but are essential for building production systems.

Watches

tr.Watch(key) returns a FutureNil that fires when the key’s value changes. The watch is registered at commit time and fires asynchronously.

Production use cases:

  • Distributed lock re-acquisition: hold a lock key; watch it to know when released
  • Cache invalidation: watch a “cache bust” key; invalidate local cache when it fires
  • Event notification: watch a “latest event” key; re-read when it changes
  • Leader election: watch the leader key; trigger re-election when it’s cleared
// Polling-free pub/sub pattern
func watchForChanges(db fdb.Database, key fdb.Key) {
    for {
        var watch fdb.FutureNil
        db.Transact(func(tr fdb.Transaction) (interface{}, error) {
            watch = tr.Watch(key)
            return nil, nil
        })
        watch.Get()  // blocks until key changes
        // process the change...
    }
}

Limits: FDB recommends no more than 10,000 active watches per client. Each watch consumes resources on the Storage Server holding that key.

Versionstamps

A versionstamp is a 10-byte identifier that is globally unique and monotonically increasing. It consists of the commit version (8 bytes) plus a user-assigned offset (2 bytes) for ordering within one transaction.

tr.SetVersionstampedKey(keyTemplate, value): writes a key where one 10-byte slot in keyTemplate is replaced with the commit versionstamp.

// Building a totally ordered event log — no coordination required
keyTemplate := append([]byte("events:"), make([]byte, 10)...)  // 10-byte placeholder
keyTemplate = append(keyTemplate, []byte(":metadata")...)

tr.SetVersionstampedKey(fdb.Key(keyTemplate), eventPayload)
// After commit: key is "events:<10-byte versionstamp>:metadata"
// Every write gets a globally unique, strictly ordered key

Why this is powerful:

  • A sequence of commits from different clients, on different machines, all get globally ordered keys
  • No sequence number coordinator required
  • Reading all events in order = a single GetRange("events:", "events\x01") scan

This is how fdb-record-layer implements change feeds and how any FDB-based event sourcing system should be built.

Atomic Operations

Atomic operations modify a key without reading it in the same transaction. They commute — two concurrent Add(key, 1) calls on the same key will both succeed, with a final result of 2, even if they had the same read version.

OperationSemanticsUse case
Add(key, n)Atomic 64-bit integer additionCounters, sequence numbers
BitAnd/Or/XorBitwise operation on the valueBit set membership
Max/MinAtomic max or min of the current valueHigh-water marks, metrics
SetVersionstampedStamp the commit version into key/valueOrdered event logs
CompareAndClearClear key if current value equals argumentExpiry, conditional delete
AppendIfFitsAppend bytes if result fits in value sizeAccumulator patterns

Critical property: atomic ops do not generate a read conflict for their key. Two transactions can both do tr.Add("counter", 1) without conflicting. This is fundamentally different from read-modify-write (which would conflict).

Under the hood: atomic ops are sent to the Storage Server as special mutation records that the server applies during version materialization. The server has the current value and applies the operation when making the value visible.


2.6 Cluster Architecture — Every Component Explained

                    ┌─────────────────────────────────┐
                    │         Client Process           │
                    │   fdb.OpenDatabase(clusterFile)  │
                    └────────────┬────────────────────┘
                                 │ reads cluster file once;
                                 │ caches routing table
                    ┌────────────▼────────────────────┐
                    │       Coordinators (3–5)         │
                    │  Paxos-based consensus           │
                    │  Store: cluster configuration    │
                    │  Elect: active Proxies, TLogs,   │
                    │         Data Distributor,        │
                    │         Ratekeeper                │
                    └────────────┬────────────────────┘
                                 │
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                   ▼
    ┌─────────────────┐  ┌──────────────┐  ┌──────────────────┐
    │    Proxies      │  │   Resolvers  │  │  Transaction Log  │
    │ (GRV + commit)  │  │ (conflicts)  │  │  (WAL, durable)   │
    └────────┬────────┘  └──────────────┘  └──────────────────┘
             │ after conflict check + TLog commit
             │ Storage Servers apply writes asynchronously
    ┌────────▼──────────────────────────────────────┐
    │             Storage Servers (many)             │
    │  Each holds a shard (contiguous key range)    │
    │  Serves reads directly; applies mutations     │
    │  from TLog                                    │
    └────────────────────────────────────────────────┘

Proxies (GRV Proxies + Commit Proxies)

FDB 7.x splits proxy responsibilities:

  • GRV Proxies (Get Read Version): assign read versions to transactions. This is a global operation — all reads in the cluster must see a consistent version ordering. GRV proxies serialize these assignments.
  • Commit Proxies: receive commit requests, forward to Resolvers for conflict checking, then write to the Transaction Log.

Proxies are stateless (they hold no data). They can be added or removed without data migration.

Resolvers

Resolvers track which keys were written in each recent commit. When a Commit Proxy submits a transaction’s read set, the Resolver checks: “was any key in this read set written by a committed transaction between this transaction’s read version and now?” If yes → conflict, retry. If no → proceed.

Resolvers are in-memory only (they hold recent write history, not persistent data). Their state window matches the MVCC window (~5 seconds of commits).

Transaction Log

The TLog is FDB’s Write-Ahead Log — but distributed. Each commit is written to f+1 TLog machines before being acknowledged to the client. TLog uses its own durable storage (either fdatasync-safe files or SSDs).

After Storage Servers confirm they have applied all mutations up to a version (the “durable version”), TLog can truncate its older entries.

Storage Servers

Each Storage Server holds a subset of the key space (a shard). When a client wants to read a key, it looks up which Storage Server holds that key’s shard (from its cached routing table), and sends the read directly to that server.

Storage Servers use a custom storage engine (called “KeyValueStore” internally) built on a variant of SQLite or a custom B-tree, depending on configuration. Each mutation from the TLog is applied to this local store.

Data Distribution: The Data Distributor process monitors shard sizes and access patterns. If a shard grows too large (>250MB by default), it splits it and migrates half to another Storage Server. This is transparent to clients.

Coordinators

Coordinators run a Paxos-like consensus protocol to maintain the “cluster controller” — the process that manages the lifecycle of all other FDB roles. If the cluster controller fails, coordinators elect a new one. Coordinators themselves are stateless beyond their Paxos log.

The cluster file (fdb.cluster) contains just the coordinator addresses: fdbdemo:fdbdemo@127.0.0.1:4500. This single file is how clients bootstrap their connection to any FDB cluster.


2.7 Production Limits and Sizing

These are not academic — they affect how you design your layer:

ParameterLimit / DefaultImplication
Max value size100 KB (102,400 bytes)Large values must be chunked (option-b)
Max transaction size~10 MB (reads + writes combined)Bulk loads must be split
Max transaction duration~5 seconds (MVCC window)No long-running transactions
Max watches per client~10,000Watches are not free
Max key size10 KBKeep keys short (< 1 KB preferred)
Read version latency~1 ms (local datacenter)Fast; dominated by network
Commit latency~2–5 ms (local datacenter)TLog write + network
Max throughput (single proxy)~250,000 operations/secScale horizontally by adding proxies
Storage per server~100 GB–4 TB typicalFDB handles redistribution automatically

Production rule of thumb: keep transactions small (< 1 MB), short (< 1 second), and focused (few key ranges). This maximizes throughput and minimizes conflicts.


2.8 The Simulation Harness — Why FDB Is Trusted

FDB’s simulation framework is the most sophisticated correctness testing system in any open-source database. Understanding it explains why Apple, Snowflake, and others trust FDB with mission-critical data.

How Simulation Works

The entire FDB cluster — network stack, disk I/O, clocks, process scheduling — is implemented twice:

  1. The real implementation (for production)
  2. A deterministic simulation (for testing)

In simulation mode:

  • All network calls are function calls (no actual sockets)
  • Disk I/O is simulated in memory with configurable failure injection
  • Clocks are virtual (controlled by the simulator)
  • Processes are coroutines (no real threads)
  • Random seeds control all non-determinism

The simulator injects:

  • Machine failures: simulate killing any process at any time
  • Network partitions: drop messages between specific nodes
  • Disk failures: simulate corrupted writes, partial writes
  • Clock skew: advance clocks non-uniformly across nodes
  • Slow operations: simulate disk latency spikes

Why Determinism Matters

When a bug is found in simulation:

  1. Record the random seed that triggered it
  2. Reproduce the exact same sequence of events with the same seed
  3. Add print statements, breakpoints, rerun — identical behavior every time

This is impossible with real distributed systems, where timing non-determinism makes bugs nearly unreproducible.

Scale of Testing

Before each FDB release, the simulation runs for thousands of machine-years worth of simulated time. A team member once stated: “We’ve found more bugs through simulation than through all other methods combined.”

This testing discipline is why FDB can make strong correctness guarantees. It’s also why the FDB team is one of few database teams in the world that will confidently claim their ACID implementation is correct under arbitrary failure injection.

What This Means for You

When you use FDB as your storage substrate, you inherit these correctness guarantees. The five layers in this repository can fail in arbitrary ways — the processes can crash, the network can partition — and FDB will maintain consistency. The only failure that can cause data loss is the simultaneous loss of f+1 machines during a transaction commit window (where f is the configured fault tolerance level, typically 1 for a 3-machine cluster or 2 for a 5-machine cluster).


Interview Questions

Q: What is strict serializability and how does it differ from serializability?

Serializability means transactions appear to execute in some sequential order. Strict serializability (also called linearizability + serializability) additionally requires that the sequential order respects real-time ordering: if transaction T1 commits before transaction T2 begins, T1 must appear before T2 in the serial order. This matters for distributed systems where different clients observe commits at different times. FDB provides strict serializability. Most databases (PostgreSQL serializable, MySQL serializable) provide only serializability.

Q: How does FDB avoid deadlocks?

FDB doesn’t use locks. Instead, it uses optimistic concurrency control: reads are conflict-tracked, writes are buffered, and conflict detection happens at commit time. If a conflict is detected, the transaction retries with a new read version — it never “waits” for another transaction to release a lock. Since there’s no waiting, there can’t be a cycle of transactions waiting on each other.

Q: What happens to a running FDB transaction if the Proxy crashes?

The client’s Transact loop handles this. The proxy failure will manifest as a network error on the commit request. The FDB client library translates this to a retriable error, and Transact calls the function again with a new transaction and a new proxy (the client discovers the new proxy from the Coordinator). The retry is transparent to application code.

Q: Why does FDB limit transaction size to 10 MB?

FDB’s conflict resolution requires the Resolver to hold recent write history in memory. Large transactions would require holding large amounts of data in the Resolver’s conflict window. Additionally, the Transaction Log must write the entire transaction atomically before acknowledging. Large transactions create tail latency. The 10 MB limit keeps both of these bounded. For bulk loading, the correct approach is to batch writes in 1–5 MB transactions.

Q: What is a “read version” in FDB?

A read version is a 64-bit integer that represents a point in time in FDB’s commit history. Every committed transaction increments the global version counter. A transaction reading at version V sees all commits with version ≤ V and no commits with version > V. Read versions are assigned by GRV Proxies and represent a consistent snapshot of the entire cluster at that version.

Q: Can two FDB transactions that don’t share any keys conflict?

No. Conflict detection in FDB is purely key-based. Two transactions with disjoint key sets can never conflict, regardless of how close together they commit. This is why carefully designed key spaces (using subspaces to isolate different logical entities) dramatically reduce conflict rates in high-throughput applications.