Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

How Real Systems Use FDB

One of the most useful ways to understand an architecture is to see it in production at scale. FoundationDB is unusual in that several of the systems built on it are open-source or at least publicly described. Each of the five labs in this repo maps directly to a pattern used in a real, deployed production system.


4.1 Apple’s FoundationDB Record Layer (iCloud / CloudKit)

The fdb-record-layer is a Java library open-sourced by Apple in 2018. It is the storage engine behind Apple’s CloudKit metadata services — the cross-device sync that backs Photos, Contacts, Notes, iMessage, and nearly every other iCloud-connected app.

What it provides:

  • Protobuf-typed records (define your schema in .proto files)
  • Declarative index definitions: value indexes, rank indexes, aggregate indexes, text indexes
  • A query planner that compiles predicates into FDB range scans
  • Schema evolution (add/remove/change fields without downtime)
  • Continuations (long queries can be resumed across transactions)
  • Change feeds via versionstamped keys

How the key layout maps to our option-c-record-layer:

fdb-record-layer uses a similar two-subspace structure:

[storePrefix] [META_DATA_VERSION] → version metadata
[storePrefix] [RECORD_TYPE_KEY]   → schema info
[storePrefix] [RECORD] [pk bytes] → serialized Protobuf record
[storePrefix] [INDEX] [indexName] [indexValue] [pk bytes] → index entry

Our record layer uses a simpler version:

ns + 0x00 + schema + 0x00 + pk    → msgpack record
ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk → index entry (empty value)

The production scale: CloudKit syncs data for hundreds of millions of Apple devices. Every Photos upload, every note edit, every contact sync creates record layer transactions. FDB handles this via horizontal partitioning: different users’ data lives in different FDB shards, so per-user transaction load is low. The global throughput is enormous, but each individual transaction is small.

Key insight — continuations: fdb-record-layer’s most important production feature is the “continuation” — a cursor that can be serialized, stored, and resumed across transaction boundaries. Since FDB limits transaction duration to 5 seconds, long queries must be broken into chunks. A continuation saves the last-seen key so the next transaction can resume exactly where the previous one left off. Our labs don’t implement this, but every production layer needs it for full-table scans, bulk exports, and maintenance jobs.

Schema evolution in production: Adding a new field to a record type is safe (existing records just don’t have it; the field defaults to absent). Removing a field requires first removing all code that writes it, then removing the field from the schema. Renaming a field is the dangerous operation — it looks like “add new field + remove old field” but requires migrating all existing records. fdb-record-layer provides a dedicated schema migration API for this case.


4.2 FoundationDB Document Layer (MongoDB Wire Protocol)

The Document Layer was an open-source FDB layer that spoke the MongoDB wire protocol. A MongoDB client (driver, shell, or Compass) could point at the Document Layer unchanged, and all data was stored in FDB with ACID guarantees.

What it did:

  • Parsed MongoDB wire protocol (BSON over TCP)
  • Translated BSON documents into FDB key-value pairs
  • Translated MongoDB queries ({city: "Paris", age: {$gte: 25}}) into FDB range scans and filters
  • Provided MongoDB’s consistency guarantees (which are weaker than FDB’s — the Document Layer could offer stronger consistency than stock MongoDB by virtue of FDB’s serializable transactions)

Key encoding for a BSON document:

[collectionPrefix] [docId]               → BSON document value
[indexPrefix] [fieldName] [fieldValue]  [docId] → index entry

This is structurally identical to option-c-record-layer with schema = collectionName.

Why it was deprecated: The Document Layer was maintained by FoundationDB’s internal team at Apple. When Apple open-sourced FDB in 2018 and then in 2019 changed some library licensing, maintaining a full MongoDB-compatible query engine was outside scope. It was deprecated in 2019 but the codebase remains instructive as a reference implementation.

Lesson: A full relational/document query engine can be built atop FDB in ~50,000 lines of C++. The query planner complexity is in compiling predicate trees to efficient index scan strategies. The storage layer is just option-c-record-layer.


4.3 mvsqlite — SQLite on FDB with Multi-Process MVCC

mvsqlite is a Rust project by Heyang Zhou (losfair). It implements a SQLite VFS backed by FDB such that multiple processes can open the same “database” concurrently with full MVCC isolation.

How it maps to option-b-sqlite:

Our pagestore stores pages at:

ns + 0x01 + pageNum (big-endian) → 4096-byte page content

mvsqlite stores pages at:

[namespace] [pageNum_big_endian] → page content at that version

What mvsqlite adds beyond our implementation:

  1. Page-level MVCC: Instead of storing just the current page, mvsqlite stores multiple historical versions. Old page versions are kept for active readers. This is implemented by appending a version prefix:

    [namespace] [pageNum] [snapshotVersion] → page at that version
    

    A reader at snapshot V only reads pages at version <= V. The GC process cleans up versions older than the oldest active reader.

  2. Multi-process write serialization: mvsqlite serializes concurrent writers using an FDB transaction on a “write lock” key. The winner gets exclusive write access; losers retry. SQLite’s own xLock/xUnlock calls are translated to FDB key operations.

  3. Write-ahead buffering in FDB: Instead of a local WAL file, mvsqlite buffers pending writes in FDB under a separate subspace. Commit flushes the buffer to the page subspace atomically.

  4. Namespace mapping: SQLite filenames map to FDB key prefixes. ATTACH DATABASE 'db2.mvsqlite' AS db2 opens a second namespace in the same FDB cluster.

The production use case: A web application running in a multi-process or multi-instance environment (e.g., multiple Gunicorn workers, multiple Docker containers) can share a single mvsqlite database. Each worker reads from a consistent snapshot without locking other workers. Writes are serialized. This is impossible with standard SQLite (which only allows one writer and assumes a shared filesystem).


4.4 LevelDB’s goleveldb Storage Interface Pattern

The goleveldb storage.Storage interface (used in option-b-leveldb) is a seam designed for exactly this purpose: replacing the local filesystem with a remote storage backend while keeping all of LevelDB’s logic intact.

In practice, option-b-leveldb is analogous to how production systems host embedded databases on shared storage:

Real analog — etcd on block storage: etcd uses bbolt (a B-tree engine) for its data. When running in Kubernetes, etcd’s data directory is a persistent volume (a network-attached block device). The “storage.Storage” for etcd is the filesystem API on top of that network block device.

Real analog — TiKV’s RocksDB on FDB: TiKV, the storage layer behind TiDB, uses RocksDB. There are research prototypes that replace RocksDB’s Env (RocksDB’s equivalent of storage.Storage) with a remote storage backend backed by a distributed KV store.

Real analog — FoundationDB’s own Blob Layer: FDB’s official Blob Layer (in the Python layers repository) uses the exact chunking pattern from option-b-leveldb:

[prefix] [blobId] [chunkNum] → up to 10,000 bytes

The chunk size is tunable. Read performance is O(n_chunks) round-trips, optimized by using GetRange to fetch all chunks in one round-trip.

What makes option-b-leveldb interesting architecturally: LevelDB produces 7 types of files (log, sstable, manifest, current, lock, temp, info). Each has different I/O patterns: log files are append-only, sstables are write-once read-many, manifests are written atomically. Our storage.Storage implementation stores all file types as FDB key ranges, but a production implementation would optimize differently per file type (e.g., log files could use FDB’s atomic append via versionstamps rather than read-modify-write chunking).


4.5 Snowflake’s Use of FDB

Snowflake uses FoundationDB as its metadata store — the catalog that tracks table schemas, partition file locations, transaction history, and access controls. The actual table data is stored in cloud object storage (S3/Azure Blob/GCS), but the metadata that makes queries possible lives in FDB.

Why FDB for metadata?

  • Cloud object storage is strongly consistent for object operations but cannot do atomic multi-object updates. Creating a table, writing partition files, and committing a transaction all need to appear atomically or not at all. FDB provides this.
  • Metadata is small but highly contended. FDB’s optimistic concurrency and high throughput handle schema-level operations (DDL, partition registration) without bottlenecks.
  • FDB’s key ordering makes catalog operations efficient: scan all partitions for table T, scan all columns in schema S, scan all transactions in time range [t1, t2].

The encoding (likely, based on public talks):

[catalog] [tenant] [schemaName] [tableName] [partitionId] → partition metadata
[txnLog]  [startVersion]                                  → transaction record
[schema]  [tenant] [schemaName] [tableName]               → column definitions

This is structurally option-a-sqlite (the SQL catalog pattern) extended to a distributed multi-tenant context.

The lesson: FDB is not used because Snowflake needed a storage engine for query data — S3 handles that. FDB is used because catalog metadata needs ACID guarantees that no other component in a cloud-native stack provides cheaply. If you’re building a distributed system that needs a consistent metadata store, FDB is frequently the correct answer.


4.6 Apple CloudKit — FDB at Hundreds of Millions of Devices Scale

CloudKit is Apple’s cross-device sync infrastructure. It stores photos metadata, notes content hashes, contact sync tokens, and app-defined data for third-party iOS apps. As of 2023, it services hundreds of millions of Apple devices.

The FDB usage in CloudKit:

  • User records (contacts, notes, photo metadata) are stored as fdb-record-layer records
  • Sync tokens are versionstamped keys that encode the exact FDB version at which a sync point was taken
  • Conflict resolution uses FDB’s read-your-writes consistency to detect and merge concurrent edits

Why versionstamps are central to sync:

// Sync token = FDB versionstamp at the time of last sync
// Next sync: "give me all changes since versionstamp V"
// FDB key encoding: versionstamp → big-endian 10-byte key prefix
// GetRange from (V, MAX) returns all records created/modified after V

This is FDB’s versionstamp feature in action: each committed transaction gets a globally monotonic 10-byte version. Records written in that transaction embed the versionstamp in their key, enabling efficient “fetch all changes since X” queries with a single range scan.

Horizontal partitioning: Each CloudKit user’s data lives in a dedicated FDB keyspace (their userID is a subspace prefix). No two users’ transactions conflict because they operate on disjoint keys. This is how FDB achieves per-user isolation without database-per-user overhead.


Interview Questions

Q: Why would you use FDB over Postgres for metadata in a distributed system like Snowflake?

Postgres is a single-node database. To scale writes, you need sharding or replication, which introduces coordination complexity. FDB is horizontally scalable by design: adding machines increases throughput linearly. For metadata that needs global ACID guarantees across hundreds of nodes, a single-node Postgres (or even a replicated Postgres with synchronous replication) cannot match FDB’s throughput. FDB also runs in the same datacenter/cloud region as the application, eliminating cross-region round-trips for metadata operations.

Q: The Document Layer is deprecated — does that mean FDB isn’t good for document storage?

No. The Document Layer was deprecated because maintaining a MongoDB-compatible query engine (including the query planner, aggregation pipeline, and wire protocol) was expensive to maintain. The underlying pattern — document as FDB record, query predicates as range scans over index subspaces — is sound and is exactly what fdb-record-layer implements (with Protobuf instead of BSON). The Document Layer’s death was organizational, not architectural.

Q: How does mvsqlite enable multiple concurrent writers to the same SQLite database?

Standard SQLite uses POSIX file locks (or Windows file locks) to serialize writes. One writer holds a RESERVED lock; others wait. mvsqlite replaces file locks with FDB transactions on a “write lock” key. The writer that wins the FDB transaction proceeds; others retry. Because FDB’s transaction semantics are stronger than POSIX file locks (FDB provides serializable isolation; POSIX locks only prevent concurrent access), mvsqlite actually provides stronger guarantees than standard SQLite in a networked environment.

Q: What is a “continuation” in fdb-record-layer and why does every production FDB layer need one?

FDB limits transaction duration to 5 seconds. Any operation that might take longer — full-table scan, bulk export, index rebuild — must be broken into multiple transactions. A continuation is a serialized cursor: it records which keys have been processed so the next transaction can resume from the right place. Without continuations, a query that touches more records than fit in one 5-second window simply cannot be executed. In our labs, we ignore this constraint — all queries run in one transaction, which fails with a transaction-too-old error if the dataset is large enough. Adding continuation support is the single most important production readiness improvement for any of these layers.