The Layer Concept
The layer concept is the central architectural idea of FoundationDB. FDB’s API surface is tiny: Get, Set, Clear, ClearRange, GetRange, and transactions. All the richness of a “database” — tables, indexes, files, records, schemas — must be built from these six primitives. This chapter explains exactly how.
3.1 Encoding Is Everything
FDB stores byte strings. There is no concept of integer columns, string columns, or JSON fields at the FDB level. You must encode your data into bytes and decode it back out. The art of building a layer is entirely the art of encoding — turning rich data structures into sequences of (key, value) pairs such that:
- Unambiguous decoding: given a raw key-value pair, you can reconstruct the original data
- Co-location: related data is adjacent in key order, enabling efficient range scans
- Minimal write amplification: updating one piece of data doesn’t force rewrites of unrelated data
- Sort-preserving encoding: values you want to range-query sort correctly under byte comparison
All four properties must hold simultaneously. Let’s examine each.
Property 1: Unambiguous Decoding
Suppose you store two fields, name and city, in a single key:
key: "user:" + name + ":" + city
If name = "Alice:Paris" and city = "London", the key becomes "user:Alice:Paris:London", which looks identical to name = "Alice", city = "Paris:London". Ambiguous. Solution: choose a separator that cannot appear in field values, or escape it.
FDB’s Tuple layer solves this by escaping \x00 inside strings as \x00\xFF, then using bare \x00 as a separator. This makes encoding unambiguous for arbitrary byte strings.
Property 2: Co-location
Imagine storing a user’s profile fields as separate keys:
"user:alice:name" → "Alice"
"user:alice:email" → "alice@example.com"
"user:alice:city" → "Paris"
"user:bob:name" → "Bob"
Reading Alice’s full profile = one GetRange("user:alice:", "user:alice\x01") call. All three keys are adjacent in FDB’s sorted key space. One round-trip, no matter how many fields she has.
Contrast with storing all users in one key: "users" → msgpack([alice, bob, ...]). Reading one user requires reading the entire blob. Updating one user requires reading the entire blob, deserializing, modifying, re-serializing, and writing back. Terrible co-location.
Co-location means: keys for the same logical entity should share a prefix, so they sort together and can be read or deleted in one range operation.
Property 3: Minimal Write Amplification
Consider a user with 50 indexed fields. If a user moves from Paris to Tokyo, only two keys change: the old city index entry (cleared) and the new city index entry (set). The user’s record and all other index entries are untouched.
Designing your encoding so that “update one field” touches only the keys for that field is the goal. If your encoding stores the entire record in one value, every field update touches the entire value — wasteful when records are large.
Property 4: Sort-Preserving Encoding
If you want to range-query age BETWEEN 25 AND 40, the encoded age values must sort in the same order as the numeric values.
String encoding fails for integers:
"9" > "30" (byte comparison: '9' = 0x39 > '3' = 0x33)
→ sorted order: "1", "10", "100", "2", "20", "3", "30", "9" — wrong
Signed big-endian fails across sign boundary:
-1 → 0xFFFFFFFFFFFFFFFF (sorts AFTER all positive numbers)
The correct encoding (used in every layer in this repo):
// encoding.go (option-c-record-layer)
func encodeInt64(x int64) []byte {
b := make([]byte, 8)
// Flip the sign bit. Maps signed range to unsigned range preserving order.
// -2^63 → 0x0000...0000 (smallest unsigned)
// -1 → 0x7FFF...FFFF
// 0 → 0x8000...0000
// 1 → 0x8000...0001
// 2^63-1 → 0xFFFF...FFFF (largest unsigned)
binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
return b
}
This encoding is the industry standard for sort-preserving signed integer encoding. It appears in Apache HBase, Apache Cassandra, FDB’s own Tuple layer, and Google Bigtable.
Why XOR with 1<<63 is the same as adding 2^63: For any signed 64-bit integer x, uint64(x) XOR (1<<63) == uint64(x) + 2^63 when computed in 64-bit unsigned arithmetic. XOR is just the bit-manipulation shorthand for the offset encoding.
3.2 The Subspace Pattern
The single most important pattern in any FDB layer is the subspace: a byte prefix that namespaces all keys for a logical entity.
// encoding.go (option-a-leveldb)
type Subspace struct {
prefix []byte
}
func (s Subspace) Pack(userKey []byte) fdb.Key {
out := make([]byte, 0, len(s.prefix)+1+len(userKey))
out = append(out, s.prefix...)
out = append(out, 0x00) // separator byte
out = append(out, userKey...)
return fdb.Key(out)
}
func (s Subspace) Range() fdb.KeyRange {
begin := append([]byte{}, s.prefix...)
begin = append(begin, 0x00)
end := append([]byte{}, s.prefix...)
end = append(end, 0x01) // one byte past the separator
return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}
Why the separator byte?
Without a separator, subspaces "foo" and "foobar" collide: the key "foobar\x00somekey" has "foo" as a prefix but was packed by the "foobar" subspace. The separator \x00 prevents this: "foo\x00" is never a prefix of "foobar\x00" because their 4th bytes differ (\x00 vs b).
The range trick — \x00 → \x01:
Incrementing the separator byte from \x00 to \x01 creates a tight upper bound. Every key in the "demo\x00" subspace sorts before "demo\x01". So GetRange("demo\x00", "demo\x01") returns exactly the keys for the “demo” namespace — no more, no less.
Subspace composition (nested subspaces):
You can build hierarchical key spaces by composing prefixes:
users subspace: "app\x00users\x00"
orders subspace: "app\x00orders\x00"
indexes subspace: "app\x00idx\x00"
city index: "app\x00idx\x00city\x00"
value "Paris": "app\x00idx\x00city\x00Paris\x00"
Each prefix cleanly separates a logical domain. ClearRange("app\x00users\x00", "app\x00users\x01") atomically deletes all users. ClearRange("app\x00idx\x00city\x00Paris\x00", "app\x00idx\x00city\x00Paris\x01") atomically deletes all city=“Paris” index entries.
All five layers in this repo use this pattern:
| Layer | Subspace layout |
|---|---|
| option-a-leveldb | ns + 0x00 + userKey |
| option-a-sqlite | ns + 0x00 + tableName (catalog), ns + 0x01 + tableName + 0x00 + pk (rows) |
| option-b-leveldb | ns + 0x01 + fileType + fileNum (meta), ns + 0x02 + fileType + fileNum + chunkNum (data) |
| option-b-sqlite | ns + 0x00 (size key), ns + 0x01 + pageNum (pages) |
| option-c-record-layer | ns + 0x00 + schema + 0x00 + pk (records), ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk (indexes) |
3.3 The Tuple Layer — Industry Standard Encoding
The official FDB client libraries (Python, Java, Go, Ruby) provide a Tuple encoding that standardizes the above patterns. It encodes lists of typed values into byte strings that sort correctly under FDB’s lexicographic comparison.
Tuple type codes and sort behavior:
| Type | Prefix byte | Encoding | Sort behavior |
|---|---|---|---|
None / nil | 0x00 | just the prefix | always smallest |
| Byte string | 0x01 | bytes + \x00 (with \x00 → \x00\xFF) | lexicographic |
| Unicode string | 0x02 | UTF-8 + \x00 (same escaping) | lexicographic |
| Nested tuple | 0x05 | nested encoding + \x00 | element-by-element |
| Integer 0 | 0x14 | just the prefix | numeric zero |
| Positive integer | 0x15–0x1C | 1–8 bytes big-endian | numerically ascending |
| Negative integer | 0x0C–0x13 | 1–8 bytes, bitwise NOT | numerically ascending |
| Float (32-bit) | 0x20 | 4 bytes, sign-bit-flipped IEEE 754 | numerically ascending |
| Double (64-bit) | 0x21 | 8 bytes, sign-bit-flipped IEEE 754 | numerically ascending |
| Boolean false | 0x26 | just the prefix | false < true |
| Boolean true | 0x27 | just the prefix | false < true |
| UUID | 0x30 | 16 bytes | UUID byte order |
| Versionstamp | 0x32/0x33 | 12 bytes | commit version ordering |
Why the layers in this repo don’t use the Tuple layer:
Using the official Tuple layer would be correct and production-appropriate. The layers here use simpler hand-rolled encodings so that each file is readable without understanding the Tuple specification. When reading the source, you can see append(out, 0x00) instead of tuple.Pack(tuple.Tuple{schema, pk}) — the intent is obvious.
For production use, adopt the Tuple layer. It handles edge cases (null bytes in strings, arbitrarily large integers, nested structures) that the hand-rolled encodings in this repo skip.
3.4 Atomicity as a Design Axiom
The most important design discipline in FDB layer development: every logical operation that must be atomic must live inside one transaction.
This sounds obvious but is violated constantly by developers who treat FDB as a fancy cache. Let’s examine the consequences of each atomicity violation:
Case 1: Split Insert (SQL Layer)
// WRONG — two transactions for one logical insert
rowid := allocateRowid() // Transaction 1: increment counter
writeRow(schema, rowid, values) // Transaction 2: write row data
// If crash between T1 and T2:
// → counter incremented, row never written
// → counter is now permanently incorrect (rowid was "consumed")
// → future inserts skip this rowid
The correct approach (used in option-a-sqlite):
// RIGHT — one transaction
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Read and increment rowid counter
rowid := tr.Get(rowidKey).Get() + 1
tr.Set(rowidKey, encode(rowid))
// Write the row
tr.Set(rowKey(schema, rowid), encode(values))
return nil, nil
})
Case 2: Split Index Update (Record Layer)
// WRONG — separate transactions for record and index
tr1.Set(recordKey, newRecord) // T1: update record
tr2.Clear(oldIndexKey) // T2: remove old index
tr2.Set(newIndexKey, nil) // T2: add new index
// If crash between T1 and T2:
// → record shows new value
// → index still points to old value
// → LookupByIndex returns stale data forever
// → This is a phantom index entry — silent corruption
The correct approach (used in option-c-record-layer):
// RIGHT — all in one transaction
s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Read old record to know what indexes to remove
prevBytes, _ := tr.Get(recordKey(s.ns, schema, pk)).Get()
if prevBytes != nil {
var prev Record
msgpackUnmarshal(prevBytes, &prev)
for _, f := range sc.Indexes {
if v, ok := prev[f]; ok {
tr.Clear(indexKey(s.ns, schema, f, v, pk))
}
}
}
// Write new record and new indexes
tr.Set(recordKey(s.ns, schema, pk), encoded)
for _, f := range sc.Indexes {
if v, ok := rec[f]; ok {
tr.Set(indexKey(s.ns, schema, f, v, pk), nil)
}
}
return nil, nil
})
Case 3: Split File Rename (LevelDB Storage Layer)
// WRONG — separate transactions for copy and delete
copyFileData(newFd, oldFd) // T1: copy all chunks
deleteFile(oldFd) // T2: clear old chunks
// If crash between T1 and T2:
// → both files exist simultaneously
// → LevelDB manifest references old name
// → on restart, LevelDB sees duplicate SSTables → corruption
The correct approach (used in option-b-leveldb):
// RIGHT — one transaction for atomic rename
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
kvs, _ := tr.GetRange(s.dataRange(oldfd), ...).GetSliceWithError()
tr.ClearRange(s.dataRange(oldfd)) // clear old
tr.Clear(s.metaKey(oldfd))
for _, kv := range kvs { // write new
tr.Set(translateKey(kv.Key, oldfd, newfd), kv.Value)
}
tr.Set(s.metaKey(newfd), metaBytes)
return nil, nil
})
The pattern: read old state → clear old keys → write new keys — in one transaction. Crash at any point: transaction not committed, neither state is visible. Crash after commit: new state is fully visible. No intermediate state is ever observable.
3.5 Key Space Design Patterns and Anti-Patterns
Pattern: Sequential Integer Keys
key: ns + 0x00 + encode_uint64(rowid)
Values: row data. Row id is monotonically increasing. New rows always insert at the end of the range (append-like behavior in the key space). Range scans return rows in insertion order.
Production consideration: In a write-heavy workload, all inserts land on the same key space boundary, creating a hotspot on the FDB shard holding the highest rowid. FDB will split this shard, but the split lags the write rate. For very high insert rates, use a randomized prefix (hash of rowid) to distribute writes across shards.
Pattern: Reverse Timestamp Keys
key: ns + 0x00 + encode_uint64(MaxUint64 - timestamp)
Most recent entries sort first. A GetRange with no bounds returns entries in reverse chronological order. Useful for “recent events”, “activity feeds”, “audit logs”.
Pattern: Composite Keys for Multi-Dimensional Queries
key: ns + encode(field1) + 0x00 + encode(field2) + 0x00 + pk
Example: city + 0x00 + age + 0x00 + pk enables range queries on city AND age:
- All users in Paris sorted by age:
GetRange(prefix + "Paris\x00", prefix + "Paris\x01") - All users in Paris aged 25–40:
GetRange(prefix + "Paris\x00" + encode(25), prefix + "Paris\x00" + encode(40) + "\xff")
This is the compound index pattern. It is exactly how MySQL’s compound indexes work.
Anti-Pattern: Large Values
Storing all of a user’s data in one 50 KB value:
key: ns + pk
value: msgpack({name, email, all_orders: [...], all_addresses: [...], ...})
Problems:
- Reading one field requires reading and deserializing the entire value (O(size))
- Updating one field rewrites the entire value (O(size) write amplification)
- The value may approach FDB’s 100 KB limit
- Hot users (many orders) create unbalanced value sizes
Better: normalize into separate keys per logical entity:
ns + 0x00 + pk → profile (name, email)
ns + 0x01 + pk + orderid → individual orders
ns + 0x02 + pk + addrid → individual addresses
Anti-Pattern: Hot Key Contention
key: "global_counter" ← every write touches this
All transactions that increment a counter conflict on this key. At high throughput, this becomes a serialization bottleneck.
Solutions:
- FDB atomic ops:
tr.Add("global_counter", 1)— atomic, no conflict, no retry needed - Sharded counters: split into N shard keys; read by summing all shards
- Approximate counting: use a HyperLogLog or count-min sketch in a single key
3.6 The FDB Blob Layer Pattern
FDB’s value size limit (100 KB) is not a fundamental limitation — it’s a performance tuning decision. The solution is chunking, used in option-b-leveldb:
key: ns + 0x01 + fileType + fileNum + chunkNum → up to 65,536 bytes
To read a file: range-scan all chunks (sorted by chunkNum due to big-endian encoding), concatenate values.
This pattern generalizes to any large object:
- Files: split by byte range (64 KB chunks)
- Images: split by byte range or by tile
- Blobs: split by arbitrary chunk size
- Large JSON documents: split by logical sub-object
FDB’s official Blob Layer (in the FDB foundationdb-python-layers repository) implements this exact pattern. Amazon S3 and Google Cloud Storage’s consistency guarantees are actually powered by similar chunk-based storage backed by distributed KV stores.
3.7 Schema Evolution — How Layers Survive Change
A layer’s encoding is its schema. Changing the encoding without migrating old data is a common source of production incidents. Strategies:
1. Prefix-versioned keys:
ns + version_byte + ...rest
New code writes version = 0x02 keys; old code writes version = 0x01 keys. A background migration reads all 0x01 keys, converts, writes 0x02 keys. Read code handles both versions.
2. Additive-only changes: If your value is msgpack/Protobuf, adding new optional fields is backward compatible. Old code ignores unknown fields. New code sets defaults for missing fields. This is safe and zero-downtime.
3. Dual writes during migration: Write to old AND new key format simultaneously during a migration window. Verify correctness. Backfill old data. Cut over to new format. Remove old code path. Zero-downtime migration.
4. Never reuse key prefixes: Deleted keys leave no trace in FDB (there’s no tombstone value). But if you reuse a deleted subspace’s prefix for a new purpose, you risk stale data colliding with new data. Always use new prefixes for new logical entities.
Interview Questions
Q: Why is key ordering so important in FDB — can’t you just use a hash map?
A hash map gives O(1) point lookups but cannot efficiently answer range queries (“all users aged 25–40”), prefix queries (“all keys starting with ‘user:42:’”), or sorted iteration. Ordered storage enables all three with a single range scan. The entire record layer concept — where secondary indexes are just cleverly prefixed keys — relies on ordering. Without ordering, every query would require a full scan.
Q: What is a “subspace” in FDB and why is it important?
A subspace is a key prefix that namespaces a set of keys. It provides namespace isolation (keys in subspace A cannot collide with keys in subspace B), atomic deletion (one ClearRange deletes all keys in a subspace), and efficient range queries (one GetRange over the subspace prefix reads all related data). Every production FDB layer uses subspaces as the fundamental organizational unit.
Q: How do you handle null bytes inside keys in FDB?
FDB keys can contain any bytes, but if you use \x00 as a separator, null bytes inside field values create ambiguity. The official solution is the Tuple encoding: escape \x00 inside string values as \x00\xFF, and use bare \x00 as the separator. The decoder sees \x00\xFF and knows it’s an escaped null; bare \x00 ends the field. The layers in this repo use simple encodings that assume field values don’t contain null bytes — acceptable for a teaching implementation, not for production with arbitrary user data.
Q: Why is the PutRecord function in the record layer more expensive than a simple Set?
PutRecord must read the old record before writing the new one, so it can remove stale index entries. This adds one Get call to every write — a full round-trip to a Storage Server. The cost is necessary: without reading the old record, you can’t know which old index entries to remove. Production systems amortize this by batching multiple record updates in one transaction, or by using techniques like “shadow documents” (keeping old field values in the record itself to avoid the read).
Q: What is the difference between Transact and ReadTransact in FDB?
ReadTransact opens a read-only transaction: no conflict tracking, no commit round-trip, cheaper. It’s for operations that only read data. Transact opens a read-write transaction: tracks the read set for conflict detection, buffers writes, commits at the end. Use ReadTransact whenever you’re not writing anything — it’s faster and imposes less load on the cluster.