Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Layer Concept

The layer concept is the central architectural idea of FoundationDB. FDB’s API surface is tiny: Get, Set, Clear, ClearRange, GetRange, and transactions. All the richness of a “database” — tables, indexes, files, records, schemas — must be built from these six primitives. This chapter explains exactly how.


3.1 Encoding Is Everything

FDB stores byte strings. There is no concept of integer columns, string columns, or JSON fields at the FDB level. You must encode your data into bytes and decode it back out. The art of building a layer is entirely the art of encoding — turning rich data structures into sequences of (key, value) pairs such that:

  1. Unambiguous decoding: given a raw key-value pair, you can reconstruct the original data
  2. Co-location: related data is adjacent in key order, enabling efficient range scans
  3. Minimal write amplification: updating one piece of data doesn’t force rewrites of unrelated data
  4. Sort-preserving encoding: values you want to range-query sort correctly under byte comparison

All four properties must hold simultaneously. Let’s examine each.

Property 1: Unambiguous Decoding

Suppose you store two fields, name and city, in a single key:

key: "user:" + name + ":" + city

If name = "Alice:Paris" and city = "London", the key becomes "user:Alice:Paris:London", which looks identical to name = "Alice", city = "Paris:London". Ambiguous. Solution: choose a separator that cannot appear in field values, or escape it.

FDB’s Tuple layer solves this by escaping \x00 inside strings as \x00\xFF, then using bare \x00 as a separator. This makes encoding unambiguous for arbitrary byte strings.

Property 2: Co-location

Imagine storing a user’s profile fields as separate keys:

"user:alice:name"    → "Alice"
"user:alice:email"   → "alice@example.com"
"user:alice:city"    → "Paris"
"user:bob:name"      → "Bob"

Reading Alice’s full profile = one GetRange("user:alice:", "user:alice\x01") call. All three keys are adjacent in FDB’s sorted key space. One round-trip, no matter how many fields she has.

Contrast with storing all users in one key: "users" → msgpack([alice, bob, ...]). Reading one user requires reading the entire blob. Updating one user requires reading the entire blob, deserializing, modifying, re-serializing, and writing back. Terrible co-location.

Co-location means: keys for the same logical entity should share a prefix, so they sort together and can be read or deleted in one range operation.

Property 3: Minimal Write Amplification

Consider a user with 50 indexed fields. If a user moves from Paris to Tokyo, only two keys change: the old city index entry (cleared) and the new city index entry (set). The user’s record and all other index entries are untouched.

Designing your encoding so that “update one field” touches only the keys for that field is the goal. If your encoding stores the entire record in one value, every field update touches the entire value — wasteful when records are large.

Property 4: Sort-Preserving Encoding

If you want to range-query age BETWEEN 25 AND 40, the encoded age values must sort in the same order as the numeric values.

String encoding fails for integers:

"9"  > "30"  (byte comparison: '9' = 0x39 > '3' = 0x33)
→ sorted order: "1", "10", "100", "2", "20", "3", "30", "9" — wrong

Signed big-endian fails across sign boundary:

-1 → 0xFFFFFFFFFFFFFFFF  (sorts AFTER all positive numbers)

The correct encoding (used in every layer in this repo):

// encoding.go (option-c-record-layer)
func encodeInt64(x int64) []byte {
    b := make([]byte, 8)
    // Flip the sign bit. Maps signed range to unsigned range preserving order.
    // -2^63 → 0x0000...0000 (smallest unsigned)
    // -1    → 0x7FFF...FFFF
    //  0    → 0x8000...0000
    //  1    → 0x8000...0001
    //  2^63-1 → 0xFFFF...FFFF (largest unsigned)
    binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
    return b
}

This encoding is the industry standard for sort-preserving signed integer encoding. It appears in Apache HBase, Apache Cassandra, FDB’s own Tuple layer, and Google Bigtable.

Why XOR with 1<<63 is the same as adding 2^63: For any signed 64-bit integer x, uint64(x) XOR (1<<63) == uint64(x) + 2^63 when computed in 64-bit unsigned arithmetic. XOR is just the bit-manipulation shorthand for the offset encoding.


3.2 The Subspace Pattern

The single most important pattern in any FDB layer is the subspace: a byte prefix that namespaces all keys for a logical entity.

// encoding.go (option-a-leveldb)
type Subspace struct {
    prefix []byte
}

func (s Subspace) Pack(userKey []byte) fdb.Key {
    out := make([]byte, 0, len(s.prefix)+1+len(userKey))
    out = append(out, s.prefix...)
    out = append(out, 0x00)        // separator byte
    out = append(out, userKey...)
    return fdb.Key(out)
}

func (s Subspace) Range() fdb.KeyRange {
    begin := append([]byte{}, s.prefix...)
    begin = append(begin, 0x00)
    end := append([]byte{}, s.prefix...)
    end = append(end, 0x01)  // one byte past the separator
    return fdb.KeyRange{Begin: fdb.Key(begin), End: fdb.Key(end)}
}

Why the separator byte?

Without a separator, subspaces "foo" and "foobar" collide: the key "foobar\x00somekey" has "foo" as a prefix but was packed by the "foobar" subspace. The separator \x00 prevents this: "foo\x00" is never a prefix of "foobar\x00" because their 4th bytes differ (\x00 vs b).

The range trick — \x00\x01:

Incrementing the separator byte from \x00 to \x01 creates a tight upper bound. Every key in the "demo\x00" subspace sorts before "demo\x01". So GetRange("demo\x00", "demo\x01") returns exactly the keys for the “demo” namespace — no more, no less.

Subspace composition (nested subspaces):

You can build hierarchical key spaces by composing prefixes:

users subspace:    "app\x00users\x00"
orders subspace:   "app\x00orders\x00"
indexes subspace:  "app\x00idx\x00"
  city index:      "app\x00idx\x00city\x00"
    value "Paris": "app\x00idx\x00city\x00Paris\x00"

Each prefix cleanly separates a logical domain. ClearRange("app\x00users\x00", "app\x00users\x01") atomically deletes all users. ClearRange("app\x00idx\x00city\x00Paris\x00", "app\x00idx\x00city\x00Paris\x01") atomically deletes all city=“Paris” index entries.

All five layers in this repo use this pattern:

LayerSubspace layout
option-a-leveldbns + 0x00 + userKey
option-a-sqlitens + 0x00 + tableName (catalog), ns + 0x01 + tableName + 0x00 + pk (rows)
option-b-leveldbns + 0x01 + fileType + fileNum (meta), ns + 0x02 + fileType + fileNum + chunkNum (data)
option-b-sqlitens + 0x00 (size key), ns + 0x01 + pageNum (pages)
option-c-record-layerns + 0x00 + schema + 0x00 + pk (records), ns + 0x01 + schema + 0x00 + field + 0x00 + value + 0x00 + pk (indexes)

3.3 The Tuple Layer — Industry Standard Encoding

The official FDB client libraries (Python, Java, Go, Ruby) provide a Tuple encoding that standardizes the above patterns. It encodes lists of typed values into byte strings that sort correctly under FDB’s lexicographic comparison.

Tuple type codes and sort behavior:

TypePrefix byteEncodingSort behavior
None / nil0x00just the prefixalways smallest
Byte string0x01bytes + \x00 (with \x00\x00\xFF)lexicographic
Unicode string0x02UTF-8 + \x00 (same escaping)lexicographic
Nested tuple0x05nested encoding + \x00element-by-element
Integer 00x14just the prefixnumeric zero
Positive integer0x150x1C1–8 bytes big-endiannumerically ascending
Negative integer0x0C0x131–8 bytes, bitwise NOTnumerically ascending
Float (32-bit)0x204 bytes, sign-bit-flipped IEEE 754numerically ascending
Double (64-bit)0x218 bytes, sign-bit-flipped IEEE 754numerically ascending
Boolean false0x26just the prefixfalse < true
Boolean true0x27just the prefixfalse < true
UUID0x3016 bytesUUID byte order
Versionstamp0x32/0x3312 bytescommit version ordering

Why the layers in this repo don’t use the Tuple layer:

Using the official Tuple layer would be correct and production-appropriate. The layers here use simpler hand-rolled encodings so that each file is readable without understanding the Tuple specification. When reading the source, you can see append(out, 0x00) instead of tuple.Pack(tuple.Tuple{schema, pk}) — the intent is obvious.

For production use, adopt the Tuple layer. It handles edge cases (null bytes in strings, arbitrarily large integers, nested structures) that the hand-rolled encodings in this repo skip.


3.4 Atomicity as a Design Axiom

The most important design discipline in FDB layer development: every logical operation that must be atomic must live inside one transaction.

This sounds obvious but is violated constantly by developers who treat FDB as a fancy cache. Let’s examine the consequences of each atomicity violation:

Case 1: Split Insert (SQL Layer)

// WRONG — two transactions for one logical insert
rowid := allocateRowid()           // Transaction 1: increment counter
writeRow(schema, rowid, values)    // Transaction 2: write row data

// If crash between T1 and T2:
// → counter incremented, row never written
// → counter is now permanently incorrect (rowid was "consumed")
// → future inserts skip this rowid

The correct approach (used in option-a-sqlite):

// RIGHT — one transaction
db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    // Read and increment rowid counter
    rowid := tr.Get(rowidKey).Get() + 1
    tr.Set(rowidKey, encode(rowid))
    // Write the row
    tr.Set(rowKey(schema, rowid), encode(values))
    return nil, nil
})

Case 2: Split Index Update (Record Layer)

// WRONG — separate transactions for record and index
tr1.Set(recordKey, newRecord)        // T1: update record
tr2.Clear(oldIndexKey)               // T2: remove old index
tr2.Set(newIndexKey, nil)            // T2: add new index

// If crash between T1 and T2:
// → record shows new value
// → index still points to old value
// → LookupByIndex returns stale data forever
// → This is a phantom index entry — silent corruption

The correct approach (used in option-c-record-layer):

// RIGHT — all in one transaction
s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
    // Read old record to know what indexes to remove
    prevBytes, _ := tr.Get(recordKey(s.ns, schema, pk)).Get()
    if prevBytes != nil {
        var prev Record
        msgpackUnmarshal(prevBytes, &prev)
        for _, f := range sc.Indexes {
            if v, ok := prev[f]; ok {
                tr.Clear(indexKey(s.ns, schema, f, v, pk))
            }
        }
    }
    // Write new record and new indexes
    tr.Set(recordKey(s.ns, schema, pk), encoded)
    for _, f := range sc.Indexes {
        if v, ok := rec[f]; ok {
            tr.Set(indexKey(s.ns, schema, f, v, pk), nil)
        }
    }
    return nil, nil
})

Case 3: Split File Rename (LevelDB Storage Layer)

// WRONG — separate transactions for copy and delete
copyFileData(newFd, oldFd)    // T1: copy all chunks
deleteFile(oldFd)              // T2: clear old chunks

// If crash between T1 and T2:
// → both files exist simultaneously
// → LevelDB manifest references old name
// → on restart, LevelDB sees duplicate SSTables → corruption

The correct approach (used in option-b-leveldb):

// RIGHT — one transaction for atomic rename
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    kvs, _ := tr.GetRange(s.dataRange(oldfd), ...).GetSliceWithError()
    tr.ClearRange(s.dataRange(oldfd))   // clear old
    tr.Clear(s.metaKey(oldfd))
    for _, kv := range kvs {            // write new
        tr.Set(translateKey(kv.Key, oldfd, newfd), kv.Value)
    }
    tr.Set(s.metaKey(newfd), metaBytes)
    return nil, nil
})

The pattern: read old state → clear old keys → write new keys — in one transaction. Crash at any point: transaction not committed, neither state is visible. Crash after commit: new state is fully visible. No intermediate state is ever observable.


3.5 Key Space Design Patterns and Anti-Patterns

Pattern: Sequential Integer Keys

key: ns + 0x00 + encode_uint64(rowid)

Values: row data. Row id is monotonically increasing. New rows always insert at the end of the range (append-like behavior in the key space). Range scans return rows in insertion order.

Production consideration: In a write-heavy workload, all inserts land on the same key space boundary, creating a hotspot on the FDB shard holding the highest rowid. FDB will split this shard, but the split lags the write rate. For very high insert rates, use a randomized prefix (hash of rowid) to distribute writes across shards.

Pattern: Reverse Timestamp Keys

key: ns + 0x00 + encode_uint64(MaxUint64 - timestamp)

Most recent entries sort first. A GetRange with no bounds returns entries in reverse chronological order. Useful for “recent events”, “activity feeds”, “audit logs”.

Pattern: Composite Keys for Multi-Dimensional Queries

key: ns + encode(field1) + 0x00 + encode(field2) + 0x00 + pk

Example: city + 0x00 + age + 0x00 + pk enables range queries on city AND age:

  • All users in Paris sorted by age: GetRange(prefix + "Paris\x00", prefix + "Paris\x01")
  • All users in Paris aged 25–40: GetRange(prefix + "Paris\x00" + encode(25), prefix + "Paris\x00" + encode(40) + "\xff")

This is the compound index pattern. It is exactly how MySQL’s compound indexes work.

Anti-Pattern: Large Values

Storing all of a user’s data in one 50 KB value:

key: ns + pk
value: msgpack({name, email, all_orders: [...], all_addresses: [...], ...})

Problems:

  • Reading one field requires reading and deserializing the entire value (O(size))
  • Updating one field rewrites the entire value (O(size) write amplification)
  • The value may approach FDB’s 100 KB limit
  • Hot users (many orders) create unbalanced value sizes

Better: normalize into separate keys per logical entity:

ns + 0x00 + pk              → profile (name, email)
ns + 0x01 + pk + orderid    → individual orders
ns + 0x02 + pk + addrid     → individual addresses

Anti-Pattern: Hot Key Contention

key: "global_counter"   ← every write touches this

All transactions that increment a counter conflict on this key. At high throughput, this becomes a serialization bottleneck.

Solutions:

  1. FDB atomic ops: tr.Add("global_counter", 1) — atomic, no conflict, no retry needed
  2. Sharded counters: split into N shard keys; read by summing all shards
  3. Approximate counting: use a HyperLogLog or count-min sketch in a single key

3.6 The FDB Blob Layer Pattern

FDB’s value size limit (100 KB) is not a fundamental limitation — it’s a performance tuning decision. The solution is chunking, used in option-b-leveldb:

key: ns + 0x01 + fileType + fileNum + chunkNum  →  up to 65,536 bytes

To read a file: range-scan all chunks (sorted by chunkNum due to big-endian encoding), concatenate values.

This pattern generalizes to any large object:

  • Files: split by byte range (64 KB chunks)
  • Images: split by byte range or by tile
  • Blobs: split by arbitrary chunk size
  • Large JSON documents: split by logical sub-object

FDB’s official Blob Layer (in the FDB foundationdb-python-layers repository) implements this exact pattern. Amazon S3 and Google Cloud Storage’s consistency guarantees are actually powered by similar chunk-based storage backed by distributed KV stores.


3.7 Schema Evolution — How Layers Survive Change

A layer’s encoding is its schema. Changing the encoding without migrating old data is a common source of production incidents. Strategies:

1. Prefix-versioned keys:

ns + version_byte + ...rest

New code writes version = 0x02 keys; old code writes version = 0x01 keys. A background migration reads all 0x01 keys, converts, writes 0x02 keys. Read code handles both versions.

2. Additive-only changes: If your value is msgpack/Protobuf, adding new optional fields is backward compatible. Old code ignores unknown fields. New code sets defaults for missing fields. This is safe and zero-downtime.

3. Dual writes during migration: Write to old AND new key format simultaneously during a migration window. Verify correctness. Backfill old data. Cut over to new format. Remove old code path. Zero-downtime migration.

4. Never reuse key prefixes: Deleted keys leave no trace in FDB (there’s no tombstone value). But if you reuse a deleted subspace’s prefix for a new purpose, you risk stale data colliding with new data. Always use new prefixes for new logical entities.


Interview Questions

Q: Why is key ordering so important in FDB — can’t you just use a hash map?

A hash map gives O(1) point lookups but cannot efficiently answer range queries (“all users aged 25–40”), prefix queries (“all keys starting with ‘user:42:’”), or sorted iteration. Ordered storage enables all three with a single range scan. The entire record layer concept — where secondary indexes are just cleverly prefixed keys — relies on ordering. Without ordering, every query would require a full scan.

Q: What is a “subspace” in FDB and why is it important?

A subspace is a key prefix that namespaces a set of keys. It provides namespace isolation (keys in subspace A cannot collide with keys in subspace B), atomic deletion (one ClearRange deletes all keys in a subspace), and efficient range queries (one GetRange over the subspace prefix reads all related data). Every production FDB layer uses subspaces as the fundamental organizational unit.

Q: How do you handle null bytes inside keys in FDB?

FDB keys can contain any bytes, but if you use \x00 as a separator, null bytes inside field values create ambiguity. The official solution is the Tuple encoding: escape \x00 inside string values as \x00\xFF, and use bare \x00 as the separator. The decoder sees \x00\xFF and knows it’s an escaped null; bare \x00 ends the field. The layers in this repo use simple encodings that assume field values don’t contain null bytes — acceptable for a teaching implementation, not for production with arbitrary user data.

Q: Why is the PutRecord function in the record layer more expensive than a simple Set?

PutRecord must read the old record before writing the new one, so it can remove stale index entries. This adds one Get call to every write — a full round-trip to a Storage Server. The cost is necessary: without reading the old record, you can’t know which old index entries to remove. Production systems amortize this by batching multiple record updates in one transaction, or by using techniques like “shadow documents” (keeping old field values in the record itself to avoid the read).

Q: What is the difference between Transact and ReadTransact in FDB?

ReadTransact opens a read-only transaction: no conflict tracking, no commit round-trip, cheaper. It’s for operations that only read data. Transact opens a read-write transaction: tracks the read set for conflict detection, buffers writes, commits at the end. Use ReadTransact whenever you’re not writing anything — it’s faster and imposes less load on the cluster.