Hitchhiker’s Guide — Option C: Record Layer with Secondary Indexes
The question this answers: “How does a database maintain secondary indexes so they always agree with the primary data — even under concurrent writes?”
The deeper question: “What does Apple actually do inside iCloud, and why did they choose FoundationDB as the substrate?”
Table of Contents
- What the Record Layer Is
- Records vs Indexes: Two Subspaces, One Truth
- The Index Key Format — Anatomy of a Lookup Key
- PutRecord — Read-Modify-Write as a Pattern
- LookupByIndex — Pipelining Explained
- The encodeInt64 Sign-Bit Trick (Derived from First Principles)
- Index Consistency: The Core Guarantee
- The Apple Story: fdb-record-layer and CloudKit
- Real-World Analogues: MySQL, PostgreSQL, MongoDB
- Exercises
1. What the Record Layer Is
The FoundationDB Record Layer is a Java library open-sourced by Apple in 2019. It provides a record-oriented storage layer above FDB with:
- Typed records using Protobuf descriptors
- Declarative indexes (define once; maintenance is automatic)
- A query planner that compiles predicates into index range scans
- Schema evolution without downtime
- Change feeds via versionstamped keys
Our implementation is a Go reimagining: records are map[string]any
(instead of Protobuf messages), indexes are field names (instead of
declarative index definitions), and there is no query planner — you call
LookupByIndex directly. But every core concept is the same.
Why a “record layer” and not just “a KV store”?
The moment you have more than one way to find a record, you need an index. The moment you have an index, you have two copies of the same data in two different shapes. The moment you have two copies, you need a protocol for keeping them consistent. The record layer is that protocol.
2. Records vs Indexes: Two Subspaces, One Truth
Our key space uses two tag bytes to separate the two subspaces:
Records subspace:
<ns> + 0x00 + <schemaName> + 0x00 + <pk>
→ value: msgpack(record)
Indexes subspace:
<ns> + 0x01 + <schemaName> + 0x00 + <fieldName> + 0x00 + <encodedFieldValue> + 0x00 + <pk>
→ value: "" (empty — the key IS the data)
Example with namespace "demo", schema "user", user Alice (pk="alice")
with field city="Paris":
Records:
64 65 6d 6f 00 75 73 65 72 00 61 6c 69 63 65
"demo" 00 "user" 00 "alice"
→ msgpack({id:"alice", name:"Alice", city:"Paris"})
Indexes (city field):
64 65 6d 6f 01 75 73 65 72 00 63 69 74 79 00 50 61 72 69 73 00 61 6c 69 63 65
"demo" 01 "user" 00 "city" 00 "Paris" 00 "alice"
→ "" (empty value)
Why the value is empty for indexes:
The index key encodes everything you need to find the record: which schema, which field, what value, and what the PK is. The value at an index key carries no additional information. This is identical to how MySQL’s InnoDB secondary indexes work: the B-tree leaf stores just the clustered PK, not the full row.
Why the subspace separation matters:
- A full table scan (read all records) =
GetRange(ns+0x00+schema, ns+0x00+schema+0x01)— hits only records, not indexes. - An index scan (find all users where city=‘Paris’) =
GetRange(ns+0x01+schema+0x00+city+0x00+Paris, ns+0x01+schema+0x00+city+0x00+Paris+0x01)— hits only that index column/value, not other indexes or records. - Clearing all indexes for a schema =
ClearRange(ns+0x01+schema, ns+0x01+schema+0x01).
With one tag byte, ranges are perfectly clean. Without it, you’d need carefully chosen sentinels and you’d still risk collisions.
3. The Index Key Format — Anatomy of a Lookup Key
Let’s trace exactly what key is written for each user:
Alice (pk="alice", city="Paris"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "alice"
Bob (pk="bob", city="Tokyo"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Tokyo" + 0x00 + "bob"
Carol (pk="carol", city="Paris"):
index key = ns + 0x01 + "user" + 0x00 + "city" + 0x00 + "Paris" + 0x00 + "carol"
In FDB’s sorted key space, these appear as:
(sorted by bytes)
...
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"alice" → ""
ns+0x01+"user"+0x00+"city"+0x00+"Paris"+0x00+"carol" → ""
ns+0x01+"user"+0x00+"city"+0x00+"Tokyo"+0x00+"bob" → ""
...
A GetRange for city="Paris" scans from ...+"Paris"+0x00 to
...+"Paris"+0x01, returning exactly two keys (Alice and Carol). The PKs
("alice", "carol") are embedded in the key suffix. This is the index scan.
The extractPkFromIndexKey function:
Given the raw FDB index key, we need to extract just the PK suffix. The PK
starts after the last 0x00 byte:
func extractPkFromIndexKey(key fdb.Key) []byte {
for i := len(key) - 1; i >= 0; i-- {
if key[i] == 0x00 {
return key[i+1:]
}
}
return nil
}
We scan backwards because the PK itself might contain 0x00 bytes if it
was a binary key. Scanning backwards from the end finds the last 0x00,
which is the separator between the encoded field value and the PK. This is
safe because the encoded field value ends before that last 0x00, and the
PK follows.
A subtle limitation: if the PK itself can contain 0x00, this parsing
is ambiguous. A production implementation uses the Tuple encoding, which
escapes 0x00 bytes within strings (\x00\xff) so the separator \x00\x00
is unambiguous.
4. PutRecord — Read-Modify-Write as a Pattern
This is the most important function in the record layer. Let’s walk through it completely:
func (s *Store) PutRecord(schema Schema, pk string, rec Record) error {
_, err := s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Step 1: Read the OLD record (if it exists)
oldBytes, _ := tr.Get(recordKey(s.ns, schema.Name, pk)).Get()
// Step 2: Decode the old record
var oldRec Record
if len(oldBytes) > 0 {
msgpack.Unmarshal(oldBytes, &oldRec)
}
// Step 3: Remove OLD index entries
for _, field := range schema.Indexes {
if oldVal, ok := oldRec[field]; ok {
tr.Clear(indexKey(s.ns, schema.Name, field, oldVal, pk))
}
}
// Step 4: Write the NEW record
encoded, _ := msgpack.Marshal(rec)
tr.Set(recordKey(s.ns, schema.Name, pk), encoded)
// Step 5: Write NEW index entries
for _, field := range schema.Indexes {
if newVal, ok := rec[field]; ok {
tr.Set(indexKey(s.ns, schema.Name, field, newVal, pk), []byte{})
}
}
return nil, nil
})
return err
}
Why must you read the old record before writing the new one?
Consider Alice moving from Paris to Tokyo:
Before: {id:"alice", name:"Alice", city:"Paris"}
After: {id:"alice", name:"Alice", city:"Tokyo"}
We need to:
- Remove the old index entry for
city=Paris, pk=alice - Add a new index entry for
city=Tokyo, pk=alice - Write the new record
We cannot do step 1 without knowing that Alice was previously in Paris. That information lives in the old record. So we read it first.
Why inside one transaction?
If we split this into two transactions:
Transaction 1: read old record, decide to clear Paris index
Transaction 2: clear Paris index, write new record, write Tokyo index
A crash between T1 and T2 leaves the Paris index entry pointing to a record that now says Tokyo. This is a phantom index entry — a stale pointer that silently returns wrong results.
With all steps in one transaction: either the entire operation commits (both index update and record update) or neither does. The index and the record always agree.
This is what database people mean when they say atomicity eliminates a class of bugs. It’s not just about performance; it’s about correctness.
The cost: one extra read per update.
Every PutRecord does one Get (the old record) before writing. This adds
a read-version conflict key for the record’s current PK slot. If two writers
try to update the same record concurrently, one will get a conflict and retry.
This is acceptable and correct — the alternative (no-conflict update) risks
interleaving index writes.
5. LookupByIndex — Pipelining Explained
This is where we use FDB’s pipelining to make multi-key lookups efficient.
The naive approach (one round-trip per record):
// BAD: N sequential round-trips for N results
pks := getIndexScan(...) // one round-trip: get all PKs from index
for _, pk := range pks {
rec := getRecord(pk) // one round-trip each: N round-trips total
results = append(results, rec)
}
For N=100, this is 101 network round-trips. At 1ms per round-trip, that’s ~100ms for a simple lookup. Unacceptable.
The pipelined approach (two round-trips total):
// GOOD: pipeline all Get calls, then collect all results
_, err = s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
// Phase 1: issue ALL Get calls. None block yet.
futures := make([]fdb.FutureByteSlice, len(pks))
for i, pk := range pks {
futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
// rt.Get returns a Future immediately — it has NOT waited for a response.
}
// Phase 2: now block on each future in sequence.
for i, fut := range futures {
b, _ := fut.Get() // blocks until this specific response arrives
// But by the time we block on futures[1], the response for futures[1]
// through futures[N-1] may already be in flight or arrived.
...
}
})
Phase 1 sends N requests in rapid succession without waiting for any response.
The FDB client library batches these into its network buffer and sends them.
The storage servers process all N in parallel. By the time we call Get() on
the first future, many responses have already arrived.
Network diagram:
Without pipelining: With pipelining:
→ Get(pk1) → Get(pk1), Get(pk2), Get(pk3), ... (burst)
← reply ← reply(pk1)
→ Get(pk2) ← reply(pk2)
← reply ← reply(pk3)
... (collect all)
Total: N round-trips Total: ~1 round-trip worth of latency
Why this works in FDB’s model:
FDB transactions are snapshot reads. All Get calls within one
ReadTransact closure see the same version of the database. There are no
cross-key dependencies or ordering constraints on reads within a transaction.
So the FDB client can send all reads in any order, and they will all return
consistent (same read version) values.
This is the same principle as HTTP pipelining, REDIS pipelining, or database query batching — but with a guarantee that all reads are from the same snapshot.
6. The encodeInt64 Sign-Bit Trick (Derived from First Principles)
When indexing a numeric field (e.g., age: 25), we want the index to sort
numerically. Bytes sort lexicographically. How do we make byte-sort match
numeric sort for 64-bit signed integers?
Step 1: Start with what we have.
A 64-bit signed integer in two’s complement:
0 = 0000 0000 0000 0000 0000 0000 0000 0000 (0x0000_0000_0000_0000)
1 = 0000 0000 0000 0000 0000 0000 0000 0001 (0x0000_0000_0000_0001)
-1 = 1111 1111 1111 1111 1111 1111 1111 1111 (0xFFFF_FFFF_FFFF_FFFF)
-2 = 1111 1111 1111 1111 1111 1111 1111 1110 (0xFFFF_FFFF_FFFF_FFFE)
MIN = 1000 0000 0000 0000 0000 0000 0000 0000 (0x8000_0000_0000_0000)
MAX = 0111 1111 1111 1111 1111 1111 1111 1111 (0x7FFF_FFFF_FFFF_FFFF)
Big-endian byte-sort order: MIN, -2, -1, 0, 1, MAX (MIN starts with 0x80, which is 128, larger than 0x00 which is 0). So big-endian two’s complement does NOT sort correctly across the sign boundary.
Step 2: What would correct order look like?
We want the sorted sequence: …, -2, -1, 0, 1, …, MAX. In bytes, each element should be strictly less than the next. If we could assign distinct 64-bit unsigned values to each signed value such that the unsigned order matches the signed order, we’d be done.
Step 3: The offset encoding idea.
One option: add 2^63 to shift the range. Then:
-2^63 + 2^63 = 0
-1 + 2^63 = 2^63 - 1
0 + 2^63 = 2^63
1 + 2^63 = 2^63 + 1
2^63-1 + 2^63 = 2^64 - 1
These are now unsigned values in the range [0, 2^64-1], in the same order as the original signed values. And they sort correctly in big-endian bytes!
Step 4: Addition vs. XOR.
Adding 2^63 is the same as flipping just the sign bit (bit 63). This is because adding 2^63 to a 64-bit value either sets bit 63 (if it was 0) or clears bit 63 and carries into bit 64 (which wraps around, but 2’s complement overflow is the same as XOR on individual bits).
More precisely: x + 2^63 = x XOR 2^63 for all x in the range
[-2^63, 2^63-1]. This is because:
- If x ≥ 0: bit 63 of x is 0; adding 2^63 sets it to 1. XOR with 1<<63 also sets bit 63. Same result.
- If x < 0: bit 63 of x is 1; adding 2^63 causes a carry into bit 64, which is discarded in 64-bit arithmetic, leaving bit 63 as 0. XOR with 1<<63 also clears bit 63. Same result.
So the implementation is just:
func encodeInt64(x int64) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
return b
}
Verification:
-2 → uint64(-2) = 0xFFFFFFFFFFFFFFFE → XOR 0x8000... → 0x7FFFFFFFFFFFFFFE
-1 → 0xFFFFFFFFFFFFFFFF → XOR → 0x7FFFFFFFFFFFFFFF
0 → 0x0000000000000000 → XOR → 0x8000000000000000
1 → 0x0000000000000001 → XOR → 0x8000000000000001
2 → 0x0000000000000002 → XOR → 0x8000000000000002
Sorted by bytes: 0x7FFFFFFFFFFFFFE, 0x7FFFFFFFFFFFFFFF, 0x8000000000000000, 0x8000000000000001, 0x8000000000000002 — which is the same order as -2, -1, 0, 1, 2. Correct.
This is the standard encoding used by:
- FDB’s official Tuple layer (negative integers)
- Apache HBase row key encoding
- Google Cloud Bigtable ordered keys
- Apache Flink and Spark for sorted key encodings
7. Index Consistency: The Core Guarantee
Let’s state it precisely: at any point in time, for every record with field
F=V and PK=P, there exists an index entry at (F, V, P). And there are no
index entries without corresponding records.
Formally:
forall records: indexEntries(record) ⊆ presentIndexKeysforall indexKey: correspondingRecord exists
Both directions. Our PutRecord maintains this invariant within one
transaction:
- Remove old index entries → breaks “no stale entries” for old value
- Write new record → old record value is overwritten
- Write new index entries → restores consistency
Between steps 1 and 3, within the transaction, the invariant is temporarily broken. But no other transaction sees this intermediate state — FDB’s isolation means only the final committed state is visible.
What happens if the invariant is violated?
A phantom index entry (index says Alice is in Paris, record says Tokyo):
LookupByIndex(city="Paris")returns Alice- Fetching Alice’s record shows city=Tokyo
- Application logic sees inconsistent data
This is precisely the class of bugs that a record layer is designed to prevent.
8. The Apple Story: fdb-record-layer and CloudKit
Apple’s CloudKit powers iCloud sync for iOS and macOS apps. At its peak it handles hundreds of millions of devices syncing calendar events, notes, photos metadata, and app-specific data. This is the production scale at which fdb-record-layer operates.
Why FDB?
Before FDB, Apple had a custom distributed database for CloudKit. It had the usual problems: hard to scale, hard to operate, hard to maintain consistency under failures. FDB was acquired in 2015. Over the next few years Apple rebuilt CloudKit’s storage engine on FDB.
Why a record layer on top?
FDB is a KV store. CloudKit stores typed records (calendar events have a
start_date, end_date, title; notes have body, attachments). You
cannot just say “find all events starting tomorrow” to an FDB cluster — you
need to translate that query into a range scan over an encoded index.
fdb-record-layer provides:
- Protobuf-typed records (schema validation, efficient encoding)
- Value indexes, rank indexes, text indexes, aggregate indexes
- A query planner that selects the best index for a given predicate
- Change feed via versionstamped index keys (see GUIDE.md chapter on versionstamps)
- Schema evolution (add fields with defaults, deprecate fields, change types safely)
Our implementation omits the Protobuf (replaced with map[string]any) and
the query planner (you call LookupByIndex directly). But the index key
layout and the PutRecord read-modify-write pattern are the same.
9. Real-World Analogues
MySQL InnoDB Secondary Indexes
In InnoDB, a secondary index B-tree contains:
(secondary_key_value, primary_key_value) → ""
Identical to our (field, encodedValue, pk) → "". To look up a row by
a secondary key:
- Range-scan the secondary index to get PKs.
- Look up each PK in the clustered index (the primary B-tree).
This two-step process is called a “double-lookup” or “bookmark lookup”
in MySQL docs. Our LookupByIndex does exactly this.
The reason InnoDB stores the PK (not the heap pointer) in secondary indexes: when a row is updated and moves in the clustered index, secondary indexes don’t need to be updated (they still point to the same PK value, which now lives in a different B-tree page but is found by the clustered index scan).
PostgreSQL GIN / B-Tree Indexes
PostgreSQL secondary indexes store (index_value → ctid) where ctid is the
physical page+slot address. This is faster to look up (one step, not two)
but requires updating ALL indexes when a row moves (e.g., after VACUUM).
PostgreSQL recently added “index-only scans” where the full row is stored in the index (similar to InnoDB’s covering indexes), trading space for read speed.
MongoDB Indexes
MongoDB maintains indexes as separate B-trees:
{ city: "Paris" } → [ ObjectId("alice"), ObjectId("carol") ]
MongoDB’s analog of LookupByIndex is called a “secondary index scan +
fetch”. The MongoDB Document Layer (which stored MongoDB data in FDB) used
exactly our two-subspace approach — one FDB range for documents, one for
each index.
10. Exercises
Exercise 1 — COUNT Aggregate Index
Add a special “count” index that maintains a per-schema counter:
ns + 0x02 + schemaName + 0x00 → msgpack(int64)
In PutRecord, if old record doesn’t exist, increment. In DeleteRecord,
decrement. Both inside the same transaction as the record+index update.
This gives you COUNT(*) in O(1) instead of a full table scan.
Exercise 2 — Range Index Query
Add LookupByIndexRange(schema, field, minVal, maxVal any) ([]Record, error).
Since the index is sorted by encoded field value, you can do:
begin = indexPrefix(ns, schema, field) + encodeValue(minVal)
end = indexPrefix(ns, schema, field) + encodeValue(maxVal) + 0x01
GetRange(begin, end)
This enables WHERE age >= 25 AND age < 35 as a single range scan.
Make sure encodeValue produces sort-preserving bytes for your supported
types (use encodeInt64 for integers).
Exercise 3 — Composite Indexes
Add support for multi-field indexes:
Schema{Name: "event", Indexes: [{"start", "end"}, {"city", "name"}]}
A composite index on (start, end) allows efficient queries like:
WHERE start >= monday AND end <= friday. The key would be:
ns + 0x01 + schema + 0x00 + "start,end" + 0x00 + encodeValue(start) + 0x00 + encodeValue(end) + 0x00 + pk
Exercise 4 — Covering Indexes
A covering index stores the full record at the index key (instead of just
""). Then LookupByIndex returns full records from the index scan alone,
with no secondary lookup:
index key → msgpack(record) (instead of → "")
Upside: LookupByIndex requires only one FDB range scan, no Get calls.
Downside: PutRecord must update index values when any indexed field changes.
Compare the trade-off with real-world covering indexes in MySQL and
PostgreSQL.
Exercise 5 — Versionstamp-Based Change Feed
Use FDB’s SetVersionstampedKey in PutRecord to write a change log entry:
ns + 0xFF + <10-byte versionstamp> → msgpack({schema, pk, change_type})
Build a SubscribeChanges(since versionstamp) (<-chan Change, error) function
that uses GetRange + Watch to stream changes in real time. This is a
simplified version of fdb-record-layer’s change feed.
11. Source Code Deep Dive
encoding.go — The Key Construction Functions
// tagRecords = 0x00, tagIndexes = 0x01
func recordKey(ns []byte, schema, pk string) fdb.Key {
return append(ns, append([]byte{tagRecords}, []byte(schema+"\x00"+pk)...)...)
}
func indexKey(ns []byte, schema, field string, val any, pk string) fdb.Key {
return append(ns, append([]byte{tagIndexes},
[]byte(schema+"\x00"+field+"\x00"+encodeIndexValue(val)+"\x00"+pk)...)...)
}
Why embed the separator \x00 as a Go string literal?
The \x00 byte is the separator between components. Using it directly in a string literal is readable and avoids a separate bytes.Join call. The key correctness requirement: no component value may contain \x00 unescaped. For string field values from map[string]any, callers must validate or escape input. The Tuple encoding in fdb-record-layer handles this by escaping \x00 inside strings as \x00\xFF.
encodeIndexValue — Dispatching on Type
func encodeIndexValue(val any) string {
switch v := val.(type) {
case int64:
return string(encodeInt64(v))
case float64:
// JSON numbers unmarshal as float64 in Go
return string(encodeInt64(int64(v)))
case string:
return v
default:
return fmt.Sprintf("%v", v)
}
}
The float64 case: In Go’s encoding/json, all JSON numbers unmarshal to float64 when the target type is any. This means {"age": 25} gives you age = float64(25). The encoding converts float64 to int64 first. This is lossy for non-integer floats — a production implementation would have a separate encodeFloat64 using the IEEE 754 bit pattern with a sign-flip (identical logic to encodeInt64, but on the math.Float64bits representation).
store.go — ScanRecords with Cursor Logic
func (s *Store) ScanRecords(schema string, limit int) ([]Record, error) {
kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(recordRange(s.ns, schema), fdb.RangeOptions{
Limit: limit,
}).GetSliceWithError()
})
var recs []Record
for _, kv := range kvs.([]fdb.KeyValue) {
var rec Record
msgpack.Unmarshal(kv.Value, &rec)
recs = append(recs, rec)
}
return recs, nil
}
fdb.RangeOptions{Limit: limit} instructs FDB to stop returning results after limit key-value pairs. This is pushed down to the storage server — the server stops scanning early, reducing both network bandwidth and server-side work.
The continuation problem: For a dataset with 1 million records, a single GetRange with no limit would transfer all 1M records in one response. This hits FDB’s per-transaction byte limit (~10 MB) and would transfer far more data than needed. A production scan should use a continuation token:
type Continuation struct {
LastPK string
}
func (s *Store) ScanRecordsPage(schema string, limit int, cont *Continuation) ([]Record, Continuation, error) {
begin := recordRange(s.ns, schema).Begin
if cont != nil {
// Start after the last-seen PK
begin = fdb.Key(append(recordKey(s.ns, schema, cont.LastPK), 0x01...))
}
// GetRange from begin to end of schema range, limit = limit
...
}
Each page call fetches limit records starting after the continuation. The caller loops, passing the continuation from each response into the next call.
index.go — LookupByIndex Pipelining in Detail
func (s *Store) LookupByIndex(schema Schema, field string, value any) ([]Record, error) {
prefix := indexPrefix(s.ns, schema.Name, field, value)
// Step 1: get all PKs from the index (one round-trip)
kvs, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(
fdb.KeyRange{Begin: prefix, End: fdb.Key(append(prefix, 0xFF))},
fdb.RangeOptions{},
).GetSliceWithError()
})
pks := make([]string, len(kvs))
for i, kv := range kvs {
pks[i] = string(extractPkFromIndexKey(kv.Key))
}
// Step 2: fetch all records by PK (pipelined, one round-trip)
recs := make([]Record, 0, len(pks))
s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
futures := make([]fdb.FutureByteSlice, len(pks))
for i, pk := range pks {
futures[i] = rt.Get(recordKey(s.ns, schema.Name, pk))
}
for _, fut := range futures {
b, _ := fut.Get()
var rec Record
msgpack.Unmarshal(b, &rec)
recs = append(recs, rec)
}
return nil, nil
})
return recs, nil
}
Two round-trips for any number of results: Step 1 is one GetRange → one round-trip. Step 2 issues N Get calls inside one transaction → they are pipelined to the storage servers, returning in approximately one round-trip (the latency of the slowest individual read, not N × single-read latency).
Why two separate transactions?
The index scan (step 1) and the record fetches (step 2) could be in one ReadTransact closure — they would both see the same read version. We separate them here for clarity. In production, combining them into one transaction is slightly more efficient: one beginRead call, one read version negotiation, one transaction context.
12. Production Considerations
12.1 Write Amplification Under Index Updates
Every PutRecord that changes an indexed field performs:
- 1 read (old record)
- 1 clear (old index entry)
- 1 set (new record)
- 1 set (new index entry)
For a schema with K indexes and a write that changes all K indexed fields:
- K reads (old index values embedded in old record, but we need the old record regardless)
- K clears (old index entries)
- 1 set (new record value)
- K sets (new index entries)
Total mutations: K clears + K+1 sets = 2K+1 write operations per update. For a schema with 5 indexes, every update writes 11 keys. This is standard for indexed databases — MySQL and PostgreSQL have the same write amplification for secondary indexes. FDB’s commit pipeline absorbs this efficiently because all K sets and K clears go into one transaction batch.
Mitigation: Only maintain indexes on fields that are actually queried. Don’t add indexes speculatively.
12.2 Online Index Building
Adding a new index to a schema that already has millions of records requires backfilling: writing index entries for every existing record. This must be done without taking the database offline.
The safe approach:
- Start writing new index entries in
PutRecordimmediately (for new and updated records). - In a background worker, page through all existing records (using the continuation pattern) and write their index entries.
- Once the background worker reaches the “frontier” (current write position), the index is consistent.
During step 2, queries against the new index may return incomplete results (missing records that haven’t been backfilled yet). This is acceptable if the application can tolerate “eventually consistent” index builds. FDB’s versionstamp mechanism can track the backfill position and indicate to callers which index entries are “safe” (backfilled) vs “tentative.”
12.3 N+1 Read Problem
LookupByIndex with N results does:
- 1 index range scan
- N record reads (pipelined)
If the calling code then calls LookupByIndex for each result (e.g., “find all events in Paris, then for each event, find all attendees”), you get M index scans + M×N record reads. This is the N+1 problem from ORM land, applied to FDB.
Solution: Batch the secondary lookups. Collect all PKs from all index scans, then issue one pipelined batch of record reads. FDB’s pipelining makes this efficient even when batching hundreds of PKs.
13. Interview Questions — Secondary Indexes and Record Layers
Q: Why does PutRecord need to read the old record before writing the new one?
To maintain index consistency. If a record’s indexed field value changes (e.g., city: “Paris” → “Tokyo”), the old index entry (city=Paris,pk=alice) must be deleted and a new entry (city=Tokyo,pk=alice) must be created — in the same transaction as the record write. To know the old index value, we need the old record. Without reading the old record, we can’t know which index keys to clear, so stale index entries would accumulate, pointing to records with wrong field values.
Q: How does LookupByIndex achieve O(2) round-trips regardless of result count?
The index range scan (step 1) is one GetRange that returns all PKs for the queried field value — one round-trip. The record reads (step 2) are issued as N rt.Get calls inside one ReadTransact closure. The FDB client library sends all N requests before waiting for any response (pipelining). The storage servers process them in parallel. By the time the first response arrives, most others are already in flight or returned. The effective latency is one round-trip, not N.
Q: What is the encodeInt64 function doing, and why is it necessary?
FDB keys are sorted lexicographically by bytes. For integer field values to sort correctly as numbers (−1 < 0 < 1 < 2), the byte encoding must be sort-preserving. Two’s complement big-endian is almost right, but fails at the sign boundary: −1 (0xFFFF…) sorts after MAX_INT (0x7FFF…) in byte order. Flipping the sign bit (XOR with 0x8000_0000_0000_0000) shifts the entire integer range to unsigned, making byte order match signed integer order. The result: every negative number’s encoding is less than every non-negative number’s encoding, and within each sign region, the encoding preserves order.
Q: How would you implement multi-tenant isolation so that one tenant’s records are never visible to another?
Add a tenant identifier to the namespace prefix: ns = globalPrefix + tenantId. All record and index keys for that tenant are prefixed with their unique tenant bytes. A range scan by tenant A (GetRange(ns_A, ns_A + 0xFF)) only returns keys in tenant A’s namespace. The FDB cluster enforces this: there are no keys shared between tenant namespaces, so no data can leak across tenants. Additionally, for strict isolation, use FDB’s tenant feature (available in 7.x) which restricts a transaction to a specific tenant prefix at the FDB server level.