Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hitchhiker’s Guide — Option A: SQL Engine over FoundationDB

The question this answers: “What does a relational database actually store? If you strip away the SQL parser and the query planner, what is left?”

The deeper question: “How do tables, rows, primary keys, and catalogs map onto an ordered key–value store?”


Table of Contents

  1. What a Relational Database Actually Stores
  2. The Catalog — A Database About Databases
  3. Row Encoding — Turning a Map into Bytes
  4. Primary Keys and Sort Order
  5. Full Table Scan — The Honest Query
  6. INSERT as a Rowid Allocation + Write
  7. UPDATE = Overwrite (Why it Works)
  8. DELETE = Clear (Why it Works)
  9. Key Layout In Detail
  10. What We Didn’t Build (and Why Those Parts Are Hard)
  11. Real-World Analogues: SQLite, MySQL, PostgreSQL
  12. Exercises

1. What a Relational Database Actually Stores

A relational database is, at its core, a system for:

  1. Storing rows (collections of typed values) under a primary key.
  2. Keeping a catalog (metadata about what tables exist and their schemas).
  3. Maintaining indexes (secondary data structures for fast lookup by non-primary-key fields).
  4. Executing queries (finding rows matching a predicate).

Strip away SQL parsing, the query planner, authentication, and network protocols, and you have a structured key–value store with a catalog.

Our layer implements steps 1 and 2. Step 3 is in option-c-record-layer. Step 4 is a SelectWhere(pred) that does a full scan — honest but slow, which makes the need for indexes obvious.


2. The Catalog — A Database About Databases

Every real database has a catalog: a table of tables. PostgreSQL stores it in pg_catalog.pg_class. MySQL uses information_schema.TABLES. SQLite writes the catalog to the first page of the database file as sqlite_master.

Our catalog lives in FDB under a dedicated subspace:

<ns> 0x00 <tableName>  →  msgpack(TableDef)

TableDef contains the table name, column definitions (name + type), and the declared primary-key column name (if any). When Open() is called, we scan this subspace and reconstruct the in-memory schemas map:

func (d *DB) loadCatalog() error {
    // range-scan the catalog subspace
    kvs, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(d.catalogRange(), fdb.RangeOptions{}).GetSliceWithError()
    })
    for _, kv := range kvs {
        var def TableDef
        msgpack.Unmarshal(kv.Value, &def)
        d.schemas[def.Name] = def
    }
}

This means the catalog is self-describing — a fresh process connecting to an existing FDB namespace automatically discovers all tables. You don’t need to pass schemas at startup. This mirrors how SQLite opens a database file and reads its catalog from sqlite_master before accepting any queries.

CreateTable is atomic: the catalog write and any initial setup happen inside one FDB transaction. Two concurrent CreateTable("users") calls will either both succeed (if they have identical schemas) or one will fail with a conflict. There is no possibility of a half-created table.


3. Row Encoding — Turning a Map into Bytes

Our Row type is map[string]any. We serialize it with msgpack, which produces a compact binary representation:

{"name": "Alice", "city": "Paris", "id": 1}
→  83 a4 6e 61 6d 65 a5 41 6c 69 63 65 a4 63 69 74 79 a5 50 61 72 69 73 a2 69 64 01
   (msgpack map: 3 entries, name="Alice", city="Paris", id=1)

The row bytes become the value in FDB. The key is derived from the primary key value (see below).

Why msgpack, not JSON or Protobuf?

  • JSON is human-readable but wastes space (quoted keys, text numbers).
  • Protobuf is compact but requires a schema file and code generation.
  • msgpack is compact, self-describing (no external schema), and has first- class Go support.

For a teaching implementation, msgpack hits the right trade-off. For production, Protobuf or Cap’n Proto would be better choices, since they support schema evolution (adding fields without breaking old readers) and are more efficient for fixed schemas.


4. Primary Keys and Sort Order

A primary key must be unique and determines the storage order of rows.

In FDB, “storage order” = “key byte order”. So we need to encode primary key values into bytes in a way that preserves the intended sort order.

String PKs: store as raw UTF-8 bytes. UTF-8’s byte ordering preserves lexicographic order for ASCII characters. Enough for a demo; a production system would use Unicode collation.

Integer PKs: this is where it gets interesting.

Naive approach: encode as decimal string. Problem: "9" > "30" in byte comparison. Rows would sort as: 1, 10, 100, 2, 20, 3… — wrong.

Better: big-endian 64-bit integer. Problem: negative numbers have the high bit set, so -1 encodes as 0xFF... which sorts after all positive numbers.

Our solution (and the industry standard):

func encodeInt64(x int64) []byte {
    b := make([]byte, 8)
    // Flip the sign bit. This turns the two's-complement representation into
    // an unsigned representation that sorts correctly:
    //  0 → 0x8000_0000_0000_0000 (sorts between negatives and positives)
    //  1 → 0x8000_0000_0000_0001
    // -1 → 0x7FFF_FFFF_FFFF_FFFF (sorts before 0)
    // -2 → 0x7FFF_FFFF_FFFF_FFFE (sorts before -1)
    binary.BigEndian.PutUint64(b, uint64(x)^(1<<63))
    return b
}

After this encoding, integer byte strings sort in the same order as their numeric values. A GetRange on the rows subspace returns rows in ascending PK order, for free, by FDB’s native byte-order scan.

This is the same trick used by:

  • Apache HBase (for row key ordering)
  • Google Cloud Bigtable
  • FoundationDB’s official Tuple layer (for negative integers)
  • Apache Cassandra’s blob serialization

5. Full Table Scan — The Honest Query

SelectAll and SelectWhere both scan the entire rows subspace:

func (d *DB) SelectWhere(table string, pred func(Row) bool) ([]Row, error) {
    v, _ := d.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(d.tableRange(table), fdb.RangeOptions{}).GetSliceWithError()
    })
    for _, kv := range v.([]fdb.KeyValue) {
        var row Row
        msgpack.Unmarshal(kv.Value, &row)
        if pred == nil || pred(row) {
            out = append(out, row)
        }
    }
}

The predicate is applied in Go, after fetching all rows from FDB. This is a full table scan — the same operation as SELECT * FROM users WHERE city='Paris' without any index.

Why not push the predicate to FDB?

FDB has no server-side filtering. Its API is: “give me this key range.” All filtering must happen on the client. This is a deliberate design choice by FDB: the cluster is a dumb-fast KV store; smarts live in the layer.

Real SQL databases push predicates “down” to the storage engine to avoid fetching unnecessary rows over the network. In our setup, “the network” is the connection to FDB’s storage servers — for large tables, a full scan incurs real latency. This is the motivation for secondary indexes (option-c).

What a real SQL engine does instead:

A query planner looks at the predicate city='Paris' and checks whether there is an index on city. If yes, it rewrites the scan as:

scan index(city = 'Paris') → [pk1, pk2, pk3]
fetch rows by pk: Get(pk1), Get(pk2), Get(pk3)

This is exactly what LookupByIndex does in option-c.


6. INSERT as Rowid Allocation + Write

When a table has no declared PK, we allocate a monotonic rowid:

_, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
    curBytes, _ := tr.Get(d.seqKey(table)).Get()
    var cur uint64
    if len(curBytes) == 8 {
        cur = binary.LittleEndian.Uint64(curBytes)
    }
    next := cur + 1
    // Write the new counter value
    binary.LittleEndian.PutUint64(buf, next)
    tr.Set(d.seqKey(table), buf)
    // Write the row
    tr.Set(d.rowKey(table, encodeInt64(int64(next))), encoded)
    return nil, nil
})

The counter read + increment + row write happen in one transaction. Two concurrent inserts cannot get the same rowid — FDB’s conflict detection will cause one to retry with a fresh counter read.

This is identical to how SQLite allocates its rowid (a monotonic integer stored in the B-tree page header) and how PostgreSQL allocates OIDs and ctids. The key insight: the counter is a row, and updating it is a transaction.


7. UPDATE = Overwrite (Why it Works)

LevelDB and FDB share a property: Set(key, newValue) on an existing key simply replaces it. There is no “update in place” at the storage level.

Our “UPDATE” is just another Insert call with the same PK:

db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Tokyo"})
// This calls tr.Set(rowKey(table, encode(1)), msgpack(new_row))
// The old row bytes under the same key are replaced.

FDB does not copy the old value to a TOAST or undo segment — it just writes the new value. The old version is kept in FDB’s MVCC history for ~5 seconds (for in-flight read transactions) and then garbage-collected.

What changes when you have indexes:

Once you have secondary indexes (option-c), an update is no longer just a single key write. You must:

  1. Read the old row (to find out what old index entries to remove)
  2. Clear the old index entries
  3. Write the new row
  4. Write the new index entries

All in one transaction. This is why option-c’s PutRecord is more complex than our Insert.


8. DELETE = Clear (Why it Works)

func (d *DB) Delete(table string, pk any) error {
    encodedPK, _ := encodePK(pk)
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Clear(d.rowKey(table, encodedPK))
        return nil, nil
    })
    return err
}

Clear adds a tombstone to the transaction buffer. After commit, the key disappears from FDB as if it was never written. Future reads return nil.

No cascades: unlike SQL with ON DELETE CASCADE, our engine has no foreign key enforcement. This is an intentional simplification. Real SQL engines resolve cascades by finding related rows (via indexes on the FK column) and deleting them in the same transaction.

What about DropTable?

func (d *DB) DropTable(name string) error {
    _, err := d.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        tr.Clear(d.catalogKey(name))         // remove from catalog
        tr.ClearRange(d.tableRange(name))    // delete all rows
        tr.Clear(d.seqKey(name))             // reset rowid counter
        return nil, nil
    })
}

One transaction that deletes the catalog entry, all data rows, and the sequence counter. Atomically. A reader that has an open transaction might still see the table (MVCC), but any new transaction started after commit sees the table as gone.


9. Key Layout In Detail

Let’s trace exactly what happens when you call:

db.CreateTable(TableDef{Name: "users", PK: "id", Columns: [...]})
db.Insert("users", Row{"id": 1, "name": "Alice", "city": "Paris"})
db.Insert("users", Row{"id": 2, "name": "Bob",   "city": "Tokyo"})

With namespace "demo-sql":

Catalog key:
  "demo-sql" + 0x00 + "users"
  = 64 65 6d 6f 2d 73 71 6c 00 75 73 65 72 73
  → value: msgpack({Name:"users", PK:"id", Columns:[...]})

Row key for id=1:
  "demo-sql" + 0x01 + "users" + 0x00 + encodeInt64(1)
  = 64 65 6d 6f 2d 73 71 6c 01 75 73 65 72 73 00 80 00 00 00 00 00 00 01
                                               ↑ encoded 1 (sign bit flipped)
  → value: msgpack({id:1, name:"Alice", city:"Paris"})

Row key for id=2:
  "demo-sql" + 0x01 + "users" + 0x00 + 80 00 00 00 00 00 00 02
  → value: msgpack({id:2, name:"Bob", city:"Tokyo"})

Table scan range:
  begin: "demo-sql" + 0x01 + "users" + 0x00
  end:   "demo-sql" + 0x01 + "users" + 0x01
  → returns id=1 row, then id=2 row (in ascending id order)

The 0x00 separator between tag, table name, and PK ensures:

  • Table “user” doesn’t collide with table “users”
  • Row key doesn’t collide with catalog key (different tag byte: 0x01 vs 0x00)
  • Rows for table “users” don’t collide with rows for table “orders” (different table name segment)

10. What We Didn’t Build (and Why Those Parts Are Hard)

Secondary indexes: We have SelectWhere but it does a full scan. Adding a secondary index (e.g., “find all users where city=‘Paris’”) requires maintaining an extra key for every row that maps (city value, pk) → "". The hard part is keeping the index consistent when rows change — see option-c.

SQL parsing: A SQL parser turns SELECT name FROM users WHERE city='Paris' into an AST. This is ~1000 lines of code for a minimal SQL grammar. We skip it because it adds no FDB-specific insight.

Query planner: Given multiple indexes, which one should we use? This is a graph problem (estimating cost of each query plan) that modern databases spend enormous effort on. The basic version is simple (pick the most selective index), but getting it right for all cases is a career.

JOIN: Joining two tables in SQL is conceptually a nested loop (for each row in table A, find matching rows in table B). The hard part is making this fast when both tables are large. Most JOIN algorithms (hash join, sort-merge join) still reduce to: “scan one table, look up matching rows from the other.”

MVCC for uncommitted readers: We provide MVCC through FDB’s read versions, but we don’t expose BEGIN TRANSACTION / COMMIT / ROLLBACK at the SQL level. Adding this requires keeping an open FDB transaction for the duration of the user transaction — which is possible but requires careful lifecycle management.


11. Real-World Analogues

SQLite

SQLite stores its catalog in sqlite_master, a special table on page 1:

CREATE TABLE sqlite_master (
    type TEXT,    -- "table", "index", "view", "trigger"
    name TEXT,
    tbl_name TEXT,
    rootpage INTEGER,  -- page number of the B-tree root for this table
    sql TEXT           -- the original CREATE statement
);

Every table’s rows are stored in a B-tree rooted at rootpage. The B-tree keys are the rowid values (64-bit integers, always). The row data (all columns) is packed into the B-tree leaf value using SQLite’s own compact encoding.

Our rowKey(table, encodedPK) is the FDB equivalent of the B-tree rowid. Our msgpack row serialization is the FDB equivalent of SQLite’s row record format.

MySQL InnoDB

InnoDB uses a clustered index — the primary key IS the B-tree. All columns are stored in the PK leaf node. Secondary indexes store the PK value (not the physical row location) as their “pointer”:

PRIMARY KEY index: pk → all_columns
SECONDARY INDEX on city: city → pk

To look up a row by city:

  1. Scan secondary index for city = 'Paris' → get pk values
  2. Look up each pk in the primary index → get the full row

This two-step lookup is called a “key lookup” in MySQL’s EXPLAIN output. Our option-c-record-layer does exactly this.

PostgreSQL

PostgreSQL uses “heap files” — rows are stored in pages in insertion order, not sorted by PK. The PK is just another index (a B-tree) that maps pk → ctid (page number + slot within page). Secondary indexes also map field_value → ctid.

This is subtly different from InnoDB: in PostgreSQL, all indexes point to physical page locations, not to the PK. This makes index maintenance easier (an update to a non-indexed column doesn’t touch any index), but makes VACUUM (garbage collection of old row versions) more complex.

CockroachDB / TiKV / YugabyteDB

These “NewSQL” databases are essentially what we’re building here: a SQL query engine on top of a distributed, ordered, transactional KV store. Their storage architecture is:

SQL layer (parsing, planning, execution)
    ↓
Encoding layer (rows → keys, indexes → keys)
    ↓
Distributed KV layer (RocksDB + Raft replication per shard)

CockroachDB’s row encoding is:

/<table_id>/<index_id>/<col1_value>/<col2_value>/...  →  rest_of_columns

This is the same pattern as our <ns>/<tag>/<table_name>/<pk> — just with more type information and a smarter Tuple encoding.


12. Exercises

Exercise 1 — Add a Secondary Index

Modify Insert to also write index entries:

ns + 0x02 + tableName + 0x00 + columnName + 0x00 + encodedValue + 0x00 + pk → ""

Modify Delete to also clear those entries (requires reading the row first). Add SelectByIndex(table, column, value) ([]Row, error).

Exercise 2 — Foreign Keys

Add a References field to Column:

type Column struct {
    Name       string
    Type       ColumnType
    References *ForeignKey  // nil if no FK
}

In Insert, check that the referenced PK exists (a Get inside the same transaction). In Delete, check that no child rows reference this PK (a GetRange on the FK index).

Exercise 3 — Transactions for Callers

Add a Begin() *Tx method that returns a transaction object. Tx should buffer operations and commit on tx.Commit() — all inside one FDB db.Transact() call.

Exercise 4 — DISTINCT and COUNT

Implement SelectDistinct(table, column) and Count(table) using a single FDB GetRange and in-process aggregation. This is your first aggregation operator — the same one query engines implement with hash tables or sort passes.

Exercise 5 — Schema Evolution

Add a version number to TableDef. When Open() loads a table with version < current, run a migration function that adds default values for new columns.


13. Source Code Deep Dive — sqllayer/db.go

The entire engine is in sqllayer/db.go. Let’s trace the critical paths.

The DB Type

type DB struct {
    fdb     fdb.Database
    ns      []byte
    schemas map[string]TableDef
    mu      sync.RWMutex
}

schemas is an in-memory cache of the catalog. It is loaded at Open() time from FDB and updated by CreateTable/DropTable. The mutex protects concurrent access to schemas.

Design question: Why cache schemas in memory at all? Because every query needs the schema to know column types (for encoding/decoding) and the PK column name. Fetching the schema from FDB on every query would add a round-trip. The trade-off: the in-memory cache can be stale if another process creates/drops a table. Production systems solve this with schema versioning: every DDL operation increments a global version; processes re-sync when they detect a version mismatch.

Key Helpers

func (d *DB) catalogKey(table string) fdb.Key {
    return fdb.Key(append(append([]byte{}, d.ns...), append([]byte{0x00}, []byte(table)...)...))
}

func (d *DB) rowKey(table string, encodedPK []byte) fdb.Key {
    prefix := append(append([]byte{}, d.ns...), 0x01)
    prefix = append(prefix, []byte(table)...)
    prefix = append(prefix, 0x00)
    return fdb.Key(append(prefix, encodedPK...))
}

Tag bytes distinguish subspaces:

  • 0x00 → catalog entries
  • 0x01 → row data
  • 0x02 → sequence (rowid counter)

This tag-byte approach is simpler than the full Tuple encoding but achieves the same goal: prevent catalog keys from colliding with row keys even if a table is named “users” and another subspace prefix also uses “users”.

The Sequence Counter Pattern

func (d *DB) nextRowid(tr fdb.Transaction, table string) (int64, error) {
    key := d.seqKey(table)
    cur, _ := tr.Get(key).Get()
    var n int64 = 1
    if len(cur) == 8 {
        n = int64(binary.LittleEndian.Uint64(cur)) + 1
    }
    b := make([]byte, 8)
    binary.LittleEndian.PutUint64(b, uint64(n))
    tr.Set(key, b)
    return n, nil
}

This is a read-modify-write within one transaction. The caller passes tr (an active transaction). The Get reads the current counter value. The Set increments it. Both are in the same transaction as the row write.

If two concurrent inserts both read the same counter value, they’ll both try to write n+1 — but FDB’s conflict detection will catch this: both transactions read the same key, and both tried to write it. One will conflict and retry, which re-reads the counter and gets a fresh value. The retry is invisible to the caller.

Why not use FDB’s atomic add?

tr.Add(key, encodeInt64(1))  // atomic increment, no conflict

tr.Add is a commutative atomic operation — it doesn’t cause read-write conflicts. But it returns no value — you can’t know what rowid was assigned. If you need to use the rowid in the same transaction (to build the row key), you need the read-modify-write pattern. If the rowid is just an opaque row identifier that callers don’t use, tr.Add plus a versionstamp (to create a unique, ordered key without knowing the exact value) is a better approach.

The encodePK Function

func encodePK(v any) ([]byte, error) {
    switch x := v.(type) {
    case int:
        return encodeInt64(int64(x)), nil
    case int64:
        return encodeInt64(x), nil
    case string:
        return []byte(x), nil
    case []byte:
        return x, nil
    default:
        return nil, fmt.Errorf("unsupported PK type %T", v)
    }
}

String PKs are stored as raw UTF-8 bytes. This sorts correctly for ASCII but not for Unicode (e.g., “é” sorts after “z” in byte order but before “z” in French locale). For a production system, use the ICU library for locale-aware collation, or restrict PKs to ASCII.

Integer PKs use encodeInt64: big-endian with sign-bit flipped. Read Chapter 3 of this guide for the derivation.

What about composite PKs? (user_id, order_id) as a PK means the row key is encode(user_id) + 0x00 + encode(order_id). The separator between components is 0x00 (the Tuple encoding style). This naturally makes rows for the same user adjacent in key order — a range scan GetRange(encode(user_id) + 0x00, encode(user_id) + 0x01) returns all orders for that user. Our implementation only supports single-column PKs, but the extension is straightforward.


14. Production Architecture — What a Real SQL-on-FDB Looks Like

The five components of a production SQL-on-FDB system:

Component 1: SQL Parser

Takes SQL text, produces an AST. For SELECT name FROM users WHERE city='Paris':

SelectStatement {
    Columns: ["name"],
    Table:   "users",
    Where:   EqualExpr{Column: "city", Value: "Paris"},
}

Libraries: pingcap/tidb/parser (Go, production-grade), xwb1989/sqlparser (Go, simpler), or write your own using gocc or pigeon.

Component 2: Query Planner

Takes the AST, looks at available indexes, estimates row counts, and picks the cheapest plan:

Without index on city:
  plan: TableScan("users") → Filter(city='Paris')
  cost: O(n_rows) reads

With index on city:
  plan: IndexScan(city='Paris') → PKLookup("users", [pk1, pk2, ...])
  cost: O(index_entries) + O(result_rows) reads

The planner is the hardest part of a query engine. Modern planners use statistics (histograms of column value distributions) stored in a catalog table to estimate n_rows for each plan.

Component 3: Encoding Layer (what we built)

Rows → keys in FDB. Indexes → keys in FDB. All with sort-preserving encodings.

Component 4: FDB Transaction Management

SQL has BEGIN / COMMIT / ROLLBACK. FDB has Transact. Bridging these requires holding an FDB Transaction object open for the duration of the SQL transaction. The FDB 5-second limit means long-running SQL transactions must be broken up or use FDB’s “grv” caching.

Component 5: Network Protocol

Postgres wire protocol (libpq-compatible), MySQL wire protocol, or a custom protocol. This is what lets standard clients (psql, JDBC, SQLAlchemy) connect. CockroachDB implements the Postgres wire protocol on top of their FDB-like storage.


15. Interview Questions — Relational Databases and FDB

Q: What is a clustered index, and how does option-a-sqlite implement it?

A clustered index is an index where the data rows are stored at the index leaf nodes, sorted by the index key. In option-a-sqlite, the FDB key for each row IS the primary key, and the row data is the value at that key. Range scanning by primary key returns rows in PK order and fetches the data in the same operation. This is a clustered index: the data is organized by PK. InnoDB uses the same design (the primary key B-tree is the table).

Q: What happens when two concurrent transactions try to insert a row with the same primary key?

Both transactions read the same row key (which doesn’t exist yet — they get nil). Both write a new value to that key. FDB’s conflict detection: one transaction commits; the other conflicts (it wrote to a key that was also written by the committed transaction). The conflicting transaction retries from the beginning. On retry, it reads the row key again — this time it exists. Depending on application semantics, the retry can: (a) return a duplicate key error, (b) return the existing row (SELECT or INSERT/IGNORE), or (c) replace it (INSERT OR REPLACE).

Q: Why does PostgreSQL store rows in heap order rather than clustered by PK?

PostgreSQL’s original design assumed that MVCC updates would frequently need to move rows (updating a row creates a new row version at a new physical location; the old version stays until VACUUM). If rows were clustered by PK, a moved row would need to update all indexes to point to the new physical location. By using heap files with stable physical locations (ctid), updates only create a new heap entry and update the visibility chain — indexes don’t need to change. The trade-off: PK range scans in PostgreSQL require index traversal + random I/O into the heap, while InnoDB’s clustered index gives sequential I/O for PK range scans.

Q: How does your row encoding handle schema migrations — adding a new column to a table with existing rows?

With msgpack, adding a new column to the schema is safe for new rows (they include the new column). Old rows, when read back, will simply not have the column in the map. Application code should handle missing fields with defaults. This is “forward compatibility”: old data is compatible with new code. The hard case is “backward compatibility”: if you need all rows to have the new column (for a NOT NULL constraint), you must run a background migration job that reads every old row, adds the default value, and writes it back — one paginated batch per transaction.

This is how every production database handles ALTER TABLE ADD COLUMN.