Hitchhiker’s Guide — Option B: LevelDB on FDB Storage

Table of Contents
1. The Storage Abstraction: Why LevelDB Has a Plugin Point
2. What LevelDB Actually Writes to Disk
3. The storage.Storage Interface — Dissected
- 3.1 There Are No Filesystem Calls in the Data Path
4. How We Map LevelDB Files to FDB Keys
5. Key Layout Deep Dive: From Function Call to FDB Bytes
6. Chunking: Overcoming the 100 KiB Value Limit
7. Atomic Rename — Durability’s Secret Weapon
8. The Writer: Batching Chunks into Transactions
9. Why the WAL is Redundant With FDB
10. The Blob Layer Pattern
11. Real-World Analogues
12. Exercises
13. Source Code Deep Dive — fdbstorage/storage.go
14. Production Considerations
15. Interview Questions — Storage Abstractions and LSM Trees
16. Bugs Encountered and Lessons Learned

The question this answers: “Can I run a real LSM-tree storage engine — the actual LevelDB binary with all its compaction logic — with its files stored in FoundationDB instead of a local disk?”

The deeper question: “What is the storage.Storage interface, why does it exist, and what does it tell us about how databases handle file I/O?”

The Storage Abstraction: Why LevelDB Has a Plugin Point
What LevelDB Actually Writes to Disk
The storage.Storage Interface — Dissected
How We Map LevelDB Files to FDB Keys
Key Layout Deep Dive: From Function Call to FDB Bytes
Chunking: Overcoming the 100 KiB Value Limit
Atomic Rename — Durability’s Secret Weapon
The Writer: Batching Chunks into Transactions
Why the WAL is Redundant With FDB
The Blob Layer Pattern
Real-World Analogues: RocksDB on Cloud Storage
Exercises
Source Code Deep Dive — fdbstorage/storage.go
Production Considerations
Interview Questions — Storage Abstractions and LSM Trees
Bugs Encountered and Lessons Learned

1. The Storage Abstraction: Why LevelDB Has a Plugin Point

LevelDB’s storage.Storage interface exists because the original LevelDB authors (Jeff Dean and Sanjay Ghemawat) designed it for portability. Not every environment has a POSIX filesystem. Google has internal systems where storage might be Bigtable, Colossus, or a custom log-structured store.

The interface says: “if you can implement these 8 methods, LevelDB will run on your storage.” The application code (goleveldb) doesn’t know whether it’s writing to ext4, NTFS, GCS, or FDB — it just calls the interface.

type Storage interface {
    Lock() (util.Releaser, error)
    Log(m storage.FileDesc) (storage.Writer, error)
    Open(fd storage.FileDesc) (storage.Reader, error)
    Create(fd storage.FileDesc) (storage.Writer, error)
    Remove(fd storage.FileDesc) error
    Rename(oldfd, newfd storage.FileDesc) error
    GetMeta() (storage.FileDesc, error)
    SetMeta(fd storage.FileDesc) error
    List() ([]storage.FileDesc, error)
    Close() error
}

Our fdbstorage.Storage implements all of these, storing files as FDB key-value pairs. LevelDB itself (in the syndtr/goleveldb package) calls these methods. It has no idea the “files” are actually chunks in a distributed database.

2. What LevelDB Actually Writes to Disk

To understand what we need to store, let’s look at what LevelDB writes:

File types:
  TypeJournal  (.log)  — Write-Ahead Log: records every write before the
                         MemTable is flushed. Used to recover unflushed writes
                         after a crash.
  TypeManifest (.MANIFEST) — Lists which SSTables are "live" (not yet
                         garbage-collected). Updated at each compaction.
  TypeTable    (.ldb / .sst) — Sorted String Tables. Immutable, sorted KV
                         data files produced by compaction.
  TypeCurrent  (CURRENT) — A single file containing the name of the latest
                         MANIFEST file.
  TypeTemp     (.tmp)   — Temporary files used during compaction.
  TypeLock     (LOCK)   — A file held open to prevent two processes from
                         opening the same database simultaneously.

File descriptor:
type FileDesc struct {
    Type FileType  // TypeJournal, TypeManifest, etc.
    Num  int64     // unique file number (monotonically increasing)
}

A LevelDB database directory looks like:

000003.log       ← journal (WAL)
000004.ldb       ← SSTable level 0
000005.ldb       ← SSTable level 0
MANIFEST-000002  ← current manifest
CURRENT          ← "MANIFEST-000002\n"
LOCK             ← lockfile

When compaction happens:

LevelDB picks some SSTables, merges and sorts them into a new SSTable.
It writes the new SSTable as a .tmp file (via Create(TypeTemp, ...))
It renames the .tmp to the final .ldb name (via Rename)
It updates the MANIFEST to list the new SSTable and de-list the old ones.
It removes the old SSTables (via Remove).

This is the temp-then-rename durability pattern: create a new file atomically, then rename it into place. POSIX rename is atomic — the old name or the new name is visible, never a partial file. Our FDB implementation must replicate this property.

3. The storage.Storage Interface — Dissected

Let’s look at each method and what it does:

Lock() (util.Releaser, error) Prevents two processes from opening the same database simultaneously. We implement this by writing a “lock” key to FDB. The Releaser clears it.

Create(fd FileDesc) (Writer, error) and Open(fd FileDesc) (Reader, error) Create starts a new file (for writing). Open opens an existing file (for reading). In FDB terms: Create returns a writer that buffers bytes; Open reads all chunks for the file into memory and returns a bytes.Reader.

Remove(fd FileDesc) error Deletes a file. In FDB: ClearRange over all chunk keys for this file.

Rename(oldfd, newfd FileDesc) error Renames a file atomically. In FDB: copy all chunks from oldfd keys to newfd keys, then clear all oldfd keys — in one transaction. This is the atomic rename.

GetMeta() (FileDesc, error) and SetMeta(fd FileDesc) error Get/set the “current” file pointer — which MANIFEST is current. In FDB: a single key (ns + tagManifest + 0x00) stores the current FileDesc. This replaces the CURRENT file in LevelDB’s original design.

List() ([]FileDesc, error) List all files. We implement this as a range scan over the meta key prefix. We store a meta key for each file alongside its data.

3.1 There Are No Filesystem Calls in the Data Path

This is the most important thing to understand about the implementation. LevelDB calls our methods thinking it is talking to a real filesystem. But inside every method, instead of OS syscalls, we make FDB transactions.

// What a normal disk-backed implementation would do:
os.Open(path)             // syscall — reads from disk
os.Create(path)           // syscall — writes to disk

// What fdbstorage does instead:
rt.Get(metaKey(fd))                          // FDB read: does this file exist?
rt.GetRange(dataRange(fd)).GetSliceWithError() // FDB read: fetch all chunks
tr.Set(metaKey(fd), sizeBytes)               // FDB write: file metadata
tr.Set(chunkKey(fd, i), chunk)               // FDB write: file content

The only os package usage in the entire file is os.ErrNotExist — borrowed for its error semantics. There is no os.Open, no os.Create, no os.Read, no os.Write, no file descriptor, no inode. The filesystem is entirely replaced by the key naming scheme.

The exact translation points are:

LevelDB expects	We return
`Create(fd)` → `io.WriteCloser`	`&writer{buf: bytes.Buffer{}}` — writes go to memory
`w.Sync()` / `w.Close()`	`flush()` — this is the first and only FDB write
`Open(fd)` → `io.ReadSeeker`	one FDB `ReadTransact` → `bytes.NewReader(buf)`
`Remove(fd)`	`tr.Clear` + `tr.ClearRange` in one transaction
`Rename(old,new)`	copy all keys + clear old keys in one transaction

4. How We Map LevelDB Files to FDB Keys

4.1 FileDesc: (Type, Num) — Not a Filename

LevelDB never uses string filenames internally. Every file is a FileDesc struct — just two numbers:

type FileDesc struct {
    Type FileType  // what category of file
    Num  int64     // which one — a counter, never reused
}

The String() method (MANIFEST-000001, 000002.log, 000003.ldb) is a display-only format for humans and log messages. Those strings never appear in FDB keys — only the raw integer values do.

Num is assigned by LevelDB’s internal file number counter, which only ever increases. If file 3 is deleted and a new SSTable is created, it gets number 5, not 3. This makes (Type, Num) a safe unique identifier for a key:

{TypeJournal,  2}  →  000002.log         (WAL)
{TypeManifest, 1}  →  MANIFEST-000001
{TypeTable,    3}  →  000003.ldb         (first SSTable)
{TypeTable,    4}  →  000004.ldb
       ↑                                 file 3 deleted → next is 5, not 3
{TypeTable,    5}  →  000005.ldb

The four file types and their byte values:

Constant	Value	Purpose
`TypeManifest`	`0x01`	Tracks which SSTables are live (replaces CURRENT file)
`TypeJournal`	`0x02`	Write-Ahead Log — records every write before memtable flush
`TypeTable`	`0x04`	SSTable — sorted, immutable on-disk data file
`TypeTemp`	`0x08`	Scratch file used during compaction, then renamed
`TypeAll`	`0x0F`	Bitmask combining all four — used in `List()` filter

4.2 Namespace = Your Database Name

The ns prefix you pass to New() is the entire database identity. Two Storage instances with different namespaces share the same FDB cluster but never see each other’s keys:

storA := fdbstorage.New(db, "alice")  // all keys start with 0x616c6963 65
storB := fdbstorage.New(db, "bob")    // all keys start with 0x626f62

Both write SSTable num=3 (TypeTable=4). Their FDB keys are completely different:

alice's SSTable:  61 6c 69 63 65  01  04  00 00 00 00 00 00 00 03
                  └── "alice" ──┘
bob's   SSTable:  62 6f 62  01  04  00 00 00 00 00 00 00 03
                  └─ "bob" ┘

FDB sees them as unrelated keys in a flat sorted list. No schema, no directory, no tenant table — just different byte prefixes.

4.3 Your User Data is NOT an FDB Key

This surprises most people. When you write:

db.Put([]byte("apple"),  []byte("red"),    nil)
db.Put([]byte("banana"), []byte("yellow"), nil)
db.Put([]byte("cherry"), []byte("red"),    nil)

apple, banana, and cherry never appear as FDB keys. LevelDB packs them together into an SSTable file (a sorted binary format), then our flush() stores that binary blob across 64KB FDB chunk values:

User writes:   apple → red
               banana → yellow          ← LevelDB holds these in memory
               cherry → red
                     ↓  (on db.Close or memtable flush)
LevelDB creates one SSTable file: {TypeTable, Num=3}
                     ↓
FDB sees:
  "mydb" 01 04 0000000000000003  →  size (8 bytes)      ← file exists
  "mydb" 02 04 0000000000000003 00000000  →  <SSTable binary blob>
              ↑ TypeTable=4                ↑ the blob contains apple+banana+cherry
              chunk 0                        encoded in LevelDB's internal format

FDB stores opaque bytes. It has no idea what’s inside the chunk values. The apple/banana/cherry data is invisible to FDB — it can only be decoded by LevelDB when it reads the SSTable back.

4.4 How `db.Get("bob")` Actually Works — The Two-Level Lookup

If FDB only stores opaque SSTable blobs, how does LevelDB ever find bob? The answer is a two-level search: LevelDB decides which file, then FDB delivers that file’s bytes, then LevelDB searches inside the bytes.

db.Get([]byte("bob"))
         │
         ▼
╔══════════════════════════════════════╗
║  LEVEL 1: LevelDB finds the FILE     ║  ← no FDB I/O yet
╠══════════════════════════════════════╣
║  1. Check memtable (unflushed writes)
║  2. Read MANIFEST → live SSTables: #3, #5, #7
║  3. Check each SSTable's Bloom filter
║     → "bob is probably in SSTable #5"
║  4. Read SSTable #5's index block
║     → "bob lives in the block at byte offset 4096"
╚══════════════════════════════════════╝
         │
         ▼  stor.Open(FileDesc{TypeTable, 5}) ← FDB call here
╔════════════════════════════════════════╗
║  LEVEL 2: FDB reassembles the file     ║
╠════════════════════════════════════════╣
║  Range scan: ns 02 04 0000000000000005 *
║   chunk 0 → 64 KB of SSTable bytes
║   chunk 1 → 64 KB of SSTable bytes
║   chunk 2 → remainder
║  → reassemble → bytes.Reader
╚════════════════════════════════════════╝
         │
         ▼  (back in LevelDB, no more FDB I/O)
  binary-search SSTable at byte offset 4096
  find key "bob" → return value "smith"

What the FDB keys encode is only the file address, never the user key:

ns + 0x02 + 0x04 + 0000000000000005 + 00000000
│    │      │      │                  └─ chunk index
│    │      │      └─ file number (5)
│    │      └─ FileType (0x04 = TypeTable)
│    └─ tag byte (0x02 = file data)
└─ namespace ("mydb")

bob is packed inside the value bytes of those chunks, in LevelDB’s own sorted-block binary format, alongside every other key in that SSTable.

The three layers of indexing, summarized:

Layer	What it indexes	How it searches
FDB	File #5, chunks 0–N	Key range scan `ns 02 04 00000005 *`
LevelDB	“bob is in file #5”	MANIFEST + Bloom filter + index block
SSTable binary	“bob is at block offset 4096”	Binary search on sorted key blocks

FDB is a file store. LevelDB is a key–value store built on top of it. The bob → smith lookup is entirely LevelDB’s responsibility. FDB just delivers SSTable #5’s bytes on demand.

4.5 Many Writes, One Flush — How Records Accumulate

Say you write 22 records:

db.Put([]byte("alex"), []byte("20"), nil)
db.Put([]byte("bob"),  []byte("25"), nil)
// ... 20 more

Do each of those create a new FDB key? No. Here is the full path:

db.Put("alex", "20")
db.Put("bob",  "25")      ← both go here immediately:
db.Put(...)               ↓
                    ┌─────────────────────┐
                    │  WAL (TypeJournal)  │  ← append each record as binary
                    │  in-memory buffer   │     Sync() → written to FDB
                    └─────────────────────┘
                    ┌─────────────────────┐
                    │  Memtable           │  ← sorted in-memory skip-list
                    │  alex=20, bob=25... │     all 22 records live here
                    └─────────────────────┘
                          │ (memtable full, ~4 MB, or db.Close())
                          ▼
                    LevelDB flushes the ENTIRE memtable
                    as ONE SSTable file (TypeTable, Num=3)
                          │
                          ▼  stor.Create({TypeTable,3}) → flush() → FDB
       ┌──────────────────────────────────────────────────────────────┐
       │  FDB keys written (all in one or two transactions):          │
       │  ns 01 04 0000000000000003          → meta (size)            │
       │  ns 02 04 0000000000000003 00000000 → chunk 0: sorted blob   │
       │           ↑ all 22 records (alex, bob, ...) packed in here   │
       └──────────────────────────────────────────────────────────────┘

Key rules:

Rule	Why
All 22 records go into the same SSTable	One memtable → one flush → one file
The SSTable is written once, in full	`flush()` fires on `Close()` only
The SSTable is immutable forever after	SSTables are never modified
A new write to `"alex"` does NOT update SSTable #3	It goes into the next memtable → SSTable #5
Two versions of `"alex"` can exist simultaneously	LevelDB uses sequence numbers to pick the newest

What about the WAL?

The WAL (TypeJournal) is different — it IS append-like. As you call db.Put(), LevelDB appends binary records to its Journal writer. In our FDB implementation, every Sync() call rewrites the Journal’s chunks from scratch (chunk 0, 1, 2…) because our writer accumulates into a bytes.Buffer and flush() always writes the full buffer. There is no partial append in FDB — we over-write all chunks with the latest snapshot of the buffer.

after db.Put("alex"):  WAL buffer = [record-for-alex]
                       Sync() → FDB chunk 0 = [record-for-alex] (32 bytes)

after db.Put("bob"):   WAL buffer = [record-for-alex | record-for-bob]
                       Sync() → FDB chunk 0 = [record-for-alex | record-for-bob]
                                 (entire buffer, overwriting previous chunk 0)

Once the memtable is flushed to an SSTable, the WAL is deleted (it’s no longer needed for recovery — the data is in the immutable SSTable).

Each LevelDB file is identified by (FileType, FileNum). We encode this as:

Meta key (file existence + type):
  ns + tagFileMeta(0x01) + fileType(1 byte) + fileNum(8 bytes BE)
  → msgpack({Type: ft, Num: n, size: totalBytes})

Data chunks:
  ns + tagFileData(0x02) + fileType(1 byte) + fileNum(8 bytes BE) + chunkNum(8 bytes BE)
  → up to 64 KiB of file data

Manifest pointer (replaces CURRENT file):
  ns + tagManifest(0x03)
  → msgpack(FileDesc{Type: TypeManifest, Num: n})

Lock key:
  ns + tagLock(0x04)
  → "locked" (any non-empty value means locked)

Why big-endian for file numbers?

Big-endian encoding preserves sort order. File numbers are monotonically increasing (LevelDB never reuses a file number). By storing them big-endian, a range scan over ns+tagFileData+ft+num+* returns chunks in chunk-number order — which is the correct order to reassemble the file. Without big-endian encoding, chunk 10 would sort before chunk 2 (0x0A < 0x02 is false in big-endian but 0x0000000A < 0x00000002 is also false — you need lexicographic order over big-endian bytes).

File number and type as part of the key:

This means all chunks of file (TypeJournal, 3) sort together, before all chunks of (TypeJournal, 4), which sort before (TypeTable, 5). Clean, hierarchical key organization.

5. Key Layout Deep Dive: From Function Call to FDB Bytes

This section traces the complete journey of every operation — from a Go function call in storage.go down to the raw bytes written in FDB. If you read only one section, read this one.

5.1 The Four Tag Bytes

const (
    tagFileMeta byte = 0x01  // "does this file exist, and how big is it?"
    tagFileData byte = 0x02  // "here are the actual file contents"
    tagManifest byte = 0x03  // "which MANIFEST file is currently active?"
    tagLock     byte = 0x04  // "is this database open by a process?"
)

FDB is a flat key→value store. There are no tables, folders, or schemas — just a sorted sequence of byte keys. To store four conceptually different things (file metadata, file data chunks, the manifest pointer, the lock) without collisions, the very first byte after the namespace prefix tells you what kind of record you are looking at. This is the tag byte.

Think of it like a URL path prefix:

/meta/... → tag 0x01
/data/... → tag 0x02
/manifest → tag 0x03
/lock → tag 0x04

5.2 Anatomy of Every Key Type

ns = "mydb"  (4 bytes: 0x6D 0x79 0x64 0x62)

── tagFileMeta (0x01) ──────────────────────────────────────────────────────
Key:   6D 79 64 62  01  04  00 00 00 00 00 00 00 03
       └── ns ───┘  ↑   ↑   └────── num (int64 BE) ──┘
                    │   └── fd.Type = TypeTable (4 = 0x04)
                    └── tagFileMeta
Value: 00 00 00 00 00 20 00 00   (uint64 BE = 2097152 = 2 MB)

── tagFileData (0x02) ──────────────────────────────────────────────────────
Key:   6D 79 64 62  02  04  00 00 00 00 00 00 00 03  00 00 00 01
       └── ns ───┘  ↑   ↑   └────── num (int64 BE) ──┘  └chunk┘
                    │   └── fd.Type = TypeTable (4)
                    └── tagFileData
Value: <65536 bytes of SSTable data, chunk index 1>

── tagManifest (0x03) ──────────────────────────────────────────────────────
Key:   6D 79 64 62  03
       └── ns ───┘  └── tagManifest (no other fields — there is only one)
Value: 01  00 00 00 00 00 00 00 01
       ↑   └────── Num = 1 (int64 BE) ──────────────────────────────────┘
       └── fd.Type = TypeManifest (1 = 0x01)

── tagLock (0x04) ──────────────────────────────────────────────────────────
Key:   6D 79 64 62  04
       └── ns ───┘  └── tagLock
Value: (empty — presence of the key = locked; absent = unlocked)

5.3 Journey: `db.Put("hello", "world")` → FDB keys

Here is the full call chain from a single LevelDB write to the bytes that land in FDB.

User code
  db.Put([]byte("hello"), []byte("world"), nil)
    │
    ▼ goleveldb internal
  memTable.Put(...)          ← stored in memory only
    │
    ▼  (on MemTable flush or db.Close)
  compaction goroutine
    │  calls our storage interface:
    ├─ stor.Create(FileDesc{Type: TypeJournal, Num: 2})
    │    └─ returns &writer{fd: {TypeJournal, 2}}
    │
    ├─ writer.Write(journalRecord)    ← buffered in writer.buf
    ├─ writer.Sync()                  ← calls flush()
    │    └─ fdb.Transact:
    │         tr.ClearRange(dataRange({TypeJournal, 2}))        ← wipe old
    │         tr.Set(metaKey({TypeJournal, 2}), size_8bytes)    ← tag 0x01
    │         tr.Set(chunkKey({TypeJournal, 2}, 0), chunk0)     ← tag 0x02
    │
    └─ writer.Close()

At this point two FDB keys exist:

"mydb" 0x01 0x02 0x00000000_00000002   →  size (8 bytes)
"mydb" 0x02 0x02 0x00000000_00000002 0x00000000  →  journal bytes

(0x02 in the 3rd byte position is TypeJournal = 2, not tagFileData.)

5.4 Journey: `stor.Open(FileDesc{TypeJournal, 2})`

When goleveldb recovers after a restart it calls Open on each journal file it found via List. Here is what happens:

// In storage.go — Open()
func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
    v, err := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {

        // Step 1 — fetch the meta key (tag 0x01)
        // Key: "mydb" 0x01 0x02 0x00000000_00000002
        meta, err := rt.Get(s.metaKey(fd)).Get()
        // meta == []byte{0x00,0x00,0x00,0x00,0x00,0x00,0x04,0x00}  (1024 bytes)

        size := binary.BigEndian.Uint64(meta)   // = 1024

        // Step 2 — range-scan all chunk keys (tag 0x02)
        // Range: ["mydb" 0x02 0x02 0x00000000_00000002]
        //         ──────────────────────────────────────────────────────────────
        //        ["mydb" 0x02 0x02 0x00000000_00000002 0xFF 0xFF 0xFF 0xFF 0xFF]
        kvs, _ := rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
        // kvs[0].Key   = "mydb" 0x02 0x02 ... 0x00000000  (chunk 0)
        // kvs[0].Value = <1024 bytes>

        buf := make([]byte, 0, size)
        for _, kv := range kvs {
            buf = append(buf, kv.Value...)   // reassemble from chunks in order
        }
        return buf, nil
    })
    return &reader{bytes.NewReader(v.([]byte))}, nil
}

Two FDB reads, one round-trip (both happen inside one ReadTransact).

5.5 Journey: `stor.List(TypeAll)`

Called at startup. goleveldb needs to know every file that exists so it can decide which journal files to replay and which SSTables are live.

// In storage.go — List()
func (s *Storage) List(ft storage.FileType) ([]storage.FileDesc, error) {
    // Scan ONLY the tag-0x01 band — never touch the data keys.
    allMeta := fdb.KeyRange{
        Begin: fdb.Key(append([]byte(s.ns), tagFileMeta)),      // "mydb" 0x01
        End:   fdb.Key(append([]byte(s.ns), tagFileMeta+1)),    // "mydb" 0x02
    }
    // Returns all keys that start with "mydb" 0x01 ...
    // Example keys returned:
    //   "mydb" 0x01 0x01 0x00000000_00000001  → MANIFEST-000001
    //   "mydb" 0x01 0x02 0x00000000_00000002  → 000002.log
    //   "mydb" 0x01 0x04 0x00000000_00000003  → 000003.ldb
    //   "mydb" 0x01 0x04 0x00000000_00000004  → 000004.ldb

    v, _ := s.fdb.ReadTransact(...)

    prefixLen := len(s.ns) + 2   // skip: ns + tagFileMeta + ftype byte
    for _, kv := range v.([]fdb.KeyValue) {
        k := []byte(kv.Key)
        thisFT := storage.FileType(k[len(s.ns)+1])    // byte after the tag
        if ft&thisFT == 0 { continue }                // bitmask filter
        num := int64(binary.BigEndian.Uint64(k[prefixLen : prefixLen+8]))
        out = append(out, storage.FileDesc{Type: thisFT, Num: num})
    }
}

One range scan, one round-trip. Notice that the data chunks (tag 0x02) are never touched — List only reads the tiny meta keys (tag 0x01). This is why having separate tags for metadata vs data matters: you can enumerate all files without reading any file contents.

5.6 Journey: `stor.SetMeta(FileDesc{TypeManifest, 1})`

Called by goleveldb after it writes a new MANIFEST file, to record “this is now the current MANIFEST”.

func (s *Storage) SetMeta(fd storage.FileDesc) error {
    _, err := s.fdb.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // encodeFD packs 9 bytes: 1 byte type + 8 bytes num BE
        // fd = {TypeManifest=1, Num=1}
        // encoded = [0x01, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x01]
        tr.Set(s.manifestKey(), encodeFD(fd))
        //      └── key: "mydb" 0x03 (just the ns + tag, no other fields)
        return nil, nil
    })
    return err
}

And the reverse, GetMeta():

func (s *Storage) GetMeta() (storage.FileDesc, error) {
    v, _ := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.Get(s.manifestKey()).Get()   // key: "mydb" 0x03
    })
    b := v.([]byte)
    if b == nil {
        return storage.FileDesc{}, os.ErrNotExist   // fresh DB — no manifest yet
    }
    // b = [0x01, 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x01]
    return decodeFD(b)   // → {TypeManifest, Num: 1}
}

5.7 What the FDB keyspace looks like for a small database

After running demo/main.go and writing a few keys, the entire FDB namespace "mydb" might look like this (printed as hexdump | ascii):

KEY                                                     VALUE
───────────────────────────────────────────────────────────────────────────
mydb·03                                                 \x01·\x00\x00\x00\x00\x00\x00\x00\x01
                                                        ↑ MANIFEST pointer → {TypeManifest, 1}

mydb·01·01·\x00\x00\x00\x00\x00\x00\x00\x01           \x00\x00\x00\x00\x00\x00\x04\xA0
                                                        ↑ meta for MANIFEST-000001, size=1184

mydb·02·01·\x00\x00\x00\x00\x00\x00\x00\x01·\x00\x00\x00\x00   <1184 bytes>
                                                        ↑ MANIFEST-000001 data, chunk 0

mydb·01·02·\x00\x00\x00\x00\x00\x00\x00\x02           \x00\x00\x00\x00\x00\x00\x10\x00
                                                        ↑ meta for 000002.log, size=4096

mydb·02·02·\x00\x00\x00\x00\x00\x00\x00\x02·\x00\x00\x00\x00   <4096 bytes>
                                                        ↑ 000002.log data, chunk 0

mydb·01·04·\x00\x00\x00\x00\x00\x00\x00\x03           \x00\x00\x00\x00\x00\x20\x00\x00
                                                        ↑ meta for 000003.ldb, size=2MB

mydb·02·04·\x00\x00\x00\x00\x00\x00\x00\x03·\x00\x00\x00\x00   <65536 bytes>
mydb·02·04·\x00\x00\x00\x00\x00\x00\x00\x03·\x00\x00\x00\x01   <65536 bytes>
...                                                     ↑ 000003.ldb data, 32 chunks
mydb·02·04·\x00\x00\x00\x00\x00\x00\x00\x03·\x00\x00\x00\x1F   <last chunk>

Key observations:

All 0x01 (meta) keys sort before all 0x02 (data) keys — because 0x01 < 0x02.
Within the 0x01 band, TypeManifest (0x01) sorts before TypeJournal (0x02) sorts before TypeTable (0x04) — because their type bytes are in numeric order.
Chunks within a file sort by chunk index — because BE integers preserve numeric order.
List() range-scans only [mydb·01, mydb·02) — skipping all chunk data entirely.

5.8 Verifying with a test

Here is a self-contained Go test that exercises every code path above and lets you inspect the raw FDB keys to verify they match the byte layout:

// fdbstorage/storage_layout_test.go
package fdbstorage_test

import (
    "bytes"
    "encoding/hex"
    "fmt"
    "testing"

    "github.com/apple/foundationdb/bindings/go/src/fdb"
    "github.com/syndtr/goleveldb/leveldb/storage"

    fdbstorage "github.com/your-module/fdbstorage"
)

func TestKeyLayout(t *testing.T) {
    fdb.MustAPIVersion(620)
    db := fdb.MustOpenDefault()

    stor := fdbstorage.New(db, "testlayout")
    stor.Wipe() // always start from a clean slate

    // ── 1. Write a small file ────────────────────────────────────────────
    fd := storage.FileDesc{Type: storage.TypeJournal, Num: 7}
    w, err := stor.Create(fd)
    if err != nil {
        t.Fatal(err)
    }
    payload := []byte("hello from the journal")
    w.Write(payload)
    if err := w.Close(); err != nil {
        t.Fatal(err)
    }

    // ── 2. SetMeta (manifest pointer) ───────────────────────────────────
    manifest := storage.FileDesc{Type: storage.TypeManifest, Num: 1}
    if err := stor.SetMeta(manifest); err != nil {
        t.Fatal(err)
    }

    // ── 3. Dump all raw FDB keys in the namespace ────────────────────────
    // Expected output:
    //   74657374 6c61796f 7574 03                             ← manifest key
    //   74657374 6c61796f 7574 01 02 0000000000000007         ← meta key
    //   74657374 6c61796f 7574 02 02 0000000000000007 00000000 ← data chunk 0
    allKeys, err := stor.DumpKeys()
    if err != nil {
        t.Fatal(err)
    }
    fmt.Println("\n=== raw FDB keys in namespace ===")
    for _, k := range allKeys {
        fmt.Printf("  %s\n", hex.EncodeToString(k))
    }
    if len(allKeys) != 3 {
        t.Fatalf("want 3 keys (manifest + meta + 1 chunk), got %d", len(allKeys))
    }

    // ── 4. List round-trip ───────────────────────────────────────────────
    fds, err := stor.List(storage.TypeAll)
    if err != nil {
        t.Fatal(err)
    }
    if len(fds) != 1 || fds[0] != fd {
        t.Fatalf("List: want [%v], got %v", fd, fds)
    }

    // ── 5. Open round-trip (bytes must match) ────────────────────────────
    r, err := stor.Open(fd)
    if err != nil {
        t.Fatal(err)
    }
    buf := make([]byte, len(payload))
    if _, err := r.Read(buf); err != nil {
        t.Fatal(err)
    }
    if !bytes.Equal(buf, payload) {
        t.Fatalf("Open: want %q, got %q", payload, buf)
    }

    // ── 6. GetMeta round-trip ────────────────────────────────────────────
    got, err := stor.GetMeta()
    if err != nil {
        t.Fatal(err)
    }
    if got != manifest {
        t.Fatalf("GetMeta: want %v, got %v", manifest, got)
    }

    // ── 7. Remove clears both meta and data keys ─────────────────────────
    if err := stor.Remove(fd); err != nil {
        t.Fatal(err)
    }
    remaining, _ := stor.DumpKeys()
    // Only the manifest key should remain
    if len(remaining) != 1 {
        t.Fatalf("after Remove: want 1 key, got %d: %v", len(remaining),
            func() []string {
                var s []string
                for _, k := range remaining {
                    s = append(s, hex.EncodeToString(k))
                }
                return s
            }())
    }

    stor.Wipe()
}

To use DumpKeys you need to add this helper to fdbstorage/storage.go:

// DumpKeys returns all raw FDB keys in this storage's namespace, in order.
// Intended for tests and debugging only.
func (s *Storage) DumpKeys() ([][]byte, error) {
    nsEnd := append(append([]byte{}, s.ns...), 0xff)
    v, err := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(fdb.KeyRange{
            Begin: fdb.Key(s.ns),
            End:   fdb.Key(nsEnd),
        }, fdb.RangeOptions{}).GetSliceWithError()
    })
    if err != nil {
        return nil, err
    }
    kvs := v.([]fdb.KeyValue)
    out := make([][]byte, len(kvs))
    for i, kv := range kvs {
        out[i] = []byte(kv.Key)
    }
    return out, nil
}

The test confirms:

Create + Close produces exactly one meta key (0x01) and one chunk key (0x02).
List finds the file by scanning only 0x01 keys.
Open reconstructs the payload exactly by reading the 0x02 chunk.
SetMeta / GetMeta round-trips through the single 0x03 key.
Remove clears both the 0x01 meta key and all 0x02 chunk keys, leaving nothing behind.

5.9 Live Simulation — Annotated Output

The demo/simulate.go program writes three files, dumps every raw FDB key, reads them back, and calls GetMeta + List. Run it with:

go run ./demo/ -sim

Here is the full output with every byte explained.

Phase 1 — Write

wrote  MANIFEST-000001        "MANIFEST-content"
wrote  000002.log             "WAL-entry-bytes"
wrote  000005.ldb             "SST-block-bytes"

Each call is:

stor.Create(fd)        // returns a *writer with an empty bytes.Buffer — no FDB I/O yet
w.Write([]byte(...))   // appends into writer.buf — still no FDB I/O
w.Close()              // calls flush() → fires fdb.Transact
                       //   tr.Set(metaKey(fd),  sizeBytes)  ← tag 0x01 key
                       //   tr.Set(chunkKey(fd,0), data)     ← tag 0x02 key

Nothing reaches FDB until Close() (or Sync()). LevelDB thinks it has a real file. FDB only learns about it at flush time.

Phase 2 — FDB Keyspace Dump

META   key=73696d0001010000000000000001  →  type=Manifest  num=1   size=16 B
META   key=73696d0001020000000000000002  →  type=Journal   num=2   size=15 B
META   key=73696d0001040000000000000005  →  type=Table     num=5   size=15 B

Byte-by-byte for the first META key:

73 69 6d 00        ← namespace "sim\x00"  (4 bytes)
           01      ← tagFileMeta  (the "directory entry" tag)
              01   ← fd.Type = TypeManifest = 1
                 00 00 00 00 00 00 00 01
                 └────── fd.Num = 1, big-endian int64 ───┘

Value: 00 00 00 00 00 00 00 10  ← uint64 BE = 16 — file size in bytes

The three CHUNK keys follow immediately after in the sorted order:

CHUNK  key=73696d000201000000000000000100000000  →  type=Manifest  chunk=0  16 B

73 69 6d 00        ← namespace
           02      ← tagFileData  (actual content)
              01   ← TypeManifest
                 00 00 00 00 00 00 00 01   ← num=1
                                         00 00 00 00  ← chunk index 0 (uint32 BE)

Value: "MANIFEST-content"  (16 bytes, the raw file payload)

The sorted order in FDB is:

All META  (0x01) keys come first because 0x01 < 0x02
  ├── TypeManifest (0x01) meta    ← 0x01·0x01·...
  ├── TypeJournal  (0x02) meta    ← 0x01·0x02·...
  └── TypeTable    (0x04) meta    ← 0x01·0x04·...

All CHUNK (0x02) keys come next
  ├── TypeManifest (0x01) chunk 0 ← 0x02·0x01·...·0x00000000
  ├── TypeJournal  (0x02) chunk 0 ← 0x02·0x02·...·0x00000000
  └── TypeTable    (0x04) chunk 0 ← 0x02·0x04·...·0x00000000

MANIF (0x03) key  ← single key, no extra fields

The MANIF key:

MANIF  key=73696d0003  →  points to type=Manifest num=1

73 69 6d 00        ← namespace
           03      ← tagManifest  (that's it — only one manifest at a time)

Value: 01  00 00 00 00 00 00 00 01
       ↑   └────── Num = 1 ─────┘
       └── fd.Type = TypeManifest = 1

This is exactly what encodeFD produces and decodeFD expects.

Phase 3 — Read Back

open   MANIFEST-000001        → "MANIFEST-content"
open   000002.log             → "WAL-entry-bytes"
open   000005.ldb             → "SST-block-bytes"

Each Open(fd) fires one FDB ReadTransact with two operations:

1. rt.Get(metaKey(fd))           ← fetch the 0x01 key  → confirms file exists + size
2. rt.GetRange(dataRange(fd))    ← fetch all 0x02 keys  → reassemble bytes in chunk order

Both happen inside the same snapshot. FDB guarantees you see a consistent view: no partial writes, no torn reads across chunks.

Phase 4 — GetMeta + List

GetMeta() → MANIFEST-000001

List(TypeAll):
  MANIFEST-000001
  000002.log
  000005.ldb

GetMeta() is one point-read on the single 0x03 key.

List(TypeAll) is one range scan over [ns·0x01, ns·0x02) — the entire META band. It decodes the type byte and num from each key, never touching a single CHUNK key:

thisFT := storage.FileType(k[len(s.ns)+1])   // byte at position: ns + tagByte + HERE
num    := int64(binary.BigEndian.Uint64(k[prefixLen : prefixLen+8]))

Three files → three META keys scanned → three FileDesc returned. The nine CHUNK keys (one per file) are not even requested.

Summary table

Key hex prefix	Tag	What it represents	Read by
`73696d00·01·01·…`	`0x01`	MANIFEST file exists, size=N	`Open`, `List`
`73696d00·01·02·…`	`0x01`	Journal file exists, size=N	`Open`, `List`
`73696d00·01·04·…`	`0x01`	Table file exists, size=N	`Open`, `List`
`73696d00·02·01·…·chunk`	`0x02`	MANIFEST file content	`Open` only
`73696d00·02·02·…·chunk`	`0x02`	Journal file content	`Open` only
`73696d00·02·04·…·chunk`	`0x02`	Table file content	`Open` only
`73696d00·03`	`0x03`	Current MANIFEST pointer	`GetMeta` only

6. Chunking: Overcoming the 100 KiB Value Limit

FDB has a hard limit: values may not exceed 100 KiB (102,400 bytes). A typical LevelDB SSTable is 2–4 MB. We cannot store it in one FDB value.

Our solution: split each file into 64 KiB chunks:

const chunkSize = 64 * 1024  // 65,536 bytes

// Writing a 200 KiB file:
// Chunk 0: bytes [0, 65536)
// Chunk 1: bytes [65536, 131072)
// Chunk 2: bytes [131072, 200000)  (partial last chunk)

Each chunk is stored as a separate FDB key-value pair:

ns+0x02+ft+num+00000000_00000000  → 65536 bytes
ns+0x02+ft+num+00000000_00000001  → 65536 bytes
ns+0x02+ft+num+00000000_00000002  → 68528 bytes (partial)

Reading a file: range-scan all chunk keys for (ft, num), sort by chunk number (already in order due to big-endian encoding), concatenate the values.

func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
    })
    var buf []byte
    for _, kv := range kvs.([]fdb.KeyValue) {
        buf = append(buf, kv.Value...)
    }
    return io.NopCloser(bytes.NewReader(buf)), nil
}

Why 64 KiB chunks?

Smaller than FDB’s 100 KiB value limit ✓
Large enough to minimize key overhead (a 4 MB SSTable = 64 chunks, not thousands)
Aligns with filesystem block sizes (4–64 KiB typical)

7. Atomic Rename — Durability’s Secret Weapon

POSIX rename(src, dst) is the single most important durability primitive in filesystems. Its contract: after rename returns, dst exists and src does not, with no window where neither exists. This is atomic replacement.

LevelDB uses rename heavily:

Rename(TypeTemp, n, TypeTable, n): promote temp SSTable to final name
Rename(TypeTemp, n, TypeManifest, n): promote temp manifest

Without atomicity, a crash during rename could leave:

Neither file existing → data loss
Both files existing → ambiguity about which is current
A partial file at dst → corruption

In FDB, we implement atomic rename as copy + clear in one transaction:

func (s *Storage) Rename(oldfd, newfd storage.FileDesc) error {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // 1. Read all old chunks
        kvs, _ := tr.GetRange(s.dataRange(oldfd), fdb.RangeOptions{}).GetSliceWithError()

        // 2. Clear old data
        tr.ClearRange(s.dataRange(oldfd))
        tr.Clear(s.metaKey(oldfd))

        // 3. Write new data
        for _, kv := range kvs {
            newKey := s.translateChunkKey(kv.Key, oldfd, newfd)
            tr.Set(newKey, kv.Value)
        }
        tr.Set(s.metaKey(newfd), metaBytes)
        return nil, nil
    })
    return err
}

One transaction. The cluster either commits all of this (old is gone, new is present) or none of it (crash safety). The atomicity guarantee is identical to POSIX rename — and arguably stronger, since FDB replicates the commit across multiple machines before returning.

The 10 MB transaction limit:

FDB transactions are limited to ~10 MB of reads + writes. A large SSTable (4 MB) would have chunks adding up to 4 MB of writes in one transaction. That’s under the 10 MB limit. But 8 MB SSTables would be risky.

Our Rename reads all chunks in the transaction (4 MB reads) and writes them all back (4 MB writes) — totaling 8 MB. Safe for typical LevelDB files.

For larger files, we’d need to either:

Break the rename into multiple transactions (violating atomicity), or
Use a two-phase approach: write new chunks in a first transaction, then atomically swap the meta key in a second transaction (using a PENDING state key as the “in-progress rename” marker).

8. The Writer: Batching Chunks into Transactions

When LevelDB writes a new SSTable, it calls Create(fd) which returns a Writer. The writer accumulates bytes via Write(p []byte). When Close() is called, we flush everything to FDB.

Batch size:

const maxChunksPerTx = 100  // 100 × 64 KiB = 6.4 MB per transaction

We flush up to 100 chunks per FDB transaction. This stays well within the 10 MB limit. A 20 MB SSTable would be flushed in 4 transactions of 5 MB each.

func (w *writer) flush(final bool) error {
    start := w.flushedChunks
    end := start + maxChunksPerTx
    if end > len(w.chunks) {
        end = len(w.chunks)
    }
    _, err := w.s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for i := start; i < end; i++ {
            tr.Set(w.s.chunkKey(w.fd, i), w.chunks[i])
        }
        if final && end == len(w.chunks) {
            // Write the meta key only on the final flush
            tr.Set(w.s.metaKey(w.fd), metaBytes)
        }
        return nil, nil
    })
    w.flushedChunks = end
    return err
}

The meta-key-last invariant:

We write the meta key (the file’s “directory entry”) only in the last batch of chunks. This ensures that List() never returns a file whose chunks are only partially written — the file is only “visible” once all its chunks exist.

This is the FDB equivalent of:

Write all content to a temp file
rename(temp, final) atomically

9. Why the WAL is Redundant With FDB

LevelDB’s Write-Ahead Log (WAL / journal, TypeJournal) exists for one reason: crash recovery. If the process crashes after writing to the in-memory MemTable but before flushing the MemTable to an SSTable on disk, the WAL is replayed to reconstruct the MemTable.

With FDB as the storage backend:

Every write is already durable before Write(p) returns.

Our writer.Write buffers bytes in memory. Our writer.Close flushes to FDB in transactions. Each FDB Transact call does not return until the commit is confirmed by FDB’s replication protocol — the data is on at least f+1 machines (where f is the fault tolerance level, typically 2). A process crash after Close() returns means the data is safe.

The WAL is protecting against “data written to OS memory but not yet on disk.” FDB’s Transact eliminates this window. By the time the WAL file is written through our fdbstorage.Writer, the bytes are already in FDB.

A production implementation would patch goleveldb to skip WAL writes entirely (or use Options.DisableSeeksCompaction and a custom journal implementation that’s a no-op). This would improve write throughput by 50% or more and reduce FDB key usage.

10. The Blob Layer Pattern

FDB’s core team documented the “Blob Layer” pattern: storing binary blobs (arbitrary large byte arrays) in FDB by chunking them. Our file storage is an instance of this pattern.

The Blob Layer pattern:

blob_key + chunkNum  →  chunk_data

It solves the 100 KiB value limit while preserving atomic operations on the whole blob (via FDB transactions) and efficient byte-range access (read only the chunks you need, e.g., for seeking within a large file).

Applications:

Store large media files (> 100 KiB) in FDB for atomic metadata-plus-content updates
Store ML model weights alongside their metadata records
Store serialized protocol buffers larger than 100 KiB
Back any file system abstraction (exactly what we’re doing)

11. Real-World Analogues

RocksDB Remote Compaction (Project Titan, Ripple)

Meta (Facebook) runs RocksDB on distributed storage in some configurations. Their “Ripple” project stores RocksDB SSTables in a distributed block store (similar to HDFS or GFS). The storage interface they use is exactly the same concept: RocksDB writes “files” via an abstract interface; the implementation stores chunks in a distributed system.

TiKV on Disaggregated Storage

TiDB (PingCAP) is moving toward a disaggregated architecture where TiKV (which uses RocksDB internally) stores its SSTables in object storage (S3, GCS). The TiKV storage engine writes SSTables through an abstract file interface to S3. This is identical to our pattern.

Pebble (CockroachDB)

CockroachDB replaced RocksDB with Pebble (a Go implementation) in 2021. Pebble has a vfs.FS interface — a virtual filesystem abstraction — that allows swapping the storage backend. CockroachDB uses this for testing (an in-memory FS) and is exploring using it for cloud storage.

The Pattern’s Universality

Every LSM-tree engine eventually adds a pluggable storage interface:

LevelDB: storage.Storage
RocksDB: Env (virtual filesystem)
Pebble: vfs.FS
WiredTiger: WT_FILE_SYSTEM

Why? Because running the compaction engine without worrying about where data lives is architecturally clean. The engine is responsible for LSM semantics; the storage interface is responsible for durability. Separation of concerns.

12. Exercises

Exercise 1 — Streaming Reader

Instead of materializing the entire file into memory in Open(), return an io.ReadSeekCloser that fetches chunks lazily. A read at offset 128 KiB should only fetch chunks 2–3, not chunk 0 and 1.

This reduces memory usage for large SSTables and enables efficient Seek(offset, io.SeekStart) for random-access reads.

Exercise 2 — File Size Cache

List() currently returns all file descriptors by scanning the meta keys. Open(fd) reads the meta key to get the file size, then reads all chunk keys.

Add a small in-memory LRU cache mapping FileDesc → size. On Open, check the cache first. Invalidate the cache entry on Remove and Rename.

Measure the reduction in FDB round-trips for a workload with many small reads on recently-opened files.

Exercise 3 — Compression

Before storing each 64 KiB chunk, compress it with compress/flate or github.com/golang/snappy. Store a compression-type byte in the meta key. On read, decompress transparently.

LevelDB SSTables are already internally compressed (Snappy by default), so this may not reduce size much for TypeTable files. But TypeJournal files are not compressed and might benefit.

Exercise 4 — Two-Phase Large Rename

For files larger than 5 MB (which would exceed the transaction limit in our current Rename), implement the two-phase rename:

Phase 1: Write all new chunks in multiple transactions. Write a “rename-pending” key: ns+tagPending+oldfd → newfd.

Phase 2: In one transaction, atomically: clear the pending key, clear all old chunks and meta, set the meta for new fd (chunks already exist).

On startup, check for any pending keys and complete or roll back the rename. This is essentially a two-phase commit for large file renames.

Exercise 5 — Multi-Tenant Databases

Add a namespace concept: allow multiple LevelDB databases to share one FDB cluster with independent key spaces. Each New(fdb, namespace) call returns a storage implementation that is completely isolated from others.

This is how mvsqlite handles multiple SQLite “database files” — each is an FDB namespace.

13. Source Code Deep Dive — fdbstorage/storage.go

The Storage Struct

type Storage struct {
    db  fdb.Database
    ns  []byte
}

Minimal. db is the FDB connection; ns is the byte prefix for all keys. The entire storage is two fields. All complexity lives in the key encoding and transaction logic.

Key Encoding Helpers

func (s *Storage) metaKey(fd storage.FileDesc) fdb.Key {
    // ns + 0x01 + type(1 byte) + num(8 bytes big-endian)
    key := make([]byte, len(s.ns)+10)
    copy(key, s.ns)
    key[len(s.ns)] = 0x01
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    return fdb.Key(key)
}

func (s *Storage) chunkKey(fd storage.FileDesc, chunkNum int) fdb.Key {
    // ns + 0x02 + type(1 byte) + num(8 bytes) + chunk(8 bytes)
    key := make([]byte, len(s.ns)+18)
    copy(key, s.ns)
    key[len(s.ns)] = 0x02
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    binary.BigEndian.PutUint64(key[len(s.ns)+10:], uint64(chunkNum))
    return fdb.Key(key)
}

Why 8-byte big-endian for chunkNum? Chunk numbers are read back via GetRange which returns chunks in key order. Big-endian ensures key order equals chunk number order. If we used little-endian, chunk 256 (LE: 00 01 00 00 00 00 00 00) would sort before chunk 1 (LE: 01 00 00 00 00 00 00 00) — wrong.

The dataRange Helper

func (s *Storage) dataRange(fd storage.FileDesc) fdb.KeyRange {
    begin := s.chunkKey(fd, 0)
    // end: same prefix but with chunkNum = MaxUint64 + 1 — use next-prefix trick
    endPrefix := make([]byte, len(s.ns)+10) // ns + 0x02 + type + num
    copy(endPrefix, s.ns)
    endPrefix[len(s.ns)] = 0x02
    endPrefix[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(endPrefix[len(s.ns)+2:], uint64(fd.Num))
    end := append(endPrefix, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF)
    return fdb.KeyRange{Begin: begin, End: fdb.Key(append(end, 0x01))}
}

This range covers all chunk keys for (type, num) regardless of chunkNum. GetRange(dataRange(fd)) fetches all chunks in order.

The List() Implementation

func (s *Storage) List() ([]storage.FileDesc, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.metaRange(), fdb.RangeOptions{}).GetSliceWithError()
    })
    var fds []storage.FileDesc
    for _, kv := range kvs.([]fdb.KeyValue) {
        var fd storage.FileDesc
        msgpack.Unmarshal(kv.Value, &fd)
        fds = append(fds, fd)
    }
    return fds, nil
}

A single range scan over all meta keys returns all files in one round-trip. LevelDB calls List() at startup to find all existing files. With FDB, this is O(1) round-trips regardless of file count.

With a local filesystem, List() is an opendir/readdir syscall — also O(1) in latency, but I/O must go through the local disk controller. With FDB, the I/O goes to the closest FDB storage server over the network, with similar or lower latency than a rotational disk.

The Lock Implementation

func (s *Storage) Lock() (util.Releaser, error) {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        existing, _ := tr.Get(s.lockKey()).Get()
        if len(existing) > 0 {
            return nil, errors.New("storage: already locked")
        }
        tr.Set(s.lockKey(), []byte("locked"))
        return nil, nil
    })
    if err != nil {
        return nil, err
    }
    return util.ReleaserFunc(func() {
        s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
            tr.Clear(s.lockKey())
            return nil, nil
        })
    }), nil
}

The lock is a FDB key. Acquiring the lock: check if the key exists; if not, set it — in one atomic transaction. This check-then-set is race-free because FDB’s optimistic concurrency ensures that if two processes both read “no lock” and both try to write “locked”, only one will commit (the other will conflict and retry, then find the lock held).

Limitation: This is a process-level lock, not a durable lease. If the lock-holding process crashes without calling Release(), the lock remains set until manually cleared. For production, use a lock with an expiry: store the lock as {holder: processID, expires: time.Now().Add(30*time.Second)} and have each lock holder refresh it periodically. A lock that isn’t refreshed is treated as expired.

14. Production Considerations

14.1 Transaction Size for Large SSTables

LevelDB level-0 SSTables are 2–4 MB. Level-1 SSTables are larger (up to L1_target_size, configurable). For a L1_target_size of 64 MB, level-1 SSTables are 64 MB each. Our current Rename would fail for files this large (exceeds the 10 MB transaction limit).

Solution: For production, configure LevelDB’s CompactionTableSize to keep SSTables small:

opts := &opt.Options{
    CompactionTableSize: 2 * 1024 * 1024,  // 2 MB SSTables
}

2 MB SSTables = 32 chunks of 64 KiB. Rename transaction: 32 reads + 32 writes = 4 MB total. Well within limits.

14.2 Read Performance for Large SSTables

Reading a 2 MB SSTable requires fetching 32 chunks from FDB. Our current implementation reads them in one GetRange — one round-trip, 32 key-value pairs returned. Latency: ~1–5 ms (FDB cluster local read latency).

A local filesystem read of 2 MB: ~1–3 ms on SSD, ~10–20 ms on HDD.

For a warm FDB cluster, FDB storage is competitive with SSDs and dramatically better than spinning disks. For random chunk access (seeking within large files), FDB may be faster because it can pipeline multiple point reads, while a spinning disk requires physical seeking.

14.3 Write Amplification

Our chunking adds write amplification: writing a 64 KiB chunk requires writing the chunk key (18 bytes) + value (64 KiB) = 64 KiB + 18 bytes. The key overhead is <0.03%, negligible.

But FDB itself adds write amplification internally: each committed transaction is written to the Transaction Log (TLog), then asynchronously applied to Storage Servers. The TLog write is sequential (fast). The Storage Server write is to FDB’s B-tree (with its own write amplification). FDB’s overall write amplification is roughly 3–5x — comparable to RocksDB’s LSM write amplification.

14.4 Monitoring

Key metrics for a fdbstorage-backed LevelDB deployment:

FDB transaction latency P99: should be < 10ms for small transactions (meta reads)
FDB range scan bytes/second: correlates with compaction throughput
FDB conflict rate: if high, indicates concurrent compaction and write contention
LevelDB metrics via db.GetProperty("leveldb.stats"): still valid — LevelDB reports its own view of compaction and SSTable counts, just the “disk I/O” is actually FDB I/O

15. Interview Questions — Storage Abstractions and LSM Trees

Q: What is the purpose of the Rename operation in LevelDB’s storage interface, and how does your FDB implementation preserve its atomicity guarantee?

Rename is LevelDB’s way of atomically promoting a new SSTable (or MANIFEST) into production. During compaction, LevelDB writes the new SSTable to a temp file, then renames it to its final name. POSIX rename is atomic: either the old name or the new name is visible, never a half-written file. Our FDB implementation reads all chunks with the old file descriptor, writes them with the new file descriptor, and clears the old keys — all in one FDB transaction. FDB’s transaction atomicity provides the same guarantee: either the old keys or the new keys are visible, never both or neither.

Q: Why does LevelDB use a Write-Ahead Log, and is it still necessary when using FDB as the storage backend?

The WAL protects against crash scenarios where data was written to the in-memory MemTable but not yet flushed to an SSTable on disk. Without a WAL, a crash after the MemTable write but before the SSTable flush would lose those writes. With FDB as storage, our writer.Close() writes chunks to FDB in transactions. Each committed FDB transaction is durable (replicated to at least two machines). A crash after Close() returns has no data loss. The WAL’s durability purpose is already provided by FDB. A production implementation would use a no-op WAL to skip the overhead.

Q: What is the 10 MB transaction limit in FDB, and what design patterns avoid hitting it?

FDB limits the total read + write size per transaction to approximately 10 MB to bound the memory required on Commit Proxies and to keep transaction resolution fast. Patterns to stay within the limit: (1) chunk large values (as we do with 64 KiB chunks), (2) break bulk writes into multiple transactions with cursor-based pagination, (3) configure LevelDB’s compaction to keep SSTable sizes small (< 2 MB), (4) use FDB atomic operations (tr.Add, tr.SetVersionstampedKey) where possible — atomic operations don’t count against the read portion of the limit.

Q: How would you extend this implementation to support multiple concurrent LevelDB instances sharing the same FDB cluster?

Give each LevelDB instance its own ns prefix. The FDB key space is naturally partitioned: ns1 + ... keys and ns2 + ... keys are completely disjoint. Multiple instances can read and write concurrently with no coordination overhead — FDB’s conflict detection only fires when two transactions write the same key, and different namespaces use different keys. The lock key (ns + tagLock) is also per-namespace, so locking one instance doesn’t affect others.

16. Bugs Encountered and Lessons Learned

Implementing storage.Storage from scratch surfaces several non-obvious contracts that the goleveldb source doesn’t make obvious. These are real bugs found while getting the demo to work end-to-end.

Bug 1 — `storage.ErrNotExist` Does Not Exist

Symptom:

fdbstorage/storage.go:265:27: undefined: storage.ErrNotExist

What happened:

The initial implementation tried to use storage.ErrNotExist (by analogy with os.ErrNotExist), assuming goleveldb exported a sentinel error value for “file not found”. It doesn’t — goleveldb v1.0.0 exports no such symbol.

The contract (from the goleveldb source comments):

// Open opens file with the given 'file descriptor' read-only.
// Returns os.ErrNotExist error if the file does not exist.
Open(fd FileDesc) (Reader, error)

// GetMeta returns 'file descriptor' stored in meta.
// Returns os.ErrNotExist if meta doesn't store any 'file descriptor'.
GetMeta() (FileDesc, error)

The interface contract is written in English comments, not in types. Both Open and GetMeta must signal absence with os.ErrNotExist — the standard library sentinel, not a goleveldb one.

Fix: Use os.ErrNotExist directly everywhere the storage contract requires “not found”.

Bug 2 — `GetMeta` Returning a Custom Error on Empty Namespace

Symptom:

fdbstorage: no manifest set
exit status 1

What happened:

GetMeta read the manifest key from FDB. When the namespace was empty (first run, no DB created yet), the key was nil. The original code returned:

return storage.FileDesc{}, errors.New("fdbstorage: no manifest set")

goleveldb’s Open distinguishes two cases after calling s.recover():

err = s.recover()
if err != nil {
    if !os.IsNotExist(err) || s.o.GetErrorIfMissing() {
        return  // real error → abort
    }
    err = s.create()  // not-exist → create a fresh DB ← this is what we want
}

The custom error doesn’t satisfy os.IsNotExist, so goleveldb took the “real error → abort” path and surfaced the message to the user.

Fix:

if b == nil {
    return storage.FileDesc{}, os.ErrNotExist
}

Return the standard sentinel. goleveldb then calls s.create() and initialises a fresh database.

Bug 3 — `List()` Always Returned Empty

Symptom:

First session wrote three keys successfully. Second session opened the DB without error, but db.Get("apple") returned leveldb: not found.

What happened:

The List function is supposed to enumerate all files in the namespace. goleveldb calls List(TypeJournal) during recovery to find WAL files that need to be replayed. If the list is empty, no WAL is replayed — the memtable stays empty — and all keys written in the previous session appear missing.

The bug was subtle. List always passed storage.TypeAll to the internal range helper regardless of the ft argument:

func (s *Storage) List(ft storage.FileType) ([]storage.FileDesc, error) {
    v, err := s.fdb.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        // BUG: always uses TypeAll as the range prefix byte
        return rt.GetRange(s.metaRangeForType(storage.TypeAll), ...).GetSliceWithError()
    })

storage.TypeAll is a bitmask (TypeManifest | TypeJournal | TypeTable | TypeTemp = 1 | 2 | 4 | 8 = 0x0F). It is not a real file type. The meta keys in FDB use individual type bytes (0x01, 0x02, 0x04, 0x08). The range starting at <ns> 0x01 0x0F matched nothing because 0x0F is larger than all real type bytes.

The in-loop filter ft & thisFT == 0 was correct; only the FDB range prefix was wrong.

Fix: Scan all meta keys (from <ns> tagFileMeta to <ns> tagFileMeta+1) and let the existing per-item ft filter handle selection:

allMeta := fdb.KeyRange{
    Begin: fdb.Key(append(append([]byte{}, s.ns...), tagFileMeta)),
    End:   fdb.Key(append(append([]byte{}, s.ns...), tagFileMeta+1)),
}

Lesson: A bitmask sentinel (TypeAll) and a concrete type byte are different things. Never use a bitmask as a key prefix component.

Bug 4 — FDB-Persisted Lock Survives Process Crashes

Symptom:

fdbstorage: namespace already locked
exit status 1

on every run after any run that didn’t exit cleanly.

What happened:

The original Lock() wrote a key to FDB. Unlock() (called by goleveldb inside db.Close()) cleared it. This worked as long as db.Close() ran. But log.Fatal / log.Fatalf calls os.Exit(1), which does not run deferred functions. The demo’s second session had:

db, err := leveldb.Open(stor, nil)
// ...
defer db.Close()   // ← registered but ...

v, err := db.Get([]byte(k), nil)
if err != nil {
    log.Fatalf(...)  // ← os.Exit kills the process here; defer never runs
}

goleveldb stores the locker returned by Lock() and calls locker.Unlock() inside db.Close(). When os.Exit fires, db.Close() is never called, the unlock never happens, and the FDB lock key persists until the next manual cleanup.

Two fixes applied:

Switch to an in-process lock — replace the FDB key with a sync.Mutex flag on the Storage struct. An in-process flag resets automatically every time the process starts; stale state from a crashed run is impossible. The trade-off is losing cross-process exclusion, which is acceptable for a single-process demo.
Avoid log.Fatal after defer db.Close() — extract the code into a function that returns an error. Deferred cleanup runs normally when the function returns, even on the error path. The caller then does log.Fatal after cleanup is complete.

Lesson: log.Fatal, log.Fatalf, os.Exit, and runtime.Goexit all bypass defer. Any resource that must be released (DB handles, locks, network connections) should be closed by a return-based error path, not a fatal exit inside a function that registered defer.

Bug 5 — `os.IsNotExist` Ignores Custom `.Is()` Methods (Go 1.16+)

Symptom:

fdbstorage: file MANIFEST-000003 does not exist
exit status 1

after stale FDB state was left by a prior failed run.

What happened:

The initial fix for Bug 1 used a custom error type with an Is method:

type fileNotExistError struct{ fd storage.FileDesc }

func (e *fileNotExistError) Is(target error) bool {
    return target == os.ErrNotExist
}

errors.Is(err, os.ErrNotExist) correctly returns true for this type. However, goleveldb’s session.recover() uses os.IsNotExist(err), not errors.Is. In Go 1.25, os.IsNotExist is implemented as:

func underlyingErrorIs(err, target error) bool {
    err = underlyingError(err)  // unwraps *PathError, *LinkError, *SyscallError only
    if err == target { return true }
    e, ok := err.(syscallErrorType)
    return ok && e.Is(target)  // only calls .Is() on syscall errors
}

The comment in the Go source is explicit: underlyingErrorIs “only examines syscall errors” to preserve historical behaviour. A custom error type’s Is method is called by errors.Is but not by os.IsNotExist.

Because os.IsNotExist returned false for fileNotExistError, goleveldb surfaced the error directly instead of creating a fresh DB.

Fix: Use *os.PathError as the wrapper — os.IsNotExist explicitly unwraps *PathError in its underlyingError switch:

return nil, &os.PathError{Op: "open", Path: fd.String(), Err: os.ErrNotExist}

underlyingError extracts Err (which is os.ErrNotExist), the == target check passes, and os.IsNotExist returns true.

Lesson: errors.Is and os.IsNotExist are not equivalent in Go 1.13+. errors.Is traverses the full chain calling .Is() methods. os.IsNotExist only inspects a fixed set of OS error wrapper types. Always use *os.PathError{Err: os.ErrNotExist} (not a custom Is method) when you need os.IsNotExist compatibility.

Bug 6 — Stale FDB State Across Runs

Root cause: Every bug above caused the demo to exit abnormally, leaving partial or inconsistent FDB state: a manifest pointer referencing a file whose data chunks were never written, or data from a session that was never cleanly closed.

Fix: Add Storage.Wipe() — a single range-clear of the entire namespace — and call it at demo startup. This makes each run deterministic. In a real system, Wipe would be replaced with a proper recovery procedure:

Check for a “rename-pending” key (Exercise 4 in this guide) and complete or roll back any in-flight rename.
Use goleveldb’s Recover function (instead of Open) which reads whatever SSTables it can find and rebuilds the MANIFEST from scratch.
Remove orphaned files (those in FDB but not referenced by the MANIFEST) via List + cross-reference with the MANIFEST’s live file set.

Keyboard shortcuts

FoundationDB Layers — Go Implementations