Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hitchhiker’s Guide — Option B: LevelDB on FDB Storage

The question this answers: “Can I run a real LSM-tree storage engine — the actual LevelDB binary with all its compaction logic — with its files stored in FoundationDB instead of a local disk?”

The deeper question: “What is the storage.Storage interface, why does it exist, and what does it tell us about how databases handle file I/O?”


Table of Contents

  1. The Storage Abstraction: Why LevelDB Has a Plugin Point
  2. What LevelDB Actually Writes to Disk
  3. The storage.Storage Interface — Dissected
  4. How We Map LevelDB Files to FDB Keys
  5. Chunking: Overcoming the 100 KiB Value Limit
  6. Atomic Rename — Durability’s Secret Weapon
  7. The Writer: Batching Chunks into Transactions
  8. Why the WAL is Redundant With FDB
  9. The Blob Layer Pattern
  10. Real-World Analogues: RocksDB on Cloud Storage
  11. Exercises

1. The Storage Abstraction: Why LevelDB Has a Plugin Point

LevelDB’s storage.Storage interface exists because the original LevelDB authors (Jeff Dean and Sanjay Ghemawat) designed it for portability. Not every environment has a POSIX filesystem. Google has internal systems where storage might be Bigtable, Colossus, or a custom log-structured store.

The interface says: “if you can implement these 8 methods, LevelDB will run on your storage.” The application code (goleveldb) doesn’t know whether it’s writing to ext4, NTFS, GCS, or FDB — it just calls the interface.

type Storage interface {
    Lock() (util.Releaser, error)
    Log(m storage.FileDesc) (storage.Writer, error)
    Open(fd storage.FileDesc) (storage.Reader, error)
    Create(fd storage.FileDesc) (storage.Writer, error)
    Remove(fd storage.FileDesc) error
    Rename(oldfd, newfd storage.FileDesc) error
    GetMeta() (storage.FileDesc, error)
    SetMeta(fd storage.FileDesc) error
    List() ([]storage.FileDesc, error)
    Close() error
}

Our fdbstorage.Storage implements all of these, storing files as FDB key-value pairs. LevelDB itself (in the syndtr/goleveldb package) calls these methods. It has no idea the “files” are actually chunks in a distributed database.


2. What LevelDB Actually Writes to Disk

To understand what we need to store, let’s look at what LevelDB writes:

File types:
  TypeJournal  (.log)  — Write-Ahead Log: records every write before the
                         MemTable is flushed. Used to recover unflushed writes
                         after a crash.
  TypeManifest (.MANIFEST) — Lists which SSTables are "live" (not yet
                         garbage-collected). Updated at each compaction.
  TypeTable    (.ldb / .sst) — Sorted String Tables. Immutable, sorted KV
                         data files produced by compaction.
  TypeCurrent  (CURRENT) — A single file containing the name of the latest
                         MANIFEST file.
  TypeTemp     (.tmp)   — Temporary files used during compaction.
  TypeLock     (LOCK)   — A file held open to prevent two processes from
                         opening the same database simultaneously.

File descriptor:
type FileDesc struct {
    Type FileType  // TypeJournal, TypeManifest, etc.
    Num  int64     // unique file number (monotonically increasing)
}

A LevelDB database directory looks like:

000003.log       ← journal (WAL)
000004.ldb       ← SSTable level 0
000005.ldb       ← SSTable level 0
MANIFEST-000002  ← current manifest
CURRENT          ← "MANIFEST-000002\n"
LOCK             ← lockfile

When compaction happens:

  1. LevelDB picks some SSTables, merges and sorts them into a new SSTable.
  2. It writes the new SSTable as a .tmp file (via Create(TypeTemp, ...))
  3. It renames the .tmp to the final .ldb name (via Rename)
  4. It updates the MANIFEST to list the new SSTable and de-list the old ones.
  5. It removes the old SSTables (via Remove).

This is the temp-then-rename durability pattern: create a new file atomically, then rename it into place. POSIX rename is atomic — the old name or the new name is visible, never a partial file. Our FDB implementation must replicate this property.


3. The storage.Storage Interface — Dissected

Let’s look at each method and what it does:

Lock() (util.Releaser, error) Prevents two processes from opening the same database simultaneously. We implement this by writing a “lock” key to FDB. The Releaser clears it.

Create(fd FileDesc) (Writer, error) and Open(fd FileDesc) (Reader, error) Create starts a new file (for writing). Open opens an existing file (for reading). In FDB terms: Create returns a writer that buffers bytes; Open reads all chunks for the file into memory and returns a bytes.Reader.

Remove(fd FileDesc) error Deletes a file. In FDB: ClearRange over all chunk keys for this file.

Rename(oldfd, newfd FileDesc) error Renames a file atomically. In FDB: copy all chunks from oldfd keys to newfd keys, then clear all oldfd keys — in one transaction. This is the atomic rename.

GetMeta() (FileDesc, error) and SetMeta(fd FileDesc) error Get/set the “current” file pointer — which MANIFEST is current. In FDB: a single key (ns + tagManifest + 0x00) stores the current FileDesc. This replaces the CURRENT file in LevelDB’s original design.

List() ([]FileDesc, error) List all files. We implement this as a range scan over the meta key prefix. We store a meta key for each file alongside its data.


4. How We Map LevelDB Files to FDB Keys

Each LevelDB file is identified by (FileType, FileNum). We encode this as:

Meta key (file existence + type):
  ns + tagFileMeta(0x01) + fileType(1 byte) + fileNum(8 bytes BE)
  → msgpack({Type: ft, Num: n, size: totalBytes})

Data chunks:
  ns + tagFileData(0x02) + fileType(1 byte) + fileNum(8 bytes BE) + chunkNum(8 bytes BE)
  → up to 64 KiB of file data

Manifest pointer (replaces CURRENT file):
  ns + tagManifest(0x03)
  → msgpack(FileDesc{Type: TypeManifest, Num: n})

Lock key:
  ns + tagLock(0x04)
  → "locked" (any non-empty value means locked)

Why big-endian for file numbers?

Big-endian encoding preserves sort order. File numbers are monotonically increasing (LevelDB never reuses a file number). By storing them big-endian, a range scan over ns+tagFileData+ft+num+* returns chunks in chunk-number order — which is the correct order to reassemble the file. Without big-endian encoding, chunk 10 would sort before chunk 2 (0x0A < 0x02 is false in big-endian but 0x0000000A < 0x00000002 is also false — you need lexicographic order over big-endian bytes).

File number and type as part of the key:

This means all chunks of file (TypeJournal, 3) sort together, before all chunks of (TypeJournal, 4), which sort before (TypeTable, 5). Clean, hierarchical key organization.


5. Chunking: Overcoming the 100 KiB Value Limit

FDB has a hard limit: values may not exceed 100 KiB (102,400 bytes). A typical LevelDB SSTable is 2–4 MB. We cannot store it in one FDB value.

Our solution: split each file into 64 KiB chunks:

const chunkSize = 64 * 1024  // 65,536 bytes

// Writing a 200 KiB file:
// Chunk 0: bytes [0, 65536)
// Chunk 1: bytes [65536, 131072)
// Chunk 2: bytes [131072, 200000)  (partial last chunk)

Each chunk is stored as a separate FDB key-value pair:

ns+0x02+ft+num+00000000_00000000  → 65536 bytes
ns+0x02+ft+num+00000000_00000001  → 65536 bytes
ns+0x02+ft+num+00000000_00000002  → 68528 bytes (partial)

Reading a file: range-scan all chunk keys for (ft, num), sort by chunk number (already in order due to big-endian encoding), concatenate the values.

func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
    })
    var buf []byte
    for _, kv := range kvs.([]fdb.KeyValue) {
        buf = append(buf, kv.Value...)
    }
    return io.NopCloser(bytes.NewReader(buf)), nil
}

Why 64 KiB chunks?

  • Smaller than FDB’s 100 KiB value limit ✓
  • Large enough to minimize key overhead (a 4 MB SSTable = 64 chunks, not thousands)
  • Aligns with filesystem block sizes (4–64 KiB typical)

6. Atomic Rename — Durability’s Secret Weapon

POSIX rename(src, dst) is the single most important durability primitive in filesystems. Its contract: after rename returns, dst exists and src does not, with no window where neither exists. This is atomic replacement.

LevelDB uses rename heavily:

  • Rename(TypeTemp, n, TypeTable, n): promote temp SSTable to final name
  • Rename(TypeTemp, n, TypeManifest, n): promote temp manifest

Without atomicity, a crash during rename could leave:

  • Neither file existing → data loss
  • Both files existing → ambiguity about which is current
  • A partial file at dst → corruption

In FDB, we implement atomic rename as copy + clear in one transaction:

func (s *Storage) Rename(oldfd, newfd storage.FileDesc) error {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // 1. Read all old chunks
        kvs, _ := tr.GetRange(s.dataRange(oldfd), fdb.RangeOptions{}).GetSliceWithError()

        // 2. Clear old data
        tr.ClearRange(s.dataRange(oldfd))
        tr.Clear(s.metaKey(oldfd))

        // 3. Write new data
        for _, kv := range kvs {
            newKey := s.translateChunkKey(kv.Key, oldfd, newfd)
            tr.Set(newKey, kv.Value)
        }
        tr.Set(s.metaKey(newfd), metaBytes)
        return nil, nil
    })
    return err
}

One transaction. The cluster either commits all of this (old is gone, new is present) or none of it (crash safety). The atomicity guarantee is identical to POSIX rename — and arguably stronger, since FDB replicates the commit across multiple machines before returning.

The 10 MB transaction limit:

FDB transactions are limited to ~10 MB of reads + writes. A large SSTable (4 MB) would have chunks adding up to 4 MB of writes in one transaction. That’s under the 10 MB limit. But 8 MB SSTables would be risky.

Our Rename reads all chunks in the transaction (4 MB reads) and writes them all back (4 MB writes) — totaling 8 MB. Safe for typical LevelDB files.

For larger files, we’d need to either:

  1. Break the rename into multiple transactions (violating atomicity), or
  2. Use a two-phase approach: write new chunks in a first transaction, then atomically swap the meta key in a second transaction (using a PENDING state key as the “in-progress rename” marker).

7. The Writer: Batching Chunks into Transactions

When LevelDB writes a new SSTable, it calls Create(fd) which returns a Writer. The writer accumulates bytes via Write(p []byte). When Close() is called, we flush everything to FDB.

Batch size:

const maxChunksPerTx = 100  // 100 × 64 KiB = 6.4 MB per transaction

We flush up to 100 chunks per FDB transaction. This stays well within the 10 MB limit. A 20 MB SSTable would be flushed in 4 transactions of 5 MB each.

func (w *writer) flush(final bool) error {
    start := w.flushedChunks
    end := start + maxChunksPerTx
    if end > len(w.chunks) {
        end = len(w.chunks)
    }
    _, err := w.s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for i := start; i < end; i++ {
            tr.Set(w.s.chunkKey(w.fd, i), w.chunks[i])
        }
        if final && end == len(w.chunks) {
            // Write the meta key only on the final flush
            tr.Set(w.s.metaKey(w.fd), metaBytes)
        }
        return nil, nil
    })
    w.flushedChunks = end
    return err
}

The meta-key-last invariant:

We write the meta key (the file’s “directory entry”) only in the last batch of chunks. This ensures that List() never returns a file whose chunks are only partially written — the file is only “visible” once all its chunks exist.

This is the FDB equivalent of:

  1. Write all content to a temp file
  2. rename(temp, final) atomically

8. Why the WAL is Redundant With FDB

LevelDB’s Write-Ahead Log (WAL / journal, TypeJournal) exists for one reason: crash recovery. If the process crashes after writing to the in-memory MemTable but before flushing the MemTable to an SSTable on disk, the WAL is replayed to reconstruct the MemTable.

With FDB as the storage backend:

Every write is already durable before Write(p) returns.

Our writer.Write buffers bytes in memory. Our writer.Close flushes to FDB in transactions. Each FDB Transact call does not return until the commit is confirmed by FDB’s replication protocol — the data is on at least f+1 machines (where f is the fault tolerance level, typically 2). A process crash after Close() returns means the data is safe.

The WAL is protecting against “data written to OS memory but not yet on disk.” FDB’s Transact eliminates this window. By the time the WAL file is written through our fdbstorage.Writer, the bytes are already in FDB.

A production implementation would patch goleveldb to skip WAL writes entirely (or use Options.DisableSeeksCompaction and a custom journal implementation that’s a no-op). This would improve write throughput by 50% or more and reduce FDB key usage.


9. The Blob Layer Pattern

FDB’s core team documented the “Blob Layer” pattern: storing binary blobs (arbitrary large byte arrays) in FDB by chunking them. Our file storage is an instance of this pattern.

The Blob Layer pattern:

blob_key + chunkNum  →  chunk_data

It solves the 100 KiB value limit while preserving atomic operations on the whole blob (via FDB transactions) and efficient byte-range access (read only the chunks you need, e.g., for seeking within a large file).

Applications:

  • Store large media files (> 100 KiB) in FDB for atomic metadata-plus-content updates
  • Store ML model weights alongside their metadata records
  • Store serialized protocol buffers larger than 100 KiB
  • Back any file system abstraction (exactly what we’re doing)

10. Real-World Analogues

RocksDB Remote Compaction (Project Titan, Ripple)

Meta (Facebook) runs RocksDB on distributed storage in some configurations. Their “Ripple” project stores RocksDB SSTables in a distributed block store (similar to HDFS or GFS). The storage interface they use is exactly the same concept: RocksDB writes “files” via an abstract interface; the implementation stores chunks in a distributed system.

TiKV on Disaggregated Storage

TiDB (PingCAP) is moving toward a disaggregated architecture where TiKV (which uses RocksDB internally) stores its SSTables in object storage (S3, GCS). The TiKV storage engine writes SSTables through an abstract file interface to S3. This is identical to our pattern.

Pebble (CockroachDB)

CockroachDB replaced RocksDB with Pebble (a Go implementation) in 2021. Pebble has a vfs.FS interface — a virtual filesystem abstraction — that allows swapping the storage backend. CockroachDB uses this for testing (an in-memory FS) and is exploring using it for cloud storage.

The Pattern’s Universality

Every LSM-tree engine eventually adds a pluggable storage interface:

  • LevelDB: storage.Storage
  • RocksDB: Env (virtual filesystem)
  • Pebble: vfs.FS
  • WiredTiger: WT_FILE_SYSTEM

Why? Because running the compaction engine without worrying about where data lives is architecturally clean. The engine is responsible for LSM semantics; the storage interface is responsible for durability. Separation of concerns.


11. Exercises

Exercise 1 — Streaming Reader

Instead of materializing the entire file into memory in Open(), return an io.ReadSeekCloser that fetches chunks lazily. A read at offset 128 KiB should only fetch chunks 2–3, not chunk 0 and 1.

This reduces memory usage for large SSTables and enables efficient Seek(offset, io.SeekStart) for random-access reads.

Exercise 2 — File Size Cache

List() currently returns all file descriptors by scanning the meta keys. Open(fd) reads the meta key to get the file size, then reads all chunk keys.

Add a small in-memory LRU cache mapping FileDesc → size. On Open, check the cache first. Invalidate the cache entry on Remove and Rename.

Measure the reduction in FDB round-trips for a workload with many small reads on recently-opened files.

Exercise 3 — Compression

Before storing each 64 KiB chunk, compress it with compress/flate or github.com/golang/snappy. Store a compression-type byte in the meta key. On read, decompress transparently.

LevelDB SSTables are already internally compressed (Snappy by default), so this may not reduce size much for TypeTable files. But TypeJournal files are not compressed and might benefit.

Exercise 4 — Two-Phase Large Rename

For files larger than 5 MB (which would exceed the transaction limit in our current Rename), implement the two-phase rename:

Phase 1: Write all new chunks in multiple transactions. Write a “rename-pending” key: ns+tagPending+oldfd → newfd.

Phase 2: In one transaction, atomically: clear the pending key, clear all old chunks and meta, set the meta for new fd (chunks already exist).

On startup, check for any pending keys and complete or roll back the rename. This is essentially a two-phase commit for large file renames.

Exercise 5 — Multi-Tenant Databases

Add a namespace concept: allow multiple LevelDB databases to share one FDB cluster with independent key spaces. Each New(fdb, namespace) call returns a storage implementation that is completely isolated from others.

This is how mvsqlite handles multiple SQLite “database files” — each is an FDB namespace.


12. Source Code Deep Dive — fdbstorage/storage.go

The Storage Struct

type Storage struct {
    db  fdb.Database
    ns  []byte
}

Minimal. db is the FDB connection; ns is the byte prefix for all keys. The entire storage is two fields. All complexity lives in the key encoding and transaction logic.

Key Encoding Helpers

func (s *Storage) metaKey(fd storage.FileDesc) fdb.Key {
    // ns + 0x01 + type(1 byte) + num(8 bytes big-endian)
    key := make([]byte, len(s.ns)+10)
    copy(key, s.ns)
    key[len(s.ns)] = 0x01
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    return fdb.Key(key)
}

func (s *Storage) chunkKey(fd storage.FileDesc, chunkNum int) fdb.Key {
    // ns + 0x02 + type(1 byte) + num(8 bytes) + chunk(8 bytes)
    key := make([]byte, len(s.ns)+18)
    copy(key, s.ns)
    key[len(s.ns)] = 0x02
    key[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
    binary.BigEndian.PutUint64(key[len(s.ns)+10:], uint64(chunkNum))
    return fdb.Key(key)
}

Why 8-byte big-endian for chunkNum? Chunk numbers are read back via GetRange which returns chunks in key order. Big-endian ensures key order equals chunk number order. If we used little-endian, chunk 256 (LE: 00 01 00 00 00 00 00 00) would sort before chunk 1 (LE: 01 00 00 00 00 00 00 00) — wrong.

The dataRange Helper

func (s *Storage) dataRange(fd storage.FileDesc) fdb.KeyRange {
    begin := s.chunkKey(fd, 0)
    // end: same prefix but with chunkNum = MaxUint64 + 1 — use next-prefix trick
    endPrefix := make([]byte, len(s.ns)+10) // ns + 0x02 + type + num
    copy(endPrefix, s.ns)
    endPrefix[len(s.ns)] = 0x02
    endPrefix[len(s.ns)+1] = byte(fd.Type)
    binary.BigEndian.PutUint64(endPrefix[len(s.ns)+2:], uint64(fd.Num))
    end := append(endPrefix, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF)
    return fdb.KeyRange{Begin: begin, End: fdb.Key(append(end, 0x01))}
}

This range covers all chunk keys for (type, num) regardless of chunkNum. GetRange(dataRange(fd)) fetches all chunks in order.

The List() Implementation

func (s *Storage) List() ([]storage.FileDesc, error) {
    kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        return rt.GetRange(s.metaRange(), fdb.RangeOptions{}).GetSliceWithError()
    })
    var fds []storage.FileDesc
    for _, kv := range kvs.([]fdb.KeyValue) {
        var fd storage.FileDesc
        msgpack.Unmarshal(kv.Value, &fd)
        fds = append(fds, fd)
    }
    return fds, nil
}

A single range scan over all meta keys returns all files in one round-trip. LevelDB calls List() at startup to find all existing files. With FDB, this is O(1) round-trips regardless of file count.

With a local filesystem, List() is an opendir/readdir syscall — also O(1) in latency, but I/O must go through the local disk controller. With FDB, the I/O goes to the closest FDB storage server over the network, with similar or lower latency than a rotational disk.

The Lock Implementation

func (s *Storage) Lock() (util.Releaser, error) {
    _, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        existing, _ := tr.Get(s.lockKey()).Get()
        if len(existing) > 0 {
            return nil, errors.New("storage: already locked")
        }
        tr.Set(s.lockKey(), []byte("locked"))
        return nil, nil
    })
    if err != nil {
        return nil, err
    }
    return util.ReleaserFunc(func() {
        s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
            tr.Clear(s.lockKey())
            return nil, nil
        })
    }), nil
}

The lock is a FDB key. Acquiring the lock: check if the key exists; if not, set it — in one atomic transaction. This check-then-set is race-free because FDB’s optimistic concurrency ensures that if two processes both read “no lock” and both try to write “locked”, only one will commit (the other will conflict and retry, then find the lock held).

Limitation: This is a process-level lock, not a durable lease. If the lock-holding process crashes without calling Release(), the lock remains set until manually cleared. For production, use a lock with an expiry: store the lock as {holder: processID, expires: time.Now().Add(30*time.Second)} and have each lock holder refresh it periodically. A lock that isn’t refreshed is treated as expired.


13. Production Considerations

13.1 Transaction Size for Large SSTables

LevelDB level-0 SSTables are 2–4 MB. Level-1 SSTables are larger (up to L1_target_size, configurable). For a L1_target_size of 64 MB, level-1 SSTables are 64 MB each. Our current Rename would fail for files this large (exceeds the 10 MB transaction limit).

Solution: For production, configure LevelDB’s CompactionTableSize to keep SSTables small:

opts := &opt.Options{
    CompactionTableSize: 2 * 1024 * 1024,  // 2 MB SSTables
}

2 MB SSTables = 32 chunks of 64 KiB. Rename transaction: 32 reads + 32 writes = 4 MB total. Well within limits.

13.2 Read Performance for Large SSTables

Reading a 2 MB SSTable requires fetching 32 chunks from FDB. Our current implementation reads them in one GetRange — one round-trip, 32 key-value pairs returned. Latency: ~1–5 ms (FDB cluster local read latency).

A local filesystem read of 2 MB: ~1–3 ms on SSD, ~10–20 ms on HDD.

For a warm FDB cluster, FDB storage is competitive with SSDs and dramatically better than spinning disks. For random chunk access (seeking within large files), FDB may be faster because it can pipeline multiple point reads, while a spinning disk requires physical seeking.

13.3 Write Amplification

Our chunking adds write amplification: writing a 64 KiB chunk requires writing the chunk key (18 bytes) + value (64 KiB) = 64 KiB + 18 bytes. The key overhead is <0.03%, negligible.

But FDB itself adds write amplification internally: each committed transaction is written to the Transaction Log (TLog), then asynchronously applied to Storage Servers. The TLog write is sequential (fast). The Storage Server write is to FDB’s B-tree (with its own write amplification). FDB’s overall write amplification is roughly 3–5x — comparable to RocksDB’s LSM write amplification.

13.4 Monitoring

Key metrics for a fdbstorage-backed LevelDB deployment:

  • FDB transaction latency P99: should be < 10ms for small transactions (meta reads)
  • FDB range scan bytes/second: correlates with compaction throughput
  • FDB conflict rate: if high, indicates concurrent compaction and write contention
  • LevelDB metrics via db.GetProperty("leveldb.stats"): still valid — LevelDB reports its own view of compaction and SSTable counts, just the “disk I/O” is actually FDB I/O

14. Interview Questions — Storage Abstractions and LSM Trees

Q: What is the purpose of the Rename operation in LevelDB’s storage interface, and how does your FDB implementation preserve its atomicity guarantee?

Rename is LevelDB’s way of atomically promoting a new SSTable (or MANIFEST) into production. During compaction, LevelDB writes the new SSTable to a temp file, then renames it to its final name. POSIX rename is atomic: either the old name or the new name is visible, never a half-written file. Our FDB implementation reads all chunks with the old file descriptor, writes them with the new file descriptor, and clears the old keys — all in one FDB transaction. FDB’s transaction atomicity provides the same guarantee: either the old keys or the new keys are visible, never both or neither.

Q: Why does LevelDB use a Write-Ahead Log, and is it still necessary when using FDB as the storage backend?

The WAL protects against crash scenarios where data was written to the in-memory MemTable but not yet flushed to an SSTable on disk. Without a WAL, a crash after the MemTable write but before the SSTable flush would lose those writes. With FDB as storage, our writer.Close() writes chunks to FDB in transactions. Each committed FDB transaction is durable (replicated to at least two machines). A crash after Close() returns has no data loss. The WAL’s durability purpose is already provided by FDB. A production implementation would use a no-op WAL to skip the overhead.

Q: What is the 10 MB transaction limit in FDB, and what design patterns avoid hitting it?

FDB limits the total read + write size per transaction to approximately 10 MB to bound the memory required on Commit Proxies and to keep transaction resolution fast. Patterns to stay within the limit: (1) chunk large values (as we do with 64 KiB chunks), (2) break bulk writes into multiple transactions with cursor-based pagination, (3) configure LevelDB’s compaction to keep SSTable sizes small (< 2 MB), (4) use FDB atomic operations (tr.Add, tr.SetVersionstampedKey) where possible — atomic operations don’t count against the read portion of the limit.

Q: How would you extend this implementation to support multiple concurrent LevelDB instances sharing the same FDB cluster?

Give each LevelDB instance its own ns prefix. The FDB key space is naturally partitioned: ns1 + ... keys and ns2 + ... keys are completely disjoint. Multiple instances can read and write concurrently with no coordination overhead — FDB’s conflict detection only fires when two transactions write the same key, and different namespaces use different keys. The lock key (ns + tagLock) is also per-namespace, so locking one instance doesn’t affect others.