Hitchhiker’s Guide — Option B: LevelDB on FDB Storage
The question this answers: “Can I run a real LSM-tree storage engine — the actual LevelDB binary with all its compaction logic — with its files stored in FoundationDB instead of a local disk?”
The deeper question: “What is the
storage.Storageinterface, why does it exist, and what does it tell us about how databases handle file I/O?”
Table of Contents
- The Storage Abstraction: Why LevelDB Has a Plugin Point
- What LevelDB Actually Writes to Disk
- The storage.Storage Interface — Dissected
- How We Map LevelDB Files to FDB Keys
- Chunking: Overcoming the 100 KiB Value Limit
- Atomic Rename — Durability’s Secret Weapon
- The Writer: Batching Chunks into Transactions
- Why the WAL is Redundant With FDB
- The Blob Layer Pattern
- Real-World Analogues: RocksDB on Cloud Storage
- Exercises
1. The Storage Abstraction: Why LevelDB Has a Plugin Point
LevelDB’s storage.Storage interface exists because the original LevelDB
authors (Jeff Dean and Sanjay Ghemawat) designed it for portability. Not
every environment has a POSIX filesystem. Google has internal systems where
storage might be Bigtable, Colossus, or a custom log-structured store.
The interface says: “if you can implement these 8 methods, LevelDB will run on your storage.” The application code (goleveldb) doesn’t know whether it’s writing to ext4, NTFS, GCS, or FDB — it just calls the interface.
type Storage interface {
Lock() (util.Releaser, error)
Log(m storage.FileDesc) (storage.Writer, error)
Open(fd storage.FileDesc) (storage.Reader, error)
Create(fd storage.FileDesc) (storage.Writer, error)
Remove(fd storage.FileDesc) error
Rename(oldfd, newfd storage.FileDesc) error
GetMeta() (storage.FileDesc, error)
SetMeta(fd storage.FileDesc) error
List() ([]storage.FileDesc, error)
Close() error
}
Our fdbstorage.Storage implements all of these, storing files as FDB
key-value pairs. LevelDB itself (in the syndtr/goleveldb package) calls
these methods. It has no idea the “files” are actually chunks in a distributed
database.
2. What LevelDB Actually Writes to Disk
To understand what we need to store, let’s look at what LevelDB writes:
File types:
TypeJournal (.log) — Write-Ahead Log: records every write before the
MemTable is flushed. Used to recover unflushed writes
after a crash.
TypeManifest (.MANIFEST) — Lists which SSTables are "live" (not yet
garbage-collected). Updated at each compaction.
TypeTable (.ldb / .sst) — Sorted String Tables. Immutable, sorted KV
data files produced by compaction.
TypeCurrent (CURRENT) — A single file containing the name of the latest
MANIFEST file.
TypeTemp (.tmp) — Temporary files used during compaction.
TypeLock (LOCK) — A file held open to prevent two processes from
opening the same database simultaneously.
File descriptor:
type FileDesc struct {
Type FileType // TypeJournal, TypeManifest, etc.
Num int64 // unique file number (monotonically increasing)
}
A LevelDB database directory looks like:
000003.log ← journal (WAL)
000004.ldb ← SSTable level 0
000005.ldb ← SSTable level 0
MANIFEST-000002 ← current manifest
CURRENT ← "MANIFEST-000002\n"
LOCK ← lockfile
When compaction happens:
- LevelDB picks some SSTables, merges and sorts them into a new SSTable.
- It writes the new SSTable as a
.tmpfile (viaCreate(TypeTemp, ...)) - It renames the
.tmpto the final.ldbname (viaRename) - It updates the MANIFEST to list the new SSTable and de-list the old ones.
- It removes the old SSTables (via
Remove).
This is the temp-then-rename durability pattern: create a new file
atomically, then rename it into place. POSIX rename is atomic — the old
name or the new name is visible, never a partial file. Our FDB implementation
must replicate this property.
3. The storage.Storage Interface — Dissected
Let’s look at each method and what it does:
Lock() (util.Releaser, error)
Prevents two processes from opening the same database simultaneously. We
implement this by writing a “lock” key to FDB. The Releaser clears it.
Create(fd FileDesc) (Writer, error) and Open(fd FileDesc) (Reader, error)
Create starts a new file (for writing). Open opens an existing file (for
reading). In FDB terms: Create returns a writer that buffers bytes; Open
reads all chunks for the file into memory and returns a bytes.Reader.
Remove(fd FileDesc) error
Deletes a file. In FDB: ClearRange over all chunk keys for this file.
Rename(oldfd, newfd FileDesc) error
Renames a file atomically. In FDB: copy all chunks from oldfd keys to
newfd keys, then clear all oldfd keys — in one transaction. This is the
atomic rename.
GetMeta() (FileDesc, error) and SetMeta(fd FileDesc) error
Get/set the “current” file pointer — which MANIFEST is current. In FDB:
a single key (ns + tagManifest + 0x00) stores the current FileDesc.
This replaces the CURRENT file in LevelDB’s original design.
List() ([]FileDesc, error)
List all files. We implement this as a range scan over the meta key prefix.
We store a meta key for each file alongside its data.
4. How We Map LevelDB Files to FDB Keys
Each LevelDB file is identified by (FileType, FileNum). We encode this as:
Meta key (file existence + type):
ns + tagFileMeta(0x01) + fileType(1 byte) + fileNum(8 bytes BE)
→ msgpack({Type: ft, Num: n, size: totalBytes})
Data chunks:
ns + tagFileData(0x02) + fileType(1 byte) + fileNum(8 bytes BE) + chunkNum(8 bytes BE)
→ up to 64 KiB of file data
Manifest pointer (replaces CURRENT file):
ns + tagManifest(0x03)
→ msgpack(FileDesc{Type: TypeManifest, Num: n})
Lock key:
ns + tagLock(0x04)
→ "locked" (any non-empty value means locked)
Why big-endian for file numbers?
Big-endian encoding preserves sort order. File numbers are monotonically
increasing (LevelDB never reuses a file number). By storing them big-endian,
a range scan over ns+tagFileData+ft+num+* returns chunks in chunk-number
order — which is the correct order to reassemble the file. Without big-endian
encoding, chunk 10 would sort before chunk 2 (0x0A < 0x02 is false in
big-endian but 0x0000000A < 0x00000002 is also false — you need
lexicographic order over big-endian bytes).
File number and type as part of the key:
This means all chunks of file (TypeJournal, 3) sort together, before all
chunks of (TypeJournal, 4), which sort before (TypeTable, 5). Clean,
hierarchical key organization.
5. Chunking: Overcoming the 100 KiB Value Limit
FDB has a hard limit: values may not exceed 100 KiB (102,400 bytes). A typical LevelDB SSTable is 2–4 MB. We cannot store it in one FDB value.
Our solution: split each file into 64 KiB chunks:
const chunkSize = 64 * 1024 // 65,536 bytes
// Writing a 200 KiB file:
// Chunk 0: bytes [0, 65536)
// Chunk 1: bytes [65536, 131072)
// Chunk 2: bytes [131072, 200000) (partial last chunk)
Each chunk is stored as a separate FDB key-value pair:
ns+0x02+ft+num+00000000_00000000 → 65536 bytes
ns+0x02+ft+num+00000000_00000001 → 65536 bytes
ns+0x02+ft+num+00000000_00000002 → 68528 bytes (partial)
Reading a file: range-scan all chunk keys for (ft, num), sort by chunk
number (already in order due to big-endian encoding), concatenate the values.
func (s *Storage) Open(fd storage.FileDesc) (storage.Reader, error) {
kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(s.dataRange(fd), fdb.RangeOptions{}).GetSliceWithError()
})
var buf []byte
for _, kv := range kvs.([]fdb.KeyValue) {
buf = append(buf, kv.Value...)
}
return io.NopCloser(bytes.NewReader(buf)), nil
}
Why 64 KiB chunks?
- Smaller than FDB’s 100 KiB value limit ✓
- Large enough to minimize key overhead (a 4 MB SSTable = 64 chunks, not thousands)
- Aligns with filesystem block sizes (4–64 KiB typical)
6. Atomic Rename — Durability’s Secret Weapon
POSIX rename(src, dst) is the single most important durability primitive in
filesystems. Its contract: after rename returns, dst exists and
src does not, with no window where neither exists. This is atomic
replacement.
LevelDB uses rename heavily:
Rename(TypeTemp, n, TypeTable, n): promote temp SSTable to final nameRename(TypeTemp, n, TypeManifest, n): promote temp manifest
Without atomicity, a crash during rename could leave:
- Neither file existing → data loss
- Both files existing → ambiguity about which is current
- A partial file at dst → corruption
In FDB, we implement atomic rename as copy + clear in one transaction:
func (s *Storage) Rename(oldfd, newfd storage.FileDesc) error {
_, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// 1. Read all old chunks
kvs, _ := tr.GetRange(s.dataRange(oldfd), fdb.RangeOptions{}).GetSliceWithError()
// 2. Clear old data
tr.ClearRange(s.dataRange(oldfd))
tr.Clear(s.metaKey(oldfd))
// 3. Write new data
for _, kv := range kvs {
newKey := s.translateChunkKey(kv.Key, oldfd, newfd)
tr.Set(newKey, kv.Value)
}
tr.Set(s.metaKey(newfd), metaBytes)
return nil, nil
})
return err
}
One transaction. The cluster either commits all of this (old is gone, new is present) or none of it (crash safety). The atomicity guarantee is identical to POSIX rename — and arguably stronger, since FDB replicates the commit across multiple machines before returning.
The 10 MB transaction limit:
FDB transactions are limited to ~10 MB of reads + writes. A large SSTable (4 MB) would have chunks adding up to 4 MB of writes in one transaction. That’s under the 10 MB limit. But 8 MB SSTables would be risky.
Our Rename reads all chunks in the transaction (4 MB reads) and writes them
all back (4 MB writes) — totaling 8 MB. Safe for typical LevelDB files.
For larger files, we’d need to either:
- Break the rename into multiple transactions (violating atomicity), or
- Use a two-phase approach: write new chunks in a first transaction, then atomically swap the meta key in a second transaction (using a PENDING state key as the “in-progress rename” marker).
7. The Writer: Batching Chunks into Transactions
When LevelDB writes a new SSTable, it calls Create(fd) which returns a
Writer. The writer accumulates bytes via Write(p []byte). When Close()
is called, we flush everything to FDB.
Batch size:
const maxChunksPerTx = 100 // 100 × 64 KiB = 6.4 MB per transaction
We flush up to 100 chunks per FDB transaction. This stays well within the 10 MB limit. A 20 MB SSTable would be flushed in 4 transactions of 5 MB each.
func (w *writer) flush(final bool) error {
start := w.flushedChunks
end := start + maxChunksPerTx
if end > len(w.chunks) {
end = len(w.chunks)
}
_, err := w.s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
for i := start; i < end; i++ {
tr.Set(w.s.chunkKey(w.fd, i), w.chunks[i])
}
if final && end == len(w.chunks) {
// Write the meta key only on the final flush
tr.Set(w.s.metaKey(w.fd), metaBytes)
}
return nil, nil
})
w.flushedChunks = end
return err
}
The meta-key-last invariant:
We write the meta key (the file’s “directory entry”) only in the last batch
of chunks. This ensures that List() never returns a file whose chunks are
only partially written — the file is only “visible” once all its chunks exist.
This is the FDB equivalent of:
- Write all content to a temp file
rename(temp, final)atomically
8. Why the WAL is Redundant With FDB
LevelDB’s Write-Ahead Log (WAL / journal, TypeJournal) exists for one
reason: crash recovery. If the process crashes after writing to the in-memory
MemTable but before flushing the MemTable to an SSTable on disk, the WAL
is replayed to reconstruct the MemTable.
With FDB as the storage backend:
Every write is already durable before Write(p) returns.
Our writer.Write buffers bytes in memory. Our writer.Close flushes to FDB
in transactions. Each FDB Transact call does not return until the commit is
confirmed by FDB’s replication protocol — the data is on at least f+1
machines (where f is the fault tolerance level, typically 2). A process
crash after Close() returns means the data is safe.
The WAL is protecting against “data written to OS memory but not yet on disk.”
FDB’s Transact eliminates this window. By the time the WAL file is written
through our fdbstorage.Writer, the bytes are already in FDB.
A production implementation would patch goleveldb to skip WAL writes
entirely (or use Options.DisableSeeksCompaction and a custom journal
implementation that’s a no-op). This would improve write throughput by 50% or
more and reduce FDB key usage.
9. The Blob Layer Pattern
FDB’s core team documented the “Blob Layer” pattern: storing binary blobs (arbitrary large byte arrays) in FDB by chunking them. Our file storage is an instance of this pattern.
The Blob Layer pattern:
blob_key + chunkNum → chunk_data
It solves the 100 KiB value limit while preserving atomic operations on the whole blob (via FDB transactions) and efficient byte-range access (read only the chunks you need, e.g., for seeking within a large file).
Applications:
- Store large media files (> 100 KiB) in FDB for atomic metadata-plus-content updates
- Store ML model weights alongside their metadata records
- Store serialized protocol buffers larger than 100 KiB
- Back any file system abstraction (exactly what we’re doing)
10. Real-World Analogues
RocksDB Remote Compaction (Project Titan, Ripple)
Meta (Facebook) runs RocksDB on distributed storage in some configurations. Their “Ripple” project stores RocksDB SSTables in a distributed block store (similar to HDFS or GFS). The storage interface they use is exactly the same concept: RocksDB writes “files” via an abstract interface; the implementation stores chunks in a distributed system.
TiKV on Disaggregated Storage
TiDB (PingCAP) is moving toward a disaggregated architecture where TiKV (which uses RocksDB internally) stores its SSTables in object storage (S3, GCS). The TiKV storage engine writes SSTables through an abstract file interface to S3. This is identical to our pattern.
Pebble (CockroachDB)
CockroachDB replaced RocksDB with Pebble (a Go implementation) in 2021.
Pebble has a vfs.FS interface — a virtual filesystem abstraction — that
allows swapping the storage backend. CockroachDB uses this for testing (an
in-memory FS) and is exploring using it for cloud storage.
The Pattern’s Universality
Every LSM-tree engine eventually adds a pluggable storage interface:
- LevelDB:
storage.Storage - RocksDB:
Env(virtual filesystem) - Pebble:
vfs.FS - WiredTiger:
WT_FILE_SYSTEM
Why? Because running the compaction engine without worrying about where data lives is architecturally clean. The engine is responsible for LSM semantics; the storage interface is responsible for durability. Separation of concerns.
11. Exercises
Exercise 1 — Streaming Reader
Instead of materializing the entire file into memory in Open(), return an
io.ReadSeekCloser that fetches chunks lazily. A read at offset 128 KiB should
only fetch chunks 2–3, not chunk 0 and 1.
This reduces memory usage for large SSTables and enables efficient
Seek(offset, io.SeekStart) for random-access reads.
Exercise 2 — File Size Cache
List() currently returns all file descriptors by scanning the meta keys.
Open(fd) reads the meta key to get the file size, then reads all chunk keys.
Add a small in-memory LRU cache mapping FileDesc → size. On Open, check
the cache first. Invalidate the cache entry on Remove and Rename.
Measure the reduction in FDB round-trips for a workload with many small reads on recently-opened files.
Exercise 3 — Compression
Before storing each 64 KiB chunk, compress it with compress/flate or
github.com/golang/snappy. Store a compression-type byte in the meta key.
On read, decompress transparently.
LevelDB SSTables are already internally compressed (Snappy by default), so this may not reduce size much for TypeTable files. But TypeJournal files are not compressed and might benefit.
Exercise 4 — Two-Phase Large Rename
For files larger than 5 MB (which would exceed the transaction limit in
our current Rename), implement the two-phase rename:
Phase 1: Write all new chunks in multiple transactions. Write a
“rename-pending” key: ns+tagPending+oldfd → newfd.
Phase 2: In one transaction, atomically: clear the pending key, clear all old chunks and meta, set the meta for new fd (chunks already exist).
On startup, check for any pending keys and complete or roll back the rename. This is essentially a two-phase commit for large file renames.
Exercise 5 — Multi-Tenant Databases
Add a namespace concept: allow multiple LevelDB databases to share one FDB
cluster with independent key spaces. Each New(fdb, namespace) call returns
a storage implementation that is completely isolated from others.
This is how mvsqlite handles multiple SQLite “database files” — each is an FDB namespace.
12. Source Code Deep Dive — fdbstorage/storage.go
The Storage Struct
type Storage struct {
db fdb.Database
ns []byte
}
Minimal. db is the FDB connection; ns is the byte prefix for all keys. The entire storage is two fields. All complexity lives in the key encoding and transaction logic.
Key Encoding Helpers
func (s *Storage) metaKey(fd storage.FileDesc) fdb.Key {
// ns + 0x01 + type(1 byte) + num(8 bytes big-endian)
key := make([]byte, len(s.ns)+10)
copy(key, s.ns)
key[len(s.ns)] = 0x01
key[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
return fdb.Key(key)
}
func (s *Storage) chunkKey(fd storage.FileDesc, chunkNum int) fdb.Key {
// ns + 0x02 + type(1 byte) + num(8 bytes) + chunk(8 bytes)
key := make([]byte, len(s.ns)+18)
copy(key, s.ns)
key[len(s.ns)] = 0x02
key[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(key[len(s.ns)+2:], uint64(fd.Num))
binary.BigEndian.PutUint64(key[len(s.ns)+10:], uint64(chunkNum))
return fdb.Key(key)
}
Why 8-byte big-endian for chunkNum? Chunk numbers are read back via GetRange which returns chunks in key order. Big-endian ensures key order equals chunk number order. If we used little-endian, chunk 256 (LE: 00 01 00 00 00 00 00 00) would sort before chunk 1 (LE: 01 00 00 00 00 00 00 00) — wrong.
The dataRange Helper
func (s *Storage) dataRange(fd storage.FileDesc) fdb.KeyRange {
begin := s.chunkKey(fd, 0)
// end: same prefix but with chunkNum = MaxUint64 + 1 — use next-prefix trick
endPrefix := make([]byte, len(s.ns)+10) // ns + 0x02 + type + num
copy(endPrefix, s.ns)
endPrefix[len(s.ns)] = 0x02
endPrefix[len(s.ns)+1] = byte(fd.Type)
binary.BigEndian.PutUint64(endPrefix[len(s.ns)+2:], uint64(fd.Num))
end := append(endPrefix, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF)
return fdb.KeyRange{Begin: begin, End: fdb.Key(append(end, 0x01))}
}
This range covers all chunk keys for (type, num) regardless of chunkNum. GetRange(dataRange(fd)) fetches all chunks in order.
The List() Implementation
func (s *Storage) List() ([]storage.FileDesc, error) {
kvs, _ := s.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
return rt.GetRange(s.metaRange(), fdb.RangeOptions{}).GetSliceWithError()
})
var fds []storage.FileDesc
for _, kv := range kvs.([]fdb.KeyValue) {
var fd storage.FileDesc
msgpack.Unmarshal(kv.Value, &fd)
fds = append(fds, fd)
}
return fds, nil
}
A single range scan over all meta keys returns all files in one round-trip. LevelDB calls List() at startup to find all existing files. With FDB, this is O(1) round-trips regardless of file count.
With a local filesystem, List() is an opendir/readdir syscall — also O(1) in latency, but I/O must go through the local disk controller. With FDB, the I/O goes to the closest FDB storage server over the network, with similar or lower latency than a rotational disk.
The Lock Implementation
func (s *Storage) Lock() (util.Releaser, error) {
_, err := s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
existing, _ := tr.Get(s.lockKey()).Get()
if len(existing) > 0 {
return nil, errors.New("storage: already locked")
}
tr.Set(s.lockKey(), []byte("locked"))
return nil, nil
})
if err != nil {
return nil, err
}
return util.ReleaserFunc(func() {
s.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
tr.Clear(s.lockKey())
return nil, nil
})
}), nil
}
The lock is a FDB key. Acquiring the lock: check if the key exists; if not, set it — in one atomic transaction. This check-then-set is race-free because FDB’s optimistic concurrency ensures that if two processes both read “no lock” and both try to write “locked”, only one will commit (the other will conflict and retry, then find the lock held).
Limitation: This is a process-level lock, not a durable lease. If the lock-holding process crashes without calling Release(), the lock remains set until manually cleared. For production, use a lock with an expiry: store the lock as {holder: processID, expires: time.Now().Add(30*time.Second)} and have each lock holder refresh it periodically. A lock that isn’t refreshed is treated as expired.
13. Production Considerations
13.1 Transaction Size for Large SSTables
LevelDB level-0 SSTables are 2–4 MB. Level-1 SSTables are larger (up to L1_target_size, configurable). For a L1_target_size of 64 MB, level-1 SSTables are 64 MB each. Our current Rename would fail for files this large (exceeds the 10 MB transaction limit).
Solution: For production, configure LevelDB’s CompactionTableSize to keep SSTables small:
opts := &opt.Options{
CompactionTableSize: 2 * 1024 * 1024, // 2 MB SSTables
}
2 MB SSTables = 32 chunks of 64 KiB. Rename transaction: 32 reads + 32 writes = 4 MB total. Well within limits.
13.2 Read Performance for Large SSTables
Reading a 2 MB SSTable requires fetching 32 chunks from FDB. Our current implementation reads them in one GetRange — one round-trip, 32 key-value pairs returned. Latency: ~1–5 ms (FDB cluster local read latency).
A local filesystem read of 2 MB: ~1–3 ms on SSD, ~10–20 ms on HDD.
For a warm FDB cluster, FDB storage is competitive with SSDs and dramatically better than spinning disks. For random chunk access (seeking within large files), FDB may be faster because it can pipeline multiple point reads, while a spinning disk requires physical seeking.
13.3 Write Amplification
Our chunking adds write amplification: writing a 64 KiB chunk requires writing the chunk key (18 bytes) + value (64 KiB) = 64 KiB + 18 bytes. The key overhead is <0.03%, negligible.
But FDB itself adds write amplification internally: each committed transaction is written to the Transaction Log (TLog), then asynchronously applied to Storage Servers. The TLog write is sequential (fast). The Storage Server write is to FDB’s B-tree (with its own write amplification). FDB’s overall write amplification is roughly 3–5x — comparable to RocksDB’s LSM write amplification.
13.4 Monitoring
Key metrics for a fdbstorage-backed LevelDB deployment:
- FDB transaction latency P99: should be < 10ms for small transactions (meta reads)
- FDB range scan bytes/second: correlates with compaction throughput
- FDB conflict rate: if high, indicates concurrent compaction and write contention
- LevelDB metrics via
db.GetProperty("leveldb.stats"): still valid — LevelDB reports its own view of compaction and SSTable counts, just the “disk I/O” is actually FDB I/O
14. Interview Questions — Storage Abstractions and LSM Trees
Q: What is the purpose of the Rename operation in LevelDB’s storage interface, and how does your FDB implementation preserve its atomicity guarantee?
Rename is LevelDB’s way of atomically promoting a new SSTable (or MANIFEST) into production. During compaction, LevelDB writes the new SSTable to a temp file, then renames it to its final name. POSIX rename is atomic: either the old name or the new name is visible, never a half-written file. Our FDB implementation reads all chunks with the old file descriptor, writes them with the new file descriptor, and clears the old keys — all in one FDB transaction. FDB’s transaction atomicity provides the same guarantee: either the old keys or the new keys are visible, never both or neither.
Q: Why does LevelDB use a Write-Ahead Log, and is it still necessary when using FDB as the storage backend?
The WAL protects against crash scenarios where data was written to the in-memory MemTable but not yet flushed to an SSTable on disk. Without a WAL, a crash after the MemTable write but before the SSTable flush would lose those writes. With FDB as storage, our writer.Close() writes chunks to FDB in transactions. Each committed FDB transaction is durable (replicated to at least two machines). A crash after Close() returns has no data loss. The WAL’s durability purpose is already provided by FDB. A production implementation would use a no-op WAL to skip the overhead.
Q: What is the 10 MB transaction limit in FDB, and what design patterns avoid hitting it?
FDB limits the total read + write size per transaction to approximately 10 MB to bound the memory required on Commit Proxies and to keep transaction resolution fast. Patterns to stay within the limit: (1) chunk large values (as we do with 64 KiB chunks), (2) break bulk writes into multiple transactions with cursor-based pagination, (3) configure LevelDB’s compaction to keep SSTable sizes small (< 2 MB), (4) use FDB atomic operations (tr.Add, tr.SetVersionstampedKey) where possible — atomic operations don’t count against the read portion of the limit.
Q: How would you extend this implementation to support multiple concurrent LevelDB instances sharing the same FDB cluster?
Give each LevelDB instance its own ns prefix. The FDB key space is naturally partitioned: ns1 + ... keys and ns2 + ... keys are completely disjoint. Multiple instances can read and write concurrently with no coordination overhead — FDB’s conflict detection only fires when two transactions write the same key, and different namespaces use different keys. The lock key (ns + tagLock) is also per-namespace, so locking one instance doesn’t affect others.