Option B — LevelDB on top of FoundationDB
Pattern: existing storage engine, FDB as the disk. We give the unmodified
goleveldblibrary an FDB-backed implementation of itsstorage.Storageinterface. LevelDB still does its LSM thing (memtables, SSTables, compaction, MANIFEST), but every byte ends up in FDB key ranges instead of on local disk.
This is the mirror image of option-a: the storage engine sits below FDB rather than above it.
How LevelDB sees the world
goleveldb accesses persistence exclusively through a small interface:
type Storage interface {
Lock() (Locker, error)
Log(str string)
SetMeta(FileDesc) error
GetMeta() (FileDesc, error)
List(FileType) ([]FileDesc, error)
Open(FileDesc) (Reader, error)
Create(FileDesc) (Writer, error)
Remove(FileDesc) error
Rename(old, new FileDesc) error
Close() error
}
Each FileDesc is {Type, Num} — e.g. {TypeTable, 42} for SST #42 or
{TypeManifest, 7} for the 7th manifest. Filenames are an implementation
detail; LevelDB never looks at strings.
Our fdbstorage package implements that interface against FDB. The whole
file is ~250 lines.
Key layout
<ns> 0x01 <ftype:1B> <num:int64 BE> -> uint64 BE file size
<ns> 0x02 <ftype:1B> <num:int64 BE> <chunk:uint32 BE> -> 64 KiB chunk
<ns> 0x03 -> current MANIFEST {ftype,num}
<ns> 0x04 -> lock marker
- Files are split into 64 KiB chunks so we stay well under FDB’s 100 KiB-per- value soft limit and the 10 MB-per-transaction hard limit.
Createreturns aWriterthat buffers in memory and flushes onSync/Close. We split the flush across multiple transactions (100 chunks each ≈ 6 MiB) to safely handle files larger than 10 MB.Renameis implemented as copy-then-clear inside one transaction. LevelDB only renames small files (temp → real on flush completion), so the inefficiency doesn’t matter.SetMetawrites the manifest pointer atomically. Because FDB transactions are serializable, two concurrent flushes can’t observe a half-rotated manifest.
Why this is interesting
You get a real LevelDB instance — with bloom filters, compaction, snapshots, the works — whose durability story is “whatever FDB’s durability story is.” That means:
- Geo-replication and read scaling come for free from the FDB cluster.
- Backups are FDB backups.
- The local node has no on-disk state at all; it can crash and restart against a different FDB coordinator without losing anything.
The cost is latency: every SST read is at least one FDB round-trip, every flush is many. This isn’t a production architecture; it’s a teaching artifact that proves how cleanly the layers separate.
Running
cd option-b-leveldb
go mod tidy
go run ./demo -cluster ../fdb.cluster
Expected output (the second session re-opens and reads the persisted data):
First session: wrote 3 keys, then closed.
Reopening LevelDB on the same FDB namespace...
apple -> red
banana -> yellow
cherry -> red
Iterating the whole DB:
apple -> red
banana -> yellow
cherry -> red
What this implementation skips
- Locker isn’t multi-process safe across long-lived processes — if a holder crashes the lock key stays set. A production version would attach the lock to a client UUID and TTL it via FDB watches.
- Reader loads the whole file into memory. LevelDB SSTs are bounded (default 2 MB), so this is fine for a demo but not for huge tables.
- No caching layer. Every
Openis a fresh FDB scan. A real impl would cache hot SSTs.
Read fdbstorage/storage.go — the whole thing is one file deliberately, so you can follow the data flow end to end.