Hitchhiker’s Guide — Option B: SQLite VFS over FoundationDB

The question this answers: “How does SQLite actually read and write its database file? Can we replace the file with FoundationDB?”

The deeper question: “What is a database page, how does the pager work, and why is the page model the right abstraction for a VFS?”

SQLite’s Architecture — From SQL to Bytes
The Virtual File System (VFS) — SQLite’s Plugin Point
The Page Model — How SQLite Organizes Storage
Our pagestore — FDB as a Page Store
The Partial-Page Write Problem — And Its Atomic Solution
xSync is a No-Op — And Why That’s Correct
Locking — From POSIX Flock to FDB Keys
Journal Modes: DELETE, WAL, MEMORY
mvsqlite — The Production Version of This
Real-World Analogues: libSQL, Litestream, LiteFS
Exercises

1. SQLite’s Architecture — From SQL to Bytes

SQLite processes queries through seven layers:

SQL text
   ↓ Tokenizer + Parser
   Abstract Syntax Tree (AST)
   ↓ Code Generator
   Bytecode program (VDBE instructions)
   ↓ Virtual Database Engine (VDBE)
   B-tree operations (seek, insert, delete on B-tree pages)
   ↓ Pager
   Page cache: reads/writes logical page numbers
   ↓ OS Interface (VFS)
   File reads/writes at byte offsets
   ↓ Actual storage

We plug in at the VFS layer. Everything above (SQL parsing, the B-tree, the pager) runs as normal. The VFS is called for all I/O, and our implementation redirects those calls to FDB.

The key insight: SQLite doesn’t know or care whether the VFS talks to a local file, a network mount, or a distributed database. It only needs the VFS contract to be upheld.

2. The Virtual File System (VFS) — SQLite’s Plugin Point

The VFS is documented in SQLite’s C API. The key functions (C signatures simplified):

// File operations (called on each open file):
int xRead(file, buf, amount, offset);   // read `amount` bytes at `offset`
int xWrite(file, buf, amount, offset);  // write `amount` bytes at `offset`
int xTruncate(file, size);              // truncate to `size` bytes
int xSync(file, flags);                 // flush to durable storage
int xFileSize(file, *size);             // return current size
int xLock(file, locktype);             // acquire SHARED/EXCLUSIVE/etc
int xUnlock(file, locktype);           // release lock
int xCheckReservedLock(file, *result); // is anyone holding RESERVED lock?
int xShmMap(file, region, size, ...);  // map shared memory region (WAL mode)

Our Go pagestore implements the subset needed for non-WAL mode: ReadAt, WriteAt, Truncate, FileSize, Lock, Unlock.

Why a Go struct instead of a C VFS?

A real SQLite VFS requires implementing C structs (sqlite3_vfs, sqlite3_file) and using CGO to register them. This is correct but adds ~200 lines of glue code that would obscure the core concept. Our pagestore is a Go struct that exercises the exact same I/O patterns as SQLite would use. The demo calls ReadAt/WriteAt directly, simulating what the SQLite pager would call through the VFS.

3. The Page Model — How SQLite Organizes Storage

SQLite’s database file is divided into fixed-size pages (default 4096 bytes). Every structure in a SQLite database — B-tree interior nodes, B-tree leaf nodes, overflow pages, free-list pages — is exactly one page. The pager manages a cache of these pages and issues I/O to the VFS in page-aligned operations.

SQLite database file layout:
  Offset  0 – 4095:   Page 1 (database header + B-tree root)
  Offset  4096 – 8191:  Page 2
  Offset  8192 – 12287: Page 3
  ...
  Offset (N-1)*4096 – N*4096-1: Page N

Page 1 is special: its first 100 bytes are the database header:

Offset  0: "SQLite format 3\000" (16 bytes, magic string)
Offset 16: page size (2 bytes)
Offset 18: file format write version (1 byte)
Offset 19: file format read version (1 byte)
...
Offset 28: file change counter (4 bytes)
...
Offset 52: schema format number (4 bytes)
...

The partial-page write problem arises here: the header is 100 bytes, but SQLite might write just the header (100 bytes at offset 0), while the rest of page 1 (bytes 100–4095) contains B-tree data. An xWrite(file, header, 100, 0) call writes only 100 bytes — not the full page.

4. Our pagestore — FDB as a Page Store

We store pages as FDB keys:

const PageSize = 4096

// Page data:
//   ns + 0x01 + pageNum(uint64 big-endian)  →  4096 bytes

// Database size (in pages):
//   ns + 0x00  →  uint64 (big-endian)

// Exclusive lock:
//   ns + 0x02  →  "locked" (any non-empty value)

Page numbers:

SQLite page numbers are 1-based (page 1 is the first page). We store them as 0-padded big-endian uint64s:

Page 1:  ns + 0x01 + 0000000000000001
Page 2:  ns + 0x01 + 0000000000000002
Page 1000: ns + 0x01 + 00000000000003E8

Big-endian encoding ensures that a GetRange over ns+0x01+0 to ns+0x01+FFFFFFFFFFFFFFFF returns pages in ascending page-number order — useful for scanning the entire database.

Size tracking:

SQLite calls xFileSize to determine how many pages exist. We track this explicitly with a size key rather than scanning for the highest-numbered page key (which would be an O(N) scan). The size key is updated atomically in the same transaction as every WriteAt and Truncate.

5. The Partial-Page Write Problem — And Its Atomic Solution

This is the trickiest part of implementing a page-based storage engine.

SQLite’s VFS contract says:

xWrite(offset, amount, data):
  Write `amount` bytes starting at `offset`.
  `offset` and `amount` may be any values (not necessarily page-aligned).

Examples of partial-page writes:

Write header (100 bytes at offset 0) to page 1
Write a record that spans a 4-byte boundary at the end of a page
Write a single integer (4 bytes at offset 12) to a page

We cannot simply map WriteAt(offset, data) to Set(pageKey(offset/4096), data) because that would overwrite the entire 4096-byte page with only the partial data — corrupting the rest of the page.

Our solution: read-modify-write inside one transaction.

func (p *PageStore) WriteAt(data []byte, off int64) (int, error) {
    firstPage := off / PageSize
    lastPage := (off + int64(len(data)) - 1) / PageSize

    _, err := p.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        for pageNum := firstPage; pageNum <= lastPage; pageNum++ {
            // 1. Read current page content (or zeros if it doesn't exist yet)
            existing, _ := tr.Get(p.pageKey(pageNum)).Get()
            page := make([]byte, PageSize)
            copy(page, existing)  // pad to 4096 with zeros if shorter

            // 2. Compute which bytes of `data` go into this page
            pageStart := pageNum * PageSize
            pageEnd := pageStart + PageSize
            writeStart := max(off, pageStart) - pageStart
            writeEnd := min(off+int64(len(data)), pageEnd) - pageStart
            dataStart := max(off, pageStart) - off
            dataEnd := min(off+int64(len(data)), pageEnd) - off

            // 3. Overlay the new bytes onto the existing page
            copy(page[writeStart:writeEnd], data[dataStart:dataEnd])

            // 4. Write the modified page back
            tr.Set(p.pageKey(pageNum), page)
        }
        // 5. Update size key if this write extends the file
        ...
        return nil, nil
    })
    return len(data), err
}

Steps 1–4 happen inside one FDB transaction. The read and the write are atomic: no concurrent writer can modify the page between our read and write. No partial update is visible to other transactions.

The atomicity guarantee:

If the process crashes after step 1 (read) but before step 4 (write), the transaction is never committed. FDB abandons it. The page is unchanged. No corruption.

If two concurrent writers both try to read-modify-write the same page, FDB’s conflict detection will cause one to retry. The second writer will re-read the page (now including the first writer’s change) and overlay its own data on top. Correct behavior, no coordination required.

6. xSync is a No-Op — And Why That’s Correct

SQLite calls xSync to tell the VFS: “make sure all previous writes are durable on physical storage before returning.” On a normal filesystem, this calls fsync() which flushes the OS page cache to the disk controller and waits for the disk to confirm durability.

Why do we implement it as a no-op?

FDB’s Transact is synchronous and durable.

When WriteAt calls p.db.Transact(...) and it returns without error, the data is already:

Written to FDB’s transaction log on f+1 machines (durably replicated)
Committed (visible to new transactions)

By the time WriteAt returns to the SQLite pager, the durability guarantee is already satisfied — stronger than fsync on a local disk.

xSync is a no-op because the work it would do has already been done during xWrite. There is nothing left to flush.

Analogy: Imagine a bank transfer. xWrite is “debit Alice, credit Bob, commit to ledger.” xSync is “make sure the ledger is durable.” If the ledger is already in a replicated distributed database, xSync has nothing to do.

7. Locking — From POSIX Flock to FDB Keys

SQLite uses file locking to coordinate concurrent access. The POSIX lock levels, in order of increasing exclusivity:

UNLOCKED → SHARED → RESERVED → PENDING → EXCLUSIVE

SHARED: reader has this. Many readers can hold SHARED simultaneously.
RESERVED: writer intending to modify. One writer can hold RESERVED while readers continue holding SHARED.
PENDING: writer waiting for all SHARED holders to release.
EXCLUSIVE: sole writer, no readers. Required before writing pages.

A full VFS implementation would map these levels to FDB lock keys with appropriate semantics (e.g., a counter for SHARED, a flag for EXCLUSIVE).

Our implementation is simplified: we use one “exclusive lock” key. Setting it means exclusive ownership; clearing it means unlocked. This correctly implements single-writer semantics but doesn’t allow concurrent readers.

For multi-reader/single-writer, a production implementation would:

SHARED lock: counter key (increment on lock, decrement on unlock)
EXCLUSIVE lock: flag key
  → acquire EXCLUSIVE: check counter == 0, set flag key
  → acquire SHARED: check flag key is unset, increment counter

Both checks-and-sets would use FDB transactions to ensure atomicity.

8. Journal Modes: DELETE, WAL, MEMORY

SQLite has several journal modes, selected with PRAGMA journal_mode=MODE. The journal is SQLite’s crash recovery mechanism (distinct from FDB’s own recovery).

DELETE (default rollback journal): Before modifying a page, SQLite copies the original page content to a separate journal file. If a crash occurs during a write, SQLite replays the journal to restore the original page content.

With FDB, this is redundant: FDB’s transactions provide rollback for free. A partial WriteAt that crashes mid-transaction rolls back automatically. The journal file would be stored in FDB via our VFS, adding overhead for no benefit.

WAL (Write-Ahead Log): Instead of copying original pages before modification, SQLite appends new page versions to a WAL file. Readers check the WAL before reading the main database file. A “checkpoint” operation copies WAL pages back to the main file.

WAL mode requires xShmMap — shared memory for the WAL index. This is a memory-mapped file that multiple processes share to coordinate which WAL frames are valid. Implementing xShmMap in FDB would require:

A small FDB key range to store the WAL index state.
Synchronized access to that state via FDB transactions. This is complex but possible — mvsqlite does it.

MEMORY: SQLite uses an in-memory journal, not written to any file. Rollback is possible within a transaction but not after a crash.

For our FDB VFS, PRAGMA journal_mode=MEMORY is the right choice: SQLite won’t try to create journal files (which would trigger extra VFS calls), and crash recovery is handled by FDB. This is what our demo uses implicitly by not specifying a journal mode (we would need to open the database and run PRAGMA journal_mode=MEMORY before any writes).

9. mvsqlite — The Production Version of This

mvsqlite (by losfair, Rust) is the production-grade implementation of the same idea. Its architecture:

Namespace mapping: Each SQLite “database file” path is mapped to an FDB namespace prefix. Multiple processes can open the same “file” (namespace) simultaneously with MVCC isolation.

Page-level MVCC: mvsqlite keeps multiple versions of each page, similar to how PostgreSQL keeps multiple row versions. When a reader opens the database, it captures an FDB read version. Page reads are served from FDB at that version. New writes create new page versions. Old versions are kept until all readers that need them complete.

Our pagestore is a simplified, non-MVCC version: every read sees the latest page version. Adding page-level MVCC would require:

ns + 0x01 + pageNum + version  →  4096 bytes  (instead of just pageNum)

With a cleanup process that removes old versions when no readers hold them.

Write-ahead log in FDB: mvsqlite implements WAL mode by storing the WAL log pages in FDB itself:

ns + 0x03 + txnId + pageNum  →  4096 bytes  (uncommitted WAL pages)
ns + 0x04 + txnId           →  commit record

This allows SQLite’s WAL mode to work without xShmMap (the shared memory region) — instead, the WAL index state lives in FDB and is accessed via transactions.

10. Real-World Analogues

libSQL (Turso)

libSQL is a fork of SQLite that adds multi-tenancy and remote storage. Turso’s cloud product stores SQLite databases in a distributed object store (similar to our FDB approach). The architecture:

Local replicas for low-latency reads
A primary writes to the distributed store
Followers pull changes from the store and apply them locally

Litestream

Litestream continuously replicates SQLite databases to S3 or other cloud storage by intercepting the WAL. It monitors the SQLite WAL file and copies new frames to cloud storage in near-real-time. Restore is done by downloading frames and applying them.

Litestream does NOT modify SQLite’s VFS — it runs as a separate process that monitors the WAL file. This is simpler but means it can’t provide multi-writer consistency (only one process can write to a SQLite database at a time).

LiteFS (Fly.io)

LiteFS mounts a FUSE filesystem that intercepts SQLite writes, replicates them to a primary node, and distributes to replicas. It does intercept at the filesystem level (FUSE = Filesystem in Userspace), which is essentially implementing the VFS pattern at the OS level.

The Pattern’s Significance

All of these tools (mvsqlite, libSQL, Litestream, LiteFS) are attacking the same problem: SQLite is an excellent embedded database, but it’s tied to a single file on a single machine. The solution in each case is to intercept the storage layer and redirect I/O to a distributed, replicated system.

Our pagestore is the minimal proof-of-concept for this idea: 150 lines of Go that demonstrate the core page-store primitives.

11. Exercises

Exercise 1 — Page Cache

Add an in-process LRU page cache:

type PageStore struct {
    ...
    cache *lru.Cache  // maps pageNum → [PageSize]byte
}

ReadAt checks the cache before going to FDB. WriteAt updates the cache after writing to FDB. Evict on Truncate.

Measure the cache hit rate for a typical SQLite workload (a mix of reads and writes). You’ll find that B-tree root pages have very high hit rates — they’re accessed on every query.

Exercise 2 — Implement xShmMap for WAL Mode

WAL mode requires shared memory for the WAL index. In mvsqlite, this is done with FDB keys. Implement a simplified version:

Add a “WAL region” subspace: ns + 0x03 + regionNum → 32768 bytes
xShmMap(region, size) reads the FDB key and returns a []byte
xShmBarrier() writes the modified byte slice back to FDB
xShmLock maps to a FDB lock key per-region

With this, enable PRAGMA journal_mode=WAL in the demo and verify reads and writes still work.

Exercise 3 — Multi-Version Pages

Change the page key layout to:

ns + 0x01 + pageNum + readVersion  →  4096 bytes

WriteAt reads the current version, writes a new version at the current FDB commit version. ReadAt scans backwards from the requested read version to find the most recent page version ≤ that version.

Add a Compact(olderThan version) that clears all page versions older than a given version. This is the MVCC GC mechanism.

Exercise 4 — Register as a Real SQLite VFS

Using CGO, implement the actual sqlite3_vfs and sqlite3_file C structs backed by our pagestore. Register the VFS with sqlite3_vfs_register. Open a real SQLite connection using this VFS:

db, err := sql.Open("sqlite3", "file:demo.db?vfs=fdbvfs")
db.Exec("CREATE TABLE t (id INTEGER PRIMARY KEY, val TEXT)")
db.Exec("INSERT INTO t VALUES (1, 'hello')")

All I/O will go through pagestore to FDB. You have now replaced SQLite’s storage engine while keeping the full SQL query interface.

Exercise 5 — Multi-Database Isolation

The pagestore uses a namespace prefix to isolate one database. Add a DatabaseManager that:

Lists all databases (range scan over a well-known metadata prefix)
Creates a new database (allocates a namespace, writes a metadata entry)
Deletes a database (one ClearRange + metadata clear)
Returns a pagestore for a given database name

12. Source Code Deep Dive — pagestore/pagestore.go

The File Struct

type File struct {
    db   fdb.Database
    ns   []byte
    mu   sync.Mutex
}

ns is the namespace byte prefix. mu protects concurrent calls to WriteAt and Truncate from a single File instance (though FDB’s own transactions provide the real isolation for concurrent processes).

Key Construction

func (f *File) sizeKey() fdb.Key {
    key := make([]byte, len(f.ns)+1)
    copy(key, f.ns)
    key[len(f.ns)] = 0x00
    return fdb.Key(key)
}

func (f *File) pageKey(pageNum int64) fdb.Key {
    key := make([]byte, len(f.ns)+9)
    copy(key, f.ns)
    key[len(f.ns)] = 0x01
    binary.BigEndian.PutUint64(key[len(f.ns)+1:], uint64(pageNum))
    return fdb.Key(key)
}

sizeKey() stores the total file size in bytes. pageKey(n) stores the 4096-byte page at page number n. The big-endian 8-byte encoding of pageNum ensures that GetRange over sizeKey to the end of the 0x01 subspace returns pages in ascending page-number order.

ReadAt — Handling Partial-Page Reads

func (f *File) ReadAt(p []byte, off int64) (int, error) {
    firstPage := off / PageSize
    lastPage := (off + int64(len(p)) - 1) / PageSize

    futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)

    _, err := f.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
        for i := firstPage; i <= lastPage; i++ {
            futures[i-firstPage] = rt.Get(f.pageKey(i))
        }
        return nil, nil
    })
    // Wait for all futures and assemble the result
    for i, fut := range futures {
        pageData, _ := fut.Get()
        // copy relevant bytes from page into p...
    }
}

Pipelining multiple page reads: All rt.Get(pageKey(i)) calls are issued inside one ReadTransact closure. FDB sends all read requests to the storage servers before waiting for any response. If the read spans 3 pages (common for reads that cross page boundaries), all 3 are fetched in one round-trip. Without pipelining, each page would require a separate round-trip — 3x the latency.

This is the same pipelining pattern used in option-c-record-layer’s LookupByIndex.

WriteAt — The Read-Modify-Write Pattern

func (f *File) WriteAt(p []byte, off int64) (int, error) {
    f.mu.Lock()
    defer f.mu.Unlock()

    firstPage := off / PageSize
    lastPage := (off + int64(len(p)) - 1) / PageSize

    _, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // Issue all reads first (pipelined)
        futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
        for i := firstPage; i <= lastPage; i++ {
            futures[i-firstPage] = tr.Get(f.pageKey(i))
        }
        sizeF := tr.Get(f.sizeKey())

        // Wait for reads and apply writes
        for i := firstPage; i <= lastPage; i++ {
            existing, _ := futures[i-firstPage].Get()
            page := make([]byte, PageSize)
            copy(page, existing)
            // overlay p bytes onto page...
            tr.Set(f.pageKey(i), page)
        }

        // Update size
        curSize := int64(binary.BigEndian.Uint64(sizeF.MustGet()))
        newEnd := off + int64(len(p))
        if newEnd > curSize {
            b := make([]byte, 8)
            binary.BigEndian.PutUint64(b, uint64(newEnd))
            tr.Set(f.sizeKey(), b)
        }

        return nil, nil
    })
    return len(p), err
}

The read futures are pipelined: all tr.Get() calls are issued before any .Get() is called to block. This means all page reads happen in parallel — one round-trip for up to N pages regardless of N.

The size update is in the same transaction: The page writes and the size update are atomic. If the process crashes mid-transaction, FDB abandons it: neither the page updates nor the size update are applied. The database remains in its previous state — consistent, readable, no corruption.

pageRange — For Bulk Operations

func (f *File) pageRange() fdb.KeyRange {
    begin := f.pageKey(0)
    end := make([]byte, len(f.ns)+1)
    copy(end, f.ns)
    end[len(f.ns)] = 0x02  // one past the page subspace tag (0x01)
    return fdb.KeyRange{Begin: begin, End: fdb.Key(end)}
}

Used by Truncate to clear all pages above the new size:

func (f *File) Truncate(size int64) error {
    newLastPage := size / PageSize
    _, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
        // Clear all pages above newLastPage
        clearBegin := f.pageKey(newLastPage + 1)
        tr.ClearRange(fdb.KeyRange{Begin: clearBegin, End: f.pageRange().End})
        // Update size
        b := make([]byte, 8)
        binary.BigEndian.PutUint64(b, uint64(size))
        tr.Set(f.sizeKey(), b)
        return nil, nil
    })
    return err
}

One ClearRange atomically removes all pages above the new size. This is O(1) in FDB key operations (one range-clear instruction), even if there are thousands of pages to clear. This is vastly more efficient than deleting pages one by one.

13. Production Considerations

13.1 Page Size Selection

The default SQLite page size is 4096 bytes. You can change it with PRAGMA page_size=N (must be set before the first write). Larger pages reduce the number of B-tree levels and improve scan performance on large tables, but increase read amplification for small point lookups. For an FDB-backed store:

4096 bytes (default): 4096-byte FDB values. Small key overhead. Good for mixed workloads.
8192 bytes: 8192-byte values. Better for large sequential scans.
65536 bytes (maximum SQLite page size): 64 KB values. Below FDB’s 100 KB limit. Excellent for large sequential scans, but poor for point lookups.

For most use cases, 4096 is the right choice. Match it to your workload’s read/write unit.

13.2 Hot Page Contention

Page 1 contains the database header and the root page of the main B-tree. Every query touches page 1. Every transaction increments the “file change counter” in the header. This means every write transaction does a read-modify-write on page 1 — even if the query doesn’t touch any user data on page 1 (e.g., inserting into a deep table touches only leaf pages and page 1 for the change counter).

In a high-concurrency write workload, this is a hot key. FDB’s conflict detection will serialize all transactions that write to page 1 — reducing concurrency.

Solution (mvsqlite’s approach): Move the change counter out of the page and into a separate FDB key. Use FDB’s atomic add to increment it without a read-modify-write. Since atomic add is commutative, it doesn’t cause conflicts. The page itself (minus the counter) is written only when actual B-tree structure changes.

13.3 Crash Recovery Testing

With FDB’s transactional guarantees, the traditional SQLite recovery path (re-applying the rollback journal) is never needed. But it’s worth testing:

Write some data.
Kill the process mid-write (use kill -9 or os.Exit(1) inside a WriteAt).
Restart the process and open the database.
Verify the database is in a consistent state (either the write happened or it didn’t, no partial state).

FDB’s transaction semantics make this trivially correct, but testing it gives you confidence that your implementation upholds the contract.

14. Interview Questions — SQLite, VFS, and Page Models

Q: What is a database page and why do databases use a fixed page size?

A page is the fundamental unit of I/O and storage in a database. Fixed page sizes allow: (1) simple address calculation — page N starts at byte (N-1) * pageSize; (2) alignment with OS page sizes (4 KiB for virtual memory, matching SQLite’s default); (3) buffer pool management — a page cache stores exactly one page per slot. Variable-size records are packed into pages; a B-tree node is exactly one page. The fixed size ensures that a B-tree node read is always one I/O operation, not an I/O per record.

Q: What is the partial-page write problem, and how does your FDB implementation solve it?

SQLite’s VFS xWrite can write any byte range — not necessarily page-aligned. If a 100-byte write starts at offset 0, it overwrites bytes 0–99 of page 1, but bytes 100–4095 are unchanged. An implementation that naively maps this to Set(page1Key, 100_bytes) would corrupt page 1 (losing bytes 100–4095). Our solution: read the existing page, overlay the new bytes, write the full 4096-byte page back — in one FDB transaction. The transaction atomicity ensures that no other writer sees a partial page and that a crash during the operation leaves the page unchanged.

Q: Why is xSync a no-op in your VFS implementation?

xSync requests durability: “flush all previous writes to stable storage.” Our WriteAt implementation uses fdb.Transact which does not return until the write is committed and replicated to FDB’s transaction log. The commit is synchronous and durable by the time WriteAt returns. There is nothing left for xSync to flush. The durability guarantee of xSync is already satisfied by the end of WriteAt.

Q: How does SQLite locking work, and how would you implement a multi-reader single-writer lock in FDB?

SQLite uses 5 lock levels. For multi-reader single-writer: maintain a SHARED lock counter (number of active readers) and an EXCLUSIVE lock flag. Acquire SHARED: check exclusive flag is unset, increment counter (in one FDB transaction). Acquire EXCLUSIVE: check counter == 0, set exclusive flag (in one FDB transaction). Release SHARED: decrement counter. Release EXCLUSIVE: clear flag. All check-and-set operations are atomic in FDB transactions, so no intermediate state is observable. A would-be exclusive lock holder that sees counter > 0 must retry (wait for readers to finish).

This is the NamespaceManager concept in mvsqlite, and it’s how a multi-tenant SQLite service would work.

FoundationDB Layers — Go Implementations