Hitchhiker’s Guide — Option B: SQLite VFS over FoundationDB
The question this answers: “How does SQLite actually read and write its database file? Can we replace the file with FoundationDB?”
The deeper question: “What is a database page, how does the pager work, and why is the page model the right abstraction for a VFS?”
Table of Contents
- SQLite’s Architecture — From SQL to Bytes
- The Virtual File System (VFS) — SQLite’s Plugin Point
- The Page Model — How SQLite Organizes Storage
- Our pagestore — FDB as a Page Store
- The Partial-Page Write Problem — And Its Atomic Solution
- xSync is a No-Op — And Why That’s Correct
- Locking — From POSIX Flock to FDB Keys
- Journal Modes: DELETE, WAL, MEMORY
- mvsqlite — The Production Version of This
- Real-World Analogues: libSQL, Litestream, LiteFS
- Exercises
1. SQLite’s Architecture — From SQL to Bytes
SQLite processes queries through seven layers:
SQL text
↓ Tokenizer + Parser
Abstract Syntax Tree (AST)
↓ Code Generator
Bytecode program (VDBE instructions)
↓ Virtual Database Engine (VDBE)
B-tree operations (seek, insert, delete on B-tree pages)
↓ Pager
Page cache: reads/writes logical page numbers
↓ OS Interface (VFS)
File reads/writes at byte offsets
↓ Actual storage
We plug in at the VFS layer. Everything above (SQL parsing, the B-tree, the pager) runs as normal. The VFS is called for all I/O, and our implementation redirects those calls to FDB.
The key insight: SQLite doesn’t know or care whether the VFS talks to a local file, a network mount, or a distributed database. It only needs the VFS contract to be upheld.
2. The Virtual File System (VFS) — SQLite’s Plugin Point
The VFS is documented in SQLite’s C API. The key functions (C signatures simplified):
// File operations (called on each open file):
int xRead(file, buf, amount, offset); // read `amount` bytes at `offset`
int xWrite(file, buf, amount, offset); // write `amount` bytes at `offset`
int xTruncate(file, size); // truncate to `size` bytes
int xSync(file, flags); // flush to durable storage
int xFileSize(file, *size); // return current size
int xLock(file, locktype); // acquire SHARED/EXCLUSIVE/etc
int xUnlock(file, locktype); // release lock
int xCheckReservedLock(file, *result); // is anyone holding RESERVED lock?
int xShmMap(file, region, size, ...); // map shared memory region (WAL mode)
Our Go pagestore implements the subset needed for non-WAL mode:
ReadAt, WriteAt, Truncate, FileSize, Lock, Unlock.
Why a Go struct instead of a C VFS?
A real SQLite VFS requires implementing C structs (sqlite3_vfs,
sqlite3_file) and using CGO to register them. This is correct but adds
~200 lines of glue code that would obscure the core concept. Our pagestore
is a Go struct that exercises the exact same I/O patterns as SQLite would use.
The demo calls ReadAt/WriteAt directly, simulating what the SQLite pager
would call through the VFS.
3. The Page Model — How SQLite Organizes Storage
SQLite’s database file is divided into fixed-size pages (default 4096 bytes). Every structure in a SQLite database — B-tree interior nodes, B-tree leaf nodes, overflow pages, free-list pages — is exactly one page. The pager manages a cache of these pages and issues I/O to the VFS in page-aligned operations.
SQLite database file layout:
Offset 0 – 4095: Page 1 (database header + B-tree root)
Offset 4096 – 8191: Page 2
Offset 8192 – 12287: Page 3
...
Offset (N-1)*4096 – N*4096-1: Page N
Page 1 is special: its first 100 bytes are the database header:
Offset 0: "SQLite format 3\000" (16 bytes, magic string)
Offset 16: page size (2 bytes)
Offset 18: file format write version (1 byte)
Offset 19: file format read version (1 byte)
...
Offset 28: file change counter (4 bytes)
...
Offset 52: schema format number (4 bytes)
...
The partial-page write problem arises here: the header is 100 bytes, but
SQLite might write just the header (100 bytes at offset 0), while the rest
of page 1 (bytes 100–4095) contains B-tree data. An xWrite(file, header, 100, 0) call writes only 100 bytes — not the full page.
4. Our pagestore — FDB as a Page Store
We store pages as FDB keys:
const PageSize = 4096
// Page data:
// ns + 0x01 + pageNum(uint64 big-endian) → 4096 bytes
// Database size (in pages):
// ns + 0x00 → uint64 (big-endian)
// Exclusive lock:
// ns + 0x02 → "locked" (any non-empty value)
Page numbers:
SQLite page numbers are 1-based (page 1 is the first page). We store them as 0-padded big-endian uint64s:
Page 1: ns + 0x01 + 0000000000000001
Page 2: ns + 0x01 + 0000000000000002
Page 1000: ns + 0x01 + 00000000000003E8
Big-endian encoding ensures that a GetRange over ns+0x01+0 to
ns+0x01+FFFFFFFFFFFFFFFF returns pages in ascending page-number order —
useful for scanning the entire database.
Size tracking:
SQLite calls xFileSize to determine how many pages exist. We track this
explicitly with a size key rather than scanning for the highest-numbered
page key (which would be an O(N) scan). The size key is updated atomically
in the same transaction as every WriteAt and Truncate.
5. The Partial-Page Write Problem — And Its Atomic Solution
This is the trickiest part of implementing a page-based storage engine.
SQLite’s VFS contract says:
xWrite(offset, amount, data):
Write `amount` bytes starting at `offset`.
`offset` and `amount` may be any values (not necessarily page-aligned).
Examples of partial-page writes:
- Write header (100 bytes at offset 0) to page 1
- Write a record that spans a 4-byte boundary at the end of a page
- Write a single integer (4 bytes at offset 12) to a page
We cannot simply map WriteAt(offset, data) to Set(pageKey(offset/4096), data) because that would overwrite the entire 4096-byte page with only
the partial data — corrupting the rest of the page.
Our solution: read-modify-write inside one transaction.
func (p *PageStore) WriteAt(data []byte, off int64) (int, error) {
firstPage := off / PageSize
lastPage := (off + int64(len(data)) - 1) / PageSize
_, err := p.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
for pageNum := firstPage; pageNum <= lastPage; pageNum++ {
// 1. Read current page content (or zeros if it doesn't exist yet)
existing, _ := tr.Get(p.pageKey(pageNum)).Get()
page := make([]byte, PageSize)
copy(page, existing) // pad to 4096 with zeros if shorter
// 2. Compute which bytes of `data` go into this page
pageStart := pageNum * PageSize
pageEnd := pageStart + PageSize
writeStart := max(off, pageStart) - pageStart
writeEnd := min(off+int64(len(data)), pageEnd) - pageStart
dataStart := max(off, pageStart) - off
dataEnd := min(off+int64(len(data)), pageEnd) - off
// 3. Overlay the new bytes onto the existing page
copy(page[writeStart:writeEnd], data[dataStart:dataEnd])
// 4. Write the modified page back
tr.Set(p.pageKey(pageNum), page)
}
// 5. Update size key if this write extends the file
...
return nil, nil
})
return len(data), err
}
Steps 1–4 happen inside one FDB transaction. The read and the write are atomic: no concurrent writer can modify the page between our read and write. No partial update is visible to other transactions.
The atomicity guarantee:
If the process crashes after step 1 (read) but before step 4 (write), the transaction is never committed. FDB abandons it. The page is unchanged. No corruption.
If two concurrent writers both try to read-modify-write the same page, FDB’s conflict detection will cause one to retry. The second writer will re-read the page (now including the first writer’s change) and overlay its own data on top. Correct behavior, no coordination required.
6. xSync is a No-Op — And Why That’s Correct
SQLite calls xSync to tell the VFS: “make sure all previous writes are
durable on physical storage before returning.” On a normal filesystem, this
calls fsync() which flushes the OS page cache to the disk controller and
waits for the disk to confirm durability.
Why do we implement it as a no-op?
FDB’s Transact is synchronous and durable.
When WriteAt calls p.db.Transact(...) and it returns without error, the
data is already:
- Written to FDB’s transaction log on
f+1machines (durably replicated) - Committed (visible to new transactions)
By the time WriteAt returns to the SQLite pager, the durability guarantee
is already satisfied — stronger than fsync on a local disk.
xSync is a no-op because the work it would do has already been done
during xWrite. There is nothing left to flush.
Analogy: Imagine a bank transfer. xWrite is “debit Alice, credit Bob,
commit to ledger.” xSync is “make sure the ledger is durable.” If the
ledger is already in a replicated distributed database, xSync has nothing
to do.
7. Locking — From POSIX Flock to FDB Keys
SQLite uses file locking to coordinate concurrent access. The POSIX lock levels, in order of increasing exclusivity:
UNLOCKED → SHARED → RESERVED → PENDING → EXCLUSIVE
- SHARED: reader has this. Many readers can hold SHARED simultaneously.
- RESERVED: writer intending to modify. One writer can hold RESERVED while readers continue holding SHARED.
- PENDING: writer waiting for all SHARED holders to release.
- EXCLUSIVE: sole writer, no readers. Required before writing pages.
A full VFS implementation would map these levels to FDB lock keys with appropriate semantics (e.g., a counter for SHARED, a flag for EXCLUSIVE).
Our implementation is simplified: we use one “exclusive lock” key. Setting it means exclusive ownership; clearing it means unlocked. This correctly implements single-writer semantics but doesn’t allow concurrent readers.
For multi-reader/single-writer, a production implementation would:
SHARED lock: counter key (increment on lock, decrement on unlock)
EXCLUSIVE lock: flag key
→ acquire EXCLUSIVE: check counter == 0, set flag key
→ acquire SHARED: check flag key is unset, increment counter
Both checks-and-sets would use FDB transactions to ensure atomicity.
8. Journal Modes: DELETE, WAL, MEMORY
SQLite has several journal modes, selected with PRAGMA journal_mode=MODE.
The journal is SQLite’s crash recovery mechanism (distinct from FDB’s own
recovery).
DELETE (default rollback journal): Before modifying a page, SQLite copies the original page content to a separate journal file. If a crash occurs during a write, SQLite replays the journal to restore the original page content.
With FDB, this is redundant: FDB’s transactions provide rollback for free.
A partial WriteAt that crashes mid-transaction rolls back automatically.
The journal file would be stored in FDB via our VFS, adding overhead for
no benefit.
WAL (Write-Ahead Log): Instead of copying original pages before modification, SQLite appends new page versions to a WAL file. Readers check the WAL before reading the main database file. A “checkpoint” operation copies WAL pages back to the main file.
WAL mode requires xShmMap — shared memory for the WAL index. This is a
memory-mapped file that multiple processes share to coordinate which WAL
frames are valid. Implementing xShmMap in FDB would require:
- A small FDB key range to store the WAL index state.
- Synchronized access to that state via FDB transactions. This is complex but possible — mvsqlite does it.
MEMORY: SQLite uses an in-memory journal, not written to any file. Rollback is possible within a transaction but not after a crash.
For our FDB VFS, PRAGMA journal_mode=MEMORY is the right choice: SQLite
won’t try to create journal files (which would trigger extra VFS calls), and
crash recovery is handled by FDB. This is what our demo uses implicitly by
not specifying a journal mode (we would need to open the database and run
PRAGMA journal_mode=MEMORY before any writes).
9. mvsqlite — The Production Version of This
mvsqlite (by losfair, Rust) is the production-grade implementation of the same idea. Its architecture:
Namespace mapping: Each SQLite “database file” path is mapped to an FDB namespace prefix. Multiple processes can open the same “file” (namespace) simultaneously with MVCC isolation.
Page-level MVCC: mvsqlite keeps multiple versions of each page, similar to how PostgreSQL keeps multiple row versions. When a reader opens the database, it captures an FDB read version. Page reads are served from FDB at that version. New writes create new page versions. Old versions are kept until all readers that need them complete.
Our pagestore is a simplified, non-MVCC version: every read sees the latest
page version. Adding page-level MVCC would require:
ns + 0x01 + pageNum + version → 4096 bytes (instead of just pageNum)
With a cleanup process that removes old versions when no readers hold them.
Write-ahead log in FDB: mvsqlite implements WAL mode by storing the WAL log pages in FDB itself:
ns + 0x03 + txnId + pageNum → 4096 bytes (uncommitted WAL pages)
ns + 0x04 + txnId → commit record
This allows SQLite’s WAL mode to work without xShmMap (the shared memory
region) — instead, the WAL index state lives in FDB and is accessed via
transactions.
10. Real-World Analogues
libSQL (Turso)
libSQL is a fork of SQLite that adds multi-tenancy and remote storage. Turso’s cloud product stores SQLite databases in a distributed object store (similar to our FDB approach). The architecture:
- Local replicas for low-latency reads
- A primary writes to the distributed store
- Followers pull changes from the store and apply them locally
Litestream
Litestream continuously replicates SQLite databases to S3 or other cloud storage by intercepting the WAL. It monitors the SQLite WAL file and copies new frames to cloud storage in near-real-time. Restore is done by downloading frames and applying them.
Litestream does NOT modify SQLite’s VFS — it runs as a separate process that monitors the WAL file. This is simpler but means it can’t provide multi-writer consistency (only one process can write to a SQLite database at a time).
LiteFS (Fly.io)
LiteFS mounts a FUSE filesystem that intercepts SQLite writes, replicates them to a primary node, and distributes to replicas. It does intercept at the filesystem level (FUSE = Filesystem in Userspace), which is essentially implementing the VFS pattern at the OS level.
The Pattern’s Significance
All of these tools (mvsqlite, libSQL, Litestream, LiteFS) are attacking the same problem: SQLite is an excellent embedded database, but it’s tied to a single file on a single machine. The solution in each case is to intercept the storage layer and redirect I/O to a distributed, replicated system.
Our pagestore is the minimal proof-of-concept for this idea: 150 lines of
Go that demonstrate the core page-store primitives.
11. Exercises
Exercise 1 — Page Cache
Add an in-process LRU page cache:
type PageStore struct {
...
cache *lru.Cache // maps pageNum → [PageSize]byte
}
ReadAt checks the cache before going to FDB. WriteAt updates the cache
after writing to FDB. Evict on Truncate.
Measure the cache hit rate for a typical SQLite workload (a mix of reads and writes). You’ll find that B-tree root pages have very high hit rates — they’re accessed on every query.
Exercise 2 — Implement xShmMap for WAL Mode
WAL mode requires shared memory for the WAL index. In mvsqlite, this is done with FDB keys. Implement a simplified version:
- Add a “WAL region” subspace:
ns + 0x03 + regionNum → 32768 bytes xShmMap(region, size)reads the FDB key and returns a[]bytexShmBarrier()writes the modified byte slice back to FDBxShmLockmaps to a FDB lock key per-region
With this, enable PRAGMA journal_mode=WAL in the demo and verify reads
and writes still work.
Exercise 3 — Multi-Version Pages
Change the page key layout to:
ns + 0x01 + pageNum + readVersion → 4096 bytes
WriteAt reads the current version, writes a new version at the current
FDB commit version. ReadAt scans backwards from the requested read version
to find the most recent page version ≤ that version.
Add a Compact(olderThan version) that clears all page versions older than
a given version. This is the MVCC GC mechanism.
Exercise 4 — Register as a Real SQLite VFS
Using CGO, implement the actual sqlite3_vfs and sqlite3_file C structs
backed by our pagestore. Register the VFS with sqlite3_vfs_register.
Open a real SQLite connection using this VFS:
db, err := sql.Open("sqlite3", "file:demo.db?vfs=fdbvfs")
db.Exec("CREATE TABLE t (id INTEGER PRIMARY KEY, val TEXT)")
db.Exec("INSERT INTO t VALUES (1, 'hello')")
All I/O will go through pagestore to FDB. You have now replaced SQLite’s
storage engine while keeping the full SQL query interface.
Exercise 5 — Multi-Database Isolation
The pagestore uses a namespace prefix to isolate one database. Add a
DatabaseManager that:
- Lists all databases (range scan over a well-known metadata prefix)
- Creates a new database (allocates a namespace, writes a metadata entry)
- Deletes a database (one
ClearRange+ metadata clear) - Returns a
pagestorefor a given database name
12. Source Code Deep Dive — pagestore/pagestore.go
The File Struct
type File struct {
db fdb.Database
ns []byte
mu sync.Mutex
}
ns is the namespace byte prefix. mu protects concurrent calls to WriteAt and Truncate from a single File instance (though FDB’s own transactions provide the real isolation for concurrent processes).
Key Construction
func (f *File) sizeKey() fdb.Key {
key := make([]byte, len(f.ns)+1)
copy(key, f.ns)
key[len(f.ns)] = 0x00
return fdb.Key(key)
}
func (f *File) pageKey(pageNum int64) fdb.Key {
key := make([]byte, len(f.ns)+9)
copy(key, f.ns)
key[len(f.ns)] = 0x01
binary.BigEndian.PutUint64(key[len(f.ns)+1:], uint64(pageNum))
return fdb.Key(key)
}
sizeKey() stores the total file size in bytes. pageKey(n) stores the 4096-byte page at page number n. The big-endian 8-byte encoding of pageNum ensures that GetRange over sizeKey to the end of the 0x01 subspace returns pages in ascending page-number order.
ReadAt — Handling Partial-Page Reads
func (f *File) ReadAt(p []byte, off int64) (int, error) {
firstPage := off / PageSize
lastPage := (off + int64(len(p)) - 1) / PageSize
futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
_, err := f.db.ReadTransact(func(rt fdb.ReadTransaction) (interface{}, error) {
for i := firstPage; i <= lastPage; i++ {
futures[i-firstPage] = rt.Get(f.pageKey(i))
}
return nil, nil
})
// Wait for all futures and assemble the result
for i, fut := range futures {
pageData, _ := fut.Get()
// copy relevant bytes from page into p...
}
}
Pipelining multiple page reads: All rt.Get(pageKey(i)) calls are issued inside one ReadTransact closure. FDB sends all read requests to the storage servers before waiting for any response. If the read spans 3 pages (common for reads that cross page boundaries), all 3 are fetched in one round-trip. Without pipelining, each page would require a separate round-trip — 3x the latency.
This is the same pipelining pattern used in option-c-record-layer’s LookupByIndex.
WriteAt — The Read-Modify-Write Pattern
func (f *File) WriteAt(p []byte, off int64) (int, error) {
f.mu.Lock()
defer f.mu.Unlock()
firstPage := off / PageSize
lastPage := (off + int64(len(p)) - 1) / PageSize
_, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Issue all reads first (pipelined)
futures := make([]fdb.FutureByteSlice, lastPage-firstPage+1)
for i := firstPage; i <= lastPage; i++ {
futures[i-firstPage] = tr.Get(f.pageKey(i))
}
sizeF := tr.Get(f.sizeKey())
// Wait for reads and apply writes
for i := firstPage; i <= lastPage; i++ {
existing, _ := futures[i-firstPage].Get()
page := make([]byte, PageSize)
copy(page, existing)
// overlay p bytes onto page...
tr.Set(f.pageKey(i), page)
}
// Update size
curSize := int64(binary.BigEndian.Uint64(sizeF.MustGet()))
newEnd := off + int64(len(p))
if newEnd > curSize {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(newEnd))
tr.Set(f.sizeKey(), b)
}
return nil, nil
})
return len(p), err
}
The read futures are pipelined: all tr.Get() calls are issued before any .Get() is called to block. This means all page reads happen in parallel — one round-trip for up to N pages regardless of N.
The size update is in the same transaction: The page writes and the size update are atomic. If the process crashes mid-transaction, FDB abandons it: neither the page updates nor the size update are applied. The database remains in its previous state — consistent, readable, no corruption.
pageRange — For Bulk Operations
func (f *File) pageRange() fdb.KeyRange {
begin := f.pageKey(0)
end := make([]byte, len(f.ns)+1)
copy(end, f.ns)
end[len(f.ns)] = 0x02 // one past the page subspace tag (0x01)
return fdb.KeyRange{Begin: begin, End: fdb.Key(end)}
}
Used by Truncate to clear all pages above the new size:
func (f *File) Truncate(size int64) error {
newLastPage := size / PageSize
_, err := f.db.Transact(func(tr fdb.Transaction) (interface{}, error) {
// Clear all pages above newLastPage
clearBegin := f.pageKey(newLastPage + 1)
tr.ClearRange(fdb.KeyRange{Begin: clearBegin, End: f.pageRange().End})
// Update size
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(size))
tr.Set(f.sizeKey(), b)
return nil, nil
})
return err
}
One ClearRange atomically removes all pages above the new size. This is O(1) in FDB key operations (one range-clear instruction), even if there are thousands of pages to clear. This is vastly more efficient than deleting pages one by one.
13. Production Considerations
13.1 Page Size Selection
The default SQLite page size is 4096 bytes. You can change it with PRAGMA page_size=N (must be set before the first write). Larger pages reduce the number of B-tree levels and improve scan performance on large tables, but increase read amplification for small point lookups. For an FDB-backed store:
- 4096 bytes (default): 4096-byte FDB values. Small key overhead. Good for mixed workloads.
- 8192 bytes: 8192-byte values. Better for large sequential scans.
- 65536 bytes (maximum SQLite page size): 64 KB values. Below FDB’s 100 KB limit. Excellent for large sequential scans, but poor for point lookups.
For most use cases, 4096 is the right choice. Match it to your workload’s read/write unit.
13.2 Hot Page Contention
Page 1 contains the database header and the root page of the main B-tree. Every query touches page 1. Every transaction increments the “file change counter” in the header. This means every write transaction does a read-modify-write on page 1 — even if the query doesn’t touch any user data on page 1 (e.g., inserting into a deep table touches only leaf pages and page 1 for the change counter).
In a high-concurrency write workload, this is a hot key. FDB’s conflict detection will serialize all transactions that write to page 1 — reducing concurrency.
Solution (mvsqlite’s approach): Move the change counter out of the page and into a separate FDB key. Use FDB’s atomic add to increment it without a read-modify-write. Since atomic add is commutative, it doesn’t cause conflicts. The page itself (minus the counter) is written only when actual B-tree structure changes.
13.3 Crash Recovery Testing
With FDB’s transactional guarantees, the traditional SQLite recovery path (re-applying the rollback journal) is never needed. But it’s worth testing:
- Write some data.
- Kill the process mid-write (use
kill -9oros.Exit(1)inside aWriteAt). - Restart the process and open the database.
- Verify the database is in a consistent state (either the write happened or it didn’t, no partial state).
FDB’s transaction semantics make this trivially correct, but testing it gives you confidence that your implementation upholds the contract.
14. Interview Questions — SQLite, VFS, and Page Models
Q: What is a database page and why do databases use a fixed page size?
A page is the fundamental unit of I/O and storage in a database. Fixed page sizes allow: (1) simple address calculation — page N starts at byte (N-1) * pageSize; (2) alignment with OS page sizes (4 KiB for virtual memory, matching SQLite’s default); (3) buffer pool management — a page cache stores exactly one page per slot. Variable-size records are packed into pages; a B-tree node is exactly one page. The fixed size ensures that a B-tree node read is always one I/O operation, not an I/O per record.
Q: What is the partial-page write problem, and how does your FDB implementation solve it?
SQLite’s VFS xWrite can write any byte range — not necessarily page-aligned. If a 100-byte write starts at offset 0, it overwrites bytes 0–99 of page 1, but bytes 100–4095 are unchanged. An implementation that naively maps this to Set(page1Key, 100_bytes) would corrupt page 1 (losing bytes 100–4095). Our solution: read the existing page, overlay the new bytes, write the full 4096-byte page back — in one FDB transaction. The transaction atomicity ensures that no other writer sees a partial page and that a crash during the operation leaves the page unchanged.
Q: Why is xSync a no-op in your VFS implementation?
xSync requests durability: “flush all previous writes to stable storage.” Our WriteAt implementation uses fdb.Transact which does not return until the write is committed and replicated to FDB’s transaction log. The commit is synchronous and durable by the time WriteAt returns. There is nothing left for xSync to flush. The durability guarantee of xSync is already satisfied by the end of WriteAt.
Q: How does SQLite locking work, and how would you implement a multi-reader single-writer lock in FDB?
SQLite uses 5 lock levels. For multi-reader single-writer: maintain a SHARED lock counter (number of active readers) and an EXCLUSIVE lock flag. Acquire SHARED: check exclusive flag is unset, increment counter (in one FDB transaction). Acquire EXCLUSIVE: check counter == 0, set exclusive flag (in one FDB transaction). Release SHARED: decrement counter. Release EXCLUSIVE: clear flag. All check-and-set operations are atomic in FDB transactions, so no intermediate state is observable. A would-be exclusive lock holder that sees counter > 0 must retry (wait for readers to finish).
This is the NamespaceManager concept in mvsqlite, and it’s how a
multi-tenant SQLite service would work.