Write-Path Optimizations¶

This document catalogs every significant design decision in kiwi-db and explains how each one contributes to a write-optimized LSM architecture. Sections are ordered by impact on write throughput — the highest-impact architectural decisions first, supporting optimizations later. Each section provides enough context for generating presentation slides — including the problem, the chosen solution, what alternatives exist, and the concrete benefit.

1. L0 Multiple Files / L1+ Single File: Write-Optimized Level Organization¶

Problem: In a traditional leveled LSM-Tree, every level maintains a single sorted run. When a memtable is flushed, the resulting SSTable must be merged into the existing L0 sorted run before the flush can complete. This merge-on-write design means the write path is blocked by compaction-like I/O — the very thing compaction was supposed to defer.

Decision: kiwi-db splits level organization into two distinct strategies:

Level	Files	Key Ranges	Flush Cost	Read Cost
L0	Up to 10 SSTables	Overlapping — same key can exist in multiple files	O(1): just write a new file, no merge	Must check all L0 files
L1–L3	Exactly 1 SSTable per level	Non-overlapping — fully merged and deduplicated	N/A (produced by compaction)	Check one file per level

Architecture Diagram

How this optimizes writes:

The flush pipeline writes a new L0 SSTable and returns immediately. No merge, no sorting against existing L0 data, no blocking on compaction. The write path never touches existing SSTables — it only creates new files.

Traditional LSM (merge-on-flush):
  freeze memtable → merge with L0 sorted run → replace L0 → done
  └── blocked on merge I/O (100–500ms for 64MB)

kiwi-db (append-to-L0):
  freeze memtable → write new L0 file → register in manifest → done
  └── no merge; flush is pure sequential write (~50ms for 64MB)

This is the foundational write-throughput optimization. Everything else — the parallel flush pipeline, the subprocess compaction, the level reservations — builds on this separation between "fast L0 append" and "deferred background merge."

The read-side tradeoff: Since L0 files have overlapping key ranges, get() must check all L0 files (not just one). This is mitigated by:

Bloom filters — each L0 file has its own bloom filter. A negative bloom check skips the entire file in microseconds.
Bounded L0 count — compaction triggers at l0_compaction_threshold (default 10). L0 never grows unbounded.
Newest-first ordering — L0 files are scanned newest-first. For hot keys (recently written), the first file often has the answer.

Compaction benefit: When L0 reaches the threshold, all L0 files are merged into a single L1 file in one pass. L1+ levels always hold exactly one file, so subsequent compactions (L1→L2, L2→L3) are simple two-file merges — no fan-out, no multi-way coordination.

Alternative considered: Universal compaction (all levels allow multiple files). Rejected because it increases read amplification at every level and makes point lookups expensive deep in the tree. The L0-multiple/L1+-single hybrid gives the write speed of universal compaction at L0 with the read efficiency of leveled compaction at L1+.

2. Parallel Flush Pipeline with Ordered Commits¶

Problem: Flushing memtable snapshots to SSTables is I/O-bound (~200–500ms per 64MB SSTable). Sequential flushing wastes disk bandwidth when multiple snapshots are queued.

Decision: Two-phase flush with parallel writes and serialized commits:

Phase	Parallelism	What happens
Phase 1: Write	Parallel (up to `flush_max_workers`, default 2)	SSTableWriter produces data.bin + index.bin + filter.bin + meta.json
Phase 2: Commit	Serialized (oldest-first via event chain)	Register reader → pop queue → truncate WAL → trigger compaction

The ordering mechanism is an asyncio.Event chain:

slot0.prev_committed = pre_set_event       ← commits immediately
slot0.my_committed   = Event_A

slot1.prev_committed = Event_A             ← waits for slot0
slot1.my_committed   = Event_B

Why it matters: Disk writes are the bottleneck. Parallelizing them doubles throughput on NVMe. But commits must be FIFO to maintain the immutable queue invariant — the oldest snapshot must be committed first so pop_oldest() removes the correct entry.

Failure handling: If any slot fails, batch_abort is set — all downstream slots skip their commits. The failed snapshot stays in the queue and is retried on the next dispatch cycle. my_committed is always set in a finally block to prevent event chain deadlock.

Operational Complexity Comparison (K = queued snapshots, W = max_workers, T_io = single SSTable write time):

Approach	Flush Latency (K snapshots)	Disk Utilization	Commit Ordering
kiwi-db (parallel write + serial commit)	T_io × ceil(K/W) — parallel bounded by W	~100% with W=2 on NVMe	Guaranteed FIFO via event chain
Sequential flush (one at a time)	T_io × K — fully serial	~50% on NVMe (idle between flushes)	Trivially ordered
Fully parallel (no ordering)	T_io × 1 (all concurrent)	~100%	Broken — manifest non-deterministic, queue pops unordered
Batched flush (group then write)	T_io + merge cost	Variable	Requires post-merge sort

With W=2 and K=4 queued snapshots: sequential takes 4 × T_io, parallel takes 2 × T_io — a 2x throughput improvement with correctness preserved by the event chain. The commit phase is O(1) per slot (register + pop + truncate), so the serialization overhead is negligible (<5ms vs 200–500ms write time).

3. Subprocess Compaction (GIL Bypass)¶

Problem: Compaction merges large SSTables — a CPU-bound operation. On CPython 3.12 with GIL enabled, threading cannot achieve true parallelism for this workload.

Decision: Compaction runs in a ProcessPoolExecutor subprocess:

Main Process (asyncio)              Subprocess (own GIL)
─────────────────────               ─────────────────────
CompactionManager                   run_compaction(task)
  _build_task() → CompactionTask      open input SSTables
  asyncio.to_thread(                   KWayMergeIterator
    pool.submit(run_compaction)          dedup + GC tombstones
  )                                    SSTableWriter.finish_sync()
  ← SSTableMeta result                return SSTableMeta
  commit_compaction_async()

Why it matters: The merge is CPU-bound (decode records, compare keys, re-encode). Running it in a subprocess with its own GIL means the main event loop continues accepting writes and reads uninterrupted.

CompactionTask serialization: The task is a frozen dataclass with only primitive fields (str, int, list, dict, float) — no file handles, no locks, no live objects. Safe to send across the process boundary.

Non-blocking dispatch: asyncio.create_task() launches the job and returns immediately. The engine doesn't wait for compaction to finish.

Operational Complexity Comparison (M = total records to merge, T_merge = merge time):

Approach	Write Path Blocking	Merge Parallelism	Event Loop Impact
kiwi-db (ProcessPoolExecutor subprocess)	Zero — fire-and-forget dispatch	True parallelism — own GIL, own memory space	None — event loop free for put/get
In-thread compaction (GIL)	Zero (background thread)	None — GIL serializes with writes	CPU-bound merge competes with writes for GIL
Synchronous compaction (inline)	O(M) — blocks until merge completes	N/A	Blocked — no writes during merge
Async compaction (to_thread)	Zero (offloaded)	None — still one GIL	Moderate — thread pool contention with WAL fsync

The subprocess approach is the only option that achieves true parallelism on CPython 3.12. A 64MB L0→L1 merge (~650K records) takes ~500ms of CPU. With in-thread compaction, this 500ms competes with the write path for GIL time slices. With subprocess, both run at full speed simultaneously.

4. Write Path: Atomic Under a Single Lock¶

Problem: Traditional LSMs either use separate locks per component (WAL lock, memtable lock) risking inconsistency, or a coarse global lock that blocks reads during writes.

Decision: A single _mem.write_lock (RLock) protects the entire atomic write sequence:

SeqGenerator.next() — monotonic sequence number
WALManager.sync_append() — encode + write + fsync (durability)
MemTableManager.put() — skip list insert (visibility)
MemTableManager.maybe_freeze() — threshold check + possible freeze

Why it matters: The WAL is fsynced before the memtable is updated. If the process crashes after fsync but before memtable put, recovery replays the WAL entry. If it crashes before fsync, the write was never acknowledged — no data loss in either case.

Alternative considered: Per-component locks with two-phase commit. Rejected because it adds complexity without throughput gain — the bottleneck is fsync latency (~1ms), not lock contention.

Lock ordering rule: _mem.write_lock → _wal_lock. Never reversed. Enforced by code structure — the engine acquires write_lock, then sync_append acquires wal_lock internally.

Operational Complexity Comparison (T_fsync = WAL fsync latency, T_put = memtable insert):

Approach	Write Latency	Consistency Guarantee	Read Blocking
kiwi-db (single atomic lock)	T_fsync + T_put — sequential under one lock	Full — WAL durable before memtable visible	None — readers are lock-free
Per-component locks (WAL lock + memtable lock)	T_fsync ‖ T_put — potentially parallel	Requires 2PC or ordering protocol	Depends on lock design
Global RWLock (read/write lock)	T_fsync + T_put — same latency	Full	Readers block during writes
Lock-free CAS (compare-and-swap)	T_fsync + T_put — plus retry overhead	Requires careful memory ordering	None

The single-lock approach has the same write latency as alternatives (T_fsync dominates at ~1ms, T_put is microseconds) but is simpler and doesn't block readers. Per-component locks add coordination complexity for no throughput gain because fsync is the bottleneck, not lock contention.

5. Concurrent Skip List with Lock-Free Reads¶

Problem: The memtable needs to support concurrent writes from multiple threads while allowing lock-free reads for the hot read path.

Decision: A probabilistic skip list with three concurrency mechanisms:

Mechanism	Who	How
Per-node locks	Writers	Lock only predecessor nodes at insertion boundary
Lock-free reads	Readers	Check `fully_linked` and `marked` boolean flags (GIL-atomic)
Deterministic lock ordering	Writers	Acquire locks in ascending `id()` order to prevent deadlock

Why it matters: Readers (the get() path) never block. The fully_linked flag is the visibility barrier — it's set to True only after all forward pointers are linked, making the node atomically visible to concurrent readers.

Alternative comparison:

Structure	Sorted Iteration	Point Lookup	Concurrent Writes	Lock-Free Reads
Skip List	O(n)	O(log n)	Per-node locks	Yes
Hash Map	Requires sort	O(1) avg	Global lock	Yes
B-Tree	O(n)	O(log n)	Complex rebalancing	No

Skip list wins because SSTable flush requires sorted iteration, and lock-free reads are critical for throughput.

Safety bound: 64 retries before raising SkipListInsertError. Prevents live-lock under extreme contention.

Operational Complexity Comparison (N = entries, C = concurrent writers, R = concurrent readers):

Structure	Write	Read	Freeze (sorted snapshot)	Write Contention (C writers)	Read Contention (R readers)
Skip List (kiwi-db)	O(log N) per-node lock	O(log N) lock-free	O(N) lock-free iteration	O(C × log N) — disjoint keys don't contend	Zero — no locks
B-Tree + RWLock	O(log N) write-locked	O(log N) read-locked	O(N) read-locked	O(C × log N) — serialized by global write lock	O(R × log N) — blocked during writes
Hash Map + sort-on-freeze	O(1) amortized	O(1)	O(N log N) sort required	O(C) — global lock	Zero (if lock-free hash)
Red-Black Tree + RWLock	O(log N)	O(log N)	O(N)	Rebalancing holds lock longer	Blocked during rebalance

The skip list's unique advantage: writes are O(log N) with per-node locking (concurrent writes to disjoint keys proceed without contention) and reads are O(log N) with zero locking (GIL-atomic flag checks). The hash map alternative has O(1) writes but pays O(N log N) at freeze time — for a 64MB memtable with ~650K entries, that's a significant freeze stall.

6. Backpressure: Bounded Immutable Queue¶

Problem: If the flush pipeline falls behind, frozen memtable snapshots accumulate in memory without bound, eventually causing OOM.

Decision: The immutable queue has a maximum depth (immutable_queue_max_len, default 4). When full, maybe_freeze() blocks the writer on a Condition.wait() with a 60-second timeout:

Writer calls put() → threshold reached → maybe_freeze()
  └─ queue full? → wait(60s) → FreezeBackpressureTimeout if still full

Why it matters: Backpressure propagates to the caller — if the disk can't keep up, writes slow down rather than consuming unbounded memory. The timeout detects stuck flush pipelines (e.g., disk failure) and raises an actionable error.

Operational Complexity Comparison (Q = queue depth, M = memtable size, D = disk throughput):

Approach	Memory Usage	Write Stall Behavior	Failure Detection
kiwi-db (bounded queue + timeout)	O(Q × M) — bounded at 4 × 64MB = 256MB max	Controlled — blocks writer, resumes when space frees	60s timeout → FreezeBackpressureTimeout
Unbounded queue (no limit)	O(∞) — grows until OOM	None — writes never stall, then process crashes	OOM kill (uncontrolled)
Drop-oldest (evict unflushed)	O(Q × M) — bounded	None — but loses data silently	No detection needed (data lost)
Rate limiting (token bucket)	O(Q × M) — bounded	Smooth degradation	Latency spike detection

The bounded queue with timeout is the only approach that provides: (1) bounded memory (no OOM), (2) no data loss (snapshot stays in queue), and (3) actionable failure detection (timeout raises a named exception the caller can handle).

7. SSTable Four-File Layout with Completeness Signal¶

Problem: A crash during SSTable write must not produce a corrupted table that poisons reads on recovery.

Decision: Each SSTable is a directory with four files, written in a specific order:

1. data.bin      ← sorted records in blocks (written during put() calls)
2. index.bin     ← sparse index (written during finish())
3. filter.bin    ← bloom filter (written during finish(), concurrent with index)
4. meta.json     ← metadata (written LAST — completeness signal)

Why it matters: meta.json is the completeness signal. If it exists, all other files are guaranteed to be fully written and fsynced. If it's missing (crash during write), the SSTable directory is ignored on recovery. No corruption, no partial reads.

For L0 SSTables, finish() writes bloom and index concurrently via asyncio.gather() — parallelizing two I/O operations that are independent.

8. WAL Design: Msgpack Framing with Per-Write Fsync¶

Problem: The WAL must be durable (survive process crash), compact (minimize I/O), and recoverable (detect corruption).

Decision: Each WAL entry is framed as:

[4B payload length][msgpack(seq, ts, op, key, value)][4B CRC32]

Msgpack: Compact binary encoding (vs JSON), fast to encode/decode
Length prefix: Enables fast forward scanning during replay
CRC32 per frame: Detects corruption at frame granularity — a partial write from a crash is detected and replay stops cleanly
Fsync per write: Every append() calls fd.flush() + os.fsync() before returning

Why it matters: Per-write fsync is the conservative choice — every acknowledged write is durable. During recovery, entries already flushed to SSTables are filtered by sequence number (seq > sst_max_seq), avoiding redundant replay.

9. Level Reservation for Conflict-Free Compaction¶

Problem: Multiple compaction jobs can run concurrently, but adjacent-level jobs conflict. An L0→L1 job reads L1, so a concurrent L1→L2 job writing to L1 would cause data corruption.

Decision: A set-based reservation system tracks active levels:

_active_levels: set[Level] = set()

# Reserve: blocked if either level is already active
if src in _active_levels or dst in _active_levels:
    return False  # skip this job, try next cycle
_active_levels.add(src)
_active_levels.add(dst)

Conflict matrix (with max_levels = 3):

Running Job	Reserves	Blocks	Can Parallel With
L0→L1	{0, 1}	L0→L1, L1→L2	L2→L3
L1→L2	{1, 2}	L0→L1, L1→L2, L2→L3	Nothing
L2→L3	{2, 3}	L1→L2, L2→L3	L0→L1

Why it matters: Only truly independent jobs run in parallel. After a job completes, check_and_compact() is re-triggered to pick up deferred work — cascading compactions happen naturally.

10. Bloom Filter: Data-Derived Sizing with Environment-Aware FPR¶

Problem: A bloom filter that's too large wastes memory and build time. One that's too small has excessive false positives, causing unnecessary disk reads.

Decision: Two parameters, sourced differently:

Parameter	Source	Value
`bloom_n` (expected items)	Actual data count — `len(snapshot)` for flush, `sum(record_counts)` for compaction	Exact
`bloom_fpr` (false positive rate)	Config — `bloom_fpr_dev = 0.05`, `bloom_fpr_prod = 0.01`	Environment-aware

Why it matters: Using the actual record count produces an optimally sized filter — no over-provisioning, no under-provisioning. The FPR tradeoff is:

Dev (5%): Smaller filters, faster builds, good enough for small test datasets
Prod (1%): Larger filters, fewer false disk reads on production workloads

The optimal bit count and hash count are derived from standard formulas: - Bit count: m = -n × ln(fpr) / ln²(2) - Hash count: k = (m/n) × ln(2)

11. Lazy Loading with Cross-Reader Cache Reuse¶

Problem: Opening an SSTable eagerly (loading bloom filter + sparse index at open time) is wasteful when most L0 files are compacted away before being queried.

Decision: Bloom and index are loaded lazily on first get() call, not at open time:

Check BlockCache for cached bytes (sentinel offsets: -1 for bloom, -2 for index)
On cache miss, read from disk (filter.bin, index.bin)
Populate cache for future readers
Deserialize into BloomFilter and SparseIndex objects

Why it matters: At startup, the engine opens all SSTables but only mmaps data.bin. The bloom and index are deferred until actually needed. After the first read, they're cached — even if the reader is closed and a new one opened for the same file (common after engine restart), the cache hit avoids disk I/O entirely.

12. Three-Tier Block Cache with Differential Eviction¶

Problem: A single LRU cache treats all entries equally, but bloom filters and indexes are far more valuable than individual data blocks — they're reused across many queries.

Decision: Three separate LRU caches with independent capacities:

Tier	Sentinel Offset	Default Capacity	Eviction Priority
Data blocks	>= 0	256 entries	Evicted first
Sparse indexes	-2	64 entries	Medium retention
Bloom filters	-1	64 entries	Evicted last

Routing is automatic — the offset parameter determines which tier receives the entry.

Why it matters: Under heavy read load, data blocks churn through the cache rapidly. Without tiering, they would evict bloom filters and indexes, forcing expensive re-loads. With tiering, blooms and indexes stay resident while data blocks rotate through their own pool.

13. Runtime-Mutable Configuration¶

Problem: Production workloads change over time. Fixed thresholds require restart to tune.

Decision: All thresholds are stored in config.json and read at the point of use (not cached):

# In MemTableManager.maybe_freeze():
if self._config.is_prod:
    return self._active.size_bytes >= self._config.max_memtable_bytes

Updating a config value:

engine.update_config("max_memtable_entries", 50)  # takes effect on next put()

The update is persisted atomically (temp file + os.replace()) and logged.

Why it matters: No restart needed to tune freeze thresholds, compaction triggers, bloom FPR, or cache sizes. The dev/prod mode switch (config.set("env", "prod")) changes multiple behaviors at once — freeze trigger type, bloom FPR, compaction threshold scaling.

14. Recovery: Seq-Based WAL Filtering¶

Problem: After a crash, the WAL may contain entries that were already flushed to SSTables before the crash. Replaying them is redundant and wastes time.

Decision: Recovery filters WAL entries by sequence number:

1. sst_max_seq = SSTableManager.max_seq_seen()     ← highest seq in all SSTables
2. entries = WAL.replay()                            ← all entries in WAL
3. unflushed = [e for e in entries if e.seq > sst_max_seq]
4. MemTableManager.restore(unflushed)               ← replay only unflushed
5. SeqGenerator.restore(max_unflushed_seq)           ← ensure next seq is higher

Why it matters: On a store with millions of entries, the WAL might have thousands of entries from the last few seconds. But most are already in SSTables. Seq-based filtering skips them in O(n) with no disk I/O — just integer comparison.

15. AsyncRWLock with Writer Preference¶

Problem: During compaction commits, the SSTableManager needs exclusive access to a level's file list. But reads are far more frequent than compaction commits. A reader-preference lock would starve the compaction commit under continuous read load.

Decision: Writer-preference async RWLock — once a writer is waiting, new readers block:

Reader arrives → if writer_waiting > 0: block
Writer arrives → wait for readers == 0, then proceed

Each compaction level has its own lock. This means: - L0 reads and L2→L3 compaction proceed simultaneously (different locks) - L0 reads block briefly during L0→L1 commit (same lock, < 5ms)

Why it matters: Compaction must complete promptly to prevent L0 buildup. Writer preference ensures compaction commits aren't starved, while per-level granularity limits reader blocking to only the affected level.

16. mmap Reads with Zero-Copy Block Scanning¶

Problem: Reading SSTable records involves I/O. Traditional approaches copy data from kernel to user space, then deserialize each record.

Decision: data.bin is memory-mapped at open time. Reads use memoryview slicing — no copy until cache insertion:

mv = memoryview(self._mm)           # zero-copy view into mmap
for rec in iter_block(mv, start, end):  # decode records on-the-fly
    if rec.key == key:
        return (rec.seq, rec.timestamp_ms, rec.value)

Cache insertion is the only copy:

self._cache.put(file_id, offset, bytes(self._mm[offset:end]))

Why it matters: After the OS page cache is warm, mmap reads are effectively memory reads — no syscall, no copy. The iter_block() generator decodes records lazily and stops early when rec.key > key (records are sorted).

17. Manifest: Atomic Persistence via Temp + Replace¶

Problem: The manifest tracks L0 ordering and per-level file assignments. A corrupt manifest after crash means all SSTables become orphaned.

Decision: Manifest written via temp file + os.replace():

tmp = tempfile.NamedTemporaryFile(dir=manifest_dir, ...)
tmp.write(json.dumps(layout))
os.fsync(tmp.fileno())
os.replace(tmp_path, manifest_path)  # atomic on POSIX

Why it matters: os.replace() is atomic on POSIX filesystems. Either the old manifest or the new one is visible — never a partial write. If crash occurs before replace, the old manifest is intact. If after, the new one is fully written and fsynced.

On startup, the manifest is reconciled with what's actually on disk — orphan files are adopted, missing files are skipped.

Summary: Write-Path Optimization Stack¶

The following table summarizes how each decision contributes to write throughput, ordered by impact:

#	Layer	Optimization	Effect on Writes
1	Level Architecture	L0 multiple files / L1+ single file	Flush is pure append — no merge-on-write
2	Flush	Parallel SSTable writes + ordered commits	2x disk throughput on NVMe
3	Compaction	Subprocess execution (GIL bypass)	Event loop stays free during merge
4	API	Single atomic write lock	Minimal critical section; durability before visibility
5	Memtable	Concurrent skip list (lock-free reads)	Writers don't contend with readers
6	Freeze	Backpressure queue (bounded depth)	Prevents OOM; propagates disk pressure to callers
7	SSTable	Four-file layout + async bloom/index writes	Crash-safe; L0 finish() parallelizes I/O
8	WAL	Msgpack framing + per-write fsync	Compact, durable, corruption-detectable
9	Compaction	Level reservation (conflict-free parallel jobs)	Non-adjacent compactions run simultaneously
10	Bloom	Data-derived sizing + env-aware FPR	Optimal filter per SSTable; dev speed vs prod accuracy
11	Cache	Lazy loading + cross-reader reuse	No I/O for unopened SSTables
12	Cache	Three-tier LRU (differential eviction)	Bloom/index survive data block churn
13	Config	Runtime mutability (no restart)	Live threshold tuning
14	Recovery	Seq-based WAL filtering	Skip already-flushed entries
15	Compaction	AsyncRWLock with writer preference	Compaction commits aren't starved
16	SSTable	mmap + zero-copy block scanning	Read optimization; no copy after page cache warm
17	Manifest	Atomic temp + os.replace	Crash-safe state persistence