System Architecture¶
kiwi-db is a Log-Structured Merge Tree key-value store built in Python 3.12+. It provides durable, crash-safe storage with an async-first API. The engine coordinates six subsystems — WAL, memtable, SSTable, flush pipeline, compaction, and observability — through a strict layered dependency model where no layer imports from a layer above it.
Layered Dependency Model¶
The codebase is organized into six tiers, from leaf modules with no internal dependencies up to the top-level engine coordinator:
| Tier | Modules | Depends on |
|---|---|---|
| 0 | types.py, common/crc.py, common/uuid7.py, common/errors.py |
Nothing (stdlib only) |
| 1 | bloom/, cache/, index/, common/rwlock.py, common/merge_iterator.py |
Tier 0 |
| 2 | memtable/, wal/ |
Tier 0–1 |
| 3 | sstable/, engine/config.py, engine/seq_generator.py, engine/wal_manager.py |
Tier 0–2 |
| 4 | engine/memtable_manager.py, engine/sstable_manager.py, compaction/ |
Tier 0–3 |
| 5 | engine/flush_pipeline.py, engine/compaction_manager.py |
Tier 0–4 |
| 6 | engine/lsm_engine.py |
All tiers |
This strict ordering prevents circular imports and makes each layer independently testable.
Write Path¶
Every write follows this sequence atomically under a single lock:
engine.put(key, value)
│
├─ 1. SeqGenerator.next() → monotonic sequence number
├─ 2. WALManager.sync_append() → durable on disk (fsync)
├─ 3. MemTableManager.put() → insert into active skip list
└─ 4. MemTableManager.maybe_freeze()
│
├─ threshold not reached → return
└─ threshold reached → freeze active table
│
├─ snapshot skip list → ImmutableMemTable
├─ push to immutable queue
├─ create fresh ActiveMemTable
└─ signal FlushPipeline
Atomicity guarantee: all four steps happen under _mem.write_lock. Either all succeed or the caller sees the exception.
Read Path¶
Reads scan from newest to oldest data:
engine.get(key)
│
├─ 1. Active memtable → O(log n) skip list lookup
├─ 2. Immutable queue → newest-first scan (0–4 snapshots)
├─ 3. L0 SSTables → all files checked (overlapping ranges)
│ └─ bloom → index bisect → block scan
└─ 4. L1, L2, L3 SSTables → one file per level, sequential
└─ bloom → index bisect → block scan
The first match with the highest sequence number wins. Tombstones indicate deleted keys — the engine returns None.
Recovery Model¶
On startup, the engine replays the WAL to rebuild in-memory state:
- Load config and open WAL file
- Load all SSTables from disk via manifest
- Determine
max_seq_seenacross all SSTables - Replay WAL entries with
seq > max_seq_seeninto fresh memtable - Restore sequence generator to highest seen seq
- Start flush pipeline and compaction manager
Entries already flushed to SSTables are skipped. The WAL is the source of truth for unflushed data.
Thread and Async Model¶
| Component | Model | Reason |
|---|---|---|
| MemTable operations | Sync (threading locks) | Pure in-memory, microseconds |
| WAL append | Sync under lock, offloaded via to_thread for async callers |
Must fsync before returning |
| Flush pipeline | Async daemon task | Non-blocking I/O, parallel writes |
| Compaction merge | Subprocess (ProcessPoolExecutor) |
CPU-bound, bypasses GIL |
| SSTable reads | Sync (mmap) | Memory-mapped, no syscall after page cache warm |
| Config access | Sync (RLock) | Thread-safe attribute reads |