System Architecture¶

kiwi-db is a Log-Structured Merge Tree key-value store built in Python 3.12+. It provides durable, crash-safe storage with an async-first API. The engine coordinates six subsystems — WAL, memtable, SSTable, flush pipeline, compaction, and observability — through a strict layered dependency model where no layer imports from a layer above it.

Layered Dependency Model¶

The codebase is organized into six tiers, from leaf modules with no internal dependencies up to the top-level engine coordinator:

Tier	Modules	Depends on
0	`types.py`, `common/crc.py`, `common/uuid7.py`, `common/errors.py`	Nothing (stdlib only)
1	`bloom/`, `cache/`, `index/`, `common/rwlock.py`, `common/merge_iterator.py`	Tier 0
2	`memtable/`, `wal/`	Tier 0–1
3	`sstable/`, `engine/config.py`, `engine/seq_generator.py`, `engine/wal_manager.py`	Tier 0–2
4	`engine/memtable_manager.py`, `engine/sstable_manager.py`, `compaction/`	Tier 0–3
5	`engine/flush_pipeline.py`, `engine/compaction_manager.py`	Tier 0–4
6	`engine/lsm_engine.py`	All tiers

This strict ordering prevents circular imports and makes each layer independently testable.

Write Path¶

Every write follows this sequence atomically under a single lock:

engine.put(key, value)
    │
    ├─ 1. SeqGenerator.next()         → monotonic sequence number
    ├─ 2. WALManager.sync_append()    → durable on disk (fsync)
    ├─ 3. MemTableManager.put()       → insert into active skip list
    └─ 4. MemTableManager.maybe_freeze()
              │
              ├─ threshold not reached → return
              └─ threshold reached → freeze active table
                    │
                    ├─ snapshot skip list → ImmutableMemTable
                    ├─ push to immutable queue
                    ├─ create fresh ActiveMemTable
                    └─ signal FlushPipeline

Atomicity guarantee: all four steps happen under _mem.write_lock. Either all succeed or the caller sees the exception.

Read Path¶

Reads scan from newest to oldest data:

engine.get(key)
    │
    ├─ 1. Active memtable       → O(log n) skip list lookup
    ├─ 2. Immutable queue       → newest-first scan (0–4 snapshots)
    ├─ 3. L0 SSTables           → all files checked (overlapping ranges)
    │      └─ bloom → index bisect → block scan
    └─ 4. L1, L2, L3 SSTables  → one file per level, sequential
           └─ bloom → index bisect → block scan

The first match with the highest sequence number wins. Tombstones indicate deleted keys — the engine returns None.

Recovery Model¶

On startup, the engine replays the WAL to rebuild in-memory state:

Load config and open WAL file
Load all SSTables from disk via manifest
Determine max_seq_seen across all SSTables
Replay WAL entries with seq > max_seq_seen into fresh memtable
Restore sequence generator to highest seen seq
Start flush pipeline and compaction manager

Entries already flushed to SSTables are skipped. The WAL is the source of truth for unflushed data.

Thread and Async Model¶

Component	Model	Reason
MemTable operations	Sync (threading locks)	Pure in-memory, microseconds
WAL append	Sync under lock, offloaded via `to_thread` for async callers	Must fsync before returning
Flush pipeline	Async daemon task	Non-blocking I/O, parallel writes
Compaction merge	Subprocess (`ProcessPoolExecutor`)	CPU-bound, bypasses GIL
SSTable reads	Sync (mmap)	Memory-mapped, no syscall after page cache warm
Config access	Sync (RLock)	Thread-safe attribute reads

Architecture Diagram¶

Architecture diagram