Compaction Design¶
Compaction is the process of merging SSTables across levels to reduce read amplification, reclaim disk space, and garbage-collect deletion tombstones. Without compaction, every read would need to scan an ever-growing number of L0 files.
Why Compaction Matters¶
| Problem | Without compaction | With compaction |
|---|---|---|
| Read amplification | Grows with every flush (scan all L0 files) | Bounded by level count (3–4 levels) |
| Disk space | Dead keys and tombstones accumulate forever | Garbage-collected during merge |
| Bloom filter effectiveness | Each L0 file has its own filter | Merged filter covers more keys |
Level Model¶
L0: [file_a] [file_b] [file_c] ... (up to 10 files, overlapping key ranges)
↓ compaction (merge all L0 + existing L1)
L1: [single_merged_file] (non-overlapping, fully deduplicated)
↓ compaction (when L1 exceeds threshold)
L2: [single_merged_file]
↓
L3: [single_merged_file] (oldest data, largest file)
- L0: Multiple files with overlapping key ranges. Flushed directly from memtable.
- L1–L3: Exactly one file per level. Produced by merging the level above with the existing file at this level.
Trigger Mechanism¶
Compaction is flush-triggered, not daemon-driven. After every successful flush commit, the flush pipeline calls CompactionManager.check_and_compact(). This avoids idle polling and ensures compaction happens promptly when needed.
Thresholds¶
| Transition | Trigger | Config |
|---|---|---|
| L0 → L1 | File count >= threshold | l0_compaction_threshold (default: 10) |
| L1 → L2 | Dev: entry count, Prod: byte size | Cascading: 10^(N+1) x base |
| L2 → L3 | Dev: entry count, Prod: byte size | Cascading: 10^(N+1) x base |
Dev base = max_memtable_entries, Prod base = max_memtable_bytes.
Merge Algorithm¶
The KWayMergeIterator merges N sorted input iterators (one per input SSTable) into a single globally sorted stream:
- Deduplication: For equal keys, only the record with the highest sequence number is kept
- Tombstone GC: Tombstones with
seq < seq_cutoffare dropped (the deletion has propagated to all levels) - Tombstone preservation: Tombstones above the cutoff are kept (needed to shadow values in deeper levels)
The merge output is written to a new SSTable via SSTableWriter.finish_sync() (synchronous, since it runs in a subprocess).
Task-Based Execution¶
Each compaction run is represented by a CompactionTask — a frozen dataclass with only primitive fields that can safely cross the subprocess boundary. The full task lifecycle (trigger → build → dispatch → merge → commit → re-trigger) and how multiple tasks run as parallel async jobs is documented in detail in Compaction Manager.
Subprocess Merge¶
Compaction merges run in a ProcessPoolExecutor subprocess to achieve true parallelism, bypassing CPython's GIL. The subprocess has its own memory space and file handles with no shared state. The dispatch is non-blocking — asyncio.to_thread(pool.submit(...).result) keeps the event loop free during the merge.
Compaction Commit¶
After the subprocess produces a new SSTable, the commit is atomic:
- Register new reader (destination level becomes readable)
- Write manifest (durable)
- Mark old files for deletion (deferred by ref-count)
- Update in-memory state
- Evict stale cache blocks
Write locks on source and destination levels prevent reads from seeing inconsistent state.