Parallel Flush Correctness¶
This document proves the correctness of the parallel flush design — specifically, that data is never unreachable and that commit ordering is preserved despite concurrent writes.
The Core Invariant¶
At every point in time, every durable key must be findable by get() — either in the immutable queue or in the SSTable registry. Never in neither.
This means:
| Phase | Parallel? | Reason |
|---|---|---|
| SSTable disk write | Yes | Each writes to a separate directory |
| Registry register + queue pop + WAL truncate | No — strictly ordered | Determines read visibility and WAL safety |
Writes are embarrassingly parallel. Commits must be serialized in oldest-first order.
Event Chain Mechanism¶
Each flush slot carries two asyncio.Event objects:
slot0.prev_committed = pre_set_event ← no predecessor, can commit immediately
slot0.my_committed = Event_A
slot1.prev_committed = Event_A ← must wait for slot0
slot1.my_committed = Event_B
slot2.prev_committed = Event_B ← must wait for slot1
slot2.my_committed = Event_C
Each worker blocks on prev_committed.wait() only after its disk write is complete. The wait does not delay the write — only the commit.
Crash Safety¶
Crash during Phase 1 (write)¶
- Snapshot still in immutable queue
- Incomplete SSTable directory has no
meta.json— ignored on recovery - WAL intact — recovery replays all entries
Crash after register, before pop¶
- SSTable registered and readable
- Snapshot still in queue (redundant but safe)
- WAL replay deduplicates by sequence number
Crash after pop, before WAL truncate¶
- SSTable has all data
- WAL entries still present — replayed and deduplicated on recovery
Crash after WAL truncate¶
- Clean state: SSTable has data, WAL trimmed
Out-of-Order Completion¶
When a newer snapshot finishes writing before an older one:
snap0 (oldest): ──[write: 400ms]─────────────────[commit A]──►
snap1 (newer): ──[write: 200ms]──[wait: 200ms]──────────────[commit B]──►
snap1 finishes writing 200ms early. It blocks on Event_A.wait() until snap0 commits. Total elapsed: 400ms + commit overhead. No ordering violation.
State Machine (2 Workers)¶
Queue at dispatch: [snap1(newer)] [snap0(oldest)]
slot0: prev=pre_set, my=Event_A
slot1: prev=Event_A, my=Event_B
Worker A picks slot0:
Phase 1: write snap0 to disk
Phase 2: prev already set → commit immediately
register L0a → pop snap0 → truncate WAL
set Event_A
Worker B picks slot1:
Phase 1: write snap1 to disk (concurrent with A)
Phase 2: wait for Event_A
register L0b → pop snap1 → truncate WAL
set Event_B
After both complete: queue empty, both SSTables registered, WAL truncated, Event_B set for the next batch.
Failure Propagation¶
If any slot fails during Phase 1 (write):
batch_abortevent is set — all downstream slots skip their commitsmy_committedis set in thefinallyblock — prevents chain deadlock- The failed snapshot stays in the queue — retried on the next dispatch cycle
- WAL is not truncated — data remains recoverable