LSM Trees vs B+ Trees¶

This document explains what a Log-Structured Merge Tree (LSM Tree) is, how it compares to the traditional B+ Tree used in most relational databases, and why kiwi-db chose an LSM architecture.

What is a B+ Tree?¶

A B+ Tree is a balanced, disk-oriented tree structure where:

All data lives in leaf nodes (sorted by key)
Internal nodes store only keys and child pointers for navigation
Every insert, update, or delete modifies the tree in place — finding the correct leaf, updating it, and potentially splitting or merging nodes to maintain balance

B+ Trees are the foundation of most relational databases (PostgreSQL, MySQL/InnoDB, SQLite) and are optimized for read-heavy workloads where data is queried far more often than it is written.

How a B+ Tree write works¶

1. Traverse tree from root to target leaf         → O(log_B N) random reads
2. Modify the leaf page in place                   → 1 random write
3. If leaf overflows → split + update parent       → 1–2 more random writes
4. Fsync the modified pages                        → random I/O

Each write requires random I/O — seeking to the correct page on disk, modifying it, and fsyncing. On HDDs this is ~10ms per seek. On SSDs it's faster but still involves write amplification from page rewrites.

How a B+ Tree read works¶

1. Traverse tree from root to target leaf         → O(log_B N) page reads
2. Binary search within the leaf page             → O(log B) comparisons

Reads are efficient because the tree is always sorted and balanced. With a branching factor of ~100, even a billion keys need only 4–5 page reads.

What is an LSM Tree?¶

A Log-Structured Merge Tree takes the opposite approach:

Writes go to an in-memory buffer (memtable) first — no disk I/O on the write path
When the buffer fills up, it's flushed as a sorted file (SSTable) to disk — purely sequential I/O
Background compaction merges sorted files to keep read performance bounded
Reads check the memtable first, then search through sorted files on disk

LSM Trees are the foundation of most write-heavy systems (Cassandra, RocksDB, LevelDB, HBase, ScyllaDB) and are optimized for write-heavy workloads where throughput matters more than single-read latency.

How an LSM Tree write works¶

1. Append to WAL (sequential write + fsync)        → O(1) sequential I/O
2. Insert into in-memory memtable                   → O(log N) in-memory
3. Done — no disk page modification                 → no random I/O

Every write is sequential I/O — appending to the WAL and inserting into memory. No seeking, no page splitting, no in-place modification. This is fundamentally cheaper than B+ Tree writes.

How an LSM Tree read works¶

1. Check memtable (in-memory)                      → O(log N)
2. Check each SSTable on disk:
   a. Bloom filter test                            → O(k) hash checks
   b. Sparse index bisect                          → O(log B) comparisons
   c. Block read + scan                            → 1 sequential read
3. Return result with highest sequence number

Reads are more expensive than B+ Trees because multiple sources must be checked. The cost depends on the number of levels and SSTables. Bloom filters mitigate this by eliminating most negative lookups.

Side-by-Side Comparison¶

Dimension	B+ Tree	LSM Tree	Winner
Write latency	O(log_B N) random I/O	O(1) sequential I/O	LSM
Write throughput	Bounded by random IOPS	Bounded by sequential bandwidth	LSM
Write amplification	1× (in-place) + page splits	L/B to T·L/B (compaction rewrites)	B+ Tree
Read latency (point)	O(log_B N) — predictable	O(L · φ) to O(T · L · φ) — depends on levels + bloom FPR	B+ Tree
Read amplification	1× — single tree traversal	L to T·L — check each level	B+ Tree
Range scan	O(k) — leaf pages are linked	O(k · L) — merge across levels	B+ Tree
Space amplification	~1× (in-place updates)	1/T to T× (depends on compaction strategy)	B+ Tree
Sequential disk access	Mostly random	Mostly sequential	LSM
SSD friendliness	Random writes wear SSD faster	Sequential writes extend SSD life	LSM
Concurrent writes	Page-level locking, splits block	Memtable-level locking, no disk contention	LSM

The RUM Conjecture¶

The Read-Update-Memory (RUM) conjecture states that any data structure can optimize at most two of three dimensions:

Read cost
Update (write) cost
Memory (space) overhead

You cannot minimize all three simultaneously. Every design makes a tradeoff:

Structure	Optimizes	Sacrifices
B+ Tree	Read + Memory	Write (random I/O per update)
LSM Tree (leveling)	Read + Memory	Write (compaction rewrites)
LSM Tree (tiering)	Write + Memory	Read (multiple runs per level)
LSM Tree (kiwi-db hybrid)	Write + Read (balanced)	Memory (bounded L0 overlap)

kiwi-db's hybrid approach (tiering at L0, leveling at L1+) trades a small, bounded read penalty at L0 for dramatically cheaper writes — the best tradeoff for write-heavy workloads.

Why LSM for Write-Heavy Workloads¶

1. Sequential I/O is 10–100× cheaper than random I/O¶

On an HDD, a sequential write of 64KB takes ~0.1ms. A random write of 4KB takes ~10ms. The LSM write path is entirely sequential — WAL append + SSTable flush. The B+ Tree write path requires random page modifications.

On SSDs, the gap is smaller (~3–10×) but still significant at scale. Additionally, random small writes to SSDs cause write amplification at the flash translation layer (FTL), reducing SSD lifespan.

2. Write path has no disk contention¶

B+ Tree writes modify shared pages — concurrent writers must coordinate via page-level locks, and page splits can cascade. LSM writes go to an in-memory memtable. Multiple threads can write concurrently with fine-grained locking (per-node in kiwi-db's skip list). The disk is only involved during background flush, which is non-blocking.

3. Writes never block on compaction¶

In kiwi-db, compaction runs in a background subprocess. The write path (WAL + memtable) is completely independent of compaction. In a B+ Tree, there's no concept of background reorganization — every write directly modifies the tree.

4. Write amplification is deferred and amortized¶

A B+ Tree write touches 1–3 pages immediately (leaf + potential splits). An LSM write touches 0 pages immediately (just WAL + memory). The compaction cost is paid later, in the background, and amortized across many writes.

Why B+ Trees Win for Read-Heavy Workloads¶

1. Single traversal, predictable latency¶

A B+ Tree read follows one path from root to leaf — O(log_B N) pages, always. An LSM read may check memtable + multiple SSTables, with latency varying based on where the key lives.

2. No bloom filter false positives¶

B+ Tree reads are deterministic — the key is either in the leaf or it isn't. LSM reads rely on bloom filters that have a false positive rate (1–5%). Each false positive triggers an unnecessary block read.

3. Range scans are a linked-list traversal¶

B+ Tree leaf nodes are linked. A range scan reads consecutive pages — perfect sequential I/O. An LSM range scan must merge results from multiple levels, each potentially requiring seeks to different SSTables.

Where kiwi-db Sits on the Spectrum¶

kiwi-db is designed for write-heavy workloads with point reads — the sweet spot for LSM Trees. Its specific optimizations push further toward write performance:

Feature	Effect
L0 tiering (multiple files)	Flush is O(N) — no merge-on-write
Parallel flush pipeline	2× disk throughput on NVMe
Subprocess compaction	GIL bypass — writes never blocked
Lock-free skip list reads	Concurrent writes don't block reads
Bloom filters (per-SSTable)	Point reads skip irrelevant files
Three-tier block cache	Hot metadata stays resident
Environment-aware bloom FPR	Dev speed vs prod accuracy

For workloads that are primarily read-heavy with infrequent writes, a B+ Tree (PostgreSQL, SQLite) would be more appropriate. For workloads with high write throughput, time-series data, log ingestion, or append-heavy patterns, the LSM architecture provides fundamentally better performance.