Skip to content

LSM Trees vs B+ Trees

This document explains what a Log-Structured Merge Tree (LSM Tree) is, how it compares to the traditional B+ Tree used in most relational databases, and why kiwi-db chose an LSM architecture.


What is a B+ Tree?

A B+ Tree is a balanced, disk-oriented tree structure where:

  • All data lives in leaf nodes (sorted by key)
  • Internal nodes store only keys and child pointers for navigation
  • Every insert, update, or delete modifies the tree in place — finding the correct leaf, updating it, and potentially splitting or merging nodes to maintain balance

B+ Trees are the foundation of most relational databases (PostgreSQL, MySQL/InnoDB, SQLite) and are optimized for read-heavy workloads where data is queried far more often than it is written.

How a B+ Tree write works

1. Traverse tree from root to target leaf         → O(log_B N) random reads
2. Modify the leaf page in place                   → 1 random write
3. If leaf overflows → split + update parent       → 1–2 more random writes
4. Fsync the modified pages                        → random I/O

Each write requires random I/O — seeking to the correct page on disk, modifying it, and fsyncing. On HDDs this is ~10ms per seek. On SSDs it's faster but still involves write amplification from page rewrites.

How a B+ Tree read works

1. Traverse tree from root to target leaf         → O(log_B N) page reads
2. Binary search within the leaf page             → O(log B) comparisons

Reads are efficient because the tree is always sorted and balanced. With a branching factor of ~100, even a billion keys need only 4–5 page reads.


What is an LSM Tree?

A Log-Structured Merge Tree takes the opposite approach:

  • Writes go to an in-memory buffer (memtable) first — no disk I/O on the write path
  • When the buffer fills up, it's flushed as a sorted file (SSTable) to disk — purely sequential I/O
  • Background compaction merges sorted files to keep read performance bounded
  • Reads check the memtable first, then search through sorted files on disk

LSM Trees are the foundation of most write-heavy systems (Cassandra, RocksDB, LevelDB, HBase, ScyllaDB) and are optimized for write-heavy workloads where throughput matters more than single-read latency.

How an LSM Tree write works

1. Append to WAL (sequential write + fsync)        → O(1) sequential I/O
2. Insert into in-memory memtable                   → O(log N) in-memory
3. Done — no disk page modification                 → no random I/O

Every write is sequential I/O — appending to the WAL and inserting into memory. No seeking, no page splitting, no in-place modification. This is fundamentally cheaper than B+ Tree writes.

How an LSM Tree read works

1. Check memtable (in-memory)                      → O(log N)
2. Check each SSTable on disk:
   a. Bloom filter test                            → O(k) hash checks
   b. Sparse index bisect                          → O(log B) comparisons
   c. Block read + scan                            → 1 sequential read
3. Return result with highest sequence number

Reads are more expensive than B+ Trees because multiple sources must be checked. The cost depends on the number of levels and SSTables. Bloom filters mitigate this by eliminating most negative lookups.


Side-by-Side Comparison

Dimension B+ Tree LSM Tree Winner
Write latency O(log_B N) random I/O O(1) sequential I/O LSM
Write throughput Bounded by random IOPS Bounded by sequential bandwidth LSM
Write amplification 1× (in-place) + page splits L/B to T·L/B (compaction rewrites) B+ Tree
Read latency (point) O(log_B N) — predictable O(L · φ) to O(T · L · φ) — depends on levels + bloom FPR B+ Tree
Read amplification 1× — single tree traversal L to T·L — check each level B+ Tree
Range scan O(k) — leaf pages are linked O(k · L) — merge across levels B+ Tree
Space amplification ~1× (in-place updates) 1/T to T× (depends on compaction strategy) B+ Tree
Sequential disk access Mostly random Mostly sequential LSM
SSD friendliness Random writes wear SSD faster Sequential writes extend SSD life LSM
Concurrent writes Page-level locking, splits block Memtable-level locking, no disk contention LSM

The RUM Conjecture

The Read-Update-Memory (RUM) conjecture states that any data structure can optimize at most two of three dimensions:

  • Read cost
  • Update (write) cost
  • Memory (space) overhead

You cannot minimize all three simultaneously. Every design makes a tradeoff:

Structure Optimizes Sacrifices
B+ Tree Read + Memory Write (random I/O per update)
LSM Tree (leveling) Read + Memory Write (compaction rewrites)
LSM Tree (tiering) Write + Memory Read (multiple runs per level)
LSM Tree (kiwi-db hybrid) Write + Read (balanced) Memory (bounded L0 overlap)

kiwi-db's hybrid approach (tiering at L0, leveling at L1+) trades a small, bounded read penalty at L0 for dramatically cheaper writes — the best tradeoff for write-heavy workloads.


Why LSM for Write-Heavy Workloads

1. Sequential I/O is 10–100× cheaper than random I/O

On an HDD, a sequential write of 64KB takes ~0.1ms. A random write of 4KB takes ~10ms. The LSM write path is entirely sequential — WAL append + SSTable flush. The B+ Tree write path requires random page modifications.

On SSDs, the gap is smaller (~3–10×) but still significant at scale. Additionally, random small writes to SSDs cause write amplification at the flash translation layer (FTL), reducing SSD lifespan.

2. Write path has no disk contention

B+ Tree writes modify shared pages — concurrent writers must coordinate via page-level locks, and page splits can cascade. LSM writes go to an in-memory memtable. Multiple threads can write concurrently with fine-grained locking (per-node in kiwi-db's skip list). The disk is only involved during background flush, which is non-blocking.

3. Writes never block on compaction

In kiwi-db, compaction runs in a background subprocess. The write path (WAL + memtable) is completely independent of compaction. In a B+ Tree, there's no concept of background reorganization — every write directly modifies the tree.

4. Write amplification is deferred and amortized

A B+ Tree write touches 1–3 pages immediately (leaf + potential splits). An LSM write touches 0 pages immediately (just WAL + memory). The compaction cost is paid later, in the background, and amortized across many writes.


Why B+ Trees Win for Read-Heavy Workloads

1. Single traversal, predictable latency

A B+ Tree read follows one path from root to leaf — O(log_B N) pages, always. An LSM read may check memtable + multiple SSTables, with latency varying based on where the key lives.

2. No bloom filter false positives

B+ Tree reads are deterministic — the key is either in the leaf or it isn't. LSM reads rely on bloom filters that have a false positive rate (1–5%). Each false positive triggers an unnecessary block read.

3. Range scans are a linked-list traversal

B+ Tree leaf nodes are linked. A range scan reads consecutive pages — perfect sequential I/O. An LSM range scan must merge results from multiple levels, each potentially requiring seeks to different SSTables.


Where kiwi-db Sits on the Spectrum

kiwi-db is designed for write-heavy workloads with point reads — the sweet spot for LSM Trees. Its specific optimizations push further toward write performance:

Feature Effect
L0 tiering (multiple files) Flush is O(N) — no merge-on-write
Parallel flush pipeline 2× disk throughput on NVMe
Subprocess compaction GIL bypass — writes never blocked
Lock-free skip list reads Concurrent writes don't block reads
Bloom filters (per-SSTable) Point reads skip irrelevant files
Three-tier block cache Hot metadata stays resident
Environment-aware bloom FPR Dev speed vs prod accuracy

For workloads that are primarily read-heavy with infrequent writes, a B+ Tree (PostgreSQL, SQLite) would be more appropriate. For workloads with high write throughput, time-series data, log ingestion, or append-heavy patterns, the LSM architecture provides fundamentally better performance.