Bloom Filter¶
Probabilistic membership test for fast negative lookups.
Configuration¶
The bloom filter's false positive rate is controlled per environment via config.json:
| Config key | Default | Effect |
|---|---|---|
bloom_fpr_dev |
0.05 (5%) |
Smaller filters, faster builds — suited for small dev datasets |
bloom_fpr_prod |
0.01 (1%) |
Fewer false disk reads — suited for production workloads |
Access the active value via the convenience property:
The expected item count (bloom_n) is not configurable — it is derived from the actual data:
- Flush path:
len(snapshot)— exact entry count of the immutable memtable - Compaction path: sum of
reader.meta.record_countacross all input SSTables
This ensures the bloom filter is optimally sized for every SSTable.
BloomFilter¶
BloomFilter(n=1000000, fpr=0.01)
¶
Bases: Serializable
Fixed-size Bloom filter backed by mmh3 hashes.
Initialize a Bloom filter with optimal bit count and hash count.
The optimal bit count and hash count are derived from n and fpr
using the standard formulas. Callers should pass the actual or
expected number of items — the flush path uses len(snapshot)
and the compaction path uses the sum of input record counts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Expected number of elements to be inserted. Determines the bit array size together with fpr. Clamped to 1 if non-positive. |
1000000
|
fpr
|
float
|
Desired false positive rate in the range |
0.01
|
Source code in app/bloom/filter.py
add(key)
¶
Insert key into the filter.
may_contain(key)
¶
Return True if key might be present (false positives allowed).
Source code in app/bloom/filter.py
to_bytes()
¶
Serialize the filter to bytes with CRC footer.
Source code in app/bloom/filter.py
from_bytes(data)
classmethod
¶
Deserialize a filter from data.
Raises :class:CorruptRecordError if data is truncated or
CRC verification fails.