DDIA Chapter 4 Guide — Storage and Retrieval
May 10, 2026
In DDIA Chapter 3 — Data Models and Query Languages, the focus was on data models. In Chapter 4, the discussion moves one layer down: how databases store data, and how they return it efficiently.
These are foundational questions for every database. A minimal implementation can already support set and get operations:
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
This works, but it does not scale. db_get scans the whole file, so lookup is O(n). As the dataset doubles, lookup time roughly doubles.
Indexes
Indexes are additional data structures that speed up reads. The trade-off is extra storage and slower writes, because index structures must be maintained.
Hash Indexes
A straightforward approach is to keep an in-memory hash map from key to on-disk location. This avoids full scans for point lookups.
Limitations are clear:
- Memory pressure grows with key count.
- In-memory state is lost after restart and must be rebuilt.
- Range queries are weak because hash order is not sorted.
SSTable and LSM
To improve on hash-only designs, keys can be stored in sorted order using SSTables (Sorted String Tables).
Benefits include:
- Lower memory usage with sparse indexes.
- Faster range scans due to sorted blocks.
- Better compression from shared key prefixes.
The downside is write amplification if you rewrite large sorted files directly. Real systems address this with memtables and background merge/compaction, which leads to the LSM-tree pattern:
Log-Structured: append-oriented writes.Merge: merge segments and keep the latest value, with tombstones for deletes.
Bloom Filters
LSM systems may still check multiple SSTables for a missing key. Bloom filters help avoid unnecessary disk reads for keys that are definitely absent.
They are space-efficient probabilistic structures:
- If any required bit is 0, the key is definitely not present.
- If all bits are 1, the key might be present (possible false positives).
B-Tree vs LSM
B-trees are still dominant in many production systems.
General trade-off:
- LSM trees often win on write throughput and sequential I/O.
- B-trees often provide more stable read latency and efficient range scans.
So the practical choice is workload-driven: write-heavy patterns often favor LSM; read-heavy patterns often favor B-trees.
Support ExplainThis
If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.
Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.