DDIA Chapter 4 Guide — Storage and Retrieval

In DDIA Chapter 3 — Data Models and Query Languages, the focus was on data models. In Chapter 4, the discussion moves one layer down: how databases store data, and how they return it efficiently.

These are foundational questions for every database. A minimal implementation can already support set and get operations:

#!/bin/bash

db_set () {
  echo "$1,$2" >> database
}

db_get () {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This works, but it does not scale. db_get scans the whole file, so lookup is O(n). As the dataset doubles, lookup time roughly doubles.

Indexes

Indexes are additional data structures that speed up reads. The trade-off is extra storage and slower writes, because index structures must be maintained.

Hash Indexes

A straightforward approach is to keep an in-memory hash map from key to on-disk location. This avoids full scans for point lookups.

Limitations are clear:

Memory pressure grows with key count.
In-memory state is lost after restart and must be rebuilt.
Range queries are weak because hash order is not sorted.

SSTable and LSM

To improve on hash-only designs, keys can be stored in sorted order using SSTables (Sorted String Tables).

Benefits include:

Lower memory usage with sparse indexes.
Faster range scans due to sorted blocks.
Better compression from shared key prefixes.

The downside is write amplification if you rewrite large sorted files directly. Real systems address this with memtables and background merge/compaction, which leads to the LSM-tree pattern:

Log-Structured: append-oriented writes.
Merge: merge segments and keep the latest value, with tombstones for deletes.

Bloom Filters

LSM systems may still check multiple SSTables for a missing key. Bloom filters help avoid unnecessary disk reads for keys that are definitely absent.

They are space-efficient probabilistic structures:

If any required bit is 0, the key is definitely not present.
If all bits are 1, the key might be present (possible false positives).

B-Tree vs LSM

B-trees are still dominant in many production systems.

General trade-off:

LSM trees often win on write throughput and sequential I/O.
B-trees often provide more stable read latency and efficient range scans.

So the practical choice is workload-driven: write-heavy patterns often favor LSM; read-heavy patterns often favor B-trees.

Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.