DDIA Chapter 4 Guide — Storage and Retrieval

May 10, 2026

☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee

In DDIA Chapter 3 — Data Models and Query Languages, the focus was on data models. In Chapter 4, the discussion moves one layer down: how databases store data, and how they return it efficiently.

These are foundational questions for every database. A minimal implementation can already support set and get operations:

#!/bin/bash

db_set () {
  echo "$1,$2" >> database
}

db_get () {
  grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

This works, but it does not scale. db_get scans the whole file, so lookup is O(n). As the dataset doubles, lookup time roughly doubles.

Indexes

Indexes are additional data structures that speed up reads. The trade-off is extra storage and slower writes, because index structures must be maintained.

Hash Indexes

A straightforward approach is to keep an in-memory hash map from key to on-disk location. This avoids full scans for point lookups.

Limitations are clear:

  • Memory pressure grows with key count.
  • In-memory state is lost after restart and must be rebuilt.
  • Range queries are weak because hash order is not sorted.

SSTable and LSM

To improve on hash-only designs, keys can be stored in sorted order using SSTables (Sorted String Tables).

Benefits include:

  • Lower memory usage with sparse indexes.
  • Faster range scans due to sorted blocks.
  • Better compression from shared key prefixes.

The downside is write amplification if you rewrite large sorted files directly. Real systems address this with memtables and background merge/compaction, which leads to the LSM-tree pattern:

  • Log-Structured: append-oriented writes.
  • Merge: merge segments and keep the latest value, with tombstones for deletes.

Bloom Filters

LSM systems may still check multiple SSTables for a missing key. Bloom filters help avoid unnecessary disk reads for keys that are definitely absent.

They are space-efficient probabilistic structures:

  • If any required bit is 0, the key is definitely not present.
  • If all bits are 1, the key might be present (possible false positives).

B-Tree vs LSM

B-trees are still dominant in many production systems.

General trade-off:

  • LSM trees often win on write throughput and sequential I/O.
  • B-trees often provide more stable read latency and efficient range scans.

So the practical choice is workload-driven: write-heavy patterns often favor LSM; read-heavy patterns often favor B-trees.


Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.

☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee