DDIA Ch2 Reading Guide — Non-functional Requirements

When building software products, the capabilities a product must provide are classified as functional requirements. In a social platform, for example, letting users browse feeds, leave comments, and like posts are all functional requirements. But product quality is not determined by features alone. There are other qualities that users care about just as much.

Performance is a clear example. Many studies have shown that faster page loading in e-commerce can significantly improve user retention. In real-world usage, few people will stick with a platform that makes them wait five or ten seconds every time they open a page. Performance is a non-functional requirement, and so are usability, security, and other system qualities.

From our perspective, implementing features is the minimum bar for software engineers. If you want to grow into a senior engineer, “it works” is not enough. Designing non-functional requirements well is a core part of the job.

Among many possible non-functional requirements, Chapter 2 focuses on four: performance, reliability, scalability, and maintainability. Let’s go through them step by step.

Performance

When people talk about software performance, they usually start with two metrics: latency and throughput.

Latency is the time between a request entering the system and a response being returned. Lower is better. If one API responds in 100 ms and another in 500 ms, the first has lower latency. Throughput refers to how many requests or how much data a system can process per second. Higher is generally better.

In theory, system designers want both low latency and high throughput. In practice, with limited resources, there is often a tradeoff. As request volume rises on a machine, the CPU may already be busy handling current work, so new requests wait longer before they are processed, increasing latency.

The chapter discusses throughput more in the scalability section. Here, we’ll focus first on latency.

Breaking Down Response Time

The book recommends decomposing latency into stages. After a request arrives, it experiences network delay, possibly queueing delay, then execution time. After processing completes, the response still needs network transfer time on the way back. The book does not go into every layer, but in real systems you can break this down further. For example, once the response reaches the client, you can continue into frontend rendering stages. In our article on critical rendering path, we explain that frontend-side breakdown in more detail.

This decomposition is highly practical. A good approach is to measure timing at every important step in the flow. Then, when response time regresses, the team can quickly identify which segment is abnormal and narrow down the issue faster.

Why Median and Percentiles Beat Averages

For latency measurement, the book recommends avoiding averages as the primary metric (we made the same point in our What is SLO and how do you set one for your team? article).

The reason is simple: averages hide distribution. You might see a 200 ms average response time and assume the system is healthy, while in reality one group of requests is very fast and another group is painfully slow.

Median and percentiles make this visible. If p50 is 200 ms, half of requests take longer than 200 ms. If p95 is 1.5 seconds, 5% of requests exceed 1.5 seconds. In systems with very large user bases, 5% can still represent hundreds of thousands of users. Percentile-based metrics help teams protect the real user experience at scale.

Beyond “don’t rely on averages,” the author also touches on SLO in this section. If you want a deeper treatment of SLOs, our dedicated article above covers the topic in more depth.

Reliability

For an application, reliability means users don’t run into unexpected failures such as features not working as intended or performance degrading to unusable levels.

But failures are inevitable in software. A perfectly failure-free system is unrealistic. From a systems perspective, what matters more is resilience: when something goes wrong, the system should tolerate faults and continue operating. Chaos engineering is a mature industry practice for this (one famous example is Netflix’s Chaos Monkey), where teams deliberately inject faults into production-like environments and verify the system keeps running.

The book groups common failures into three categories: hardware faults, software errors, and human errors.

Hardware faults include things like disk failures, damaged undersea cables, or power-grid disruptions caused by extreme weather. Because software ultimately runs on hardware, these events can make services unavailable. A direct mitigation is redundancy: keep backup capacity ready. If one data center is impacted by a disaster, traffic can be shifted to another region. (For related ideas, see our article on blast radius and strategic containment.)

Compared with hardware faults, software errors often have broader blast radii. Hardware incidents are usually localized by geography, while software bugs can propagate globally. A recent example is the 2025 Cloudflare global incident (news link), where a software issue caused worldwide impact.

Finally, human errors are failures introduced by operational mistakes. As discussed in our article on feature flags, major cloud outages have been triggered by configuration mistakes. Clear interfaces, strong testing, and production monitoring can significantly reduce the damage from this class of failures.

Performance

Breaking Down Response Time

Why Median and Percentiles Beat Averages

Reliability

Further Reading