What Is Chaos Engineering, and Why Is the Industry Embracing It?

When people hear "chaos engineering," the first things that usually come to mind are Chaos Monkey and the Simian Army, the open-source tools introduced by Netflix. Similar practices existed before Netflix named them, but Netflix was the company that popularized the word "chaos" in this context and turned chaos engineering into a mainstream discipline.

Netflix introduced Chaos Monkey internally in 2010. The name comes from a simple image: a wild monkey running through a data center and causing damage, such as unplugging servers or breaking disks. The reliability team's goal was straightforward: even if that monkey showed up, Netflix's service should remain available to end users.

Because Netflix heavily depended on AWS, early Chaos Monkey experiments focused on simulating AWS instance failures. But production incidents come from many failure modes, so Netflix expanded beyond Chaos Monkey and built the Simian Army. One example is Latency Monkey, which simulates increased latency in RESTful APIs to test whether the overall service can still hold up.

After Chaos Monkey proved useful, Netflix pushed further with Chaos Kong. Chaos Monkey terminated individual instances; Chaos Kong simulated a full AWS region outage. The point was to ensure that if one region failed, the system could fail over smoothly to another region without interrupting service.

Before the term chaos engineering was standardized, many companies were already doing similar drills. At Amazon, the well-known "Master of Disaster" Jesse Robbins promoted GameDay exercises that simulated events like data-center disasters, broken dependencies, and network partitions to test system resilience.

Why chaos engineering matters

Once you understand this background, the value for software teams becomes clearer.

Chaos engineering is like a vaccine for software systems. Vaccines intentionally introduce a controlled threat to build immunity; chaos engineering intentionally introduces failures to build resilience. In a controlled environment, teams can terminate instances, inject configuration mistakes, or add latency to verify whether the current system can tolerate stress.

In real systems, incidents can originate anywhere: data centers, network paths, third-party dependencies, or configuration errors. Chaos engineering gives teams evidence-based confidence that they can handle those surprises.

For example, if a system keeps working even after controlled disruption, such as a regional outage scenario, the team can confidently say its architecture is resilient.

There is also an operational advantage: chaos experiments are usually planned and executed during working hours. If something does go wrong, the reliability team is online and can respond immediately, instead of waking up on-call engineers in the middle of the night.

For larger products where production incidents are expensive, this confidence is valuable enough that companies like Netflix invest in dedicated Chaos Engineer roles.

How chaos engineering helped Netflix through large-scale restarts

Netflix shared a concrete case at Cassandra Summit (link). AWS informed Netflix about a security issue that required restarting all EC2 instances. That meant every Cassandra node Netflix ran on EC2 had to restart.

For reliability teams, this type of mass restart is high pressure. You cannot assume every instance will recover cleanly. If some fail to come back, the team must detect that quickly and reroute traffic to healthy nodes.

This sounds simple on paper, but if the workflow has never been rehearsed end to end, the detection and traffic-shifting path may break. Before the real restart day, Netflix used chaos engineering to simulate failures in a smaller subset of Cassandra nodes. They verified that failed nodes were detected, replacement instances were launched, and requests were redirected correctly.

The exercise exposed process issues, which they fixed before the real event. On the actual restart day, 218 nodes were restarted globally, and 22 had problems. Because the response flow had already been practiced, the team handled those failed nodes smoothly.

That case captures the practical value of chaos engineering: by testing failure in advance, engineers can approach high-risk production events with much less anxiety.

How to run chaos engineering

Netflix's engineering team outlines a four-step approach in the book Chaos Engineering (link).

Step one is defining a steady state with measurable metrics. Netflix examples include streams per second, upstream API latency and error rates, and system throughput. This is conceptually similar to defining SLOs.

When deciding where to start, ask two questions: "Which area is most likely to fail?" and "Which area would hurt us most if it failed?" A critical review of your own system helps identify weak points and prioritize experiments.

Step two is forming a hypothesis, typically a null hypothesis: injecting failure should not materially affect steady-state behavior. For example, "Injecting faults into the primary cache will not increase end-user latency."

Step three is injecting failures to challenge that hypothesis. You might increase cache-miss traffic or overload the primary cache with high request volume.

Step four is evaluation: did the hypothesis hold? If latency remains stable despite multiple fault injections, you have evidence that the system is robust for that scenario.

If the hypothesis is broken, for example latency degrades, stop the experiment and run a post-incident review. The goal is not to "win" the experiment, but to discover weaknesses early and improve the system.

Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of any amount through this Buy Me a Coffee page.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.