How Can Teams Handle Incidents Effectively?

In earlier articles, we discussed how software engineers should handle on-call work, and how to run effective incident reviews. But before those two topics, engineering teams need to answer an even more basic question: how can a team make sure it is ready to face and handle incidents effectively?

Google's SRE book breaks incident response management into three stages: Prepare, Response, and Learn. Mapped to the earlier articles, on-call is about response, and incident review is about learning. In this article, we will focus on preparation.

Build the Foundation for Effective Incident Response

Incident response cannot wait until an incident actually happens. If the team has prepared well in advance, it can respond more calmly when something goes wrong, and it can resolve the incident faster and more effectively. That is why teams should build the foundation for incident response before they need it.

Reduce Noise During Incident Response

As shown in Google's SRE book, the first step in the Response phase is detecting the incident. Ideally, incidents should be detected automatically by the system. If the team only learns about a problem from end users, it is usually too late. To detect incidents effectively, teams need solid monitoring in place.

In practice, one of the most common monitoring problems is alert storms, which we discussed in how software engineers should approach monitoring. Imagine an on-call engineer constantly receiving alerts, only to discover after investigation that each one is a false alarm. This kind of noise does not just waste the on-call engineer's time. It can also make people less alert over time: when the next alert comes in, they may assume it is another false alarm and fail to treat it seriously.

So if we want on-call engineers to focus on solving the real problem during an incident, one of the most important things is to reduce noise.

Reducing alert noise means not turning everything into a paging alert. In practice, there are two useful angles. First, prioritize paging alerts around SLOs. If you are not familiar with SLOs, you can revisit What Is an SLO? Why Does It Matter for Software Teams?. Signals that do not directly affect SLOs can live in dashboards, low-priority notifications, or supporting debugging data instead of being escalated into urgent alerts. Second, define clear priorities such as P1, P2, and P3, and make it explicit which levels require on-call action. For example, P4 issues should usually not page the on-call engineer.

Whenever the team receives a false alert, the on-call engineer should revisit it afterward. They might silence the alert, adjust the threshold, or change the conditions so the same false alarm does not happen again. When defining alerts, teams should also consider how alerts relate to one another, so similar alerts do not fire at the same time and create unnecessary noise.

Build Foundations That Speed Up Mitigation

Beyond reducing alert noise, teams also need foundations that make investigation and mitigation faster. These preparations may not feel urgent day to day, but when an incident happens, they can help the team stop the bleeding quickly.

One of the most important preparations is a runbook. A runbook should clearly document what to do for each type of alert. For example, when alert X fires, the suggested first place to check is Y. It should also define severity levels clearly: which incidents are P1 and need immediate action, which are P4 and can be noted without being handled immediately, and so on.

The team also needs to keep the runbook up to date. If an on-call engineer encounters an incident that cannot be handled by following the existing runbook, they should update the runbook after the incident is resolved. That way, the next person facing the same issue can mitigate it faster.

Good software engineering practices also help with mitigation. For example, deployments should avoid bundling too many changes together. It is better to deploy in small batches. You can revisit how to manage the production release process. When a team deploys in small batches, it is easier to identify which deployment caused the incident. Rollbacks also have a smaller blast radius and are less likely to affect unrelated functionality.

Feature flags are also useful during deployment. If you are not familiar with them, see What Are Feature Flags? Why Should Teams Use Them?. Feature flags make it easier to isolate functionality. If feature X causes a problem, the team can disable only feature X without affecting feature Y. This makes the decision much simpler, because the team does not need to ask product stakeholders whether it is acceptable to turn off feature Y as collateral damage.

Make Responsibilities Clear

When an incident happens, clear ownership helps the team get oriented faster. It prevents everyone from panicking, talking over one another, or getting stuck because nobody knows who should do what. During an incident, there should be at least three core roles: Incident Commander, Communications Lead, and Operations Lead.

The Incident Commander owns the overall situation and knows the current state of the response. This role does not have to be filled by the most senior person. It should be filled by someone who can understand the whole picture, coordinate information, assign work, and drive decisions. In some teams, this might be a technical lead. In more mature incident management processes, the role is usually assigned based on the incident context, relevant experience, and availability.

The Incident Commander brings in the people who can help resolve the incident and assigns tasks so the team can start investigating. At the same time, if someone gets blocked, the Incident Commander should help unblock them.

The Communications Lead records the current state of the incident, keeps stakeholders updated, and serves as the communication point for external parties. With a Communications Lead in place, the Incident Commander can focus on coordination without worrying about missing details when updating stakeholders. With today's AI tools, teams can also use AI to assist with notes and updates. For example, during an incident call, the team can generate a transcript and have AI summarize the latest status every 5 to 10 minutes, then automatically push updates to the relevant communication channels.

For example, suppose an e-commerce checkout page suddenly starts showing a large number of order failure messages. The Incident Commander would first scan the relevant information, then quickly bring in the right people: frontend engineers responsible for the page, backend engineers, and the relevant customer support owner. Next, the Incident Commander would assign investigation tracks, such as asking the frontend engineer to check the latest deployment, asking the backend engineer to inspect the relevant APIs and logs, and asking customer support to draft user-facing communication based on the current information.

When bringing people into the response, make sure the investigation includes enough perspectives. Incidents usually involve the broader system, and each person on the team only understands part of that system. The team often needs to combine knowledge from multiple people to understand what is actually happening. For example, if the database is overloaded, the database team may be able to identify that the query pattern changed, but not why it changed. In that case, the team needs to talk to the application team that wrote the query.

Multiple perspectives also help prevent fixation. People naturally tend to look at a problem from a specific angle. If that angle is not the one needed to solve the problem, the team may struggle to find the real cause.

That is why teams should maintain a contact list, ideally inside the runbook. For example, if a certain page breaks, the list should say who should be pulled in. The list should be more granular than just frontend and backend. It can include backend API owners, database owners, algorithm owners, and so on. This does not mean every incident requires pulling everyone in immediately. The list can be tiered: who must be pulled in first, and who should be brought in if the team still cannot find the cause after a certain amount of time.

Create Psychological Safety for the Team

Beyond process, roles, and investigation methods, one more factor matters a lot: the team needs enough psychological safety for people to investigate and resolve incidents effectively.

In Thinking, Fast and Slow, Nobel laureate Daniel Kahneman describes two modes of human thinking. One is fast, more automatic thinking. The other is slower thinking that involves more reasoning. Fast thinking reduces cognitive load. After all, if every tiny decision on the way from waking up to brushing your teeth required deep thought, you would be overwhelmed before doing anything meaningful. Slow thinking, on the other hand, helps us reason through details and make higher-quality decisions.

During incidents, teams should ideally use slow thinking. They should avoid rushed, random decisions and focus on finding the core issue. To do that, the on-call engineer needs to stay calm. From the team's perspective, this means reducing the pressure the on-call engineer faces during an incident.

Google's SRE practices also emphasize that team leaders should make sure on-call members have enough psychological safety. This safety helps on-call engineers avoid reacting anxiously, which reduces rushed fast thinking.

According to Google's team, the most effective way to build this kind of safety is to make sure on-call members clearly understand three things:

Escalation paths: if they cannot solve the problem alone, who can they ask for help?
A clearly defined incident response process: if on-call engineers know where to start, they will feel less panic and anxiety.
Blameless incident reviews: we discussed this in detail in how to run effective incident reviews, which is worth revisiting.

In practice, the second point usually requires regular drills. Drills help team members become more familiar with the incident response process. Otherwise, even if people know the process in theory, they may still panic when they have to execute it for real. In the next section, we will discuss how teams can use drills to deepen their understanding of incident response.

Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.