How Should Software Engineers Conduct Incident Reviews?

September 26, 2024

☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee

Think back to your last significant outage or incident. What happened after you fixed the immediate problem? Did your team just move on, relieved the crisis was over? Or did you take time to understand what went wrong and why?

This question matters more than you might think. Previously, we've discussed Application Monitoring Best Practices for Software Engineers and On-Call Best Practices for Software Engineers. Now let's explore what happens after incidents: the incident review process.

For minor issues with minimal user impact, you might skip a formal review. But when incidents cause meaningful disruption, a structured incident review becomes essential. But here's the key question: what's the real purpose of these reviews?

What Are Incident Reviews Really About?

Before diving into the mechanics, let's clarify what incident reviews actually accomplish. You might hear different names in the industry: incident review, sev review (where "sev" refers to severity levels we discussed in Application Monitoring Best Practices for Software Engineers), or postmortem (borrowed from medical practice, where examining what went wrong after the fact reveals critical insights).

But regardless of the name, ask yourself: when an incident happens, what should your team's primary focus be?

The answer might surprise you. It's not about finding who's responsible. It's not about assigning blame. The real purpose is much more constructive.

Think about it this way: incidents are going to happen. No engineer is perfect, no system is flawless, and no process catches every edge case. Given this reality, what's the most valuable thing you can do after an incident occurs? The answer is learning. Specifically, learning how to prevent the same incident from happening again.

The Learning Mindset

When you approach incident reviews with a learning mindset, something interesting happens. Instead of people getting defensive or trying to cover their tracks, they become collaborative problem-solvers. Everyone involved starts asking better questions:

  • What conditions led to this incident?
  • What warning signs did we miss?
  • What processes or tools could have caught this earlier?
  • How can we make our systems more resilient?

This collaborative learning becomes even more valuable when it's documented and shared. Many well-established companies maintain detailed incident review records not just for the immediate team, but for other teams and departments to learn from as well.

This cross-team sharing matters because of the interconnected nature of modern software systems. An incident in one team's service often reveals architectural weaknesses, monitoring blind spots, or deployment practices that affect other teams too. When teams share their learnings, they're essentially building collective organizational resilience. Instead of having each team learn the same lessons separately through their own painful incidents, cross-team sharing allows the entire organization to benefit from each incident's insights.

Why Blameless Reviews Matter

While this collaborative learning approach sounds ideal, achieving it requires careful attention to how incident reviews are conducted. Unfortunately, many teams undermine their own learning potential by turning incident reviews into investigations to find the "guilty party." But consider this: if you knew that making a mistake would result in public blame and potential career damage, how likely would you be to take innovative risks? How transparent would you be about what actually happened?

This is why the concept of "blameless incident reviews" has become a cornerstone of healthy engineering cultures.

Let me share a powerful example. In 2021, after a major global outage at Meta, Engineering VP Vijaye Rau shared insights about their blameless approach. He emphasized that asking "who caused this incident?" at review time doesn't help solve the problem. Worse, it discourages the kind of innovative risk-taking that drives technical progress.

But here's an even more striking example from Taiwan's software history. In what's known as the "Trend 594 incident," an engineer at Trend Micro released code that wasn't fully tested, causing widespread customer system failures and resulting in a $5.4 billion market cap loss.

Faced with such massive financial impact, most leaders would be tempted to find someone to blame. But the CEO, Chen Yi-hua, chose a different path. She later shared: "I knew if I asked 'Who wrote this? Why wasn't it tested properly before release?' the company would be finished right then and there." Instead of seeking blame, she focused the organization on innovation and learning.

The result? This incident led Trend Micro to rethink their architecture fundamentally, pioneering cloud-based security solutions that gave them a 3-4 year competitive advantage in the market.

Consider the cultural implications here. When teams fear blame, destructive behaviors emerge: people hide problems, delay reporting issues, and avoid taking the kinds of calculated risks that drive innovation. But when teams focus on learning from failures, they become stronger and more resilient.

Building Your Incident Review Practice

As an engineer, how can you contribute to effective incident reviews? Start by asking yourself these questions:

Before the review:

  • What timeline of events led to this incident?
  • What systems and processes were involved?
  • What was the impact on users and the business?

During the review:

  • What can we learn from this incident?
  • What preventive measures could we implement?
  • How can we improve our detection and response times?
  • What would we do differently if this happened again?

After the review:

  • What specific action items came out of this discussion?
  • Who owns each improvement initiative?
  • How will we track progress on these improvements?

In addition, as you develop in your engineering career, strive to build a culture where incidents become learning opportunities rather than blame sessions. This means creating an environment where:

  • Engineers feel safe reporting issues quickly without fear of retribution
  • Teams focus on systemic improvements rather than individual mistakes
  • Failure is treated as valuable data for building more resilient systems
  • Innovation is encouraged, even when it sometimes leads to problems
  • Knowledge sharing across teams is the norm, not the exception

The truth is that you don't need to be in a leadership position to influence this culture. Every engineer can contribute by modeling blameless behavior, asking constructive questions in reviews, and sharing learnings with colleagues.

Remember: the next incident will happen. The question isn't whether you'll face another outage or bug, but whether your team will be better prepared because of what you learned from the last one. Make your incident reviews count.


Support ExplainThis

If you found this content helpful, please consider supporting our work with a one-time donation of whatever amount feels right to you through this Buy Me a Coffee page, or share the article with your friends to help us reach more readers.

Creating in-depth technical content takes significant time. Your support helps us continue producing high-quality educational content accessible to everyone.

☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee