On-Call Best Practices for Software Engineers

August 28, 2024

☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee

In our previous article about Application Monitoring Best Practices for Software Engineers, we discussed how monitoring alerts trigger responses from on-call team members. But who exactly are these on-call team members, and how does the whole system work? This article covers everything you need to know about effective on-call practices, from understanding your role to handling incidents like a pro.

Who Should Handle On-Call Duties?

In the early days of software development, development and operations were completely separate worlds. Developers would build features and toss them over the wall to operations teams for deployment and maintenance. If something broke at 3 AM, that was the ops team's problem, not the developer's.

This separation seemed logical at first, but it created serious problems in practice. Communication gaps between teams made scaling difficult, while the lack of ownership meant developers didn't think twice about production stability when writing code. When application issues occurred, operations teams often had to pull in developers anyway since they were the only ones who truly understood the code.

The "You Build It, You Own It" Approach

The industry solution was to involve developers directly in operations. This follows the principle "You build it, you own it" – if you write the code, you're responsible for its entire lifecycle, not just the fun development part.

This shift changed everything. When developers know they'll be the ones getting woken up at 3 AM for bugs they create, they naturally write more careful, stable code. Suddenly, error handling isn't an afterthought – it's a survival skill.

Different companies have implemented this philosophy in various ways. Google keeps SRE teams focused only on the most critical and stable services like Ads, Gmail, and Search, while development teams handle everything else. Airbnb uses a collaborative model where developers handle on-call duties while SREs help implement operational best practices.

Datadog goes for shared responsibility with both SRE and developers participating in every rotation. Amazon takes the most extreme approach, following pure "you build it, you own it" where developers handle everything from development to operations.

How On-Call Rotation Actually Works

Now that we understand why developers do on-call, let's dive into how it actually works day-to-day. The system is designed to ensure someone is always available to handle incidents while preventing any one person from burning out.

Most teams use weekly rotations with two key roles: a primary on-call person who gets the first alert when issues occur, and a secondary on-call person who serves as backup. The frequency depends on team size – a team with 12 engineers means each person is on primary duty every 12 weeks, which is quite manageable.

How Escalation Works

The escalation process should be simple and effective. When an incident occurs, the primary on-call person gets the first alert. If they don't respond within the timeout period (usually 5-15 minutes), the secondary gets alerted. Still no response? The alerts keep climbing: team lead, then department manager, and so on.

This system works because nobody wants to wake up their boss at 2 AM. The social pressure to respond quickly is real and effective. It also ensures that no incident falls through the cracks – someone will always be reached eventually.

Designing Sustainable Rotations

The key to sustainable on-call is preventing burnout through smart rotation design. This means several things working together:

  • Reasonable frequency so people don't get exhausted
  • Clear expectations so everyone knows their responsibilities
  • Proper documentation with runbooks and procedures
  • Team support through secondary on-call and escalation processes

During job interviews, smart candidates ask about on-call frequency and team practices. This isn't being picky – it's being realistic about what you're signing up for.

When the Pager Goes Off: Your Action Plan

So the moment arrives: you're on-call and your phone starts buzzing with an alert. What now? Having a clear action plan prevents panic and ensures you handle incidents professionally.

Start immediately by clicking the "Ack" (Acknowledge) button. This tells everyone you've received the alert and are working on it, stopping the escalation process. It's a simple action, but it's crucial – it shows you're on top of the situation.

Assess First, Then Communicate

Your next job is detective work. Is this a real incident or a false positive? Sometimes monitoring systems cry wolf, so investigate quickly to confirm whether there's actually a problem. Look at related metrics, check recent deployments, and see if users are actually affected.

Once you've confirmed it's real, communication becomes critical. Most companies use a two-channel approach during incidents. There's a broad team channel where you post status updates to keep everyone informed. Then there's an incident-specific channel for the technical discussion among people actively working on the problem.

For customer-facing issues, external communication matters too. Status pages and social media updates keep users informed and show you're on top of the situation. When OpenAI's API goes down, they immediately update both their status page and Twitter – this transparency builds trust even during outages.

The Golden Rule: Fix First, Understand Later

Here's where many engineers make a critical mistake: they try to understand the root cause before fixing the immediate problem. Don't do this. The golden rule of incident response is "Mitigation first, investigation later."

This means three steps in order:

  1. Stop the bleeding – prevent further damage
  2. Restore service – get systems working again
  3. Investigate root causes – understand what happened (after service is restored)

Why this order? Because you're racing against time. Companies have SLA commitments that cost real money when broken. Every second of downtime damages user experience and potentially costs revenue. Plus, trying to debug under time pressure often leads to panic-driven decisions that make things worse.

Common mitigation strategies are often surprisingly simple. If the issue started after a recent deployment, immediately roll back – you can figure out what went wrong later. Use circuit breakers to disable problematic features temporarily. Redirect traffic away from struggling servers. Add more capacity if it's a resource issue. The goal is stopping the pain, not proving you're smart.

Building a Culture of Effective On-Call

Individual incident response skills matter, but they're not enough. Great on-call practices require team-wide commitment to building sustainable systems and processes.

Start with documentation that actually helps under pressure. Create runbooks that include common incident types and their solutions, clear escalation procedures, contact information for key systems, and step-by-step troubleshooting guides. When you're half-awake at 3 AM, you'll be grateful for documentation that tells you exactly what to do.

After every incident, invest time in learning. Document what happened, identify root causes, implement preventive measures, and update your procedures. This isn't bureaucracy – it's how you prevent the same problem from ruining your sleep again next month.

Balance remains crucial throughout this process. Don't let fear of incidents prevent innovation and risk-taking, but also don't sacrifice reliability for speed. Each team needs to find their own sweet spot based on their users' needs and business requirements.

Remember that on-call shouldn't feel like a solo battle. Encourage team members to help each other during complex incidents, share knowledge and experiences, and continuously improve processes together. The best on-call cultures treat incidents as learning opportunities that make everyone better engineers.

When done right, on-call duties become more than just a necessary evil – they become a valuable learning experience that deepens your understanding of production systems and makes you a more well-rounded engineer. You'll sleep better knowing your systems are robust, and you'll code better knowing you'll be the one fixing any problems you create.


☕️ Support Us
Your support will help us to continue to provide quality content.👉 Buy Me a Coffee