Application Monitoring Best Practices for Software Engineers

Many people think that monitoring and incident response should be handled by SRE (Site Reliability Engineers) teams. While this was true in the past, the industry has shifted significantly. Today, most companies expect developers to be directly involved in monitoring and incident response for the systems they build.

A survey by Increment found that major companies including Amazon, Dropbox, Meta, Google, and Netflix now have developers handle monitoring and on-call responsibilities. This makes sense because developers understand their applications best. When application-level problems occur, SRE teams often need to pull in developers anyway to resolve complex issues.

As a software engineer, you need to understand monitoring, on-call duties, and incident response. This article focuses on the monitoring fundamentals you need to know.

What Should You Monitor and Why?

Monitoring means continuously observing your entire system and its components by collecting different metrics to determine if anything is wrong. Through monitoring, you can discover problems early instead of waiting for users to report issues.

The problems you might detect include system failures or outages, performance issues that could become serious, and usage reaching limits that require scaling.

Here's a simple example: Imagine your website normally gets 10,000 visitors per day. If monitoring shows a sudden spike to 1 million visitors, this unusual pattern might indicate a problem (like a DDoS attack). With monitoring, engineers can investigate immediately instead of waiting for user complaints.

Without monitoring, you'd only discover the traffic spike after users start complaining they can't access your site. By then, you're already behind in solving the problem.

Key Metrics: What You Must Monitor

To monitor effectively, you need to know what to watch. These things you measure are called metrics. If you don't monitor the right metrics, you might miss critical problems.

Google's SRE handbook recommends four essential metrics that every system should monitor. While different systems may need additional specific metrics, these four form the foundation:

1. Latency (Performance)

Latency measures how long each part of your system takes to respond. Slow responses frustrate users and can indicate deeper problems. Track response times from frontend to backend, database query times, and API response times.

2. Traffic

Traffic measures how much your system is being used. Unusual traffic patterns can indicate attacks, viral content, or system issues. Key metrics include UV (Unique Visitors), PV (Page Views), and QPS (Queries Per Second) - the number of requests your system handles.

3. Errors (Stability)

Error metrics show how often your system fails or works incorrectly. Errors directly impact user experience and indicate system problems. Monitor your error rate (percentage of requests that fail), success rate (percentage of requests that work correctly), and uptime (how long your system stays running without problems).

4. Saturation (Resource Usage)

Saturation measures how much of your system's capacity is being used. High resource usage can lead to performance problems or crashes. The key metrics are CPU usage, memory usage, and network bandwidth consumption.

Important Tips for Performance Monitoring

Watch for Leading Indicators: Some problems cause others. For example, slow database queries or message queue backlogs can trigger multiple other alerts. Monitor these "upstream" problems first.

Use Percentiles, Not Just Averages: Instead of only looking at average response times, monitor percentiles like P50, P90, P95, P99, and P99.9.

Here's what percentiles mean: If your P90 response time is 1 second, it means 90% of users get responses within 1 second (but 10% wait longer). This gives you a better picture of user experience, especially for popular applications where even 1% of users represents many people.

Alert Levels: Not All Problems Are Equal

When you set up monitoring, you'll create alerts that notify your team when specific conditions occur. For example, you might set an alert for when your API success rate drops below 95%.

However, not all problems deserve the same level of attention. A success rate dropping to 94% is different from dropping to 20%. You need different response levels for different severity levels.

Industry Standard Alert Levels

Most companies use priority levels to categorize alerts:

Most companies use either a Priority System or Severity System to categorize alerts.

The Priority System uses P0 for most critical issues (drop everything and fix immediately), P1 for important issues (fix within business hours), P2 for moderate issues (fix within a few days), and P3 for minor issues (fix when convenient).

The Severity System uses SEV-1 for most severe issues like customer data loss or complete system outages, SEV-2 for moderate severity like major feature failures, and SEV-3 for minor issues like cosmetic problems or non-critical features.

Response Time Requirements

Different alert levels require different response and resolution times. Response time is how quickly you acknowledge receiving the alert (usually by clicking an "Ack" button), while resolution time is how long from alert to actual problem resolution in production.

Here's a typical example (your team may set different standards):

Alert Level	Response Time	Resolution Time
P0 or SEV-1	1 minute	1 hour
P1 or SEV-2	15 minutes	24 hours
P2 or SEV-3	30 minutes	48 hours

How to Handle Alerts When They Fire

When an alert triggers, follow this process:

Acknowledge the Alert: click the "Ack" button to let your team know you've received the alert and are working on it.
Assess the Severity: determine what level of problem this is and whether it's a real issue or a false positive.
Investigate and Respond: if it's a false positive, adjust your monitoring thresholds to reduce noise. If it's a real problem, start troubleshooting immediately. If it's caused by recent deployments, roll back first and investigate later.
Communicate Constantly: keep your team updated throughout the incident. Regular communication helps coordinate response efforts and keeps stakeholders informed.
Follow Up: after resolving the issue, notify relevant team members that the problem is fixed, ask for confirmation that everything is working normally, and for major incidents, schedule a post-incident review meeting.

Key Takeaways

Modern software development requires engineers to be directly involved in monitoring their systems. By understanding what to monitor, how to set up appropriate alerts, and how to respond effectively, you can catch problems early and minimize their impact on users.

Remember: good monitoring isn't just about collecting data – it's about getting the right information at the right time to make quick, informed decisions about your system's health.

How Should Software Engineers Handle On-Call Duties?