What is SLO? Why SLO Matters for Software Teams

When discussing reliability engineering, SLO (Service-Level Objective) is often the first term that comes to mind. Google's landmark book Site Reliability Engineering even states that "without SLOs, there is no reason for SRE to exist," since SRE teams use SLOs to prioritize work and ensure these objectives are consistently met.

What are SLI, SLO, and SLA?

Before defining SLO precisely, let's step back and think about software products fundamentally.

Before reading further, try answering these questions:

Why does a software product exist? (Think about software you've written at work or in personal projects and why you invest time building it)
How should a software team judge whether it's reliably achieving the product's purpose? (Consider: if someone asked about your team's product reliability, how would you respond? What would you base that answer on?)
As an engineer, you want to propose to your product manager that the team dedicate more time to technical improvements and refactoring. How would you demonstrate that this time investment has value?

In a software team, the first question is typically answered by the product manager (though good engineers should have their own perspective too). The second and third questions are challenges that senior engineers face in their day-to-day work, and SLOs exist precisely to help answer them.

SLI Comes Before SLO

SLO stands for Service-Level Objective, which translates to "service quality target." For engineers, an SLO represents a set of objectives the team must achieve. Common examples include "API requests have a 99.9% success rate" or "90% of API requests complete within 10 milliseconds." Different products have different SLOs, and engineers are responsible for defining and ensuring these objectives are met.

Unfortunately, some engineering teams arbitrarily pick targets to appease management, or simply copy SLOs from other products. To avoid this, you need to establish meaningful SLIs before setting SLOs, ensuring your objectives aren't haphazard or just following trends.

SLI stands for Service-Level Indicator—basically, it identifies what matters for your software product.

Different products prioritize different aspects. For financial or accounting software, correctness is paramount; even minor errors are unacceptable. In these cases, teams often sacrifice speed to ensure accuracy.

However, social media applications prioritize speed, since users may abandon a slow experience. Correctness requirements are lower—if the like count isn't perfectly accurate in real-time, that's usually acceptable.

Typically, you'd define SLIs in collaboration with your product manager, documenting what's most important for your product. Common indicators include speed (latency), availability, durability, accuracy, and completeness.

Setting SLOs Based on SLIs

Once you've identified important SLIs, you'll likely realize something's missing. If your product manager says availability is crucial, most engineers' first question is: "What counts as high availability?"

For instance, is 99% availability good? Or does it need to be 99.9%? Does a 100ms API response time qualify as low latency? Or should it be under 50ms?

This is where the "O" in SLO comes in. Based on SLIs, you set specific objectives. SLIs tell you what's important; SLOs define the standard you need to meet. Even with identical SLIs, different systems might have different SLOs. Even within the same system, the same indicator might have different targets. We'll explore how to set appropriate targets for your team later.

Returning to your earlier question—"How do teams judge whether they're reliably achieving their product's purpose?"—the answer is: define SLOs around reliability. Meeting SLOs means you're reliably succeeding.

SLA: The Customer Promise

For engineering teams, SLOs are sufficient. But from a business perspective, teams often go further by defining SLAs (Service-Level Agreements)—formal service guarantees.

SLAs represent external commitments to customers. Imagine your team is building a competitor to AWS S3. To convince customers to use your product instead, you might say, "Our product is more reliable than AWS S3." Naturally, customers ask: "How do you guarantee that?"

SLAs answer this question by putting your money where your mouth is. If you claim 99.999% availability but offer no recourse if you miss it, customers won't trust you. But if you say, "If we don't hit 99.999% availability, we'll refund everything," you're far more likely to win them over.

AWS S3's SLA (link) illustrates this: if availability drops below 99%, AWS credits 10% of fees; below 95%, they refund everything.

Most teams set SLOs more strictly than SLAs to create a buffer. For example, if the SLA is 99.9%, the internal SLO might be 99.95%. This way, SLO misses trigger remediation before the team breaches the SLA and owes refunds.