The Pragmatic SRE Guide to SLOs: From Business Goals to Error Budgets
Go beyond uptime percentages—learn how to map business goals into user-centric SLOs, define error budgets, and set up actionable alerting with real-world examples.
The Pragmatic SRE Guide to SLOs: From Business Goals to Error Budgets
Intro: Why SLOs Matter More Than Uptime Percentages
For years, teams measured success using uptime percentages — “five nines” became the gold standard.
But uptime alone doesn’t reflect user experience. A system can be “up” but still unusable if login fails, checkout lags, or search results never load.
This is where Service Level Objectives (SLOs) shine. They align engineering goals with business outcomes, ensuring teams optimize reliability where it matters most.
Mapping Business Goals to User Journeys
Every business goal — whether it’s revenue growth, retention, or engagement — depends on key user journeys. For an e-commerce app, that might mean:
- Browsing/searching for products
- Adding items to the cart
- Completing checkout
Instead of “99.9% uptime,” measure how reliably users can complete these journeys.
Identify Critical Paths (Checkout, Search, Login)
Focus reliability efforts on flows where downtime directly hurts users or revenue:
- Login service → Gatekeeper for everything.
- Checkout service → High latency = abandoned carts.
- Search service → Broken discovery = fewer conversions.
Prioritize these over “nice-to-have” features.
Translate UX Latency & Errors into SLIs
Define Service Level Indicators (SLIs) — measurable signals tied to user experience:
- Availability → % of successful requests.
- Latency → Response times under X ms.
- Correctness → No unexpected server-side errors.
Example SLI:
99% of checkout API requests must respond within 300ms.
Defining SLOs That Teams Actually Use
SLOs should be clear, simple, and actionable.
✅ Good Example:
99.5% of checkout requests succeed within 300ms over 30 days.
❌ Bad Example:
System uptime must be 99.999% except during deploys, brownouts, or load tests.
Latency Percentiles (p95, p99) vs Averages
Never rely on averages — they hide outliers that ruin user experience.
- p95 latency → 95% of requests are faster than X ms.
- p99 latency → 99% of requests are faster than X ms.
If checkout p99 = 5 seconds, 1% of users are still frustrated.
Error Budgets Explained with Examples
An error budget defines how much failure is acceptable before action is required.
Example:
- SLO = 99.9% availability per month.
- 30 days = 43,200 minutes.
- Allowed downtime = 43.2 minutes.
If you “spend” 30 minutes in week one, slow down risky deploys to preserve your budget.
Alerting Based on Error Budgets
Traditional monitoring = noisy paging. Modern observability ties alerts to error budget burn instead.
- Slow burn → Handle during business hours.
- Fast burn → Wake someone up immediately.
This cuts alert fatigue and ensures engineers focus on user-impacting issues.
Burn Rate Alerts (Fast vs Slow Burn)
Burn Rate | Budget Exhausted | Action |
---|---|---|
14x | < 2 hours | Critical page |
6x | Within 24 hours | Investigate soon |
1x | Within 30 days | Open a ticket |
Example Grafana/Prometheus Alert Rules
groups:
- name: slo.rules
rules:
- alert: FastBurn
expr: rate(errors_total[5m]) / rate(requests_total[5m]) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
description: "Error budget burning too fast (<2h remaining)"
- alert: SlowBurn
expr: rate(errors_total[1h]) / rate(requests_total[1h]) > (1 - 0.999) * 6
for: 2h
labels:
severity: warning
annotations:
description: "Error budget burning steadily, check during business hours"
Key Takeaways
- Uptime ≠ reliability; optimize for user journeys instead.
- Define SLIs & SLOs tied to real UX signals like latency & errors.
- Use error budgets to balance innovation and stability.
- Alert on budget burn rates, not individual blips.
- p95/p99 percentiles matter more than averages.
- SLO-driven operations empower teams to deliver consistent, user-focused reliability.
Done right, SLOs align engineering with business, balance feature velocity with stability, and drive meaningful reliability improvements.