The Pragmatic SRE Guide to SLOs: From Business Goals to Error Budgets

Intro: Why SLOs Matter More Than Uptime Percentages

For years, teams measured success using uptime percentages — “five nines” became the gold standard.
But uptime alone doesn’t reflect user experience. A system can be “up” but still unusable if login fails, checkout lags, or search results never load.

This is where Service Level Objectives (SLOs) shine. They align engineering goals with business outcomes, ensuring teams optimize reliability where it matters most.

Mapping Business Goals to User Journeys

Every business goal — whether it’s revenue growth, retention, or engagement — depends on key user journeys. For an e-commerce app, that might mean:

Browsing/searching for products
Adding items to the cart
Completing checkout

Instead of “99.9% uptime,” measure how reliably users can complete these journeys.

Identify Critical Paths (Checkout, Search, Login)

Focus reliability efforts on flows where downtime directly hurts users or revenue:

Login service → Gatekeeper for everything.
Checkout service → High latency = abandoned carts.
Search service → Broken discovery = fewer conversions.

Prioritize these over “nice-to-have” features.

Translate UX Latency & Errors into SLIs

Define Service Level Indicators (SLIs) — measurable signals tied to user experience:

Availability → % of successful requests.
Latency → Response times under X ms.
Correctness → No unexpected server-side errors.

Example SLI:

99% of checkout API requests must respond within 300ms.

Defining SLOs That Teams Actually Use

SLOs should be clear, simple, and actionable.

✅ Good Example:

99.5% of checkout requests succeed within 300ms over 30 days.

❌ Bad Example:

System uptime must be 99.999% except during deploys, brownouts, or load tests.

Latency Percentiles (p95, p99) vs Averages

Never rely on averages — they hide outliers that ruin user experience.

p95 latency → 95% of requests are faster than X ms.
p99 latency → 99% of requests are faster than X ms.

If checkout p99 = 5 seconds, 1% of users are still frustrated.

Error Budgets Explained with Examples

An error budget defines how much failure is acceptable before action is required.

Example:

SLO = 99.9% availability per month.
30 days = 43,200 minutes.
Allowed downtime = 43.2 minutes.

If you “spend” 30 minutes in week one, slow down risky deploys to preserve your budget.

Alerting Based on Error Budgets

Traditional monitoring = noisy paging. Modern observability ties alerts to error budget burn instead.

Slow burn → Handle during business hours.
Fast burn → Wake someone up immediately.

This cuts alert fatigue and ensures engineers focus on user-impacting issues.

Burn Rate Alerts (Fast vs Slow Burn)

Burn Rate	Budget Exhausted	Action
14x	< 2 hours	Critical page
6x	Within 24 hours	Investigate soon
1x	Within 30 days	Open a ticket

Example Grafana/Prometheus Alert Rules

groups:
- name: slo.rules
  rules:
  - alert: FastBurn
    expr: rate(errors_total[5m]) / rate(requests_total[5m]) > (1 - 0.999) * 14.4
    for: 5m
    labels:
      severity: critical
    annotations:
      description: "Error budget burning too fast (<2h remaining)"
  
  - alert: SlowBurn
    expr: rate(errors_total[1h]) / rate(requests_total[1h]) > (1 - 0.999) * 6
    for: 2h
    labels:
      severity: warning
    annotations:
      description: "Error budget burning steadily, check during business hours"

Key Takeaways

Uptime ≠ reliability; optimize for user journeys instead.
Define SLIs & SLOs tied to real UX signals like latency & errors.
Use error budgets to balance innovation and stability.
Alert on budget burn rates, not individual blips.
p95/p99 percentiles matter more than averages.
SLO-driven operations empower teams to deliver consistent, user-focused reliability.

Done right, SLOs align engineering with business, balance feature velocity with stability, and drive meaningful reliability improvements.