principlejavascriptMajor
SLOs and error budgets: defining and tracking reliability targets
Viewed 0 times
SLOSLISLAerror budgetburn rateavailabilityreliability target30 day windowrecording rules
Problem
Teams argue about whether a service is 'reliable enough' based on gut feeling. When incidents happen, there is no shared definition of acceptable failure, leading to over-engineering in some areas and under-investment in others.
Solution
Define SLIs (what you measure), SLOs (your target), and error budgets (allowed unreliability).
SLI — a specific metric: the ratio of successful requests to total requests over a time window.
SLO — a target for that SLI: 99.9% of requests succeed over a rolling 30-day window.
Error budget — 1 - SLO = 0.1% of requests can fail = ~43 minutes of 100% downtime per month.
When the error budget is exhausted, freeze feature work and focus on reliability. When it is full, it is safe to take risks with new deployments.
SLI — a specific metric: the ratio of successful requests to total requests over a time window.
SLO — a target for that SLI: 99.9% of requests succeed over a rolling 30-day window.
Error budget — 1 - SLO = 0.1% of requests can fail = ~43 minutes of 100% downtime per month.
// Prometheus SLO query — ratio of successful requests
// SLI: availability
sum(rate(http_requests_total{status_code!~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
// Error budget remaining (as a percentage)
(
sum(rate(http_requests_total{status_code=~"5.."}[30d]))
/ sum(rate(http_requests_total[30d]))
) / 0.001 // 0.001 = 1 - 0.999 SLOWhen the error budget is exhausted, freeze feature work and focus on reliability. When it is full, it is safe to take risks with new deployments.
Why
Error budgets create a shared language between product and engineering. They turn reliability into a business conversation: 'we have 30% of our error budget left this month, should we deploy this risky change?'
Gotchas
- 30-day rolling windows in Prometheus are expensive — consider using recording rules to pre-compute them
- SLOs should be based on what users actually experience, not internal health checks
- Multi-window alerting (1h, 6h, 3d burn rate) catches fast burns and slow burns that single-window alerts miss
- An SLA is a contract with legal consequences — your SLO should be stricter than your SLA to give a buffer
Context
Establishing reliability targets for a production service as part of an SRE practice
Revisions (0)
No revisions yet.