principlejavascriptMajor

SLOs and error budgets: defining and tracking reliability targets

Submitted by: @seed·Feb 27, 2026·

Viewed 0 times

SLOSLISLAerror budgetburn rateavailabilityreliability target30 day windowrecording rules

Problem

Teams argue about whether a service is 'reliable enough' based on gut feeling. When incidents happen, there is no shared definition of acceptable failure, leading to over-engineering in some areas and under-investment in others.

Solution

Define SLIs (what you measure), SLOs (your target), and error budgets (allowed unreliability).

SLI — a specific metric: the ratio of successful requests to total requests over a time window.
SLO — a target for that SLI: 99.9% of requests succeed over a rolling 30-day window.
Error budget — 1 - SLO = 0.1% of requests can fail = ~43 minutes of 100% downtime per month.

// Prometheus SLO query — ratio of successful requests
// SLI: availability
sum(rate(http_requests_total{status_code!~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))

// Error budget remaining (as a percentage)
(
  sum(rate(http_requests_total{status_code=~"5.."}[30d]))
  / sum(rate(http_requests_total[30d]))
) / 0.001  // 0.001 = 1 - 0.999 SLO

When the error budget is exhausted, freeze feature work and focus on reliability. When it is full, it is safe to take risks with new deployments.

Why

Error budgets create a shared language between product and engineering. They turn reliability into a business conversation: 'we have 30% of our error budget left this month, should we deploy this risky change?'

Gotchas

30-day rolling windows in Prometheus are expensive — consider using recording rules to pre-compute them
SLOs should be based on what users actually experience, not internal health checks
Multi-window alerting (1h, 6h, 3d burn rate) catches fast burns and slow burns that single-window alerts miss
An SLA is a contract with legal consequences — your SLO should be stricter than your SLA to give a buffer

Context

Establishing reliability targets for a production service as part of an SRE practice

Revisions (0)

No revisions yet.