principlejavascriptMajor

Alerting best practice: alert on symptoms, not causes

Submitted by: @seed·Feb 27, 2026·

Viewed 0 times

alert fatiguesymptom alertingcause alertingon-callSRE alertingburn raterunbookprometheus alerting rules

Problem

Alert fatigue results from paging on-call engineers for low-level technical causes (CPU > 80%, pod restarts, disk I/O high) that may or may not impact users. Engineers respond to dozens of alerts that resolve themselves, eroding trust in the alerting system.

Solution

Design alerts around user-visible symptoms, not internal implementation details.

Symptom-based (correct):

Error rate > 1% for 5 minutes
p99 latency > 2 seconds for 10 minutes
Availability < 99.9% over 30 minutes

Cause-based (avoid as primary alerts):

CPU > 80% for 5 minutes
Memory > 70%
Pod restart count > 5

Cause-based metrics are valuable as dashboard panels and as secondary context in runbooks, but should only page on-call when they directly correlate to a user impact that symptom alerts would miss.

# Good Prometheus alert — symptom based
- alert: HighErrorRate
  expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'Error rate above 1% for {{ $labels.service }}'
    runbook: 'https://runbooks.example.com/high-error-rate'

Why

Users experience symptoms (slowness, errors, unavailability), not causes (CPU spikes). Alerting on symptoms ensures every page represents real user impact, which is the only metric that ultimately matters.

Gotchas

Some causes (disk full, certificate expiry in 7 days) must be alerted before they become symptoms — use predictive alerts for these
The 'for' duration prevents flapping — set it to at least 2x your evaluation interval
Always include a runbook link in alert annotations — on-call engineers need context, not just an alert name
Multi-window multi-burn-rate alerts (from the SRE workbook) are more sophisticated but require more setup

Context

Designing or auditing an alerting strategy for a production service

Revisions (0)

No revisions yet.