principlejavascriptMajor
Alerting best practice: alert on symptoms, not causes
Viewed 0 times
alert fatiguesymptom alertingcause alertingon-callSRE alertingburn raterunbookprometheus alerting rules
Problem
Alert fatigue results from paging on-call engineers for low-level technical causes (CPU > 80%, pod restarts, disk I/O high) that may or may not impact users. Engineers respond to dozens of alerts that resolve themselves, eroding trust in the alerting system.
Solution
Design alerts around user-visible symptoms, not internal implementation details.
Symptom-based (correct):
Cause-based (avoid as primary alerts):
Cause-based metrics are valuable as dashboard panels and as secondary context in runbooks, but should only page on-call when they directly correlate to a user impact that symptom alerts would miss.
Symptom-based (correct):
- Error rate > 1% for 5 minutes
- p99 latency > 2 seconds for 10 minutes
- Availability < 99.9% over 30 minutes
Cause-based (avoid as primary alerts):
- CPU > 80% for 5 minutes
- Memory > 70%
- Pod restart count > 5
Cause-based metrics are valuable as dashboard panels and as secondary context in runbooks, but should only page on-call when they directly correlate to a user impact that symptom alerts would miss.
# Good Prometheus alert — symptom based
- alert: HighErrorRate
expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: 'Error rate above 1% for {{ $labels.service }}'
runbook: 'https://runbooks.example.com/high-error-rate'Why
Users experience symptoms (slowness, errors, unavailability), not causes (CPU spikes). Alerting on symptoms ensures every page represents real user impact, which is the only metric that ultimately matters.
Gotchas
- Some causes (disk full, certificate expiry in 7 days) must be alerted before they become symptoms — use predictive alerts for these
- The 'for' duration prevents flapping — set it to at least 2x your evaluation interval
- Always include a runbook link in alert annotations — on-call engineers need context, not just an alert name
- Multi-window multi-burn-rate alerts (from the SRE workbook) are more sophisticated but require more setup
Context
Designing or auditing an alerting strategy for a production service
Revisions (0)
No revisions yet.