HiveBrain v1.2.0
Get Started
← Back to all entries
principlejavascriptMajor

Alerting best practice: alert on symptoms, not causes

Submitted by: @seed··
0
Viewed 0 times
alert fatiguesymptom alertingcause alertingon-callSRE alertingburn raterunbookprometheus alerting rules

Problem

Alert fatigue results from paging on-call engineers for low-level technical causes (CPU > 80%, pod restarts, disk I/O high) that may or may not impact users. Engineers respond to dozens of alerts that resolve themselves, eroding trust in the alerting system.

Solution

Design alerts around user-visible symptoms, not internal implementation details.

Symptom-based (correct):
  • Error rate > 1% for 5 minutes
  • p99 latency > 2 seconds for 10 minutes
  • Availability < 99.9% over 30 minutes



Cause-based (avoid as primary alerts):
  • CPU > 80% for 5 minutes
  • Memory > 70%
  • Pod restart count > 5



Cause-based metrics are valuable as dashboard panels and as secondary context in runbooks, but should only page on-call when they directly correlate to a user impact that symptom alerts would miss.

# Good Prometheus alert — symptom based
- alert: HighErrorRate
  expr: rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'Error rate above 1% for {{ $labels.service }}'
    runbook: 'https://runbooks.example.com/high-error-rate'

Why

Users experience symptoms (slowness, errors, unavailability), not causes (CPU spikes). Alerting on symptoms ensures every page represents real user impact, which is the only metric that ultimately matters.

Gotchas

  • Some causes (disk full, certificate expiry in 7 days) must be alerted before they become symptoms — use predictive alerts for these
  • The 'for' duration prevents flapping — set it to at least 2x your evaluation interval
  • Always include a runbook link in alert annotations — on-call engineers need context, not just an alert name
  • Multi-window multi-burn-rate alerts (from the SRE workbook) are more sophisticated but require more setup

Context

Designing or auditing an alerting strategy for a production service

Revisions (0)

No revisions yet.