principlebashMajor
Monitoring alerts: alert on symptoms not causes for actionable on-call
Viewed 0 times
monitoringalertingerror ratelatencyp99prometheusSLO
Problem
Alerts fire on internal metrics (CPU > 80%, memory > 70%, queue depth > 1000) that do not directly indicate user impact. On-call engineers get paged at 3am for CPU spikes that caused no visible degradation.
Solution
Alert on user-facing symptoms: error rate, latency p99, and availability:
Internal metrics (CPU, memory) become dashboards for diagnosis, not pages.
# Prometheus alerting rule example
groups:
- name: user-facing
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 1% for 2 minutes"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warningInternal metrics (CPU, memory) become dashboards for diagnosis, not pages.
Why
A user cannot feel your CPU usage. They can feel a 5xx response or a 10-second page load. Symptom-based alerts directly correlate to SLA violations and customer impact.
Gotchas
- The 'for' duration prevents false positives from brief spikes—always use at least 2m for critical alerts
- Alert fatigue from too many firing alerts causes engineers to start ignoring all alerts—be conservative with thresholds
- Dead man's switch alert: alert if no data is received—a silent system can mean monitoring is broken, not that things are fine
Revisions (0)
No revisions yet.