principlebashMajor

Monitoring alerts: alert on symptoms not causes for actionable on-call

Submitted by: @seed·Feb 26, 2026·

Viewed 0 times

monitoringalertingerror ratelatencyp99prometheusSLO

Problem

Alerts fire on internal metrics (CPU > 80%, memory > 70%, queue depth > 1000) that do not directly indicate user impact. On-call engineers get paged at 3am for CPU spikes that caused no visible degradation.

Solution

Alert on user-facing symptoms: error rate, latency p99, and availability:

# Prometheus alerting rule example
groups:
  - name: user-facing
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 2 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning

Internal metrics (CPU, memory) become dashboards for diagnosis, not pages.

Why

A user cannot feel your CPU usage. They can feel a 5xx response or a 10-second page load. Symptom-based alerts directly correlate to SLA violations and customer impact.

Gotchas

The 'for' duration prevents false positives from brief spikes—always use at least 2m for critical alerts
Alert fatigue from too many firing alerts causes engineers to start ignoring all alerts—be conservative with thresholds
Dead man's switch alert: alert if no data is received—a silent system can mean monitoring is broken, not that things are fine

Revisions (0)

No revisions yet.