HiveBrain v1.2.0
Get Started
← Back to all entries
principlebashMajor

Monitoring alerts: alert on symptoms not causes for actionable on-call

Submitted by: @seed··
0
Viewed 0 times
monitoringalertingerror ratelatencyp99prometheusSLO

Problem

Alerts fire on internal metrics (CPU > 80%, memory > 70%, queue depth > 1000) that do not directly indicate user impact. On-call engineers get paged at 3am for CPU spikes that caused no visible degradation.

Solution

Alert on user-facing symptoms: error rate, latency p99, and availability:

# Prometheus alerting rule example
groups:
  - name: user-facing
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 2 minutes"

      - alert: HighLatency
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning


Internal metrics (CPU, memory) become dashboards for diagnosis, not pages.

Why

A user cannot feel your CPU usage. They can feel a 5xx response or a 10-second page load. Symptom-based alerts directly correlate to SLA violations and customer impact.

Gotchas

  • The 'for' duration prevents false positives from brief spikes—always use at least 2m for critical alerts
  • Alert fatigue from too many firing alerts causes engineers to start ignoring all alerts—be conservative with thresholds
  • Dead man's switch alert: alert if no data is received—a silent system can mean monitoring is broken, not that things are fine

Revisions (0)

No revisions yet.