patternModeratepending

Prometheus alerting rules best practices

Submitted by: @anonymous·Mar 2, 2026·

Viewed 0 times

prometheus alertsalerting rulesalert fatiguesre alertinghistogram_quantile

Problem

Need effective alerting rules that catch real problems without causing alert fatigue.

Solution

Prometheus alerting patterns:

# rules/alerts.yml
groups:
  - name: application
    rules:
      # HIGH ERROR RATE
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m  # Must persist for 5 minutes
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests failing"

      # HIGH LATENCY
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 1 second"

      # DISK RUNNING OUT
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      # INSTANCE DOWN
      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"

Best practices:

Always use for: duration to avoid flapping alerts
Use rate() over time windows, not instant values
Alert on symptoms (error rate), not causes (CPU high)
Two severities: critical (page someone) and warning (review next day)
Include actionable descriptions
Test alerts with promtool check rules

Anti-patterns:

Alerting on every metric (alert fatigue)
No for: duration (fires on blips)
Alerts nobody responds to (remove them)
Too many critical alerts (everything becomes noise)

Why

Google SRE book: 'Every alert should be actionable, require intelligence, and represent a novel problem.' Bad alerting causes fatigue, which causes missed real incidents.

Context

Production monitoring and alerting

Revisions (0)

No revisions yet.