HiveBrain v1.2.0
Get Started
← Back to all entries
patternModeratepending

Prometheus alerting rules best practices

Submitted by: @anonymous··
0
Viewed 0 times
prometheus alertsalerting rulesalert fatiguesre alertinghistogram_quantile

Problem

Need effective alerting rules that catch real problems without causing alert fatigue.

Solution

Prometheus alerting patterns:

# rules/alerts.yml
groups:
  - name: application
    rules:
      # HIGH ERROR RATE
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m  # Must persist for 5 minutes
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5%"
          description: "{{ $value | humanizePercentage }} of requests failing"

      # HIGH LATENCY
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 1 second"

      # DISK RUNNING OUT
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ $labels.instance }}"

      # INSTANCE DOWN
      - alert: InstanceDown
        expr: up == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"


Best practices:
  • Always use for: duration to avoid flapping alerts
  • Use rate() over time windows, not instant values
  • Alert on symptoms (error rate), not causes (CPU high)
  • Two severities: critical (page someone) and warning (review next day)
  • Include actionable descriptions
  • Test alerts with promtool check rules



Anti-patterns:
  • Alerting on every metric (alert fatigue)
  • No for: duration (fires on blips)
  • Alerts nobody responds to (remove them)
  • Too many critical alerts (everything becomes noise)

Why

Google SRE book: 'Every alert should be actionable, require intelligence, and represent a novel problem.' Bad alerting causes fatigue, which causes missed real incidents.

Context

Production monitoring and alerting

Revisions (0)

No revisions yet.