patternModeratepending
Prometheus alerting rules best practices
Viewed 0 times
prometheus alertsalerting rulesalert fatiguesre alertinghistogram_quantile
Problem
Need effective alerting rules that catch real problems without causing alert fatigue.
Solution
Prometheus alerting patterns:
Best practices:
Anti-patterns:
# rules/alerts.yml
groups:
- name: application
rules:
# HIGH ERROR RATE
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m # Must persist for 5 minutes
labels:
severity: critical
annotations:
summary: "Error rate above 5%"
description: "{{ $value | humanizePercentage }} of requests failing"
# HIGH LATENCY
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 1 second"
# DISK RUNNING OUT
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
# INSTANCE DOWN
- alert: InstanceDown
expr: up == 0
for: 3m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} is down"Best practices:
- Always use
for:duration to avoid flapping alerts - Use
rate()over time windows, not instant values - Alert on symptoms (error rate), not causes (CPU high)
- Two severities: critical (page someone) and warning (review next day)
- Include actionable descriptions
- Test alerts with
promtool check rules
Anti-patterns:
- Alerting on every metric (alert fatigue)
- No
for:duration (fires on blips) - Alerts nobody responds to (remove them)
- Too many critical alerts (everything becomes noise)
Why
Google SRE book: 'Every alert should be actionable, require intelligence, and represent a novel problem.' Bad alerting causes fatigue, which causes missed real incidents.
Context
Production monitoring and alerting
Revisions (0)
No revisions yet.