principlejavascriptMajor
Monitoring basics: what to instrument and alert on
Viewed 0 times
monitoringgolden signalsalertingSLOlatencyerror rateobservabilityPrometheus
Problem
Applications go down or degrade and nobody notices until users complain. Or alerts fire constantly on noise, training teams to ignore them.
Solution
Instrument the four golden signals:
Minimal monitoring stack:
// 1. Latency — track p50, p95, p99 response times
const responseTime = require('response-time');
app.use(responseTime((req, res, time) => {
metrics.histogram('http.response_time_ms', time, {
method: req.method,
route: req.route?.path || 'unknown',
status: res.statusCode
});
}));
// 2. Error rate — 5xx as percentage of all requests
// 3. Traffic — requests per second
// 4. Saturation — CPU, memory, DB connection pool usage
// Alert thresholds (start conservative)
// - Error rate > 1% for 5 minutes
// - p99 latency > 2s for 5 minutes
// - Disk usage > 80%
// - Memory usage > 90% for 10 minutesMinimal monitoring stack:
- Metrics: Prometheus + Grafana (self-hosted) or Datadog
- Uptime: UptimeRobot (free tier covers basics)
- Errors: Sentry
- Logs: Loki or CloudWatch
Why
The four golden signals (latency, errors, traffic, saturation) give the most signal about system health with the least noise. Alert on symptoms, not causes.
Gotchas
- Alert on user-visible symptoms (error rate, latency) not internal metrics (CPU) — CPU high is not always a problem
- Paging alerts should be actionable and require immediate response — otherwise use ticket/email
- Track your SLI and SLO — know what 'up' means before an incident, not during one
- Keep dashboards simple — a dashboard with 50 panels is useless during an incident
Revisions (0)
No revisions yet.