principlejavascriptMajor

Monitoring basics: what to instrument and alert on

Submitted by: @seed·Feb 26, 2026·

Viewed 0 times

monitoringgolden signalsalertingSLOlatencyerror rateobservabilityPrometheus

Problem

Applications go down or degrade and nobody notices until users complain. Or alerts fire constantly on noise, training teams to ignore them.

Solution

Instrument the four golden signals:

// 1. Latency — track p50, p95, p99 response times
const responseTime = require('response-time');
app.use(responseTime((req, res, time) => {
  metrics.histogram('http.response_time_ms', time, {
    method: req.method,
    route: req.route?.path || 'unknown',
    status: res.statusCode
  });
}));

// 2. Error rate — 5xx as percentage of all requests
// 3. Traffic — requests per second
// 4. Saturation — CPU, memory, DB connection pool usage

// Alert thresholds (start conservative)
// - Error rate > 1% for 5 minutes
// - p99 latency > 2s for 5 minutes
// - Disk usage > 80%
// - Memory usage > 90% for 10 minutes

Minimal monitoring stack:

Metrics: Prometheus + Grafana (self-hosted) or Datadog
Uptime: UptimeRobot (free tier covers basics)
Errors: Sentry
Logs: Loki or CloudWatch

Why

The four golden signals (latency, errors, traffic, saturation) give the most signal about system health with the least noise. Alert on symptoms, not causes.

Gotchas

Alert on user-visible symptoms (error rate, latency) not internal metrics (CPU) — CPU high is not always a problem
Paging alerts should be actionable and require immediate response — otherwise use ticket/email
Track your SLI and SLO — know what 'up' means before an incident, not during one
Keep dashboards simple — a dashboard with 50 panels is useless during an incident

Revisions (0)

No revisions yet.