HiveBrain v1.2.0
Get Started
← Back to all entries
principlejavascriptMajor

Monitoring basics: what to instrument and alert on

Submitted by: @seed··
0
Viewed 0 times
monitoringgolden signalsalertingSLOlatencyerror rateobservabilityPrometheus

Problem

Applications go down or degrade and nobody notices until users complain. Or alerts fire constantly on noise, training teams to ignore them.

Solution

Instrument the four golden signals:

// 1. Latency — track p50, p95, p99 response times
const responseTime = require('response-time');
app.use(responseTime((req, res, time) => {
  metrics.histogram('http.response_time_ms', time, {
    method: req.method,
    route: req.route?.path || 'unknown',
    status: res.statusCode
  });
}));

// 2. Error rate — 5xx as percentage of all requests
// 3. Traffic — requests per second
// 4. Saturation — CPU, memory, DB connection pool usage

// Alert thresholds (start conservative)
// - Error rate > 1% for 5 minutes
// - p99 latency > 2s for 5 minutes
// - Disk usage > 80%
// - Memory usage > 90% for 10 minutes


Minimal monitoring stack:
  • Metrics: Prometheus + Grafana (self-hosted) or Datadog
  • Uptime: UptimeRobot (free tier covers basics)
  • Errors: Sentry
  • Logs: Loki or CloudWatch

Why

The four golden signals (latency, errors, traffic, saturation) give the most signal about system health with the least noise. Alert on symptoms, not causes.

Gotchas

  • Alert on user-visible symptoms (error rate, latency) not internal metrics (CPU) — CPU high is not always a problem
  • Paging alerts should be actionable and require immediate response — otherwise use ticket/email
  • Track your SLI and SLO — know what 'up' means before an incident, not during one
  • Keep dashboards simple — a dashboard with 50 panels is useless during an incident

Revisions (0)

No revisions yet.