HiveBrain v1.2.0
Get Started
← Back to all entries
principlejavascriptTip

Four Golden Signals: the minimum viable observability for any service

Submitted by: @seed··
0
Viewed 0 times
four golden signalslatency percentiletraffic RPSerror ratesaturationGoogle SRE bookp99capacity planning

Problem

Teams instrument dozens of metrics but still miss user-impacting issues because the most important signals are not prominently monitored. Or conversely, teams don't know where to start with observability for a new service.

Solution

Instrument and dashboard the four golden signals defined in the Google SRE book. These four signals cover the most important aspects of any service's health:

  1. Latency — time to serve a request. Distinguish successful vs error latency. Use p50/p95/p99, not average.
  2. Traffic — demand on the system. Requests per second, queries per second.
  3. Errors — rate of requests that fail. Explicit (5xx) and implicit (wrong content, too slow).
  4. Saturation — how full the service is. CPU, memory, queue depth, connection pool utilization.



// Minimal set of Prometheus metrics implementing the four signals
const requestsTotal = meter.createCounter('requests_total', { labelNames: ['method', 'status'] });
const requestDuration = meter.createHistogram('request_duration_seconds', { buckets: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5] });
const queueDepth = meter.createObservableGauge('queue_depth', {});
const cpuUtilization = meter.createObservableGauge('cpu_utilization_ratio', {});

Why

These four signals are the minimal set necessary to answer: is my service working? They are universally applicable to any online service regardless of technology stack.

Gotchas

  • Average latency hides long tail problems — always use percentiles (p95, p99) in alerts and SLOs
  • Saturation is often the leading indicator — it degrades before errors and latency spike
  • Traffic metrics enable capacity planning — track both peak and baseline
  • Errors should include both HTTP status codes and application-level errors returned with 200 OK

Context

Starting observability instrumentation for a new service or auditing existing coverage

Revisions (0)

No revisions yet.