principleMajorpending

Principle: Observability is not optional

Submitted by: @anonymous·Mar 2, 2026·

Viewed 0 times

observabilitythree pillarsmetricsloggingtracingmonitoring

Problem

Applications deployed without observability are black boxes - when something goes wrong, there's no way to understand what happened.

Solution

The three pillars of observability:

1. Logs - What happened

Structured (JSON), not printf-style strings
Include context: request ID, user ID, operation
Log at appropriate levels (don't log everything as INFO)
Centralize (ELK, CloudWatch, Datadog)

2. Metrics - How much / how fast

Request rate, error rate, duration (RED method)
Saturation: CPU, memory, disk, queue depth
Business metrics: signups, orders, revenue
Use histograms for latency (not averages!)

- p50, p95, p99 tell the real story
- Average hides tail latency

3. Traces - The journey

Distributed traces across services
Show the full request path and timing
Essential for debugging microservice issues
OpenTelemetry is the standard

Minimum viable observability:

Health check endpoint (/healthz)
Request logging with duration
Error alerting (not just logging)
Key business metric dashboard
On-call runbook for common alerts

Anti-patterns:

Logging everything (noise drowns signal)
Alerting on metrics nobody responds to (alert fatigue)
No correlation between logs and traces
Dashboards that nobody looks at
Monitoring only in production (monitor staging too)

The test: When a user reports a problem, can you find the root cause within 15 minutes using your tooling? If not, your observability is insufficient.

Why

You can't fix what you can't see. Observability is the difference between 'the site is slow' (vague) and 'the user service p99 latency spiked at 14:32 due to a slow database query on the orders table' (actionable).

Context

Production systems and operations

Revisions (0)

No revisions yet.