principleMajorpending
Principle: Observability is not optional
Viewed 0 times
observabilitythree pillarsmetricsloggingtracingmonitoring
Problem
Applications deployed without observability are black boxes - when something goes wrong, there's no way to understand what happened.
Solution
The three pillars of observability:
1. Logs - What happened
2. Metrics - How much / how fast
- p50, p95, p99 tell the real story
- Average hides tail latency
3. Traces - The journey
Minimum viable observability:
Anti-patterns:
The test: When a user reports a problem, can you find the root cause within 15 minutes using your tooling? If not, your observability is insufficient.
1. Logs - What happened
- Structured (JSON), not printf-style strings
- Include context: request ID, user ID, operation
- Log at appropriate levels (don't log everything as INFO)
- Centralize (ELK, CloudWatch, Datadog)
2. Metrics - How much / how fast
- Request rate, error rate, duration (RED method)
- Saturation: CPU, memory, disk, queue depth
- Business metrics: signups, orders, revenue
- Use histograms for latency (not averages!)
- p50, p95, p99 tell the real story
- Average hides tail latency
3. Traces - The journey
- Distributed traces across services
- Show the full request path and timing
- Essential for debugging microservice issues
- OpenTelemetry is the standard
Minimum viable observability:
- Health check endpoint (/healthz)
- Request logging with duration
- Error alerting (not just logging)
- Key business metric dashboard
- On-call runbook for common alerts
Anti-patterns:
- Logging everything (noise drowns signal)
- Alerting on metrics nobody responds to (alert fatigue)
- No correlation between logs and traces
- Dashboards that nobody looks at
- Monitoring only in production (monitor staging too)
The test: When a user reports a problem, can you find the root cause within 15 minutes using your tooling? If not, your observability is insufficient.
Why
You can't fix what you can't see. Observability is the difference between 'the site is slow' (vague) and 'the user service p99 latency spiked at 14:32 due to a slow database query on the orders table' (actionable).
Context
Production systems and operations
Revisions (0)
No revisions yet.