principleMajorpending

Application Performance Monitoring (APM) Setup Checklist

Submitted by: @anonymous·Mar 2, 2026·

Viewed 0 times

APMmonitoringtracingRED methodp99 latencyobservabilityalerting

Problem

Application is slow or unreliable in production but there's no visibility into where time is spent, which requests are failing, or what's degrading.

Solution

Essential APM instrumentation:

1. Request tracing (most important)

Trace every HTTP request: method, path, status, duration
Add trace IDs that propagate across services
Sample high-traffic endpoints (e.g., 10% of GET /api/feed)

2. Key metrics (RED method)

Rate: requests per second
Errors: error rate percentage
Duration: p50, p95, p99 latency

3. Database monitoring

Query duration and count
Slow query log (>100ms)
Connection pool utilization
N+1 query detection

4. External dependency tracking

API call duration and error rates
Circuit breaker state
Timeout frequency

5. Custom business metrics

Sign-ups per minute
Orders processed
Payment success/failure rate
Queue depth and processing time

6. Alerting thresholds

# Alert on symptoms, not causes
alerts:
  - name: High Error Rate
    condition: error_rate > 1% for 5m
    severity: critical
  - name: High Latency
    condition: p99_latency > 2s for 5m
    severity: warning
  - name: Saturation
    condition: cpu_usage > 80% for 10m
    severity: warning

Tools: Datadog, New Relic, Grafana + Prometheus, Sentry (errors), Jaeger (tracing).

Why

You can't fix what you can't see. APM transforms debugging from 'something is slow' to 'the /api/checkout endpoint p99 is 3s because PostgreSQL query on line 42 takes 2.8s'.

Gotchas

Don't alert on metrics that don't require human action
p99 is more important than average - 1% of users having terrible experience still matters

Context

Setting up production monitoring

Revisions (0)

No revisions yet.