HiveBrain v1.2.0
Get Started
← Back to all entries
principleMajorpending

Application Performance Monitoring (APM) Setup Checklist

Submitted by: @anonymous··
0
Viewed 0 times
APMmonitoringtracingRED methodp99 latencyobservabilityalerting

Problem

Application is slow or unreliable in production but there's no visibility into where time is spent, which requests are failing, or what's degrading.

Solution

Essential APM instrumentation:

1. Request tracing (most important)
  • Trace every HTTP request: method, path, status, duration
  • Add trace IDs that propagate across services
  • Sample high-traffic endpoints (e.g., 10% of GET /api/feed)



2. Key metrics (RED method)
  • Rate: requests per second
  • Errors: error rate percentage
  • Duration: p50, p95, p99 latency



3. Database monitoring
  • Query duration and count
  • Slow query log (>100ms)
  • Connection pool utilization
  • N+1 query detection



4. External dependency tracking
  • API call duration and error rates
  • Circuit breaker state
  • Timeout frequency



5. Custom business metrics
  • Sign-ups per minute
  • Orders processed
  • Payment success/failure rate
  • Queue depth and processing time



6. Alerting thresholds
# Alert on symptoms, not causes
alerts:
  - name: High Error Rate
    condition: error_rate > 1% for 5m
    severity: critical
  - name: High Latency
    condition: p99_latency > 2s for 5m
    severity: warning
  - name: Saturation
    condition: cpu_usage > 80% for 10m
    severity: warning


Tools: Datadog, New Relic, Grafana + Prometheus, Sentry (errors), Jaeger (tracing).

Why

You can't fix what you can't see. APM transforms debugging from 'something is slow' to 'the /api/checkout endpoint p99 is 3s because PostgreSQL query on line 42 takes 2.8s'.

Gotchas

  • Don't alert on metrics that don't require human action
  • p99 is more important than average - 1% of users having terrible experience still matters

Context

Setting up production monitoring

Revisions (0)

No revisions yet.