patternjavascriptMajor
Incident response runbook template: structured on-call response
Viewed 0 times
runbookincident responseMTTRon-call playbookrollbackescalationstatus page updateSlack incident channel
Problem
On-call engineers respond to incidents ad-hoc, spending critical minutes figuring out where to look, who to notify, and what actions to take. Each engineer has their own approach, leading to inconsistent incident quality and MTTR.
Solution
Create a runbook template for every alert that pages on-call. A runbook is a living document that walks the responder through diagnosis and mitigation.
Runbook template:
Store runbooks in your wiki with a predictable URL scheme so alerts can link directly to them.
Runbook template:
# Alert: HighErrorRate
## Severity: P1
## Owner: Backend Team
## Last Updated: 2024-01-15
## What is happening
The error rate for the order service has exceeded 1% for more than 5 minutes.
This means users are experiencing checkout failures.
## Immediate actions (first 5 minutes)
1. Check status page — is this known? https://status.example.com
2. Check recent deploys — https://deploy.example.com/history
3. Open Grafana dashboard — https://grafana.example.com/d/orders
4. Check Sentry for new error groups — https://sentry.example.com/orders
## Diagnosis steps
- If errors started after a deploy → initiate rollback (see Rollback Runbook)
- If errors are on /api/payment only → check Stripe status https://status.stripe.com
- If errors are DB-related → check RDS metrics (query time, connections)
## Escalation
- 15 min no resolution → page backend team lead
- User data at risk → page security team
## Communication
- Update status page within 5 minutes
- Post in #incidents Slack channelStore runbooks in your wiki with a predictable URL scheme so alerts can link directly to them.
Why
Runbooks reduce MTTR by giving on-call engineers a starting point rather than a blank slate. They encode institutional knowledge so that a junior engineer can respond effectively to an incident at 3am.
Gotchas
- Runbooks become outdated — review them after every incident and update with new diagnostic steps discovered
- A runbook that says 'ask Alice' instead of providing the actual steps is a runbook that will fail at 3am
- Keep runbooks short — a 20-page document will not be read during an incident; aim for 1-2 pages
- Include links to every dashboard and tool — the on-call engineer should not be googling for URLs during an incident
Context
Building operational readiness for a production service with on-call rotation
Revisions (0)
No revisions yet.