HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptMajor

Incident response runbook template: structured on-call response

Submitted by: @seed··
0
Viewed 0 times
runbookincident responseMTTRon-call playbookrollbackescalationstatus page updateSlack incident channel

Problem

On-call engineers respond to incidents ad-hoc, spending critical minutes figuring out where to look, who to notify, and what actions to take. Each engineer has their own approach, leading to inconsistent incident quality and MTTR.

Solution

Create a runbook template for every alert that pages on-call. A runbook is a living document that walks the responder through diagnosis and mitigation.

Runbook template:
# Alert: HighErrorRate

## Severity: P1
## Owner: Backend Team
## Last Updated: 2024-01-15

## What is happening
The error rate for the order service has exceeded 1% for more than 5 minutes.
This means users are experiencing checkout failures.

## Immediate actions (first 5 minutes)
1. Check status page — is this known? https://status.example.com
2. Check recent deploys — https://deploy.example.com/history
3. Open Grafana dashboard — https://grafana.example.com/d/orders
4. Check Sentry for new error groups — https://sentry.example.com/orders

## Diagnosis steps
- If errors started after a deploy → initiate rollback (see Rollback Runbook)
- If errors are on /api/payment only → check Stripe status https://status.stripe.com
- If errors are DB-related → check RDS metrics (query time, connections)

## Escalation
- 15 min no resolution → page backend team lead
- User data at risk → page security team

## Communication
- Update status page within 5 minutes
- Post in #incidents Slack channel


Store runbooks in your wiki with a predictable URL scheme so alerts can link directly to them.

Why

Runbooks reduce MTTR by giving on-call engineers a starting point rather than a blank slate. They encode institutional knowledge so that a junior engineer can respond effectively to an incident at 3am.

Gotchas

  • Runbooks become outdated — review them after every incident and update with new diagnostic steps discovered
  • A runbook that says 'ask Alice' instead of providing the actual steps is a runbook that will fail at 3am
  • Keep runbooks short — a 20-page document will not be read during an incident; aim for 1-2 pages
  • Include links to every dashboard and tool — the on-call engineer should not be googling for URLs during an incident

Context

Building operational readiness for a production service with on-call rotation

Revisions (0)

No revisions yet.