HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

What do you tell someone new to going oncall or PagerDutyl?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
oncallgoingnewyouwhatsomeonetellpagerdutyl

Problem

[[ Based on tags and the wording this sounds like a PagerDuty-specific question, but it could apply to any tech or monitoring stack in my mind.... PagerDuty is just so dominant in SF-based jobs I've seen that is like tech Kleenex.... ]]

I've been searching in my spare time for a few months for an answer to: what do you tell a new recruit to pager duty (or PagerDuty or pagerduty)??? And I don't mainly mere "install the app and prey" training, but those are key parts of it. I'm more looking for generic, general, globalizable advice like:

  • ask for help as early as possible



  • don't go sleep deprived "too" many nights in a row ~3



  • know who is likely to answer with low latency and who will notice work again at Monday 9am DOH!



  • it is easier to find answer on StackExchange than random forums at 3am free plug



  • triage: how big of a fire is it? how many fire trucks do you call in? Etc.



  • when to merge incidents



  • when to snooze incidents... but that is so scary!! "You don't get fired for snoozing here?" hahaha.



  • good notification settings



  • good phone/volume settings for getting woken up, even if you sleep like a log.



  • scheduling and calendar layers



  • rules of thumb and recommended tools



  • know who to call for what subsystem (varies based on locality and org) but using a wiki or Confluence is often part of that



Maybe I'm searching wrong. I'm open to meta-suggestions for better ways to find this needle in the haystack. Searching for "starting pagerduty" makes sense to me, but it gets results of how to setup these things programmatically or technically, not how to instruct the humans in dealing with sorts of things. PagerDuty does try to address this, but it doesn't include the breadth of responsibilities that the job entails to me.

I also admit that I'm asking for everything and the kitchen sink and the city water supply for a bonus round. Getting little pieces of this as links and pointing folks at it would be a huge improvement over what

Solution

There are a number of things that I think it's important to know before going on-call:

  • How to triage an issue. There should be a defined list of priorities with examples so that you can identify how serious a problem is.



  • What the priority levels mean. Can you ack this and then go back to sleep?



  • Who to contact post-triage. The on-caller's primary responsibility is triage, not fixing. Fixing the problem is good, too, but once you've triaged the issue, then it's up to you to rope in other folks from other teams and manage communications between them.



People jumping straight into fixing problems before triaging is the most common mistake I've seen (because, after all, we're used to fixing things and think of that as our job). But communications are often vital, and doing things in the wrong order can mean customers are left in the dark because you got caught up in trying to fix a problem before letting your support people know there is a problem.

Context

StackExchange DevOps Q#5230, answer score: 2

Revisions (0)

No revisions yet.