HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Trigger PagerDuty alert after a number of incidents

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
afterpagerdutynumbertriggerincidentsalert

Problem

Is it possible to only have an alert triggered after a certain number of incidents from an integration?

For example, if my application reports a single non-critical failure of a kind, that's worth-while to troubleshoot, but probably not so bad as to wake up somebody in the middle of the night.

However, if the application is reporting the same failure over and over again, it's a symptom of a bigger problem that somebody should look at ASAP.

Any ideas?

Solution

The easiest way to build this would be to have some sort of metrics-based alerting system like Prometheus, datadog, etc. Those systems allow you to have a counter that gets incremented and you can see how many incidents happened in a given period of time on a pretty graph. Most metrics systems will tie nicely to PagerDuty for the alerting piece. Having any of these metrics systems functioning will also give you a historic baseline for your systems that is often incredibly handy in figuring our what went wrong and when.

The threshold should probably be number of incidents over a given period of time. If you just pick an arbitrary number, say 100, for this you will always eventually get to that limit. If you see 10 incidents in an hour that could be bad for you.

The hardest part of this would be that I don't think PagerDuty makes it easy to access metrics about the number of incidents. You can look at pretty graphs in their UI, but there's no easy way to feed that into your metrics system. You would need to add a bit of code to the other places you have creating incidents and have them increment this counter.

Context

StackExchange DevOps Q#2215, answer score: 2

Revisions (0)

No revisions yet.