HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How do you monitor and alert on thrashing containers?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
thrashinghowyoumonitorandcontainersalert

Problem

Are there any proposed best practices for how to monitor for thrashing containers? We're in a situation where we have containers that might try to come up and will crash -> restart a few dozen times without anyone noticing. However, it's "normal" for containers to be so ephemeral, so tracking this sort of anomaly seems difficult (what if your workload really is just 30 seconds?)

I'm just wondering if there are any "good enough" alerting practices that anyone else has come up with for tracking unhealthy restarts vs healthy restarts?

Solution

Regardless if containers are ephemeral or not, there's two things you could consider not normal:

  • Non-zero exit codes



  • Restart count > 0 at the orchestration layer



How you can alert on those metrics will vary wildly between orchestration layers, providers and monitoring tool. I'll keep the discussion about the principles:

Non-zero exit codes

This one seems obvious enough. If you have an ephemeral workload, make sure it exits cleanly and you'll be able to monitor the health of your containers. Even if your ephemeral workload last 30 seconds and you launch thousand of them (small short-lived workers style), as long as everything exits cleanly, there's nothing to worry about.

Restart count at the orchestration layer

Most orchestration layers will expose a restart count. For Kubernetes, that's literally the 'Restart Count' metric when describing a pod. Those restarts are the orchestration layer restarting a container because it wasn't supposed to be in a 'stopped' state, but it was. Ephemeral workloads won't have a restart count, because of their very nature (or if they do, the "final" shutdown of the container won't cause a restart when the workload is done).

Whether you want to alert as soon as the restart count crosses a certain threshold or if you want to be more clever about it and calculate a rate (roughly: restart count / pod/service duration) is up to you.

Context

StackExchange DevOps Q#8192, answer score: 3

Revisions (0)

No revisions yet.