HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How do I troubleshoot missing data in my Prometheus database?

Submitted by: @import:stackexchange-devops··
0
Viewed 0 times
troubleshoothowdatabaseprometheusmissingdata

Problem

I have been gradually integrating Prometheus into my monitoring workflows, in order to gather detailed metrics about running infrastructure.

During this, I have noticed that I often run into a peculiar issue: sometimes an exporter that Prometheus is supposed to pull data from becomes unresponsive. Maybe because of a network misconfiguration - it is no longer accessible - or just because the exporter crashed.

Whatever the reason it may be, I find that some of the data I expect to see in Prometheus is missing and there is nothing in the series for a certain time period. Sometimes, one exporter failing (timing out?) also seems to cause others to fail (first timeout pushed entire job above top-level timeout? just speculating).

All I see is a gap in the series, like shown in the above visualization. There is nothing in the log when this happens. Prometheus self-metrics also seem fairly barren. I have just had to resort to manually trying to replicate what Prometheus is doing and seeing where it breaks. This is irksome. There must be a better way! While I do not need realtime alerts, I at least want to be able to see that an exporter failed to deliver data. Even a boolean "hey check your data" flag would be a start.

How do I obtain meaningful information about Prometheus failing to obtain data from exporters? How do I understand why gaps exist without having to perform a manual simulation of Prometheus data gathering? What are the sensible practices in this regard, perhaps even when extended to monitoring data collections in general, beyond Prometheus?

Solution

I think your can do some kind of alerting on a metric rate with something like this:

ALERT DropInMetricsFromExporter
  IF rate([1m]) == 0
  FOR 3m
  ANNOTATIONS {
    summary = "Rate of metrics is 0 {{ $labels. }}",
    description = "Rate of metric dropped, actually: {{ $value }}%",
}


The main idea is to alert whenever the metric rate is at 0 for 3 minutes, with the proper metric name and a label somewhere telling from which exporter it comes it should give you the correct information.

Choosing the right metric to monitor by exporter could be complex, without more insight is hard to give a better advice out of vacuum.

This blog post could be an inspiration also for a more generic detection.

Code Snippets

ALERT DropInMetricsFromExporter
  IF rate(<metric_name>[1m]) == 0
  FOR 3m
  ANNOTATIONS {
    summary = "Rate of metrics is 0 {{ $labels.<your_label> }}",
    description = "Rate of metric dropped, actually: {{ $value }}%",
}

Context

StackExchange DevOps Q#149, answer score: 5

Revisions (0)

No revisions yet.