patternjavascriptModerate
Histogram bucket design: choosing boundaries that match your latency distribution
Viewed 0 times
histogram bucketshistogram_quantilebucket boundariesSLO boundaryp99 accuracyle labelexplicit boundariesnative histogram
Problem
Prometheus histogram_quantile() returns inaccurate percentile estimates because the bucket boundaries do not match the actual distribution of observed values. p99 latency appears as either 0 or infinity.
Solution
Design histogram bucket boundaries to bracket the expected range of values with appropriate resolution. More buckets near the SLO boundary give better accuracy.
For HTTP latency (target < 200ms SLO):
Rules:
For OTel histograms, use explicit advice boundaries:
For HTTP latency (target < 200ms SLO):
const latency = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 2, 5, 10],
});
// Buckets in seconds: 5ms, 10ms, 25ms, 50ms, 100ms, 200ms(SLO), 300ms, 500ms, 1s, 2s, 5s, 10sRules:
- Include a bucket at your SLO boundary (0.2 for 200ms SLO)
- Use more buckets in the range where most requests fall
- Include an upper bucket well above the maximum expected value
- Keep bucket count under 15 — each bucket is a separate time series
For OTel histograms, use explicit advice boundaries:
const histogram = meter.createHistogram('http.request.duration', {
advice: { explicitBucketBoundaries: [5, 10, 25, 50, 100, 200, 500, 1000, 5000] }, // in ms
});Why
histogram_quantile interpolates linearly within a bucket. If the p99 falls in a very wide bucket (0-Inf), the estimate is meaningless. Narrow buckets around the SLO boundary give accurate percentile calculation.
Gotchas
- Prometheus native histograms (experimental) use adaptive buckets and avoid this problem entirely
- The _bucket metric has one time series per le boundary — 15 buckets = 15 time series per label combination
- Don't use the same bucket boundaries for all metric types — DB query latency (0-10ms) needs different buckets than page render (0-5000ms)
- Changing bucket boundaries is a breaking change — existing recording rules and dashboards that sum buckets will give wrong results
Context
Configuring histogram metrics for accurate percentile calculations in Prometheus
Revisions (0)
No revisions yet.