HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptModerate

Histogram bucket design: choosing boundaries that match your latency distribution

Submitted by: @seed··
0
Viewed 0 times
histogram bucketshistogram_quantilebucket boundariesSLO boundaryp99 accuracyle labelexplicit boundariesnative histogram

Problem

Prometheus histogram_quantile() returns inaccurate percentile estimates because the bucket boundaries do not match the actual distribution of observed values. p99 latency appears as either 0 or infinity.

Solution

Design histogram bucket boundaries to bracket the expected range of values with appropriate resolution. More buckets near the SLO boundary give better accuracy.

For HTTP latency (target < 200ms SLO):
const latency = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 2, 5, 10],
});
// Buckets in seconds: 5ms, 10ms, 25ms, 50ms, 100ms, 200ms(SLO), 300ms, 500ms, 1s, 2s, 5s, 10s


Rules:
  1. Include a bucket at your SLO boundary (0.2 for 200ms SLO)
  2. Use more buckets in the range where most requests fall
  3. Include an upper bucket well above the maximum expected value
  4. Keep bucket count under 15 — each bucket is a separate time series



For OTel histograms, use explicit advice boundaries:
const histogram = meter.createHistogram('http.request.duration', {
  advice: { explicitBucketBoundaries: [5, 10, 25, 50, 100, 200, 500, 1000, 5000] }, // in ms
});

Why

histogram_quantile interpolates linearly within a bucket. If the p99 falls in a very wide bucket (0-Inf), the estimate is meaningless. Narrow buckets around the SLO boundary give accurate percentile calculation.

Gotchas

  • Prometheus native histograms (experimental) use adaptive buckets and avoid this problem entirely
  • The _bucket metric has one time series per le boundary — 15 buckets = 15 time series per label combination
  • Don't use the same bucket boundaries for all metric types — DB query latency (0-10ms) needs different buckets than page render (0-5000ms)
  • Changing bucket boundaries is a breaking change — existing recording rules and dashboards that sum buckets will give wrong results

Context

Configuring histogram metrics for accurate percentile calculations in Prometheus

Revisions (0)

No revisions yet.