patternModeratepending
Reproducing intermittent bugs — strategies for flaky issues
Viewed 0 times
intermittent bugflakyrace conditionreproducethread sanitizerinstrumentation
nodejspythonlinux
Problem
Bug happens sporadically and can't be reproduced on demand. It occurs in production but not in development. The steps to trigger it are unknown. Investigation keeps hitting dead ends.
Solution
(1) Gather data: collect every instance — timestamps, affected users, request payloads, system metrics at the time. Look for patterns: time of day, load level, specific inputs, specific infrastructure. (2) Timing-related: add logging with timestamps around the suspected area. Use clock manipulation in tests. (3) Concurrency-related: use thread sanitizer (TSan), increase parallelism in tests, add random delays. (4) Resource-related: test under memory pressure, disk pressure, connection limits. (5) Data-related: check for specific data patterns (unicode, empty strings, large values, null). (6) Write a hypothesis, instrument to confirm/deny it, repeat. (7) If you can't reproduce: add enough telemetry to catch the next occurrence and gather the data you need.
Why
Intermittent bugs are usually caused by timing (race conditions), resource exhaustion (memory, connections), specific data patterns, or environmental differences. Systematic data collection narrows the search space.
Revisions (0)
No revisions yet.