This is a fascinating approach to finding hard to
reproduce event-interleaving related bugs.
I'm particularly interested in this approachÂ
because rr record and replay plus chaosÂ
mode is directly applicable toÂ
Go programs -- whereas deterministic simulation
testing (DST) is next to impossible in a Go programÂ
using more than 4GBÂ of memory (like most of my
programs) because this rules out wasm.
In contrast to DST, the rr+chaos approach
accepts you will be randomlyÂ
sampling executions, but by recording all of them you
can still get reproducibility when you do hit the issue. Â
rr is very efficient at recording. Green test runs can be quickly
discarded.
In a blog from 2016, Robert O'Callahan, one of the principal rr authors,
talks about the design of rr's chaos mode for provoking hard to findÂ
concurrency bugs:
>Â To cut a long story short, here's an approach that works.Â
> Use just two thread priorities, "high" and "low". MakeÂ
> most threads high-priority; I give each thread a 0.1Â
> probability of being low priority. Periodically re-randomizeÂ
> thread priorities. Randomize timeslice lengths.Â
>
> Here's the good part: periodically choose a short random interval,Â
> up to a few seconds long, and during that interval do notÂ
> allow low-priority threads to run at all, even if they'reÂ
> the only runnable threads. Since these intervals canÂ
> prevent all forward progress (no control of priority inversion),
>Â limit their length to no more than 20% of total run time.Â
>
> The intuition is that many of our intermittent test failuresÂ
> depend on CPU starvation (e.g. a machine temporarilyÂ
> hanging), so we're emulating intense starvation of a fewÂ
> "victim" threads, and allowing high-priority threads toÂ
> wait for timeouts or input from the environmentÂ
> without interruption.
>
> With this approach, rr can reproduce my bug in > several runs out of a thousand. I've also been ableÂ
> we've been chasing for a while. A couple of otherÂ
> people have found this enabled reproducing theirÂ
> bugs. I'm sure there are still bugs this approachÂ
> can't reproduce, but it's good progress.
>Â
>Â I just landed all this work on rr master. TheÂ
> normal scheduler doesn't do this randomization,Â
> because it reduces throughput, i.e. slows down
> recording for easy-to-reproduce bugs.Â
> Run rr record -h to enable chaos mode forÂ
> hard-to-reproduce bugs.
Links to more info and background on rr:
Enjoy.
- Jason