rr's chaos mode for hard to reproduce bugs

Jason E. Aten

unread,

Jun 4, 2025, 7:18:45â€¯AMJun 4

to golang-nuts

This is a fascinating approach to finding hard to

reproduce event-interleaving related bugs.

I'm particularly interested in this approachÂ

because rr recordÂ and replay plus chaosÂ

mode is directly applicable toÂ

Go programs -- whereas deterministic simulation

testing (DST) is next to impossible in a Go programÂ

using more than 4GBÂ of memory (like most of my

programs) because this rules out wasm.

In contrast to DST, the rr+chaos approach

accepts you will be randomlyÂ

sampling executions, but by recording all of them you

can still get reproducibility when you do hit the issue.Â Â

rr is very efficient at recording. Green test runs can be quickly

discarded.

In a blog from 2016, Robert O'Callahan, one of the principal rr authors,

talks about the design of rr's chaos mode for provoking hard to findÂ

concurrency bugs:

>Â To cut a long story short, here's an approach that works.Â

> Use just two thread priorities, "high" and "low". MakeÂ

> most threads high-priority; I give each thread a 0.1Â

> probability of being low priority. Periodically re-randomizeÂ

> thread priorities. Randomize timeslice lengths.Â

>

> Here'sÂ the good part: periodically choose a short random interval,Â

> up to a few seconds long, and during that interval do notÂ

> allow low-priority threads to run at all, even if they'reÂ

> the only runnable threads. Since these intervals canÂ

> prevent all forward progress (no control of priority inversion),

>Â limit their length to no more than 20% of total run time.Â

>

> The intuition is that many of our intermittent test failuresÂ

> depend on CPU starvation (e.g. a machine temporarilyÂ

> hanging), so we're emulating intense starvation of a fewÂ

> "victim" threads, and allowing high-priority threads toÂ

> wait for timeouts or input from the environmentÂ

> without interruption.

>

>Â With this approach, rr can reproduceÂ my bugÂ inÂ

> several runs out of a thousand. I've also been ableÂ

> to reproduceÂ a top intermittentÂ (now being fixed),

>Â an intermittent test failureÂ that was assigned to me,Â

> andÂ an intermittent shutdown hang in IndexedDBÂ

> we've been chasing for a while. A couple of otherÂ

> people have found this enabled reproducing theirÂ

> bugs. I'm sure there are still bugs this approachÂ

> can't reproduce, but it's good progress.

>Â

>Â I just landed all this work on rr master. TheÂ

> normal scheduler doesn't do this randomization,Â

> because it reduces throughput, i.e. slows down

> recording for easy-to-reproduce bugs.Â

> RunÂ rr record -hÂ to enable chaos mode forÂ

> hard-to-reproduce bugs.

Â --Â https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

Links to more info and background on rr:

https://rr-project.org/

https://github.com/rr-debugger/rr

https://github.com/rr-debugger/rr/wiki/Usage

https://github.com/rr-debugger/rr/wiki/Testimonials

https://github.com/rr-debugger/rr/wiki/Building-And-Installing

https://arxiv.org/pdf/1705.05937

https://fitzgen.com/2015/11/02/back-to-the-futurre.html

https://www.percona.com/blog/replay-the-execution-of-mysql-with-rr-record-and-replay/

https://www.youtube.com/watch?v=61kD3x4Pu8I

Robert's talk, "Taming Non-determinism" from 9 years ago is

a good technical introduction to rr.

https://www.youtube.com/watch?v=H4iNuufAe_8

NB The Delve debugger for Go supports rr, so you can get goroutine stack traces.

Enjoy.

- Jason

Jason E. Aten

unread,

Jun 5, 2025, 4:28:07â€¯PMJun 5

to golang-nuts

Hmm. Maybe rr found a bug in runtime GC code(?)... or

maybe it will be hard to use rr on Go. Let's see what the runtime folks say

on this issue:

https://github.com/golang/go/issues/74019

cpasmaboiteaspam

unread,

Jun 6, 2025, 2:05:35â€¯AMJun 6

to golang-nuts

Hello Jason,

Intuitively,Â to play with timing of goroutines,
to manually inject delays where none exists,
is what new comers do,Â
to experiment with deadlocks,
or simply build better understanding,Â
at least I did that very often.
Therefor your suggestion definitely makes sense,
to me, a non low level programmer perspective.

(not a useful email, i only want to show some supportÂ
to your various posts which i read with lots of interest,
others too... but physically speakingÂ
it is hard to follow everybody,
I liked the last post from RobertÂ Griesemer
I like all the post of the Go blog anyway......Â
and the coroutines.... c'est interminable...= )

Thank you,
Thank you all.

Jason E. Aten

unread,

Jun 6, 2025, 3:32:46â€¯AMJun 6

to golang-nuts

Thanks for your note of appreciation,Â cpasmaboiteaspam.

Reply all

Reply to author

Forward