Description
Go version
go version go1.24.2 linux/riscv64
Output of go env
in your module/workspace:
AR='ar'
CC='riscv64-unknown-linux-gnu-gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='riscv64-unknown-linux-gnu-g++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='riscv64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -pthread -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1454745784=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='riscv64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/tmp/benchmark_demo/go.mod'
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GORISCV64='rva20u64'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_riscv64'
GOVCS=''
GOVERSION='go1.24.2'
GOWORK=''
PKG_CONFIG='pkg-config'
What did you do?
I was studying branch prediction behavior on real RISC-V hardware (Starfive VisionFive 2 - JH7110 => 4x sifive u74-mc) by creating two nearly identical benchmark functions that differ only in data preparation before b.ResetTimer()
:
-
Created a minimal test case with two benchmark functions:
BenchmarkSortedData
: callssort.Ints(data)
beforeb.ResetTimer()
BenchmarkUnsortedData
: same function without the sort call
-
Both functions have identical benchmark loops after
b.ResetTimer()
-
Ran the benchmarks with default optimizations:
go test -bench=. riscv_bug_test.go
Complete Go source file and generated RISC-V assembly have been attached for full analysis:
What did you see happen?
1. Performance Results (Problematic)
BenchmarkSortedData-4 6843 175703 ns/op (SLOW - 4x slower!)
BenchmarkUnsortedData-4 27356 43874 ns/op (FAST)
2. Assembly Generation Issues
The compiler generated drastically different assembly for the two functions:
BenchmarkSortedData: 152 bytes, 48-byte stack frame ($40-8)
BenchmarkUnsortedData: 124 bytes, 24-byte stack frame ($16-8)
The benchmark loops after b.ResetTimer()
are identical in Go source code, but the compiler:
- Uses different stack layouts (56(SP) vs 32(SP) offsets)
- Applies different inlining strategies
- Generates different register allocation patterns
Verification with disabled optimizations
When running with -gcflags="-N -l"
, the performance difference becomes logical:
bashBenchmarkSortedData-4 975 1229971 ns/op (Predictable branches - faster)
BenchmarkUnsortedData-4 858 1397883 ns/op (Unpredictable branches - slower)
This shows the 4x artificial difference disappears when optimizations are disabled, revealing the real ~14% CPU behavior difference.
Expected Assembly Behavior
Both functions should generate similar assembly code since they have identical Go source code after b.ResetTimer()
. The compiler should:
- Use similar stack layouts (same frame size and local variable allocation)
- Apply consistent optimization strategies for the identical benchmark loops
- Generate comparable code size (within a few bytes)
- Respect the
b.ResetTimer()
optimization boundary - code before the timer reset should not influence code generation after it
Expected Performance Results
The performance difference should reflect real CPU behavior (branch prediction effects), approximately:
BenchmarkSortedData-4 ~1000 ~1200000 ns/op (Predictable branches)
BenchmarkUnsortedData-4 ~900 ~1400000 ns/op (Unpredictable branches)
This would show a realistic ~15% difference due to branch misprediction penalties on the SiFive U74-MC, not an artificial 4x compiler-generated difference.
Expected consistency
The same optimization level should produce the same code structure for logically equivalent functions, regardless of data preparation steps that occur before the measurement boundary.
Root cause analysis (updated)
After further investigation with collaboration from other AI systems, the root cause appears to be be related to the inlining budget heuristics in the RISC-V backend.
The presence of sort.Ints(data)
before b.ResetTimer()
causes the compiler to:
- Perceive higher function complexity due to the sort operation's internal loops and calls
- Consume inlining budget during the analysis phase
- Adopt different optimization strategies for subsequent code, including the benchmark loop
- Allocate larger stack frames as a preventive measure (48 vs 24 bytes)
- Generate different register allocation patterns due to the altered stack layout
This creates a cascade effect where identical Go code after b.ResetTimer()
produces different assembly due to compiler state changes from code that shouldn't influence the measured performance.
Technical Impact
- Different stack frame sizes:
$40-8
vs$16-8
- Different register allocation strategies
- Different code generation for identical source code
- 4x artificial performance difference masking real CPU behavior
Related Issues
This appears related to #50821 (AMD64 register allocation inconsistency), suggesting a broader compiler optimization pipeline issue affecting multiple architectures.
Edits: Typos. No results change.