Skip to content

cmd/compile: riscv performance degredation  #74606

Closed
@admnd

Description

@admnd

Go version

go version go1.24.2 linux/riscv64

Output of go env in your module/workspace:

AR='ar'
CC='riscv64-unknown-linux-gnu-gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='riscv64-unknown-linux-gnu-g++'
GCCGO='gccgo'
GO111MODULE=''
GOARCH='riscv64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -pthread -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1454745784=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='riscv64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/tmp/benchmark_demo/go.mod'
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GORISCV64='rva20u64'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/root/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_riscv64'
GOVCS=''
GOVERSION='go1.24.2'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I was studying branch prediction behavior on real RISC-V hardware (Starfive VisionFive 2 - JH7110 => 4x sifive u74-mc) by creating two nearly identical benchmark functions that differ only in data preparation before b.ResetTimer():

  1. Created a minimal test case with two benchmark functions:

    • BenchmarkSortedData: calls sort.Ints(data) before b.ResetTimer()
    • BenchmarkUnsortedData: same function without the sort call
  2. Both functions have identical benchmark loops after b.ResetTimer()

  3. Ran the benchmarks with default optimizations:

    go test -bench=. riscv_bug_test.go

Complete Go source file and generated RISC-V assembly have been attached for full analysis:

riscv_bug_test.go.txt

riscv_bug_test.S.txt

What did you see happen?

1. Performance Results (Problematic)

BenchmarkSortedData-4       6843    175703 ns/op  (SLOW - 4x slower!)
BenchmarkUnsortedData-4    27356     43874 ns/op  (FAST)

2. Assembly Generation Issues
The compiler generated drastically different assembly for the two functions:

BenchmarkSortedData: 152 bytes, 48-byte stack frame ($40-8)
BenchmarkUnsortedData: 124 bytes, 24-byte stack frame ($16-8)

The benchmark loops after b.ResetTimer() are identical in Go source code, but the compiler:

  1. Uses different stack layouts (56(SP) vs 32(SP) offsets)
  2. Applies different inlining strategies
  3. Generates different register allocation patterns

Verification with disabled optimizations

When running with -gcflags="-N -l", the performance difference becomes logical:

bashBenchmarkSortedData-4        975   1229971 ns/op  (Predictable branches - faster)
BenchmarkUnsortedData-4      858   1397883 ns/op  (Unpredictable branches - slower)

This shows the 4x artificial difference disappears when optimizations are disabled, revealing the real ~14% CPU behavior difference.

Expected Assembly Behavior

Both functions should generate similar assembly code since they have identical Go source code after b.ResetTimer(). The compiler should:

  1. Use similar stack layouts (same frame size and local variable allocation)
  2. Apply consistent optimization strategies for the identical benchmark loops
  3. Generate comparable code size (within a few bytes)
  4. Respect the b.ResetTimer() optimization boundary - code before the timer reset should not influence code generation after it

Expected Performance Results

The performance difference should reflect real CPU behavior (branch prediction effects), approximately:

BenchmarkSortedData-4        ~1000   ~1200000 ns/op  (Predictable branches)
BenchmarkUnsortedData-4       ~900   ~1400000 ns/op  (Unpredictable branches)

This would show a realistic ~15% difference due to branch misprediction penalties on the SiFive U74-MC, not an artificial 4x compiler-generated difference.

Expected consistency

The same optimization level should produce the same code structure for logically equivalent functions, regardless of data preparation steps that occur before the measurement boundary.

Root cause analysis (updated)

After further investigation with collaboration from other AI systems, the root cause appears to be be related to the inlining budget heuristics in the RISC-V backend.

The presence of sort.Ints(data) before b.ResetTimer() causes the compiler to:

  1. Perceive higher function complexity due to the sort operation's internal loops and calls
  2. Consume inlining budget during the analysis phase
  3. Adopt different optimization strategies for subsequent code, including the benchmark loop
  4. Allocate larger stack frames as a preventive measure (48 vs 24 bytes)
  5. Generate different register allocation patterns due to the altered stack layout

This creates a cascade effect where identical Go code after b.ResetTimer() produces different assembly due to compiler state changes from code that shouldn't influence the measured performance.

Technical Impact

  • Different stack frame sizes: $40-8 vs $16-8
  • Different register allocation strategies
  • Different code generation for identical source code
  • 4x artificial performance difference masking real CPU behavior

Related Issues

This appears related to #50821 (AMD64 register allocation inconsistency), suggesting a broader compiler optimization pipeline issue affecting multiple architectures.

Edits: Typos. No results change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReportIssues describing a possible bug in the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.arch-riscvIssues solely affecting the riscv64 architecture.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions