Skip to content

DRAFT : GPU memory validity backtrace #4845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

victor-decaria-nnl
Copy link
Contributor

This is for discussion for now. I'm not sure where is the best place to put something like this.

@camierjs This adds a debug tool specifically to be used in conjunction with the debug memory backend. The idea is that if you get an MmuError while running your program with the debug backend, you can preload this library to find why that happens. The debug backend uses mprotect to enforce validity of memory between host and device which this library intercepts and tracks. The backend has some false positives currently, but as they are fixed I think a tool like this could make the debug backend even more useful for debugging GPU code. As far as I am aware, there aren't existing tools for this. Tools like nvidia's compute-sanitizer probably aren't useful in this case because it's just going to detect access to unallocated or uninitialized memory, not whether or not values got out of sync between host and device.

To use, set LD_PRELOAD=/path/to/mprotect_trace.so before executing your program.
If it is in the same directory as your present directory, you may need to use
LD_PRELOAD=./mprotect_trace.so

It may be possible to rewrite this using libunwind, but I found it easier to get informative traces with libbacktrace.

Limitations:

  • Right now I only have this working with the cmake build. This requires linux, libbacktrace, and I am using c++17. I tested with gcc, but it may also work with clang.
  • I have to probably refine the logic around the lock guards for thread safety, but that wasn't my focus.
  • This finds the offending protection call based on an address, memory range, and most recent timestamp meeting those parameters. If the timestamp isn't resolved enough, it's possible that multiple calls could meet the same criteria. This seems like an unlikely case, but it could be eliminated with more work.

@victor-decaria-nnl victor-decaria-nnl added WIP Work in Progress GPU labels May 1, 2025
@victor-decaria-nnl victor-decaria-nnl self-assigned this May 1, 2025
@victor-decaria-nnl victor-decaria-nnl marked this pull request as draft May 1, 2025 15:55
@victor-decaria-nnl victor-decaria-nnl changed the title GPU memory validity backtrace DRAFT : GPU memory validity backtrace May 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU WIP Work in Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant