Skip to content

Valgrind Memory Leak Checking #8954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

wiredfool
Copy link
Member

@wiredfool wiredfool commented May 13, 2025

Changes proposed in this pull request:

  • Add support in the Makefile to use valgrind to check for definite leaks.
  • Add some suppressions of python object allocations for leak checks. Python is a bit noisy for leak checking.
  • Fix a leak in the Arrow schema where child schemas weren't released.
  • Fix a leak in webp encode on error
  • Fix a leak in UnsharpMask on error
  • Fix a leak in TiffEncode on error
  • Fix a leak in Font getmask on error
  • Fix a leak in JpegEncode on error

To Do:

  • Consider how to run this periodically in CI. This is significantly slower than plain valgrind -- locally I had a 4 hour run at one point. That's too slow for a merges or prs, but we need to see the results.
  • Consider setting the leak detection to something other than definite.
  • Review the python suppressions. It's definitely possible that some of them are leaks -- they'd be pointing to issues with our PyObject handling at the C level, either erroring from a function without decrefing or similar.

Note -- this is built on the Arrow memory leak check PR #8953.

@wiredfool wiredfool added Bug Any unexpected behavior, until confirmed feature. Memory labels May 13, 2025
@aclark4life
Copy link
Member

@wiredfool We can trigger a workflow manually or run daily independent of PRs.

@wiredfool
Copy link
Member Author

We can trigger a workflow manually or run daily independent of PRs.

The question then is how to we prevent this from being run as a scheduled action and subsequently ignored? The tension is that this is by far the most valuable when a PR is in flight, because then it's obvious when you've gone from current (xfail) to fail, but the runtime is likely 2-3x worse than our longest other test run, which is already too long. (Even running single tests with valgrind take ~ 1 min, because all the pytest setup infra runs under valgrind as well)

@aclark4life
Copy link
Member

In that case, long PR it is! We typically have PRs queued for days, or longer, before merging.

@wiredfool
Copy link
Member Author

On My Machine:
4711 passed, 345 skipped, 11 xfailed, 8 warnings in 15074.00s (4:11:14)

That's vs 87 minutes for ordinary valgrind on this pr in the GHA testing.

Unfortunately, we're single threaded for valgrind tests.

@wiredfool
Copy link
Member Author

Ok, We'll see how long this takes: https://github.com/python-pillow/Pillow/actions/runs/15054197221/job/42316085246?pr=8954

I'm only running this on the pull request, not the push. I'm expecting 5 hours.

This is injecting the test script in the Pillow repo, instead of building a new copy of the valgrind image. At some point we can move that over, but I'd like to be tuning the supressions without having to build a new valgrind image, so it's probably better to leave it here for now.

wiredfool added 3 commits May 16, 2025 12:08
* ensure that the env variable is set in the makefile
* Some failing tests are on main but not last released version
@wiredfool
Copy link
Member Author

Ok, so one more leak found, and a couple of timeouts that weren't xfailed in valgrind. Total test time was 3 hours, which was a little better than I was expecting. Guess core speed has gotten a little better than this machine.

So what this whole exercise is showing is that we're reasonably good at dealing with the memory lifetime for the happy case, but there are a lot of cases where c-level error handling is returning without freeing the inflight items that would otherwise be passed out or handled properly. I suspect that there are more similar leaks out there -- the ones that we're hitting are the ones that are exercised in our test suite. There are a couple of places here where I've added changes that aren't picked up by the code coverage but are essentially the same pattern as the leak. This indicates that we're going to have other error level leaks, and we'll probably find them with additional testing.

pytest-timeout doesn't raise a timeout error.
@wiredfool wiredfool force-pushed the valgrind-leakcheck branch from dfec187 to 20b49a3 Compare May 17, 2025 08:47
wiredfool and others added 3 commits May 23, 2025 10:57
Comment on lines +3 to +4
## Run this as the test script in the docker valgrind image.
## Note -- can be included directly into the docker image,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Run this as the test script in the docker valgrind image.
## Note -- can be included directly into the docker image,
## Run this as the test script in the Docker valgrind image.
## Note -- can be included directly into the Docker image,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Any unexpected behavior, until confirmed feature. Memory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants