Skip to content

refactor(profiling): pool memalloc traceback scratch buffers #13185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 17, 2025

Conversation

nsrip-dd
Copy link
Contributor

@nsrip-dd nsrip-dd commented Apr 11, 2025

Right now there is a global scratch buffer used when collecting
tracebacks for memalloc. This is a resource usage optimization. We have
enough space set aside to record a maximum depth traceback. We collect
the traceback and put it in the scratch space. Then, once we know how
many frames there really are, we can allocate a new traceback that's
just the right size and store that for later reporting. This saves
wasted memory and extra work dynamically growing the traceback to the
right size.

However, there is only one scratch buffer. There is a hypothetical
logical race accessing the buffer. Before 2.18, when we added locks, it
was a real logical race. Essentially what happens is:

  • Thread A is going to collect a traceback, and starts using the buffer
  • Thread A sets the frame count to 0 and starts writing frames to the
    buffer
  • Thread A, partway through, triggers a memory allocation when getting
    the next frame that just so happens to trigger garbage collection,
    which in turn just so happens to run interpreter code which releases
    the GIL
  • Thread B can start running now that the GIL was released
  • Thread B happens to also want to collect a traceback. It can do so and
    starts using the same buffer
  • Thread B resets the frame count to 0
  • Thread B writes all its frames, leaving the frame count at N, and just
    so happens not to relase the GIL
  • Thread B finishes sampling and eventually releases the GIL
  • Thread A picks up where it left off, continuing to write frames. Only
    now the frame counter is N, so it is writing at the Nth offset in the
    buffer.
  • Thread A runs out of Python frames, and returns a new traceback, which
    will have the N frames that thread B wrote plus the remaining frames
    that A collected.

This is a logical race that leads to double frees. Now we have two
tracebacks holding some of the exact same frame objects, and they will
eventually both be freed. This corrupts internal allocator state and
leads to crashes.

I say this is hypothetical because now, even though we have locks in
this code, there are actually two separate locks which can be acquired
independently. Both the allocation profiler and heap profiler could
simultaneously collect tracebacks. I haven't been able to actually
trigger this, even with my recently-added reproducer for the race that
ddtrace 2.18 addressed, but I believe the problem is there.

Even without proving this is a problem in practice, it is also an
impedment to performance optimizations. The synchronization we added in
ddtrace 2.18 has introduced a lof of overhead. We can, in theory, use
the GIL to protect most of the memory profiler data structures. But we
need to guard against logical races like this.

Keeping one buffer and locking it for the duration of traceback
collection is not a solution. If the GIL could possibly be released,
then this creates a lock order inversion bug. Another thread grabs the
GIL and tries to get a traceback, but can't since the lock is held. And
so the thread can't release the GIL. So the program deadlocks. For the
same reason, we can't just give the allocation and heap profilers their
own scratch buffers.

NB: We are actually only saved from this problem today because we use
try locks. And I'm not confident that we are actually correctly handling
the cases where fail to acquire the existing locks. I would rather have
a solution where we don't possibly introduce lock ordering bugs.

Instead, we can use a pool of buffers. Basically, when a thread wants a
traceback, it grabs a buffer from the pool or creates a new one when the
pool is empty. Then the thread has exclusive access to that buffer. When
the thread is done with the buffer, it places it back in the pool. We
can cap the size of the pool to prevent having the pool stuck at the
high water-mark size if we typically only need one or two buffers at a
time.

This is safe if we hold the GIL on entry to the buffer management
functions, as the functions themselves don't call back into Python and
risk releasing the GIL. If we ever wanted to support no-GIL mode or
subinterpreters, wit would be trivial to add synchronization to the
pool.

I suspect that in the stable state we will not need to allocate new
buffers often and the overhead will be comparable to what it is today.
But it'll be much safer and more suitable for further optimizations to
the memory profiler.

PROF-11531

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

CODEOWNERS have been resolved as:

ddtrace/profiling/collector/_memalloc_tb.c                              @DataDog/profiling-python

@nsrip-dd nsrip-dd added the changelog/no-changelog A changelog entry is not required for this PR. label Apr 11, 2025
Copy link
Contributor

github-actions bot commented Apr 11, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 234 ± 5 ms.

The average import time from base is: 240 ± 5 ms.

The import time difference between this PR and base is: -5.7 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 2.225 ms (0.95%)
ddtrace.bootstrap.sitecustomize 1.552 ms (0.66%)
ddtrace.bootstrap.preload 1.552 ms (0.66%)
ddtrace.internal.products 1.552 ms (0.66%)
ddtrace.internal.remoteconfig.client 0.714 ms (0.30%)
ddtrace 0.673 ms (0.29%)

@pr-commenter
Copy link

pr-commenter bot commented Apr 11, 2025

Benchmarks

Benchmark execution time: 2025-04-15 19:54:55

Comparing candidate commit 6bb3660 in PR branch nick.ripley/pool-memalloc-traceback-buffers with baseline commit 73764dc in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 506 metrics, 2 unstable metrics.

Right now there is a global scratch buffer used when collecting
tracebacks for memalloc. This is a resource usage optimization. We have
enough space set aside to record a maximum depth traceback. We collect
the traceback and put it in the scratch space. Then, once we know how
many frames there really are, we can allocate a new traceback that's
just the right size and store that for later reporting. This saves
wasted memory and extra work dynamically growing the traceback to the
right size.

However, there is only one scratch buffer. There is a hypothetical
logical race accessing the buffer. Before 2.18, when we added locks, it
was a _real_ logical  race. Essentially what happens is:

- Thread A is going to collect a traceback, and starts using the buffer
- Thread A sets the frame count to 0 and starts writing frames to the
  buffer
- Thread A, partway through, triggers a memory allocation when getting
  the next frame that just so happens to trigger garbage collection,
  which in turn just so happens to run interpreter code which releases
  the GIL
- Thread B can start running now that the GIL was released
- Thread B happens to also want to collect a traceback. It can do so and
  starts using the same buffer
- Thread B resets the frame count to 0
- Thread B writes all its frames, leaving the frame count at N, and just
  so happens _not_ to relase the GIL
- Thread B finishes sampling and eventually releases the GIL
- Thread A picks up where it left off, continuing to write frames. Only
  now the frame counter is N, so it is writing at the Nth offset in the
  buffer.
- Thread A runs out of Python frames, and returns a new traceback, which
  will have the N frames that thread B wrote plus the remaining frames
  that A collected.

This is a logical race that leads to double frees. Now we have two
tracebacks holding some of the exact same frame objects, and they will
eventually both be freed. This corrupts internal allocator state and
leads to crashes.

I say this is hypothetical because now, even though we have locks in
this code, there are actually two separate locks which can be acquired
independently. Both the allocation profiler and heap profiler could
simultaneously collect tracebacks. I haven't been able to actually
trigger this, even with my recently-added reproducer for the race that
ddtrace 2.18 addressed, but I believe the problem is there.

Even without proving this is a problem in practice, it is also an
impedment to performance optimizations. The synchronization we added in
ddtrace 2.18 has introduced a lof of overhead. We can, in theory, use
the GIL to protect most of the memory profiler data structures. But we
need to guard against logical races like this.

Keeping one buffer and locking it for the duration of traceback
collection is not a solution. If the GIL could possibly be released,
then this creates a lock order inversion bug. Another thread grabs the
GIL and tries to get a traceback, but can't since the lock is held. And
so the thread can't release the GIL. So the program deadlocks. For the
same reason, we can't just give the allocation and heap profilers their
own scratch buffers.

NB: We are actually only saved from this problem today because we use
try locks. And I'm not confident that we are actually correctly handling
the cases where fail to acquire the existing locks. I would rather have
a solution where we don't possibly introduce lock ordering bugs.

Instead, we can use a pool of buffers. Basically, when a thread wants a
traceback, it grabs a buffer from the pool or creates a new one when the
pool is empty. Then the thread has exclusive access to that buffer. When
the thread is done with the buffer, it places it back in the pool. We
can cap the size of the pool to prevent having the pool stuck at the
high water-mark size if we typically only need one or two buffers at a
time.

This is safe if we hold the GIL on entry to the buffer management
functions, as the functions themselves don't call back into Python and
risk releasing the GIL. If we ever wanted to support no-GIL mode or
subinterpreters, wit would be trivial to add synchronization to the
pool.

I suspect that in the stable state we will not need to allocate new
buffers often and the overhead will be comparable to what it is today.
But it'll be much safer and more suitable for further optimizations to
the memory profiler.
@nsrip-dd nsrip-dd force-pushed the nick.ripley/pool-memalloc-traceback-buffers branch from c8c3195 to 487ec4e Compare April 14, 2025 14:55
@nsrip-dd nsrip-dd marked this pull request as ready for review April 14, 2025 15:48
@nsrip-dd nsrip-dd requested a review from a team as a code owner April 14, 2025 15:48
Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you noticed any difference on native profiles with/without this?

Also, it would be interesting to see how this behaves when there are many threads, which could potentially lead to more alloc/free for buffers

@P403n1x87 P403n1x87 added the Profiling Continous Profling label Apr 15, 2025
For the memalloc traceback scratch buffer pool, use plain malloc and
free rather than PyMem_RawMalloc/PyMem_RawFree. Those functions happen
to just call plain malloc/free, but it's not really guaranteed. Also, it's
important that the pool management functions don't release the GIL,
which is easier to know doesn't happen if we just don't call any Python
C API functions. We may also want to profile the "raw" allocators,
though that's a project for another time.

Related to the above GIL concerns, add an assertion that we hold the GIL
in the pool management functions. We rely on the GIL to protect the pool
state right now. Having the assertions both signals that in the code,
and helps us debug in case we start seeing crashes in the memalloc code.
Copy link
Contributor Author

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you noticed any difference on native profiles with/without this?

In my admittedly very simple testing, these functions don't even show up in profiles. I'm using Linux perf and recording a simple Flask application with the Python profiler enabled, under load. I'm also running this change with the test from #13026 under perf.

Also, it would be interesting to see how this behaves when there are many threads, which could potentially lead to more alloc/free for buffers

Yeah, I'd be interested in that as well. Right now, since we have the profile locks, I think we're pretty unlikely to have multiple threads hit this code. In my local workspace where I'm moving around the locking, I currently only see two buffers get allocated. The worst case would be where a thread samples, happens to release the GIL while collecting a traceback, and then another thread starts sampling. And then that thread also relases the GIL while collecting a traceback, and then a third thread starts sampling, and so on... Basically we start doing a new buffer per sample if we have more than 4 (pool size) threads getting tracebacks during sampling concurrently. For a thread to release the GIL in sampling it needs to happen to trigger garbage collection during traceback collection. So in the pathological case where we have more than, like, 2 buffers in the pool, we need to have a bunch of threads sample and trigger garbage collection back to back and not reach a steady state...

Long story short, my gut feeling is that the pathological case of constantly allocating new scratch buffers is unlikely to happen. I think Python 3.12 and later don't have the GIL release any more since they don't do garbage collection directly when allocating a new object.

We could alternatively do a "dynamic array" approach, where we make a fresh traceback buffer for each sample and grow it to the right size. I chose the pool approach because it's simple to implement and (at least to me) easy to see is correct. But the dynamic array would let us not have global state, at the expense of doing several allocations and copies. The worst case for the pool would have us doing two allocations and a copy for the traceback space. Might be worse for a dynamic array approach?

LMK if you have any thoughts on the current approach vs something else. I can sketch out the dynamic array approach if you want.

@nsrip-dd nsrip-dd merged commit 8c189a1 into main Apr 17, 2025
319 of 320 checks passed
@nsrip-dd nsrip-dd deleted the nick.ripley/pool-memalloc-traceback-buffers branch April 17, 2025 18:38
nsrip-dd added a commit that referenced this pull request May 14, 2025
We added locking to memalloc, the memory profiler, in #11460 in order to
address crashes. These locks made the crashes go away, but significantly
increased the baseline overhead of the profiler and introduced subtle
bugs. The locks we added turned out to be fundamentally incompatible
with the global interpreter lock (GIL), at least with the implementation
from #11460. This PR refactors the profiler to use the GIL exclusively
for locking.

First, we should acknowledge no-GIL and subinterpreters. As of right
now, our module does not support either. A module has to explicitly
opt-in to support either, so there is no risk of those modes being
enabled under our feet. Supporting either mode is likely a repo-wide
project. For now, we can assume the GIL exists.

This work was motivated by overhead. We currently acquire and release
locks in every memory allocation and free. Even when the locks aren't
contended, allocations and frees are very frequent, and the extra works
adds up. We add about ~8x overhead to the baselien cost of allocation
just with our locking, not including the cost of actually sampling an
allocation. We can't get rid of this overhead just by reducing sampling
frequency.

There are a few rules to follow in order to use the GIL correctly for
locking:

1) The GIL is held when a C extension function is called, _except_
   possibly in the raw allocator, which we do not profile
2) The GIL may be released during C Python API calls. Even if it is
   released, though, it will be held again after the call
3) Thus, the GIL creates critical sections only between C Python API
   calls, and the beginning and end of C extension functions. Modifications
   to shared state across those points are not atomic.
4) If we take a lock of our own in a C extension code (i.e. a
   pthread_mutex), and the extension code releases the GIL, then the
   program will deadlock due to lock order inversion. We can only safely
   take locks in C extension when the GIL is released.

The crashes that #11460 addresed were due to breaking the first three
rules. In particular, we could race on accessing the shared scratch
buffer used when collecting tracebacks, which lead to double-frees.
See #13185 for more details.

Our mitigation involved using C locks around any access to the shared
profiler state. We nearly broke rule 4 in the process. However, we used
try-locks specifically out of a fear of introducing deadlocks. Try-locks
mean that we attempt to acquire the lock, but return a failure if the
lock is already held. This stopped deadlocks, but introduced bugs: For
example:

- If we failed to take the lock when trying to report allocation
  profile events, we'd raise an exception when it was in fact not
  reasonable for doing that to fail. See #12075.
- memalloc_heap_untrack, which removes tracked allocations, was guarded
  with a try-lock. If we couldn't acquire the lock, we would fail to
  remove a record for an allocation and effectively leak memory.
  See #13317
- We attempted to make our locking fork-safe. The first attempt was
  inefficient; we made it less inefficient but the fix only "worked"
  because of try-locks. See #11848

Try-locks hide concurrency problems and we shouldn't use them. Using our
own locks requires releasing the GIL before acquisition, and then
re-acquiring the GIL. That adds unnecessary overhead. We don't
inherently need to do any off-GIL work. So, we should try to just use
the GIL as long as it is available.

The basic refactor is actually pretty simple. In a nutshell, we
rearrange the memalloc_add_event and memalloc_heap_track functions so
that they make the sampling decision, then take a traceback, then insert
the traceback into the appropriate data structure. Collecting a
traceback can release the GIL, so we make sure that modifying the data
structure happens completely after the traceback is collected. We also
safeguard against the possibility that the profiler was stopped during
sampling, if the GIL was released. This requires a small rearrangement
of memalloc_stop to make sure that the sampling functions don't see
partially-freed profiler data structures.

For testing, I have mainly used the code from test_memealloc_data_race_regression.
I also added a debug mode, enabled by compiling with
MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it
would be expected. For performance I examined the overhead of profiling
on a basic flask application.
nsrip-dd added a commit that referenced this pull request May 15, 2025
We added locking to memalloc, the memory profiler, in #11460 in order to
address crashes. These locks made the crashes go away, but significantly
increased the baseline overhead of the profiler and introduced subtle
bugs. The locks we added turned out to be fundamentally incompatible
with the global interpreter lock (GIL), at least with the implementation
from #11460. This PR refactors the profiler to use the GIL exclusively
for locking.

First, we should acknowledge no-GIL and subinterpreters. As of right
now, our module does not support either. A module has to explicitly
opt-in to support either, so there is no risk of those modes being
enabled under our feet. Supporting either mode is likely a repo-wide
project. For now, we can assume the GIL exists.

This work was motivated by overhead. We currently acquire and release
locks in every memory allocation and free. Even when the locks aren't
contended, allocations and frees are very frequent, and the extra works
adds up. We add about ~8x overhead to the baselien cost of allocation
just with our locking, not including the cost of actually sampling an
allocation. We can't get rid of this overhead just by reducing sampling
frequency.

There are a few rules to follow in order to use the GIL correctly for
locking:

1) The GIL is held when a C extension function is called, _except_
   possibly in the raw allocator, which we do not profile
2) The GIL may be released during C Python API calls. Even if it is
   released, though, it will be held again after the call
3) Thus, the GIL creates critical sections only between C Python API
   calls, and the beginning and end of C extension functions. Modifications
   to shared state across those points are not atomic.
4) If we take a lock of our own in a C extension code (i.e. a
   pthread_mutex), and the extension code releases the GIL, then the
   program will deadlock due to lock order inversion. We can only safely
   take locks in C extension when the GIL is released.

The crashes that #11460 addresed were due to breaking the first three
rules. In particular, we could race on accessing the shared scratch
buffer used when collecting tracebacks, which lead to double-frees.
See #13185 for more details.

Our mitigation involved using C locks around any access to the shared
profiler state. We nearly broke rule 4 in the process. However, we used
try-locks specifically out of a fear of introducing deadlocks. Try-locks
mean that we attempt to acquire the lock, but return a failure if the
lock is already held. This stopped deadlocks, but introduced bugs: For
example:

- If we failed to take the lock when trying to report allocation
  profile events, we'd raise an exception when it was in fact not
  reasonable for doing that to fail. See #12075.
- memalloc_heap_untrack, which removes tracked allocations, was guarded
  with a try-lock. If we couldn't acquire the lock, we would fail to
  remove a record for an allocation and effectively leak memory.
  See #13317
- We attempted to make our locking fork-safe. The first attempt was
  inefficient; we made it less inefficient but the fix only "worked"
  because of try-locks. See #11848

Try-locks hide concurrency problems and we shouldn't use them. Using our
own locks requires releasing the GIL before acquisition, and then
re-acquiring the GIL. That adds unnecessary overhead. We don't
inherently need to do any off-GIL work. So, we should try to just use
the GIL as long as it is available.

The basic refactor is actually pretty simple. In a nutshell, we
rearrange the memalloc_add_event and memalloc_heap_track functions so
that they make the sampling decision, then take a traceback, then insert
the traceback into the appropriate data structure. Collecting a
traceback can release the GIL, so we make sure that modifying the data
structure happens completely after the traceback is collected. We also
safeguard against the possibility that the profiler was stopped during
sampling, if the GIL was released. This requires a small rearrangement
of memalloc_stop to make sure that the sampling functions don't see
partially-freed profiler data structures.

For testing, I have mainly used the code from test_memealloc_data_race_regression.
I also added a debug mode, enabled by compiling with
MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it
would be expected. For performance I examined the overhead of profiling
on a basic flask application.
nsrip-dd added a commit that referenced this pull request May 20, 2025
We added locking to memalloc, the memory profiler, in #11460 in order to
address crashes. These locks made the crashes go away, but significantly
increased the baseline overhead of the profiler and introduced subtle
bugs. The locks we added turned out to be fundamentally incompatible
with the global interpreter lock (GIL), at least with the implementation
from #11460. This PR refactors the profiler to use the GIL exclusively
for locking.

First, we should acknowledge no-GIL and subinterpreters. As of right
now, our module does not support either. A module has to explicitly
opt-in to support either, so there is no risk of those modes being
enabled under our feet. Supporting either mode is likely a repo-wide
project. For now, we can assume the GIL exists.

This work was motivated by overhead. We currently acquire and release
locks in every memory allocation and free. Even when the locks aren't
contended, allocations and frees are very frequent, and the extra works
adds up. We add about ~8x overhead to the baselien cost of allocation
just with our locking, not including the cost of actually sampling an
allocation. We can't get rid of this overhead just by reducing sampling
frequency.

There are a few rules to follow in order to use the GIL correctly for
locking:

1) The GIL is held when a C extension function is called, _except_
   possibly in the raw allocator, which we do not profile
2) The GIL may be released during C Python API calls. Even if it is
   released, though, it will be held again after the call
3) Thus, the GIL creates critical sections only between C Python API
   calls, and the beginning and end of C extension functions. Modifications
   to shared state across those points are not atomic.
4) If we take a lock of our own in a C extension code (i.e. a
   pthread_mutex), and the extension code releases the GIL, then the
   program will deadlock due to lock order inversion. We can only safely
   take locks in C extension when the GIL is released.

The crashes that #11460 addresed were due to breaking the first three
rules. In particular, we could race on accessing the shared scratch
buffer used when collecting tracebacks, which lead to double-frees.
See #13185 for more details.

Our mitigation involved using C locks around any access to the shared
profiler state. We nearly broke rule 4 in the process. However, we used
try-locks specifically out of a fear of introducing deadlocks. Try-locks
mean that we attempt to acquire the lock, but return a failure if the
lock is already held. This stopped deadlocks, but introduced bugs: For
example:

- If we failed to take the lock when trying to report allocation
  profile events, we'd raise an exception when it was in fact not
  reasonable for doing that to fail. See #12075.
- memalloc_heap_untrack, which removes tracked allocations, was guarded
  with a try-lock. If we couldn't acquire the lock, we would fail to
  remove a record for an allocation and effectively leak memory.
  See #13317
- We attempted to make our locking fork-safe. The first attempt was
  inefficient; we made it less inefficient but the fix only "worked"
  because of try-locks. See #11848

Try-locks hide concurrency problems and we shouldn't use them. Using our
own locks requires releasing the GIL before acquisition, and then
re-acquiring the GIL. That adds unnecessary overhead. We don't
inherently need to do any off-GIL work. So, we should try to just use
the GIL as long as it is available.

The basic refactor is actually pretty simple. In a nutshell, we
rearrange the memalloc_add_event and memalloc_heap_track functions so
that they make the sampling decision, then take a traceback, then insert
the traceback into the appropriate data structure. Collecting a
traceback can release the GIL, so we make sure that modifying the data
structure happens completely after the traceback is collected. We also
safeguard against the possibility that the profiler was stopped during
sampling, if the GIL was released. This requires a small rearrangement
of memalloc_stop to make sure that the sampling functions don't see
partially-freed profiler data structures.

For testing, I have mainly used the code from test_memealloc_data_race_regression.
I also added a debug mode, enabled by compiling with
MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it
would be expected. For performance I examined the overhead of profiling
on a basic flask application.
nsrip-dd added a commit that referenced this pull request May 20, 2025
We added locking to memalloc, the memory profiler, in #11460 in order to
address crashes. These locks made the crashes go away, but significantly
increased the baseline overhead of the profiler and introduced subtle
bugs. The locks we added turned out to be fundamentally incompatible
with the global interpreter lock (GIL), at least with the implementation
from #11460. This PR refactors the profiler to use the GIL exclusively
for locking.

First, we should acknowledge no-GIL and subinterpreters. As of right
now, our module does not support either. A module has to explicitly
opt-in to support either, so there is no risk of those modes being
enabled under our feet. Supporting either mode is likely a repo-wide
project. For now, we can assume the GIL exists.

This work was motivated by overhead. We currently acquire and release
locks in every memory allocation and free. Even when the locks aren't
contended, allocations and frees are very frequent, and the extra works
adds up. We add about ~8x overhead to the baselien cost of allocation
just with our locking, not including the cost of actually sampling an
allocation. We can't get rid of this overhead just by reducing sampling
frequency.

There are a few rules to follow in order to use the GIL correctly for
locking:

1) The GIL is held when a C extension function is called, _except_
   possibly in the raw allocator, which we do not profile
2) The GIL may be released during C Python API calls. Even if it is
   released, though, it will be held again after the call
3) Thus, the GIL creates critical sections only between C Python API
   calls, and the beginning and end of C extension functions. Modifications
   to shared state across those points are not atomic.
4) If we take a lock of our own in a C extension code (i.e. a
   pthread_mutex), and the extension code releases the GIL, then the
   program will deadlock due to lock order inversion. We can only safely
   take locks in C extension when the GIL is released.

The crashes that #11460 addresed were due to breaking the first three
rules. In particular, we could race on accessing the shared scratch
buffer used when collecting tracebacks, which lead to double-frees.
See #13185 for more details.

Our mitigation involved using C locks around any access to the shared
profiler state. We nearly broke rule 4 in the process. However, we used
try-locks specifically out of a fear of introducing deadlocks. Try-locks
mean that we attempt to acquire the lock, but return a failure if the
lock is already held. This stopped deadlocks, but introduced bugs: For
example:

- If we failed to take the lock when trying to report allocation
  profile events, we'd raise an exception when it was in fact not
  reasonable for doing that to fail. See #12075.
- memalloc_heap_untrack, which removes tracked allocations, was guarded
  with a try-lock. If we couldn't acquire the lock, we would fail to
  remove a record for an allocation and effectively leak memory.
  See #13317
- We attempted to make our locking fork-safe. The first attempt was
  inefficient; we made it less inefficient but the fix only "worked"
  because of try-locks. See #11848

Try-locks hide concurrency problems and we shouldn't use them. Using our
own locks requires releasing the GIL before acquisition, and then
re-acquiring the GIL. That adds unnecessary overhead. We don't
inherently need to do any off-GIL work. So, we should try to just use
the GIL as long as it is available.

The basic refactor is actually pretty simple. In a nutshell, we
rearrange the memalloc_add_event and memalloc_heap_track functions so
that they make the sampling decision, then take a traceback, then insert
the traceback into the appropriate data structure. Collecting a
traceback can release the GIL, so we make sure that modifying the data
structure happens completely after the traceback is collected. We also
safeguard against the possibility that the profiler was stopped during
sampling, if the GIL was released. This requires a small rearrangement
of memalloc_stop to make sure that the sampling functions don't see
partially-freed profiler data structures.

For testing, I have mainly used the code from test_memealloc_data_race_regression.
I also added a debug mode, enabled by compiling with
MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it
would be expected. For performance I examined the overhead of profiling
on a basic flask application.
nsrip-dd added a commit that referenced this pull request May 20, 2025
We added locking to memalloc, the memory profiler, in #11460 in order to
address crashes. These locks made the crashes go away, but significantly
increased the baseline overhead of the profiler and introduced subtle
bugs. The locks we added turned out to be fundamentally incompatible
with the global interpreter lock (GIL), at least with the implementation
from #11460. This PR refactors the profiler to use the GIL exclusively
for locking.

First, we should acknowledge no-GIL and subinterpreters. As of right
now, our module does not support either. A module has to explicitly
opt-in to support either, so there is no risk of those modes being
enabled under our feet. Supporting either mode is likely a repo-wide
project. For now, we can assume the GIL exists.

This work was motivated by overhead. We currently acquire and release
locks in every memory allocation and free. Even when the locks aren't
contended, allocations and frees are very frequent, and the extra works
adds up. We add about ~8x overhead to the baselien cost of allocation
just with our locking, not including the cost of actually sampling an
allocation. We can't get rid of this overhead just by reducing sampling
frequency.

There are a few rules to follow in order to use the GIL correctly for
locking:

1) The GIL is held when a C extension function is called, _except_
   possibly in the raw allocator, which we do not profile
2) The GIL may be released during C Python API calls. Even if it is
   released, though, it will be held again after the call
3) Thus, the GIL creates critical sections only between C Python API
   calls, and the beginning and end of C extension functions. Modifications
   to shared state across those points are not atomic.
4) If we take a lock of our own in a C extension code (i.e. a
   pthread_mutex), and the extension code releases the GIL, then the
   program will deadlock due to lock order inversion. We can only safely
   take locks in C extension when the GIL is released.

The crashes that #11460 addresed were due to breaking the first three
rules. In particular, we could race on accessing the shared scratch
buffer used when collecting tracebacks, which lead to double-frees.
See #13185 for more details.

Our mitigation involved using C locks around any access to the shared
profiler state. We nearly broke rule 4 in the process. However, we used
try-locks specifically out of a fear of introducing deadlocks. Try-locks
mean that we attempt to acquire the lock, but return a failure if the
lock is already held. This stopped deadlocks, but introduced bugs: For
example:

- If we failed to take the lock when trying to report allocation
  profile events, we'd raise an exception when it was in fact not
  reasonable for doing that to fail. See #12075.
- memalloc_heap_untrack, which removes tracked allocations, was guarded
  with a try-lock. If we couldn't acquire the lock, we would fail to
  remove a record for an allocation and effectively leak memory.
  See #13317
- We attempted to make our locking fork-safe. The first attempt was
  inefficient; we made it less inefficient but the fix only "worked"
  because of try-locks. See #11848

Try-locks hide concurrency problems and we shouldn't use them. Using our
own locks requires releasing the GIL before acquisition, and then
re-acquiring the GIL. That adds unnecessary overhead. We don't
inherently need to do any off-GIL work. So, we should try to just use
the GIL as long as it is available.

The basic refactor is actually pretty simple. In a nutshell, we
rearrange the memalloc_add_event and memalloc_heap_track functions so
that they make the sampling decision, then take a traceback, then insert
the traceback into the appropriate data structure. Collecting a
traceback can release the GIL, so we make sure that modifying the data
structure happens completely after the traceback is collected. We also
safeguard against the possibility that the profiler was stopped during
sampling, if the GIL was released. This requires a small rearrangement
of memalloc_stop to make sure that the sampling functions don't see
partially-freed profiler data structures.

For testing, I have mainly used the code from test_memealloc_data_race_regression.
I also added a debug mode, enabled by compiling with
MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it
would be expected. For performance I examined the overhead of profiling
on a basic flask application.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants