refactor(profiling): pool memalloc traceback scratch buffers #13185

nsrip-dd · 2025-04-11T18:56:50Z

Right now there is a global scratch buffer used when collecting
tracebacks for memalloc. This is a resource usage optimization. We have
enough space set aside to record a maximum depth traceback. We collect
the traceback and put it in the scratch space. Then, once we know how
many frames there really are, we can allocate a new traceback that's
just the right size and store that for later reporting. This saves
wasted memory and extra work dynamically growing the traceback to the
right size.

However, there is only one scratch buffer. There is a hypothetical
logical race accessing the buffer. Before 2.18, when we added locks, it
was a real logical race. Essentially what happens is:

Thread A is going to collect a traceback, and starts using the buffer
Thread A sets the frame count to 0 and starts writing frames to the
buffer
Thread A, partway through, triggers a memory allocation when getting
the next frame that just so happens to trigger garbage collection,
which in turn just so happens to run interpreter code which releases
the GIL
Thread B can start running now that the GIL was released
Thread B happens to also want to collect a traceback. It can do so and
starts using the same buffer
Thread B resets the frame count to 0
Thread B writes all its frames, leaving the frame count at N, and just
so happens not to relase the GIL
Thread B finishes sampling and eventually releases the GIL
Thread A picks up where it left off, continuing to write frames. Only
now the frame counter is N, so it is writing at the Nth offset in the
buffer.
Thread A runs out of Python frames, and returns a new traceback, which
will have the N frames that thread B wrote plus the remaining frames
that A collected.

This is a logical race that leads to double frees. Now we have two
tracebacks holding some of the exact same frame objects, and they will
eventually both be freed. This corrupts internal allocator state and
leads to crashes.

I say this is hypothetical because now, even though we have locks in
this code, there are actually two separate locks which can be acquired
independently. Both the allocation profiler and heap profiler could
simultaneously collect tracebacks. I haven't been able to actually
trigger this, even with my recently-added reproducer for the race that
ddtrace 2.18 addressed, but I believe the problem is there.

Even without proving this is a problem in practice, it is also an
impedment to performance optimizations. The synchronization we added in
ddtrace 2.18 has introduced a lof of overhead. We can, in theory, use
the GIL to protect most of the memory profiler data structures. But we
need to guard against logical races like this.

Keeping one buffer and locking it for the duration of traceback
collection is not a solution. If the GIL could possibly be released,
then this creates a lock order inversion bug. Another thread grabs the
GIL and tries to get a traceback, but can't since the lock is held. And
so the thread can't release the GIL. So the program deadlocks. For the
same reason, we can't just give the allocation and heap profilers their
own scratch buffers.

NB: We are actually only saved from this problem today because we use
try locks. And I'm not confident that we are actually correctly handling
the cases where fail to acquire the existing locks. I would rather have
a solution where we don't possibly introduce lock ordering bugs.

Instead, we can use a pool of buffers. Basically, when a thread wants a
traceback, it grabs a buffer from the pool or creates a new one when the
pool is empty. Then the thread has exclusive access to that buffer. When
the thread is done with the buffer, it places it back in the pool. We
can cap the size of the pool to prevent having the pool stuck at the
high water-mark size if we typically only need one or two buffers at a
time.

This is safe if we hold the GIL on entry to the buffer management
functions, as the functions themselves don't call back into Python and
risk releasing the GIL. If we ever wanted to support no-GIL mode or
subinterpreters, wit would be trivial to add synchronization to the
pool.

I suspect that in the stable state we will not need to allocate new
buffers often and the overhead will be comparable to what it is today.
But it'll be much safer and more suitable for further optimizations to
the memory profiler.

PROF-11531

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2025-04-11T18:57:35Z

CODEOWNERS have been resolved as:

ddtrace/profiling/collector/_memalloc_tb.c                              @DataDog/profiling-python

github-actions · 2025-04-11T19:17:48Z

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 234 ± 5 ms.

The average import time from base is: 240 ± 5 ms.

The import time difference between this PR and base is: -5.7 ± 0.2 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 2.225 ms (0.95%)

ddtrace.bootstrap.sitecustomize 1.552 ms (0.66%)

ddtrace.bootstrap.preload 1.552 ms (0.66%)

ddtrace.internal.products 1.552 ms (0.66%)

ddtrace.internal.remoteconfig.client 0.714 ms (0.30%)

ddtrace 0.673 ms (0.29%)

pr-commenter · 2025-04-11T19:56:58Z

Benchmarks

Benchmark execution time: 2025-04-15 19:54:55

Comparing candidate commit 6bb3660 in PR branch nick.ripley/pool-memalloc-traceback-buffers with baseline commit 73764dc in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 506 metrics, 2 unstable metrics.

Right now there is a global scratch buffer used when collecting tracebacks for memalloc. This is a resource usage optimization. We have enough space set aside to record a maximum depth traceback. We collect the traceback and put it in the scratch space. Then, once we know how many frames there really are, we can allocate a new traceback that's just the right size and store that for later reporting. This saves wasted memory and extra work dynamically growing the traceback to the right size. However, there is only one scratch buffer. There is a hypothetical logical race accessing the buffer. Before 2.18, when we added locks, it was a _real_ logical race. Essentially what happens is: - Thread A is going to collect a traceback, and starts using the buffer - Thread A sets the frame count to 0 and starts writing frames to the buffer - Thread A, partway through, triggers a memory allocation when getting the next frame that just so happens to trigger garbage collection, which in turn just so happens to run interpreter code which releases the GIL - Thread B can start running now that the GIL was released - Thread B happens to also want to collect a traceback. It can do so and starts using the same buffer - Thread B resets the frame count to 0 - Thread B writes all its frames, leaving the frame count at N, and just so happens _not_ to relase the GIL - Thread B finishes sampling and eventually releases the GIL - Thread A picks up where it left off, continuing to write frames. Only now the frame counter is N, so it is writing at the Nth offset in the buffer. - Thread A runs out of Python frames, and returns a new traceback, which will have the N frames that thread B wrote plus the remaining frames that A collected. This is a logical race that leads to double frees. Now we have two tracebacks holding some of the exact same frame objects, and they will eventually both be freed. This corrupts internal allocator state and leads to crashes. I say this is hypothetical because now, even though we have locks in this code, there are actually two separate locks which can be acquired independently. Both the allocation profiler and heap profiler could simultaneously collect tracebacks. I haven't been able to actually trigger this, even with my recently-added reproducer for the race that ddtrace 2.18 addressed, but I believe the problem is there. Even without proving this is a problem in practice, it is also an impedment to performance optimizations. The synchronization we added in ddtrace 2.18 has introduced a lof of overhead. We can, in theory, use the GIL to protect most of the memory profiler data structures. But we need to guard against logical races like this. Keeping one buffer and locking it for the duration of traceback collection is not a solution. If the GIL could possibly be released, then this creates a lock order inversion bug. Another thread grabs the GIL and tries to get a traceback, but can't since the lock is held. And so the thread can't release the GIL. So the program deadlocks. For the same reason, we can't just give the allocation and heap profilers their own scratch buffers. NB: We are actually only saved from this problem today because we use try locks. And I'm not confident that we are actually correctly handling the cases where fail to acquire the existing locks. I would rather have a solution where we don't possibly introduce lock ordering bugs. Instead, we can use a pool of buffers. Basically, when a thread wants a traceback, it grabs a buffer from the pool or creates a new one when the pool is empty. Then the thread has exclusive access to that buffer. When the thread is done with the buffer, it places it back in the pool. We can cap the size of the pool to prevent having the pool stuck at the high water-mark size if we typically only need one or two buffers at a time. This is safe if we hold the GIL on entry to the buffer management functions, as the functions themselves don't call back into Python and risk releasing the GIL. If we ever wanted to support no-GIL mode or subinterpreters, wit would be trivial to add synchronization to the pool. I suspect that in the stable state we will not need to allocate new buffers often and the overhead will be comparable to what it is today. But it'll be much safer and more suitable for further optimizations to the memory profiler.

taegyunkim

Have you noticed any difference on native profiles with/without this?

Also, it would be interesting to see how this behaves when there are many threads, which could potentially lead to more alloc/free for buffers

ddtrace/profiling/collector/_memalloc_tb.c

For the memalloc traceback scratch buffer pool, use plain malloc and free rather than PyMem_RawMalloc/PyMem_RawFree. Those functions happen to just call plain malloc/free, but it's not really guaranteed. Also, it's important that the pool management functions don't release the GIL, which is easier to know doesn't happen if we just don't call any Python C API functions. We may also want to profile the "raw" allocators, though that's a project for another time. Related to the above GIL concerns, add an assertion that we hold the GIL in the pool management functions. We rely on the GIL to protect the pool state right now. Having the assertions both signals that in the code, and helps us debug in case we start seeing crashes in the memalloc code.

nsrip-dd

Have you noticed any difference on native profiles with/without this?

In my admittedly very simple testing, these functions don't even show up in profiles. I'm using Linux perf and recording a simple Flask application with the Python profiler enabled, under load. I'm also running this change with the test from #13026 under perf.

Also, it would be interesting to see how this behaves when there are many threads, which could potentially lead to more alloc/free for buffers

Yeah, I'd be interested in that as well. Right now, since we have the profile locks, I think we're pretty unlikely to have multiple threads hit this code. In my local workspace where I'm moving around the locking, I currently only see two buffers get allocated. The worst case would be where a thread samples, happens to release the GIL while collecting a traceback, and then another thread starts sampling. And then that thread also relases the GIL while collecting a traceback, and then a third thread starts sampling, and so on... Basically we start doing a new buffer per sample if we have more than 4 (pool size) threads getting tracebacks during sampling concurrently. For a thread to release the GIL in sampling it needs to happen to trigger garbage collection during traceback collection. So in the pathological case where we have more than, like, 2 buffers in the pool, we need to have a bunch of threads sample and trigger garbage collection back to back and not reach a steady state...

Long story short, my gut feeling is that the pathological case of constantly allocating new scratch buffers is unlikely to happen. I think Python 3.12 and later don't have the GIL release any more since they don't do garbage collection directly when allocating a new object.

We could alternatively do a "dynamic array" approach, where we make a fresh traceback buffer for each sample and grow it to the right size. I chose the pool approach because it's simple to implement and (at least to me) easy to see is correct. But the dynamic array would let us not have global state, at the expense of doing several allocations and copies. The worst case for the pool would have us doing two allocations and a copy for the traceback space. Might be worse for a dynamic array approach?

LMK if you have any thoughts on the current approach vs something else. I can sketch out the dynamic array approach if you want.

ddtrace/profiling/collector/_memalloc_tb.c

We added locking to memalloc, the memory profiler, in #11460 in order to address crashes. These locks made the crashes go away, but significantly increased the baseline overhead of the profiler and introduced subtle bugs. The locks we added turned out to be fundamentally incompatible with the global interpreter lock (GIL), at least with the implementation from #11460. This PR refactors the profiler to use the GIL exclusively for locking. First, we should acknowledge no-GIL and subinterpreters. As of right now, our module does not support either. A module has to explicitly opt-in to support either, so there is no risk of those modes being enabled under our feet. Supporting either mode is likely a repo-wide project. For now, we can assume the GIL exists. This work was motivated by overhead. We currently acquire and release locks in every memory allocation and free. Even when the locks aren't contended, allocations and frees are very frequent, and the extra works adds up. We add about ~8x overhead to the baselien cost of allocation just with our locking, not including the cost of actually sampling an allocation. We can't get rid of this overhead just by reducing sampling frequency. There are a few rules to follow in order to use the GIL correctly for locking: 1) The GIL is held when a C extension function is called, _except_ possibly in the raw allocator, which we do not profile 2) The GIL may be released during C Python API calls. Even if it is released, though, it will be held again after the call 3) Thus, the GIL creates critical sections only between C Python API calls, and the beginning and end of C extension functions. Modifications to shared state across those points are not atomic. 4) If we take a lock of our own in a C extension code (i.e. a pthread_mutex), and the extension code releases the GIL, then the program will deadlock due to lock order inversion. We can only safely take locks in C extension when the GIL is released. The crashes that #11460 addresed were due to breaking the first three rules. In particular, we could race on accessing the shared scratch buffer used when collecting tracebacks, which lead to double-frees. See #13185 for more details. Our mitigation involved using C locks around any access to the shared profiler state. We nearly broke rule 4 in the process. However, we used try-locks specifically out of a fear of introducing deadlocks. Try-locks mean that we attempt to acquire the lock, but return a failure if the lock is already held. This stopped deadlocks, but introduced bugs: For example: - If we failed to take the lock when trying to report allocation profile events, we'd raise an exception when it was in fact not reasonable for doing that to fail. See #12075. - memalloc_heap_untrack, which removes tracked allocations, was guarded with a try-lock. If we couldn't acquire the lock, we would fail to remove a record for an allocation and effectively leak memory. See #13317 - We attempted to make our locking fork-safe. The first attempt was inefficient; we made it less inefficient but the fix only "worked" because of try-locks. See #11848 Try-locks hide concurrency problems and we shouldn't use them. Using our own locks requires releasing the GIL before acquisition, and then re-acquiring the GIL. That adds unnecessary overhead. We don't inherently need to do any off-GIL work. So, we should try to just use the GIL as long as it is available. The basic refactor is actually pretty simple. In a nutshell, we rearrange the memalloc_add_event and memalloc_heap_track functions so that they make the sampling decision, then take a traceback, then insert the traceback into the appropriate data structure. Collecting a traceback can release the GIL, so we make sure that modifying the data structure happens completely after the traceback is collected. We also safeguard against the possibility that the profiler was stopped during sampling, if the GIL was released. This requires a small rearrangement of memalloc_stop to make sure that the sampling functions don't see partially-freed profiler data structures. For testing, I have mainly used the code from test_memealloc_data_race_regression. I also added a debug mode, enabled by compiling with MEMALLOC_TESTING_GIL_RELEASE, which releases the GIL at places where it would be expected. For performance I examined the overhead of profiling on a basic flask application.

nsrip-dd added the changelog/no-changelog A changelog entry is not required for this PR. label Apr 11, 2025

nsrip-dd force-pushed the nick.ripley/pool-memalloc-traceback-buffers branch from c8c3195 to 487ec4e Compare April 14, 2025 14:55

nsrip-dd marked this pull request as ready for review April 14, 2025 15:48

nsrip-dd requested a review from a team as a code owner April 14, 2025 15:48

taegyunkim reviewed Apr 14, 2025

View reviewed changes

ddtrace/profiling/collector/_memalloc_tb.c Outdated Show resolved Hide resolved

ddtrace/profiling/collector/_memalloc_tb.c Show resolved Hide resolved

P403n1x87 added the Profiling Continous Profling label Apr 15, 2025

nsrip-dd commented Apr 15, 2025

View reviewed changes

ddtrace/profiling/collector/_memalloc_tb.c Outdated Show resolved Hide resolved

ddtrace/profiling/collector/_memalloc_tb.c Show resolved Hide resolved

taegyunkim approved these changes Apr 16, 2025

View reviewed changes

sanchda approved these changes Apr 16, 2025

View reviewed changes

nsrip-dd merged commit 8c189a1 into main Apr 17, 2025
319 of 320 checks passed

nsrip-dd deleted the nick.ripley/pool-memalloc-traceback-buffers branch April 17, 2025 18:38

nsrip-dd mentioned this pull request May 14, 2025

refactor(profiling): remove redundant locks from memalloc #13305

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(profiling): pool memalloc traceback scratch buffers #13185

refactor(profiling): pool memalloc traceback scratch buffers #13185

nsrip-dd commented Apr 11, 2025 •

edited by taegyunkim

Loading

github-actions bot commented Apr 11, 2025

github-actions bot commented Apr 11, 2025 •

edited

Loading

pr-commenter bot commented Apr 11, 2025 •

edited

Loading

taegyunkim left a comment

nsrip-dd left a comment

refactor(profiling): pool memalloc traceback scratch buffers #13185

refactor(profiling): pool memalloc traceback scratch buffers #13185

Conversation

nsrip-dd commented Apr 11, 2025 • edited by taegyunkim Loading

Checklist

Reviewer Checklist

github-actions bot commented Apr 11, 2025

github-actions bot commented Apr 11, 2025 • edited Loading

Bootstrap import analysis

Summary

Import time breakdown

pr-commenter bot commented Apr 11, 2025 • edited Loading

Benchmarks

taegyunkim left a comment

Choose a reason for hiding this comment

nsrip-dd left a comment

Choose a reason for hiding this comment

nsrip-dd commented Apr 11, 2025 •

edited by taegyunkim

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

pr-commenter bot commented Apr 11, 2025 •

edited

Loading