-
Notifications
You must be signed in to change notification settings - Fork 437
fix(profiling): fix potential StackCollector deadlock #4108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
23004ed
to
aa66610
Compare
Would you have an example traceback to include? That'd make a good illustration for the commit at least. |
@jd added! |
brettlangdon
previously approved these changes
Aug 17, 2022
jd
previously approved these changes
Aug 17, 2022
@Mergifyio backport 1.4 1.3 0.x 0.61 |
✅ Backports have been created
|
6ca0b9f
to
78c35be
Compare
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky.
78c35be
to
c5708ff
Compare
mergify bot
pushed a commit
that referenced
this pull request
Aug 17, 2022
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky. (cherry picked from commit b0967c4)
mergify bot
pushed a commit
that referenced
this pull request
Aug 17, 2022
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky. (cherry picked from commit b0967c4)
mergify bot
pushed a commit
that referenced
this pull request
Aug 17, 2022
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky. (cherry picked from commit b0967c4)
mergify bot
pushed a commit
that referenced
this pull request
Aug 17, 2022
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky. (cherry picked from commit b0967c4)
mergify bot
added a commit
that referenced
this pull request
Aug 18, 2022
#4115) This is an automatic backport of pull request #4108 done by [Mergify](https://mergify.com). --- <details> <summary>Mergify commands and options</summary> <br /> More conditions and actions can be found in the [documentation](https://docs.mergify.com/). You can also trigger Mergify actions by commenting on this pull request: - `@Mergifyio refresh` will re-evaluate the rules - `@Mergifyio rebase` will rebase this PR on its base branch - `@Mergifyio update` will merge the base branch into this PR - `@Mergifyio backport <destination>` will backport this PR on `<destination>` branch Additionally, on Mergify [dashboard](https://dashboard.mergify.com/) you can: - look at your merge queues - generate the Mergify configuration with the config editor. Finally, you can contact us on https://mergify.com </details>
14 tasks
mergify bot
pushed a commit
that referenced
this pull request
Aug 25, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone.
mergify bot
pushed a commit
that referenced
this pull request
Aug 25, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone. (cherry picked from commit b0d8765)
mergify bot
pushed a commit
that referenced
this pull request
Aug 25, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone. (cherry picked from commit b0d8765)
mergify bot
pushed a commit
that referenced
this pull request
Aug 25, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone. (cherry picked from commit b0d8765)
mergify bot
pushed a commit
that referenced
this pull request
Aug 25, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone. (cherry picked from commit b0d8765) # Conflicts: # ddtrace/internal/nogevent.py
Kyle-Verhoog
added a commit
that referenced
this pull request
Sep 1, 2022
It is possible that while the profiling StackCollector is handling the on_span_activate event that a gc occurs which could result in a weakref finalize callback being called which then could activate a new span. This sequence of events results in a deadlock as the lock the StackCollector uses is not reentrant. A regression test would be great but we cannot think of a tractable test case which wouldn't be flaky. (cherry picked from commit b0967c4) Co-authored-by: Kyle Verhoog <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Kyle-Verhoog
pushed a commit
that referenced
this pull request
Sep 6, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 (cherry picked from commit b0d8765) Co-authored-by: Tahir H. Butt <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Kyle-Verhoog
added a commit
that referenced
this pull request
Sep 8, 2022
## Description #4108 replaced an internal use of the original `threading.Lock` function rather than the `gevent.threading.Lock` with the original `threading.RLock` function. However, at least in Python 2, there is an indirection with the implementation of `threading._RLock` class where it the `threading._allocate_lock` called is in fact `gevent.threading._allocate_lock`. When the gevent patched lock is used instead, we can encounter deadlocks or even gevent exceptions like the following: ``` ddtrace.profiling.collector.stack._ThreadSpanLinks.get_active_span_from_thread_id File "ddtrace/profiling/_threading.pyx", line 112, in ddtrace.profiling._threading._ThreadLink.get_object File "/usr/local/lib/python2.7/threading.py", line 174, in acquire rc = self.__block.acquire(blocking) File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541) File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367) File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562) File "/usr/local/lib/python2.7/site-packages/gevent/hub.py", line 630, in switch return RawGreenlet.switch(self) gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7fe2bda052d0 epoll pending=0 ref=0 fileno=267>) ``` The added test for the issue: https://app.circleci.com/pipelines/github/DataDog/dd-trace-py/20715/workflows/602ec8d6-d8ba-4191-ada8-0da76adbdc6b/jobs/1405224 ## Checklist - [x] Title must conform to [conventional commit](https://github.com/conventional-changelog/commitlint/tree/master/%40commitlint/config-conventional). - [x] Add additional sections for `feat` and `fix` pull requests. - [x] Ensure tests are passing for affected code. - [x] [Library documentation](https://github.com/DataDog/dd-trace-py/tree/1.x/docs) and/or [Datadog's documentation site](https://github.com/DataDog/documentation/) is updated. Link to doc PR in description. ## Reviewer Checklist - [x] Title is accurate. - [x] Description motivates each change. - [x] No unnecessary changes were introduced in this PR. - [x] PR cannot be broken up into smaller PRs. - [x] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Tests provided or description of manual testing performed is included in the code or PR. - [x] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [x] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone. (cherry picked from commit b0d8765) # Conflicts: # ddtrace/internal/nogevent.py Co-authored-by: Tahir H. Butt <[email protected]> Co-authored-by: Kyle Verhoog <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It is possible that while the profiling StackCollector is handling the
on_span_activate event that a gc occurs which could result in a weakref
finalize callback being called which then could activate a new span.
This sequence of events results in a deadlock as the lock the
StackCollector uses is not reentrant.
A regression test would be great but we cannot think of a tractable test
case which wouldn't be flaky.
Here's an example stacktrace of this happening: