-
Notifications
You must be signed in to change notification settings - Fork 437
ddtrace
deadlock in on_span_start
#4978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@nierob just out of curiosity: I gather this is running on Python 3.11. Do you also have the profiler on when this happens, or is it just tracing? |
We do not use profiler, so I do not know. Yes we use python3.11 I went through logs and events in our system, down to December and I can see that we likely had a similar issue with ddtrace 1.6, it was just not debugged deeply enough, so it is hard to say, but symptoms were the same . |
Is there any way I can help debugging it? At the moment the only workaround I have is to disable the tracing, which is increasing my fear of missing out ;-). |
@P403n1x87 you mentioned in #4149 that running on latest CPython 3.11 was a clear recommendation (?) I also see from release notes of 1.7.3 ("This fix resolves an issue in Python 3.11 where a PyFrameObject strong reference count was not properly decremented in the stack collector.") which might be related to this issue? |
I'm seeing a similar issue on py3.9 - out of curiosity, what's the full list of Python modules that you're patching with ddtrace? |
In addition to the default we explicitly call: patch_all(
aiohttp=True,
aioredis=True,
asyncio=True,
asyncpg=True,
httplib=True,
logging=True,
requests=True,
) |
The recommendation is to run the latest CPython 3.11 when using the 3.11 release series, to benefit from any patch releases. It's not a recommendation to migrate, e.g., from 3.9 to 3.11.
Given that the profiler was off for the original post, and that there are reports of the same issue with CPython 3.9, I think those fixes are orthogonal to the problem in question. |
What happens if you move from |
I have noticed that we also define From what I see "sqlalchemy": False, # Prefer DB client instrumentation so from my understanding it should not be loaded. I tried to debug it a bit and I do not see In [4]: sorted(ddtrace._monkey._PATCHED_MODULES)
Out[4]:
['aiohttp',
'asyncio',
'asyncpg',
'botocore',
'celery',
'elasticsearch',
'fastapi',
'flask',
'futures',
'gevent',
'httplib',
'jinja2',
'logging',
'mako',
'psycopg',
'pymysql',
'pynamodb',
'redis',
'requests',
'rq',
'sqlite3',
'starlette'] I can try to explicitly disable |
In our case, we had both If disabling sqlalchemy doesn't work for you, I'd go through the list of modules you're patching and disable them one-by-one to find the culprit - maybe it's the combination of |
We've received another report of the same deadlock traceback appearing when |
I've tried disabling sqlalchemy tracing (I think?), and still hit it. The stack is really similar to the original stack in the issue description:
Library versions:
What I thiiiiink is happening is that in the middle of trying to get the lock during one call to |
@leifwalsh I think you are onto something! We have seen this before with a deadlock in the profiler that was fixed by moving to a re-entrant lock: #4108 We could do the same here for the lock used internally when spans are started and finished:
To test this theory, could someone try to change this to an |
The reason why this is an issue for ddtrace is that the dbapi instrumentation added to the rollback, as seen in the sqlalchemy stack trace above. This problem should remain if only the sqlalchemy integration is disabled as the pyscopg integration would add the dbapi instrumentation. Mitigating this issue would require also disabling the psycopg integration. As for remediation, another alternative to changing the lock type is to avoid tracing the rollback that occurs during garbage collection. But I that seems a gnarly bit of logic if we need to still trace all rollbacks in general. |
I patched a deadlocking application with the EDIT: nvm, #6325 has already been opened. |
This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Brett Langdon <[email protected]>
This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Brett Langdon <[email protected]> (cherry picked from commit 18ca85e)
This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Brett Langdon <[email protected]> (cherry picked from commit 18ca85e)
Backport 18ca85e from #6325 to 1.16. This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) Co-authored-by: Zachary Groves <[email protected]>
Backport 18ca85e from #6325 to 1.17. This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) Co-authored-by: Zachary Groves <[email protected]>
This fix resolves an issue where sqlalchemy rollbacks could intermittently cause deadlocks due SpanAggregator's `on_span_start` using a Lock. This fix was put behind a flag `DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible performance impacts. While our benchmarks don't show anything major, and we suspect this change will not impact performance much at all, we want to be cautious and test on internal services before making it true by default. We plan on gathering feedback and testing on internal services, as long as we don't see a negative impact, we plan on making this the new default. Should fix: #4978 ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Title is accurate. - [x] No unnecessary changes are introduced. - [x] Description motivates each change. - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [x] Testing strategy adequately addresses listed risk(s). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] Release note makes sense to a user of the library. - [x] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Brett Langdon <[email protected]>
Following up on the work here: #6325 For this issue: #4978 After testing on staging for delancie crawlers without issue, we can make Rlock the default for SpanAggregator. ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [ ] Title is accurate. - [ ] No unnecessary changes are introduced. - [ ] Description motivates each change. - [ ] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [ ] Testing strategy adequately addresses listed risk(s). - [ ] Change is maintainable (easy to change, telemetry, documentation). - [ ] Release note makes sense to a user of the library. - [ ] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [ ] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) - [ ] If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from `@DataDog/security-design-and-guidance`. - [ ] This PR doesn't touch any of that. --------- Co-authored-by: Munir Abdinur <[email protected]>
Following up on the work here: #6325 For this issue: #4978 After testing on staging for delancie crawlers without issue, we can make Rlock the default for SpanAggregator. ## Checklist - [x] Change(s) are motivated and described in the PR description. - [x] Testing strategy is described if automated tests are not included in the PR. - [x] Risk is outlined (performance impact, potential for breakage, maintainability, etc). - [x] Change is maintainable (easy to change, telemetry, documentation). - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed. If no release note is required, add label `changelog/no-changelog`. - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)). - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [ ] Title is accurate. - [ ] No unnecessary changes are introduced. - [ ] Description motivates each change. - [ ] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [ ] Testing strategy adequately addresses listed risk(s). - [ ] Change is maintainable (easy to change, telemetry, documentation). - [ ] Release note makes sense to a user of the library. - [ ] Reviewer has explicitly acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment. - [ ] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) - [ ] If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from `@DataDog/security-design-and-guidance`. - [ ] This PR doesn't touch any of that. --------- Co-authored-by: Munir Abdinur <[email protected]>
Summary of problem
ddtrace
deadlock inon_span_start
Which version of dd-trace-py are you using?
ddtrace==1.7.2
Which version of pip are you using?
pip 22.3.1
Which libraries and their versions are you using?
`pip freeze`
Brotli==1.0.9 Jinja2==3.1.2 Mako==1.2.2 MarkupSafe==2.1.1 Pillow==9.3.0 PyJWT==2.6.0 PyMySQL==1.0.2 PyNaCl==1.5.0 PyYAML==6.0 Pygments==2.13.0 SQLAlchemy==1.4.46 aenum==3.1.11 aiocontextvars==0.2.2 aiohttp==3.8.3 aiosignal==1.2.0 alembic==1.9.2 amqp==5.1.1 anyio==3.6.1 arq==0.25.0 asttokens==2.0.8 async-cache==1.1.1 async-timeout==4.0.2 asyncpg==0.27.0 attrs==22.2.0 backcall==0.2.0 bcrypt==4.0.1 billiard==4.1.0 bleach==5.0.1 bytecode==0.13.0 cachetools==5.3.0 cattrs==22.1.0 celery==5.3.0b1 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==2.1.1 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.2.0 click==8.1.3 cryptography==38.0.4 cssselect2==0.6.0 datadog==0.44.0 ddsketch==2.0.4 ddtrace==1.7.2 decorator==5.1.1 defusedxml==0.7.1 dnspython==2.2.1 email-validator==1.3.1 emoji==2.2.0 envier==0.4.0 et-xmlfile==1.1.0 executing==1.0.0 fastapi==0.89.1 fonttools==4.37.1 frozenlist==1.3.1 greenlet==1.1.3 h11==0.12.0 hiredis==2.0.0 html5lib==1.1 idna==3.3 ipython==8.8.0 isodate==0.6.1 jedi==0.18.1 jsonschema==3.2.0 kombu==5.3.0b2 lxml==4.9.2 markdown-it-py==2.1.0 matplotlib-inline==0.1.6 mdurl==0.1.2 multidict==5.2.0 mypy-extensions==0.4.3 netsgiro==2.0.1 numpy==1.23.2 openpyxl==3.0.10 packaging==21.3 pandas==1.5.2 paramiko==2.12.0 parso==0.8.3 passlib==1.7.4 pdfkit==1.0.0 pexpect==4.8.0 phonenumbers==8.13.4 pickleshare==0.7.5 platformdirs==2.5.2 prompt-toolkit==3.0.30 protobuf==4.21.6 psycopg2==2.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyOpenSSL==22.0.0 pycparser==2.21 pydantic==1.10.4 pydyf==0.5.0 pyparsing==3.0.9 pypdf==3.2.0 pyphen==0.12.0 pyrsistent==0.18.1 python-Levenshtein==0.12.2 python-dateutil==2.8.2 python-dotenv==0.20.0 python-json-logger==2.0.4 python-multipart==0.0.5 pytz==2022.2.1 redis==4.4.2 requests-file==1.5.1 requests-pkcs12==1.15 requests-toolbelt==0.9.1 requests==2.28.1 rich==13.2.0 ruff==0.0.230 sentry-sdk==1.13.0 six==1.16.0 sniffio==1.2.0 sqlparse==0.4.3 stack-data==0.5.0 starlette==0.22.0 tenacity==8.0.1 thefuzz==0.19.0 tinycss2==1.1.1 tornado==6.2 tqdm==4.64.1 traitlets==5.3.0 typed-argument-parser==1.7.2 typing-inspect==0.8.0 typing_extensions==4.3.0 ua-parser==0.16.1 urllib3==1.26.14 user-agents==2.2.0 uvicorn==0.20.0 vine==5.0.0 wcwidth==0.2.5 weasyprint==57.2 webencodings==0.5.1 xmlsec==1.3.13 xmltodict==0.13.0 yarl==1.8.1 zeep==4.2.1 zopfli==0.2.1How can we reproduce your problem?
It is random, happens after a while, ranging from 30 min to couple of days.
What is the result that you get?
Deadlock. The app does not react to most of signals, one exception is ABRT which allowed to print the stack trace:
In line
ddtrace/internal/processor/trace.py, line 206 in on_span_finish
a lock is hold and then recursivelyon_span_start
is called. The curiosity of it is call fromddtrace/internal/writer.py, line 545 in write
tosqlalchemy/pool/base.py, line 509 in
which almost feel like a stack corruption, but can it be somegreenlet
switch issue? I have suspicious that more code we have for async sqlalchemy the more likely is to cause the deadlock, but it can be a red herring.What is the result that you expected?
No deadlocks, just fun and prosperity.
The text was updated successfully, but these errors were encountered: