Skip to content

ddtrace deadlock in on_span_start #4978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nierob opened this issue Jan 27, 2023 · 15 comments · Fixed by #6325
Closed

ddtrace deadlock in on_span_start #4978

nierob opened this issue Jan 27, 2023 · 15 comments · Fixed by #6325
Labels
Tracing Distributed Tracing

Comments

@nierob
Copy link

nierob commented Jan 27, 2023

Summary of problem

ddtrace deadlock in on_span_start

Which version of dd-trace-py are you using?

ddtrace==1.7.2

Which version of pip are you using?

pip 22.3.1

Which libraries and their versions are you using?

`pip freeze` Brotli==1.0.9 Jinja2==3.1.2 Mako==1.2.2 MarkupSafe==2.1.1 Pillow==9.3.0 PyJWT==2.6.0 PyMySQL==1.0.2 PyNaCl==1.5.0 PyYAML==6.0 Pygments==2.13.0 SQLAlchemy==1.4.46 aenum==3.1.11 aiocontextvars==0.2.2 aiohttp==3.8.3 aiosignal==1.2.0 alembic==1.9.2 amqp==5.1.1 anyio==3.6.1 arq==0.25.0 asttokens==2.0.8 async-cache==1.1.1 async-timeout==4.0.2 asyncpg==0.27.0 attrs==22.2.0 backcall==0.2.0 bcrypt==4.0.1 billiard==4.1.0 bleach==5.0.1 bytecode==0.13.0 cachetools==5.3.0 cattrs==22.1.0 celery==5.3.0b1 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==2.1.1 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.2.0 click==8.1.3 cryptography==38.0.4 cssselect2==0.6.0 datadog==0.44.0 ddsketch==2.0.4 ddtrace==1.7.2 decorator==5.1.1 defusedxml==0.7.1 dnspython==2.2.1 email-validator==1.3.1 emoji==2.2.0 envier==0.4.0 et-xmlfile==1.1.0 executing==1.0.0 fastapi==0.89.1 fonttools==4.37.1 frozenlist==1.3.1 greenlet==1.1.3 h11==0.12.0 hiredis==2.0.0 html5lib==1.1 idna==3.3 ipython==8.8.0 isodate==0.6.1 jedi==0.18.1 jsonschema==3.2.0 kombu==5.3.0b2 lxml==4.9.2 markdown-it-py==2.1.0 matplotlib-inline==0.1.6 mdurl==0.1.2 multidict==5.2.0 mypy-extensions==0.4.3 netsgiro==2.0.1 numpy==1.23.2 openpyxl==3.0.10 packaging==21.3 pandas==1.5.2 paramiko==2.12.0 parso==0.8.3 passlib==1.7.4 pdfkit==1.0.0 pexpect==4.8.0 phonenumbers==8.13.4 pickleshare==0.7.5 platformdirs==2.5.2 prompt-toolkit==3.0.30 protobuf==4.21.6 psycopg2==2.9.5 ptyprocess==0.7.0 pure-eval==0.2.2 pyOpenSSL==22.0.0 pycparser==2.21 pydantic==1.10.4 pydyf==0.5.0 pyparsing==3.0.9 pypdf==3.2.0 pyphen==0.12.0 pyrsistent==0.18.1 python-Levenshtein==0.12.2 python-dateutil==2.8.2 python-dotenv==0.20.0 python-json-logger==2.0.4 python-multipart==0.0.5 pytz==2022.2.1 redis==4.4.2 requests-file==1.5.1 requests-pkcs12==1.15 requests-toolbelt==0.9.1 requests==2.28.1 rich==13.2.0 ruff==0.0.230 sentry-sdk==1.13.0 six==1.16.0 sniffio==1.2.0 sqlparse==0.4.3 stack-data==0.5.0 starlette==0.22.0 tenacity==8.0.1 thefuzz==0.19.0 tinycss2==1.1.1 tornado==6.2 tqdm==4.64.1 traitlets==5.3.0 typed-argument-parser==1.7.2 typing-inspect==0.8.0 typing_extensions==4.3.0 ua-parser==0.16.1 urllib3==1.26.14 user-agents==2.2.0 uvicorn==0.20.0 vine==5.0.0 wcwidth==0.2.5 weasyprint==57.2 webencodings==0.5.1 xmlsec==1.3.13 xmltodict==0.13.0 yarl==1.8.1 zeep==4.2.1 zopfli==0.2.1

How can we reproduce your problem?

It is random, happens after a while, ranging from 30 min to couple of days.

What is the result that you get?

Deadlock. The app does not react to most of signals, one exception is ABRT which allowed to print the stack trace:

File "python3.11/site-packages/ddtrace/internal/processor/trace.py", line 162 in on_span_start
File "python3.11/site-packages/ddtrace/tracer.py", line 728 in _start_span
File "python3.11/site-packages/ddtrace/tracer.py", line 810 in trace
File "python3.11/site-packages/ddtrace/contrib/dbapi/init.py", line 306 in _trace_method
File "python3.11/site-packages/ddtrace/contrib/dbapi/init.py", line 325 in rollback
File "python3.11/site-packages/sqlalchemy/engine/default.py", line 683 in do_rollback
File "python3.11/site-packages/sqlalchemy/pool/base.py", line 1038 in _reset
File "python3.11/site-packages/sqlalchemy/pool/base.py", line 763 in _finalize_fairy
File "python3.11/site-packages/sqlalchemy/pool/base.py", line 509 in
File "python3.11/site-packages/ddtrace/internal/writer.py", line 545 in write
File "python3.11/site-packages/ddtrace/internal/processor/trace.py", line 206 in on_span_finish
File "python3.11/site-packages/ddtrace/tracer.py", line 746 in _on_span_finish
File "python3.11/site-packages/ddtrace/span.py", line 245 in finish
File "python3.11/site-packages/ddtrace/span.py", line 503 in exit
File "python3.11/site-packages/ddtrace/contrib/dbapi/init.py", line 89 in _trace_method
File "python3.11/site-packages/ddtrace/contrib/psycopg/patch.py", line 70 in _trace_method
File "python3.11/site-packages/ddtrace/contrib/dbapi/init.py", line 138 in execute
File "python3.11/site-packages/sqlalchemy/engine/default.py", line 736 in do_execute
File "python3.11/site-packages/sqlalchemy/engine/base.py", line 1900 in _execute_context
File "python3.11/site-packages/sqlalchemy/engine/base.py", line 1572 in _execute_clauseelement
File "python3.11/site-packages/sqlalchemy/sql/elements.py", line 334 in _execute_on_connection
File "python3.11/site-packages/sqlalchemy/engine/base.py", line 1705 in _execute_20
File "python3.11/site-packages/sqlalchemy/orm/session.py", line 1714 in execute
...

In line ddtrace/internal/processor/trace.py, line 206 in on_span_finish a lock is hold and then recursively on_span_start is called. The curiosity of it is call from ddtrace/internal/writer.py, line 545 in write to sqlalchemy/pool/base.py, line 509 in which almost feel like a stack corruption, but can it be some greenlet switch issue? I have suspicious that more code we have for async sqlalchemy the more likely is to cause the deadlock, but it can be a red herring.

What is the result that you expected?

No deadlocks, just fun and prosperity.

@P403n1x87
Copy link
Contributor

dtrace/internal/writer.py, line 545 calls into Cython, so that's perhaps why we see a jump straight to sqlalchemy/pool/base.py, line 509 which seems to be in the body of a lambda within a weakref. The deadlock is clearly caused by on_span_start being called while on_span_finish is running, with both methods attempting to acquire a non-reentrant lock. I think it remans to determine whether the call to put on the encoder in dtrace/internal/writer.py, line 545 can cause side-effects that would trigger the traced rollback operation.

@nierob just out of curiosity: I gather this is running on Python 3.11. Do you also have the profiler on when this happens, or is it just tracing?

@nierob
Copy link
Author

nierob commented Jan 27, 2023

@nierob just out of curiosity: I gather this is running on Python 3.11. Do you also have the profiler on when this happens, or is it just tracing?

We do not use profiler, so I do not know.

Yes we use python3.11

I went through logs and events in our system, down to December and I can see that we likely had a similar issue with ddtrace 1.6, it was just not debugged deeply enough, so it is hard to say, but symptoms were the same .

@P403n1x87 P403n1x87 added the Tracing Distributed Tracing label Jan 30, 2023
@nierob
Copy link
Author

nierob commented Feb 6, 2023

Is there any way I can help debugging it? At the moment the only workaround I have is to disable the tracing, which is increasing my fear of missing out ;-).

@hermansc
Copy link

@P403n1x87 you mentioned in #4149 that running on latest CPython 3.11 was a clear recommendation (?)

I also see from release notes of 1.7.3 ("This fix resolves an issue in Python 3.11 where a PyFrameObject strong reference count was not properly decremented in the stack collector.") which might be related to this issue?

@aetimmes
Copy link

I'm seeing a similar issue on py3.9 - out of curiosity, what's the full list of Python modules that you're patching with ddtrace?

@nierob
Copy link
Author

nierob commented Feb 27, 2023

In addition to the default we explicitly call:

    patch_all(
        aiohttp=True,
        aioredis=True,
        asyncio=True,
        asyncpg=True,
        httplib=True,
        logging=True,
        requests=True,
    )

@P403n1x87
Copy link
Contributor

@P403n1x87 you mentioned in #4149 that running on latest CPython 3.11 was a clear recommendation (?)

The recommendation is to run the latest CPython 3.11 when using the 3.11 release series, to benefit from any patch releases. It's not a recommendation to migrate, e.g., from 3.9 to 3.11.

I also see from release notes of 1.7.3 ("This fix resolves an issue in Python 3.11 where a PyFrameObject strong reference count was not properly decremented in the stack collector.") which might be related to this issue?

Given that the profiler was off for the original post, and that there are reports of the same issue with CPython 3.9, I think those fixes are orthogonal to the problem in question.

@aetimmes
Copy link

aetimmes commented Feb 27, 2023

In addition to the default we explicitly call:

    patch_all(
        aiohttp=True,
        aioredis=True,
        asyncio=True,
        asyncpg=True,
        httplib=True,
        logging=True,
        requests=True,
    )

What happens if you move from patch_all to patch to only patch the allowlisted modules? My current working theory for my version of this issue is that patching both SQLAlchemy and psycopg is creating the scenario under which the deadlocks can occur.

@nierob
Copy link
Author

nierob commented Feb 27, 2023

In addition to the default we explicitly call:

    patch_all(
        aiohttp=True,
        aioredis=True,
        asyncio=True,
        asyncpg=True,
        httplib=True,
        logging=True,
        requests=True,
    )

What happens if you move from patch_all to patch to only patch the allowlisted modules? My current working theory for my version of this issue is that patching both SQLAlchemy and psycopg is creating the scenario under which the deadlocks can occur.

I have noticed that we also define DD_TRACE_TORNADO_ENABLED="true" environment variable. Should not matter in this context.

From what I see sqlalchemy is not autoloaded. The code says:

"sqlalchemy": False,  # Prefer DB client instrumentation

so from my understanding it should not be loaded. I tried to debug it a bit and I do not see sqlalchemy in the patched modules set:

In [4]: sorted(ddtrace._monkey._PATCHED_MODULES)
Out[4]:
['aiohttp',
 'asyncio',
 'asyncpg',
 'botocore',
 'celery',
 'elasticsearch',
 'fastapi',
 'flask',
 'futures',
 'gevent',
 'httplib',
 'jinja2',
 'logging',
 'mako',
 'psycopg',
 'pymysql',
 'pynamodb',
 'redis',
 'requests',
 'rq',
 'sqlite3',
 'starlette']

I can try to explicitly disable sqlalchemy.

@aetimmes
Copy link

aetimmes commented Mar 1, 2023

In our case, we had both psycopg and sqlalchemy patched for tracing, and the workaround was disabling sqlalchemy - sqlalchemy was starting a trace for a transaction, and then in the finalization of the trace span, the transaction was being rolled back, which triggered a new psycopg trace to be created, which created the re-entrant lock condition that caused the deadlock.

If disabling sqlalchemy doesn't work for you, I'd go through the list of modules you're patching and disable them one-by-one to find the culprit - maybe it's the combination of psycopg2 and asyncpg that's tripping you up?

@github-actions github-actions bot added the stale label Apr 1, 2023
@emmettbutler
Copy link
Collaborator

emmettbutler commented Jun 5, 2023

We've received another report of the same deadlock traceback appearing when sqlalchemy, pymysql, and mysqlclient are all in use together.

@github-actions github-actions bot removed the stale label Jun 6, 2023
@leifwalsh
Copy link

I've tried disabling sqlalchemy tracing (I think?), and still hit it. The stack is really similar to the original stack in the issue description:

  File "/path/to/ddtrace/internal/processor/trace.py", line 179, in on_span_start
    with self._lock:
  File "/path/to/ddtrace/tracer.py", line 758, in _start_span
    p.on_span_start(span)
  File "/path/to/ddtrace/tracer.py", line 840, in trace
    return self.start_span(
  File "/path/to/ddtrace/contrib/dbapi/__init__.py", line 296, in _trace_method
    with pin.tracer.trace(name, service=ext_service(pin, self._self_config)) as s:
  File "/path/to/ddtrace/contrib/dbapi/__init__.py", line 320, in rollback
    return self._trace_method(self.__wrapped__.rollback, span_name, {}, *args, **kwargs)
  File "/path/to/sqlalchemy/engine/default.py", line 683, in do_rollback
    dbapi_connection.rollback()
  File "/path/to/sqlalchemy/pool/base.py", line 1038, in _reset
    pool._dialect.do_rollback(self)
  File "/path/to/sqlalchemy/pool/base.py", line 763, in _finalize_fairy
    fairy._reset(pool, transaction_was_reset)
  File "/path/to/sqlalchemy/pool/base.py", line 509, in <lambda>
    and _finalize_fairy(
  Garbage-collecting
  File "/path/to/ddtrace/context.py", line 105, in _update_tags
    with self._lock:
  File "/path/to/ddtrace/internal/processor/trace.py", line 139, in process_trace
    ctx._update_tags(chunk_root)
  File "/path/to/ddtrace/internal/processor/trace.py", line 219, in on_span_finish
    spans = tp.process_trace(spans)
  File "/path/to/ddtrace/tracer.py", line 776, in _on_span_finish
    p.on_span_finish(span)
  File "/path/to/ddtrace/span.py", line 274, in _finish_ns
    cb(self)
  File "/path/to/ddtrace/span.py", line 261, in finish
    self._finish_ns(time_ns())
  File "/path/to/ddtrace/span.py", line 538, in __exit__
    self.finish()
  File "/path/to/ddtrace/contrib/dbapi/__init__.py", line 117, in _trace_method
    return method(*args, **kwargs)
  File "/path/to/ddtrace/contrib/psycopg/patch.py", line 82, in _trace_method
    return super(Psycopg2TracedCursor, self)._trace_method(
  File "/path/to/ddtrace/contrib/dbapi/__init__.py", line 148, in execute
    return self._trace_method(
  File "/path/to/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/path/to/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/path/to/sqlalchemy/engine/base.py", line 1572, in _execute_clauseelement
    ret = self._execute_context(
  File "/path/to/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/path/to/sqlalchemy/engine/base.py", line 1705, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/path/to/sqlalchemy/orm/persistence.py", line 1001, in _emit_update_statements
    c = connection._execute_20(
  File "/path/to/sqlalchemy/orm/persistence.py", line 237, in save_obj
    _emit_update_statements(
  File "/path/to/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/path/to/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/path/to/sqlalchemy/orm/session.py", line 3544, in _flush
    flush_context.execute()
  File "/path/to/sqlalchemy/orm/session.py", line 3444, in flush
    self._flush(objects)
  File "/path/to/sqlalchemy/orm/session.py", line 808, in _prepare_impl
    self.session.flush()
  File "/path/to/sqlalchemy/orm/session.py", line 829, in commit
    self._prepare_impl()
  File "/path/to/sqlalchemy/orm/session.py", line 1451, in commit
    self._transaction.commit(_to_root=self.future)
  ... app code ...

Library versions:

  • ddtrace 1.12.6
  • sqlalchemy 1.4.46

What I thiiiiink is happening is that in the middle of trying to get the lock during one call to on_span_finish, the garbage collector runs and decides to delete something which, in its destruction path, tries to roll back a transaction to close a connection, and this enters on_span_start again.

@majorgreys
Copy link
Contributor

@leifwalsh I think you are onto something!

We have seen this before with a deadlock in the profiler that was fixed by moving to a re-entrant lock: #4108

We could do the same here for the lock used internally when spans are started and finished:

_lock = attr.ib(init=False, factory=threading.Lock, repr=False)

To test this theory, could someone try to change this to an RLock?

@majorgreys
Copy link
Contributor

The reason why this is an issue for ddtrace is that the dbapi instrumentation added to the rollback, as seen in the sqlalchemy stack trace above.

This problem should remain if only the sqlalchemy integration is disabled as the pyscopg integration would add the dbapi instrumentation. Mitigating this issue would require also disabling the psycopg integration.

As for remediation, another alternative to changing the lock type is to avoid tracing the rollback that occurs during garbage collection. But I that seems a gnarly bit of logic if we need to still trace all rollbacks in general.

@aetimmes
Copy link

aetimmes commented Jul 14, 2023

I patched a deadlocking application with the RLock fix and the deadlock condition has yet to resurface after >12 hours of continued load. Going to submit a PR to fix.

EDIT: nvm, #6325 has already been opened.

brettlangdon added a commit that referenced this issue Jul 24, 2023
This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978


## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Brett Langdon <[email protected]>
github-actions bot pushed a commit that referenced this issue Jul 24, 2023
This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978

## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Brett Langdon <[email protected]>
(cherry picked from commit 18ca85e)
github-actions bot pushed a commit that referenced this issue Jul 24, 2023
This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978

## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Brett Langdon <[email protected]>
(cherry picked from commit 18ca85e)
mabdinur pushed a commit that referenced this issue Jul 25, 2023
Backport 18ca85e from #6325 to 1.16.

This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978


## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: Zachary Groves <[email protected]>
mabdinur pushed a commit that referenced this issue Jul 25, 2023
Backport 18ca85e from #6325 to 1.17.

This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978


## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: Zachary Groves <[email protected]>
romainkomorn-exdatadog pushed a commit that referenced this issue Aug 8, 2023
This fix resolves an issue where sqlalchemy rollbacks could
intermittently cause deadlocks due SpanAggregator's `on_span_start`
using a Lock. This fix was put behind a flag
`DD_TRACE_SPAN_AGGREGATOR_RLOCK` which defaults to false due to possible
performance impacts. While our benchmarks don't show anything major, and
we suspect this change will not impact performance much at all, we want
to be cautious and test on internal services before making it true by
default.

We plan on gathering feedback and testing on internal services, as long
as we don't see a negative impact, we plan on making this the new
default.

Should fix: #4978


## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [x] Title is accurate.
- [x] No unnecessary changes are introduced.
- [x] Description motivates each change.
- [x] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [x] Testing strategy adequately addresses listed risk(s).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] Release note makes sense to a user of the library.
- [x] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [x] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Brett Langdon <[email protected]>
ZStriker19 added a commit that referenced this issue Sep 8, 2023
Following up on the work here:
#6325

For this issue: #4978

After testing on staging for delancie crawlers without issue, we can
make Rlock the default for SpanAggregator.

## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [ ] Title is accurate.
- [ ] No unnecessary changes are introduced.
- [ ] Description motivates each change.
- [ ] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [ ] Testing strategy adequately addresses listed risk(s).
- [ ] Change is maintainable (easy to change, telemetry, documentation).
- [ ] Release note makes sense to a user of the library.
- [ ] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [ ] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
- [ ] If this PR touches code that signs or publishes builds or
packages, or handles credentials of any kind, I've requested a review
from `@DataDog/security-design-and-guidance`.
- [ ] This PR doesn't touch any of that.

---------

Co-authored-by: Munir Abdinur <[email protected]>
ZStriker19 added a commit that referenced this issue Sep 11, 2023
Following up on the work here:
#6325

For this issue: #4978

After testing on staging for delancie crawlers without issue, we can
make Rlock the default for SpanAggregator.

## Checklist

- [x] Change(s) are motivated and described in the PR description.
- [x] Testing strategy is described if automated tests are not included
in the PR.
- [x] Risk is outlined (performance impact, potential for breakage,
maintainability, etc).
- [x] Change is maintainable (easy to change, telemetry, documentation).
- [x] [Library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
are followed. If no release note is required, add label
`changelog/no-changelog`.
- [x] Documentation is included (in-code, generated user docs, [public
corp docs](https://github.com/DataDog/documentation/)).
- [x] Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist

- [ ] Title is accurate.
- [ ] No unnecessary changes are introduced.
- [ ] Description motivates each change.
- [ ] Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes unless absolutely necessary.
- [ ] Testing strategy adequately addresses listed risk(s).
- [ ] Change is maintainable (easy to change, telemetry, documentation).
- [ ] Release note makes sense to a user of the library.
- [ ] Reviewer has explicitly acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment.
- [ ] Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
- [ ] If this PR touches code that signs or publishes builds or
packages, or handles credentials of any kind, I've requested a review
from `@DataDog/security-design-and-guidance`.
- [ ] This PR doesn't touch any of that.

---------

Co-authored-by: Munir Abdinur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tracing Distributed Tracing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants