-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Is there a potential regression in performance of CAS Add() in v1.21.0? #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@chlunde Welcome! I'm here to help you solve bugs, answer questions, and become a contributor. The concern raised about the potential regression in performance of the The microbenchmark results provided illustrate that with the 10 ms initial sleep, operations take significantly longer, with median values around 25 ms, compared to using a 10 µs initial sleep or no sleep at all, where execution times are much shorter (around 3-4 ms). This suggests that the initial sleep duration could be causing unnecessary delays, particularly in low contention scenarios with fewer goroutines. The suggestion to reduce the initial sleep duration or use alternatives like To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
I work with @dethi. The particular case where this came up is where we launch several of short-lived goroutines that each do a little bit of work and increment a counter. It's not okay for us for a metric We have a benchmark for this code path. It got 10x slower due to this change. Reducing the initial sleep to 10 microseconds reduces the average time by a lot, but still incurs a 20% penalty. |
To echo @chlunde's analysis, most code is not doing metric increments in a tight loop. They are doing some actual work and then occasionally updating a metric with the expectation that this metric update should be quick. I feel the best way to handle that workflow is with a tight CompareAndSwap loop, like the code was doing before #1661. In the normal scenario contention should be rare. And when there is a bit of contention it should be resolved in a small number of retries. Sleeping for any amount of time is way too costly for a metric update. |
Ack, thanks for the feedback! Also recording another feedback from @tsuna I think we have enough evidences that the 10ms penalty is too high, thanks for the feedback. We should revert the problematic change and consider adding a separate implementation for high contention scenarios, if we can't have a dynamic algorithm for this (e.g. noticing contentions and switching the mechanism. Help wanted on bridging this gap! |
This now fixed in v1.21.1 (announcement). We will be hardening the release process even further (#1759) to prevent this in future. Apologies for the inconvenience and thanks everyone here for the quick research and report! 💪🏽 Feedback welcome on what we can do more to prevent those. We might be contacting you in the future for some realistic benchmarks to setup as an acceptance tests. The high concurrency optimization is planned to be eventually reintroduced too, however in a much safer manner, potentially in a separate API. |
Please do! I will try to prepare better benchmark suites for this too |
Re. #1661
I wonder if we should consider the value for 10ms for the initial sleep, because in my mind this is a very long time. I worry that callers of
Add()
would not expect that, and that it could lead to high tail latencies and lower throughput for a normal workflow.I think the microbenchmark is biased against normal workloads
Add()
To illustrate the issue, I have drafted what I hope is a more balanced microbenchmark with four goroutines. On my machine it shows regressions compared to 1.20.0. A task that took 3ms before now takes 13ms to 39ms, median values around 25ms. With an initial sleep of 10 µs instead of 10 ms, the execution time is still 3-4ms.
10 ms
10 µs
No sleep
runtime.Gosched()
--
edit: replaced Inc() with Add() as Inc is not relevant
The text was updated successfully, but these errors were encountered: