Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

chlunde · 2025-02-24T09:55:07Z

I wonder if we should consider the value for 10ms for the initial sleep, because in my mind this is a very long time. I worry that callers of Add() would not expect that, and that it could lead to high tail latencies and lower throughput for a normal workflow.

I think the microbenchmark is biased against normal workloads

it does not perform any work between calls to Add()
In many cases, there is a very high number of goroutines compared to cores. This hides the throughput-problem.

To illustrate the issue, I have drafted what I hope is a more balanced microbenchmark with four goroutines. On my machine it shows regressions compared to 1.20.0. A task that took 3ms before now takes 13ms to 39ms, median values around 25ms. With an initial sleep of 10 µs instead of 10 ms, the execution time is still 3-4ms.

func TestAtomicUpdateFloatLatency(t *testing.T) {
	var val float64 = 0.0
	bits := (*uint64)(unsafe.Pointer(&val))
	var wg sync.WaitGroup
	numGoroutines := 4 // Low contention scenario
	latencies := make([]time.Duration, numGoroutines)
	workData := make([]int, numGoroutines)

	work := func(idx int) {
		// Simulate some work
		j := workData[idx]
		for i := 0; i < 10000; i++ {
			j += i
		}

		workData[idx] += j
	}

	testStart := time.Now()
	for i := 0; i < numGoroutines; i++ {
		wg.Add(1)
		go func(idx int) {
			defer wg.Done()

			for j := 0; j < 1000; j++ {
				// t := time.Now() // uncomment to discover how long it takes to do the work on your machine
				work(idx)
				// fmt.Println("Work took", time.Since(t))

				start := time.Now()
				atomicUpdateFloat(bits, func(f float64) float64 {
					return f + 1.0
				})
				latencies[idx] += time.Since(start)
			}
		}(i)
	}

	wg.Wait()

	// Check if any operation took longer than expected due to backoff
	for i, lat := range latencies {
		if lat >= 5*time.Millisecond {
			t.Errorf("Operation %d took %v, which suggests unnecessary backoff delay", i, lat)
		}
	}

	t.Logf("Test took %v", time.Since(testStart))
}

10 ms

$ go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 2 took 11.077548ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.215797ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 13.678618ms
--- FAIL: TestAtomicUpdateFloatLatency (0.01s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 21.180438ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.336778ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 24.43109ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.22996ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 20.774034ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.588107ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 25.983047ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 1 took 10.4611ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 21.187434ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.116956ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 28.06019ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.948254ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.333145ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 15.054686ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.308113ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 20.704634ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 24.944142ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 21.336846ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 10.494176ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 31.890589ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 36.762684ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.540261ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 21.214645ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 29.41141ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.518466ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 22.075644ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 31.846374ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 37.270233ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 1 took 21.45678ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.683375ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 30.958185ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 39.396503ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
FAIL
exit status 1
FAIL	github.com/prometheus/client_golang/prometheus	0.280s

10 µs

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.733696ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 5.024125ms
--- PASS: TestAtomicUpdateFloatLatency (0.01s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.034559ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.877559ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.856746ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.730594ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.782137ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.988741ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.279611ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.001143ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.041s

No sleep

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.753508ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.454005ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.747001ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.300866ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.250745ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.99001ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.589476ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.858731ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.221545ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.713847ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.039s

`runtime.Gosched()`

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.875757ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.412897ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.197536ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.328073ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.137486ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.119635ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.965832ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.105811ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.154215ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.921208ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.036s

--

edit: replaced Inc() with Add() as Inc is not relevant

The text was updated successfully, but these errors were encountered:

dosubot · 2025-02-24T09:56:25Z

@chlunde Welcome! I'm here to help you solve bugs, answer questions, and become a contributor.

The concern raised about the potential regression in performance of the CAS Inc() function in version 1.21.0 of the client_golang library seems to be related to the initial sleep duration of 10 milliseconds used in the atomicUpdateFloat function. This duration is considered long and could lead to high tail latencies and lower throughput, especially in scenarios where the function is called frequently without much work in between.

The microbenchmark results provided illustrate that with the 10 ms initial sleep, operations take significantly longer, with median values around 25 ms, compared to using a 10 µs initial sleep or no sleep at all, where execution times are much shorter (around 3-4 ms). This suggests that the initial sleep duration could be causing unnecessary delays, particularly in low contention scenarios with fewer goroutines.

The suggestion to reduce the initial sleep duration or use alternatives like runtime.Gosched() could potentially mitigate these performance issues by reducing unnecessary backoff delays and improving throughput in typical workloads.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

dethi · 2025-02-26T20:31:35Z

I wonder if we should consider the value for 10ms for the initial sleep, because in my mind this is a very long time. I worry that callers of Add() would not expect that, and that it could lead to high tail latencies and lower throughput for a normal workflow.

This is exactly what just surprised us in production. We have been chasing a performance degradation in our system. The engineering team have been bisecting our git history to find the commit causing a 10x regression in our benchmark that escaped to prod. It turned out to be one of our regular go modules update containing this prometheus change. The code affected by the performance regression has a few prometheus counters that we added a while back, all using .Add(...) function.

aaronbee · 2025-02-26T23:47:57Z

I work with @dethi. The particular case where this came up is where we launch several of short-lived goroutines that each do a little bit of work and increment a counter.

It's not okay for us for a metric Add operation to take 10ms (or much longer if you get unlucky). In the particular case we are hitting this we are counting the cache hits that avoid a roundtrip to our DB. The roundtrip to the DB often takes less than 10ms. For the metric.Add to potentially take longer than that is really bad for us! Appreciate the work here, but I think this behavior needs to be reconsidered.

We have a benchmark for this code path. It got 10x slower due to this change. Reducing the initial sleep to 10 microseconds reduces the average time by a lot, but still incurs a 20% penalty.

aaronbee · 2025-02-27T00:24:56Z

To echo @chlunde's analysis, most code is not doing metric increments in a tight loop. They are doing some actual work and then occasionally updating a metric with the expectation that this metric update should be quick. I feel the best way to handle that workflow is with a tight CompareAndSwap loop, like the code was doing before #1661. In the normal scenario contention should be rare. And when there is a bit of contention it should be resolved in a small number of retries. Sleeping for any amount of time is way too costly for a metric update.

chlunde · 2025-02-27T20:33:29Z

FYI @imorph @bwplotka - see comments above

bwplotka · 2025-03-04T13:05:23Z

Ack, thanks for the feedback!

Also recording another feedback from @tsuna

I think we have enough evidences that the 10ms penalty is too high, thanks for the feedback.

We should revert the problematic change and consider adding a separate implementation for high contention scenarios, if we can't have a dynamic algorithm for this (e.g. noticing contentions and switching the mechanism.

Help wanted on bridging this gap!

bwplotka · 2025-03-04T13:16:47Z

#1757

bwplotka · 2025-03-04T15:38:45Z

This now fixed in v1.21.1 (announcement).

We will be hardening the release process even further (#1759) to prevent this in future. Apologies for the inconvenience and thanks everyone here for the quick research and report! 💪🏽 Feedback welcome on what we can do more to prevent those. We might be contacting you in the future for some realistic benchmarks to setup as an acceptance tests.

The high concurrency optimization is planned to be eventually reintroduced too, however in a much safer manner, potentially in a separate API.

imorph · 2025-03-05T06:05:58Z

OMG!

Looks like I really messed up here. @chlunde thanks for noticing!

@bwplotka thanks for reverting! May I try #1760 but with more rigorous testing this time?

bwplotka · 2025-03-05T08:55:12Z

Please do! I will try to prepare better benchmark suites for this too

dosubot bot added the bug label Feb 24, 2025

chlunde mentioned this issue Feb 24, 2025

Add: exponential backoff for CAS operations on floats #1661

Merged

chlunde changed the title ~~Is there a potential regression in performance of CAS Inc() in v1.21.0?~~ Is there a potential regression in performance of CAS Add() in v1.21.0? Feb 24, 2025

bwplotka mentioned this issue Mar 4, 2025

Revert "exponential backoff for CAS operations on floats" and cut 1.21.1 #1757

Merged

This was referenced Mar 4, 2025

Add consistent Add/Inc/Observe microbenchmarks, validate results on release (regressions vs past releases). #1759

Open

Re-introduce CAS optimizations for high concurrency cases. #1760

Open

bwplotka closed this as completed Mar 4, 2025

This was referenced Mar 6, 2025

There is a 61.3% performance regression in sysbench after PR#59685 pingcap/tidb#59784

Closed

*: Update prometheus_client to v1.21.1 pingcap/tidb#59941

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

chlunde commented Feb 24, 2025 •

edited

Loading

dosubot bot commented Feb 24, 2025

dethi commented Feb 26, 2025

aaronbee commented Feb 26, 2025

aaronbee commented Feb 27, 2025 •

edited

Loading

chlunde commented Feb 27, 2025

bwplotka commented Mar 4, 2025 •

edited

Loading

bwplotka commented Mar 4, 2025

bwplotka commented Mar 4, 2025

imorph commented Mar 5, 2025

bwplotka commented Mar 5, 2025

Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

Comments

chlunde commented Feb 24, 2025 • edited Loading

10 ms

10 µs

No sleep

runtime.Gosched()

dosubot bot commented Feb 24, 2025

dethi commented Feb 26, 2025

aaronbee commented Feb 26, 2025

aaronbee commented Feb 27, 2025 • edited Loading

chlunde commented Feb 27, 2025

bwplotka commented Mar 4, 2025 • edited Loading

bwplotka commented Mar 4, 2025

bwplotka commented Mar 4, 2025

imorph commented Mar 5, 2025

bwplotka commented Mar 5, 2025

chlunde commented Feb 24, 2025 •

edited

Loading

`runtime.Gosched()`

aaronbee commented Feb 27, 2025 •

edited

Loading

bwplotka commented Mar 4, 2025 •

edited

Loading