Skip to content

Is there a potential regression in performance of CAS Add() in v1.21.0? #1748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chlunde opened this issue Feb 24, 2025 · 10 comments
Closed
Labels

Comments

@chlunde
Copy link

chlunde commented Feb 24, 2025

Re. #1661

I wonder if we should consider the value for 10ms for the initial sleep, because in my mind this is a very long time. I worry that callers of Add() would not expect that, and that it could lead to high tail latencies and lower throughput for a normal workflow.

I think the microbenchmark is biased against normal workloads

  • it does not perform any work between calls to Add()
  • In many cases, there is a very high number of goroutines compared to cores. This hides the throughput-problem.

To illustrate the issue, I have drafted what I hope is a more balanced microbenchmark with four goroutines. On my machine it shows regressions compared to 1.20.0. A task that took 3ms before now takes 13ms to 39ms, median values around 25ms. With an initial sleep of 10 µs instead of 10 ms, the execution time is still 3-4ms.

func TestAtomicUpdateFloatLatency(t *testing.T) {
	var val float64 = 0.0
	bits := (*uint64)(unsafe.Pointer(&val))
	var wg sync.WaitGroup
	numGoroutines := 4 // Low contention scenario
	latencies := make([]time.Duration, numGoroutines)
	workData := make([]int, numGoroutines)

	work := func(idx int) {
		// Simulate some work
		j := workData[idx]
		for i := 0; i < 10000; i++ {
			j += i
		}

		workData[idx] += j
	}

	testStart := time.Now()
	for i := 0; i < numGoroutines; i++ {
		wg.Add(1)
		go func(idx int) {
			defer wg.Done()

			for j := 0; j < 1000; j++ {
				// t := time.Now() // uncomment to discover how long it takes to do the work on your machine
				work(idx)
				// fmt.Println("Work took", time.Since(t))

				start := time.Now()
				atomicUpdateFloat(bits, func(f float64) float64 {
					return f + 1.0
				})
				latencies[idx] += time.Since(start)
			}
		}(i)
	}

	wg.Wait()

	// Check if any operation took longer than expected due to backoff
	for i, lat := range latencies {
		if lat >= 5*time.Millisecond {
			t.Errorf("Operation %d took %v, which suggests unnecessary backoff delay", i, lat)
		}
	}

	t.Logf("Test took %v", time.Since(testStart))
}

10 ms

$ go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 2 took 11.077548ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.215797ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 13.678618ms
--- FAIL: TestAtomicUpdateFloatLatency (0.01s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 21.180438ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.336778ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 24.43109ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.22996ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 20.774034ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.588107ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 25.983047ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 1 took 10.4611ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 21.187434ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 10.116956ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 28.06019ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.948254ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.333145ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 15.054686ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.308113ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 20.704634ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 24.944142ms
--- FAIL: TestAtomicUpdateFloatLatency (0.02s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 21.336846ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 10.494176ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 31.890589ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 36.762684ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.540261ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 21.214645ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 29.41141ms
--- FAIL: TestAtomicUpdateFloatLatency (0.03s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 0 took 10.518466ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 1 took 22.075644ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 31.846374ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 37.270233ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:213: Operation 1 took 21.45678ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 2 took 10.683375ms, which suggests unnecessary backoff delay
    atomic_update_test.go:213: Operation 3 took 30.958185ms, which suggests unnecessary backoff delay
    atomic_update_test.go:217: Test took 39.396503ms
--- FAIL: TestAtomicUpdateFloatLatency (0.04s)
FAIL
exit status 1
FAIL	github.com/prometheus/client_golang/prometheus	0.280s

10 µs

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.733696ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 5.024125ms
--- PASS: TestAtomicUpdateFloatLatency (0.01s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.034559ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.877559ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.856746ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.730594ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.782137ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.988741ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.279611ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.001143ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.041s

No sleep

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.753508ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.454005ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.747001ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.300866ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.250745ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.99001ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.589476ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.858731ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 4.221545ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.713847ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.039s

runtime.Gosched()

go test -count=10 -v -run=TestAtomicUpdateFloatLatency
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.875757ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.412897ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.197536ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.328073ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.137486ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.119635ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.965832ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.105811ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 3.154215ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
=== RUN   TestAtomicUpdateFloatLatency
    atomic_update_test.go:217: Test took 2.921208ms
--- PASS: TestAtomicUpdateFloatLatency (0.00s)
PASS
ok  	github.com/prometheus/client_golang/prometheus	0.036s

--

edit: replaced Inc() with Add() as Inc is not relevant

Copy link

dosubot bot commented Feb 24, 2025

@chlunde Welcome! I'm here to help you solve bugs, answer questions, and become a contributor.

The concern raised about the potential regression in performance of the CAS Inc() function in version 1.21.0 of the client_golang library seems to be related to the initial sleep duration of 10 milliseconds used in the atomicUpdateFloat function. This duration is considered long and could lead to high tail latencies and lower throughput, especially in scenarios where the function is called frequently without much work in between.

The microbenchmark results provided illustrate that with the 10 ms initial sleep, operations take significantly longer, with median values around 25 ms, compared to using a 10 µs initial sleep or no sleep at all, where execution times are much shorter (around 3-4 ms). This suggests that the initial sleep duration could be causing unnecessary delays, particularly in low contention scenarios with fewer goroutines.

The suggestion to reduce the initial sleep duration or use alternatives like runtime.Gosched() could potentially mitigate these performance issues by reducing unnecessary backoff delays and improving throughput in typical workloads.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the bug label Feb 24, 2025
@chlunde chlunde changed the title Is there a potential regression in performance of CAS Inc() in v1.21.0? Is there a potential regression in performance of CAS Add() in v1.21.0? Feb 24, 2025
@dethi
Copy link

dethi commented Feb 26, 2025

I wonder if we should consider the value for 10ms for the initial sleep, because in my mind this is a very long time. I worry that callers of Add() would not expect that, and that it could lead to high tail latencies and lower throughput for a normal workflow.

This is exactly what just surprised us in production. We have been chasing a performance degradation in our system. The engineering team have been bisecting our git history to find the commit causing a 10x regression in our benchmark that escaped to prod. It turned out to be one of our regular go modules update containing this prometheus change. The code affected by the performance regression has a few prometheus counters that we added a while back, all using .Add(...) function.

Image

@aaronbee
Copy link

I work with @dethi. The particular case where this came up is where we launch several of short-lived goroutines that each do a little bit of work and increment a counter.

It's not okay for us for a metric Add operation to take 10ms (or much longer if you get unlucky). In the particular case we are hitting this we are counting the cache hits that avoid a roundtrip to our DB. The roundtrip to the DB often takes less than 10ms. For the metric.Add to potentially take longer than that is really bad for us! Appreciate the work here, but I think this behavior needs to be reconsidered.

We have a benchmark for this code path. It got 10x slower due to this change. Reducing the initial sleep to 10 microseconds reduces the average time by a lot, but still incurs a 20% penalty.

@aaronbee
Copy link

aaronbee commented Feb 27, 2025

To echo @chlunde's analysis, most code is not doing metric increments in a tight loop. They are doing some actual work and then occasionally updating a metric with the expectation that this metric update should be quick. I feel the best way to handle that workflow is with a tight CompareAndSwap loop, like the code was doing before #1661. In the normal scenario contention should be rare. And when there is a bit of contention it should be resolved in a small number of retries. Sleeping for any amount of time is way too costly for a metric update.

@chlunde
Copy link
Author

chlunde commented Feb 27, 2025

FYI @imorph @bwplotka - see comments above

@bwplotka
Copy link
Member

bwplotka commented Mar 4, 2025

Ack, thanks for the feedback!

Also recording another feedback from @tsuna

I think we have enough evidences that the 10ms penalty is too high, thanks for the feedback.

We should revert the problematic change and consider adding a separate implementation for high contention scenarios, if we can't have a dynamic algorithm for this (e.g. noticing contentions and switching the mechanism.

Help wanted on bridging this gap!

@bwplotka
Copy link
Member

bwplotka commented Mar 4, 2025

#1757

@bwplotka
Copy link
Member

bwplotka commented Mar 4, 2025

This now fixed in v1.21.1 (announcement).

We will be hardening the release process even further (#1759) to prevent this in future. Apologies for the inconvenience and thanks everyone here for the quick research and report! 💪🏽 Feedback welcome on what we can do more to prevent those. We might be contacting you in the future for some realistic benchmarks to setup as an acceptance tests.

The high concurrency optimization is planned to be eventually reintroduced too, however in a much safer manner, potentially in a separate API.

@bwplotka bwplotka closed this as completed Mar 4, 2025
@imorph
Copy link
Contributor

imorph commented Mar 5, 2025

OMG!

Looks like I really messed up here. @chlunde thanks for noticing!

@bwplotka thanks for reverting! May I try #1760 but with more rigorous testing this time?

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2025

Please do! I will try to prepare better benchmark suites for this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants