Skip to content

storage: Add load-based split telemetry #39192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 27, 2019

Conversation

andy-kimball
Copy link
Contributor

Add "kv.split.load" telemetry counter. It is incremented each time a load-
based split occurs.

Release note: None

@andy-kimball andy-kimball requested review from danhhz, ajwerner, nvanbenschoten and a team July 31, 2019 16:41
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball andy-kimball requested a review from knz July 31, 2019 16:42
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you reckon this could be tested?

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions on how to test this in reasonable, automated way. It's one of the things I was hoping to get some feedback about from this group, since I didn't see much precedent for it in the core code to follow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)

Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched for an end-to-end load-based split in our unit tests but couldn't find one. That's a little surprising to me, but I guess our feeling at the time was that the smaller unit tests + the suite of splits/load roachtests provided sufficient coverage.

In terms of testing these telemetry hooks, we seem to be pretty inconsistent. We have a number of tests in pkg/kv that get a lot of utility out of peering in at the TxnCoordSender's metrics (see checkTxnMetrics). In other places we haven't found them to be as useful. I think in general we've taken the stance that we'll test metrics if doing so helps test the larger component, but we don't test metrics just to make sure that they're hooked up correctly. I personally don't have a strong feeling either way, other than that I would be concerned about a strict rule here causing us to avoid adding telemetry in hard-to-reach places in the first place.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @danhhz)


pkg/storage/split_queue.go, line 32 at r1 (raw file):

// loadBasedSplitCounter is incremented each time a load-based split occurs.
var loadBasedSplitCounter = telemetry.GetCounterOnce("kv.split.load")

What's your preference for static variables like this vs. adding the counter as a field in splitQueue like we do in kv.TxnMetrics (among other places)?

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)


pkg/storage/split_queue.go, line 32 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's your preference for static variables like this vs. adding the counter as a field in splitQueue like we do in kv.TxnMetrics (among other places)?

Well, it seems like there's a general pattern in SQL-land to use global counters. This is a better idea when the counters are used by objects that are created and destroyed often. There's no reason to add overhead to creation and extra memory to objects. However, I assume that splitQueue isn't created often, so that isn't a concern here. So I'm fine using it as a field if that's what we prefer in core. I'll just point out that over time, we might end up getting a mishmash of the two patterns, because we'll probably add telemetry to core objects where using the field pattern isn't appropriate.

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried putting a panic where I call out to telemetry (in split_queue.go), and then ran all the tests in pkg/storage, and never hit the panic. So it seems that load-based splitting has 0 tests in the storage package, which isn't a great situation. I think we've got to do a better job with building unit-testable code, since using roachtests for this is like using a sledgehammer to drive a finishing nail.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this code is only tested via roachtest, I manually verified that the counter is being correctly incremented.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)


pkg/storage/split_queue.go, line 32 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Well, it seems like there's a general pattern in SQL-land to use global counters. This is a better idea when the counters are used by objects that are created and destroyed often. There's no reason to add overhead to creation and extra memory to objects. However, I assume that splitQueue isn't created often, so that isn't a concern here. So I'm fine using it as a field if that's what we prefer in core. I'll just point out that over time, we might end up getting a mishmash of the two patterns, because we'll probably add telemetry to core objects where using the field pattern isn't appropriate.

I changed the code to use a field for the counter rather than a global. PTAL.

Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @danhhz, @knz, and @nvanbenschoten)


pkg/storage/split_queue.go, line 72 at r2 (raw file):

		db:             db,
		purgChan:       purgChan,
		loadBasedCount: telemetry.GetCounterOnce("kv.split.load"),

I think we want GetCounter, not GetCounterOnce.

@andy-kimball
Copy link
Contributor Author


pkg/storage/split_queue.go, line 72 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we want GetCounter, not GetCounterOnce.

Done. I didn't appreciate the difference.

Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)


pkg/storage/split_queue.go, line 72 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Done. I didn't appreciate the difference.

Did you push?

Add "kv.split.load" telemetry counter. It is incremented each time a load-
based split occurs.

Release note: None
Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)


pkg/storage/split_queue.go, line 72 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you push?

Whoops, got distracted by something else and neglected to push.

Copy link
Contributor

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @knz)

@andy-kimball
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Aug 27, 2019
39192: storage: Add load-based split telemetry r=andy-kimball a=andy-kimball

Add "kv.split.load" telemetry counter. It is incremented each time a load-
based split occurs.

Release note: None

Co-authored-by: Andrew Kimball <[email protected]>
@craig
Copy link
Contributor

craig bot commented Aug 27, 2019

Build succeeded

@craig craig bot merged commit 92d968f into cockroachdb:master Aug 27, 2019
@andy-kimball andy-kimball deleted the telemetry branch August 28, 2019 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants