-
Notifications
You must be signed in to change notification settings - Fork 3.9k
storage: Add load-based split telemetry #39192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you reckon this could be tested?
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to suggestions on how to test this in reasonable, automated way. It's one of the things I was hoping to get some feedback about from this group, since I didn't see much precedent for it in the core code to follow.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched for an end-to-end load-based split in our unit tests but couldn't find one. That's a little surprising to me, but I guess our feeling at the time was that the smaller unit tests + the suite of splits/load
roachtests provided sufficient coverage.
In terms of testing these telemetry hooks, we seem to be pretty inconsistent. We have a number of tests in pkg/kv
that get a lot of utility out of peering in at the TxnCoordSender
's metrics (see checkTxnMetrics
). In other places we haven't found them to be as useful. I think in general we've taken the stance that we'll test metrics if doing so helps test the larger component, but we don't test metrics just to make sure that they're hooked up correctly. I personally don't have a strong feeling either way, other than that I would be concerned about a strict rule here causing us to avoid adding telemetry in hard-to-reach places in the first place.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @danhhz)
pkg/storage/split_queue.go, line 32 at r1 (raw file):
// loadBasedSplitCounter is incremented each time a load-based split occurs. var loadBasedSplitCounter = telemetry.GetCounterOnce("kv.split.load")
What's your preference for static variables like this vs. adding the counter as a field in splitQueue
like we do in kv.TxnMetrics
(among other places)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
pkg/storage/split_queue.go, line 32 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What's your preference for static variables like this vs. adding the counter as a field in
splitQueue
like we do inkv.TxnMetrics
(among other places)?
Well, it seems like there's a general pattern in SQL-land to use global counters. This is a better idea when the counters are used by objects that are created and destroyed often. There's no reason to add overhead to creation and extra memory to objects. However, I assume that splitQueue
isn't created often, so that isn't a concern here. So I'm fine using it as a field if that's what we prefer in core. I'll just point out that over time, we might end up getting a mishmash of the two patterns, because we'll probably add telemetry to core objects where using the field pattern isn't appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried putting a panic where I call out to telemetry (in split_queue.go
), and then ran all the tests in pkg/storage
, and never hit the panic. So it seems that load-based splitting has 0 tests in the storage package, which isn't a great situation. I think we've got to do a better job with building unit-testable code, since using roachtests for this is like using a sledgehammer to drive a finishing nail.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @nvanbenschoten)
a510cfe
to
d880e0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this code is only tested via roachtest, I manually verified that the counter is being correctly incremented.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)
pkg/storage/split_queue.go, line 32 at r1 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Well, it seems like there's a general pattern in SQL-land to use global counters. This is a better idea when the counters are used by objects that are created and destroyed often. There's no reason to add overhead to creation and extra memory to objects. However, I assume that
splitQueue
isn't created often, so that isn't a concern here. So I'm fine using it as a field if that's what we prefer in core. I'll just point out that over time, we might end up getting a mishmash of the two patterns, because we'll probably add telemetry to core objects where using the field pattern isn't appropriate.
I changed the code to use a field for the counter rather than a global. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @danhhz, @knz, and @nvanbenschoten)
pkg/storage/split_queue.go, line 72 at r2 (raw file):
db: db, purgChan: purgChan, loadBasedCount: telemetry.GetCounterOnce("kv.split.load"),
I think we want GetCounter
, not GetCounterOnce
.
pkg/storage/split_queue.go, line 72 at r2 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. I didn't appreciate the difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)
pkg/storage/split_queue.go, line 72 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Done. I didn't appreciate the difference.
Did you push?
Add "kv.split.load" telemetry counter. It is incremented each time a load- based split occurs. Release note: None
d880e0a
to
92d968f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, @knz, and @nvanbenschoten)
pkg/storage/split_queue.go, line 72 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you push?
Whoops, got distracted by something else and neglected to push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @danhhz, and @knz)
bors r+ |
39192: storage: Add load-based split telemetry r=andy-kimball a=andy-kimball Add "kv.split.load" telemetry counter. It is incremented each time a load- based split occurs. Release note: None Co-authored-by: Andrew Kimball <[email protected]>
Build succeeded |
Add "kv.split.load" telemetry counter. It is incremented each time a load-
based split occurs.
Release note: None