-
Notifications
You must be signed in to change notification settings - Fork 3.9k
movr: Add stats collection to movr workload run #41138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @jseldess , no need to open an issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @rohany)
pkg/workload/movr/movr.go, line 553 at r1 (raw file):
err := work() elapsed := timeutil.Since(start) hists.Get(key).Record(elapsed)
we definitely only want to update when err is nil. we've had issues in the past with it being misleading to mix successful and failing queries in the same histogram
if you'd like to measure the errors as well, i'd make them separate buckets (either one big errors bucket or something like key + "-error", though probably the former to limit histogram explosion)
pkg/workload/movr/movr.go, line 626 at r1 (raw file):
} hists := reg.GetHandle()
the handle returned by this is not threadsafe, you need one per worker. I think this happens to work now since there appears to be one worker, but let's avoid leaving this gotcha around in case someone goes to add more workers later
pkg/workload/movr/movr.go, line 666 at r1 (raw file):
return err } else if rng.Float64() < 0.1 { // Apply a promo code to an account.
aren't these more useful to track at the level of a logical "movr api call"? so there'd be one for "apply promo code" instead of breaking it down for each db call in it. see how tpcc works for what i'm suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
pkg/workload/movr/movr.go, line 553 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
we definitely only want to update when err is nil. we've had issues in the past with it being misleading to mix successful and failing queries in the same histogram
if you'd like to measure the errors as well, i'd make them separate buckets (either one big errors bucket or something like key + "-error", though probably the former to limit histogram explosion)
Ok, that makes sense.
pkg/workload/movr/movr.go, line 626 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
the handle returned by this is not threadsafe, you need one per worker. I think this happens to work now since there appears to be one worker, but let's avoid leaving this gotcha around in case someone goes to add more workers later
Yeah, i saw that. There is only one worker right now, so this is OK. However, I'm not sure how to change this to avoid a gotcha? Does leaving a comment denoting that this is the case the correct thing to do?
pkg/workload/movr/movr.go, line 666 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
aren't these more useful to track at the level of a logical "movr api call"? so there'd be one for "apply promo code" instead of breaking it down for each db call in it. see how tpcc works for what i'm suggestion
I can condense some of these queries into one timed execution, but I wanted to separate the getRandom* from the other queries, because these are not part of the original movr application. They were added as a utility for me to easily generate random values, while the movr app we have kind of makes a local in-memory copy of the db and samples from it. So i didn't want to include those queries as part of the timing of a particular API call in order to avoid having times that differ a decent amount from the published movr app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @rohany)
pkg/workload/movr/movr.go, line 626 at r1 (raw file):
Previously, rohany (Rohan Yadav) wrote…
Yeah, i saw that. There is only one worker right now, so this is OK. However, I'm not sure how to change this to avoid a gotcha? Does leaving a comment denoting that this is the case the correct thing to do?
I think we need to bite the bullet on a small refactor of this code. Like I said on slack, this should all probably move to be closer to what tpcc is doing. I'm wary of making this code more complex and brittle and then calling it "low risk"
pkg/workload/movr/movr.go, line 666 at r1 (raw file):
Previously, rohany (Rohan Yadav) wrote…
I can condense some of these queries into one timed execution, but I wanted to separate the getRandom* from the other queries, because these are not part of the original movr application. They were added as a utility for me to easily generate random values, while the movr app we have kind of makes a local in-memory copy of the db and samples from it. So i didn't want to include those queries as part of the timing of a particular API call in order to avoid having times that differ a decent amount from the published movr app.
Hmm, i'm not particularly concerned about this matching the published movr app. Should I be? I'd rather optimize for this making sense to people kicking the tires on cockroachdb, which is likely to be the majority use of workload run movr
12236db
to
682765c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz)
pkg/workload/movr/movr.go, line 626 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I think we need to bite the bullet on a small refactor of this code. Like I said on slack, this should all probably move to be closer to what tpcc is doing. I'm wary of making this code more complex and brittle and then calling it "low risk"
Ok, i did the refactor. it does feel better now
pkg/workload/movr/movr.go, line 666 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
Hmm, i'm not particularly concerned about this matching the published movr app. Should I be? I'd rather optimize for this making sense to people kicking the tires on cockroachdb, which is likely to be the majority use of
workload run movr
I thought about it more and I agree -- if you want to see the exact same thing as the movr app, just run the docker image yourself. Otherwise, cockroach workload run movr doesn't need to be exact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last plea for not leaving this as a gotcha, but feel free to merge even if you don't
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @rohany)
pkg/workload/movr/movr.go, line 758 at r2 (raw file):
} // Hists is not threadsafe! If this workload expands to returning multiple workers,
i think doing this is trivial now. if you make a new type struct worker
that has a db *gosql.DB
and a hists *histograms.Histograms
, you can move all the work fns and movrQuerySimulation
to be methods on that struct
if you really don't feel like doing this now, then lets move this comment to be above the ql.WorkerFns = append(ql.WorkerFns, movrQuerySimulation)
line, which is much more likely to be seen by someone making the change you mention than it would up here
This PR adds tracking stats for each kind of query in the movr workload so that output is displayed from cockroach workload run. Additionally, this refactors the movr workload to define the work as functions on a worker struct. This hopefully will avoid a common gotcha of having different workers sharing the same not threadsafe histograms object. Release justification: low risk nice to have feature Release note: None
682765c
to
460f9e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your persistence -- it feels alot better now. Can you take another quick look?
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @rohany)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you patience on this, I feel good that we picked the cleanup off now.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @rohany)
bors r=danhhz |
40493: sql: Display inherited constraints in SHOW PARTITIONS r=andreimatei a=rohany SHOW PARTITIONS now displays the inherited zone configuration of the partitions in a separate column. To accomplish this, the crdb_internal.zones table now holds on to the inherited constraints of each zone in a separate column. Additionally, the crdb_internal.partitions table holds on to the zone_id and subzone_id of the zone configuration the partition refers to. These id's correspond to the zone configuration at the lowest point in that partitions "inheritance chain". Release justification: Adds a low risk, good to have UX feature. Fixes #40349. Release note (sql change): * SHOW PARTITIONS now displays inherited zone configurations. * Adds the zone_id, subzone_id columns to crdb_internal.partitions, which form a link to the corresponding zone config in crdb_internal.zones which apply to the partitions. * Rename the config_yaml, config_sql and config_proto columns in crdb_internal.zones to raw_config_yaml, raw_config_sql, raw_config_proto. * Add the columns full_config_sql and full_config_yaml to the crdb_internal.zones table which display the full/inherited zone configuration. 41138: movr: Add stats collection to movr workload run r=danhhz a=rohany This PR adds tracking stats for each kind of query in the movr workload so that output is displayed from cockroach workload run. Additionally, this refactors the movr workload to define the work as functions on a worker struct. This hopefully will avoid a common gotcha of having different workers sharing the same not threadsafe histograms object. Release justification: low risk nice to have feature Release note: None 41196: store,bulk: log when delaying AddSSTable, collect + log more timings in bulk-ingest r=dt a=dt storage: log when AddSSTable requests are delayed If the rate-limiting and back-pressure mechanisms kick in, they can dramatically delay requests in some cases. However there is currently it can be unclear that this is happening and the system may simply appear slow. Logging when requests are delayed by more than a second should help identify when this is the cause of slowness. Release note: none. Release justification: low-risk (logging only) change that could significantly help in diagnosing 'stuck' jobs based on logs (which often all we have to go on). bulk: track and log more timings This tracks and logs time spent in the various stages of ingestion - sorting, splitting and flushing. This helps when trying to diagnose why a job is 'slow' or 'stuck'. Release note: none. Release justification: low-risk (logging only) changes that improve ability to diagnose problems. Co-authored-by: Rohan Yadav <[email protected]> Co-authored-by: Rohan Yadav <[email protected]> Co-authored-by: David Taylor <[email protected]>
Build succeeded |
This PR adds tracking stats for each kind of query in the movr workload
so that output is displayed from cockroach workload run. Additionally,
this refactors the movr workload to define the work as functions on a
worker struct. This hopefully will avoid a common gotcha of having
different workers sharing the same not threadsafe histograms object.
Release justification: low risk nice to have feature
Release note: None