Skip to content

movr: Add stats collection to movr workload run #41138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2019

Conversation

rohany
Copy link
Contributor

@rohany rohany commented Sep 26, 2019

This PR adds tracking stats for each kind of query in the movr workload
so that output is displayed from cockroach workload run. Additionally,
this refactors the movr workload to define the work as functions on a
worker struct. This hopefully will avoid a common gotcha of having
different workers sharing the same not threadsafe histograms object.

Release justification: low risk nice to have feature

Release note: None

@rohany rohany requested a review from danhhz September 26, 2019 20:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rohany
Copy link
Contributor Author

rohany commented Sep 26, 2019

cc @jseldess , no need to open an issue

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @rohany)


pkg/workload/movr/movr.go, line 553 at r1 (raw file):

		err := work()
		elapsed := timeutil.Since(start)
		hists.Get(key).Record(elapsed)

we definitely only want to update when err is nil. we've had issues in the past with it being misleading to mix successful and failing queries in the same histogram

if you'd like to measure the errors as well, i'd make them separate buckets (either one big errors bucket or something like key + "-error", though probably the former to limit histogram explosion)


pkg/workload/movr/movr.go, line 626 at r1 (raw file):

	}

	hists := reg.GetHandle()

the handle returned by this is not threadsafe, you need one per worker. I think this happens to work now since there appears to be one worker, but let's avoid leaving this gotcha around in case someone goes to add more workers later


pkg/workload/movr/movr.go, line 666 at r1 (raw file):

			return err
		} else if rng.Float64() < 0.1 {
			// Apply a promo code to an account.

aren't these more useful to track at the level of a logical "movr api call"? so there'd be one for "apply promo code" instead of breaking it down for each db call in it. see how tpcc works for what i'm suggestion

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/workload/movr/movr.go, line 553 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

we definitely only want to update when err is nil. we've had issues in the past with it being misleading to mix successful and failing queries in the same histogram

if you'd like to measure the errors as well, i'd make them separate buckets (either one big errors bucket or something like key + "-error", though probably the former to limit histogram explosion)

Ok, that makes sense.


pkg/workload/movr/movr.go, line 626 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

the handle returned by this is not threadsafe, you need one per worker. I think this happens to work now since there appears to be one worker, but let's avoid leaving this gotcha around in case someone goes to add more workers later

Yeah, i saw that. There is only one worker right now, so this is OK. However, I'm not sure how to change this to avoid a gotcha? Does leaving a comment denoting that this is the case the correct thing to do?


pkg/workload/movr/movr.go, line 666 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

aren't these more useful to track at the level of a logical "movr api call"? so there'd be one for "apply promo code" instead of breaking it down for each db call in it. see how tpcc works for what i'm suggestion

I can condense some of these queries into one timed execution, but I wanted to separate the getRandom* from the other queries, because these are not part of the original movr application. They were added as a utility for me to easily generate random values, while the movr app we have kind of makes a local in-memory copy of the db and samples from it. So i didn't want to include those queries as part of the timing of a particular API call in order to avoid having times that differ a decent amount from the published movr app.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rohany)


pkg/workload/movr/movr.go, line 626 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Yeah, i saw that. There is only one worker right now, so this is OK. However, I'm not sure how to change this to avoid a gotcha? Does leaving a comment denoting that this is the case the correct thing to do?

I think we need to bite the bullet on a small refactor of this code. Like I said on slack, this should all probably move to be closer to what tpcc is doing. I'm wary of making this code more complex and brittle and then calling it "low risk"


pkg/workload/movr/movr.go, line 666 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

I can condense some of these queries into one timed execution, but I wanted to separate the getRandom* from the other queries, because these are not part of the original movr application. They were added as a utility for me to easily generate random values, while the movr app we have kind of makes a local in-memory copy of the db and samples from it. So i didn't want to include those queries as part of the timing of a particular API call in order to avoid having times that differ a decent amount from the published movr app.

Hmm, i'm not particularly concerned about this matching the published movr app. Should I be? I'd rather optimize for this making sense to people kicking the tires on cockroachdb, which is likely to be the majority use of workload run movr

@rohany rohany force-pushed the movr-workload-stats branch from 12236db to 682765c Compare September 26, 2019 22:15
Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz)


pkg/workload/movr/movr.go, line 626 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I think we need to bite the bullet on a small refactor of this code. Like I said on slack, this should all probably move to be closer to what tpcc is doing. I'm wary of making this code more complex and brittle and then calling it "low risk"

Ok, i did the refactor. it does feel better now


pkg/workload/movr/movr.go, line 666 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Hmm, i'm not particularly concerned about this matching the published movr app. Should I be? I'd rather optimize for this making sense to people kicking the tires on cockroachdb, which is likely to be the majority use of workload run movr

I thought about it more and I agree -- if you want to see the exact same thing as the movr app, just run the docker image yourself. Otherwise, cockroach workload run movr doesn't need to be exact.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: one last plea for not leaving this as a gotcha, but feel free to merge even if you don't

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @rohany)


pkg/workload/movr/movr.go, line 758 at r2 (raw file):

	}

	// Hists is not threadsafe! If this workload expands to returning multiple workers,

i think doing this is trivial now. if you make a new type struct worker that has a db *gosql.DB and a hists *histograms.Histograms, you can move all the work fns and movrQuerySimulation to be methods on that struct

if you really don't feel like doing this now, then lets move this comment to be above the ql.WorkerFns = append(ql.WorkerFns, movrQuerySimulation) line, which is much more likely to be seen by someone making the change you mention than it would up here

This PR adds tracking stats for each kind of query in the movr workload
so that output is displayed from cockroach workload run. Additionally,
this refactors the movr workload to define the work as functions on a
worker struct. This hopefully will avoid a common gotcha of having
different workers sharing the same not threadsafe histograms object.

Release justification: low risk nice to have feature

Release note: None
@rohany rohany force-pushed the movr-workload-stats branch from 682765c to 460f9e7 Compare September 30, 2019 14:36
Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your persistence -- it feels alot better now. Can you take another quick look?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @rohany)

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: \o/

Thanks for you patience on this, I feel good that we picked the cleanup off now.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @rohany)

@rohany
Copy link
Contributor Author

rohany commented Sep 30, 2019

bors r=danhhz

craig bot pushed a commit that referenced this pull request Sep 30, 2019
40493: sql: Display inherited constraints in SHOW PARTITIONS  r=andreimatei a=rohany

SHOW PARTITIONS now displays the inherited zone configuration of the
partitions in a separate column. To accomplish this, the
crdb_internal.zones table now holds on to the inherited constraints of
each zone in a separate column. Additionally, the
crdb_internal.partitions table holds on to the zone_id and subzone_id of
the zone configuration the partition refers to. These id's correspond to
the zone configuration at the lowest point in that partitions
"inheritance chain".

Release justification: Adds a low risk, good to have UX feature.

Fixes #40349.

Release note (sql change):
* SHOW PARTITIONS now displays inherited zone configurations.
* Adds the zone_id, subzone_id columns to crdb_internal.partitions,
which form a link to the corresponding zone config in
crdb_internal.zones which apply to the partitions.
* Rename the config_yaml, config_sql and config_proto columns in
crdb_internal.zones to raw_config_yaml, raw_config_sql,
raw_config_proto.
* Add the columns full_config_sql and full_config_yaml to the
crdb_internal.zones table which display the full/inherited zone
configuration.

41138: movr: Add stats collection to movr workload run r=danhhz a=rohany

This PR adds tracking stats for each kind of query in the movr workload
so that output is displayed from cockroach workload run. Additionally,
this refactors the movr workload to define the work as functions on a
worker struct. This hopefully will avoid a common gotcha of having
different workers sharing the same not threadsafe histograms object.

Release justification: low risk nice to have feature

Release note: None

41196: store,bulk: log when delaying AddSSTable, collect + log more timings in bulk-ingest r=dt a=dt

storage: log when AddSSTable requests are delayed

If the rate-limiting and back-pressure mechanisms kick in, they can dramatically delay requests in some cases.
However there is currently it can be unclear that this is happening and the system may simply appear slow.
Logging when requests are delayed by more than a second should help identify when this is the cause of slowness.

Release note: none.

Release justification: low-risk (logging only) change that could significantly help in diagnosing 'stuck' jobs based on logs (which often all we have to go on).

bulk: track and log more timings

This tracks and logs time spent in the various stages of ingestion - sorting, splitting and flushing.
This helps when trying to diagnose why a job is 'slow' or 'stuck'.

Release note: none.

Release justification: low-risk (logging only) changes that improve ability to diagnose problems.


Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented Sep 30, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants