-
Notifications
You must be signed in to change notification settings - Fork 3.9k
stats,opt: performance improvements for histograms #39178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
pkg/sql/opt/memo/statistics_builder.go, line 558 at r2 (raw file):
colStat := sb.copyColStat(colSet, s, inputColStat) if inputColStat.Histogram != nil { if s.Selectivity != 1 || scan.HardLimit.IsSet() {
In the HardLimit case, we're not modifying the histogram AFAICT
pkg/sql/opt/memo/statistics_builder.go, line 567 at r2 (raw file):
} if s.Selectivity != 1 {
[nit] would be a bit more clear if we moved these two inside the block above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any benchmarks showing the improvement this PR gives?
Reviewed 19 of 19 files at r1, 5 of 6 files at r2.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @justinj and @rytaft)
pkg/sql/opt/props/histogram.go, line 159 at r2 (raw file):
return 0, false default:
would it ever be relevant to return 1
if lowerBound=upperBound
or is this function never called like that?
pkg/sql/stats/histogram.go, line 90 at r1 (raw file):
break } else if c > 0 { panic("samples not sorted")
nit/not actually a part of this change, but should this be an AssertionFailedf
?
pkg/sql/stats/histogram.proto, line 42 at r1 (raw file):
// value because it is estimated by distributing the known distinct // count for the column among the buckets, in proportion to the number // of rows in each bucket.
nit: consider mentioning that this value is in fact derived from the rest of the data, but is included to avoid re-computing it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
Do you have any benchmarks showing the improvement this PR gives?
The PR message has some perf numbers - is that not what you had in mind?
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @justinj and @RaduBerinde)
pkg/sql/opt/memo/statistics_builder.go, line 558 at r2 (raw file):
Previously, RaduBerinde wrote…
In the HardLimit case, we're not modifying the histogram AFAICT
We actually might be, in finalizeFromRowCount
. But I realized that we already account for the limit with cardinality, so I removed the check for the hard limit below -- all will be updated if needed in finalizeFromRowCount
.
pkg/sql/opt/memo/statistics_builder.go, line 567 at r2 (raw file):
Previously, RaduBerinde wrote…
[nit] would be a bit more clear if we moved these two inside the block above
How? We still want to execute this block even if the histogram is nil.
pkg/sql/opt/props/histogram.go, line 159 at r2 (raw file):
Previously, justinj (Justin Jaffray) wrote…
would it ever be relevant to return
1
iflowerBound=upperBound
or is this function never called like that?
No, because the upperBound
is exclusive. So in that case we still want to return 0.
pkg/sql/stats/histogram.go, line 90 at r1 (raw file):
Previously, justinj (Justin Jaffray) wrote…
nit/not actually a part of this change, but should this be an
AssertionFailedf
?
Done.
pkg/sql/stats/histogram.proto, line 42 at r1 (raw file):
Previously, justinj (Justin Jaffray) wrote…
nit: consider mentioning that this value is in fact derived from the rest of the data, but is included to avoid re-computing it later?
Done.
Ah nice! I missed that—I only looked at the commit messages in reviewable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @justinj and @rytaft)
pkg/sql/opt/memo/statistics_builder.go, line 558 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
We actually might be, in
finalizeFromRowCount
. But I realized that we already account for the limit with cardinality, so I removed the check for the hard limit below -- all will be updated if needed infinalizeFromRowCount
.
I see. This seems fragile; the condition here has to match what finalizeFromRowCount
is doing. I don't have a good solution though.. perhaps move the condition inside a canFinalizeFromRowCountModifyHistogram
helper (and put that next to finalizeFromRowCount
), and assert that it returns true when we modify the histogram (inside finalizeFromRowCount
).
pkg/sql/opt/memo/statistics_builder.go, line 567 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
How? We still want to execute this block even if the histogram is nil.
The block could be
colStat.Histogram = inputColStat.Histogram
if <need copy> {
colStat.Histogram = colStat.Histogram.Copy()
...
}
But I didn't realize finalizeFromRowCount
plays a role, so this doesn't help much.
This commit moves the code to estimate the number of distinct values per histogram bucket out of the optimizer and into the stats package. The purpose of doing this is to remove the overhead of calculating the number of distinct values per bucket from the critical path of query planning, and instead perform the calculation offline when initially generating the histogram. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @justinj and @RaduBerinde)
pkg/sql/opt/memo/statistics_builder.go, line 558 at r2 (raw file):
Previously, RaduBerinde wrote…
I see. This seems fragile; the condition here has to match what
finalizeFromRowCount
is doing. I don't have a good solution though.. perhaps move the condition inside acanFinalizeFromRowCountModifyHistogram
helper (and put that next tofinalizeFromRowCount
), and assert that it returns true when we modify the histogram (insidefinalizeFromRowCount
).
Thinking about this more, I think it would be much cleaner if histograms are immutable. Histogram.Filter
already returns a new histogram without changing the original, so I changed Histogram.ApplySelectivity
to do the same thing. There are cases where this may cause a histogram to get copied twice (e.g., a scan that is both constrained and limited), but in the general case I think this makes everything much cleaner and more efficient, and will save us a lot of headaches down the road. Let me know what you think.
pkg/sql/opt/memo/statistics_builder.go, line 567 at r2 (raw file):
Previously, RaduBerinde wrote…
The block could be
colStat.Histogram = inputColStat.Histogram if <need copy> { colStat.Histogram = colStat.Histogram.Copy() ... }
But I didn't realize
finalizeFromRowCount
plays a role, so this doesn't help much.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @justinj and @rytaft)
pkg/sql/opt/memo/statistics_builder.go, line 558 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Thinking about this more, I think it would be much cleaner if histograms are immutable.
Histogram.Filter
already returns a new histogram without changing the original, so I changedHistogram.ApplySelectivity
to do the same thing. There are cases where this may cause a histogram to get copied twice (e.g., a scan that is both constrained and limited), but in the general case I think this makes everything much cleaner and more efficient, and will save us a lot of headaches down the road. Let me know what you think.
Awesome, sounds great to me.
This commit improves the performance of histograms in the optimizer by avoiding allocating and copying the full histogram unless strictly necessary. Additionally, it changes the code for filtering histograms to use binary search instead of performing a linear scan. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @justinj)
39178: stats,opt: performance improvements for histograms r=rytaft a=rytaft **stats,opt: move histogram bucket distinct count estimation code** This commit moves the code to estimate the number of distinct values per histogram bucket out of the optimizer and into the stats package. The purpose of doing this is to remove the overhead of calculating the number of distinct values per bucket from the critical path of query planning, and instead perform the calculation offline when initially generating the histogram. **opt: performance improvements for histograms** This commit improves the performance of histograms in the optimizer by avoiding allocating and copying the full histogram unless strictly necessary. Additionally, it changes the code for filtering histograms to use binary search instead of performing a linear scan. ---- As a result of both of these commits, the overhead of histograms is reduced. Prior to these changes, the overhead of histograms caused 14% lower throughput and 29% higher latency when running kv with 95% reads. After, it caused only 4.6% lower throughput and 11% higher latency. (Clearly there is still work to do, but this is some progress...) Co-authored-by: Rebecca Taft <[email protected]>
Build succeeded |
stats,opt: move histogram bucket distinct count estimation code
This commit moves the code to estimate the number of distinct values
per histogram bucket out of the optimizer and into the stats package.
The purpose of doing this is to remove the overhead of calculating the
number of distinct values per bucket from the critical path of query
planning, and instead perform the calculation offline when initially
generating the histogram.
opt: performance improvements for histograms
This commit improves the performance of histograms in the optimizer
by avoiding allocating and copying the full histogram unless strictly
necessary. Additionally, it changes the code for filtering histograms
to use binary search instead of performing a linear scan.
As a result of both of these commits, the overhead of histograms is reduced. Prior to these changes, the overhead of histograms caused 14% lower throughput and 29% higher latency when running kv with 95% reads. After, it caused only 4.6% lower throughput and 11% higher latency. (Clearly there is still work to do, but this is some progress...)