-
Notifications
You must be signed in to change notification settings - Fork 13.3k
optimize joining for slices #50340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize joining for slices #50340
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
This PR is marked as WIP — @Emerentius what remains to be done? |
0080466
to
af97dc1
Compare
@shepmaster I've also extended the PR to (slightly) optimize If the tests pass, I'd say this is good to go. |
This comment has been minimized.
This comment has been minimized.
A specialization bug? I didn't change the set of types the trait is implemented for. |
This comment has been minimized.
This comment has been minimized.
Ok, I've removed the specialization for It would be nice if this optimization could be somehow implemented as a specialization of |
Ping from triage @alexcrichton! This PR needs your review. |
Ah sorry about the delay here, but thanks for the PR @Emerentius! This looks somewhat scary though in the sense that it's replacing some already somewhat tricky code with a lot of unsafe code and not a lot of tests. Would it be possible to reduce some of the unsafe code? (is it all required?) Additionally would it be possible to bolster up the test suite here to hopefully head off any possible lurking issues? Maybe copy the implementation locally as well and try fuzzing it? |
It's not so much a lot of unsafe code but a little unsafe code repeated 6x. I've compressed it by unifying the different cases and adding a helper macro. It seems like the speed didn't suffer or at least not by much.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Would it be possible to gist a safe version of the code? It sounds and looks like the unsafety here is derived from the desire to avoid bounds checks, right? Often from taking a look at the IR it's possible to poke around, tweak iterators, or something like that which allows us to convince LLVM it should eliminate all the bounds checks away.
It looks naively correct to me but I've often found that if LLVM can't prove that bounds checks can be eliminated then our own logic isn't always right
src/liballoc/slice.rs
Outdated
} else { | ||
result.push(sep.clone()) | ||
let mut iter = self.iter(); | ||
iter.next().map_or(vec![], |first| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this and the pattern below perhaps use a match
with an early return to avoid the extra indentation below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/liballoc/str.rs
Outdated
let sep_len = sep.len(); | ||
let mut iter = slice.iter(); | ||
iter.next().map_or(vec![], |first| { | ||
// this is wrong without the guarantee that `slice` is non-empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clause is specifically referring to slice.len() - 1
, right? Could it perhaps instead use iter.len()
because that's already offset by one at this point? (and we know it's exact)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This aligned better with my reasoning but I guess avoiding the - 1
is one less place to check mentally for overflow.
src/liballoc/str.rs
Outdated
let len = $bytes.len(); | ||
$target.get_unchecked_mut(..len) | ||
.copy_from_slice($bytes); | ||
$target = {$target}.get_unchecked_mut(len..); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this two get_unchecked_mut
be replaced with split_at
perhaps to avoid calling this twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so. I moved away from split_at
fairly early in my optimizations when I still used the regular bench suite to gauge speed. I can't see any meaningful improvement now so I might as well change that back.
The benchmarks are incredibly variable and regularly show spurious variations on the order of ~5% and even up to 20% (!). The variations are strangely somewhat persisting across different benchmarks in the same run in that it's almost like a constant factor for all benches. I know they are spurious because they affect the old join just as much as the new.
The variability is what made me switch to the Criterion
bench framework but that hasn't helped at all.
Here is a safe version of the code: https://gist.github.com/Emerentius/6bbc1302367111a76b2b0841a5194a2c I've done some fuzzing with I'm not sure if I can trust |
@Emerentius to confirm, the current version is the optimized version you'd like to land? It looks like there's only one reason for the (note that |
@alexcrichton Yes, I'd like to land the current version. I've rerun the benches on a desktop PC which has much less benchmark variability than my laptop and confirmed there is no meaningful difference between the Here are some plots for the speedup as a function of various variables and its difference between checked and unchecked indexing. Don't read too much into differences of specific lines. There are still some spurious variations. |
@bors: r+ Ok this is looking great, thanks so much again for the analysis done here as well as the PR! The usage of unsafe here definitely makes sense to me and I think this is also a common enough operation that speeding it up is likely worth it. Thanks again! |
📌 Commit 6d0c5b8 has been approved by |
🔒 Merge conflict |
@alexcrichton I've rebased the code, ready for another try. |
@bors: r+ |
📌 Commit bd2e23e has been approved by |
🔒 Merge conflict |
for both Vec<T> and String - eliminates the boolean first flag in fn join() for String only - eliminates repeated bounds checks in join(), concat() - adds fast paths for small string separators up to a len of 4 bytes
old tests cover the new fast path of str joining already this adds tests for joining into Strings with long separators (>4 byte) and for joining into Vec<T>, T: Clone + !Copy. Vec<T: Copy> will be specialised when specialisation type inference bugs are fixed.
further reduce unsafe fn calls reduce right drift assert! sufficient capacity
Ok, apparently I've rebased on an outdated copy because |
@bors: r+ |
📌 Commit 12bd288 has been approved by |
optimize joining for slices This improves the speed of string joining up to 3x. It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative. 3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks [with this code](https://github.com/Emerentius/join_bench). They are pretty noise despite high iteration counts, but in total one can see the trends. ``` len_separator len_string n_strings speedup 4 10 10 2.38 4 10 100 3.41 4 10 1000 3.43 4 10 10000 3.25 4 100 10 2.23 4 100 100 2.73 4 100 1000 1.33 4 100 10000 1.14 4 1000 10 1.33 4 1000 100 1.15 4 1000 1000 1.08 4 1000 10000 1.04 10 10 10 1.61 10 10 100 1.74 10 10 1000 1.77 10 10 10000 1.75 10 100 10 1.58 10 100 100 1.65 10 100 1000 1.24 10 100 10000 1.12 10 1000 10 1.23 10 1000 100 1.11 10 1000 1000 1.05 10 1000 10000 0.997 100 10 10 1.66 100 10 100 1.78 100 10 1000 1.28 100 10 10000 1.16 100 100 10 1.37 100 100 100 1.26 100 100 1000 1.09 100 100 10000 1.0 100 1000 10 1.19 100 1000 100 1.12 100 1000 1000 1.05 100 1000 10000 1.12 ``` The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors. If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call `.join("")`.
☀️ Test successful - status-appveyor, status-travis |
This improves the speed of string joining up to 3x.
It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes
These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative.
3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks with this code. They are pretty noise despite high iteration counts, but in total one can see the trends.
The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors.
If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call
.join("")
.