Skip to content

optimize joining for slices #50340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 1, 2018
Merged

optimize joining for slices #50340

merged 4 commits into from
Jun 1, 2018

Conversation

Emerentius
Copy link
Contributor

@Emerentius Emerentius commented Apr 30, 2018

This improves the speed of string joining up to 3x.
It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes
These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative.

3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks with this code. They are pretty noise despite high iteration counts, but in total one can see the trends.

len_separator  len_string   n_strings     speedup
           4          10          10        2.38
           4          10         100        3.41
           4          10        1000        3.43
           4          10       10000        3.25
           4         100          10        2.23
           4         100         100        2.73
           4         100        1000        1.33
           4         100       10000        1.14
           4        1000          10        1.33
           4        1000         100        1.15
           4        1000        1000        1.08
           4        1000       10000        1.04
          10          10          10        1.61
          10          10         100        1.74
          10          10        1000        1.77
          10          10       10000        1.75
          10         100          10        1.58
          10         100         100        1.65
          10         100        1000        1.24
          10         100       10000        1.12
          10        1000          10        1.23
          10        1000         100        1.11
          10        1000        1000        1.05
          10        1000       10000       0.997
         100          10          10        1.66
         100          10         100        1.78
         100          10        1000        1.28
         100          10       10000        1.16
         100         100          10        1.37
         100         100         100        1.26
         100         100        1000        1.09
         100         100       10000         1.0
         100        1000          10        1.19
         100        1000         100        1.12
         100        1000        1000        1.05
         100        1000       10000        1.12

The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors.

If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call .join("").

@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 30, 2018
@shepmaster shepmaster added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2018
@shepmaster
Copy link
Member

This PR is marked as WIP — @Emerentius what remains to be done?

@Emerentius Emerentius force-pushed the master branch 3 times, most recently from 0080466 to af97dc1 Compare May 7, 2018 16:32
@Emerentius
Copy link
Contributor Author

Emerentius commented May 7, 2018

@shepmaster
I was mainly hoping for a comment on whether the loop specialization by separator length and the accompanying increase in code size was acceptable here, but it seems like the optimizer can get rid of the unused cases for constant size separators which is probably the most common case.
The unsafe code seems straightforward but I'd like a quick sanity check on it.

I've also extended the PR to (slightly) optimize Vec<T> and I've added a specialization for Vec<T: Copy> that utilizes the same changes I've made for String.
It should be possible to apply the same optimization for Vec<T: Clone> but I don't know how. My attempts have given me astounding pessimizations.

If the tests pass, I'd say this is good to go.

@rust-highfive

This comment has been minimized.

@Emerentius
Copy link
Contributor Author

[00:18:29] error[E0284]: type annotations required: cannot resolve `<[&[rustc::ty::Predicate<'_>]] as std::slice::SliceConcatExt<_>>::Output == std::vec::Vec<rustc::ty::Predicate<'_>>`
[00:18:29]     --> librustc_typeck/collect.rs:1303:75
[00:18:29]      |
[00:18:29] 1303 |         [&explicit.predicates[..], &tcx.inferred_outlives_of(def_id)[..]].concat()
[00:18:29] 
[00:18:30] error: aborting due to previous error

A specialization bug? I didn't change the set of types the trait is implemented for.

@rust-highfive

This comment has been minimized.

@Emerentius
Copy link
Contributor Author

Ok, I've removed the specialization for SliceConcatExt<T: Copy> for now. String joining will just be a little faster than Vec joining for the moment.

It would be nice if this optimization could be somehow implemented as a specialization of Extend, so all Vecs and Strings could benefit from it. A problem there is that it needs to clone the iter so it can iterate twice. Once to find the total len to extend by and another time to copy data. It could have unexpected side-effects. A specialization of just FlatMap<slice::Iter<T: Copy>> on the other hand is pretty limiting.

@pietroalbini pietroalbini added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 14, 2018
@Emerentius Emerentius changed the title [wip] optimize joining for slices optimize joining for slices May 16, 2018
@pietroalbini
Copy link
Member

Ping from triage @alexcrichton! This PR needs your review.

@alexcrichton
Copy link
Member

Ah sorry about the delay here, but thanks for the PR @Emerentius!

This looks somewhat scary though in the sense that it's replacing some already somewhat tricky code with a lot of unsafe code and not a lot of tests. Would it be possible to reduce some of the unsafe code? (is it all required?) Additionally would it be possible to bolster up the test suite here to hopefully head off any possible lurking issues? Maybe copy the implementation locally as well and try fuzzing it?

@Emerentius
Copy link
Contributor Author

Emerentius commented May 23, 2018

It's not so much a lot of unsafe code but a little unsafe code repeated 6x. I've compressed it by unifying the different cases and adding a helper macro. It seems like the speed didn't suffer or at least not by much.

That is actually somewhat mystifying because now the fast paths are the same as the general path. Just the fact that I'm matching on a specific separator length before running the exact same code lets it run faster. It seems like an optimizer bug. Actually, this is not that strange if the function is not inlined.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Would it be possible to gist a safe version of the code? It sounds and looks like the unsafety here is derived from the desire to avoid bounds checks, right? Often from taking a look at the IR it's possible to poke around, tweak iterators, or something like that which allows us to convince LLVM it should eliminate all the bounds checks away.

It looks naively correct to me but I've often found that if LLVM can't prove that bounds checks can be eliminated then our own logic isn't always right

} else {
result.push(sep.clone())
let mut iter = self.iter();
iter.next().map_or(vec![], |first| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this and the pattern below perhaps use a match with an early return to avoid the extra indentation below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

let sep_len = sep.len();
let mut iter = slice.iter();
iter.next().map_or(vec![], |first| {
// this is wrong without the guarantee that `slice` is non-empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clause is specifically referring to slice.len() - 1, right? Could it perhaps instead use iter.len() because that's already offset by one at this point? (and we know it's exact)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This aligned better with my reasoning but I guess avoiding the - 1 is one less place to check mentally for overflow.

let len = $bytes.len();
$target.get_unchecked_mut(..len)
.copy_from_slice($bytes);
$target = {$target}.get_unchecked_mut(len..);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this two get_unchecked_mut be replaced with split_at perhaps to avoid calling this twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so. I moved away from split_at fairly early in my optimizations when I still used the regular bench suite to gauge speed. I can't see any meaningful improvement now so I might as well change that back.

The benchmarks are incredibly variable and regularly show spurious variations on the order of ~5% and even up to 20% (!). The variations are strangely somewhat persisting across different benchmarks in the same run in that it's almost like a constant factor for all benches. I know they are spurious because they affect the old join just as much as the new.
The variability is what made me switch to the Criterion bench framework but that hasn't helped at all.

@Emerentius
Copy link
Contributor Author

Emerentius commented May 25, 2018

Here is a safe version of the code: https://gist.github.com/Emerentius/6bbc1302367111a76b2b0841a5194a2c
The bounds checks are a major part of the speedup, yes. This safe version is a lot slower.

I've done some fuzzing with cargo fuzz for a few hours with total input sizes up to 4K and 4M respectively and encountered no cases where the resulting String was different from the old join. Note however, that I'm on a laptop with 15W CPU and had it run at 25% of one core.

I'm not sure if I can trust Vec::with_capacity to always give me a capacity >= n so I've added an assert for that.

@alexcrichton
Copy link
Member

@Emerentius to confirm, the current version is the optimized version you'd like to land? It looks like there's only one reason for the unsafe in join_generic_copy which is a one-time check outside of the inner loop. If that's so is it still needed?

(note that split_at_mut is doing bounds checks, so I thought that was gonna hit perf too much to actually land)

@Emerentius
Copy link
Contributor Author

Emerentius commented May 30, 2018

@alexcrichton Yes, I'd like to land the current version.
The unsafe is needed to get a slice into the buffer past the current len of the Vec and then later to set the Vec's len. It still avoids the repeated .reserve() calls, capacity checks and len updates that are inside extend_from_slice.

I've rerun the benches on a desktop PC which has much less benchmark variability than my laptop and confirmed there is no meaningful difference between the get_unchecked_mut and the split_at_mut version. The bounds checks aren't elided with split_at_mut but they don't matter except for the smallest of joins where it's 10% slower. That difference dissipates quickly.

Here are some plots for the speedup as a function of various variables and its difference between checked and unchecked indexing. Don't read too much into differences of specific lines. There are still some spurious variations.
Data is from a desktop, which is why the maximum speedup is now less than the 3x shown in the OP.
unchecked_checked_comparison

@alexcrichton
Copy link
Member

@bors: r+

Ok this is looking great, thanks so much again for the analysis done here as well as the PR! The usage of unsafe here definitely makes sense to me and I think this is also a common enough operation that speeding it up is likely worth it.

Thanks again!

@bors
Copy link
Collaborator

bors commented May 30, 2018

📌 Commit 6d0c5b8 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 30, 2018
@bors
Copy link
Collaborator

bors commented May 30, 2018

🔒 Merge conflict

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 30, 2018
@Emerentius
Copy link
Contributor Author

@alexcrichton I've rebased the code, ready for another try.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Collaborator

bors commented Jun 1, 2018

📌 Commit bd2e23e has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 1, 2018
@bors
Copy link
Collaborator

bors commented Jun 1, 2018

🔒 Merge conflict

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jun 1, 2018
for both Vec<T> and String
- eliminates the boolean first flag in fn join()

for String only
- eliminates repeated bounds checks in join(), concat()
- adds fast paths for small string separators up to a len of 4 bytes
old tests cover the new fast path of str joining already
this adds tests for joining into Strings with long separators (>4 byte) and
for joining into Vec<T>, T: Clone + !Copy. Vec<T: Copy> will be
specialised when specialisation type inference bugs are fixed.
further reduce unsafe fn calls
reduce right drift
assert! sufficient capacity
@Emerentius
Copy link
Contributor Author

Emerentius commented Jun 1, 2018

Ok, apparently I've rebased on an outdated copy because git fetch doesn't automatically fetch from all remotes. Rebased again.

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Collaborator

bors commented Jun 1, 2018

📌 Commit 12bd288 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 1, 2018
@bors
Copy link
Collaborator

bors commented Jun 1, 2018

⌛ Testing commit 12bd288 with merge 747e655...

bors added a commit that referenced this pull request Jun 1, 2018
optimize joining for slices

This improves the speed of string joining up to 3x.
It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes
These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative.

3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks [with this code](https://github.com/Emerentius/join_bench). They are pretty noise despite high iteration counts, but in total one can see the trends.

```
len_separator  len_string   n_strings     speedup
           4          10          10        2.38
           4          10         100        3.41
           4          10        1000        3.43
           4          10       10000        3.25
           4         100          10        2.23
           4         100         100        2.73
           4         100        1000        1.33
           4         100       10000        1.14
           4        1000          10        1.33
           4        1000         100        1.15
           4        1000        1000        1.08
           4        1000       10000        1.04
          10          10          10        1.61
          10          10         100        1.74
          10          10        1000        1.77
          10          10       10000        1.75
          10         100          10        1.58
          10         100         100        1.65
          10         100        1000        1.24
          10         100       10000        1.12
          10        1000          10        1.23
          10        1000         100        1.11
          10        1000        1000        1.05
          10        1000       10000       0.997
         100          10          10        1.66
         100          10         100        1.78
         100          10        1000        1.28
         100          10       10000        1.16
         100         100          10        1.37
         100         100         100        1.26
         100         100        1000        1.09
         100         100       10000         1.0
         100        1000          10        1.19
         100        1000         100        1.12
         100        1000        1000        1.05
         100        1000       10000        1.12
```

The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors.

If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call `.join("")`.
@bors
Copy link
Collaborator

bors commented Jun 1, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 747e655 to master...

@bors bors merged commit 12bd288 into rust-lang:master Jun 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants