optimize joining for slices #50340

Emerentius · 2018-04-30T11:52:36Z

This improves the speed of string joining up to 3x.
It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes
These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative.

3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks with this code. They are pretty noise despite high iteration counts, but in total one can see the trends.

len_separator  len_string   n_strings     speedup
           4          10          10        2.38
           4          10         100        3.41
           4          10        1000        3.43
           4          10       10000        3.25
           4         100          10        2.23
           4         100         100        2.73
           4         100        1000        1.33
           4         100       10000        1.14
           4        1000          10        1.33
           4        1000         100        1.15
           4        1000        1000        1.08
           4        1000       10000        1.04
          10          10          10        1.61
          10          10         100        1.74
          10          10        1000        1.77
          10          10       10000        1.75
          10         100          10        1.58
          10         100         100        1.65
          10         100        1000        1.24
          10         100       10000        1.12
          10        1000          10        1.23
          10        1000         100        1.11
          10        1000        1000        1.05
          10        1000       10000       0.997
         100          10          10        1.66
         100          10         100        1.78
         100          10        1000        1.28
         100          10       10000        1.16
         100         100          10        1.37
         100         100         100        1.26
         100         100        1000        1.09
         100         100       10000         1.0
         100        1000          10        1.19
         100        1000         100        1.12
         100        1000        1000        1.05
         100        1000       10000        1.12

The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors.

If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call .join("").

rust-highfive · 2018-04-30T11:52:46Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

shepmaster · 2018-05-06T22:08:51Z

This PR is marked as WIP — @Emerentius what remains to be done?

Emerentius · 2018-05-07T16:43:13Z

@shepmaster
I was mainly hoping for a comment on whether the loop specialization by separator length and the accompanying increase in code size was acceptable here, but it seems like the optimizer can get rid of the unused cases for constant size separators which is probably the most common case.
The unsafe code seems straightforward but I'd like a quick sanity check on it.

I've also extended the PR to (slightly) optimize Vec<T> and I've added a specialization for Vec<T: Copy> that utilizes the same changes I've made for String.
It should be possible to apply the same optimization for Vec<T: Clone> but I don't know how. My attempts have given me astounding pessimizations.

If the tests pass, I'd say this is good to go.

Emerentius · 2018-05-07T17:46:25Z

[00:18:29] error[E0284]: type annotations required: cannot resolve `<[&[rustc::ty::Predicate<'_>]] as std::slice::SliceConcatExt<_>>::Output == std::vec::Vec<rustc::ty::Predicate<'_>>`
[00:18:29]     --> librustc_typeck/collect.rs:1303:75
[00:18:29]      |
[00:18:29] 1303 |         [&explicit.predicates[..], &tcx.inferred_outlives_of(def_id)[..]].concat()
[00:18:29] 
[00:18:30] error: aborting due to previous error

A specialization bug? I didn't change the set of types the trait is implemented for.

Emerentius · 2018-05-13T12:51:36Z

Ok, I've removed the specialization for SliceConcatExt<T: Copy> for now. String joining will just be a little faster than Vec joining for the moment.

It would be nice if this optimization could be somehow implemented as a specialization of Extend, so all Vecs and Strings could benefit from it. A problem there is that it needs to clone the iter so it can iterate twice. Once to find the total len to extend by and another time to copy data. It could have unexpected side-effects. A specialization of just FlatMap<slice::Iter<T: Copy>> on the other hand is pretty limiting.

pietroalbini · 2018-05-21T11:09:52Z

Ping from triage @alexcrichton! This PR needs your review.

alexcrichton · 2018-05-23T03:06:04Z

Ah sorry about the delay here, but thanks for the PR @Emerentius!

This looks somewhat scary though in the sense that it's replacing some already somewhat tricky code with a lot of unsafe code and not a lot of tests. Would it be possible to reduce some of the unsafe code? (is it all required?) Additionally would it be possible to bolster up the test suite here to hopefully head off any possible lurking issues? Maybe copy the implementation locally as well and try fuzzing it?

Emerentius · 2018-05-23T05:31:43Z

It's not so much a lot of unsafe code but a little unsafe code repeated 6x. I've compressed it by unifying the different cases and adding a helper macro. It seems like the speed didn't suffer or at least not by much.

That is actually somewhat mystifying because now the fast paths are the same as the general path. Just the fact that I'm matching on a specific separator length before running the exact same code lets it run faster. It seems like an optimizer bug. Actually, this is not that strange if the function is not inlined.

alexcrichton

Thanks!

Would it be possible to gist a safe version of the code? It sounds and looks like the unsafety here is derived from the desire to avoid bounds checks, right? Often from taking a look at the IR it's possible to poke around, tweak iterators, or something like that which allows us to convince LLVM it should eliminate all the bounds checks away.

It looks naively correct to me but I've often found that if LLVM can't prove that bounds checks can be eliminated then our own logic isn't always right

alexcrichton · 2018-05-25T18:52:49Z

src/liballoc/slice.rs

-            } else {
-                result.push(sep.clone())
+        let mut iter = self.iter();
+        iter.next().map_or(vec![], |first| {


Could this and the pattern below perhaps use a match with an early return to avoid the extra indentation below?

alexcrichton · 2018-05-25T18:54:36Z

src/liballoc/str.rs

+    let sep_len = sep.len();
+    let mut iter = slice.iter();
+    iter.next().map_or(vec![], |first| {
+        // this is wrong without the guarantee that `slice` is non-empty


This clause is specifically referring to slice.len() - 1, right? Could it perhaps instead use iter.len() because that's already offset by one at this point? (and we know it's exact)

Yeah. This aligned better with my reasoning but I guess avoiding the - 1 is one less place to check mentally for overflow.

alexcrichton · 2018-05-25T18:57:26Z

src/liballoc/str.rs

+        let len = $bytes.len();
+        $target.get_unchecked_mut(..len)
+            .copy_from_slice($bytes);
+        $target = {$target}.get_unchecked_mut(len..);


Can this two get_unchecked_mut be replaced with split_at perhaps to avoid calling this twice?

It seems so. I moved away from split_at fairly early in my optimizations when I still used the regular bench suite to gauge speed. I can't see any meaningful improvement now so I might as well change that back.

The benchmarks are incredibly variable and regularly show spurious variations on the order of ~5% and even up to 20% (!). The variations are strangely somewhat persisting across different benchmarks in the same run in that it's almost like a constant factor for all benches. I know they are spurious because they affect the old join just as much as the new.
The variability is what made me switch to the Criterion bench framework but that hasn't helped at all.

Emerentius · 2018-05-25T20:50:48Z

Here is a safe version of the code: https://gist.github.com/Emerentius/6bbc1302367111a76b2b0841a5194a2c
The bounds checks are a major part of the speedup, yes. This safe version is a lot slower.

I've done some fuzzing with cargo fuzz for a few hours with total input sizes up to 4K and 4M respectively and encountered no cases where the resulting String was different from the old join. Note however, that I'm on a laptop with 15W CPU and had it run at 25% of one core.

I'm not sure if I can trust Vec::with_capacity to always give me a capacity >= n so I've added an assert for that.

alexcrichton · 2018-05-29T21:24:34Z

@Emerentius to confirm, the current version is the optimized version you'd like to land? It looks like there's only one reason for the unsafe in join_generic_copy which is a one-time check outside of the inner loop. If that's so is it still needed?

(note that split_at_mut is doing bounds checks, so I thought that was gonna hit perf too much to actually land)

Emerentius · 2018-05-30T10:07:22Z

@alexcrichton Yes, I'd like to land the current version.
The unsafe is needed to get a slice into the buffer past the current len of the Vec and then later to set the Vec's len. It still avoids the repeated .reserve() calls, capacity checks and len updates that are inside extend_from_slice.

I've rerun the benches on a desktop PC which has much less benchmark variability than my laptop and confirmed there is no meaningful difference between the get_unchecked_mut and the split_at_mut version. The bounds checks aren't elided with split_at_mut but they don't matter except for the smallest of joins where it's 10% slower. That difference dissipates quickly.

Here are some plots for the speedup as a function of various variables and its difference between checked and unchecked indexing. Don't read too much into differences of specific lines. There are still some spurious variations.
Data is from a desktop, which is why the maximum speedup is now less than the 3x shown in the OP.

alexcrichton · 2018-05-30T15:58:13Z

@bors: r+

Ok this is looking great, thanks so much again for the analysis done here as well as the PR! The usage of unsafe here definitely makes sense to me and I think this is also a common enough operation that speeding it up is likely worth it.

Thanks again!

bors · 2018-05-30T15:58:14Z

📌 Commit 6d0c5b8 has been approved by alexcrichton

bors · 2018-05-30T16:33:23Z

🔒 Merge conflict

Emerentius · 2018-05-31T21:27:38Z

@alexcrichton I've rebased the code, ready for another try.

alexcrichton · 2018-06-01T14:05:37Z

@bors: r+

bors · 2018-06-01T14:05:37Z

📌 Commit bd2e23e has been approved by alexcrichton

bors · 2018-06-01T14:05:52Z

🔒 Merge conflict

for both Vec<T> and String - eliminates the boolean first flag in fn join() for String only - eliminates repeated bounds checks in join(), concat() - adds fast paths for small string separators up to a len of 4 bytes

old tests cover the new fast path of str joining already this adds tests for joining into Strings with long separators (>4 byte) and for joining into Vec<T>, T: Clone + !Copy. Vec<T: Copy> will be specialised when specialisation type inference bugs are fixed.

further reduce unsafe fn calls reduce right drift assert! sufficient capacity

Emerentius · 2018-06-01T15:14:46Z

Ok, apparently I've rebased on an outdated copy because git fetch doesn't automatically fetch from all remotes. Rebased again.

alexcrichton · 2018-06-01T15:20:38Z

@bors: r+

bors · 2018-06-01T15:20:40Z

📌 Commit 12bd288 has been approved by alexcrichton

bors · 2018-06-01T16:16:37Z

⌛ Testing commit 12bd288 with merge 747e655...

optimize joining for slices This improves the speed of string joining up to 3x. It removes the boolean flag check every iteration, eliminates repeated bounds checks and adds a fast paths for small separators up to a len of 4 bytes These optimizations gave me ~10%, ~50% and ~80% improvements respectively over the previous speed. Those are multiplicative. 3x improvement happens for the optimal case of joining many small strings together in my microbenchmarks. Improvements flatten out for larger strings of course as more time is spent copying bits around. I've run a few benchmarks [with this code](https://github.com/Emerentius/join_bench). They are pretty noise despite high iteration counts, but in total one can see the trends. ``` len_separator len_string n_strings speedup 4 10 10 2.38 4 10 100 3.41 4 10 1000 3.43 4 10 10000 3.25 4 100 10 2.23 4 100 100 2.73 4 100 1000 1.33 4 100 10000 1.14 4 1000 10 1.33 4 1000 100 1.15 4 1000 1000 1.08 4 1000 10000 1.04 10 10 10 1.61 10 10 100 1.74 10 10 1000 1.77 10 10 10000 1.75 10 100 10 1.58 10 100 100 1.65 10 100 1000 1.24 10 100 10000 1.12 10 1000 10 1.23 10 1000 100 1.11 10 1000 1000 1.05 10 1000 10000 0.997 100 10 10 1.66 100 10 100 1.78 100 10 1000 1.28 100 10 10000 1.16 100 100 10 1.37 100 100 100 1.26 100 100 1000 1.09 100 100 10000 1.0 100 1000 10 1.19 100 1000 100 1.12 100 1000 1000 1.05 100 1000 10000 1.12 ``` The string joining with small or empty separators is now ~50% faster than the old concatenation (small strings). The same approach can also improve the performance of joining into vectors. If this approach is acceptable, I can apply it for concatenation and for vectors as well. Alternatively, concat could just call `.join("")`.

bors · 2018-06-01T18:25:02Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 747e655 to master...

rust-highfive assigned alexcrichton Apr 30, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 30, 2018

shepmaster added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 6, 2018

Emerentius force-pushed the master branch 3 times, most recently from 0080466 to af97dc1 Compare May 7, 2018 16:32

This comment has been minimized.

Sign in to view

Emerentius force-pushed the master branch from af97dc1 to 985c5d6 Compare May 13, 2018 00:36

This comment has been minimized.

Sign in to view

Emerentius force-pushed the master branch from 985c5d6 to 0806d95 Compare May 13, 2018 02:25

pietroalbini added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels May 14, 2018

Emerentius changed the title ~~[wip] optimize joining for slices~~ optimize joining for slices May 16, 2018

alexcrichton reviewed May 25, 2018

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 30, 2018

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 30, 2018

Emerentius force-pushed the master branch from 6d0c5b8 to bd2e23e Compare May 31, 2018 18:49

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 1, 2018

bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jun 1, 2018

Emerentius added 4 commits June 1, 2018 17:13

optimize joining and concatenation for slices

d866082

for both Vec<T> and String - eliminates the boolean first flag in fn join() for String only - eliminates repeated bounds checks in join(), concat() - adds fast paths for small string separators up to a len of 4 bytes

add more join tests

b2fd7da

old tests cover the new fast path of str joining already this adds tests for joining into Strings with long separators (>4 byte) and for joining into Vec<T>, T: Clone + !Copy. Vec<T: Copy> will be specialised when specialisation type inference bugs are fixed.

compacts join code

d0d0885

incorporate changes from code review

12bd288

further reduce unsafe fn calls reduce right drift assert! sufficient capacity

Emerentius force-pushed the master branch from bd2e23e to 12bd288 Compare June 1, 2018 15:14

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 1, 2018

bors merged commit 12bd288 into rust-lang:master Jun 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize joining for slices #50340

optimize joining for slices #50340

Emerentius commented Apr 30, 2018 •

edited

Loading

rust-highfive commented Apr 30, 2018

shepmaster commented May 6, 2018

Emerentius commented May 7, 2018 •

edited

Loading

This comment has been minimized.

Emerentius commented May 7, 2018

This comment has been minimized.

Emerentius commented May 13, 2018

pietroalbini commented May 21, 2018

alexcrichton commented May 23, 2018

Emerentius commented May 23, 2018 •

edited

Loading

alexcrichton left a comment

alexcrichton May 25, 2018

Emerentius May 25, 2018

alexcrichton May 25, 2018

Emerentius May 25, 2018

alexcrichton May 25, 2018

Emerentius May 25, 2018

Emerentius commented May 25, 2018 •

edited

Loading

alexcrichton commented May 29, 2018

Emerentius commented May 30, 2018 •

edited

Loading

alexcrichton commented May 30, 2018

bors commented May 30, 2018

bors commented May 30, 2018

Emerentius commented May 31, 2018

alexcrichton commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

Emerentius commented Jun 1, 2018 •

edited

Loading

alexcrichton commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

optimize joining for slices #50340

optimize joining for slices #50340

Conversation

Emerentius commented Apr 30, 2018 • edited Loading

rust-highfive commented Apr 30, 2018

shepmaster commented May 6, 2018

Emerentius commented May 7, 2018 • edited Loading

This comment has been minimized.

Emerentius commented May 7, 2018

This comment has been minimized.

Emerentius commented May 13, 2018

pietroalbini commented May 21, 2018

alexcrichton commented May 23, 2018

Emerentius commented May 23, 2018 • edited Loading

alexcrichton left a comment

Choose a reason for hiding this comment

alexcrichton May 25, 2018

Choose a reason for hiding this comment

Emerentius May 25, 2018

Choose a reason for hiding this comment

alexcrichton May 25, 2018

Choose a reason for hiding this comment

Emerentius May 25, 2018

Choose a reason for hiding this comment

alexcrichton May 25, 2018

Choose a reason for hiding this comment

Emerentius May 25, 2018

Choose a reason for hiding this comment

Emerentius commented May 25, 2018 • edited Loading

alexcrichton commented May 29, 2018

Emerentius commented May 30, 2018 • edited Loading

alexcrichton commented May 30, 2018

bors commented May 30, 2018

bors commented May 30, 2018

Emerentius commented May 31, 2018

alexcrichton commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

Emerentius commented Jun 1, 2018 • edited Loading

alexcrichton commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

bors commented Jun 1, 2018

Emerentius commented Apr 30, 2018 •

edited

Loading

Emerentius commented May 7, 2018 •

edited

Loading

Emerentius commented May 23, 2018 •

edited

Loading

Emerentius commented May 25, 2018 •

edited

Loading

Emerentius commented May 30, 2018 •

edited

Loading

Emerentius commented Jun 1, 2018 •

edited

Loading