Performance regression #15203

retep998 · 2014-06-26T20:21:48Z

When moving from the June 20 to the June 26 nightly, this code became twice as slow (comparison of ASM included):

https://gist.github.com/retep998/bf4fd704bbd456912c22

Went from 250 microseconds to 500 microseconds.

My build command: rustc.exe --opt-level=3 euler.rs --test --emit=asm,link -Ctarget-cpu=amdfam10 -Cllvm-args="--x86-asm-syntax=intel"

Using Windows 8.1 x64 with an AMD Phenom II processor.

The text was updated successfully, but these errors were encountered:

Blei · 2014-06-26T20:25:59Z

cc @dotdash, there's a suspicion that it's caused by the bool -> i1 conversion.

retep998 · 2014-06-26T20:31:11Z

Okay, I have narrowed it down specifically to let mut sieve = Vec::from_elem(size, true); For some reason, that bit of code is responsible for almost all the time taken by that function. The old version simply ran memset across the Vec, while the new version is iterating over the entire Vec storing 1 in each element manually.

mahkoh · 2014-06-26T21:00:45Z

    let mut sieve = Vec::with_capacity(size);
    unsafe {
        let p = sieve.as_mut_ptr();
        std::ptr::set_memory(p, 1, size);
        sieve.set_len(size);
    }

This generates code with memset. And it is much, much slower than the code that stores the 1s by hand. Lots of pagefaults too.

dotdash · 2014-06-26T21:26:59Z

@mahkoh that code is almost three times faster that storing the ones for me.

Original:

running 1 test
test bench_p3 ... bench:    315992 ns/iter (+/- 1850)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

With your "hack":

running 1 test
test bench_p3 ... bench:    138248 ns/iter (+/- 304)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

mahkoh · 2014-06-26T21:31:33Z

Ha, you're right. It's the same here. I didn't realize that rust would change the benchmark depending on the results. I was looking at the output of perf only.

dotdash · 2014-06-26T22:16:08Z

LLVM's LoopIdionRecognize pass currently only considers integers that are byte width multiples, i.e. i8, i16 and so on, when looking for opportunities to replace stores by a memset. I've modified the relevant places to also accept values with a bitwidth < 8. With that, at least opt generates memset calls again, but it's just a hack so far and still crashes in the LLVM test suite. Also, we'll have to see what upstream thinks of this. I'll try to refine the patch this weekend and submit it to upstream then.

dotdash · 2014-06-26T22:23:36Z

Actually, the test suite crash was just my fragile "debugging" code. Removing that made it pass the test suite. So I'll test the new patched LLVM version with rustc tomorrow and submit the patch upstream if it works as expected.

@retep998 Thanks for spotting this!

dotdash · 2014-06-29T19:43:20Z

Just a small update since this takes longer than expected / "announced" ;-)

Looking through LLVM's code I noticed a few more optz that won't trigger with i1 used for the in-memory representation of boolean value. Extendind all those places to handle i1 isn't trivial and IMHO not worth the trouble.

Also, I realized that I actually misread the comment in the clang sources that made me use i1 for the memory representation of boolean value, instead of just for the SSA values. The latter is what clang does and means that the value is truncated on loads and zero-extended on stores.

The difference from what rust used to do, is that it happens on load and store instead of e.g. br. IOW it depends on where the value lives, not on its current usage. I'm now working on a patch to replicate clang's behaviour in rustc.

LLVM doesn't really like types with a bit-width that isn't a multiple of 8 and disable various optimizations if it encounters such types used with loads/stores. OTOH, booleans must be represented as i1 when used as SSA values. To get the best results, we must use i1 for SSA values, and i8 when storing the value to memory. By using range asserts on loads, LLVM can eliminate the required zero-extend and truncate operations. Fixes rust-lang#15203

LLVM doesn't handle i1 value in allocas/memory very well and skips a number of optimizations if it hits it. So we have to do the same thing that Clang does, using i1 for SSA values, but storing i8 in memory. Fixes #15203.

Shuffle some proc_macro_expand query things around Removes some unnecessary extra work we are doing in proc-macro expansion, and more importantly `Arc` the result of the proc_macro_expand query, that way we can reuse the instance for the `macro_expand` query's result

dotdash mentioned this issue Jul 5, 2014

Improve optimizations for boolean values #15464

Merged

bors closed this as completed in #15464 Jul 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression #15203

Performance regression #15203

retep998 commented Jun 26, 2014

Blei commented Jun 26, 2014

retep998 commented Jun 26, 2014

mahkoh commented Jun 26, 2014

dotdash commented Jun 26, 2014

mahkoh commented Jun 26, 2014

dotdash commented Jun 26, 2014

dotdash commented Jun 26, 2014

dotdash commented Jun 29, 2014

Performance regression #15203

Performance regression #15203

Comments

retep998 commented Jun 26, 2014

Blei commented Jun 26, 2014

retep998 commented Jun 26, 2014

mahkoh commented Jun 26, 2014

dotdash commented Jun 26, 2014

mahkoh commented Jun 26, 2014

dotdash commented Jun 26, 2014

dotdash commented Jun 26, 2014

dotdash commented Jun 29, 2014