Skip to content

Performance regression #15203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
retep998 opened this issue Jun 26, 2014 · 8 comments · Fixed by #15464
Closed

Performance regression #15203

retep998 opened this issue Jun 26, 2014 · 8 comments · Fixed by #15464

Comments

@retep998
Copy link
Member

When moving from the June 20 to the June 26 nightly, this code became twice as slow (comparison of ASM included):

https://gist.github.com/retep998/bf4fd704bbd456912c22

Went from 250 microseconds to 500 microseconds.

My build command: rustc.exe --opt-level=3 euler.rs --test --emit=asm,link -Ctarget-cpu=amdfam10 -Cllvm-args="--x86-asm-syntax=intel"

Using Windows 8.1 x64 with an AMD Phenom II processor.

@Blei
Copy link
Contributor

Blei commented Jun 26, 2014

cc @dotdash, there's a suspicion that it's caused by the bool -> i1 conversion.

@retep998
Copy link
Member Author

Okay, I have narrowed it down specifically to let mut sieve = Vec::from_elem(size, true); For some reason, that bit of code is responsible for almost all the time taken by that function. The old version simply ran memset across the Vec, while the new version is iterating over the entire Vec storing 1 in each element manually.

@mahkoh
Copy link
Contributor

mahkoh commented Jun 26, 2014

    let mut sieve = Vec::with_capacity(size);
    unsafe {
        let p = sieve.as_mut_ptr();
        std::ptr::set_memory(p, 1, size);
        sieve.set_len(size);
    }

This generates code with memset. And it is much, much slower than the code that stores the 1s by hand. Lots of pagefaults too.

@dotdash
Copy link
Contributor

dotdash commented Jun 26, 2014

@mahkoh that code is almost three times faster that storing the ones for me.

Original:

running 1 test
test bench_p3 ... bench:    315992 ns/iter (+/- 1850)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

With your "hack":

running 1 test
test bench_p3 ... bench:    138248 ns/iter (+/- 304)

test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured

@mahkoh
Copy link
Contributor

mahkoh commented Jun 26, 2014

Ha, you're right. It's the same here. I didn't realize that rust would change the benchmark depending on the results. I was looking at the output of perf only.

@dotdash
Copy link
Contributor

dotdash commented Jun 26, 2014

LLVM's LoopIdionRecognize pass currently only considers integers that are byte width multiples, i.e. i8, i16 and so on, when looking for opportunities to replace stores by a memset. I've modified the relevant places to also accept values with a bitwidth < 8. With that, at least opt generates memset calls again, but it's just a hack so far and still crashes in the LLVM test suite. Also, we'll have to see what upstream thinks of this. I'll try to refine the patch this weekend and submit it to upstream then.

@dotdash
Copy link
Contributor

dotdash commented Jun 26, 2014

Actually, the test suite crash was just my fragile "debugging" code. Removing that made it pass the test suite. So I'll test the new patched LLVM version with rustc tomorrow and submit the patch upstream if it works as expected.

@retep998 Thanks for spotting this!

@dotdash
Copy link
Contributor

dotdash commented Jun 29, 2014

Just a small update since this takes longer than expected / "announced" ;-)

Looking through LLVM's code I noticed a few more optz that won't trigger with i1 used for the in-memory representation of boolean value. Extendind all those places to handle i1 isn't trivial and IMHO not worth the trouble.

Also, I realized that I actually misread the comment in the clang sources that made me use i1 for the memory representation of boolean value, instead of just for the SSA values. The latter is what clang does and means that the value is truncated on loads and zero-extended on stores.

The difference from what rust used to do, is that it happens on load and store instead of e.g. br. IOW it depends on where the value lives, not on its current usage. I'm now working on a patch to replicate clang's behaviour in rustc.

dotdash added a commit to dotdash/rust that referenced this issue Jul 6, 2014
LLVM doesn't really like types with a bit-width that isn't a multiple of
8 and disable various optimizations if it encounters such types used
with loads/stores. OTOH, booleans must be represented as i1 when used as
SSA values. To get the best results, we must use i1 for SSA values, and
i8 when storing the value to memory.

By using range asserts on loads, LLVM can eliminate the required
zero-extend and truncate operations.

Fixes rust-lang#15203
bors added a commit that referenced this issue Jul 7, 2014
LLVM doesn't handle i1 value in allocas/memory very well and skips a number of optimizations if it hits it. So we have to do the same thing that Clang does, using i1 for SSA values, but storing i8 in memory.

Fixes #15203.
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 17, 2023
Shuffle some proc_macro_expand query things around

Removes some unnecessary extra work we are doing in proc-macro expansion, and more importantly `Arc` the result of the proc_macro_expand query, that way we can reuse the instance for the `macro_expand` query's result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants