Skip to content

do simple literal prefix scanning in regex! #95

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkpankov opened this issue Jun 24, 2015 · 11 comments
Closed

do simple literal prefix scanning in regex! #95

mkpankov opened this issue Jun 24, 2015 · 11 comments
Labels

Comments

@mkpankov
Copy link

Hi,

I saw the news of that regex got refactored and optimized and decided to check my old benchmark. I was very surprised it now runs twice as long!

How to reproduce (using multirust for versions as older regex doesn't compile with newer nightly Rust):

git clone https://github.com/mkpankov/parse-rust.git
cd parse-rust
multirust override nightly-2015-06-24
git checkout 4076c404caf1560a466e9f0799817035089fe841
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 4s on my machine
multirust override nightly-2015-05-25
git checkout e33d410291fa7f134eef628b5591d605cd68b218
cargo clean
cargo build --release
time zcat mp3-logs-with-fake-ips.log.gz | ./target/release/parse-rust
// outputs around 2s on my machine

I'm sorry I can't pinpoint it more accurately (maybe it's Rust changes, not regex), but recent major changes of regex might be it. Two times degradation is severe in my opinion, and needs action.

regex versions:

  • new, degraded:
 "regex 0.1.38 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.20 (registry+https://github.com/rust-lang/crates.io-index)",
  • old, fast:
 "regex 0.1.30 (registry+https://github.com/rust-lang/crates.io-index)",
 "regex_macros 0.1.18 (registry+https://github.com/rust-lang/crates.io-index)",

Some background: back when I did this I compared Rust version to C++ version (doing almost stupid translation) and Rust beat C++ by about 40% w/o using compile-time regex. This kind of degradation puts it back behind C++ 😞

@mkpankov
Copy link
Author

Okay, seems the real problem is in regex! and regex_macros. Switching to regex::Regex::new instead gives me ~1s time, which is about twice as good as the best of regex! ones. 👍

If that means there's nothing to fix in this crate - feel free to close this, I'll refile the issue where appropriate.

@BurntSushi
Copy link
Member

One of the major optimizations for dynamic regexes is in how it handles literal prefixes. Unfortunately, the added complexity of that makes it a little tricky to get it into the regex! macro, so it's no longer doing literal prefix matching. There's probably something simple we can do to at least recover the loss without encoding a full DFA (which is #26).

@BurntSushi BurntSushi changed the title Performance degraded 2 times comparing to month-old version do simple literal prefix scanning in regex! Jun 24, 2015
@BurntSushi BurntSushi added the bug label Jun 24, 2015
@BurntSushi
Copy link
Member

@mkpankov Thank you for noticing this and reporting it!

@ArtemGr
Copy link

ArtemGr commented Jul 2, 2015

BTW, regarding the literal prefix matching, are you using tricks like https://github.com/shepmaster/jetscii (PCMPESTRI) to make them even faster?

@BurntSushi
Copy link
Member

jetscii must run on nightly Rust, so no. (@shepmaster knows of my interest though!) I am using memchr though, which only works in the single-byte-prefix case. I suppose jetscii is a candidate for use in regex! though.

@ArtemGr
Copy link

ArtemGr commented Jul 2, 2015

That might shift the performance scales in favour of regex! once again. = )

@bluss
Copy link
Member

bluss commented Jul 28, 2015

FYI jetscii is limited to ascii at the moment but it could just as well be implemented to look for any byte values.

@shepmaster
Copy link
Member

Perhaps of interest, jetscii 0.3.0 introduces support for finding substrings:

Method Speed
str.find(Substring::new("xyzzy")) 5017 MB/s
str.find("xyzzy") 3837 MB/s

@JoeyAcc
Copy link

JoeyAcc commented Dec 16, 2016

Out of curiosity, what is the status of this year-and-a-half old bug?

@BurntSushi
Copy link
Member

@JoeyAcc I have zero plans to work on the regex! macro in the foreseeable future. I'd personally rather see it removed until there's a plausible way to make it work on stable Rust.

@BurntSushi
Copy link
Member

Closing in favor of #26.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants