Skip to content

WIP: Add HashMap variant with incremental resize #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jonhoo
Copy link
Contributor

@jonhoo jonhoo commented Jun 29, 2020

This patch adds a variant of HashMap, IncrHashMap, that supports
incremental resizing. It uses the same underlying hash table
implementation as HashMap (i.e., RawTable), but where the standard
implementation performs all-at-once resizing when the map must grow to
accommodate new elements, this implementation instead spreads the
resizing load across subsequent inserts.

To do this, it keeps both a new, resized map, and the old, pre-resize
map around after a resize. The new map starts out mostly empty, and read
operations search in both tables. New inserts go into the new map, but
they also move a few items from the old map to the new one. When the old
map is empty, it is dropped, and all operations hit only the new map.

This map implementation offers more stable insert performance than
HashMap, but it does so at a cost:

  • Reads and removals of old or missing keys are slower for a while
    after a resize.
  • Inserts are slower than their HashMap counterparts for a while
    after a resize.
  • Memory is not reclaimed immediately after a resize, only after all
    the keys have been moved.
  • IncrHashMap is slightly larger than HashMap, to account for the
    bookkeeping needed to manage two maps during and after a resize.

This patch adds a variant of HashMap, `IncrHashMap`, that supports
incremental resizing. It uses the same underlying hash table
implementation as `HashMap` (i.e., `RawTable`), but where the standard
implementation performs all-at-once resizing when the map must grow to
accommodate new elements, this implementation instead spreads the
resizing load across subsequent inserts.

To do this, it keeps both a new, resized map, and the old, pre-resize
map around after a resize. The new map starts out mostly empty, and read
operations search in both tables. New inserts go into the new map, but
they also move a few items from the old map to the new one. When the old
map is empty, it is dropped, and all operations hit only the new map.

This map implementation offers more stable insert performance than
`HashMap`, but it does so at a cost:

 - Reads and removals of old or missing keys are slower for a while
   after a resize.
 - Inserts are slower than their `HashMap` counterparts for a while
   after a resize.
 - Memory is not reclaimed immediately after a resize, only after all
   the keys have been moved.
 - `IncrHashMap` is slightly larger than `HashMap`, to account for the
   bookkeeping needed to manage two maps during and after a resize.
@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

I've opened this as a draft, because I don't yet know if this is something that would even be reasonable to add to hashbrown. It could be its own crate, though it shares pretty much its entire implementation, so it seemed reasonable to start it as a fork. It is still missing a lot of pieces that I didn't want to bother with for a first draft (like the Entry and reserve/shrink APIs).

I cooked up a simple benchmark (code below), and preliminary results seem pretty promising:

HashMap max: 27.835029ms, mean: 1.351µs
IncrHashMap max: 1.759415ms, mean: 1.35µs

timeline


const N: u32 = 1 << 21;

fn main() {
    let mut hm = HashMap::new();
    let mut t = Instant::now();
    let mut mx = 0.0f64;
    let mut sum = Duration::new(0, 0);
    for i in 0..N {
        hm.insert(i, i);
        let t2 = Instant::now();
        let took = t2.duration_since(t);
        t = t2;
        mx = mx.max(took.as_secs_f64());
        sum += took;
        println!("{} std {} ms", i, took.as_secs_f64() * 1000.0);
    }
    eprintln!(
        "HashMap max: {:?}, mean: {:?}",
        Duration::from_secs_f64(mx),
        sum / N
    );

    let mut hm = IncrHashMap::new();
    let mut t = Instant::now();
    let mut mx = 0.0f64;
    let mut sum = Duration::new(0, 0);
    for i in 0..N {
        hm.insert(i, i);
        let t2 = Instant::now();
        let took = t2.duration_since(t);
        t = t2;
        mx = mx.max(took.as_secs_f64());
        sum += took;
        println!("{} incremental {} ms", i, took.as_secs_f64() * 1000.0);
    }
    eprintln!(
        "IncrHashMap max: {:?}, mean: {:?}",
        Duration::from_secs_f64(mx),
        sum / N
    );
}

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

Apart from the missing APIs, there are two key performance-sensitive bits that should probably be fixed before using this for anything serious. Both of them come down to avoiding a linear scan over the prefix of the old map that has already been moved. These are:

// By changing the state of the table, we have invalidated the table iterator
// we keep for what elements are left to move. So, we re-compute it.
//
// TODO: We should be able to "fix up" the iterator rather than replace it,
// which would save us from iterating over the prefix of empty buckets we've
// left in our wake from the moves so far.
lo.items = lo.table.iter();

and

hashbrown/src/incremental.rs

Lines 1079 to 1080 in 9ee7cea

// TODO: make this re-use knowledge of progress from t.items
t.table.into_iter()

The remaining small linear factor that peeks in at the end in the graph above I believe comes from instantiating the iterator over the "old" map during a resize here:

// NOTE: Calling next here could be expensive, as the iter needs to search for the
// next non-empty bucket. as the map grows in size, that search time will increase
// linearly.

It's not clear that we can be much smarter about that one, but it should also only matter for very large maps.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

I'll add that incremental.rs is very similar to map.rs, so it may be easier to review this by diffing the two files than by reading the file from scratch.

I'll hold off on working more on this until I hear back whether it's even feasible to land something like this in the first place. And if so, whether this is the right approach (as opposed to, say, making this the default behavior, or having a marker type on HashMap).

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

Interesting... It looks like CI also checks that the test suite runs on MSRV. Is that intentional? Do we want to limit test code to running on Rust 1.32 as well?

@Amanieu
Copy link
Member

Amanieu commented Jun 29, 2020

Great work! I think it would be best for this to be a separate crate. hashbrown exposes the RawTable API if you enable the "raw" feature, so your code should be able to mostly work as it is. I am happy to accept any changes to RawTable that you might need for this!

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

Ah, that's fantastic, I missed the raw feature! I'll implement it as a separate crate straight away!

@jonhoo jonhoo closed this Jun 29, 2020
@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 29, 2020

@Amanieu I think the only thing I need from RawTable that is currently missing is a way to get the current "full" capacity (bucket_mask_to_capacity(self.bucket_mask)), which I need to compute the next capacity to resize to. Although I guess I can probably work around that by using ::new + .reserve.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 30, 2020

@Amanieu I would also love to hear your take on the issues in #166 (comment). I think I need some way to "refresh" a RawIter without having it re-start from the first bucket. Is that something you think might be doable?

@jonhoo
Copy link
Contributor Author

jonhoo commented Jun 30, 2020

I made a crate out of this over in https://github.com/jonhoo/griddle/. The issues above have been filed as issues: jonhoo/griddle#1 ("refreshing" an iterator), jonhoo/griddle#2 (resuming one iterator from another), jonhoo/griddle#3 (cost of starting a new iterator), and jonhoo/griddle#5 (using new + reserve instead of with_capacity).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants