WIP: Add HashMap variant with incremental resize #166

jonhoo · 2020-06-29T20:31:29Z

This patch adds a variant of HashMap, IncrHashMap, that supports
incremental resizing. It uses the same underlying hash table
implementation as HashMap (i.e., RawTable), but where the standard
implementation performs all-at-once resizing when the map must grow to
accommodate new elements, this implementation instead spreads the
resizing load across subsequent inserts.

To do this, it keeps both a new, resized map, and the old, pre-resize
map around after a resize. The new map starts out mostly empty, and read
operations search in both tables. New inserts go into the new map, but
they also move a few items from the old map to the new one. When the old
map is empty, it is dropped, and all operations hit only the new map.

This map implementation offers more stable insert performance than
HashMap, but it does so at a cost:

Reads and removals of old or missing keys are slower for a while
after a resize.
Inserts are slower than their HashMap counterparts for a while
after a resize.
Memory is not reclaimed immediately after a resize, only after all
the keys have been moved.
IncrHashMap is slightly larger than HashMap, to account for the
bookkeeping needed to manage two maps during and after a resize.

This patch adds a variant of HashMap, `IncrHashMap`, that supports incremental resizing. It uses the same underlying hash table implementation as `HashMap` (i.e., `RawTable`), but where the standard implementation performs all-at-once resizing when the map must grow to accommodate new elements, this implementation instead spreads the resizing load across subsequent inserts. To do this, it keeps both a new, resized map, and the old, pre-resize map around after a resize. The new map starts out mostly empty, and read operations search in both tables. New inserts go into the new map, but they also move a few items from the old map to the new one. When the old map is empty, it is dropped, and all operations hit only the new map. This map implementation offers more stable insert performance than `HashMap`, but it does so at a cost: - Reads and removals of old or missing keys are slower for a while after a resize. - Inserts are slower than their `HashMap` counterparts for a while after a resize. - Memory is not reclaimed immediately after a resize, only after all the keys have been moved. - `IncrHashMap` is slightly larger than `HashMap`, to account for the bookkeeping needed to manage two maps during and after a resize.

jonhoo · 2020-06-29T20:34:46Z

I've opened this as a draft, because I don't yet know if this is something that would even be reasonable to add to hashbrown. It could be its own crate, though it shares pretty much its entire implementation, so it seemed reasonable to start it as a fork. It is still missing a lot of pieces that I didn't want to bother with for a first draft (like the Entry and reserve/shrink APIs).

I cooked up a simple benchmark (code below), and preliminary results seem pretty promising:

HashMap max: 27.835029ms, mean: 1.351µs
IncrHashMap max: 1.759415ms, mean: 1.35µs

const N: u32 = 1 << 21;

fn main() {
    let mut hm = HashMap::new();
    let mut t = Instant::now();
    let mut mx = 0.0f64;
    let mut sum = Duration::new(0, 0);
    for i in 0..N {
        hm.insert(i, i);
        let t2 = Instant::now();
        let took = t2.duration_since(t);
        t = t2;
        mx = mx.max(took.as_secs_f64());
        sum += took;
        println!("{} std {} ms", i, took.as_secs_f64() * 1000.0);
    }
    eprintln!(
        "HashMap max: {:?}, mean: {:?}",
        Duration::from_secs_f64(mx),
        sum / N
    );

    let mut hm = IncrHashMap::new();
    let mut t = Instant::now();
    let mut mx = 0.0f64;
    let mut sum = Duration::new(0, 0);
    for i in 0..N {
        hm.insert(i, i);
        let t2 = Instant::now();
        let took = t2.duration_since(t);
        t = t2;
        mx = mx.max(took.as_secs_f64());
        sum += took;
        println!("{} incremental {} ms", i, took.as_secs_f64() * 1000.0);
    }
    eprintln!(
        "IncrHashMap max: {:?}, mean: {:?}",
        Duration::from_secs_f64(mx),
        sum / N
    );
}

jonhoo · 2020-06-29T20:38:40Z

Apart from the missing APIs, there are two key performance-sensitive bits that should probably be fixed before using this for anything serious. Both of them come down to avoiding a linear scan over the prefix of the old map that has already been moved. These are:

hashbrown/src/incremental.rs

Lines 800 to 806 in 9ee7cea

    
           // By changing the state of the table, we have invalidated the table iterator 
        
           // we keep for what elements are left to move. So, we re-compute it. 
        
           // 
        
           // TODO: We should be able to "fix up" the iterator rather than replace it, 
        
           // which would save us from iterating over the prefix of empty buckets we've 
        
           // left in our wake from the moves so far. 
        
           lo.items = lo.table.iter();

and

hashbrown/src/incremental.rs

Lines 1079 to 1080 in 9ee7cea

    
           // TODO: make this re-use knowledge of progress from t.items 
        
           t.table.into_iter()

The remaining small linear factor that peeks in at the end in the graph above I believe comes from instantiating the iterator over the "old" map during a resize here:

hashbrown/src/incremental.rs

Lines 714 to 716 in 9ee7cea

    
           // NOTE: Calling next here could be expensive, as the iter needs to search for the 
        
           // next non-empty bucket. as the map grows in size, that search time will increase 
        
           // linearly.

It's not clear that we can be much smarter about that one, but it should also only matter for very large maps.

jonhoo · 2020-06-29T20:52:20Z

I'll add that incremental.rs is very similar to map.rs, so it may be easier to review this by diffing the two files than by reading the file from scratch.

I'll hold off on working more on this until I hear back whether it's even feasible to land something like this in the first place. And if so, whether this is the right approach (as opposed to, say, making this the default behavior, or having a marker type on HashMap).

jonhoo · 2020-06-29T20:58:01Z

Interesting... It looks like CI also checks that the test suite runs on MSRV. Is that intentional? Do we want to limit test code to running on Rust 1.32 as well?

Amanieu · 2020-06-29T21:47:02Z

Great work! I think it would be best for this to be a separate crate. hashbrown exposes the RawTable API if you enable the "raw" feature, so your code should be able to mostly work as it is. I am happy to accept any changes to RawTable that you might need for this!

jonhoo · 2020-06-29T21:49:13Z

Ah, that's fantastic, I missed the raw feature! I'll implement it as a separate crate straight away!

jonhoo · 2020-06-29T22:10:46Z

@Amanieu I think the only thing I need from RawTable that is currently missing is a way to get the current "full" capacity (bucket_mask_to_capacity(self.bucket_mask)), which I need to compute the next capacity to resize to. Although I guess I can probably work around that by using ::new + .reserve.

jonhoo · 2020-06-30T00:41:35Z

@Amanieu I would also love to hear your take on the issues in #166 (comment). I think I need some way to "refresh" a RawIter without having it re-start from the first bucket. Is that something you think might be doable?

jonhoo · 2020-06-30T01:30:33Z

I made a crate out of this over in https://github.com/jonhoo/griddle/. The issues above have been filed as issues: jonhoo/griddle#1 ("refreshing" an iterator), jonhoo/griddle#2 (resuming one iterator from another), jonhoo/griddle#3 (cost of starting a new iterator), and jonhoo/griddle#5 (using new + reserve instead of with_capacity).

Make clippy happy

0c608bb

jonhoo closed this Jun 29, 2020

This was referenced Jun 30, 2020

Sub-optimal performance on remove during resize jonhoo/griddle#1

Closed

High overhead of calling IntoIterator shortly after resize jonhoo/griddle#2

Closed

jonhoo mentioned this pull request Jul 20, 2020

Add support for amortized resizes indexmap-rs/indexmap#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add HashMap variant with incremental resize #166

WIP: Add HashMap variant with incremental resize #166

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

Amanieu commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 30, 2020

jonhoo commented Jun 30, 2020

WIP: Add HashMap variant with incremental resize #166

WIP: Add HashMap variant with incremental resize #166

Conversation

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

Amanieu commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 29, 2020

jonhoo commented Jun 30, 2020

jonhoo commented Jun 30, 2020