Skip to content

Memory management (fixes #22, closes #24) #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 64 commits into from
Aug 31, 2024

Conversation

mnemnion
Copy link
Contributor

@mnemnion mnemnion commented Jul 3, 2024

This PR adds proper memory management to diffz.

Almost all the tests are ported to use std.testing.allocator, and they all pass. There's a few in the diff test which I didn't feel like rewriting, I would encourage one of the maintainers to do so after the merge.

mnemnion added 22 commits July 1, 2024 20:24
This adds a test running step and a zig.zon manifest.  Also adds to the
.gitignore the new location of caches, .zig-out.
halfMatch now correctly manages its own memory, like it's supposed to.
The tests currently create Diffs with constant/static text, which
causes a panic on any attempt to free said text.

This pass fixes a couple leaks, and lays a foundation for consistently
freeing any diff's .text field when no longer in use: this requires
that, as it will be in real programs, the diff text is allocated to
begin with.
I'll need to knock out all the tests in diffMergeSemantic so that
I can flip the rest of the diffCleanupMerge tests on, without trying
to free .rodata memory.

The real annoyance of the decision to punt on memory management is in
the tests, which all have to be rewritten.  Whoever you are, you could
at least have written the tests responsibly, and spared the person who
cleans up your sloppy code from a bunch of drudgery.
Found a double free in the falsed-out one, so that's next up...
Also changing the use of .replaceRange, which lead me to think that the
item at the pointer location was being, y'know, replaced, with a use of
.insert, which is,, correct.
I had missed that some slices in cleanupSemantic were being taken from
the same string/slice that was already at that index, due to the prior
code renaming them.  Ganbatte!
The library now manages its own memory, as it should.
@travisstaloch
Copy link
Collaborator

I haven't looked closely at this PR yet but it seems like a nice improvement at a glance. One thing I would like to see is some kind of benchmark, perhaps just of the tests or diffing some large files.

Also would like to hear back from someone who's run zls with these changes and a debug or release safe build for a few days or so.

@mnemnion
Copy link
Contributor Author

mnemnion commented Jul 6, 2024

One thing I would like to see is some kind of benchmark, perhaps just of the tests or diffing some large files.

A benchmark would be a valid addition to the library, but one I don't have time to write.

To be fair to the PR, the benchmark should be run against the existing library both with and without an arena used. The PR still allows for an arena, after all, but it doesn't require it. That would separate performance regressions caused by the memory management changes, from any which are due to not using an arena (and therefore necessary to use it correctly). Because all things being equal, it's probably faster to run most code with an arena, and free it all at the end, but this is a tradeoff for additional memory pressure, so that's not a guarantee.

I did eliminate some unnecessary copying in the process of switching over to valid use of any allocator, so it's possible that the code is faster both with an arena and without. I look forward to your conclusions.

@travisstaloch
Copy link
Collaborator

A benchmark would be a valid addition to the library, but one I don't have time to write.

How about just measuring the runtime of the tests before and after this PR? You could just modify build.zig to install the tests and run $ time zig-out/bin/tests to get a ballpark. And for more details copy the before and after test binaries somewhere and use something like hyperfine or poop to measure.

I agree with what you said about the existing code having an advantage. Seems like using an arena both before and after might be the closest to a fair comparison since free/destroy for arenas is a no-op afaik.

I think that having some kind timings would go a long way to inspire confidence in these changes.

Also if you use zls: build it with this PR in release safe and try running it for a few days or more and report back if you notice anything unusual?

@mnemnion
Copy link
Contributor Author

mnemnion commented Jul 6, 2024

How about just measuring the runtime of the tests before and after this PR?

That's more straightforward, yes. It's a bit tricky to set up so that the tests can be run with either an arena or not, but taking a baseline is very simple.

"bit tricky" meaning readily achievable with comptime and a conditional defer on deinitializing the test allocator, also moving it to a global. But just running an A/B test with the branches and system time is easy. Will let you know.

@mnemnion
Copy link
Contributor Author

mnemnion commented Jul 6, 2024

Original branch:

Executed in  177.64 millis    fish           external
   usr time  133.54 millis    0.05 millis  133.49 millis
   sys time   34.21 millis    1.77 millis   32.43 millis

New branch using testing allocator:

Executed in  204.25 millis    fish           external
   usr time  151.01 millis    0.04 millis  150.97 millis
   sys time   45.58 millis    1.26 millis   44.31 millis

So in Debug mode, 1.15X slower.

But! That's in safety-mode, when the testing allocator doesn't actually release memory, and does comprehensive checks to prevent leaks and report use-after-free and double-free.

Something interesting happens when we switch to -Doptimize=ReleaseFast, here's the original:

________________________________________________________
Executed in  177.07 millis    fish           external
   usr time  132.75 millis    0.05 millis  132.69 millis
   sys time   36.29 millis    1.86 millis   34.44 millis

Which barely changes, and here's the memory-management branch:

________________________________________________________
Executed in  166.93 millis    fish           external
   usr time  127.30 millis    0.04 millis  127.26 millis
   sys time   32.20 millis    1.24 millis   30.96 millis

So in release mode, the original arena code is 1.06X slower than the memory management branch, a negative regression of 94%, 6% faster, I never know which framing of these numbers someone is accustomed to. This is readily interpreted: almost all allocations are u8, so the allocator has a lot of flexibility for reusing recently-freed memory.

Also bear in mind that the tests are inherently doing more work, since all Diff strings except the comparison ones have to be duped so that freeing them doesn't segfault, so the practical savings are possibly greater than this quick check shows.

Worth emphasizing that the short duration makes these numbers pretty fuzzy, and more and larger tests with some built-in benchmarking would be a useful addition. I also think it's worthwhile to set the tests up so the allocator can be swapped, if I find some time I might do that part.

But as a preliminary, it looks like the changes have a minimal effect on performance, and a positive one where it counts.

Perhaps worth noting: I ran each of the four tests three times and chose the middle value, as a cheap way to reduce bias, and also so that I'd spot it if there were large differences between runs. The difference were always under ten milliseconds.

@travisstaloch
Copy link
Collaborator

travisstaloch commented Jul 7, 2024

Thanks! One thing to consider, zls install notes do recommend using release safe build. So that might be the most important build mode to consider.

Also bear in mind that the tests are inherently doing more work, since all Diff strings except the comparison ones have to be duped so that freeing them doesn't segfault.

Oh thats right. I forgot about this. Maybe benching the tests isn't a good measurement because of this added work. My apologies if thats the case.

Perhaps @SuperAuguste or @Techatrix or someone else might propose a better way for us to benchmark these changes from zls. Is there any existing way to benchmark zls we should be using here? If not perhaps a better benchmark might be to just time zls tests? I can't remember if all zls' tests use the same allocation strat as in main.

Also please let us know these benchmarks are even necessary. I don't want to be making unnecessary work for anyone.

@mnemnion
Copy link
Contributor Author

mnemnion commented Jul 7, 2024

I'm afraid I don't see what role zls plays here. To be clear, that's not a disagreement, I literally don't know what part zls can play in testing this.

But (easily done) middle-of-three for ReleaseSafe for the main branch:

________________________________________________________
Executed in  171.26 millis    fish           external
   usr time  129.24 millis    0.06 millis  129.18 millis
   sys time   31.87 millis    1.83 millis   30.04 millis

and for memory-management:

________________________________________________________
Executed in  174.95 millis    fish           external
   usr time  130.14 millis    0.04 millis  130.10 millis
   sys time   34.81 millis    2.07 millis   32.74 millis

That's within the margin of error where it's pure luck which one completed quicker. I think the important part here is that there are no observable performance regressions with the data I can easily generate.

If anyone would like to add more test data, or set up some timing from within the test itself, I would be happy to merge that and run it as well.

@Techatrix
Copy link
Member

I was interested to see if there where any differences in memory usage so I did some benchmarks myself.
All runs are in ReleaseFast with check_lines == true.

diff Sema.zig & Sema.zig (100 times per run)
[nix-shell:~/repos/diffz]$ poop -d 10000 ./bench1-old-arena-fast ./bench1-new-arena-fast ./bench1-new-gpa-fast
Benchmark 1 (190 runs): ./bench1-old-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          52.0ms ±  324us    50.8ms … 53.1ms         12 ( 6%)        0%
  peak_rss           3.37MB ± 15.8KB    3.32MB … 3.39MB         16 ( 8%)        0%
  cpu_cycles         45.5M  ±  317K     44.9M  … 46.7M           2 ( 1%)        0%
  instructions       14.1M  ± 2.06K     14.1M  … 14.1M          17 ( 9%)        0%
  cache_references   1.22M  ± 24.5K     1.15M  … 1.25M          17 ( 9%)        0%
  cache_misses       2.54K  ± 1.99K      762   … 15.4K          21 (11%)        0%
  branch_misses       695   ± 42.7       589   …  918            7 ( 4%)        0%
Benchmark 2 (191 runs): ./bench1-new-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          51.9ms ±  334us    51.3ms … 53.2ms          9 ( 5%)          -  0.3% ±  0.1%
  peak_rss           3.50MB ± 17.3KB    3.46MB … 3.58MB         19 (10%)        💩+  3.9% ±  0.1%
  cpu_cycles         45.5M  ±  360K     44.9M  … 47.1M           5 ( 3%)          -  0.1% ±  0.1%
  instructions       14.1M  ± 2.14K     14.1M  … 14.1M          18 ( 9%)          +  0.0% ±  0.0%
  cache_references   1.22M  ± 19.3K     1.14M  … 1.26M           7 ( 4%)          +  0.2% ±  0.4%
  cache_misses       1.99K  ± 1.78K      656   … 20.2K          15 ( 8%)        ⚡- 21.6% ± 14.9%
  branch_misses       629   ± 43.7       540   …  749            4 ( 2%)        ⚡-  9.4% ±  1.2%
Benchmark 3 (186 runs): ./bench1-new-gpa-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          53.1ms ±  273us    52.6ms … 54.1ms          7 ( 4%)        💩+  2.0% ±  0.1%
  peak_rss           3.50MB ± 21.9KB    3.47MB … 3.61MB          1 ( 1%)        💩+  3.9% ±  0.1%
  cpu_cycles         45.7M  ±  403K     45.0M  … 47.3M           9 ( 5%)          +  0.3% ±  0.2%
  instructions       14.1M  ± 1.81K     14.1M  … 14.1M          14 ( 8%)          +  0.2% ±  0.0%
  cache_references   1.22M  ± 12.8K     1.14M  … 1.25M           5 ( 3%)          +  0.1% ±  0.3%
  cache_misses       3.13K  ± 1.93K     1.34K  … 20.0K          11 ( 6%)        💩+ 23.0% ± 15.6%
  branch_misses       868   ± 78.7       611   … 1.10K           6 ( 3%)        💩+ 25.0% ±  1.8%
diff Sema.zig & AstGen.zig
[nix-shell:~/repos/diffz]$ poop -d 10000 ./bench2-old-arena-fast ./bench2-new-arena-fast ./bench2-new-gpa-fast
Benchmark 1 (10 runs): ./bench2-old-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.01s  ±  387us    1.01s  … 1.01s           0 ( 0%)        0%
  peak_rss           58.8MB ± 63.3KB    58.7MB … 58.9MB          0 ( 0%)        0%
  cpu_cycles         3.83G  ± 8.37M     3.82G  … 3.84G           0 ( 0%)        0%
  instructions       10.2G  ± 48.3M     10.1G  … 10.2G           0 ( 0%)        0%
  cache_references   2.84M  ± 64.2K     2.75M  … 2.92M           0 ( 0%)        0%
  cache_misses       68.8K  ± 23.5K     54.6K  …  123K           0 ( 0%)        0%
  branch_misses      50.4M  ±  135K     50.3M  … 50.7M           0 ( 0%)        0%
Benchmark 2 (10 runs): ./bench2-new-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.01s  ±  422us    1.01s  … 1.01s           0 ( 0%)          -  0.0% ±  0.0%
  peak_rss           58.8MB ± 67.7KB    58.7MB … 58.9MB          0 ( 0%)          -  0.0% ±  0.1%
  cpu_cycles         3.83G  ± 10.5M     3.82G  … 3.84G           0 ( 0%)          -  0.0% ±  0.2%
  instructions       10.2G  ± 62.8M     10.1G  … 10.3G           0 ( 0%)          +  0.2% ±  0.5%
  cache_references   2.84M  ± 62.9K     2.73M  … 2.95M           0 ( 0%)          +  0.0% ±  2.1%
  cache_misses       65.0K  ± 26.4K     52.1K  …  139K           1 (10%)          -  5.5% ± 34.1%
  branch_misses      50.5M  ±  205K     50.1M  … 50.8M           0 ( 0%)          +  0.1% ±  0.3%
Benchmark 3 (10 runs): ./bench2-new-gpa-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.01s  ±  370us    1.01s  … 1.01s           0 ( 0%)          -  0.1% ±  0.0%
  peak_rss           41.7MB ± 41.9KB    41.6MB … 41.7MB          1 (10%)        ⚡- 29.1% ±  0.1%
  cpu_cycles         3.84G  ± 1.96M     3.83G  … 3.84G           0 ( 0%)          +  0.1% ±  0.1%
  instructions       9.97G  ±  171M     9.50G  … 10.1G           1 (10%)          -  1.8% ±  1.2%
  cache_references   2.61M  ± 58.3K     2.48M  … 2.69M           0 ( 0%)        ⚡-  7.8% ±  2.0%
  cache_misses       56.3K  ± 8.48K     49.7K  … 78.8K           1 (10%)          - 18.1% ± 24.1%
  branch_misses      50.4M  ± 1.02M     47.5M  … 51.0M           1 (10%)          -  0.1% ±  1.4%
diff Sema.zig & Sema2.zig (100 times per run)

Sema2.zig is identical to Sema.zig except that I have added two new lines somewhere in the middle of the document.

[nix-shell:~/repos/diffz]$ poop -d 10000 ./bench3-old-arena-fast ./bench3-new-arena-fast ./bench3-new-gpa-fast
Benchmark 1 (70 runs): ./bench3-old-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           143ms ±  642us     142ms …  145ms          0 ( 0%)        0%
  peak_rss           5.90MB ± 6.38KB    5.90MB … 5.92MB          0 ( 0%)        0%
  cpu_cycles          313M  ±  848K      312M  …  315M           0 ( 0%)        0%
  instructions        997M  ± 2.00K      997M  …  997M           3 ( 4%)        0%
  cache_references   1.56M  ± 22.5K     1.46M  … 1.59M           2 ( 3%)        0%
  cache_misses       43.3K  ± 6.13K     32.2K  … 57.1K           0 ( 0%)        0%
  branch_misses      2.68K  ±  155      2.10K  … 2.97K           1 ( 1%)        0%
Benchmark 2 (70 runs): ./bench3-new-arena-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           143ms ±  653us     142ms …  145ms          2 ( 3%)          -  0.1% ±  0.2%
  peak_rss           5.90MB ± 7.43KB    5.90MB … 5.93MB          0 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles          313M  ±  737K      312M  …  315M           2 ( 3%)          -  0.1% ±  0.1%
  instructions        997M  ± 2.96K      997M  …  997M           4 ( 6%)          +  0.0% ±  0.0%
  cache_references   1.50M  ± 12.6K     1.45M  … 1.53M           3 ( 4%)        ⚡-  4.0% ±  0.4%
  cache_misses       37.0K  ± 6.94K     20.6K  … 53.8K           0 ( 0%)        ⚡- 14.4% ±  5.0%
  branch_misses      2.17K  ±  152      1.83K  … 2.46K           0 ( 0%)        ⚡- 19.0% ±  1.9%
Benchmark 3 (71 runs): ./bench3-new-gpa-fast
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           142ms ±  616us     140ms …  144ms          2 ( 3%)          -  1.0% ±  0.1%
  peak_rss           5.91MB ± 7.77KB    5.90MB … 5.93MB          0 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles          303M  ± 1.03M      302M  …  308M           3 ( 4%)        ⚡-  3.2% ±  0.1%
  instructions        997M  ± 5.43K      997M  …  997M           6 ( 8%)          +  0.0% ±  0.0%
  cache_references   1.62M  ± 31.9K     1.54M  … 1.67M           0 ( 0%)        💩+  3.7% ±  0.6%
  cache_misses       49.0K  ± 22.9K     13.1K  … 81.8K           0 ( 0%)          + 13.3% ± 12.8%
  branch_misses      3.20K  ±  258      2.33K  … 3.71K           4 ( 6%)        💩+ 19.7% ±  2.6%

There isn't much of an impact to wall time but there is a nice reduction on peak RSS in the Sema.zig vs. AstGen.zig run.
I assume that ZLS's diffs are most similar to the third benchmark so it won't be noticeably affected by these changes.

@mnemnion
Copy link
Contributor Author

mnemnion commented Jul 9, 2024

Alright, that's sorted.

Worth noting, that this:

It looks like missing coverage is related to diffLineMode only, which isn't exercising the inside of the halfMatch code. Maybe worth circling back to.

Is where the leaks were: in the code without coverage. I described it incorrectly, though, it was in the diffCompute path which uses a half match.

Coverage now stands at 99.2%, and suggests that diffLineMode doesn't have comprehensive leak checks yet (because the errdefers in that section of the code are not triggered by the test. Total line coverage is essential for this kind of code.

@travisstaloch
Copy link
Collaborator

I've just pulled the latest changes from this branch and am test driving. Didn't notice any issues since yesterday.

@mnemnion mnemnion changed the title Memory management (fixes #22) Memory management (fixes #22, closes #24) Jul 12, 2024
@mnemnion
Copy link
Contributor Author

Added checkAllFailingAllocations test for diffLineMode which, yes, smoked out some bugs.

Running this test takes an extraordinarily long time, because the nature of the algorithm is a bunch of tiny allocations, and no obvious way to batch them, and the exhaustive test function is quadratic by the number of allocations.

It triggers line mode through the "front door" so to speak, and can no doubt be replaced with something shorter which calls diffLineMode directly.

Remaining un-covered lines are little things like the Diff format string, and some errdefers which I don't think are triggerable. But I'd rather have a few red lines in kcov than be wrong about that.

This lets the test set it to a low value.  That in turn allows the full
allocation failure check to complete in a reasonable amount of time.
@mnemnion
Copy link
Contributor Author

Added a field which configures the line-check threshold, using that and truncating the test data gets the allocation failure test down to something comparable to the other tests.

That's probably going to be my last tweak to this branch.

It is in the source text, and should be in diffz.
@Techatrix Techatrix merged commit 7cf1ba9 into ziglibs:main Aug 31, 2024
@mnemnion mnemnion deleted the memory-management branch September 3, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants