Measure startup, each bench run and shutdown separately #13162

mvorisek · 2024-01-16T10:47:25Z

Individual times per each run should be collected to plot data with robust statistics (like confidence intervals).

mvorisek · 2024-01-19T14:16:09Z

benchmark/benchmark.php

+        $profileOutSpecific = $profileOut . '.' . $k;
+
+        runCommand([
+            'callgrind_annotate',


Any idea how to convince callgrind_annotate to resolve all symbols, especially all the onces defined in Sqlite lib? libsqlite3-dev is installed.

You would need to compile with CFLAGS="-g -O2" to get debug symbols. But with multiple repetitions we don't generally care about startup/shutdown. It's not very meaningful for applications ran with opcache. If you want to provide startup/shutdown you can do so without repetitions.

in the example you can see some examples of unresolved functions (from callgrind.symfony-demo.0 profile). Will CFLAGS="-g -O2" really help if other functions are resolved well?

All unresolved functions are listed in the end of each annotated profile file (like callgrind.symfony-demo.0.txt).

Not for SQLite, since we're not compiling its source files.

Debug symbols for Debian packages are commonly available via the corresponding -dbgsym package.

In Ubuntu docker, after apt update, I get only these:

# apt-cache search libsqlite libsqlite3-0 - SQLite 3 shared library libsqlite3-dev - SQLite 3 development files libsqlite3-mod-ceph - SQLite3 VFS for Ceph libsqlite3-mod-ceph-dev - SQLite3 VFS for Ceph (development files) libsqlite-tcl - SQLite 2 Tcl bindings libsqlite0 - SQLite 2 shared library libsqlite0-dev - SQLite 2 development files libsqlite3-gst - SQLite bindings for GNU Smalltalk libsqlite3-mod-blobtoxy - SQLite3 extension module for read-only BLOB to X/Y mapping libsqlite3-mod-csvtable - SQLite3 extension module for read-only access to CSV files libsqlite3-mod-impexp - SQLite3 extension module for SQL script, XML, JSON and CSV import/export libsqlite3-mod-rasterlite2 - SQLite 3 module for huge raster coverages libsqlite3-mod-spatialite - Geospatial extension for SQLite - loadable module libsqlite3-mod-virtualpg - Loadable dynamic extension to both SQLite and SpatiaLite libsqlite3-mod-xpath - SQLite3 extension module for querying XML data with XPath libsqlite3-mod-zipfile - SQLite3 extension module for read-only access to ZIP files libsqlite3-ocaml - Embeddable SQL Database for OCaml Programs (runtime) libsqlite3-ocaml-dev - Embeddable SQL Database for OCaml Programs (development) libsqlite3-tcl - SQLite 3 Tcl bindings libsqliteodbc - ODBC driver for SQLite embedded database

https://wiki.ubuntu.com/Debug%20Symbol%20Packages

Thank you!! That link brought me to https://valgrind.org/docs/manual/manual-core.html#manual-core.debuginfod . Commited in d4fe3ef.

Now I still see some missing source for the core linux libs:

-------------------------------------------------------------------------------- The following files chosen for auto-annotation could not be found: -------------------------------------------------------------------------------- ./csu/../csu/libc-start.c ./csu/../sysdeps/nptl/libc_start_call_main.h ./csu/./csu/errno-loc.c ./ctype/../include/ctype.h ./debug/./debug/explicit_bzero_chk.c ./elf/../sysdeps/generic/dl-protected.h ./elf/../sysdeps/generic/ldsodefs.h ./elf/../sysdeps/x86_64/dl-machine.h ./elf/../sysdeps/x86_64/dl-runtime.h ./elf/../sysdeps/x86_64/dl-trampoline.h ./elf/./elf/dl-lookup.c ./elf/./elf/dl-runtime.c ./io/../sysdeps/posix/getcwd.c ./io/../sysdeps/unix/syscall-template.S ... ./io/../sysdeps/unix/sysv/linux/write.c ./libio/./libio/fileops.c ./libio/./libio/iofflush.c ./libio/./libio/libioP.h ./malloc/./malloc/arena.c ./malloc/./malloc/malloc.c ./malloc/./malloc/morecore.c ./misc/../sysdeps/unix/syscall-template.S ./misc/../sysdeps/unix/sysv/linux/brk.c ./misc/../sysdeps/unix/sysv/linux/x86_64/syscall.S ./misc/./misc/sbrk.c ./misc/./misc/syslog.c ... ./stdlib/../stdlib/strtol.c ./stdlib/../stdlib/strtol_l.c ./stdlib/./stdlib/getenv.c ./string/../sysdeps/x86_64/multiarch/memchr-avx2.S ... /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/main.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/malloc.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mem1.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/memjournal.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex_unix.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/notify.c /build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/os.c ...

any idea?

iluuu1994 · 2024-01-19T16:36:21Z

Is there any evidence that picking the fastest run is the most accurate representation for our benchmark? It would seem that this heavily depends on the distribution / outliers.

mvorisek · 2024-01-19T16:50:20Z

I will collect some data using Github Actions. In general, the lowest time is usually good metric, as it is bound by the fastest reality and immune to random delays caused by undetermistic events like context switches, allocations, resources delays, ...

So far I have been able to identify the reason why Symfony and Wordpress times are unstable.

Symfony demo - Sqlite is using malloc heavily and the impl. is not deterministic, as some allocations are internally reallocated based on the allocation alignment - any idea with Measure startup, each bench run and shutdown separately #13162 (comment)?
Wordpress demo - it uses MySQL and the network reads vary in instruction counts

iluuu1994 · 2024-01-19T17:13:56Z

I will collect some data using Github Actions.

That would be great. We should demonstrate how much more stable the benchmark actually becomes this way.

In general, the lowest time is usually good metric, as it is bound by the fastest reality and immune to random delays caused by undetermistic events like context switches, allocations, resources delays, ...

I suppose that should be ok. @nielsdos Any thoughts on this?

In case this is the approach we want to go with: I don't think the approach of dumping stats for every iteration is great, because it makes Valgrind unusable without this benchmarking script. It would be better if we could track the fastest run internally and dump it at the end, but unfortunately Valgrinds API is quite lacking.

Sqlite is using malloc heavily and the impl. is not deterministic, as some allocations are internally reallocated based on the allocation alignment

We found the same in #12130 and iluuu1994#99, but the mitigations were not satisfactory. Depending on the layout of the allocated data, SSE can be used more effectively.

I do wonder whether it may be worth checking if the the code could be optimized by allocating a bit more memory and aligning it manually to make use of SSE. This would solve the benchmark and potentially improve performance. If this happens primarily in sqlite this may be out of our control.

nielsdos · 2024-01-19T17:31:13Z

I suppose that should be ok. @nielsdos Any thoughts on this?

Not an expert, I'd have to refresh my stats course, but anyway on first sight:

Noise can only make things slower, never faster. So to stabilize benchmarks it can indeed make sense to use the minimum, as is often done. But it can give you false confidence in the results if noise is common for the benchmarked program (as is often the case for I/O heavy programs).
If the "underlying distribution" that the run time follows are not heavily skewed then with enough iterations they should tend to a normal distribution, in which case reporting the average + standard deviation is useful.

Anyway, I think minimum could be fine, but it would be nice to know the variance and what the underlying distribution looks like.

mvorisek · 2024-01-21T15:02:47Z

I run 500 benchmarks of each demo.

In the figure above (showing first few run times excl. startup) you can see that tracking JIT needs about 10 runs to stabilize the Ir (instructions) needed. Non-JIT is stable in about 2, 3 runs.

In Symfony JIT and JIT/non-JIT Wordpress, there are distant outliers.

Distribution of Symfony JIT after 12.5% percentile of outliers removed.

In did a few experiments with trimmer mean but all the results were almost the same as median - so let's use median for reporting.

mvorisek · 2024-01-22T11:56:48Z

Before this PR, when there were no warmup_repeats (for Zend/bench.php benchmark), the startup (before 1st run) was measured. This is quite non-consistent in sense of result definition, but for the chart results, I have kept this.

In following PR, I will improve the benchmarking more to collect avg and sd for repeated startup/shutdown as well as storing all these data in benchmarking repo.

For now, let me know if there are more questions to this PR. It prelands some changes and makes the single Ir reported value more stable:

1.1% more stable for Symfony
8.5% more stable for Symfony JIT
18.8% more stable for Wordpress
22% more stable for Wordpress JIT
for bench Zend/bench.php the results are the same, as run only once

iluuu1994

Thanks for investigating! Can you clarify what you mean by "x% more stable"? How did you gather these numbers?

Since you say that the results are similar to the trimmed mean, and we're already using the mean (untrimmed, but primed) I wouldn't expect there to be a big difference.

iluuu1994 · 2024-01-22T12:51:56Z

sapi/cgi/cgi_main.c

+				/* measure shutdown separately */
+				CALLGRIND_DUMP_STATS;
+#endif
+


As mentioned, I don't think dumping stats to stdout/stderr repeatedly is a good solution, because it makes Valgrind useless for all other use-cases where this script is not used. A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.

Valgrind allows specifying the output file using --log-file="filename". This would allow accessing the output from C. Unfortunately, there seems to be no way to set this programmatically, so it would have to be set manually when invoking PHP with Valgrind.

Yes, --callgrind-out-file= cannot be accessed/detected as it is not exported.

What do you want to achive, does CALLGRIND_DUMP_STATS dump stats even if callgrind is not enabled? Also notice, all CALLGRIND_DUMP_STATS are called only if CGI benchmark is (-T) is enabled.

A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.

This is not possible with Valgrind and also probably not wanted, as having multiple Callgrind files is helpful to investigate time differences.

What I mean is that Dmitry and others use valgrind php-cgi -Tn ... for things other than Symfony/Wordpress/etc. However, in this case the script will now dump the stats in every iteration, requiring manually extracting the values and calculating some mean. So the output becomes essentially useless.

In 9d77734 I have put the CALLGRIND_DUMP_STATS call in an env. var condition. If the env. var is not present, nothing is dumped during the run.

@iluuu1994 can I ask you to review this PR

iluuu1994 · 2024-01-22T12:52:33Z

benchmark/benchmark.php

@@ -63,7 +63,8 @@ function runSymfonyDemo(bool $jit): array {
    cloneRepo($dir, 'https://github.com/php/benchmarking-symfony-demo-2.2.3.git');
    runPhpCommand([$dir . '/bin/console', 'cache:clear']);
    runPhpCommand([$dir . '/bin/console', 'cache:warmup']);
-    return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, warmup: 50, repeat: 50);
+
+    return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, repeat: 50);


Removing the warmup would negatively impact the statistical mean. You should get more accurate results if you keep the warmup.

CGI warmup is the same as normal CGI repeat with an exception how time is reported. This is not relevant to the Ir/Valgrind benchmarking.

If you mean solely to not measure first 50 runs and then measure following 50 runs, this is not needed, as I have verified the results and discussed them in #13162 (comment) - basically non-JIT is Ir stable after 2 runs and JIT Ir is stable after about 10 runs.

mvorisek · 2024-01-22T15:34:31Z

Thanks for investigating! Can you clarify what you mean by "x% more stable"? How did you gather these numbers?

Since you say that the results are similar to the trimmed mean, and we're already using the mean (untrimmed, but primed) I wouldn't expect there to be a big difference.

I compared current mean (of 490 datapoints for each benchmark, first 10 were always excluded) with trimmed mean and divided that by standard deviation. The differences are thus relative to sd and explained by the outliers shown in the charts above.

iluuu1994 · 2024-01-22T17:23:16Z

The differences are thus relative to sd and explained by the outliers shown in the charts above.

Ok, thanks. Unfortunately, the 1.1% improvement for Symfony is not very significant. We don't have much trouble with the other benchmarks. I'll investigate so see if the following is possible:

I do wonder whether it may be worth checking if the the code could be optimized by allocating a bit more memory and aligning it manually to make use of SSE. This would solve the benchmark and potentially improve performance. If this happens primarily in sqlite this may be out of our control.

mvorisek · 2024-03-19T21:21:16Z

Dear maintainers, can you please review this PR?

It improves the collection to capture metrics per each script run to allow to apply robust statistics.

Except more robust/stable results, the benchmarking should work as before.

In following PR, I plan to measure also the startup/1st (compile) run multiple times to monitor that too. For that, I would like to land this PR first.

iluuu1994 · 2024-03-20T11:09:39Z

I appreciate your effort on this. However, it didn't seem like this significantly improved stability in your last comment? I see that it improves Wordpress significantly more than Symfony. However, Wordpress was never a problem to begin with. See:

https://nielsdos.github.io/php-benchmark-visualisation/

Wordpress looks quite predictable.

Symfony has very large outliers, being ~0.5% faster. (Edit: Although tbh it has looked much better lately)

Please correct me if I'm wrong.

I wonder if it would help to run the benchmark without ASLR, and possibly without find_prefered_mmap_base(). This might help make memory layout more consistent, including libraries like sqlite. Unfortunately, I doubt this is going to be possible on GitHub Actions.

mvorisek · 2024-03-20T11:25:32Z

@iluuu1994 the improvement is much better - see https://3v4l.org/Ugnd8#v8.3.4

the numbers there were collected using GH Actions for 10. - 500. run (the first 10 runs were excluded to not skew the distributions, this is what the original impl. did with the "warmup iterations")

The absolute avg vs. median error is more than 50x better for JITted variants.

Can we agree median is much more better metric for benchmarking comparison/chart?

If yes, can this PR be reviewed so I know exactly what to do in order to be merged.

Thank you.

iluuu1994 · 2024-03-20T14:30:06Z

Turns out disabling ASLR is indeed possible. #13769 Let's see if that resolves the issue fully, or whether this PR is still necessary.

mvorisek · 2024-03-27T07:20:33Z

Disabling ASLR seems to have no improvement on stddev:

As presented in #13162 (comment) by the data, this PR have significant positive impact on the benchmark stddev. And, with individual measurements, statistics can be plot/measured easily.

iluuu1994 · 2024-03-27T16:33:43Z

@mvorisek Fair enough, I will have a look in a couple of weeks. Feel free to ping me in 2-3 weeks if you don't get a response.

kocsismate · 2024-03-28T20:53:06Z

Disabling ASLR seems to have no improvement on stddev:

I concur, I disabled ASLR in my benchmark as well, but neither I got a noticable improvement in std dev.

diff: kocsismate/php-version-benchmarks@8574e75
results before disabling ASLR: https://github.com/kocsismate/php-version-benchmarks/blob/main/docs/results/2024_01_15_15_10_1_aws_x86_64_c7i_metal_24xl/result.md
results after disabling ASLR: https://github.com/kocsismate/php-version-benchmarks/blob/main/docs/results/2024_03_20_23_00_1_aws_x86_64_c7i_metal_24xl/result.md

(the two benchmarks didn't test the same PHP commits though)

github-actions bot added the SAPI: cgi label Jan 16, 2024

mvorisek force-pushed the improve_bench branch 2 times, most recently from 9411ff0 to 082bacd Compare January 16, 2024 17:40

mvorisek marked this pull request as ready for review January 16, 2024 18:19

mvorisek force-pushed the improve_bench branch from dc74b09 to 11adfd2 Compare January 16, 2024 21:03

mvorisek requested review from iluuu1994 and TimWolla as code owners January 19, 2024 14:13

mvorisek commented Jan 19, 2024

View reviewed changes

mvorisek marked this pull request as draft January 21, 2024 18:03

mvorisek force-pushed the improve_bench branch 2 times, most recently from 71088d9 to a978f8c Compare January 22, 2024 11:01

mvorisek marked this pull request as ready for review January 22, 2024 11:48

iluuu1994 reviewed Jan 22, 2024

View reviewed changes

mvorisek marked this pull request as draft January 22, 2024 13:41

mvorisek force-pushed the improve_bench branch 2 times, most recently from 035a222 to 9d77734 Compare January 26, 2024 17:49

mvorisek marked this pull request as ready for review January 26, 2024 17:58

mvorisek force-pushed the improve_bench branch from a9fd686 to a267b33 Compare January 26, 2024 18:23

mvorisek added 4 commits March 19, 2024 19:40

allow to pass unespaced arg to bench cmd

61db456

allow to parse more than one/Ir callgrind metric

6a47d3b

measure startup, each bench run and shutdown separately

48c1057

make metrics dump more human readable

66c27d7

mvorisek added 9 commits March 19, 2024 19:40

collect cache/branch misses

5ac6812

gmp ext is no longer needed for benchmarking

74ae907

Revert "collect cache/branch misses"

c92fa3c

use Ir median as main metrics

3a718b3

match oriignal data

25f33f7

dump profiles separately only if enabled by env var

b338be1

make sure children are traced (like parallel processing)

eff4cfb

store also 1st profile in artifacts

ba44c9f

download benchmark dbgsym using debuginfod

eef3071

mvorisek force-pushed the improve_bench branch from d4fe3ef to eef3071 Compare March 19, 2024 18:50

iluuu1994 self-assigned this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure startup, each bench run and shutdown separately #13162

Measure startup, each bench run and shutdown separately #13162

mvorisek commented Jan 16, 2024 •

edited

Loading

mvorisek Jan 19, 2024

iluuu1994 Jan 19, 2024

mvorisek Jan 26, 2024 •

edited

Loading

iluuu1994 Jan 26, 2024

TimWolla Jan 26, 2024

mvorisek Jan 26, 2024

TimWolla Jan 26, 2024

mvorisek Jan 26, 2024

iluuu1994 commented Jan 19, 2024

mvorisek commented Jan 19, 2024 •

edited

Loading

iluuu1994 commented Jan 19, 2024 •

edited

Loading

nielsdos commented Jan 19, 2024

mvorisek commented Jan 21, 2024 •

edited

Loading

mvorisek commented Jan 22, 2024 •

edited

Loading

iluuu1994 left a comment

iluuu1994 Jan 22, 2024

mvorisek Jan 22, 2024

iluuu1994 Jan 22, 2024 •

edited

Loading

mvorisek Jan 26, 2024

mvorisek Jan 30, 2024

iluuu1994 Jan 22, 2024

mvorisek Jan 22, 2024

mvorisek commented Jan 22, 2024 •

edited

Loading

iluuu1994 commented Jan 22, 2024

mvorisek commented Mar 19, 2024

iluuu1994 commented Mar 20, 2024 •

edited

Loading

mvorisek commented Mar 20, 2024 •

edited

Loading

iluuu1994 commented Mar 20, 2024

mvorisek commented Mar 27, 2024 •

edited

Loading

iluuu1994 commented Mar 27, 2024 •

edited

Loading

kocsismate commented Mar 28, 2024 •

edited

Loading

Measure startup, each bench run and shutdown separately #13162

Are you sure you want to change the base?

Measure startup, each bench run and shutdown separately #13162

Conversation

mvorisek commented Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvorisek Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iluuu1994 commented Jan 19, 2024

mvorisek commented Jan 19, 2024 • edited Loading

iluuu1994 commented Jan 19, 2024 • edited Loading

nielsdos commented Jan 19, 2024

mvorisek commented Jan 21, 2024 • edited Loading

mvorisek commented Jan 22, 2024 • edited Loading

iluuu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iluuu1994 Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvorisek commented Jan 22, 2024 • edited Loading

iluuu1994 commented Jan 22, 2024

mvorisek commented Mar 19, 2024

iluuu1994 commented Mar 20, 2024 • edited Loading

mvorisek commented Mar 20, 2024 • edited Loading

iluuu1994 commented Mar 20, 2024

mvorisek commented Mar 27, 2024 • edited Loading

iluuu1994 commented Mar 27, 2024 • edited Loading

kocsismate commented Mar 28, 2024 • edited Loading

mvorisek commented Jan 16, 2024 •

edited

Loading

mvorisek Jan 26, 2024 •

edited

Loading

mvorisek commented Jan 19, 2024 •

edited

Loading

iluuu1994 commented Jan 19, 2024 •

edited

Loading

mvorisek commented Jan 21, 2024 •

edited

Loading

mvorisek commented Jan 22, 2024 •

edited

Loading

iluuu1994 Jan 22, 2024 •

edited

Loading

mvorisek commented Jan 22, 2024 •

edited

Loading

iluuu1994 commented Mar 20, 2024 •

edited

Loading

mvorisek commented Mar 20, 2024 •

edited

Loading

mvorisek commented Mar 27, 2024 •

edited

Loading

iluuu1994 commented Mar 27, 2024 •

edited

Loading

kocsismate commented Mar 28, 2024 •

edited

Loading