-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Measure startup, each bench run and shutdown separately #13162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9411ff0
to
082bacd
Compare
dc74b09
to
11adfd2
Compare
$profileOutSpecific = $profileOut . '.' . $k; | ||
|
||
runCommand([ | ||
'callgrind_annotate', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how to convince callgrind_annotate
to resolve all symbols, especially all the onces defined in Sqlite lib? libsqlite3-dev
is installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to compile with CFLAGS="-g -O2"
to get debug symbols. But with multiple repetitions we don't generally care about startup/shutdown. It's not very meaningful for applications ran with opcache. If you want to provide startup/shutdown you can do so without repetitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for SQLite, since we're not compiling its source files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug symbols for Debian packages are commonly available via the corresponding -dbgsym
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Ubuntu docker, after apt update, I get only these:
# apt-cache search libsqlite
libsqlite3-0 - SQLite 3 shared library
libsqlite3-dev - SQLite 3 development files
libsqlite3-mod-ceph - SQLite3 VFS for Ceph
libsqlite3-mod-ceph-dev - SQLite3 VFS for Ceph (development files)
libsqlite-tcl - SQLite 2 Tcl bindings
libsqlite0 - SQLite 2 shared library
libsqlite0-dev - SQLite 2 development files
libsqlite3-gst - SQLite bindings for GNU Smalltalk
libsqlite3-mod-blobtoxy - SQLite3 extension module for read-only BLOB to X/Y mapping
libsqlite3-mod-csvtable - SQLite3 extension module for read-only access to CSV files
libsqlite3-mod-impexp - SQLite3 extension module for SQL script, XML, JSON and CSV import/export
libsqlite3-mod-rasterlite2 - SQLite 3 module for huge raster coverages
libsqlite3-mod-spatialite - Geospatial extension for SQLite - loadable module
libsqlite3-mod-virtualpg - Loadable dynamic extension to both SQLite and SpatiaLite
libsqlite3-mod-xpath - SQLite3 extension module for querying XML data with XPath
libsqlite3-mod-zipfile - SQLite3 extension module for read-only access to ZIP files
libsqlite3-ocaml - Embeddable SQL Database for OCaml Programs (runtime)
libsqlite3-ocaml-dev - Embeddable SQL Database for OCaml Programs (development)
libsqlite3-tcl - SQLite 3 Tcl bindings
libsqliteodbc - ODBC driver for SQLite embedded database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!! That link brought me to https://valgrind.org/docs/manual/manual-core.html#manual-core.debuginfod . Commited in d4fe3ef.
Now I still see some missing source for the core linux libs:
--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
./csu/../csu/libc-start.c
./csu/../sysdeps/nptl/libc_start_call_main.h
./csu/./csu/errno-loc.c
./ctype/../include/ctype.h
./debug/./debug/explicit_bzero_chk.c
./elf/../sysdeps/generic/dl-protected.h
./elf/../sysdeps/generic/ldsodefs.h
./elf/../sysdeps/x86_64/dl-machine.h
./elf/../sysdeps/x86_64/dl-runtime.h
./elf/../sysdeps/x86_64/dl-trampoline.h
./elf/./elf/dl-lookup.c
./elf/./elf/dl-runtime.c
./io/../sysdeps/posix/getcwd.c
./io/../sysdeps/unix/syscall-template.S
...
./io/../sysdeps/unix/sysv/linux/write.c
./libio/./libio/fileops.c
./libio/./libio/iofflush.c
./libio/./libio/libioP.h
./malloc/./malloc/arena.c
./malloc/./malloc/malloc.c
./malloc/./malloc/morecore.c
./misc/../sysdeps/unix/syscall-template.S
./misc/../sysdeps/unix/sysv/linux/brk.c
./misc/../sysdeps/unix/sysv/linux/x86_64/syscall.S
./misc/./misc/sbrk.c
./misc/./misc/syslog.c
...
./stdlib/../stdlib/strtol.c
./stdlib/../stdlib/strtol_l.c
./stdlib/./stdlib/getenv.c
./string/../sysdeps/x86_64/multiarch/memchr-avx2.S
...
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/main.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/malloc.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mem1.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/memjournal.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/mutex_unix.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/notify.c
/build/sqlite3-sDRO1D/sqlite3-3.37.2/./src/os.c
...
any idea?
Is there any evidence that picking the fastest run is the most accurate representation for our benchmark? It would seem that this heavily depends on the distribution / outliers. |
I will collect some data using Github Actions. In general, the lowest time is usually good metric, as it is bound by the fastest reality and immune to random delays caused by undetermistic events like context switches, allocations, resources delays, ... So far I have been able to identify the reason why Symfony and Wordpress times are unstable.
|
That would be great. We should demonstrate how much more stable the benchmark actually becomes this way.
I suppose that should be ok. @nielsdos Any thoughts on this? In case this is the approach we want to go with: I don't think the approach of dumping stats for every iteration is great, because it makes Valgrind unusable without this benchmarking script. It would be better if we could track the fastest run internally and dump it at the end, but unfortunately Valgrinds API is quite lacking.
We found the same in #12130 and iluuu1994#99, but the mitigations were not satisfactory. Depending on the layout of the allocated data, SSE can be used more effectively. I do wonder whether it may be worth checking if the the code could be optimized by allocating a bit more memory and aligning it manually to make use of SSE. This would solve the benchmark and potentially improve performance. If this happens primarily in sqlite this may be out of our control. |
Not an expert, I'd have to refresh my stats course, but anyway on first sight: Noise can only make things slower, never faster. So to stabilize benchmarks it can indeed make sense to use the minimum, as is often done. But it can give you false confidence in the results if noise is common for the benchmarked program (as is often the case for I/O heavy programs). Anyway, I think minimum could be fine, but it would be nice to know the variance and what the underlying distribution looks like. |
I run 500 benchmarks of each demo. In the figure above (showing first few run times excl. startup) you can see that tracking JIT needs about 10 runs to stabilize the Ir (instructions) needed. Non-JIT is stable in about 2, 3 runs. In Symfony JIT and JIT/non-JIT Wordpress, there are distant outliers. Distribution of Symfony JIT after 12.5% percentile of outliers removed. In did a few experiments with trimmer mean but all the results were almost the same as median - so let's use median for reporting. |
71088d9
to
a978f8c
Compare
Before this PR, when there were no In following PR, I will improve the benchmarking more to collect avg and sd for repeated startup/shutdown as well as storing all these data in benchmarking repo. For now, let me know if there are more questions to this PR. It prelands some changes and makes the single Ir reported value more stable:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for investigating! Can you clarify what you mean by "x% more stable"? How did you gather these numbers?
Since you say that the results are similar to the trimmed mean, and we're already using the mean (untrimmed, but primed) I wouldn't expect there to be a big difference.
/* measure shutdown separately */ | ||
CALLGRIND_DUMP_STATS; | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I don't think dumping stats to stdout/stderr repeatedly is a good solution, because it makes Valgrind useless for all other use-cases where this script is not used. A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.
Valgrind allows specifying the output file using --log-file="filename"
. This would allow accessing the output from C. Unfortunately, there seems to be no way to set this programmatically, so it would have to be set manually when invoking PHP with Valgrind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, --callgrind-out-file=
cannot be accessed/detected as it is not exported.
What do you want to achive, does CALLGRIND_DUMP_STATS
dump stats even if callgrind is not enabled? Also notice, all CALLGRIND_DUMP_STATS
are called only if CGI benchmark is (-T
) is enabled.
A better solution would be if we could calculate the mean or whatever metric we find most useful, from C, and output that when the script terminates.
This is not possible with Valgrind and also probably not wanted, as having multiple Callgrind files is helpful to investigate time differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that Dmitry and others use valgrind php-cgi -Tn ...
for things other than Symfony/Wordpress/etc. However, in this case the script will now dump the stats in every iteration, requiring manually extracting the values and calculating some mean. So the output becomes essentially useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 9d77734 I have put the CALLGRIND_DUMP_STATS
call in an env. var condition. If the env. var is not present, nothing is dumped during the run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iluuu1994 can I ask you to review this PR
benchmark/benchmark.php
Outdated
@@ -63,7 +63,8 @@ function runSymfonyDemo(bool $jit): array { | |||
cloneRepo($dir, 'https://github.com/php/benchmarking-symfony-demo-2.2.3.git'); | |||
runPhpCommand([$dir . '/bin/console', 'cache:clear']); | |||
runPhpCommand([$dir . '/bin/console', 'cache:warmup']); | |||
return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, warmup: 50, repeat: 50); | |||
|
|||
return runValgrindPhpCgiCommand('symfony-demo', [$dir . '/public/index.php'], cwd: $dir, jit: $jit, repeat: 50); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the warmup would negatively impact the statistical mean. You should get more accurate results if you keep the warmup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CGI warmup is the same as normal CGI repeat with an exception how time is reported. This is not relevant to the Ir/Valgrind benchmarking.
If you mean solely to not measure first 50 runs and then measure following 50 runs, this is not needed, as I have verified the results and discussed them in #13162 (comment) - basically non-JIT is Ir stable after 2 runs and JIT Ir is stable after about 10 runs.
I compared current mean (of 490 datapoints for each benchmark, first 10 were always excluded) with trimmed mean and divided that by standard deviation. The differences are thus relative to sd and explained by the outliers shown in the charts above. |
Ok, thanks. Unfortunately, the 1.1% improvement for Symfony is not very significant. We don't have much trouble with the other benchmarks. I'll investigate so see if the following is possible:
|
035a222
to
9d77734
Compare
a9fd686
to
a267b33
Compare
Dear maintainers, can you please review this PR? It improves the collection to capture metrics per each script run to allow to apply robust statistics. Except more robust/stable results, the benchmarking should work as before. In following PR, I plan to measure also the startup/1st (compile) run multiple times to monitor that too. For that, I would like to land this PR first. |
I appreciate your effort on this. However, it didn't seem like this significantly improved stability in your last comment? I see that it improves Wordpress significantly more than Symfony. However, Wordpress was never a problem to begin with. See: https://nielsdos.github.io/php-benchmark-visualisation/ Wordpress looks quite predictable. Symfony has very large outliers, being ~0.5% faster. (Edit: Although tbh it has looked much better lately) Please correct me if I'm wrong. I wonder if it would help to run the benchmark without ASLR, and possibly without |
@iluuu1994 the improvement is much better - see https://3v4l.org/Ugnd8#v8.3.4 the numbers there were collected using GH Actions for 10. - 500. run (the first 10 runs were excluded to not skew the distributions, this is what the original impl. did with the "warmup iterations") The absolute avg vs. median error is more than 50x better for JITted variants. Can we agree median is much more better metric for benchmarking comparison/chart? If yes, can this PR be reviewed so I know exactly what to do in order to be merged. Thank you. |
Turns out disabling ASLR is indeed possible. #13769 Let's see if that resolves the issue fully, or whether this PR is still necessary. |
Disabling ASLR seems to have no improvement on stddev: As presented in #13162 (comment) by the data, this PR have significant positive impact on the benchmark stddev. And, with individual measurements, statistics can be plot/measured easily. |
@mvorisek Fair enough, I will have a look in a couple of weeks. Feel free to ping me in 2-3 weeks if you don't get a response. |
I concur, I disabled ASLR in my benchmark as well, but neither I got a noticable improvement in std dev.
(the two benchmarks didn't test the same PHP commits though) |
Individual times per each run should be collected to plot data with robust statistics (like confidence intervals).