Skip to content

Remove zend_strtod mutex #13974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 23, 2024
Merged

Remove zend_strtod mutex #13974

merged 2 commits into from
Apr 23, 2024

Conversation

arnaud-lb
Copy link
Member

@arnaud-lb arnaud-lb commented Apr 15, 2024

zend_strtod.c uses a global state (mostly an allocation freelist) protected by a mutex in ZTS builds. This state is used by zend_dtoa(), zend_strtod(), and variants. This creates a lot of contention in concurrent loads. zend_dtoa() is used to format floats to string, e.g. in sprintf, json_encode, serialize, uniqid.

In this PR I move the global state to the thread specific executor_globals and remove the mutex.

The impact on non-concurrent environments is null or negligible, but there is a considerable speed up on concurrent environments, especially on Alpine/Musl. When comparing master to this branch, the frankenphp-demo is sped up 10% under Apache/musl, 20% under FrankenPHP/glibc, and 40% under FrankenPHP/musl. Some synthetic benchmark is 80% faster.

Benchmarks:

I'm using two benchmarks:

  • frankenphp-demo (requesting /api/monsters.jsonld). In this benchmark, the frankenphp-demo app is setup in dev mode
  • json_encode.php is a synthetic benchmark encoding an array or 100 floats

In 3 separate environments:

  • php-cgi without concurrency
  • Apache mpm_event mod_php ZTS (100 concurrent requests)
  • FrankenPHP in worker mode (100 concurrent requests)

Opcache is enabled in the php-cgi and apache benchmarks, otherwise compilation time dominates. It is disabled in FrankenPHP because it is redundant in worker mode.

Bookworm uses glibc, Alpine (3.19.1) uses musl (1.2.4).

Results:

php-cgi -T10,500 frankenphp-demo repeated 5 times:

master-bookworm:      mean:  1.7227;  stddev:  0.0035;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  1.7195;  stddev:  0.0037;  diff:  -0.19%  

master-alpine:        mean:  1.8676;  stddev:  0.0031;  diff:  -0.00% (baseline)
branch-alpine:        mean:  1.8700;  stddev:  0.0029;  diff:  +0.13%  

master-zts-bookworm:  mean:  1.7909;  stddev:  0.0026;  diff:  -0.00% (baseline)
branch-zts-bookworm:  mean:  1.7943;  stddev:  0.0014;  diff:  +0.19%  

master-zts-alpine:    mean:  1.9928;  stddev:  0.0059;  diff:  -0.00% (baseline)
branch-zts-alpine:    mean:  1.9900;  stddev:  0.0031;  diff:  -0.14%

Also in Valgrind:

valgrind php-cgi -T1,10:

master-bookworm:      mean:  1200273448.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  1200174784.0000;  stddev:  0.0000;  diff:  -0.01%

master-alpine:        mean:  1193572485.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-alpine:        mean:  1193577574.0000;  stddev:  0.0000;  diff:  +0.00%

master-zts-bookworm:  mean:  1245688708.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-bookworm:  mean:  1245684830.0000;  stddev:  0.0000;  diff:  -0.00%

master-zts-alpine:    mean:  1275329963.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-alpine:    mean:  1275302862.0000;  stddev:  0.0000;  diff:  -0.00%  

php-cgi -T10,5000 json_encode.php repeated 5 times:

master-bookworm:      mean:  0.2541;  stddev:  0.0003;  diff:  -0.00% (baseline)
branch-bookworm:      mean:  0.2532;  stddev:  0.0002;  diff:  -0.33%

master-alpine:        mean:  0.2694;  stddev:  0.0057;  diff:  +0.00% (baseline)
branch-alpine:        mean:  0.2702;  stddev:  0.0098;  diff:  +0.30%

master-zts-bookworm:  mean:  0.4092;  stddev:  0.0014;  diff:  -0.00% (baseline)
branch-zts-bookworm:  mean:  0.2665;  stddev:  0.0023;  diff:  -34.86%

master-zts-alpine:    mean:  0.5691;  stddev:  0.0015;  diff:  -0.00% (baseline)
branch-zts-alpine:    mean:  0.2862;  stddev:  0.0006;  diff:  -49.71%

Valgrind:

master-bookworm:      mean:  67901890.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-bookworm:      mean:  68082973.0000;  stddev:  0.0000;  diff:  +0.27%

master-alpine:        mean:  39943906.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-alpine:        mean:  40117586.0000;  stddev:  0.0000;  diff:  +0.43%

master-zts-bookworm:  mean:  74304991.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-bookworm:  mean:  69262486.0000;  stddev:  0.0000;  diff:  -6.79%

master-zts-alpine:    mean:  45453755.0000;  stddev:  0.0000;  diff:  +0.00% (baseline)
branch-zts-alpine:    mean:  41906922.0000;  stddev:  0.0000;  diff:  -7.80%

Apache mpm_event mod_php ZTS frankenphp-demo:

master-zts-bookworm: 10.863000; +0.00% (baseline)
branch-zts-bookworm: 10.876000; +0.12%

master-zts-alpine: 12.218000; +0.00% (baseline)
branch-zts-alpine: 10.885000; -10.91%

Apache mpm_event mod_php ZTS json_encode.php:

master-zts-bookworm: 1.476000; +0.00% (baseline)
branch-zts-bookworm: 0.228000; -84.55%

master-zts-alpine: 1.499000; +0.00% (baseline)
branch-zts-alpine: 0.243000; -83.79%

FrankenPHP frankenphp-demo:

master-bookworm:       77   +0.00% (baseline)
branch-bookworm:       62   -18.99%

master-alpine:  120  +0.00% (baseline)
branch-alpine:  68   -43.57%

@arnaud-lb
Copy link
Member Author

Unsurprisingly the change is very visible in perf:
(here for FrankenPHP on Alpine)

# Event 'cpu_atom/cycles/P'
#
# Baseline  Delta Abs  Shared Object        Symbol                                                                 
# ........  .........  ...................  .......................................................................
#
    30.24%    -30.23%  [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
     7.80%     +9.31%  libphp.so            [.] execute_ex
     2.39%     +4.45%  libphp.so            [.] zend_gc_collect_cycles
     4.62%     +4.21%  libz.so.1.3.1        [.] 0x0000000000003da8
     3.74%     -3.74%  ld-musl-x86_64.so.1  [.] pthread_mutex_timedlock
     2.90%     -2.89%  ld-musl-x86_64.so.1  [.] pthread_mutex_lock
     0.98%     +2.19%  libphp.so            [.] gc_scan
     1.89%     -1.88%  ld-musl-x86_64.so.1  [.] pthread_mutex_unlock
     1.06%     +1.43%  libphp.so            [.] zend_hash_find
     1.30%     +1.38%  frankenphp           [.] 0x0000000000009f06
     1.76%     +1.34%  ld-musl-x86_64.so.1  [.] memcpy
     1.22%     -1.21%  [kernel.kallsyms]    [k] futex_wake
     1.21%     -1.19%  [kernel.kallsyms]    [k] try_to_wake_up

@arnaud-lb arnaud-lb changed the title Remove zend_strtod mutex [wip] Remove zend_strtod mutex Apr 15, 2024
Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Only a shallow review, but couldn't find any mistakes. 👍

Comment on lines +555 to 559
#ifdef MULTIPLE_THREADS
static MUTEX_T dtoa_mutex;
static MUTEX_T pow5mult_mutex;
#endif /* ZTS */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#ifdef MULTIPLE_THREADS
static MUTEX_T dtoa_mutex;
static MUTEX_T pow5mult_mutex;
#endif /* ZTS */

You forgot to remove these?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the now obsolete usages of the acquire/release macros. I suppose you just did the minimum to draft this for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't plan to remove this and the acquire/release macros as they are part of the "API" of the file, which appears to be a reusable piece of code imported from elsewhere. These macros are documented at the beginning of the file:

php-src/Zend/zend_strtod.c

Lines 147 to 155 in 077891f

* #define MULTIPLE_THREADS if the system offers preemptively scheduled
* multiple threads. In this case, you must provide (or suitably
* #define) two locks, acquired by ACQUIRE_DTOA_LOCK(n) and freed
* by FREE_DTOA_LOCK(n) for n = 0 or 1. (The second lock, accessed
* in pow5mult, ensures lazy evaluation of only one copy of high
* powers of 5; omitting this lock would introduce a small
* probability of wasting memory, but would otherwise be harmless.)
* You must also invoke freedtoa(s) to free the value s returned by
* dtoa. You may do so whether or not MULTIPLE_THREADS is #defined.

There are many knobs like this in this file, many of which we will never use, like KR_headers.

So I only removed the definition of MULTIPLE_THREADS, and left the default no-op definitions of ACQUIRE_DTOA_LOCK and FREE_DTOA_LOCK.

I don't mind removing their use as well if you think it's better.

I feel that we should eventually replace this code by more modern implementations of strtod and dtoa. It should be possible to implement these without memory allocations. Also I don't know if we still need to support VAX/IBM arithmetic. This feels risky and largely out of scope of this PR however.

@dkarlovi
Copy link

Already value of looking into supporting musl, amazing work!

@devnexen
Copy link
Member

@arnaud-lb just curious, any change in the perf improvement since you moved to system allocation ?

@arnaud-lb
Copy link
Member Author

@devnexen no, results are the same

I switched back to system malloc because zend_dtoa may be used outside of the request lifecycle via e.g. zend_error("... %f").

@arnaud-lb arnaud-lb changed the title [wip] Remove zend_strtod mutex Remove zend_strtod mutex Apr 17, 2024
@arnaud-lb arnaud-lb marked this pull request as ready for review April 17, 2024 12:05
@arnaud-lb arnaud-lb requested a review from dstogov as a code owner April 17, 2024 12:05
@dstogov
Copy link
Member

dstogov commented Apr 17, 2024

I've never looked into zend_strtod.c code before and I got "a culture shock" :)

As I understood they implemented their own malloc cache, then added mutexes to make it thread safe...
I would suggest to try removing this caches (remove freelists and modify Balloc/Bfree to use [e]malloc/[e]free).

p5s is a linked list that caches precomputed numbers - 5**n where n < 32 - (5, 25, 125, 625, ..., 5**31).
It should be possible to pre-compute 32 numbers...

zend_dtoa() uses thread safe variable to keep the resulting string. But we use it just in two places and explicitly free the allocated memory. Switching to explicit [e]malloc and [e]free won't make any difference.

Anyway, I don't object against this PR. It doesn't make things worse.

@arnaud-lb
Copy link
Member Author

@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer.

@dstogov
Copy link
Member

dstogov commented Apr 19, 2024

@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer.

Thanks! I'll take a look on Monday.

@crrodriguez
Copy link
Contributor

@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit.

@dstogov
Copy link
Member

dstogov commented Apr 22, 2024

@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit.

I just asked to test the profitability of freelists, and @arnaud-lb showed that benefit is significant - 20%.

Of course, we shouldn't make 20% slowdown even for synthetic tests.
So the idea with per-thread freelists caches makes sense.

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that my first impression from dtoa() implementation was wrong.
Its freelists cache implementation makes sense.
Making this caches thread-local also makes sense.

@arnaud-lb arnaud-lb merged commit 9bbc195 into php:master Apr 23, 2024
10 checks passed
nielsdos added a commit to nielsdos/php-src that referenced this pull request Oct 25, 2024
This happens because on ZTS we execute `executor_globals_ctor` which reset the
`freelist` and `p5s` pointers, while on NTS we don't.
On NTS we can reuse the caches but on ZTS we can't, the easiest fix is
to call `zend_shutdown_strtod` when preloading is shut down.
This regressed in phpGH-13974 and therefore only exists in PHP 8.4 and
higher.
nielsdos added a commit that referenced this pull request Oct 28, 2024
This happens because on ZTS we execute `executor_globals_ctor` which reset the
`freelist` and `p5s` pointers, while on NTS we don't.
On NTS we can reuse the caches but on ZTS we can't, the easiest fix is
to call `zend_shutdown_strtod` when preloading is shut down.
This regressed in GH-13974 and therefore only exists in PHP 8.4 and
higher.

Closes GH-16602.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants