-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Remove zend_strtod mutex #13974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove zend_strtod mutex #13974
Conversation
Unsurprisingly the change is very visible in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Only a shallow review, but couldn't find any mistakes. 👍
#ifdef MULTIPLE_THREADS | ||
static MUTEX_T dtoa_mutex; | ||
static MUTEX_T pow5mult_mutex; | ||
#endif /* ZTS */ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#ifdef MULTIPLE_THREADS | |
static MUTEX_T dtoa_mutex; | |
static MUTEX_T pow5mult_mutex; | |
#endif /* ZTS */ |
You forgot to remove these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the now obsolete usages of the acquire/release macros. I suppose you just did the minimum to draft this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't plan to remove this and the acquire/release macros as they are part of the "API" of the file, which appears to be a reusable piece of code imported from elsewhere. These macros are documented at the beginning of the file:
Lines 147 to 155 in 077891f
* #define MULTIPLE_THREADS if the system offers preemptively scheduled | |
* multiple threads. In this case, you must provide (or suitably | |
* #define) two locks, acquired by ACQUIRE_DTOA_LOCK(n) and freed | |
* by FREE_DTOA_LOCK(n) for n = 0 or 1. (The second lock, accessed | |
* in pow5mult, ensures lazy evaluation of only one copy of high | |
* powers of 5; omitting this lock would introduce a small | |
* probability of wasting memory, but would otherwise be harmless.) | |
* You must also invoke freedtoa(s) to free the value s returned by | |
* dtoa. You may do so whether or not MULTIPLE_THREADS is #defined. |
There are many knobs like this in this file, many of which we will never use, like KR_headers
.
So I only removed the definition of MULTIPLE_THREADS, and left the default no-op definitions of ACQUIRE_DTOA_LOCK and FREE_DTOA_LOCK.
I don't mind removing their use as well if you think it's better.
I feel that we should eventually replace this code by more modern implementations of strtod and dtoa. It should be possible to implement these without memory allocations. Also I don't know if we still need to support VAX/IBM arithmetic. This feels risky and largely out of scope of this PR however.
Already value of looking into supporting musl, amazing work! |
@arnaud-lb just curious, any change in the perf improvement since you moved to system allocation ? |
@devnexen no, results are the same I switched back to system malloc because |
I've never looked into As I understood they implemented their own malloc cache, then added mutexes to make it thread safe...
Anyway, I don't object against this PR. It doesn't make things worse. |
@dstogov thank you for the review. I've tried here, but the micro benchmark is 20% slower after removing the freelist (2.5% under valgrind). There is no slowdown on other benchmarks, however. Let me know what you prefer. |
Thanks! I'll take a look on Monday. |
@arnaud-lb Im with @dstogov here, the freelists cache is not something you really want to have. it will hide bugs for very little benefit. |
I just asked to test the profitability of Of course, we shouldn't make 20% slowdown even for synthetic tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit that my first impression from dtoa() implementation was wrong.
Its freelists cache implementation makes sense.
Making this caches thread-local also makes sense.
This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in phpGH-13974 and therefore only exists in PHP 8.4 and higher.
This happens because on ZTS we execute `executor_globals_ctor` which reset the `freelist` and `p5s` pointers, while on NTS we don't. On NTS we can reuse the caches but on ZTS we can't, the easiest fix is to call `zend_shutdown_strtod` when preloading is shut down. This regressed in GH-13974 and therefore only exists in PHP 8.4 and higher. Closes GH-16602.
zend_strtod.c
uses a global state (mostly an allocation freelist) protected by a mutex in ZTS builds. This state is used byzend_dtoa()
,zend_strtod()
, and variants. This creates a lot of contention in concurrent loads.zend_dtoa()
is used to format floats to string, e.g. in sprintf, json_encode, serialize, uniqid.In this PR I move the global state to the thread specific
executor_globals
and remove the mutex.The impact on non-concurrent environments is null or negligible, but there is a considerable speed up on concurrent environments, especially on Alpine/Musl. When comparing master to this branch, the frankenphp-demo is sped up 10% under Apache/musl, 20% under FrankenPHP/glibc, and 40% under FrankenPHP/musl. Some synthetic benchmark is 80% faster.
Benchmarks:
I'm using two benchmarks:
/api/monsters.jsonld
). In this benchmark, the frankenphp-demo app is setup in dev modeIn 3 separate environments:
Opcache is enabled in the php-cgi and apache benchmarks, otherwise compilation time dominates. It is disabled in FrankenPHP because it is redundant in worker mode.
Bookworm uses glibc, Alpine (3.19.1) uses musl (1.2.4).
Results:
php-cgi -T10,500 frankenphp-demo repeated 5 times:
Also in Valgrind:
php-cgi -T10,5000 json_encode.php repeated 5 times:
Valgrind:
Apache mpm_event mod_php ZTS frankenphp-demo:
Apache mpm_event mod_php ZTS json_encode.php:
FrankenPHP frankenphp-demo: