-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Cacheline demote to improve performance #11101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think, first we should check if CLDEMOTE is available, using CPUID. Can you demonstrate CLDEMOTE usage in some other JIT engines? @wxue1 we have been working on a new generation JIT engine for a while. See https://github.com/dstogov/ir |
Thanks and I will check it based on your comment.
Great, more IR optimizations could be added then. I'll deep dive into this later. Another thing is that could you help to review PR #10897 since I updated it. |
Yeah. I saw. Sorry, I'm limited in time now and that PR is not simple. I'll take a careful look on Monday or Tuesday. |
1435f91
to
00c6d3d
Compare
On processors which do not support the CLDEMOTE instruction (including legacy hardware) the instruction will be treated as a NOP. (from the software development manual) I update the code to support tracing pattern only. CLDEMOTE haven't taken by other JIT engine, but DPDK, LLVM and other repos support that. |
10M/64 of NOPs :(
I see they provide ability to generate CLDEMOTE but don't use CLDEMOTE their selves. Few years ago Intel engineers suggested replacing non temporal stores into opcache memory by regular ones. |
Oh, I think I finally understood how CLDEMOTE may improve the overall speed. It hints CPU to evict data from L1 and L2 caches into L3. And than the data in L3 may be reused by other CPU cores. |
Yes, let me check Zend/zend_cpuinfo.h to support CLDEMOTE. |
You may also try to use CLDEMOTE in ext/opcache/ZendAccelerator.c at the end of cache_script_in_shared_memory() to demote the cached script. Of course, this requites some patch adjustment. shared_cacheline_demote(new_persistent_script->mem, new_persistent_script->size); |
Yeah, I tried before. But PHP store script structure in the hash table which is scattered. CLDEMOTE work effect is not obvious |
Once code is emitted to JIT buffer, hint the hardware to demote the corresponding cache lines to more distant level so other CPUs can access them more quickly. This gets nearly 1% performance gain on our workload. Signed-off-by: Xue,Wang <[email protected]> Signed-off-by: Tao,Su <[email protected]> Signed-off-by: Hu,chen <[email protected]>
I updated the patch and checked CPUINFO whether cldemote instruction is available |
/* hint to the hardware to push out the cache line that contains the linear address */ | ||
#if ZEND_JIT_SUPPORT_CLDEMOTE | ||
if (cpu_support_cldemote && JIT_G(trigger) == ZEND_JIT_ON_HOT_TRACE) { | ||
shared_cacheline_demote((uintptr_t)entry, size); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to do CLDEMOTE only for TRACEs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to do CLDEMOTE only for TRACEs?
You mentioned that Function JIT may produce more than 10M of code at a single request and CLDEMOTE takes time. So we support tracing JIT.
Thanks! I will focus on my other inline function patch and PHP9 IR next. |
Once code is emitted to JIT buffer, hint the hardware to demote the corresponding cache lines to more distant level so other CPUs can access them more quickly.
This gets nearly 1% performance gain on our workload.