Skip to content

Cacheline demote to improve performance #11101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2023
Merged

Conversation

wxue1
Copy link
Contributor

@wxue1 wxue1 commented Apr 19, 2023

Once code is emitted to JIT buffer, hint the hardware to demote the corresponding cache lines to more distant level so other CPUs can access them more quickly.
This gets nearly 1% performance gain on our workload.

@dstogov
Copy link
Member

dstogov commented Apr 19, 2023

I think, first we should check if CLDEMOTE is available, using CPUID.
At second, I'm not sure if we should always use it. Function JIT may produce more than 10M of code at single request.
Only the iteration over this 10M may take time, then this 10M won't fit into cache, evict something useful, and finally it's not necessary need by the other cores.

Can you demonstrate CLDEMOTE usage in some other JIT engines?

@wxue1 we have been working on a new generation JIT engine for a while. See https://github.com/dstogov/ir
You are welcome there :)

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 20, 2023

CLDEMOTE

Thanks and I will check it based on your comment.

@wxue1 we have been working on a new generation JIT engine for a while. See https://github.com/dstogov/ir You are welcome there :)

Great, more IR optimizations could be added then. I'll deep dive into this later.

Another thing is that could you help to review PR #10897 since I updated it.

@dstogov
Copy link
Member

dstogov commented Apr 20, 2023

Another thing is that could you help to review PR #10897 since I updated it.

Yeah. I saw. Sorry, I'm limited in time now and that PR is not simple. I'll take a careful look on Monday or Tuesday.

@wxue1 wxue1 force-pushed the cacheline_demote branch 2 times, most recently from 1435f91 to 00c6d3d Compare April 23, 2023 09:03
@wxue1
Copy link
Contributor Author

wxue1 commented Apr 24, 2023

I think, first we should check if CLDEMOTE is available, using CPUID. At second, I'm not sure if we should always use it. Function JIT may produce more than 10M of code at single request. Only the iteration over this 10M may take time, then this 10M won't fit into cache, evict something useful, and finally it's not necessary need by the other cores.

On processors which do not support the CLDEMOTE instruction (including legacy hardware) the instruction will be treated as a NOP. (from the software development manual)

I update the code to support tracing pattern only.

CLDEMOTE haven't taken by other JIT engine, but DPDK, LLVM and other repos support that.

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

I think, first we should check if CLDEMOTE is available, using CPUID. At second, I'm not sure if we should always use it. Function JIT may produce more than 10M of code at single request. Only the iteration over this 10M may take time, then this 10M won't fit into cache, evict something useful, and finally it's not necessary need by the other cores.

On processors which do not support the CLDEMOTE instruction (including legacy hardware) the instruction will be treated as a NOP. (from the software development manual)

10M/64 of NOPs :(
See Zend/zend_cpuinfo.h. It shouldn't be a big problem to detect CLDEMOTE support.

CLDEMOTE haven't taken by other JIT engine, but DPDK, LLVM and other repos support that.

I see they provide ability to generate CLDEMOTE but don't use CLDEMOTE their selves.

Few years ago Intel engineers suggested replacing non temporal stores into opcache memory by regular ones.
See 68185bafbe2
Your current suggestion is a bit opposite or something in the middle.

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

Oh, I think I finally understood how CLDEMOTE may improve the overall speed. It hints CPU to evict data from L1 and L2 caches into L3. And than the data in L3 may be reused by other CPU cores.

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 25, 2023

Oh, I think I finally understood how CLDEMOTE may improve the overall speed. It hints CPU to evict data from L1 and L2 caches into L3. And than the data in L3 may be reused by other CPU cores.

Yes, let me check Zend/zend_cpuinfo.h to support CLDEMOTE.

@dstogov
Copy link
Member

dstogov commented Apr 25, 2023

You may also try to use CLDEMOTE in ext/opcache/ZendAccelerator.c at the end of cache_script_in_shared_memory() to demote the cached script. Of course, this requites some patch adjustment.

	shared_cacheline_demote(new_persistent_script->mem, new_persistent_script->size);

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 25, 2023

You may also try to use CLDEMOTE in ext/opcache/ZendAccelerator.c at the end of cache_script_in_shared_memory()

Yeah, I tried before. But PHP store script structure in the hash table which is scattered. CLDEMOTE work effect is not obvious

Once code is emitted to JIT buffer, hint the hardware to
demote the corresponding cache lines to more distant level
so other CPUs can access them more quickly.
This gets nearly 1% performance gain on our workload.

Signed-off-by: Xue,Wang   <[email protected]>
Signed-off-by: Tao,Su     <[email protected]>
Signed-off-by: Hu,chen    <[email protected]>
@wxue1
Copy link
Contributor Author

wxue1 commented May 12, 2023

Oh, I think I finally understood how CLDEMOTE may improve the overall speed. It hints CPU to evict data from L1 and L2 caches into L3. And than the data in L3 may be reused by other CPU cores.

Yes, let me check Zend/zend_cpuinfo.h to support CLDEMOTE.

I updated the patch and checked CPUINFO whether cldemote instruction is available

Comment on lines +1001 to +1006
/* hint to the hardware to push out the cache line that contains the linear address */
#if ZEND_JIT_SUPPORT_CLDEMOTE
if (cpu_support_cldemote && JIT_G(trigger) == ZEND_JIT_ON_HOT_TRACE) {
shared_cacheline_demote((uintptr_t)entry, size);
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to do CLDEMOTE only for TRACEs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to do CLDEMOTE only for TRACEs?

You mentioned that Function JIT may produce more than 10M of code at a single request and CLDEMOTE takes time. So we support tracing JIT.

@dstogov dstogov merged commit 6bd5464 into php:master May 15, 2023
@wxue1
Copy link
Contributor Author

wxue1 commented May 15, 2023

Thanks! I will focus on my other inline function patch and PHP9 IR next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants