Skip to content

JIT deepest function firstly #10897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

JIT deepest function firstly #10897

wants to merge 1 commit into from

Conversation

wxue1
Copy link
Contributor

@wxue1 wxue1 commented Mar 21, 2023

The following code:

<?php
function add($x){
    return $x + 1;
}

$x = 0;
for($i = 0; $i < 1000; $i++) {
    if($i % 10 == 0) {
        $x = add($x);
    }
    if($i % 8 == 0) {
        $x = add($x);
    }
}
echo("x=$x\n");
?>

Duplicated JITTed code in this case:

---- TRACE 2 start (side trace 1/7) $main() /home/wxue/php-src/t2.php:12
0013 INIT_FCALL 1 112 string("add")
     >init add
0014 SEND_VAR CV0($x) 1 ; op1(int)
0015 V2 = DO_UCALL
     >enter add
0001  T1 = ADD CV0($x) int(1) ; op1(int)
0002  RETURN T1 ; op1(int)
     <back /home/wxue/php-src/t2.php
0016 ASSIGN CV0($x) V2 ; op1(int) op2(int)
0017 PRE_INC CV1($i) ; op1(int)
---- TRACE 2 stop (link to 1)

---- TRACE 3 start (side trace 1/4) $main() /home/wxue/php-src/t2.php:9
0006 INIT_FCALL 1 112 string("add")
     >init add
0007 SEND_VAR CV0($x) 1 ; op1(int)
0008 V2 = DO_UCALL
     >enter add
0001  T1 = ADD CV0($x) int(1) ; op1(int)
0002  RETURN T1 ; op1(int)
     <back /home/wxue/php-src/t2.php
0009 ASSIGN CV0($x) V2 ; op1(int) op2(int)
0010 T3 = MOD CV1($i) int(8) ; op1(int)
0011 T2 = IS_EQUAL T3 int(0) ; op1(int)
0012 ;JMPZ T2 0017
0017 PRE_INC CV1($i) ; op1(int)
---- TRACE 3 stop (link to 1)

But I expected the add function to be JITTed only once.

Duplicated JITTed code brings overhead for the instruction cache. This patch reduces duplication by JITting deepest inline function first because the same function is JITTed in different root trace or side trace sometimes.
It increases 3% the performance of our workload in tracing mode.

@wxue1 wxue1 requested review from dstogov and iluuu1994 as code owners March 21, 2023 06:33
@wxue1 wxue1 changed the title JIT deepest function firstly #10896 JIT deepest function firstly Mar 21, 2023
@dstogov
Copy link
Member

dstogov commented Mar 21, 2023

This needs a careful review and testing.
On one hand this may reduce the size of the generated code. On the other hand this may disable inlining of function into a hot loop.

@wxue1
Copy link
Contributor Author

wxue1 commented Mar 21, 2023

This needs a careful review and testing. On one hand this may reduce the size of the generated code. On the other hand this may disable inlining of function into a hot loop.

Yes, if the inline function will be JITTed first and then the loop is JITTed if the loop is continued.

log like this:
---- TRACE 2 start (loop) $main() /home/wxue/php-src/dup.php:10
0014 T2 = IS_SMALLER CV1($i) int(1000) ; op1(int)
0015 ;JMPNZ T2 0009
0009 INIT_FCALL 1 112 string("add")
>init add
0010 SEND_VAR CV0($x) 1 ; op1(int)
0011 V2 = DO_UCALL
>enter add
---- TRACE 2 abort (JIT deeper function and skip current trace)
---- TRACE 2 start (enter) add() /home/wxue/php-src/dup.php:3
0001 T1 = ADD CV0($x) int(1) ; op1(int)
0002 RETURN T1 ; op1(int)
---- TRACE 2 stop (return)
...
---- TRACE 3 start (loop) $main() /home/wxue/php-src/dup.php:10
0014 T2 = IS_SMALLER CV1($i) int(1000) ; op1(int)
0015 ;JMPNZ T2 0009
0009 INIT_FCALL 1 112 string("add")
>init add
0010 SEND_VAR CV0($x) 1 ; op1(int)
0011 V2 = DO_UCALL
>enter add
---- TRACE 3 stop (link to 2)

@wxue1 wxue1 force-pushed the JIT_duplication branch from fa09d4e to df5cdce Compare March 21, 2023 07:39
@dstogov
Copy link
Member

dstogov commented Mar 21, 2023

Yes, if the inline function will be JITTed first and then the loop is JITTed if the loop is continued.

This is not what I would like to do. This disables inlining and reduces specialization.

@dstogov
Copy link
Member

dstogov commented Mar 21, 2023

I think, the better decision would be fallback to compilation of inlined function if its trace becomes too long. All the specializations inherited from the caller (e.g. types of arguments) have to be discarded.

@wxue1
Copy link
Contributor Author

wxue1 commented Mar 22, 2023

This disables inlining and reduces specialization.

Yeah, it seems a little aggressive. But this patch will JIT twice if different types of arguments are passed in a function. This also meets the previous side exit design. And it is convenient to link to the JITTed function later.

JIT log like this:
---- TRACE 1 start (loop) $main() /home/wxue/php-src/dup.php:7
0008 T2 = IS_SMALLER CV1($i) int(1000) ; op1(int)
0009 ;JMPNZ T2 0003
0003 INIT_FCALL 1 112 string("add")
>init add
0004 SEND_VAR CV0($x) 1 ; op1(int)
0005 V2 = DO_UCALL
>enter add
---- TRACE 1 abort (JIT deeper function and skip current trace)
---- TRACE 1 start (enter) add() /home/wxue/php-src/dup.php:3 <--- JIT add() with int argument
0001 T1 = ADD CV0($x) int(1) ; op1(int)
0002 RETURN T1 ; op1(int)
---- TRACE 1 stop (return)

---- TRACE 5 start (side trace 1/0) add() /home/wxue/php-src/dup.php:3 <-- JIT add() with float argument
0001 T1 = ADD CV0($x) int(1) ; op1(float)
0002 RETURN T1 ; op1(float)
---- TRACE 5 stop (return)

@dstogov
Copy link
Member

dstogov commented Mar 22, 2023

TRACE 5 in your example is a side trace of TRACE 1. So when we call add() we always enter TRACE1 check for int argument, perform side exit to TRACE 5, check for float argument and only then continue execution. When add() is inlined we way completely eliminate these type checks and jumps.

It makes sense to avoid inining of big functions but not inlining at all.

@wxue1
Copy link
Contributor Author

wxue1 commented Mar 22, 2023

TRACE 5 in your example is a side trace of TRACE 1. So when we call add() we always enter TRACE1 check for int argument, perform side exit to TRACE 5, check for float argument and only then continue execution. When add() is inlined we way completely eliminate these type checks and jumps.

It makes sense to avoid inining of big functions but not inlining at all.

Actually, the type check is both needed in the final JITTed code whether or not with our patch.
In the screenshot, JIT add() only with int argument on the left, and JIT a loop including add() on the right.
image

Yes, I admit that the direct jumps increase. But this overhead is accepted since the direct call has less branch misprediction.

In our WordPress workload, the total JIT buffer size used was reduced by 40% nearly.
The WordPress performance gain is 3% on SPR(Intel) and 1% on Milan(AMD), and swoole workload gets 1% on SPR.
Maybe more tests are needed?

@dstogov
Copy link
Member

dstogov commented Mar 22, 2023

If a type of argument is known on the caller side, we don't need to check it on the inlined callee side. In your example $x is a global variable, in general it may be a reference and it may be indirectly changed at any time. If you wrap the "global" code into a function main() that type-check disappears.

<?php
function add($x){
    return $x + 1;
}

function main() {
    $x = 0;
    for($i = 0; $i < 1000; $i++) {
        if($i % 10 == 0) {
            $x = add($x);
        }
        if($i % 8 == 0) {
            $x = add($x);
        }
    }
    echo("x=$x\n");
}
main();
?>

Also you may compare the RETURN code on the left and right on the screen-shorts above. The code on the left (with your patch) in addition checks if EX(return_value) is NULL, then for rare EX_CALL_INFO() flags.

So in this example a single inlining saves at least 12 executed instructions: 2 indirect jumps between traces, 3 conditional jumps, 2 EG(jit_trace_num) assignments. It also avoids generation of never executed code for rare RETURN cases.

Tracing JIT duplicates code by design and your proposed patch is definitely not good enough.

Actually, you may achieve similar result changing opcache.jit_hot_loop and opcache.jit_hot_func setting. Set hot_func=10 and hot_loop=20 to JIT functions first. Interesting, but for this example version with inlining uses less code.

$ sapi/cli/php -d opcache.jit_debug=0x200 -d opcache.jit_hot_loop=20 -d opcache.jit_hot_func=10 add.php
x=225
JIT memory usage: 3584

$ sapi/cli/php -d opcache.jit_debug=0x200 add.php
x=225
JIT memory usage: 3536

I see your intention, and I agree that we may limit inlining of big functions, but this shouls be done in some other way.

@wxue1
Copy link
Contributor Author

wxue1 commented Mar 23, 2023

I see your intention, and I agree that we may limit inlining of big functions, but this shouls be done in some other way.

Thanks for your comment. So let's discuss the trade-off and get an appropriate implementation. I am glad to try and test. 😊

  1. Mabe we can add a switch ( such like opcache.jit_hot_loop parameter ) for my patch?
  2. Maybe we can set a parameter value? when the trace is too long, we JIT function separately.
  3. Maybe we can check the inline function whether it is JITTed in other JITTed codes?

@dstogov
Copy link
Member

dstogov commented Mar 23, 2023

I think, the problem might be solved by counting the number if inlined instructions, aborting compilation when this number exceeds some limit, and fall-back to compilation only the tail of the recorded trace for the inlined functions that caused abort.

@wxue1
Copy link
Contributor Author

wxue1 commented Mar 24, 2023

I think, the problem might be solved by counting the number if inlined instructions, aborting compilation when this number exceeds some limit, and fall-back to compilation only the tail of the recorded trace for the inlined functions that caused abort.

Okay, if abort, we compile the inlined deepest function or only the inlined function (may exist nested inlined function here) ?

@dstogov
Copy link
Member

dstogov commented Mar 27, 2023

Okay, if abort, we compile the inlined deepest function or only the inlined function (may exist nested inlined function here) ?

I think, the best solution would be compiling function that caused limit overflow. It may include other small inlined functions.

During tracing we remember function entries (push to some stack). When we trace a function exit and the distance from the corresponding function entry is above some limit, we decide to discard the start of the trace (before that function entry) and compile a trace for that function only. Note that function may be already compiled.

@wxue1 wxue1 force-pushed the JIT_duplication branch from df5cdce to a358996 Compare April 18, 2023 08:25
@wxue1
Copy link
Contributor Author

wxue1 commented Apr 18, 2023

Okay, if abort, we compile the inlined deepest function or only the inlined function (may exist nested inlined function here) ?

I think, the best solution would be compiling function that caused limit overflow. It may include other small inlined functions.

I tried an easy way to get a similar effect. When a trace is too long (idx > JIT_G(max_inline_func_length)), we JIT inlined function firstly. The parameter max_inline_func_length can be set by developers. Now our WordPress workload gets the best performance when max_inline_func_length is 8. The default value needs more testing later ~

I'm looking forward to your comment~

@@ -1062,6 +1062,13 @@ zend_jit_trace_stop ZEND_FASTCALL zend_jit_trace_execute(zend_execute_data *ex,

trace_flags = ZEND_OP_TRACE_INFO(opline, offset)->trace_flags;
if (trace_flags) {
/* if inlined functions are too long, stop current tracing and restart a new one */
if (trace_buffer[idx-1].op == ZEND_JIT_TRACE_ENTER && idx > JIT_G(max_inline_func_length)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't limit the inlined function length. This limits function inlining to the first trace instructions.

Copy link
Contributor Author

@wxue1 wxue1 Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't limit the inlined function length. This limits function inlining to the first trace instructions.

Actually, JAVA JIT support inlined function up to a particular size. Here we limit size by separating the long trace by max_inline_func_length.

If we limit the inlined function length, some functions may never be JITTed. So I chose a compromise method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code allows inlining only if it occurs in the first 16 instructions of the trace. It easelly prohibits inlining of the short functions.

---- TRACE 44 start (side trace 43/3) WP_Scripts::localize() /home/www/bench/wordpress-3.6/wp-includes/class.wp-scripts.php:148
0027 FE_FREE V8 ; op1(array)
0028 T10 = ROPE_INIT 3 string("var ")
0029 T10 = ROPE_ADD 1 T10 CV1($object_name) ; op2(string)
0030 T9 = ROPE_END 2 T10 string(" = ")
0031 INIT_FCALL 1 96 string("json_encode")
     >init json_encode
0032 SEND_VAR CV2($l10n) 1 ; op1(array)
0033 V10 = DO_ICALL
     >call json_encode
0034 T8 = FAST_CONCAT T9 V10 ; op1(string) op2(string)
0035 CV6($script) = FAST_CONCAT T8 string(";") ; op1(string)
0036 T8 = ISSET_ISEMPTY_CV (empty) CV3($after) ; op1(undef)
0037 ;JMPNZ T8 0042
0042 INIT_METHOD_CALL 2 THIS string("get_data")
     >init WP_Dependencies::get_data
0043 SEND_VAR_EX CV0($handle) 1 ; op1(string)
0044 SEND_VAL_EX string("data") 2
0045 V8 = DO_FCALL
     >enter WP_Dependencies::get_data
---- TRACE 44 abort (JIT inlined function and skip current trace)
---- TRACE 44 start (enter) WP_Dependencies::get_data() /home/www/bench/wordpress-3.6/wp-includes/class.wp-dependencies.php:163
0002 T3 = FETCH_OBJ_IS THIS string("registered") ; val(array)
0003 T2 = ISSET_ISEMPTY_DIM_OBJ (isset) T3 CV0($handle) ; op1(array) op2(string) val(object)
0004 ;JMPNZ T2 0006
0006 T3 = FETCH_OBJ_IS THIS string("registered") ; val(array)
0007 T2 = FETCH_DIM_IS T3 CV0($handle) ; op1(array) op2(string) val(object)
0008 T3 = FETCH_OBJ_IS T2 string("extra") ; op1(object of class _WP_Dependency) val(array)
0009 T2 = ISSET_ISEMPTY_DIM_OBJ (isset) T3 CV1($key) ; op1(array) op2(string) val(undef)
0010 ;JMPNZ T2 0012
0011 RETURN bool(false)
---- TRACE 44 stop (return)

JAVA doesn't use tracing JIT. The current PHP tracing rules were mainly developed on top of LuaJIT.

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

In my opinion, this doesn't make a lot of sense, at least in the current state.

By design, in case some function is called from different places it should be JIT-ed first, because its counter is going to be triggered first. Otherwise it should be inlined to provide more opportunities for optimisation.

As I told, it is possible to limit the size of inlined function, but the current propose patch does a bit different thing.

I made some benchmarks using the following command and I see only very small improvement.

sapi/cgi/php-cgi -d opcache.jit=1254 -T50,1000 /var/www/html/bench/wordpress-3.6/index.php >/dev/null
                                          Time      N traces    Code Size      CPU Insn Fetches (-T50,100)
php-master -d opcache.jit=0              2.41 sec      0              0         1,212,170
php-master -d opcache.jit=1254           2.27 sec    230        196,672           996,715
php-master+patch -d opcache.jit=1254     2.26 sec    278        197,520           992,554

I also see some hot code is blacklisted now

---- TRACE 179 abort (JIT inlined function and skip current trace)
ESCAPE-30-2: ; (unknown)
	movq $0x3e28b40, %r15
	jmpq *(%r15)

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 24, 2023

By design, in case some function is called from different places it should be JIT-ed first, because its counter is going to be triggered first. Otherwise it should be inlined to provide more opportunities for optimisation.

Yes, Here is a delay for the function counter. Next time the first function will be JITTed.
And ZEND_JIT_TRACE_STOP_MAY_RECOVER could help to try JIT before blacklisting it.

As I told, it is possible to limit the size of inlined function, but the current propose patch does a bit different thing.

I made some benchmarks using the following command and I see only very small improvement.

The performance gain is 2-3% on our WordPress workload and also reduces the total JIT buffer size.
In summary, this is a trade-off between JITTing long inlined function only and JITTing inlined function with future optimizations.
I hope to merge this patch since this parameter could be tuned by the user.

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

In my benchmark, the time improvement is ~0.44% and this is less then the measurement mistake, but CPU instruction fetched measured with valgrind show near similar improvement ~0,4%. The code size was actually insignificantly increased.

The patch doesn't limit the inlining function size, but just split the traces in a different (more aggressive) way.

I hope to merge this patch since this parameter could be tuned by user.

At first, opcache.jit_max_inline_func_length should behave accordingly to its name.

May be in case of the event, the trace should be shorten to the actual enter.
May be it's possible to compile first trace (before enter) and then continue recording for the called function.

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 24, 2023

At first, opcache.jit_max_inline_func_length should behave accordingly to its name.

Yes Yes, forgive my awkward name. This is similar like 'ZEND_JIT_TRACE_MAX_LENGTH'. How about 'opcache.jit_max_trace_length' ?

May be in case of the event, the trace should be shorten to the actual enter. May be it's possible to compile first trace (before enter) and then continue recording for the called function.

Little confused about 'the trace should be shorten to the actual entey'.
Compiling first trace enter requires priori profiling message. Then we decide to divide that long trace?

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

At first, opcache.jit_max_inline_func_length should behave accordingly to its name.

Yes Yes, forgive my awkward name. This is similar like 'ZEND_JIT_TRACE_MAX_LENGTH'. How about 'opcache.jit_max_trace_length' ?

I don't see problem replacing the hardcoded ZEND_JIT_TRACE_MAX_LENGTH with configurable opcache.jit_max_trace_length.

May be in case of the event, the trace should be shorten to the actual enter. May be it's possible to compile first trace (before enter) and then continue recording for the called function.

Little confused about 'the trace should be shorten to the actual entey'. Compiling first trace enter requires priori profiling message. Then we decide to divide that long trace?

Take a look how the backtrack_* variables are used in the tracer. At some point we remember the place where we may stop the trace, but continue recording, and in case of the future failure we compile the trace only untill the backtrace_* (shorten it).

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 24, 2023

Take a look how the backtrack_* variables are used in the tracer. At some point we remember the place where we may stop the trace, but continue recording, and in case of the future failure we compile the trace only untill the backtrace_* (shorten it).

good function. Let me take a look.

So our discussion shows two ways.

  1. JIT long inlined function only.
    This is nice theoretically,but with a more complicated implementation. (remember func entry and exit, the case that func call other func😅, modify the trace_buffer )

  2. My patch, JIT a long trace separately and take function entry as a boundary. JIT the first part and then the next part (need to change).
    easy to implement and leave space for future inline func optimizations.

So my cute maintainer, What do you choose? I'll continue to modify this code.

@dstogov
Copy link
Member

dstogov commented Apr 24, 2023

The (2) option may work with a simple implementation, if we delay the JIT-ing of the second part (the called function) until the next entry.

Anyway, this may cause breaks in the original design.
I remember, tracer might prefer to try inlining even than the called function was compiled before.

@wxue1
Copy link
Contributor Author

wxue1 commented Apr 25, 2023

The (2) option may work with a simple implementation, if we delay the JIT-ing of the second part (the called function) until the next entry.
Anyway, this may cause breaks in the original design. I remember, tracer might prefer to try inlining even than the called function was compiled before.

A little question if we set 'backtrack_' variable to compile the first part. Then we need to relink to the first part after the second part (inlined func) is compiled? I see the original code didn't link the JITTed function.

else if (trace_flags & ZEND_JIT_TRACE_START_ENTER) {
    if (start != ZEND_JIT_TRACE_START_RETURN) {
    // TODO: We may try to inline function ???
    stop = ZEND_JIT_TRACE_STOP_LINK;
    break;
    }
    if (backtrack_link_to_enter < 0) {
    backtrack_link_to_enter = idx;
    link_to_enter_opline = opline;
    }
    }

@wxue1 wxue1 force-pushed the JIT_duplication branch from a358996 to b4abcf6 Compare April 27, 2023 08:06
Duplicated JITTed code brings overhead for the instruction cache.
This patch reduces duplication by JITting inlined function
separately if trace is too long because the same function is
JITTed in different root trace or side trace sometimes.
It increases 2%~3% performance of our workload in tracing mode.

Signed-off-by: Wang, Xue   <[email protected]>
Signed-off-by: Yang, Lin A <[email protected]>
Signed-off-by: Su, Tao     <[email protected]>
@wxue1 wxue1 force-pushed the JIT_duplication branch from b4abcf6 to 2d9784b Compare June 13, 2023 08:12
@wxue1
Copy link
Contributor Author

wxue1 commented Jun 15, 2023

Dear maintainer, I sent you a mail for this PR. Hope to get your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants