Just-In-Time Compiler in PHP 8

Just-In-Time Compiler in PHP 8
Nikita Popov @ betterCode PHP 8

About Me
●
Dmitry Stogov works on JIT
●
I work on everything else :)

About Me
●
Dmitry Stogov works on JIT
●
I work on everything else :)
●
My JIT involvement mostly QA

Just-In-Time (JIT) Compiler
PHP Code
Opcodes
Virtual
Machine
CPU

Just-In-Time (JIT) Compiler
PHP Code
Opcodes
Virtual
Machine
CPU
Machine
Code
JIT

History
●
Old project started by Zend in PHP 5 times
●
Mainly implemented by Dmitry Stogov

History
●
Early prototypes: The rest of PHP is too slow for
it to matter

History
●
it to matter
– Too many allocations
– Too much memory usage
– Too much pointer chasing
– Cache locality is key

History
●
it to matter
●
PHPNG (later: PHP 7) project started to
optimize PHP
●
Large performance improvements (2x), no JIT
needed!

History
●
SSA and type inference from JIT integrated into
opcache
●
Used for opcode optimizations

History
●
SSA and type inference from JIT integrated into
opcache
●
Used for opcode optimizations
– Constant Propagation
– Dead Code Elimination
– Refcount Optimization

Configuration
●
Enable opcache
●
opcache.jit_buffer_size=128M
●
Done!

Configuration
●
Advanced configuration:
– opcache.jit (CRTO)
– opcache.jit_debug, opcache.jit_bisect_limit
– opcache.jit_max_root_traces, opcache.jit_max_side_traces,
opcache.jit_max_exit_counters
– opcache.jit_hot_loop, opcache.jit_hot_func, opcache.jit_hot_return,
opcache.jit_hot_side_exit
– opcache.jit_blacklist_root_trace, opcache.jit_blacklist_side_trace
– opcache.jit_max_loop_unrolls, opcache.jit_max_recursive_calls,
opcache.jit_max_recursive_returns, opcache.jit_max_polymorphic_calls
– https://www.php.net/manual/en/opcache.configuration.php

Performance
bench.php
micro_bench.php
PHP-Parser
amphp
Symfony Demo
With Preloading
0 0.5 1 1.5 2 2.5 3 3.5
Baseline: Opcache + No JIT

bench.php
micro_bench.php
PHP-Parser
amphp
Symfony Demo
With Preloading
0 0.5 1 1.5 2 2.5 3 3.5
Baseline: Opcache + No JIT
Performance

Performance
●
Heavily depends on workload
●
Larger impact the more time is spent executing
PHP code (rather than e.g. DB queries)
●
More useful for "non-standard" applications

Function JIT
●
opcache.jit=function
●
Always JITs a whole function

Function JIT
PHP Code
Opcodes
Virtual
Machine
CPU
Machine
Code
JIT
Trigger

Function JIT
●
Trigger: When to JIT
– 0: All functions, on script load
– 1: All functions, on first execution
– 2: Profile first request, JIT hot functions
– 3: Profile on the fly, JIT hot functions

<?php
function sum(int $n) {
$sum = 0;
for ($i = 0; $i < $n; $i++) {
$sum += $i;
}
return $sum;
}

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
if ($i < $n) goto loop;
finish:
return $sum;
}

<?php
entry:
$sum_0 = 0;
$i_0 = 0;
goto cond;
loop:
$sum_2 = $sum_1 + $i_1;
$i_2 = $i_1 + 1;
cond:
$sum_1 = phi(entry: $sum_0, loop: $sum_2);
$i_1 = phi(entry: $i_0, loop: $i_2);
if ($i_1 < $n) goto loop;
finish:
return $sum_1;
}

<?php
entry:
$sum_0 = 0; # int
$i_0 = 0; # int
goto cond;
loop:
$sum_2 = $sum_1 + $i_1; # int|float
$i_2 = $i_1 + 1; # int
cond:
$sum_1 = phi(entry: $sum_0, loop: $sum_2); # int|float
$i_1 = phi(entry: $i_0, loop: $i_2); # int
if ($i_1 < $n) goto loop;
finish:
return $sum_1;
}

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov %rsi, 0x50(%r14)
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
mov 0x60(%r14), %rax
add %rdx, %rax
jo .L9
mov %rax, 0x60(%r14)
.L4:
add $0x1, %rdx
.L5:
...

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Assign 0 to $i (in register)
Increment $i (in register)

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Frame pointer

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Assign int(0) to $sum

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Check whether $sum is int

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Load $sum to register
Add $sum and $i
Write result back

...
.L2:
mov $0x0, 0x60(%r14)
mov $0x4, 0x68(%r14)
xor %rdx, %rdx
jmp .L5
.L3:
mov $0x4, 0x58(%r14)
cmp $0x4, 0x68(%r14)
jnz .L10
add %rdx, %rax
jo .L9
.L4:
add $0x1, %rdx
.L5:
...
Check if addition overflowed

...
.L9:
vxorps %xmm0, %xmm0, %xmm0
vcvtsi2sd 0x60(%r14), %xmm0, %xmm0
vcvtsi2sd %rdx, %xmm1, %xmm1
vaddsd %xmm1, %xmm0, %xmm0
vmovsd %xmm0, 0x60(%r14)
mov $0x5, 0x68(%r14)
jmp .L4
.L10:
vaddsd 0x60(%r14), %xmm0, %xmm0
jmp .L4
.L11:
...
Convert $i to float

...
.L9:
mov $0x5, 0x68(%r14)
jmp .L4
.L10:
jmp .L4
.L11:
...
Add (float)$i to $sum

...
.L9:
mov $0x5, 0x68(%r14)
jmp .L4
.L10:
jmp .L4
.L11:
...
Convert $sum to floatConvert $sum to float
Convert $i to float

...
.L9:
mov $0x5, 0x68(%r14)
jmp .L4
.L10:
jmp .L4
.L11:
...
Add $sum and $i as floats
Mark $sum slot as float

...
.L9:
mov $0x5, 0x68(%r14)
jmp .L4
.L10:
jmp .L4
.L11:
...
This code is almost
certainly unused!
Can't store $sum in
register, because it
might turn float

Tracing JIT
VM Execution
+ Profiling

Tracing JIT
VM Execution
+ Profiling
Trace
Collection
Hot

Tracing JIT
VM Execution
+ Profiling
Trace
Collection
Trace
Compilation
Hot

Tracing JIT
VM Execution
+ Profiling
Trace
Collection
Trace
Execution
Trace
Compilation
Hot

Tracing JIT
VM Execution
+ Profiling
Trace
Collection
Trace
Execution
Trace
Compilation
Hot
Deoptimization

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
trace:

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
trace:
if ($i < $n)

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
trace:
if ($i < $n)
$sum += $i;

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
trace:
if ($i < $n)
$sum += $i;
$i++;

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
trace:
if ($i < $n)
$sum += $i;
$i++;
goto trace;

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
$sum_0 = ...;
$i_0 = ...;
trace:
$sum_1 = phi($sum_0, $sum_2);
$i_1 = phi($i_0, $i_2);
if ($i_1 < $n)
$sum_2 = $sum_1 + $i_1;
$i_2 = $i_1 + 1;
goto trace;

<?php
entry:
$sum = 0;
$i = 0;
goto cond;
loop:
$sum += $i;
$i++;
cond:
finish:
return $sum;
}
<?php
$sum_0 = ...; # int
$i_0 = ...;
trace:
$sum_1 = phi($sum_0, $sum_2);
$i_1 = phi($i_0, $i_2);
if ($i_1 < $n) # does not exit
$sum_2 = $sum_1 + $i_1; # int
$i_2 = $i_1 + 1;
goto trace;

sub $0x10, %rsp
mov $EG(jit_trace_num), %rax
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
jnz jit$$trace_exit_0
mov 0x50(%r14), %rcx
mov 0x60(%r14), %rdx
mov 0x70(%r14), %rsi
.L1:
cmp %rcx, %rsi
jge jit$$trace_exit_1
mov %rdx, %rax
add %rsi, %rax
jo jit$$trace_exit_2
mov %rax, %rdx
add $0x1, %rsi
mov $EG(vm_interrupt), %rax
cmp $0x0, (%rax)
jz .L1
jmp jit$$trace_exit_3

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Check if $sum is int (exit 0)

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Load $n, $sum, $i into registers

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Check $i < $n (exit 1)

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
$sum += $i, check overflow (exit 2)

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
$i++

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Check VM interrupt, like timeout
(exit 3)

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Exits go to VM or side traces

TRACE-2$sum$5:
mov $0x2, (%rax)
cmp 0x50(%r14), %rax
cmp $0x5, 0x68(%r14)
vcvtsi2sd %rax, %xmm0, %xmm0
add $0x1, 0x70(%r14)
cmp $0x0, (%rax)
jz TRACE-1$sum$5+4

TRACE-2$sum$5:
mov $0x2, (%rax)
cmp $0x5, 0x68(%r14)
add $0x1, 0x70(%r14)
cmp $0x0, (%rax)
jz TRACE-1$sum$5+4
Check if $sum is float

TRACE-2$sum$5:
mov $0x2, (%rax)
cmp $0x5, 0x68(%r14)
add $0x1, 0x70(%r14)
cmp $0x0, (%rax)
jz TRACE-1$sum$5+4
$sum += (float) $i

sub $0x10, %rsp
mov $0x1, (%rax)
cmp $0x4, 0x68(%r14)
.L1:
cmp %rcx, %rsi
mov %rdx, %rax
add %rsi, %rax
mov %rax, %rdx
add $0x1, %rsi
cmp $0x0, (%rax)
jz .L1
Trace 2

Interception
●
Each opcode stores a "VM handler" pointer

Interception
●
Each opcode stores a "VM handler" pointer
●
Replace handler at function entry, loop headers,
returns
●
Handler counts executions and invokes JIT

Trace Collection
●
Separate VM that collects type info while
executing

Trace Collection
●
Separate VM that collects type info while
executing
●
Traces can span different loops and functions
– Calls effectively get "inlined"

Code Generation
●
Early prototypes used LLVM
– Architecture agnostic
– Supports many sophisticated optimizations

Code Generation
●
– But: Extremely slow compile-times

Code Generation
●
– But: Extremely slow compile-times
●
Now using DynASM from the LuaJIT project
– Very fast
– But: Architecture specific

|.macro LONG_MATH_REG, opcode, dst_reg, src_reg
|| switch (opcode) {
|| case ZEND_ADD:
| add dst_reg, src_reg
|| break;
|| case ZEND_SUB:
| sub dst_reg, src_reg
|| break;
|| case ZEND_MUL:
| imul dst_reg, src_reg
|| break;
|| case ZEND_BW_OR:
| or dst_reg, src_reg
|| break;
|| case ZEND_BW_AND:
| and dst_reg, src_reg
|| break;
...
|| }
|.endmacro

|.macro LONG_MATH_REG, opcode, dst_reg, src_reg
|| switch (opcode) {
|| case ZEND_ADD:
| add dst_reg, src_reg
|| break;
|| case ZEND_SUB:
| sub dst_reg, src_reg
|| break;
|| case ZEND_MUL:
| imul dst_reg, src_reg
|| break;
|| case ZEND_BW_OR:
| or dst_reg, src_reg
|| break;
|| case ZEND_BW_AND:
| and dst_reg, src_reg
|| break;
...
|| }
|.endmacro
C code
X86 Assembly with placeholders

Code Generation
●
DynASM itself supports many architectures
●
But JIT code has to be written for each
●
No support for M1 at this time, sorry!

Closing Thoughts
●
Performance benefit workload dependent
– Try it!

Closing Thoughts
●
Performance benefit workload dependent
– Try it!
●
Room for improvement
– E.g. optimizations (loop invariant code motion, etc.)

Closing Thoughts
●
Concern: Stability
– Increased potential for hard to debug, hard to
reproduce bugs

Closing Thoughts
●
Concern: Stability
– Increased potential for hard to debug, hard to
reproduce bugs
●
Concern: Maintenance
– Only one person really understands the JIT

Just-In-Time Compiler in PHP 8

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Just-In-Time Compiler in PHP 8

Similar to Just-In-Time Compiler in PHP 8 (20)

More from Nikita Popov

More from Nikita Popov (8)

Recently uploaded

Recently uploaded (20)

Just-In-Time Compiler in PHP 8