SlideShare a Scribd company logo
A race of two compilers:
Graal JIT vs. C2 JIT.
Which one offers better runtime
performance?
by Ionuţ Baloşin
ionutbalosin.com
@ionutbalosin
About Me
Software Architect
Technical Trainer
Writer
Speaker
SOFTWARE ARCHITECTURE JAVA PERFORMANCE AND TUNING
IonutBalosin.com
Today’s Content
01 Intro
02 Scalar Replacement
03 Virtual Calls
04 Vectorization & Loop Optimizations
05 Summary
www.ionutbalosin.comwww.ionutbalosin.com
Intro
www.ionutbalosin.com 5
https://ionutbalosin.com/2019/04/jvm-jit-compilers-benchmarks-report-19-04
www.ionutbalosin.com
Configuration
6
CPU Intel i7-8550U Kaby Lake R
MEMORY 16GB DDR4 2400 MHz
OS / Kernel Ubuntu 18.10 / 4.18.0-17-generic
Java Version 12 Linux-x64
JVM / CE
JMH v1.21
5x10s warm-up iterations, 5x10s measurement
iterations, 3 JVM forks, single threaded
Note: C2 JIT might need less timings, being a native Compiler.
www.ionutbalosin.com
This talk is NOT about EE.
Current optimizations might differ for other JVMs.
7
www.ionutbalosin.com
Fair comparison?
8
to
www.ionutbalosin.com
Fair comparison?
9
to
Well … it depends from the perspective:
Application/end user: performance and (licensing) costs are the
leading factors
Compiler/internals (e.g. implementation, maturity): very different
www.ionutbalosin.com
Graal JIT
✓ Java implementation
✓ still early stage (recently become
production ready; e.g. v19.0.0)
✓ targets JVM-based languages (e.g.
Java, Scala, Kotlin), dynamic
languages (e.g. JavaScript, Ruby, R)
but also native languages (e.g. C/C++,
Rust)
✓ modular design (JVMCI plug-in arch)
✓ Ahead Of Time flavor
10
www.ionutbalosin.com
Graal JIT
✓ Java implementation
✓ still early stage (recently become
production ready; e.g. v19.0.0)
✓ targets JVM-based languages (e.g.
Java, Scala, Kotlin), dynamic
languages (e.g. JavaScript, Ruby, R)
but also native languages (e.g. C/C++,
Rust)
✓ modular design (JVMCI plug-in arch)
✓ Ahead Of Time flavor
C2 JIT
✓ C++ implementation
✓ mature / production grade runtime
✓ written mainly to optimize Java
source code
✓ is essentially a local maximum of
performance and is reaching the end
of its design lifetime
✓ latest improvements were more
focused on intrinsics and
vectorization, rather than adding
new optimization heuristics
11
www.ionutbalosin.com
Nevertheless, nowadays the baseline for Graal JIT still
remains C2 JIT!
The target is to leverage its performance to be at least
on par with C2 JIT especially on real world applications /
production code.
12
www.ionutbalosin.com
There are 3 main distinguishing factors for Graal JIT:
13
Improved inlining
Partial Escape Analysis
Guard optimizations for efficient
handling of speculative code
www.ionutbalosin.com 14
1. Partial Escape Analysis (PEA) determines the escapability of objects on
individual branches and reduces object allocations even if the objects
escapes on some paths.
PEA works on compiler’s intermediate representation not on bytecode.
It is effective in conjunction with other parts of the compiler, such as
inlining, global value numbering, and constant folding.
www.ionutbalosin.com 15
2. Improved inlining strategy explores the call tree and performs late
inlining as default instead of inlining during bytecode parsing.
Graal JIT devise three novel heuristics: adaptive decision thresholds, callsite
clustering, and deep inlining trials.
www.ionutbalosin.com 16
3. Guard optimizations for efficient handling of speculative code has its
biggest impact on the JavaScript, Ruby, and R implementations built on
Graal’s multi-language framework Truffle.
www.ionutbalosin.comwww.ionutbalosin.com
Scalar Replacement
www.ionutbalosin.com
Object scopes:
✓ NoEscape (local to current method/thread)
✓ ArgEscape (passed as an argument but not visible by other threads)
✓ GlobalEscape (visible outside of current method/thread)
18
Escape Analysis (EA) analyses the scope of a new object and decides
whether it might be allocated or not on the heap.
EA is not an optimization but rather an analysis phase for the optimizer.
The result of this analisys allows compiler to perform optimizations like
object allocations, synchronization primitives and field accesses.
www.ionutbalosin.com
For NoEscape objects compiler can remap the accesses
to the object fields to accesses to synthetic local
operands: Scalar Replacement optimization.
19
www.ionutbalosin.com
@Param({"3"}) private int param;
@Param({"128"}) private int size;
@Benchmark
public int no_escape() {
LightWrapper w = new LightWrapper(param);
return w.f1 + w.f2;
}
public static class LightWrapper {
public int f1;
public int f2;
//...
}
20
CASE I
www.ionutbalosin.com
@Param({"3"}) private int param;
@Param({"128"}) private int size;
@Benchmark
public int no_escape() {
LightWrapper w = new LightWrapper(param);
return w.f1 + w.f2;
}
public static class LightWrapper {
public int f1;
public int f2;
//...
}
21
equivalent to
return f1 + f2;
CASE I
www.ionutbalosin.com
@Param({"3"}) private int param;
@Param({"128"}) private int size;
@Benchmark
public int no_escape_object_array() {
HeavyWrapper w = new HeavyWrapper(param, size);
return w.f1 + w.f2 + w.z.length;
}
public static class HeavyWrapper {
public int f1;
public int f2;
public byte z[];
// ...
}
22
CASE II
www.ionutbalosin.com
@Param({"3"}) private int param;
@Param({"128"}) private int size;
@Benchmark
public int no_escape_object_array() {
HeavyWrapper w = new HeavyWrapper(param, size);
return w.f1 + w.f2 + w.z.length;
}
public static class HeavyWrapper {
public int f1;
public int f2;
public byte z[];
// ...
}
23
equivalent to
return f1 + f2 + length;
CASE II
www.ionutbalosin.com
@Param(value = {"false"}) private boolean objectEscapes;
@Benchmark
public void branch_escape_object(Blackhole bh) {
HeavyWrapper wrapper = new HeavyWrapper(param, size);
HeavyWrapper result;
if (objectEscapes) // false
result = wrapper;
else
result = globalCachedWrapper;
bh.consume(result);
}
24
CASE III
www.ionutbalosin.com
@Param(value = {"false"}) private boolean objectEscapes;
@Benchmark
public void branch_escape_object(Blackhole bh) {
HeavyWrapper wrapper = new HeavyWrapper(param, size);
HeavyWrapper result;
if (objectEscapes) // false
result = wrapper;
else
result = globalCachedWrapper;
bh.consume(result);
}
25
CASE III
branch is never taken,
wrapper never escapes,
its scope is NoEscape
www.ionutbalosin.com
@Param(value = {"false"}) private boolean objectEscapes;
@Benchmark
public void branch_escape_object(Blackhole bh) {
HeavyWrapper wrapper = new HeavyWrapper(param, size);
HeavyWrapper result;
if (objectEscapes) // false
result = wrapper;
else
result = globalCachedWrapper;
bh.consume(result);
}
26
branch is never taken,
wrapper never escapes,
its scope is NoEscape
CASE III
Based on Partial Escape Analysis, the
wrapper object allocation is eliminated!
www.ionutbalosin.com
@Benchmark
public boolean arg_escape_object() {
HeavyWrapper wrapper1 = new HeavyWrapper(param, size);
HeavyWrapper wrapper2 = new HeavyWrapper(param, size);
boolean match = false;
if (wrapper1.equals(wrapper2))
match = true;
return match;
}
27
CASE IV
www.ionutbalosin.com
@Benchmark
public boolean arg_escape_object() {
HeavyWrapper wrapper1 = new HeavyWrapper(param, size);
HeavyWrapper wrapper2 = new HeavyWrapper(param, size);
boolean match = false;
if (wrapper1.equals(wrapper2))
match = true;
return match;
}
28
wrapper1 scope is NoEscape
wrapper2 scope is:
- NoEscape, if inlining of equals() succeeds
- ArgEscape, if inlining fails or disabled
CASE IV
www.ionutbalosin.com 29
www.ionutbalosin.com 30
C1/C2 JIT
C1/C2 JIT
C1/C2 JIT
www.ionutbalosin.com
Conclusions
31
Graal JIT was able to get rid of heap allocations offering a constant
response time, around 3 ns/op.
C1/C2 JIT struggles to get rid of the heap allocations if:
- there is a condition which does not make obvious at compile time if
the object escapes,
- or if the object scope becomes local (i.e. NoEscape) after inlining.
C1/C2 JIT by default does not consider escaping arrays if their size (i.e.
the number of elements) is greater than 64.
-XX:EliminateAllocationArraySizeLimit=<arraySize>
www.ionutbalosin.comwww.ionutbalosin.com
Virtual Calls
www.ionutbalosin.com
Monomorphic call site - optimistically points to only one concrete
method that has ever been used at that particular call site.
33
www.ionutbalosin.com
Bimorphic call site - optimistically points to only two concrete
methods which can be invoked at a particular call site.
34
www.ionutbalosin.com
Megamorphic call site - optimistically points to three or possibly
more methods which can be invoked at a particular call site.
35
. . .
www.ionutbalosin.com
// CMath = {Alg1, Alg2, Alg3, Alg4, Alg5, Alg6}
public int execute(CMath cmath, int i) {
return cmath.compute(i); // param * constant
}
36
megamorphic call site
www.ionutbalosin.com
@Benchmark
public int monomorphic() {
return execute(alg1, param) // param * 17
}
@Benchmark @OperationsPerInvocation(2)
public int bimorphic() {
return execute(alg1, param) + // param * 17
execute(alg2, param) // param * 19
}
37
www.ionutbalosin.com
@Benchmark
@OperationsPerInvocation(3)
public int megamorphic_3() {
return execute(alg1, param) + // param * 17
execute(alg2, param) + // param * 19
execute(alg3, param) // param * 23
}
/*
Similar test cases for:
- megamorphic_4() {...}
- megamorphic_5() {...}
- megamorphic_6() {...}
*/
38
www.ionutbalosin.com 39
www.ionutbalosin.com 40
C1/C2 JIT
C1/C2 JIT
C1/C2 JIT
C1/C2 JIT
www.ionutbalosin.comwww.ionutbalosin.com
bimorphic()
<<under the hood>>
41
www.ionutbalosin.com
C2 JIT, level 4, execute() ; generated assembly - bimorphic case
mov r10d,DWORD PTR [rsi+0x8]
; implicit exception
cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)}
je L0
cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)}
jne L1
imul eax,edx,0x13 ; - Alg2::compute
jmp 0x00007f48986678ce ; return
L0: mov eax,edx ; - Alg1::compute
shl eax,0x4
add eax,edx
L1: call 0x00007f4890ba1300 ;*invokevirtual compute
;{runtime_call UncommonTrapBlob}
42
www.ionutbalosin.com
C2 JIT, level 4, execute() ; generated assembly - bimorphic case
mov r10d,DWORD PTR [rsi+0x8]
; implicit exception
cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)}
je L0
cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)}
jne L1
imul eax,edx,0x13 ; - Alg2::compute
jmp 0x00007f48986678ce ; return
L0: mov eax,edx ; - Alg1::compute
shl eax,0x4
add eax,edx
L1: call 0x00007f4890ba1300 ;*invokevirtual compute
;{runtime_call UncommonTrapBlob}
43
Alg1
www.ionutbalosin.com
C2 JIT, level 4, execute() ; generated assembly - bimorphic case
mov r10d,DWORD PTR [rsi+0x8]
; implicit exception
cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)}
je L0
cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)}
jne L1
imul eax,edx,0x13 ; - Alg2::compute
jmp 0x00007f48986678ce ; return
L0: mov eax,edx ; - Alg1::compute
shl eax,0x4
add eax,edx
L1: call 0x00007f4890ba1300 ;*invokevirtual compute
;{runtime_call UncommonTrapBlob}
44
Alg1
Alg2
www.ionutbalosin.com
C2 JIT, level 4, execute() ; generated assembly - bimorphic case
mov r10d,DWORD PTR [rsi+0x8]
; implicit exception
cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)}
je L0
cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)}
jne L1
imul eax,edx,0x13 ; - Alg2::compute
jmp 0x00007f48986678ce ; return
L0: mov eax,edx ; - Alg1::compute
shl eax,0x4
add eax,edx
L1: call 0x00007f4890ba1300 ;*invokevirtual compute
;{runtime_call UncommonTrapBlob}
45
Alg1
Alg2
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - bimorphic case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8
je L1
jmp L2
L0: imul eax,edx,0x13 ; - Alg2::compute
ret
L1: mov eax,edx
shl eax,0x4
add eax,edx ; - Alg1::compute
ret
L2: call 0x00007f45ffba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
46
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - bimorphic case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8
je L1
jmp L2
L0: imul eax,edx,0x13 ; - Alg2::compute
ret
L1: mov eax,edx
shl eax,0x4
add eax,edx ; - Alg1::compute
ret
L2: call 0x00007f45ffba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
47
Alg1
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - bimorphic case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8
je L1
jmp L2
L0: imul eax,edx,0x13 ; - Alg2::compute
ret
L1: mov eax,edx
shl eax,0x4
add eax,edx ; - Alg1::compute
ret
L2: call 0x00007f45ffba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
48
Alg1
Alg2
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - bimorphic case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8
je L1
jmp L2
L0: imul eax,edx,0x13 ; - Alg2::compute
ret
L1: mov eax,edx
shl eax,0x4
add eax,edx ; - Alg1::compute
ret
L2: call 0x00007f45ffba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
49
Alg1
Alg2
www.ionutbalosin.com
Conclusions - bimorphic call site
50
Both compilers (Graal JIT and C2 JIT) optimizes bimorphic call sites in
a similar approach:
- inlines implementations
- adds a deoptimization routine (Graal JIT) or an uncommon trap (C2
JIT) for another (unknown) possible target invocation.
www.ionutbalosin.comwww.ionutbalosin.com
megamorphic_3()
<<under the hood>>
51
www.ionutbalosin.com
C2 JIT, level 4, execute() ; generated assembly - megamorphic_3 case
movabs rax,0xffffffffffffffff
call 0x00007fd65cb9fb80 ;*invokevirtual compute
;{virtual_call}
52
Not any further optimization.
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - megamorphic_3 case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8
je L1
cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0
je L2
jmp L3
L0: ... ;- Alg1::compute
L1: ... ;- Alg3::compute
L2: ... ;- Alg2::compute
L3: call 0x00007fb83fba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
53
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - megamorphic_3 case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8
je L1
cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0
je L2
jmp L3
L0: ... ;- Alg1::compute
L1: ... ;- Alg3::compute
L2: ... ;- Alg2::compute
L3: call 0x00007fb83fba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
54
Alg1
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - megamorphic_3 case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8
je L1
cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0
je L2
jmp L3
L0: ... ;- Alg1::compute
L1: ... ;- Alg3::compute
L2: ... ;- Alg2::compute
L3: call 0x00007fb83fba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
55
Alg1
Alg2
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - megamorphic_3 case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8
je L1
cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0
je L2
jmp L3
L0: ... ;- Alg1::compute
L1: ... ;- Alg3::compute
L2: ... ;- Alg2::compute
L3: call 0x00007fb83fba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
56
Alg1
Alg2
Alg3
www.ionutbalosin.com
Graal JIT, execute() ; generated assembly - megamorphic_3 case
cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0
je L0
cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8
je L1
cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0
je L2
jmp L3
L0: ... ;- Alg1::compute
L1: ... ;- Alg3::compute
L2: ... ;- Alg2::compute
L3: call 0x00007fb83fba10a4 ;*invokevirtual compute
;{runtime_call DeoptimizationBlob}
57
Alg1
Alg2
Alg3
www.ionutbalosin.com
Conclusions - megamorphic_3 call site
58
Graal JIT optimizes three target method calls by:
- inlining the target implementations
- adding a deoptimization routine for another (unknown) possible
target invocation.
C2 JIT does not perform any further optimization, there is a virtual call
(e.g. vtable lookup) → due to “polluted profile” context (i.e. method is
used in many different contexts with independent operand types).
Note: megamorphic_4(), megamorphic_5(), and megamorphic_6() are
optimized based on similar principles by Graal JIT and C2 JIT.
www.ionutbalosin.comwww.ionutbalosin.com
Vectorization &
Loop Optimizations
www.ionutbalosin.com
Parallel execution
60
C[i]
B[i]
A[i]
x
=
C[i] C[i+1] C[i+2] C[i+3]
B[i]
A[i]
x
=
B[i+1]
A[i+1]
x
=
B[i+2]
A[i+2]
x
=
B[i+3]
A[i+3]
x
=
Scalar version processes
a single pair of operands
at a time.
Vectorized version carries out the
same instructions on multiple
pairs of operands at once.
www.ionutbalosin.com
Auto vectorization: modern compilers analyze loops in serial code to
identify if they could be vectorized (i.e. perform loop transformations).
Loops requirements for auto vectorization:
- countable loops → only unrolled loops
- single entry and exit
- straight-line code (e.g. no switch, if with masking is allowed)
- only for most inner loop (caution in case of loop interchange or loop
collapsing)
- no functions calls, but intrinsics
61
www.ionutbalosin.com
Countable loops
1) for (int i = start; i < limit; i += stride) {
// loop body
}
2) int i = start;
while (i < limit) {
// loop body
i += stride;
}
62
- limit is loop invariant
- stride is constant (compile-time known)
www.ionutbalosin.comwww.ionutbalosin.com
C[i] = A[i] x B[i]
63
www.ionutbalosin.com
@Param({"262144"})
private int size;
private float[] A, B, C;
@Benchmark
public float[] multiply_2_arrays_elements() {
for (int i = 0; i < size; i++) {
C[i] = A[i] * B[i];
}
return C;
}
64
www.ionutbalosin.com 65
Graal JIT
www.ionutbalosin.com
1 opr/lcyc
C2 JIT loop vectorization pattern
66
Scalar pre-loop
Vectorized
main-loop
Vectorized
post-loop
Scalar post-loop
1st opr
alignment (CPU caches) 4x8 opr/lcyc
1x8 opr/lcyc
e.g. YMM registers
e.g. YMM registers
Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
www.ionutbalosin.com
C2 JIT, Scalar pre-loop (1 float operation/loop cycle)
L0000:
vmovss xmm2,DWORD PTR [rdi+r13*4+0x10] ;*faload
vmulss xmm2,xmm2,DWORD PTR [rsi+r13*4+0x10] ;*fmul
vmovss DWORD PTR [rax+r13*4+0x10],xmm2 ;*fastore
inc r13d ;*iinc
cmp r13d,ebx
jl L0000 ;*if_icmpge
/* Pattern: C[i] = A[i] x B[i]; i+=1 */
67
www.ionutbalosin.com
C2 JIT, Vectorized main-loop (4x8 float operations/loop cycle)
L0001:
vmovdqu ymm2,YMMWORD PTR [rdi+r13*4+0x10]
vmulps ymm2,ymm2,YMMWORD PTR [rsi+r13*4+0x10]
vmovdqu YMMWORD PTR [rax+r13*4+0x10],ymm2
movsxd rcx,r13d
...
vmovdqu ymm2,YMMWORD PTR [rdi+rcx*4+0x70]
vmulps ymm2,ymm2,YMMWORD PTR [rsi+rcx*4+0x70]
vmovdqu YMMWORD PTR [rax+rcx*4+0x70],ymm2 ;*fastore
add r13d,0x20 ;*iinc
cmp r13d,r9d
jl L0001 ;*if_icmpge
/* Pattern: C[i:i+7] = A[i:i+7] x B[i:i+7]; i+=32 */
68
www.ionutbalosin.com
C2 JIT, Vectorized post-loop (8 float operations/loop cycle)
L0002:
vmovdqu ymm2,YMMWORD PTR [rdi+r13*4+0x10]
vmulps ymm2,ymm2,YMMWORD PTR [rsi+r13*4+0x10]
vmovdqu YMMWORD PTR [rax+r13*4+0x10],ymm2 ;*fastore
add r13d,0x8 ;*iinc
cmp r13d,r9d
jl L0002 ;*if_icmpge
/* Pattern: C[i:i+7] = A[i:i+7] x B[i:i+7]; i+=8 */
69
www.ionutbalosin.com
C2 JIT, Scalar post-loop (1 float operation/loop cycle)
L0004:
vmovss xmm4,DWORD PTR [rdi+r13*4+0x10] ;*faload
vmulss xmm4,xmm4,DWORD PTR [rsi+r13*4+0x10] ;*fmul
vmovss DWORD PTR [rax+r13*4+0x10],xmm4 ;*fastore
inc r13d ;*iinc
cmp r13d,r11d
jl L0004 ;*iload_1
/* Pattern: C[i] = A[i] x B[i]; i+=1 */
70
www.ionutbalosin.com
1 opr/lcyc
Graal JIT loop optimization pattern
71
Loop peeling
Scalar loop
alignment (CPU caches)
Lack of loop vectorization support.
No loop unrolling.
Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
www.ionutbalosin.com
Graal JIT, Loop peeling (2 float operations)
shl rcx,0x3 ;*getfield B
vmovss xmm0,DWORD PTR [rcx+r13*4+0x10] ;*faload
shl r8,0x3 ;*getfield A
vmulss xmm0,xmm0,DWORD PTR [r8+r13*4+0x10] ;*fmul
vmovss DWORD PTR [r11+r13*4+0x10],xmm0 ;*fastore
mov edx,r13d
inc edx ;*iinc
vmovss xmm0,DWORD PTR [rcx+rdx*4+0x10] ;*faload
vmulss xmm0,xmm0,DWORD PTR [r8+rdx*4+0x10] ;*fmul
vmovss DWORD PTR [r11+rdx*4+0x10],xmm0 ;*fastore
lea edx,[r13+0x2] ;*iinc
jmp L0001 ;main_loop
72
www.ionutbalosin.com
Graal JIT, Scalar loop (1 float operation/loop cycle)
L0000:
vmovss xmm0,DWORD PTR [rcx+rdx*4+0x10] ;*faload
vmulss xmm0,xmm0,DWORD PTR [r8+rdx*4+0x10] ;*fmul
vmovss DWORD PTR [r11+rdx*4+0x10],xmm0 ;*fastore
inc edx ;*iinc
L0001:
cmp r10d,edx
jg L0000 ;*if_icmpge
/* Pattern: C[i] = A[i] x B[i]; i+=1 */
73
www.ionutbalosin.com
Conclusions – multiply_2_arrays_elements
74
C2 JIT uses vectorization for:
- main loop - YMM AVX2 registers → 4x8 float operations/loop cycle,
- post loop - YMM AVX2 registers → 8 float operations/loop cycle.
C2 JIT handles the remaining post loop without unrolling and without
vectorization, just one by one until reaches the end.
Graal JIT has not vectorization support → this is a core feature missing
in GraalVM CE!
Graal JIT does not trigger loop unrolling (in this case).
www.ionutbalosin.comwww.ionutbalosin.com
C[l] = A[l] x B[l]
<<stride:long>>
75
www.ionutbalosin.com
@Param({"262144"})
private int size;
private float[] A, B, C;
@Benchmark
public float[] multiply_2_arrays_elements_long_stride() {
for (long l = 0; l < size; l++) {
C[(int) l] = A[(int) l] * B[(int) l];
}
return C;
}
76
www.ionutbalosin.com 77
www.ionutbalosin.com
C2 JIT loop optimization pattern
Graal JIT loop optimization pattern
78
1 opr/lcyc
Scalar loop
1 opr/lcyc
Scalar loop
Lack of loop vectorization support.
No loop unrolling.
Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
www.ionutbalosin.com
C2 JIT, Scalar loop (1 float operation/loop cycle)
L0000:
vmovss DWORD PTR [rsi+rcx*4+0x10],xmm1 ;*goto
mov rcx,QWORD PTR [r15+0x108]
add r13,0x1 ;*ladd
test DWORD PTR [rcx],eax ;* SAFEPOINT POLL *
cmp r13,rdx
jge L0002 ;*ifge
mov ecx,r13d ;*l2i
vmovss xmm1,DWORD PTR [rax+rcx*4+0x10] ;*faload
vmulss xmm1,xmm1,DWORD PTR [r14+rcx*4+0x10] ;*fmul
cmp ecx,r9d
jb L0000
/* Pattern: C[i] = A[i] x B[i]; i+=1 */
79
www.ionutbalosin.com
Graal JIT, Scalar loop (1 float operation/loop cycle)
L0000:
mov esi,ecx ;*l2i
vmovss xmm0,DWORD PTR [rdi+rsi*4+0x10] ;*faload
vmulss xmm0,xmm0,DWORD PTR [r8+rsi*4+0x10] ;*fmul
vmovss DWORD PTR [r11+rsi*4+0x10],xmm0 ;*fastore
mov rsi,QWORD PTR [r15+0x108] ;*lload_1
test DWORD PTR [rsi],eax ;* SAFEPOINT POLL *
inc rcx ;*ladd
cmp r10,rcx
jg L0000 ;*ifge
/* Pattern: C[i] = A[i] x B[i]; i+=1 */
80
www.ionutbalosin.com
Conclusions – multiply_2_arrays_elements_long_stride
81
Both compilers (Graal JIT and C2 JIT) optimizes the float array
elements multiplication with long stride in a similar manner:
- one main scalar loop
- no loop unrolling
- no vectorization
- a safepoint poll is added (performance impact).
www.ionutbalosin.comwww.ionutbalosin.com
unpredictable_branch()
82
www.ionutbalosin.com
@Param({"4096"}) private int thresholdLimit;
private final int SIZE = 16384;
private int[] array = new int[SIZE];
// Initialize array elements with values between [0, thresholdLimit)
// e.g. array[i] = random.nextInt(thresholdLimit); // i = 0...SIZE
@Benchmark
public int unpredictable_branch() {
int sum = 0;
for (final int value : array) {
if (value <= (thresholdLimit / 2)) {
sum += value;
}
}
return sum;
}
83
www.ionutbalosin.com 84
www.ionutbalosin.com 85
Graal JIT
Graal JIT
www.ionutbalosin.com
C2 JIT loop optimization pattern
86
Loop peeling
Scalar
main-loop
Scalar
post-loop
alignment (CPU caches)
1st opr
1 opr/lcyc
e.g. unrolling factor = 8
8 opr/lcyc
Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
www.ionutbalosin.com
C2 JIT, Loop peeling (1st array element)
mov ebp,DWORD PTR [rsi+0x14] ;*getfield array
mov r10d,DWORD PTR [r12+rbp*8+0xc] ;*arraylength
mov eax,DWORD PTR [r12+rbp*8+0x10] ;*iaload
mov ecx,DWORD PTR [rsi+0x10] ;*getfield thresholdLimit
mov edi,0x1 ;main loop start index
87
www.ionutbalosin.com
C2 JIT, Scalar main-loop (8 array elements/loop cycle)
L0000:
mov r10d,DWORD PTR [rbp+rdi*4+0x10] ;*iaload
mov r9d,DWORD PTR [rbp+rdi*4+0x14]
mov r8d,DWORD PTR [rbp+rdi*4+0x18]
mov ebx,DWORD PTR [rbp+rdi*4+0x1c]
mov ecx,DWORD PTR [rbp+rdi*4+0x20]
mov edx,DWORD PTR [rbp+rdi*4+0x24]
mov esi,DWORD PTR [rbp+rdi*4+0x28]
mov r11d,eax
add r11d,r10d
cmp r10d,r13d ; r13d = thresholdLimit/2
cmovle eax,r11d ;*iinc
mov r10d,DWORD PTR [rbp+rdi*4+0x2c] ;*iaload
...
add edi,0x8 ;*iinc
cmp edi,r14d
jl L0000 ;*if_icmpge
88
www.ionutbalosin.com
C2 JIT, Scalar post-loop (1 array element/loop cycle)
L0001:
mov r11d,DWORD PTR [rbp+rdi*4+0x10] ;*iaload
mov r9d,eax
add r9d,r11d
cmp r11d,r13d ; r13d = thresholdLimit/2
cmovle eax,r9d
inc edi ;*iinc
cmp edi,r10d
jl L0001 ;*if_icmpge
89
www.ionutbalosin.com
1 opr/lcyc
Graal JIT loop optimization pattern
90
Loop peeling
Scalar loop
alignment (CPU caches)
1st opr
No loop unrolling.
Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
www.ionutbalosin.com
Graal JIT, Loop peeling (1st array element)
mov r11d,DWORD PTR [rsi+0x10] ;*getfield thresholdLimit
mov r8d,DWORD PTR [rax*8+0x10] ;*iaload
shl rax,0x3 ;*getfield array
mov r11d,0x1
91
www.ionutbalosin.com
Graal JIT, Scalar loop (1 array element/loop cycle)
L0000:
mov ecx,DWORD PTR [rax+r11*4+0x10] ;*iaload
cmp ecx,r9d ; r13d = thresholdLimit/2
jg L0003 ;*if_icmpgt
add r8d,ecx ;*iadd
inc r11d ;*iinc
L0001: cmp r10d,r11d
jg L0000 ;*if_icmpge
92
www.ionutbalosin.com
Conclusions – unpredictable_branch
93
C2 JIT optimizes the loop as follows:
- main loop - with an unrolling factor of 8
- post loop - one by one, until it reaches the end.
Graal JIT does not unroll the loop (second example without unrolling)!
www.ionutbalosin.comwww.ionutbalosin.com
Summary
www.ionutbalosin.com 95
Graal JIT C1/C2 JIT
Scalar Replacement
Virtual Calls
Vectorization
Loop Unrolling
Intrinsics
Scalar Replacement: nice optimizations in Graal JIT (e.g. Partial Escape Analysis).
Virtual Calls: better optimized by Graal JIT (e.g. megamorphic call sites).
Vectorization: core compiler feature “missing” in Graal JIT (Graal CE).
Loop unrolling: not found in current Graal JIT test cases. Missing or just limited!?
Intrinsics: not covered but the support is lower in Graal JIT. See reference.
Gracias
Thanks
Danke
Merci
Resources
Slides
https://ionutbalosin.com/talks
Further Readings
https://ionutbalosin.com/2019/04/jvm-jit-compilers-benchmarks-report-19-04/
Performance Characterization and Optimizations in Graal - Jean-Philippe Halimi
ionutbalosin.com
@ionutbalosin

More Related Content

What's hot

Java Crash分析(2012-05-10)
Java Crash分析(2012-05-10)Java Crash分析(2012-05-10)
Java Crash分析(2012-05-10)
Kris Mok
 
QGIS初級編2018
QGIS初級編2018QGIS初級編2018
QGIS初級編2018
Jyun Tanaka
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple Languages
Kris Mok
 
How many ways to monitor oracle golden gate - OOW14
How many ways to monitor oracle golden gate - OOW14How many ways to monitor oracle golden gate - OOW14
How many ways to monitor oracle golden gate - OOW14
Bobby Curtis
 
Oracle 12c Automatic Data Optimization (ADO) - ILM
Oracle 12c Automatic Data Optimization (ADO) - ILMOracle 12c Automatic Data Optimization (ADO) - ILM
Oracle 12c Automatic Data Optimization (ADO) - ILM
Monowar Mukul
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化areyouok
 
仮想化技術によるマルウェア対策とその問題点
仮想化技術によるマルウェア対策とその問題点仮想化技術によるマルウェア対策とその問題点
仮想化技術によるマルウェア対策とその問題点
Kuniyasu Suzaki
 
たのしいPowershell Empire
たのしいPowershell EmpireたのしいPowershell Empire
たのしいPowershell Empire
monochrojazz
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
Adrian Huang
 
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Jelastic Multi-Cloud PaaS
 
EnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAsEnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAs
EDB
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
Babak Farrokhi
 
UKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tipsUKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tips
Connor McDonald
 
計算スケジューリングの効果~もし,Halideがなかったら?~
計算スケジューリングの効果~もし,Halideがなかったら?~計算スケジューリングの効果~もし,Halideがなかったら?~
計算スケジューリングの効果~もし,Halideがなかったら?~
Norishige Fukushima
 
Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
Riyaj Shamsudeen
 
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
Developers Summit
 
An overview of reference architectures for Postgres
An overview of reference architectures for PostgresAn overview of reference architectures for Postgres
An overview of reference architectures for Postgres
EDB
 
Spring batch
Spring batchSpring batch
Spring batch
Chandan Kumar Rana
 
3.Java EE7 徹底入門 CDI&EJB
3.Java EE7 徹底入門 CDI&EJB3.Java EE7 徹底入門 CDI&EJB
3.Java EE7 徹底入門 CDI&EJB
Tsunenaga Hanyuda
 

What's hot (20)

Java Crash分析(2012-05-10)
Java Crash分析(2012-05-10)Java Crash分析(2012-05-10)
Java Crash分析(2012-05-10)
 
QGIS初級編2018
QGIS初級編2018QGIS初級編2018
QGIS初級編2018
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple Languages
 
How many ways to monitor oracle golden gate - OOW14
How many ways to monitor oracle golden gate - OOW14How many ways to monitor oracle golden gate - OOW14
How many ways to monitor oracle golden gate - OOW14
 
Oracle 12c Automatic Data Optimization (ADO) - ILM
Oracle 12c Automatic Data Optimization (ADO) - ILMOracle 12c Automatic Data Optimization (ADO) - ILM
Oracle 12c Automatic Data Optimization (ADO) - ILM
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
仮想化技術によるマルウェア対策とその問題点
仮想化技術によるマルウェア対策とその問題点仮想化技術によるマルウェア対策とその問題点
仮想化技術によるマルウェア対策とその問題点
 
たのしいPowershell Empire
たのしいPowershell EmpireたのしいPowershell Empire
たのしいPowershell Empire
 
Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...Decompressed vmlinux: linux kernel initialization from page table configurati...
Decompressed vmlinux: linux kernel initialization from page table configurati...
 
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...Elastic JVM  for Scalable Java EE Applications  Running in Containers #Jakart...
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
 
EnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAsEnterpriseDB's Best Practices for Postgres DBAs
EnterpriseDB's Best Practices for Postgres DBAs
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
UKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tipsUKOUG - 25 years of hints and tips
UKOUG - 25 years of hints and tips
 
計算スケジューリングの効果~もし,Halideがなかったら?~
計算スケジューリングの効果~もし,Halideがなかったら?~計算スケジューリングの効果~もし,Halideがなかったら?~
計算スケジューリングの効果~もし,Halideがなかったら?~
 
Rac 12c optimization
Rac 12c optimizationRac 12c optimization
Rac 12c optimization
 
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
【B-4】オープンソース開発で、フリー静的解析ツールを使ってみる
 
An overview of reference architectures for Postgres
An overview of reference architectures for PostgresAn overview of reference architectures for Postgres
An overview of reference architectures for Postgres
 
Spring batch
Spring batchSpring batch
Spring batch
 
3.Java EE7 徹底入門 CDI&EJB
3.Java EE7 徹底入門 CDI&EJB3.Java EE7 徹底入門 CDI&EJB
3.Java EE7 徹底入門 CDI&EJB
 

Similar to A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers better runtime performance?

100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
Andrey Karpov
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
PVS-Studio
 
Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2
ppd1961
 
Embedded C programming session10
Embedded C programming  session10Embedded C programming  session10
Embedded C programming session10
Keroles karam khalil
 
C programming session10
C programming  session10C programming  session10
C programming session10
Keroles karam khalil
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time CompilationThe Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
Monica Beckwith
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
Linaro
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
Valeriia Maliarenko
 
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdfBOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
MichaelOLeary82
 
Analyzing the Dolphin-emu project
Analyzing the Dolphin-emu projectAnalyzing the Dolphin-emu project
Analyzing the Dolphin-emu project
PVS-Studio
 
How to write clean & testable code without losing your mind
How to write clean & testable code without losing your mindHow to write clean & testable code without losing your mind
How to write clean & testable code without losing your mind
Andreas Czakaj
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
Mithun Hunsur
 
Browser exploitation SEC-T 2019 stockholm
Browser exploitation SEC-T 2019 stockholmBrowser exploitation SEC-T 2019 stockholm
Browser exploitation SEC-T 2019 stockholm
Jameel Nabbo
 
Two C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsTwo C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp Insights
Alison Chaiken
 
Goroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in GoGoroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in Go
Yu-Shuan Hsieh
 
maXbox Starter 45 Robotics
maXbox Starter 45 RoboticsmaXbox Starter 45 Robotics
maXbox Starter 45 Robotics
Max Kleiner
 
Improving app performance using .Net Core 3.0
Improving app performance using .Net Core 3.0Improving app performance using .Net Core 3.0
Improving app performance using .Net Core 3.0
Richard Banks
 
Top 10 bugs in C++ open source projects, checked in 2016
Top 10 bugs in C++ open source projects, checked in 2016Top 10 bugs in C++ open source projects, checked in 2016
Top 10 bugs in C++ open source projects, checked in 2016
PVS-Studio
 

Similar to A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers better runtime performance? (20)

100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2Handling Exceptions In C &amp; C++ [Part B] Ver 2
Handling Exceptions In C &amp; C++ [Part B] Ver 2
 
Embedded C programming session10
Embedded C programming  session10Embedded C programming  session10
Embedded C programming session10
 
C programming session10
C programming  session10C programming  session10
C programming session10
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time CompilationThe Performance Engineer's Guide To HotSpot Just-in-Time Compilation
The Performance Engineer's Guide To HotSpot Just-in-Time Compilation
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdfBOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
BOS K8S Meetup - Finetuning LLama 2 Model on GKE.pdf
 
Analyzing the Dolphin-emu project
Analyzing the Dolphin-emu projectAnalyzing the Dolphin-emu project
Analyzing the Dolphin-emu project
 
How to write clean & testable code without losing your mind
How to write clean & testable code without losing your mindHow to write clean & testable code without losing your mind
How to write clean & testable code without losing your mind
 
Skiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in DSkiron - Experiments in CPU Design in D
Skiron - Experiments in CPU Design in D
 
Browser exploitation SEC-T 2019 stockholm
Browser exploitation SEC-T 2019 stockholmBrowser exploitation SEC-T 2019 stockholm
Browser exploitation SEC-T 2019 stockholm
 
Two C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp InsightsTwo C++ Tools: Compiler Explorer and Cpp Insights
Two C++ Tools: Compiler Explorer and Cpp Insights
 
Goroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in GoGoroutine stack and local variable allocation in Go
Goroutine stack and local variable allocation in Go
 
maXbox Starter 45 Robotics
maXbox Starter 45 RoboticsmaXbox Starter 45 Robotics
maXbox Starter 45 Robotics
 
Improving app performance using .Net Core 3.0
Improving app performance using .Net Core 3.0Improving app performance using .Net Core 3.0
Improving app performance using .Net Core 3.0
 
Top 10 bugs in C++ open source projects, checked in 2016
Top 10 bugs in C++ open source projects, checked in 2016Top 10 bugs in C++ open source projects, checked in 2016
Top 10 bugs in C++ open source projects, checked in 2016
 

More from J On The Beach

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
J On The Beach
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
J On The Beach
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
J On The Beach
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
J On The Beach
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
J On The Beach
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
J On The Beach
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
J On The Beach
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
J On The Beach
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
J On The Beach
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
J On The Beach
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
J On The Beach
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
J On The Beach
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
J On The Beach
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
J On The Beach
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
J On The Beach
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
J On The Beach
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
J On The Beach
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
J On The Beach
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
J On The Beach
 
Getting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement LearningGetting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement Learning
J On The Beach
 

More from J On The Beach (20)

Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
 
Big Data On Data You Don’t Have
Big Data On Data You Don’t HaveBig Data On Data You Don’t Have
Big Data On Data You Don’t Have
 
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
Acoustic Time Series in Industry 4.0: Improved Reliability and Cyber-Security...
 
Pushing it to the edge in IoT
Pushing it to the edge in IoTPushing it to the edge in IoT
Pushing it to the edge in IoT
 
Drinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actorsDrinking from the firehose, with virtual streams and virtual actors
Drinking from the firehose, with virtual streams and virtual actors
 
How do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server patternHow do we deploy? From Punched cards to Immutable server pattern
How do we deploy? From Punched cards to Immutable server pattern
 
Java, Turbocharged
Java, TurbochargedJava, Turbocharged
Java, Turbocharged
 
When Cloud Native meets the Financial Sector
When Cloud Native meets the Financial SectorWhen Cloud Native meets the Financial Sector
When Cloud Native meets the Financial Sector
 
The big data Universe. Literally.
The big data Universe. Literally.The big data Universe. Literally.
The big data Universe. Literally.
 
Streaming to a New Jakarta EE
Streaming to a New Jakarta EEStreaming to a New Jakarta EE
Streaming to a New Jakarta EE
 
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
The TIPPSS Imperative for IoT - Ensuring Trust, Identity, Privacy, Protection...
 
Pushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and BlazorPushing AI to the Client with WebAssembly and Blazor
Pushing AI to the Client with WebAssembly and Blazor
 
Axon Server went RAFTing
Axon Server went RAFTingAxon Server went RAFTing
Axon Server went RAFTing
 
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
The Six Pitfalls of building a Microservices Architecture (and how to avoid t...
 
Madaari : Ordering For The Monkeys
Madaari : Ordering For The MonkeysMadaari : Ordering For The Monkeys
Madaari : Ordering For The Monkeys
 
Servers are doomed to fail
Servers are doomed to failServers are doomed to fail
Servers are doomed to fail
 
Interaction Protocols: It's all about good manners
Interaction Protocols: It's all about good mannersInteraction Protocols: It's all about good manners
Interaction Protocols: It's all about good manners
 
Leadership at every level
Leadership at every levelLeadership at every level
Leadership at every level
 
Machine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind LibrariesMachine Learning: The Bare Math Behind Libraries
Machine Learning: The Bare Math Behind Libraries
 
Getting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement LearningGetting started with Deep Reinforcement Learning
Getting started with Deep Reinforcement Learning
 

Recently uploaded

Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 

Recently uploaded (20)

Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 

A race of two compilers: GraalVM JIT versus HotSpot JIT C2. Which one offers better runtime performance?

  • 1. A race of two compilers: Graal JIT vs. C2 JIT. Which one offers better runtime performance? by Ionuţ Baloşin ionutbalosin.com @ionutbalosin
  • 2. About Me Software Architect Technical Trainer Writer Speaker SOFTWARE ARCHITECTURE JAVA PERFORMANCE AND TUNING IonutBalosin.com
  • 3. Today’s Content 01 Intro 02 Scalar Replacement 03 Virtual Calls 04 Vectorization & Loop Optimizations 05 Summary
  • 6. www.ionutbalosin.com Configuration 6 CPU Intel i7-8550U Kaby Lake R MEMORY 16GB DDR4 2400 MHz OS / Kernel Ubuntu 18.10 / 4.18.0-17-generic Java Version 12 Linux-x64 JVM / CE JMH v1.21 5x10s warm-up iterations, 5x10s measurement iterations, 3 JVM forks, single threaded Note: C2 JIT might need less timings, being a native Compiler.
  • 7. www.ionutbalosin.com This talk is NOT about EE. Current optimizations might differ for other JVMs. 7
  • 9. www.ionutbalosin.com Fair comparison? 9 to Well … it depends from the perspective: Application/end user: performance and (licensing) costs are the leading factors Compiler/internals (e.g. implementation, maturity): very different
  • 10. www.ionutbalosin.com Graal JIT ✓ Java implementation ✓ still early stage (recently become production ready; e.g. v19.0.0) ✓ targets JVM-based languages (e.g. Java, Scala, Kotlin), dynamic languages (e.g. JavaScript, Ruby, R) but also native languages (e.g. C/C++, Rust) ✓ modular design (JVMCI plug-in arch) ✓ Ahead Of Time flavor 10
  • 11. www.ionutbalosin.com Graal JIT ✓ Java implementation ✓ still early stage (recently become production ready; e.g. v19.0.0) ✓ targets JVM-based languages (e.g. Java, Scala, Kotlin), dynamic languages (e.g. JavaScript, Ruby, R) but also native languages (e.g. C/C++, Rust) ✓ modular design (JVMCI plug-in arch) ✓ Ahead Of Time flavor C2 JIT ✓ C++ implementation ✓ mature / production grade runtime ✓ written mainly to optimize Java source code ✓ is essentially a local maximum of performance and is reaching the end of its design lifetime ✓ latest improvements were more focused on intrinsics and vectorization, rather than adding new optimization heuristics 11
  • 12. www.ionutbalosin.com Nevertheless, nowadays the baseline for Graal JIT still remains C2 JIT! The target is to leverage its performance to be at least on par with C2 JIT especially on real world applications / production code. 12
  • 13. www.ionutbalosin.com There are 3 main distinguishing factors for Graal JIT: 13 Improved inlining Partial Escape Analysis Guard optimizations for efficient handling of speculative code
  • 14. www.ionutbalosin.com 14 1. Partial Escape Analysis (PEA) determines the escapability of objects on individual branches and reduces object allocations even if the objects escapes on some paths. PEA works on compiler’s intermediate representation not on bytecode. It is effective in conjunction with other parts of the compiler, such as inlining, global value numbering, and constant folding.
  • 15. www.ionutbalosin.com 15 2. Improved inlining strategy explores the call tree and performs late inlining as default instead of inlining during bytecode parsing. Graal JIT devise three novel heuristics: adaptive decision thresholds, callsite clustering, and deep inlining trials.
  • 16. www.ionutbalosin.com 16 3. Guard optimizations for efficient handling of speculative code has its biggest impact on the JavaScript, Ruby, and R implementations built on Graal’s multi-language framework Truffle.
  • 18. www.ionutbalosin.com Object scopes: ✓ NoEscape (local to current method/thread) ✓ ArgEscape (passed as an argument but not visible by other threads) ✓ GlobalEscape (visible outside of current method/thread) 18 Escape Analysis (EA) analyses the scope of a new object and decides whether it might be allocated or not on the heap. EA is not an optimization but rather an analysis phase for the optimizer. The result of this analisys allows compiler to perform optimizations like object allocations, synchronization primitives and field accesses.
  • 19. www.ionutbalosin.com For NoEscape objects compiler can remap the accesses to the object fields to accesses to synthetic local operands: Scalar Replacement optimization. 19
  • 20. www.ionutbalosin.com @Param({"3"}) private int param; @Param({"128"}) private int size; @Benchmark public int no_escape() { LightWrapper w = new LightWrapper(param); return w.f1 + w.f2; } public static class LightWrapper { public int f1; public int f2; //... } 20 CASE I
  • 21. www.ionutbalosin.com @Param({"3"}) private int param; @Param({"128"}) private int size; @Benchmark public int no_escape() { LightWrapper w = new LightWrapper(param); return w.f1 + w.f2; } public static class LightWrapper { public int f1; public int f2; //... } 21 equivalent to return f1 + f2; CASE I
  • 22. www.ionutbalosin.com @Param({"3"}) private int param; @Param({"128"}) private int size; @Benchmark public int no_escape_object_array() { HeavyWrapper w = new HeavyWrapper(param, size); return w.f1 + w.f2 + w.z.length; } public static class HeavyWrapper { public int f1; public int f2; public byte z[]; // ... } 22 CASE II
  • 23. www.ionutbalosin.com @Param({"3"}) private int param; @Param({"128"}) private int size; @Benchmark public int no_escape_object_array() { HeavyWrapper w = new HeavyWrapper(param, size); return w.f1 + w.f2 + w.z.length; } public static class HeavyWrapper { public int f1; public int f2; public byte z[]; // ... } 23 equivalent to return f1 + f2 + length; CASE II
  • 24. www.ionutbalosin.com @Param(value = {"false"}) private boolean objectEscapes; @Benchmark public void branch_escape_object(Blackhole bh) { HeavyWrapper wrapper = new HeavyWrapper(param, size); HeavyWrapper result; if (objectEscapes) // false result = wrapper; else result = globalCachedWrapper; bh.consume(result); } 24 CASE III
  • 25. www.ionutbalosin.com @Param(value = {"false"}) private boolean objectEscapes; @Benchmark public void branch_escape_object(Blackhole bh) { HeavyWrapper wrapper = new HeavyWrapper(param, size); HeavyWrapper result; if (objectEscapes) // false result = wrapper; else result = globalCachedWrapper; bh.consume(result); } 25 CASE III branch is never taken, wrapper never escapes, its scope is NoEscape
  • 26. www.ionutbalosin.com @Param(value = {"false"}) private boolean objectEscapes; @Benchmark public void branch_escape_object(Blackhole bh) { HeavyWrapper wrapper = new HeavyWrapper(param, size); HeavyWrapper result; if (objectEscapes) // false result = wrapper; else result = globalCachedWrapper; bh.consume(result); } 26 branch is never taken, wrapper never escapes, its scope is NoEscape CASE III Based on Partial Escape Analysis, the wrapper object allocation is eliminated!
  • 27. www.ionutbalosin.com @Benchmark public boolean arg_escape_object() { HeavyWrapper wrapper1 = new HeavyWrapper(param, size); HeavyWrapper wrapper2 = new HeavyWrapper(param, size); boolean match = false; if (wrapper1.equals(wrapper2)) match = true; return match; } 27 CASE IV
  • 28. www.ionutbalosin.com @Benchmark public boolean arg_escape_object() { HeavyWrapper wrapper1 = new HeavyWrapper(param, size); HeavyWrapper wrapper2 = new HeavyWrapper(param, size); boolean match = false; if (wrapper1.equals(wrapper2)) match = true; return match; } 28 wrapper1 scope is NoEscape wrapper2 scope is: - NoEscape, if inlining of equals() succeeds - ArgEscape, if inlining fails or disabled CASE IV
  • 31. www.ionutbalosin.com Conclusions 31 Graal JIT was able to get rid of heap allocations offering a constant response time, around 3 ns/op. C1/C2 JIT struggles to get rid of the heap allocations if: - there is a condition which does not make obvious at compile time if the object escapes, - or if the object scope becomes local (i.e. NoEscape) after inlining. C1/C2 JIT by default does not consider escaping arrays if their size (i.e. the number of elements) is greater than 64. -XX:EliminateAllocationArraySizeLimit=<arraySize>
  • 33. www.ionutbalosin.com Monomorphic call site - optimistically points to only one concrete method that has ever been used at that particular call site. 33
  • 34. www.ionutbalosin.com Bimorphic call site - optimistically points to only two concrete methods which can be invoked at a particular call site. 34
  • 35. www.ionutbalosin.com Megamorphic call site - optimistically points to three or possibly more methods which can be invoked at a particular call site. 35 . . .
  • 36. www.ionutbalosin.com // CMath = {Alg1, Alg2, Alg3, Alg4, Alg5, Alg6} public int execute(CMath cmath, int i) { return cmath.compute(i); // param * constant } 36 megamorphic call site
  • 37. www.ionutbalosin.com @Benchmark public int monomorphic() { return execute(alg1, param) // param * 17 } @Benchmark @OperationsPerInvocation(2) public int bimorphic() { return execute(alg1, param) + // param * 17 execute(alg2, param) // param * 19 } 37
  • 38. www.ionutbalosin.com @Benchmark @OperationsPerInvocation(3) public int megamorphic_3() { return execute(alg1, param) + // param * 17 execute(alg2, param) + // param * 19 execute(alg3, param) // param * 23 } /* Similar test cases for: - megamorphic_4() {...} - megamorphic_5() {...} - megamorphic_6() {...} */ 38
  • 40. www.ionutbalosin.com 40 C1/C2 JIT C1/C2 JIT C1/C2 JIT C1/C2 JIT
  • 42. www.ionutbalosin.com C2 JIT, level 4, execute() ; generated assembly - bimorphic case mov r10d,DWORD PTR [rsi+0x8] ; implicit exception cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)} je L0 cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)} jne L1 imul eax,edx,0x13 ; - Alg2::compute jmp 0x00007f48986678ce ; return L0: mov eax,edx ; - Alg1::compute shl eax,0x4 add eax,edx L1: call 0x00007f4890ba1300 ;*invokevirtual compute ;{runtime_call UncommonTrapBlob} 42
  • 43. www.ionutbalosin.com C2 JIT, level 4, execute() ; generated assembly - bimorphic case mov r10d,DWORD PTR [rsi+0x8] ; implicit exception cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)} je L0 cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)} jne L1 imul eax,edx,0x13 ; - Alg2::compute jmp 0x00007f48986678ce ; return L0: mov eax,edx ; - Alg1::compute shl eax,0x4 add eax,edx L1: call 0x00007f4890ba1300 ;*invokevirtual compute ;{runtime_call UncommonTrapBlob} 43 Alg1
  • 44. www.ionutbalosin.com C2 JIT, level 4, execute() ; generated assembly - bimorphic case mov r10d,DWORD PTR [rsi+0x8] ; implicit exception cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)} je L0 cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)} jne L1 imul eax,edx,0x13 ; - Alg2::compute jmp 0x00007f48986678ce ; return L0: mov eax,edx ; - Alg1::compute shl eax,0x4 add eax,edx L1: call 0x00007f4890ba1300 ;*invokevirtual compute ;{runtime_call UncommonTrapBlob} 44 Alg1 Alg2
  • 45. www.ionutbalosin.com C2 JIT, level 4, execute() ; generated assembly - bimorphic case mov r10d,DWORD PTR [rsi+0x8] ; implicit exception cmp r10d,0x23757f ; {metadata(&apos;Alg1&apos;)} je L0 cmp r10d,0x2375c2 ; {metadata(&apos;Alg2&apos;)} jne L1 imul eax,edx,0x13 ; - Alg2::compute jmp 0x00007f48986678ce ; return L0: mov eax,edx ; - Alg1::compute shl eax,0x4 add eax,edx L1: call 0x00007f4890ba1300 ;*invokevirtual compute ;{runtime_call UncommonTrapBlob} 45 Alg1 Alg2
  • 46. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - bimorphic case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8 je L1 jmp L2 L0: imul eax,edx,0x13 ; - Alg2::compute ret L1: mov eax,edx shl eax,0x4 add eax,edx ; - Alg1::compute ret L2: call 0x00007f45ffba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 46
  • 47. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - bimorphic case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8 je L1 jmp L2 L0: imul eax,edx,0x13 ; - Alg2::compute ret L1: mov eax,edx shl eax,0x4 add eax,edx ; - Alg1::compute ret L2: call 0x00007f45ffba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 47 Alg1
  • 48. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - bimorphic case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8 je L1 jmp L2 L0: imul eax,edx,0x13 ; - Alg2::compute ret L1: mov eax,edx shl eax,0x4 add eax,edx ; - Alg1::compute ret L2: call 0x00007f45ffba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 48 Alg1 Alg2
  • 49. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - bimorphic case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007f4609ed6dc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007f4609ed6dc8 je L1 jmp L2 L0: imul eax,edx,0x13 ; - Alg2::compute ret L1: mov eax,edx shl eax,0x4 add eax,edx ; - Alg1::compute ret L2: call 0x00007f45ffba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 49 Alg1 Alg2
  • 50. www.ionutbalosin.com Conclusions - bimorphic call site 50 Both compilers (Graal JIT and C2 JIT) optimizes bimorphic call sites in a similar approach: - inlines implementations - adds a deoptimization routine (Graal JIT) or an uncommon trap (C2 JIT) for another (unknown) possible target invocation.
  • 52. www.ionutbalosin.com C2 JIT, level 4, execute() ; generated assembly - megamorphic_3 case movabs rax,0xffffffffffffffff call 0x00007fd65cb9fb80 ;*invokevirtual compute ;{virtual_call} 52 Not any further optimization.
  • 53. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - megamorphic_3 case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8 je L1 cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0 je L2 jmp L3 L0: ... ;- Alg1::compute L1: ... ;- Alg3::compute L2: ... ;- Alg2::compute L3: call 0x00007fb83fba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 53
  • 54. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - megamorphic_3 case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8 je L1 cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0 je L2 jmp L3 L0: ... ;- Alg1::compute L1: ... ;- Alg3::compute L2: ... ;- Alg2::compute L3: call 0x00007fb83fba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 54 Alg1
  • 55. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - megamorphic_3 case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8 je L1 cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0 je L2 jmp L3 L0: ... ;- Alg1::compute L1: ... ;- Alg3::compute L2: ... ;- Alg2::compute L3: call 0x00007fb83fba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 55 Alg1 Alg2
  • 56. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - megamorphic_3 case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8 je L1 cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0 je L2 jmp L3 L0: ... ;- Alg1::compute L1: ... ;- Alg3::compute L2: ... ;- Alg2::compute L3: call 0x00007fb83fba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 56 Alg1 Alg2 Alg3
  • 57. www.ionutbalosin.com Graal JIT, execute() ; generated assembly - megamorphic_3 case cmp rax,QWORD PTR [rip+0xffffffffffffffb8] # 0x00007fb849ee0bc0 je L0 cmp rax,QWORD PTR [rip+0xffffffffffffffb3] # 0x00007fb849ee0bc8 je L1 cmp rax,QWORD PTR [rip+0xffffffffffffffae] # 0x00007fb849ee0bd0 je L2 jmp L3 L0: ... ;- Alg1::compute L1: ... ;- Alg3::compute L2: ... ;- Alg2::compute L3: call 0x00007fb83fba10a4 ;*invokevirtual compute ;{runtime_call DeoptimizationBlob} 57 Alg1 Alg2 Alg3
  • 58. www.ionutbalosin.com Conclusions - megamorphic_3 call site 58 Graal JIT optimizes three target method calls by: - inlining the target implementations - adding a deoptimization routine for another (unknown) possible target invocation. C2 JIT does not perform any further optimization, there is a virtual call (e.g. vtable lookup) → due to “polluted profile” context (i.e. method is used in many different contexts with independent operand types). Note: megamorphic_4(), megamorphic_5(), and megamorphic_6() are optimized based on similar principles by Graal JIT and C2 JIT.
  • 60. www.ionutbalosin.com Parallel execution 60 C[i] B[i] A[i] x = C[i] C[i+1] C[i+2] C[i+3] B[i] A[i] x = B[i+1] A[i+1] x = B[i+2] A[i+2] x = B[i+3] A[i+3] x = Scalar version processes a single pair of operands at a time. Vectorized version carries out the same instructions on multiple pairs of operands at once.
  • 61. www.ionutbalosin.com Auto vectorization: modern compilers analyze loops in serial code to identify if they could be vectorized (i.e. perform loop transformations). Loops requirements for auto vectorization: - countable loops → only unrolled loops - single entry and exit - straight-line code (e.g. no switch, if with masking is allowed) - only for most inner loop (caution in case of loop interchange or loop collapsing) - no functions calls, but intrinsics 61
  • 62. www.ionutbalosin.com Countable loops 1) for (int i = start; i < limit; i += stride) { // loop body } 2) int i = start; while (i < limit) { // loop body i += stride; } 62 - limit is loop invariant - stride is constant (compile-time known)
  • 64. www.ionutbalosin.com @Param({"262144"}) private int size; private float[] A, B, C; @Benchmark public float[] multiply_2_arrays_elements() { for (int i = 0; i < size; i++) { C[i] = A[i] * B[i]; } return C; } 64
  • 66. www.ionutbalosin.com 1 opr/lcyc C2 JIT loop vectorization pattern 66 Scalar pre-loop Vectorized main-loop Vectorized post-loop Scalar post-loop 1st opr alignment (CPU caches) 4x8 opr/lcyc 1x8 opr/lcyc e.g. YMM registers e.g. YMM registers Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
  • 67. www.ionutbalosin.com C2 JIT, Scalar pre-loop (1 float operation/loop cycle) L0000: vmovss xmm2,DWORD PTR [rdi+r13*4+0x10] ;*faload vmulss xmm2,xmm2,DWORD PTR [rsi+r13*4+0x10] ;*fmul vmovss DWORD PTR [rax+r13*4+0x10],xmm2 ;*fastore inc r13d ;*iinc cmp r13d,ebx jl L0000 ;*if_icmpge /* Pattern: C[i] = A[i] x B[i]; i+=1 */ 67
  • 68. www.ionutbalosin.com C2 JIT, Vectorized main-loop (4x8 float operations/loop cycle) L0001: vmovdqu ymm2,YMMWORD PTR [rdi+r13*4+0x10] vmulps ymm2,ymm2,YMMWORD PTR [rsi+r13*4+0x10] vmovdqu YMMWORD PTR [rax+r13*4+0x10],ymm2 movsxd rcx,r13d ... vmovdqu ymm2,YMMWORD PTR [rdi+rcx*4+0x70] vmulps ymm2,ymm2,YMMWORD PTR [rsi+rcx*4+0x70] vmovdqu YMMWORD PTR [rax+rcx*4+0x70],ymm2 ;*fastore add r13d,0x20 ;*iinc cmp r13d,r9d jl L0001 ;*if_icmpge /* Pattern: C[i:i+7] = A[i:i+7] x B[i:i+7]; i+=32 */ 68
  • 69. www.ionutbalosin.com C2 JIT, Vectorized post-loop (8 float operations/loop cycle) L0002: vmovdqu ymm2,YMMWORD PTR [rdi+r13*4+0x10] vmulps ymm2,ymm2,YMMWORD PTR [rsi+r13*4+0x10] vmovdqu YMMWORD PTR [rax+r13*4+0x10],ymm2 ;*fastore add r13d,0x8 ;*iinc cmp r13d,r9d jl L0002 ;*if_icmpge /* Pattern: C[i:i+7] = A[i:i+7] x B[i:i+7]; i+=8 */ 69
  • 70. www.ionutbalosin.com C2 JIT, Scalar post-loop (1 float operation/loop cycle) L0004: vmovss xmm4,DWORD PTR [rdi+r13*4+0x10] ;*faload vmulss xmm4,xmm4,DWORD PTR [rsi+r13*4+0x10] ;*fmul vmovss DWORD PTR [rax+r13*4+0x10],xmm4 ;*fastore inc r13d ;*iinc cmp r13d,r11d jl L0004 ;*iload_1 /* Pattern: C[i] = A[i] x B[i]; i+=1 */ 70
  • 71. www.ionutbalosin.com 1 opr/lcyc Graal JIT loop optimization pattern 71 Loop peeling Scalar loop alignment (CPU caches) Lack of loop vectorization support. No loop unrolling. Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
  • 72. www.ionutbalosin.com Graal JIT, Loop peeling (2 float operations) shl rcx,0x3 ;*getfield B vmovss xmm0,DWORD PTR [rcx+r13*4+0x10] ;*faload shl r8,0x3 ;*getfield A vmulss xmm0,xmm0,DWORD PTR [r8+r13*4+0x10] ;*fmul vmovss DWORD PTR [r11+r13*4+0x10],xmm0 ;*fastore mov edx,r13d inc edx ;*iinc vmovss xmm0,DWORD PTR [rcx+rdx*4+0x10] ;*faload vmulss xmm0,xmm0,DWORD PTR [r8+rdx*4+0x10] ;*fmul vmovss DWORD PTR [r11+rdx*4+0x10],xmm0 ;*fastore lea edx,[r13+0x2] ;*iinc jmp L0001 ;main_loop 72
  • 73. www.ionutbalosin.com Graal JIT, Scalar loop (1 float operation/loop cycle) L0000: vmovss xmm0,DWORD PTR [rcx+rdx*4+0x10] ;*faload vmulss xmm0,xmm0,DWORD PTR [r8+rdx*4+0x10] ;*fmul vmovss DWORD PTR [r11+rdx*4+0x10],xmm0 ;*fastore inc edx ;*iinc L0001: cmp r10d,edx jg L0000 ;*if_icmpge /* Pattern: C[i] = A[i] x B[i]; i+=1 */ 73
  • 74. www.ionutbalosin.com Conclusions – multiply_2_arrays_elements 74 C2 JIT uses vectorization for: - main loop - YMM AVX2 registers → 4x8 float operations/loop cycle, - post loop - YMM AVX2 registers → 8 float operations/loop cycle. C2 JIT handles the remaining post loop without unrolling and without vectorization, just one by one until reaches the end. Graal JIT has not vectorization support → this is a core feature missing in GraalVM CE! Graal JIT does not trigger loop unrolling (in this case).
  • 76. www.ionutbalosin.com @Param({"262144"}) private int size; private float[] A, B, C; @Benchmark public float[] multiply_2_arrays_elements_long_stride() { for (long l = 0; l < size; l++) { C[(int) l] = A[(int) l] * B[(int) l]; } return C; } 76
  • 78. www.ionutbalosin.com C2 JIT loop optimization pattern Graal JIT loop optimization pattern 78 1 opr/lcyc Scalar loop 1 opr/lcyc Scalar loop Lack of loop vectorization support. No loop unrolling. Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
  • 79. www.ionutbalosin.com C2 JIT, Scalar loop (1 float operation/loop cycle) L0000: vmovss DWORD PTR [rsi+rcx*4+0x10],xmm1 ;*goto mov rcx,QWORD PTR [r15+0x108] add r13,0x1 ;*ladd test DWORD PTR [rcx],eax ;* SAFEPOINT POLL * cmp r13,rdx jge L0002 ;*ifge mov ecx,r13d ;*l2i vmovss xmm1,DWORD PTR [rax+rcx*4+0x10] ;*faload vmulss xmm1,xmm1,DWORD PTR [r14+rcx*4+0x10] ;*fmul cmp ecx,r9d jb L0000 /* Pattern: C[i] = A[i] x B[i]; i+=1 */ 79
  • 80. www.ionutbalosin.com Graal JIT, Scalar loop (1 float operation/loop cycle) L0000: mov esi,ecx ;*l2i vmovss xmm0,DWORD PTR [rdi+rsi*4+0x10] ;*faload vmulss xmm0,xmm0,DWORD PTR [r8+rsi*4+0x10] ;*fmul vmovss DWORD PTR [r11+rsi*4+0x10],xmm0 ;*fastore mov rsi,QWORD PTR [r15+0x108] ;*lload_1 test DWORD PTR [rsi],eax ;* SAFEPOINT POLL * inc rcx ;*ladd cmp r10,rcx jg L0000 ;*ifge /* Pattern: C[i] = A[i] x B[i]; i+=1 */ 80
  • 81. www.ionutbalosin.com Conclusions – multiply_2_arrays_elements_long_stride 81 Both compilers (Graal JIT and C2 JIT) optimizes the float array elements multiplication with long stride in a similar manner: - one main scalar loop - no loop unrolling - no vectorization - a safepoint poll is added (performance impact).
  • 83. www.ionutbalosin.com @Param({"4096"}) private int thresholdLimit; private final int SIZE = 16384; private int[] array = new int[SIZE]; // Initialize array elements with values between [0, thresholdLimit) // e.g. array[i] = random.nextInt(thresholdLimit); // i = 0...SIZE @Benchmark public int unpredictable_branch() { int sum = 0; for (final int value : array) { if (value <= (thresholdLimit / 2)) { sum += value; } } return sum; } 83
  • 86. www.ionutbalosin.com C2 JIT loop optimization pattern 86 Loop peeling Scalar main-loop Scalar post-loop alignment (CPU caches) 1st opr 1 opr/lcyc e.g. unrolling factor = 8 8 opr/lcyc Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
  • 87. www.ionutbalosin.com C2 JIT, Loop peeling (1st array element) mov ebp,DWORD PTR [rsi+0x14] ;*getfield array mov r10d,DWORD PTR [r12+rbp*8+0xc] ;*arraylength mov eax,DWORD PTR [r12+rbp*8+0x10] ;*iaload mov ecx,DWORD PTR [rsi+0x10] ;*getfield thresholdLimit mov edi,0x1 ;main loop start index 87
  • 88. www.ionutbalosin.com C2 JIT, Scalar main-loop (8 array elements/loop cycle) L0000: mov r10d,DWORD PTR [rbp+rdi*4+0x10] ;*iaload mov r9d,DWORD PTR [rbp+rdi*4+0x14] mov r8d,DWORD PTR [rbp+rdi*4+0x18] mov ebx,DWORD PTR [rbp+rdi*4+0x1c] mov ecx,DWORD PTR [rbp+rdi*4+0x20] mov edx,DWORD PTR [rbp+rdi*4+0x24] mov esi,DWORD PTR [rbp+rdi*4+0x28] mov r11d,eax add r11d,r10d cmp r10d,r13d ; r13d = thresholdLimit/2 cmovle eax,r11d ;*iinc mov r10d,DWORD PTR [rbp+rdi*4+0x2c] ;*iaload ... add edi,0x8 ;*iinc cmp edi,r14d jl L0000 ;*if_icmpge 88
  • 89. www.ionutbalosin.com C2 JIT, Scalar post-loop (1 array element/loop cycle) L0001: mov r11d,DWORD PTR [rbp+rdi*4+0x10] ;*iaload mov r9d,eax add r9d,r11d cmp r11d,r13d ; r13d = thresholdLimit/2 cmovle eax,r9d inc edi ;*iinc cmp edi,r10d jl L0001 ;*if_icmpge 89
  • 90. www.ionutbalosin.com 1 opr/lcyc Graal JIT loop optimization pattern 90 Loop peeling Scalar loop alignment (CPU caches) 1st opr No loop unrolling. Legend: opr → loop body (i.e. operand(s) / array element(s)); lcyc → loop cycle
  • 91. www.ionutbalosin.com Graal JIT, Loop peeling (1st array element) mov r11d,DWORD PTR [rsi+0x10] ;*getfield thresholdLimit mov r8d,DWORD PTR [rax*8+0x10] ;*iaload shl rax,0x3 ;*getfield array mov r11d,0x1 91
  • 92. www.ionutbalosin.com Graal JIT, Scalar loop (1 array element/loop cycle) L0000: mov ecx,DWORD PTR [rax+r11*4+0x10] ;*iaload cmp ecx,r9d ; r13d = thresholdLimit/2 jg L0003 ;*if_icmpgt add r8d,ecx ;*iadd inc r11d ;*iinc L0001: cmp r10d,r11d jg L0000 ;*if_icmpge 92
  • 93. www.ionutbalosin.com Conclusions – unpredictable_branch 93 C2 JIT optimizes the loop as follows: - main loop - with an unrolling factor of 8 - post loop - one by one, until it reaches the end. Graal JIT does not unroll the loop (second example without unrolling)!
  • 95. www.ionutbalosin.com 95 Graal JIT C1/C2 JIT Scalar Replacement Virtual Calls Vectorization Loop Unrolling Intrinsics Scalar Replacement: nice optimizations in Graal JIT (e.g. Partial Escape Analysis). Virtual Calls: better optimized by Graal JIT (e.g. megamorphic call sites). Vectorization: core compiler feature “missing” in Graal JIT (Graal CE). Loop unrolling: not found in current Graal JIT test cases. Missing or just limited!? Intrinsics: not covered but the support is lower in Graal JIT. See reference.