JIT in modern runtimes
Challenges and solutions
Alexey Ragozin
Deutsche Bank
Presentation outline
Why dynamic languages are slow
 Virtual calls
 Untyped / Weak typed data
Two approaches to JIT
 Method based JIT
 Tracing JIT
JIT in HotSpot JVM
 Interpreter overview
 JIT dirty tricks
Good old C++
010110010010
101010100110
101010100101
010101001010
101010101010
101010101010
101010101010
101010100010
00: methodA
02: methodC
03: methodD
CODEOBJECT
VTABLE
01: methodB
Plain inheritance
Good old C++
Multiple inheritance
010110010010
101010100110
101010100101
010101001010
101010101010
101010101010
101010101010
101010100010
111010100100
011110000010
101001010100
00: methodA
02: methodC
03: methodD
CODEOBJECT
VTABLE
01: methodB
00: methodX
02: methodZ
01: methodY
VTABLE
Old good C++
More fun with multiple inheritance
A
B C
D D
A
B C
D
Branch misprediction penalty
• Intel Nehalem – 17 cycles
• Intel Sandy/Ivy bridge – 15 cycles
• Intel Haskwell – 15 - 20 cycles
• AMD K8 / K10 – 13 cycles
• AMD Buldozer – 19 - 22 cycles
http://www.agner.org/optimize/microarchitecture.pdf
Cost of virtual call
Two memory access before actual jump
• Memory access is serialized
• CPU pipeline is blocked
Memory access timings
• L1 cache ~0.5 ns
• L2 cache ~7 ns
• RAM ~100 ns
Cost of virtual call
Fields are stored in hash table
Access to field
• Arithmetic operation
• Memory read
• Condition check
• Memory read
Cost of dynamic class metadata
Is interpreters that slow?
switch(byteCode) {
case STORE: ...
case LOAD: ...
case ASTORE: ...
case ALOADE: ...
...
}
?
Fast interrupter in HotSpot JVM
Byte code interpreter in HotSpot JVM
• Each byte code instruction has routine written
in assembly language
• Dispatch – jump to corresponding routing
• Each routine ends with jump back to dispatch
 No stack frame is produced per instruction
 Dispatch table and code are well cached
 CPU pipeline is kept busy
JIT compilation approaches
Classic
Method based compilation
+ runtime profiling
+ profiling driven optimization
Tracing JIT
Recording whole execution paths (trace)
+ fallbacktointerpretedifexecutiondeviatesfrompath
+ maintain a tree of compiled traces
JIT compilation approaches
Classic
Method based compilation
– JVM, V8, Firefox Ion Monkey
Tracing JIT
Recording whole execution paths (trace)
– Flash, Trace Monkey, PyPy, LuaJIT
Tracing JIT
Interpretation mode
• Record actions and branch condition (recording a trace)
Profiling
• Detect “hot” traces
Trace compilation
• Non branching code is generated
• Guards instead of branching
• Whole trace optimization
• Guard violation – fallback to interpreted
Tracing JIT
Strong
• Devirtualization and inlining
• Hash lookups are also “deconditioned”
• Efficient “hot loops” optimization
Weak
• Tracing SLOWS down interpretation
• Long “warm up” time
Dynamic types problem
V8 – shadow classes
• Shadow classes are strongly typed
TraceMonkey – shape inference/property cache
• Inline caching in compiled code
LuaJIT – hash table access optimized trace
HREFK: if (hash[17].key != key) goto exit
HLOAD: x = hash[17].value
-or-
HSTORE: hash[17].value = x
References
1. LuaJIT
http://article.gmane.org/gmane.comp.lang.lua.general/58908
2. IncrementalDynamicCodeGenerationwithTraceTrees
http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf
3. V8 Design aspects
https://developers.google.com/v8/design
4. RPython
http://tratt.net/laurie/research/pubs/papers/bolz_tratt__the_impact_of_me
tatracing_on_vm_design_and_implementation.pdf
HotSpot JVM
HotSpot JVM JIT
• Fast interpreter
• Two JIT compilers (C1 / C2)
• Runtime profiling
• “Deoptimizing” of code on flight
• On Stack Replacement (OSR)
Devirtualization
Call site profiling
• Monomorphic
– single destination majority of calls
• Bimorphic
– there are two most frequent destinations
• Polymorphic
Devirtualization
“Inline” method caching
if (list.getClass == ArrayList.class) {
/* NON VIRTUAL */ list.ArrayList#size()
}
else {
/* VIRTUAL */ list.size();
}
Incremental compilation
Collections.indexedBinarySearch()
MyPojo
…
int mid = (low + high) >>> 1;
Comparable<? super T> midVal = list.get(mid);
int cmp = midVal.compareTo(key);
…
Polymorphic
Polymorphic
List<String> keys = new ArrayList<String>();
List<String> vals = new ArrayList<String>();
public String get(String key) {
int n = Collections.binarySearch(keys, key);
return n < 0 ? null ? vals.get(n);
}
Increamental compilation
 MyPojo.get() is compiled by JIT
– Collections.binarySort() – got inlined
 Calls in Collections.binarySort() become
monomorphic
 JIT continue to profiling in runtime
 Calls get() and compareTo() will be
inlined once MyPojo.get() is recompiled
On Stack Replacement
JIT can recompile main and replace return
address in stack while execution in some
method inside of loop
public static void main() {
long s = System.nanotime();
for(int i = 0; i != N; ++i) {
/* a lot of code */
...
}
long avg = (System.nanotime() - s) / N;
}
Escape analysis
Heritage of old days – dreaded synchronize
 buf is not used outside of method
 all methods of buf are inlined
 synchronization code could be removed
public String toString() {
StringBuffer buf = new StringBuffer();
buf.append("X=").append(x);
buf.append(",Y=").append(y);
return buf.toString();
}
Scalar replacement
After inlining of distance() in length()
 JITwillreplacePointobjectsbyfewscalarvariables
public double length() {
return distance(
new Point(ax, ay),
new Point(bx, by));
}
public double distance(Point a, Point b) {
double w = a.x - b.x;
double h = a.y - b.y;
return Math.sqrt(w*w + h*h);
}
Garbage collection and JIT
JIT can inline final static fields
• Memory address is placed in compiled code
• GC threats compiled code much like data structure
 Compiled methods act as GC roots
 GCwillfixaddressinsideofcompiledcodeifobjectisrelocated
public class Singleton {
public static final
Singleton INSTANCE = new Singleton()
}
About code optimization
“Beautiful planesareflyingbetter”
– presumably a saying of aircraft engineers
THANK YOU
Alexey Ragizun (alexey.ragozin@gmail.com)
http://blog.ragozin.info
http://aragozin.timepad.ru

JIT compilation in modern platforms – challenges and solutions

  • 1.
    JIT in modernruntimes Challenges and solutions Alexey Ragozin Deutsche Bank
  • 2.
    Presentation outline Why dynamiclanguages are slow  Virtual calls  Untyped / Weak typed data Two approaches to JIT  Method based JIT  Tracing JIT JIT in HotSpot JVM  Interpreter overview  JIT dirty tricks
  • 3.
    Good old C++ 010110010010 101010100110 101010100101 010101001010 101010101010 101010101010 101010101010 101010100010 00:methodA 02: methodC 03: methodD CODEOBJECT VTABLE 01: methodB Plain inheritance
  • 4.
    Good old C++ Multipleinheritance 010110010010 101010100110 101010100101 010101001010 101010101010 101010101010 101010101010 101010100010 111010100100 011110000010 101001010100 00: methodA 02: methodC 03: methodD CODEOBJECT VTABLE 01: methodB 00: methodX 02: methodZ 01: methodY VTABLE
  • 5.
    Old good C++ Morefun with multiple inheritance A B C D D A B C D
  • 6.
    Branch misprediction penalty •Intel Nehalem – 17 cycles • Intel Sandy/Ivy bridge – 15 cycles • Intel Haskwell – 15 - 20 cycles • AMD K8 / K10 – 13 cycles • AMD Buldozer – 19 - 22 cycles http://www.agner.org/optimize/microarchitecture.pdf Cost of virtual call
  • 7.
    Two memory accessbefore actual jump • Memory access is serialized • CPU pipeline is blocked Memory access timings • L1 cache ~0.5 ns • L2 cache ~7 ns • RAM ~100 ns Cost of virtual call
  • 8.
    Fields are storedin hash table Access to field • Arithmetic operation • Memory read • Condition check • Memory read Cost of dynamic class metadata
  • 9.
    Is interpreters thatslow? switch(byteCode) { case STORE: ... case LOAD: ... case ASTORE: ... case ALOADE: ... ... } ?
  • 10.
    Fast interrupter inHotSpot JVM Byte code interpreter in HotSpot JVM • Each byte code instruction has routine written in assembly language • Dispatch – jump to corresponding routing • Each routine ends with jump back to dispatch  No stack frame is produced per instruction  Dispatch table and code are well cached  CPU pipeline is kept busy
  • 11.
    JIT compilation approaches Classic Methodbased compilation + runtime profiling + profiling driven optimization Tracing JIT Recording whole execution paths (trace) + fallbacktointerpretedifexecutiondeviatesfrompath + maintain a tree of compiled traces
  • 12.
    JIT compilation approaches Classic Methodbased compilation – JVM, V8, Firefox Ion Monkey Tracing JIT Recording whole execution paths (trace) – Flash, Trace Monkey, PyPy, LuaJIT
  • 13.
    Tracing JIT Interpretation mode •Record actions and branch condition (recording a trace) Profiling • Detect “hot” traces Trace compilation • Non branching code is generated • Guards instead of branching • Whole trace optimization • Guard violation – fallback to interpreted
  • 14.
    Tracing JIT Strong • Devirtualizationand inlining • Hash lookups are also “deconditioned” • Efficient “hot loops” optimization Weak • Tracing SLOWS down interpretation • Long “warm up” time
  • 15.
    Dynamic types problem V8– shadow classes • Shadow classes are strongly typed TraceMonkey – shape inference/property cache • Inline caching in compiled code LuaJIT – hash table access optimized trace HREFK: if (hash[17].key != key) goto exit HLOAD: x = hash[17].value -or- HSTORE: hash[17].value = x
  • 16.
    References 1. LuaJIT http://article.gmane.org/gmane.comp.lang.lua.general/58908 2. IncrementalDynamicCodeGenerationwithTraceTrees http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf 3.V8 Design aspects https://developers.google.com/v8/design 4. RPython http://tratt.net/laurie/research/pubs/papers/bolz_tratt__the_impact_of_me tatracing_on_vm_design_and_implementation.pdf
  • 17.
  • 18.
    HotSpot JVM JIT •Fast interpreter • Two JIT compilers (C1 / C2) • Runtime profiling • “Deoptimizing” of code on flight • On Stack Replacement (OSR)
  • 19.
    Devirtualization Call site profiling •Monomorphic – single destination majority of calls • Bimorphic – there are two most frequent destinations • Polymorphic
  • 20.
    Devirtualization “Inline” method caching if(list.getClass == ArrayList.class) { /* NON VIRTUAL */ list.ArrayList#size() } else { /* VIRTUAL */ list.size(); }
  • 21.
    Incremental compilation Collections.indexedBinarySearch() MyPojo … int mid= (low + high) >>> 1; Comparable<? super T> midVal = list.get(mid); int cmp = midVal.compareTo(key); … Polymorphic Polymorphic List<String> keys = new ArrayList<String>(); List<String> vals = new ArrayList<String>(); public String get(String key) { int n = Collections.binarySearch(keys, key); return n < 0 ? null ? vals.get(n); }
  • 22.
    Increamental compilation  MyPojo.get()is compiled by JIT – Collections.binarySort() – got inlined  Calls in Collections.binarySort() become monomorphic  JIT continue to profiling in runtime  Calls get() and compareTo() will be inlined once MyPojo.get() is recompiled
  • 23.
    On Stack Replacement JITcan recompile main and replace return address in stack while execution in some method inside of loop public static void main() { long s = System.nanotime(); for(int i = 0; i != N; ++i) { /* a lot of code */ ... } long avg = (System.nanotime() - s) / N; }
  • 24.
    Escape analysis Heritage ofold days – dreaded synchronize  buf is not used outside of method  all methods of buf are inlined  synchronization code could be removed public String toString() { StringBuffer buf = new StringBuffer(); buf.append("X=").append(x); buf.append(",Y=").append(y); return buf.toString(); }
  • 25.
    Scalar replacement After inliningof distance() in length()  JITwillreplacePointobjectsbyfewscalarvariables public double length() { return distance( new Point(ax, ay), new Point(bx, by)); } public double distance(Point a, Point b) { double w = a.x - b.x; double h = a.y - b.y; return Math.sqrt(w*w + h*h); }
  • 26.
    Garbage collection andJIT JIT can inline final static fields • Memory address is placed in compiled code • GC threats compiled code much like data structure  Compiled methods act as GC roots  GCwillfixaddressinsideofcompiledcodeifobjectisrelocated public class Singleton { public static final Singleton INSTANCE = new Singleton() }
  • 27.
    About code optimization “Beautifulplanesareflyingbetter” – presumably a saying of aircraft engineers
  • 28.
    THANK YOU Alexey Ragizun(alexey.ragozin@gmail.com) http://blog.ragozin.info http://aragozin.timepad.ru