• Save
JIT compilation in modern platforms – challenges and solutions
Upcoming SlideShare
Loading in...5
×
 

JIT compilation in modern platforms – challenges and solutions

on

  • 778 views

Slidedeck from tech meet up Deutsche bank technology center, Moscow

Slidedeck from tech meet up Deutsche bank technology center, Moscow

Statistics

Views

Total Views
778
Views on SlideShare
772
Embed Views
6

Actions

Likes
4
Downloads
0
Comments
0

1 Embed 6

https://twitter.com 6

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

JIT compilation in modern platforms – challenges and solutions JIT compilation in modern platforms – challenges and solutions Presentation Transcript

  • JIT in modern runtimes Challenges and solutions Alexey Ragozin Deutsche Bank
  • Presentation outline Why dynamic languages are slow  Virtual calls  Untyped / Weak typed data Two approaches to JIT  Method based JIT  Tracing JIT JIT in HotSpot JVM  Interpreter overview  JIT dirty tricks
  • Good old C++ 010110010010 101010100110 101010100101 010101001010 101010101010 101010101010 101010101010 101010100010 00: methodA 02: methodC 03: methodD CODEOBJECT VTABLE 01: methodB Plain inheritance
  • Good old C++ Multiple inheritance 010110010010 101010100110 101010100101 010101001010 101010101010 101010101010 101010101010 101010100010 111010100100 011110000010 101001010100 00: methodA 02: methodC 03: methodD CODEOBJECT VTABLE 01: methodB 00: methodX 02: methodZ 01: methodY VTABLE
  • Old good C++ More fun with multiple inheritance A B C D D A B C D
  • Branch misprediction penalty • Intel Nehalem – 17 cycles • Intel Sandy/Ivy bridge – 15 cycles • Intel Haskwell – 15 - 20 cycles • AMD K8 / K10 – 13 cycles • AMD Buldozer – 19 - 22 cycles http://www.agner.org/optimize/microarchitecture.pdf Cost of virtual call
  • Two memory access before actual jump • Memory access is serialized • CPU pipeline is blocked Memory access timings • L1 cache ~0.5 ns • L2 cache ~7 ns • RAM ~100 ns Cost of virtual call
  • Fields are stored in hash table Access to field • Arithmetic operation • Memory read • Condition check • Memory read Cost of dynamic class metadata
  • Is interpreters that slow? switch(byteCode) { case STORE: ... case LOAD: ... case ASTORE: ... case ALOADE: ... ... } ?
  • Fast interrupter in HotSpot JVM Byte code interpreter in HotSpot JVM • Each byte code instruction has routine written in assembly language • Dispatch – jump to corresponding routing • Each routine ends with jump back to dispatch  No stack frame is produced per instruction  Dispatch table and code are well cached  CPU pipeline is kept busy
  • JIT compilation approaches Classic Method based compilation + runtime profiling + profiling driven optimization Tracing JIT Recording whole execution paths (trace) + fallbacktointerpretedifexecutiondeviatesfrompath + maintain a tree of compiled traces
  • JIT compilation approaches Classic Method based compilation – JVM, V8, Firefox Ion Monkey Tracing JIT Recording whole execution paths (trace) – Flash, Trace Monkey, PyPy, LuaJIT
  • Tracing JIT Interpretation mode • Record actions and branch condition (recording a trace) Profiling • Detect “hot” traces Trace compilation • Non branching code is generated • Guards instead of branching • Whole trace optimization • Guard violation – fallback to interpreted
  • Tracing JIT Strong • Devirtualization and inlining • Hash lookups are also “deconditioned” • Efficient “hot loops” optimization Weak • Tracing SLOWS down interpretation • Long “warm up” time
  • Dynamic types problem V8 – shadow classes • Shadow classes are strongly typed TraceMonkey – shape inference/property cache • Inline caching in compiled code LuaJIT – hash table access optimized trace HREFK: if (hash[17].key != key) goto exit HLOAD: x = hash[17].value -or- HSTORE: hash[17].value = x
  • References 1. LuaJIT http://article.gmane.org/gmane.comp.lang.lua.general/58908 2. IncrementalDynamicCodeGenerationwithTraceTrees http://www.ics.uci.edu/~franz/Site/pubs-pdf/ICS-TR-06-16.pdf 3. V8 Design aspects https://developers.google.com/v8/design 4. RPython http://tratt.net/laurie/research/pubs/papers/bolz_tratt__the_impact_of_me tatracing_on_vm_design_and_implementation.pdf
  • HotSpot JVM
  • HotSpot JVM JIT • Fast interpreter • Two JIT compilers (C1 / C2) • Runtime profiling • “Deoptimizing” of code on flight • On Stack Replacement (OSR)
  • Devirtualization Call site profiling • Monomorphic – single destination majority of calls • Bimorphic – there are two most frequent destinations • Polymorphic
  • Devirtualization “Inline” method caching if (list.getClass == ArrayList.class) { /* NON VIRTUAL */ list.ArrayList#size() } else { /* VIRTUAL */ list.size(); }
  • Incremental compilation Collections.indexedBinarySearch() MyPojo … int mid = (low + high) >>> 1; Comparable<? super T> midVal = list.get(mid); int cmp = midVal.compareTo(key); … Polymorphic Polymorphic List<String> keys = new ArrayList<String>(); List<String> vals = new ArrayList<String>(); public String get(String key) { int n = Collections.binarySearch(keys, key); return n < 0 ? null ? vals.get(n); }
  • Increamental compilation  MyPojo.get() is compiled by JIT – Collections.binarySort() – got inlined  Calls in Collections.binarySort() become monomorphic  JIT continue to profiling in runtime  Calls get() and compareTo() will be inlined once MyPojo.get() is recompiled
  • On Stack Replacement JIT can recompile main and replace return address in stack while execution in some method inside of loop public static void main() { long s = System.nanotime(); for(int i = 0; i != N; ++i) { /* a lot of code */ ... } long avg = (System.nanotime() - s) / N; }
  • Escape analysis Heritage of old days – dreaded synchronize  buf is not used outside of method  all methods of buf are inlined  synchronization code could be removed public String toString() { StringBuffer buf = new StringBuffer(); buf.append("X=").append(x); buf.append(",Y=").append(y); return buf.toString(); }
  • Scalar replacement After inlining of distance() in length()  JITwillreplacePointobjectsbyfewscalarvariables public double length() { return distance( new Point(ax, ay), new Point(bx, by)); } public double distance(Point a, Point b) { double w = a.x - b.x; double h = a.y - b.y; return Math.sqrt(w*w + h*h); }
  • Garbage collection and JIT JIT can inline final static fields • Memory address is placed in compiled code • GC threats compiled code much like data structure  Compiled methods act as GC roots  GCwillfixaddressinsideofcompiledcodeifobjectisrelocated public class Singleton { public static final Singleton INSTANCE = new Singleton() }
  • About code optimization “Beautiful planesareflyingbetter” – presumably a saying of aircraft engineers
  • THANK YOU Alexey Ragizun (alexey.ragozin@gmail.com) http://blog.ragozin.info http://aragozin.timepad.ru