Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Low latency Java apps


Published on

A presentation on how automatic memory management and adaptive compilation impact on latency of applications. Includes some ideas on how to minimise these affects.

Published in: Technology
  • This was awesome. I found the slideshare link from this youtube page, but for those who land on slideshare first, here the youtube link:
    Are you sure you want to  Yes  No
    Your message goes here

Low latency Java apps

  1. 1. <Insert Picture Here>Understanding the Java Virtual Machine andLow Latency ApplicationsSimon RitterTechnology Evangelist
  2. 2. The following is intended to outline our generalproduct direction. It is intended for informationpurposes only, and may not be incorporated into anycontract. It is not a commitment to deliver anymaterial, code, or functionality, and should not berelied upon in making purchasing decisions.The development, release, and timing of anyfeatures or functionality described for Oracle sproducts remains at the sole discretion of Oracle.
  3. 3. People Want Fast Applications•  What does fast mean? •  “I want the answer as fast as possible” •  “I want all the answers as fast as possible”•  These two goals are somewhat orthogonal from a programming perspective•  One fast answer = Low Latency•  All answers as fast as possible = High Throughput
  4. 4. The Java Virtual Machine Performance Considerations•  It’s virtual, not physical •  Conversion from bytecodes to native instructions and library calls •  Interpreted mode and Just In Time (JIT) compilation•  Automatic memory management •  new operator allocates space for an object •  Garbage collector eliminates the need for programmatic ‘free’ •  No explicit pointer manipulation (much safer)•  Multi-threaded •  Each object has a single monitor •  Programmatic locking, some automated unlocking
  5. 5. Application Stack Java Application Application Server (Optional) Impact that tuning Java Virtual Machine changes will have Operating System Hardware (CPU/Memory/Bus)
  6. 6. MemoryManagement
  7. 7. What You Need To Know About GC HotSpot VM Heap LayoutRemoval planned for JDK8
  8. 8. What You Need To Know About GCHotSpot VM Heap Layout
  9. 9. Important Concepts of GC•  Frequency of minor GCs is dictated by: •  Rate of object allocation •  Size of the Eden space•  Frequency of object promotion into tenured space is dictated by: •  Frequency of minor GCs (how quickly objects age) •  Size of the survivor spaces •  Tenuring threshold (default 7)•  Ideally as little data as possible should be promoted •  Involves copying, thus I/O and must be stop-the-world
  10. 10. Important Concepts of GC•  Object retention impacts latency more than object allocation •  GC only visits live objects •  GC time is a function of number of live objects and graph complexity•  Object allocation is very cheap •  ~10 cycles in common case •  Compare to 30 cycles for fastest malloc algorithm•  Reclaiming short lived objects is very cheap •  Weak generational hypothesis
  11. 11. Quick Rules of Thumb•  Don’t be afraid to allocate quickly disposed of objects •  Especially for immediate results•  GC loves small immutable objects and short-lived objects •  So long as they don’t survive a minor GC•  Try to avoid complex inter-object relationships •  Reduce complexity of graph to be analysed by GC
  12. 12. Quick Rules of Thumb However…•  Don’t allocate objects needlessly •  More frequent allocations means more frequent GCs •  More frequent GCs implies faster object aging •  Faster object aging means faster promotion to old generation •  Which means more frequent concurrent collections or full compacting collections of the old generation•  It is better to use short-lived immutable objects than long-lived mutable objects
  13. 13. The Ideal GC Scenario•  After application initialization phase, only experience minor GCs and old generation growth is negligible •  Ideally, never experience need for Old Generation collection •  Minor GCs are [generally] the fastest•  Start with Parallel GC •  i.e. -XX:+UseParallelOldGC or -XX:+UseParallelGC •  Parallel GC offers the fastest minor GC times •  So long as you have multiple cores/CPUs•  Move to CMS if Old Generation collection is needed •  Minor GC times will be slower due to promotion into free lists •  Hopefully this will avoid full compacting collection of old gen.
  14. 14. Concurrent GC Interesting Aside•  Concurrent collectors require a write barrier to track potential hidden live objects •  The write barrier tracks all writes to objects and records the creation and removal of references between objects•  Write barriers introduce performance overhead •  Size of overhead depends on implementation•  Stop-the-world GC does not require write barrier•  Hence, ideal situation is: •  Use Parallel GC or ParallelOld GC and avoid Old Gen. collection •  Thus avoiding full GC
  15. 15. GC Friendly Programming (1)•  Large objects •  Expensive to allocate (may not use fast path, straight in Old Gen.) •  Expensive to initialise (Java spec. requires zeroing)•  Large objects of different size can cause heap fragmentation •  For non-compacting or partially compacting GCs•  Avoid large object allocations (if you can) •  Especially frequent large object allocations during application “steady state” •  Not so bad during application warm-up (pooling)
  16. 16. GC Friendly Programming (2)•  Data structure resizing •  Avoid resizing of array backed “container objects” •  Use the constructor that takes an explicit size parameter•  Resizing leads to unnecessary object allocation •  Can also contribute to fragmentation (non-compacting GC)•  Object pooling issues •  Contributes to live objects visited during GC •  GC pause is function of number of live objects •  Access to the pool requires locking •  Scalability issue •  Weigh against benefits of large object allocation at start-up
  17. 17. GC Friendly Programming (3)•  Finalizers •  Simple rule: DON’T USE THEM! •  Unless you really, really, really (and I mean REALLY) have to •  Requires at least 2 GCs cycles and GC cycles are slower •  Use a method to explicitly free resources and manage this manually before object is no longer required•  Reference objects •  Possible alternative to finalizers (as an almost last resort)• SoftReference important note •  Referent is cleared by the GC, how aggressive it is at clearing is at the mercy of the GCs implementation •  The “aggressiveness” dictates the degree of object retention
  18. 18. Subtle Object Retention (1)•  Consider this class: class MyImpl extends ClassWithFinalizer { private byte[] buffer = new byte[1024 * 1024 * 2]; ....•  Consequences of finalizer in super-class •  At least 2 GC cycles to free the byte array•  One solution class MyImpl { private ClassWithFinalier classWithFinalizer; private byte[] buffer = new byte[1024 * 1024 * 2]; ....
  19. 19. Subtle Object Retention (2)•  Inner classes •  Have an implicit reference to the outer instance •  Can potentially increase object retention and graph complexity•  Net affect is the potential for increased GC duration •  Thus increased latency
  20. 20. Garbage First (G1) Garbage Collection•  Known limitations in current GC algorithms •  CMS: No compaction, need for a remark phase •  ParallelOld: Full heap compaction, potentially long STW pauses •  Pause target can be set, but is a best-effort, no guarantees •  Problems arise with increase in heap, throughput and live set•  G1 Collector •  Detlef, Flood, Heller, Printezis - 2004
  21. 21. G1 Collector•  CMS Replacement (available JRE 7 u4 onwards)•  Server “Style” Garbage Collector•  Parallel•  Concurrent Main differences•  Generational between CMS and G1•  Good Throughput•  Compacting•  Improved ease-of-use•  Predictable (though not hard real-time)
  22. 22. G1 Collector High Level Overview•  Region based heap •  Dynamic young generation sizing •  Partial compaction using evacuation•  Snapshot At The Beginning (SATB) •  Avoids remark phase•  Pause target •  Select number of regions in young and mixed collections that fits target•  Garbage First •  Select regions that contain mostly garbage •  Minimal work for maximal return
  23. 23. Colour Key Non-Allocated Space Young Generation Old Generation Recently Copied in Young Generation Recently Copied in Old Generation
  24. 24. Young GCs in CMS •  Young generation, split into •  Eden •  Survivor spaces •  Old generation •  In-place de-allocation •  Managed by free listsCMS
  25. 25. Young GCs in CMS •  End of young generation GCCMS
  26. 26. Young GCs in G1 G1•  Heap split into regions •  Currently 1MB regions•  Young generation •  A set of regions•  Old generation •  A set of regions
  27. 27. Young GCs in G1 G1•  During a young generation GC •  Survivors from the young regions are evacuated to: •  Survivor regions •  Old regions
  28. 28. Young GCs in G1 G1•  End of young generation GC
  29. 29. Old GCs in CMS (Sweeping After Marking) •  Concurrent marking phase •  Two stop-the-world pauses •  Initial mark •  Remark •  Marks reachable (live) objects •  Unmarked objects are deduced to be unreachable (dead)CMS
  30. 30. Old GCs in CMS (Sweeping After Marking) •  End of concurrent sweeping phase •  All unmarked objects are de- allocatedCMS
  31. 31. Old GCs in G1 (After Marking) G1•  Concurrent marking phase •  One stop-the-world pause •  Remark •  (Initial mark piggybacked on an evacuation pause) •  Calculates liveness information per region •  Empty regions can be reclaimed immediately
  32. 32. Old GCs in G1 (After Marking) G1•  End of remark phase
  33. 33. Old GCs in G1 (After Marking) G1•  Reclaiming old regions •  Pick regions with low live ratio •  Collect them piggy-backed on young GCs •  Only a few old regions collected per such GC
  34. 34. Old GCs in G1 (After Marking) G1•  We might leave some garbage objects in the heap •  In regions with very high live ratio •  We might collect them later
  35. 35. CMS vs. G1 Comparison G1CMS
  36. 36. Latency Is A Key Goal•  Oracle actively researching new ways to reduce latency and make it more predictable•  Which direction this work goes in needs to be driven by requirements
  37. 37. AdaptiveCompilation
  38. 38. JIT Compilation Facts Optimisation Decisions•  Data: classes loaded and code paths executed •  JIT compiler does not know about all code in application •  Unlike traditional compiler •  Optimisation decisions based on runtime history •  No potential to predict future profile •  Decisions made may turn out to be sub-optimal later •  Limits some types of optimisations used •  As profile changes JIT needs to react •  Throw away compiled code no longer required •  Re-optimise based on new profile
  39. 39. JIT Compilation Facts Internal Profiling•  Need to determine which methods are hot or cold•  Invocation counting •  Handled by bytecode interpreter or including an add instruction to native code •  Can have noticeable run-time overhead•  Thread sampling •  periodically check thread code, register instruction pointers •  Minimising application disruption requires custom thread implementation or OS support•  Hardware based sampling •  Platform specific instrumentation mechanisms
  40. 40. JIT Compilation Facts JIT Assumptions•  Methods will probably not be overridden •  Can be called with a fixed address•  A float will probably never be NaN •  Use hardware instructions rather than floating point library•  Exceptions will probably not be thrown in a try block •  All catch blocks are marked as cold•  A lock will probably not be saturated •  Start as a fast spinlock•  A lock will probably be taken and released by the same thread •  Sequential unlock/acquire operations can be treated as a no-op
  41. 41. Inlining and Virtualisation Competing Forces•  Most effective optimisation is method inlining•  Virtualised methods are the biggest barrier to this•  Good news: •  JIT can de-virtualize methods if it only sees 1 implementation •  Makes it a mono-morphic call•  Bad news: •  if JIT compiler later discovers an additional implementation it must de-optimize •  Re-optimise to make it a bi-morphic call •  Reduced performance, especially if extended to third method and multi-morphic call
  42. 42. Inlining and Virtualisation Important Points•  Implementation changes during “steady-state” •  Will slow down application•  Write JIT friendly code? •  No! Remember, “Beware premature optimisation”•  What to do? •  Code naturally and let the JIT figure it out •  Profile to find problem areas •  Modify code only for problem areas to improve performance
  43. 43. Conclusions•  Java uses a virtual machine, so: •  Has automatic memory management •  Has adaptive compilation of bytecodes•  How these features work will have a significant impact on the performance of your application•  Profile, profile, profile!•  Avoid premature optimisation
  44. 44. Resources• tuning-6-140523.html• vmoptions-jsp-140102.html•