Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

I know why your Java is slow


Published on

Slide deck from Devoxx 2016

Published in: Software
  • Be the first to comment

I know why your Java is slow

  1. 1. @aragozin#Devoxx #WhyJavaSlow I Know Why Your Java is Slow Alexey Ragozin
  2. 2. @aragozin#Devoxx #WhyJavaSlow After months of hard work with tons of new features well tested and polished your application is live! Everything runs smoothly until …
  3. 3. @aragozin#Devoxx #WhyJavaSlow Users start complaining about performance application has become unusable stakeholders are getting nervous You have to fix it! ASAP
  4. 4. @aragozin#Devoxx #WhyJavaSlow Server Browser Application Database
  5. 5. @aragozin#Devoxx #WhyJavaSlow Server Browser Application Database Caching JavaScript HTTP Network Network SQL performance CPU Memory / Swapping Disk IO Virtualization
  6. 6. @aragozin#Devoxx #WhyJavaSlow What exactly slow means – clarify your KPIs  Business transaction ≠ Page  Business transaction ≠ SQL transaction  Page ≠ HTTP request  HTTP request ≠ SQL transaction Your system is a black box  Do you think you know how it works?  You do not! Profiling produces three types of data  Lies – incorrectly interpreted data  Darn lies – incorrectly measured data  Statistics – data which will help to fix your system
  7. 7. @aragozin#Devoxx #WhyJavaSlow What types of bottlenecks with can find in Java? CPU bound  Single core bottleneck  CPU starvation Thread contention Memory and GC related  Frequent young GC  Abnormally long GC pauses  Frequent full GC
  8. 8. @aragozin#Devoxx #WhyJavaSlow Single core bottleneck  Certain singleton thread consumes 100% CPU CPU starvation  Number of threads compete for physical cores  Thread nor sleeps neither waits but CPU usage is far below 100% Thread contention  Thread spends considerable time in synchronization (accruing, waiting) Frequent young GC  Frequent young GC – consume handful of CPU budget  Caused by intensive memory allocation in application code
  9. 9. @aragozin#Devoxx #WhyJavaSlow Thread CPU usage  Number CPU cycles spend on this thread (translated into time or percentage)  Calculated by OS (User + Kernel)  Single thread can consume 100% at max Java thread in RUNNABLE state  Not BLOCKED, WAITING, SLEEPING or PARKED  Thread in blocking socket read is RUNNABLE  Java RUNNABLE  may not be runnable from OS prospective  may be runnable but not on CPU  may actually be running on CPU (counted to thread CPU usage)
  10. 10. @aragozin#Devoxx #WhyJavaSlow Bird eye view on Java process using JVisualVM
  11. 11. @aragozin#Devoxx #WhyJavaSlow Bird eye view on Java process using JProfiler
  12. 12. @aragozin#Devoxx #WhyJavaSlow Bird eye view on Java process using SJK ttop > java –jar sjk.jar ttop -p 12345 -n 20 -o CPU 2016-11-07T00:09:50.091+0400 Process summary process cpu=268.35% application cpu=244.42% (user=222.50% sys=21.93%) other: cpu=23.93% GC cpu=6.28% (young=1.62%, old=4.66%) heap allocation rate 1218mb/s safe point rate: 1.5 (events/s) avg. safe point pause: 43.24ms safe point sync time: 0.03% processing time: 6.39% (wallclock time) [000056] user=17.45% sys= 0.62% alloc= 106mb/s - hz._hzInstance_2_dev.generic-operation.thread-0 [000094] user=17.76% sys= 0.31% alloc= 113mb/s - hz._hzInstance_3_dev.generic-operation.thread-1 [000093] user=16.83% sys= 0.00% alloc= 111mb/s - hz._hzInstance_3_dev.generic-operation.thread-0 [000020] user=16.06% sys= 0.15% alloc= 108mb/s - hz._hzInstance_1_dev.generic-operation.thread-0 [000021] user=15.44% sys= 0.15% alloc= 110mb/s - hz._hzInstance_1_dev.generic-operation.thread-1 [000057] user=14.36% sys= 0.00% alloc= 110mb/s - hz._hzInstance_2_dev.generic-operation.thread-1 [000105] user=13.59% sys= 0.00% alloc= 72mb/s - hz._hzInstance_3_dev.cached.thread-1 [000079] user=13.43% sys= 0.15% alloc= 67mb/s - hz._hzInstance_2_dev.cached.thread-3 [000042] user=10.96% sys= 0.62% alloc= 65mb/s - hz._hzInstance_1_dev.cached.thread-2 [000174] user=10.65% sys= 0.31% alloc= 66mb/s - hz._hzInstance_3_dev.cached.thread-7 [000123] user= 8.96% sys= 0.00% alloc= 55mb/s - hz._hzInstance_4_dev.response [000129] user= 7.72% sys= 0.31% alloc= 21mb/s - hz._hzInstance_4_dev.generic-operation.thread-0 [000168] user= 6.95% sys= 0.62% alloc= 33mb/s - hz._hzInstance_1_dev.cached.thread-6 [000178] user= 7.57% sys= 0.00% alloc= 32mb/s - hz._hzInstance_2_dev.cached.thread-9 [000166] user= 6.48% sys= 0.46% alloc= 33mb/s - hz._hzInstance_1_dev.cached.thread-5 [000130] user= 6.02% sys= 0.62% alloc= 20mb/s - hz._hzInstance_4_dev.generic-operation.thread-1 [000181] user= 5.56% sys= 0.00% alloc= 34mb/s - hz._hzInstance_2_dev.cached.thread-12 [000014] user= 2.32% sys= 0.15% alloc= 7345kb/s - hz._hzInstance_1_dev.response
  13. 13. @aragozin#Devoxx #WhyJavaSlow Tracking CPU hogs using sampling using JProfiler
  14. 14. @aragozin#Devoxx #WhyJavaSlow Tracking CPU hogs using sampling flame graph with SJK
  15. 15. @aragozin#Devoxx #WhyJavaSlow JEE world example – JBoss + Seam + Hibernate
  16. 16. @aragozin#Devoxx #WhyJavaSlow JEE world example – JBoss + Seam + Hibernate Command sjk ssa -f tracedump.std --categorize -tf **.CoyoteAdapter.service -nc JDBC=**.jdbc Hibernate=org.hibernate "Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile" "Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed" JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render Other=** Report Total samples 2732050 100.00% JDBC 405439 14.84% Hibernate 802932 29.39% Facelets compile 395784 14.49% Seam bijection 385491 14.11% JSF.execute 290355 10.63% JSF.render 297868 10.90% Other 154181 5.64% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Time Other JSF.render JSF.execute Seam bijection Facelets compile Hibernate JDBC Excel
  17. 17. @aragozin#Devoxx #WhyJavaSlow Tracking CPU hot spots using thread sampling PRO  Low overhead  No upfront configuration  Can identify, CPU hot spots, IO hot spots, contention  Data are comparable across runs CON  Limited precision  No real method execution time measured  Safe point bias  Destabilization is limited to static program structure – methods
  18. 18. @aragozin#Devoxx #WhyJavaSlow Instrumentation profiling  Modifies byte code at runtime  Modifies behavior of program  Affects JIT compilation  Probe can access method arguments When instrumentation should/can be used?  You narrowed down problem area  You need more context to find root cause
  19. 19. @aragozin#Devoxx #WhyJavaSlow BTrace  Open Source  Byte code instrumentation  Probes are coded Java  CLI and API BTrace script @Property Profiler prof = Profiling.newProfiler(); @OnMethod(clazz = "org.jboss.seam.Component", method = "/(inject|disinject|outject)/") void entryByMethod2(@ProbeClassName String className, @ProbeMethodName String methodName, @Self Object component) { if (component != null) { Field nameField = field(classOf(component), "name", true); if (nameField != null) { String name = (String)get(nameField, component); Profiling.recordEntry(prof, concat("org.jboss.seam.Component.", concat(methodName, concat(":", name)))); } } } @OnMethod(clazz = "org.jboss.seam.Component", method = "/(inject|disinject|outject)/", location = @Location(value = Kind.RETURN)) void exitByMthd2(@ProbeClassName String className, @ProbeMethodName String methodName, @Self Object component, @Duration long duration) { if (component != null) { Field nameField = field(classOf(component), "name", true); if (nameField != null) { String name = (String)get(nameField, component); Profiling.recordExit(prof, concat("org.jboss.seam.Component.", concat(methodName, concat(":", name))), duration); } } }
  20. 20. @aragozin#Devoxx #WhyJavaSlow Garbage Collection is one to blame?  Enable GC logs to see whole picture  -XX:+PrintGCDetails  -XX:+PrintReferenceGC Common problems  JVM has not enough memory  Intensive allocation of short live object by application  Reference problems  Large object resurrection by finalizes  Too many references
  21. 21. @aragozin#Devoxx #WhyJavaSlow -XX:InitialTenuringThreshold=8 Initial value for tenuring threshold (number of collections before object will be promoted to old space) -XX:+UseTLAB Use thread local allocation blocks in eden -XX:MaxTenuringThreshold=15 Max value for tenuring threshold -XX:PretenureSizeThreshold=2m Max object size allowed to be allocated in young space (large objects will be allocated directly in old space). Thread local allocation bypasses this check, so if TLAB is large enough object exciding size threshold still may be allocated in young space. -XX:+AlwaysTenure Promote all objects surviving young collection immediately to tenured space (equivalent of -XX:MaxTenuringThreshold=0) -XX:+NeverTenure Objects from young space will never get promoted to tenured space unless survivor space is not enough to keep them -XX:+ResizeTLAB Let JVM resize TLABs per thread -XX:TLABSize=1m Initial size of thread’s TLAB -XX:MinTLABSize=64k Min size of TLAB -XX:+UseCMSInitiatingOccupancyOnly Only use predefined occupancy as only criterion for starting a CMS collection (disable adaptive behaviour) -XX:CMSInitiatingOccupancyFraction=70 Percentage CMS generation occupancy to start a CMS cycle. A negative value means that CMSTriggerRatio is used. -XX:CMSBootstrapOccupancy=50 Percentage CMS generation occupancy at which to initiate CMS collection for bootstrapping collection stats. -XX:CMSTriggerRatio=70 Percentage of MinHeapFreeRatio in CMS generation that is allocated before a CMS collection cycle commences. -XX:CMSWaitDuration=30000 Once CMS collection is triggered, it will wait for next young collection to perform initial mark right after. This parameter specifies how long CMS can wait for young collection -XX:+CMSScavengeBeforeRemark Force young collection before remark phase -XX:+CMSScheduleRemarkEdenSizeThreshold If Eden used is below this value, don't try to schedule remark -XX:CMSScheduleRemarkEdenPenetration=20 Eden occupancy % at which to try and schedule remark pause -XX:CMSScheduleRemarkSamplingRatio=4 StartsamplingEdentopatleastbeforeyounggenerationoccupancy reaches1/ofthesizeatwhichweplantoscheduleremark -XX:+CMSIncrementalMode Enable incremental CMS mode. Incremental mode was meant for severs with small number of CPU, but may be used on multicore servers to benefit from more conservative initiation strategy. -XX:+CMSClassUnloadingEnabled If not enabled, CMS will not clean permanent space. You may need to enable it for containers such as JEE or OSGi. -XX:ConcGCThreads=2 Number of parallel threads used for concurrent phase. -XX:ParallelGCThreads=16 Number of parallel threads used for stop-the-world phases. -XX:+DisableExplicitGC JVM will ignore application calls to System.gc() -XX:+ExplicitGCInvokesConcurrent Let System.gc() trigger concurrent collection instead of full GC -XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses Same as above but also triggers permanent space collection. -XX:PrintCMSStatistics=1 Print additional CMS statistics. Very verbose if n=2. -XX:+PrintCMSInitiationStatistics Print CMS initiation details -XX:+CMSDumpAtPromotionFailure Dump useful information about the state of the CMS old generation upon a promotion failure -XX:+CMSPrintChunksInDump (with optin above) Add more detailed information about the free chunks -XX:+CMSPrintObjectsInDump (with optin above) Add more detailed information about the allocated objects by Alexey Ragozin – HotSpot JVM options cheatsheet Young space tenuring Thread local allocation Parallel processing -XX:+CMSOldPLABMin=16 -XX:+CMSOldPLABMax=1024 Min and max size of CMS gen PLAB caches per worker per block size CMS initiating options CMS Stop-the-World pauses tuning Misc CMS options CMS Diagnostic options - Options for “deterministic” CMS, they disable some heuristics and require careful validation Concurrent Mark Sweep (CMS) All concrete numbers in JVM options in this card are for illustrational purposes only! -XX:+ParallelRefProcEnabled Enable parallel processing of references during GC pause -XX:SoftRefLRUPolicyMSPerMB=1000 Factor for calculating soft reference TTL based on free heap size -XX:OnOutOfMemoryError=… Command to be executed in case of out of memory. E.g. “kill -9 %p” on Unix or “taskkill /F /PID %p” on Windows. -XX:G1HeapRegionSize=32m Size of heap region -XX:MaxGCPauseMillis=500 Target GC pause duration. G1 is not deterministic, so no guaranties for GC pause to satisfy this limit. -XX:G1ReservePercent=10 Percentage of heap to keep free. Reserved memory is used as last resort to avoid promotion failure. -XX:G1ConfidencePercent=50 Confidence level for MMU/pause prediction -XX:G1HeapWastePercent=10 If garbage level is below threshold, G1 will not attempt to reclaim memory further -XX:G1MixedGCCountTarget=8 Target number of mixed collections after a marking cycle -XX:InitiatingHeapOccupancyPercent=45 Percentage of (entire) heap occupancy to trigger concurrent GC Garbage First (G1) -XX:+CMSParallelRemarkEnabled Whether parallel remark is enabled (enabled by default) -XX:+CMSParallelSurvivorRemarkEnabled Whether parallel remark of survivor space enabled, effective only with option above (enabled by default) -XX:+CMSConcurrentMTEnabled Use multiple threads for concurrent phases. CMS Concurrency options -XX:+CMSParallelInitialMarkEnabled Whether parallel initial mark is enabled (enabled by default) -XX:CMSTriggerInterval=60000 Periodically triggers_ CMS collection. Useful for deterministic object finalization. GC options cheat sheet download at Young collector Old collectior JVM Flags Serial (DefNew) Parallel scavenge (PSYoungGen) Parallel scavenge (PSYoungGen) Parallel (ParNew) Serial (DefNew) Parallel (ParNew) Serial Mark Sweep Compact Serial Mark Sweep Compact (PSOldGen) Parallel Mark Sweep Compact (ParOldGen) Concurrent Mark Sweep Concurrent Mark Sweep Serial Mark Sweep Compact -XX:+UseSerialGC -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+UseParNewGC -XX:-UseParNewGC1 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+UseG1GCGarbage First (G1) 1 - Notice minus before UseParNewGC, which is explicitly disables parallel mode -verbose:gc or -XX:+PrintGC Print basic GC info -XX:+PrintGCDetails Print more details GC info -XX:+PrintGCTimeStamps Print timestamps for each GC event (seconds count from start of JVM) -XX:+PrintGCDateStamps Print date stamps at garbage collection events: 2011-09-08T14:20:29.557+0400: [GC... -Xloggc:<file> RedirectsGCoutputtoafileinsteadofconsole -XX:+PrintTLAB Print TLAB allocation statistics -XX:+PrintReferenceGC Print times for special (weak, JNI, etc) reference processing during STW pause -XX:+PrintJNIGCStalls Reports if GC is waiting for native code to unpin object in memory -XX:+PrintClassHistogramAfterFullGC Prints class histogram after full GC -XX:+PrintClassHistogramBeforeFullGC Prints class histogram before full GC -XX:+UseGCLogFileRotation Enable GC log rotation -XX:GCLogFileSize=512m Size threshold for GC log file -XX:NumberOfGCLogFiles=5 Number GC log files -XX:+PrintGCCause Add cause of GC in log -XX:+PrintHeapAtGC Print heap details on GC -XX:+PrintAdaptiveSizePolicy Print young space sizing decisions -XX:+PrintHeapAtSIGBREAK Print heap details on signal -XX:+PrintPromotionFailure Print additional information for promotion failure -XX:+PrintPLAB Print survivor PLAB details -XX:+PrintOldPLAB Print old space PLAB details -XX:+PrintGCTaskTimeStamps Print timestamps for individual GC worker thread tasks (very verbose) -XX:+PrintGCApplicationStoppedTime Print summary after each JVM safepoint (including non-GC) -XX:+PrintGCApplicationConcurrentTime Print time for each concurrent phase of GC GC Log rotation More logging options -XX:+PrintTenuringDistribution Print detailed demography of young space after each collection GC log detail options -Xms256m or -XX:InitialHeapSize=256m Initial size of JVM heap (young + old) -Xmx2g or -XX:MaxHeapSize=2g Max size of JVM heap (young + old) -XX:NewSize=64m -XX:MaxNewSize=64m Absolute (initial and max) size of young space (Eden + 2 Survivours) -XX:NewRatio=3 Alternative way to specify size of young space. Sets ratio of young vs old space (e.g. -XX:NewRatio=2 means that young space will be 2 time smaller than old space, i.e. 1/3 of heap size). -XX:SurvivorRatio=15 Sets size of single survivor space relative to Eden space size (e.g. -XX:NewSize=64m -XX:SurvivorRatio=6 means that each Survivor space will be 8m and Eden will be 48m). -XX:MetaspaceSize=512m -XX:MaxMetaspaceSize=1g Initial and max size of JVM’s metaspace space -Xss256k (size in bytes) or -XX:ThreadStackSize=256 (size in Kbytes) Thread stack size -XX:MaxDirectMemorySize=2g Maximum amount of memory available for NIO off-heap byte buffers - Highly recommended option Memory sizing options by Alexey Ragozin – Available combinations of garbage collection algorithms in HotSpot JVM - Highly recommended option HotSpot JVM options cheatsheetAll concrete numbers in JVM options in this card are for illustrational purposes only! Java Process Memory JVM Memory Java Heap Non-JVMMemory (nativelibraries) Non-Heap Young Gen OldGen Eden Survivor0 Survivor1 ThreadStacks Metaspace CompressedClassSpace CodeCache NIODirectBuffers OtherJVMMemmory -Xms/-Xmx -XX:CompressedClassSpaceSize=1g Memory reserved for compressed class space (64bit only) -XX:InitialCodeCacheSize=256m -XX:ReservedCodeCacheSize=512m Initial size and max size of code cache area
  22. 22. @aragozin#Devoxx #WhyJavaSlow Visualize GC Dynamics with GC Viewer
  23. 23. @aragozin#Devoxx #WhyJavaSlow Visualize GC Dynamics with Mission Control
  24. 24. @aragozin#Devoxx #WhyJavaSlow Tracing object allocation hot spots Traditional approach – instrumentation Instrumenting each allocation site Intrusive and expensive Flight recorder approach – out of TLAB allocation tracing Low overhead No byte code transformation Biased sampling
  25. 25. @aragozin#Devoxx #WhyJavaSlow Tracking allocation hotspots with Mission Control
  26. 26. @aragozin#Devoxx #WhyJavaSlow Conclusion  Bottleneck could be everywhere in system  Move from top to bottom  Use right tools on each level  Focus on application KPI  There is no silver bullet tool
  27. 27. @aragozin#Devoxx #WhyJavaSlow KEEP CALM AND MAKE YOUR JAVA FAST Alexey Ragozin