I know why your Java is slow

@aragozin#Devoxx #WhyJavaSlow
I Know Why Your Java is Slow
Alexey Ragozin

After months of hard work
with tons of new features
well tested and polished
your application is live!
Everything runs smoothly until …

Users start complaining about performance
application has become unusable
stakeholders are getting nervous
You have to fix it!
ASAP

Server
Browser
Application
Database

Server
Browser
Application
Database
Caching
JavaScript
HTTP
Network Network
SQL performance
CPU Memory / Swapping
Disk IO Virtualization

What exactly slow means – clarify your KPIs
 Business transaction ≠ Page
 Business transaction ≠ SQL transaction
 Page ≠ HTTP request
 HTTP request ≠ SQL transaction
Your system is a black box
 Do you think you know how it works?
 You do not!
Profiling produces three types of data
 Lies – incorrectly interpreted data
 Darn lies – incorrectly measured data
 Statistics – data which will help to fix your system

What types of bottlenecks with can find in Java?
CPU bound
 Single core bottleneck
 CPU starvation
Thread contention
Memory and GC related
 Frequent young GC
 Abnormally long GC pauses
 Frequent full GC

Single core bottleneck
 Certain singleton thread consumes 100% CPU
CPU starvation
 Number of threads compete for physical cores
 Thread nor sleeps neither waits but CPU usage is far below 100%
Thread contention
 Thread spends considerable time in synchronization (accruing, waiting)
Frequent young GC
 Frequent young GC – consume handful of CPU budget
 Caused by intensive memory allocation in application code

Thread CPU usage
 Number CPU cycles spend on this thread (translated into time or percentage)
 Calculated by OS (User + Kernel)
 Single thread can consume 100% at max
Java thread in RUNNABLE state
 Not BLOCKED, WAITING, SLEEPING or PARKED
 Thread in blocking socket read is RUNNABLE
 Java RUNNABLE
 may not be runnable from OS prospective
 may be runnable but not on CPU
 may actually be running on CPU (counted to thread CPU usage)

Bird eye view on Java process
using JVisualVM

using JProfiler

using SJK ttop
> java –jar sjk.jar ttop -p 12345 -n 20 -o CPU
2016-11-07T00:09:50.091+0400 Process summary
process cpu=268.35%
application cpu=244.42% (user=222.50% sys=21.93%)
other: cpu=23.93%
GC cpu=6.28% (young=1.62%, old=4.66%)
heap allocation rate 1218mb/s
safe point rate: 1.5 (events/s) avg. safe point pause: 43.24ms
safe point sync time: 0.03% processing time: 6.39% (wallclock time)
[000056] user=17.45% sys= 0.62% alloc= 106mb/s - hz._hzInstance_2_dev.generic-operation.thread-0
[000105] user=13.59% sys= 0.00% alloc= 72mb/s - hz._hzInstance_3_dev.cached.thread-1
[000123] user= 8.96% sys= 0.00% alloc= 55mb/s - hz._hzInstance_4_dev.response
[000129] user= 7.72% sys= 0.31% alloc= 21mb/s - hz._hzInstance_4_dev.generic-operation.thread-0
[000168] user= 6.95% sys= 0.62% alloc= 33mb/s - hz._hzInstance_1_dev.cached.thread-6
[000130] user= 6.02% sys= 0.62% alloc= 20mb/s - hz._hzInstance_4_dev.generic-operation.thread-1
[000014] user= 2.32% sys= 0.15% alloc= 7345kb/s - hz._hzInstance_1_dev.response
https://github.com/aragozin/jvm-tools

Tracking CPU hogs using sampling
using JProfiler

Tracking CPU hogs using sampling
flame graph with SJK

JEE world example – JBoss + Seam + Hibernate

JEE world example – JBoss + Seam + Hibernate
Command
sjk ssa -f tracedump.std --categorize -tf **.CoyoteAdapter.service -nc
JDBC=**.jdbc
Hibernate=org.hibernate
"Facelets compile=com.sun.faces.facelets.compiler.Compiler.compile"
"Seam bijection=org.jboss.seam.**.aroundInvoke/!**.proceed"
JSF.execute=com.sun.faces.lifecycle.LifecycleImpl.execute
JSF.render=com.sun.faces.lifecycle.LifecycleImpl.render
Other=**
Report
Total samples 2732050 100.00%
JDBC 405439 14.84%
Hibernate 802932 29.39%
Facelets compile 395784 14.49%
Seam bijection 385491 14.11%
JSF.execute 290355 10.63%
JSF.render 297868 10.90%
Other 154181 5.64%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Time
Other
JSF.render
JSF.execute
Seam bijection
Facelets compile
Hibernate
JDBC
Excel

Tracking CPU hot spots using thread sampling
PRO
 Low overhead
 No upfront configuration
 Can identify, CPU hot spots, IO hot spots, contention
 Data are comparable across runs
CON
 Limited precision
 No real method execution time measured
 Safe point bias
 Destabilization is limited to static program structure – methods

Instrumentation profiling
 Modifies byte code at runtime
 Modifies behavior of program
 Affects JIT compilation
 Probe can access method arguments
When instrumentation should/can be used?
 You narrowed down problem area
 You need more context to find root cause

BTrace
 Open Source
 Byte code instrumentation
 Probes are coded Java
 CLI and API
BTrace script
@Property
Profiler prof = Profiling.newProfiler();
@OnMethod(clazz = "org.jboss.seam.Component",
method = "/(inject|disinject|outject)/")
void entryByMethod2(@ProbeClassName String className,
@ProbeMethodName String methodName, @Self Object component) {
if (component != null) {
Field nameField = field(classOf(component), "name", true);
if (nameField != null) {
String name = (String)get(nameField, component);
Profiling.recordEntry(prof, concat("org.jboss.seam.Component.",
concat(methodName, concat(":", name))));
}
}
}
@OnMethod(clazz = "org.jboss.seam.Component",
method = "/(inject|disinject|outject)/",
location = @Location(value = Kind.RETURN))
void exitByMthd2(@ProbeClassName String className,
@ProbeMethodName String methodName, @Self Object component,
@Duration long duration) {
if (component != null) {
Field nameField = field(classOf(component), "name", true);
if (nameField != null) {
String name = (String)get(nameField, component);
Profiling.recordExit(prof, concat("org.jboss.seam.Component.",
concat(methodName, concat(":", name))), duration);
}
}
}
https://github.com/jbachorik/btrace2

Garbage Collection is one to blame?
 Enable GC logs to see whole picture
 -XX:+PrintGCDetails
 -XX:+PrintReferenceGC
Common problems
 JVM has not enough memory
 Intensive allocation of short live object by application
 Reference problems
 Large object resurrection by finalizes
 Too many references

-XX:InitialTenuringThreshold=8
Initial value for tenuring threshold (number of collections
before object will be promoted to old space)
-XX:+UseTLAB Use thread local allocation blocks in eden
-XX:MaxTenuringThreshold=15
Max value for tenuring threshold
-XX:PretenureSizeThreshold=2m Max object size
allowed to be allocated in young space (large objects will be
allocated directly in old space). Thread local allocation
bypasses this check, so if TLAB is large enough object
exciding size threshold still may be allocated in young space.
-XX:+AlwaysTenure Promote all objects surviving
young collection immediately to tenured space
(equivalent of -XX:MaxTenuringThreshold=0)
-XX:+NeverTenure Objects from young space
will never get promoted to tenured space unless survivor
space is not enough to keep them
-XX:+ResizeTLAB Let JVM resize TLABs per thread
-XX:TLABSize=1m Initial size of thread’s TLAB
-XX:MinTLABSize=64k Min size of TLAB
-XX:+UseCMSInitiatingOccupancyOnly
Only use predefined occupancy as only criterion for starting
a CMS collection (disable adaptive behaviour)
-XX:CMSInitiatingOccupancyFraction=70
Percentage CMS generation occupancy to start a CMS cycle.
A negative value means that CMSTriggerRatio is used.
-XX:CMSBootstrapOccupancy=50
Percentage CMS generation occupancy at which to initiate
CMS collection for bootstrapping collection stats.
-XX:CMSTriggerRatio=70
Percentage of MinHeapFreeRatio in CMS generation that is
allocated before a CMS collection cycle commences.
-XX:CMSWaitDuration=30000
Once CMS collection is triggered, it will wait for next young
collection to perform initial mark right after. This parameter
specifies how long CMS can wait for young collection
-XX:+CMSScavengeBeforeRemark
Force young collection before remark phase
-XX:+CMSScheduleRemarkEdenSizeThreshold
If Eden used is below this value, don't try to schedule remark
-XX:CMSScheduleRemarkEdenPenetration=20
Eden occupancy % at which to try and schedule remark pause
-XX:CMSScheduleRemarkSamplingRatio=4
StartsamplingEdentopatleastbeforeyounggenerationoccupancy
reaches1/ofthesizeatwhichweplantoscheduleremark
-XX:+CMSIncrementalMode
Enable incremental CMS mode. Incremental mode was meant for
severs with small number of CPU, but may be used on multicore
servers to benefit from more conservative initiation strategy.
-XX:+CMSClassUnloadingEnabled
If not enabled, CMS will not clean permanent space. You
may need to enable it for containers such as JEE or OSGi.
-XX:ConcGCThreads=2
Number of parallel threads used for concurrent phase.
-XX:ParallelGCThreads=16
Number of parallel threads used for stop-the-world phases.
-XX:+DisableExplicitGC
JVM will ignore application calls to System.gc()
-XX:+ExplicitGCInvokesConcurrent
Let System.gc() trigger concurrent collection instead of full GC
-XX:+ExplicitGCInvokesConcurrentAndUnloadsClasses
Same as above but also triggers permanent space collection.
-XX:PrintCMSStatistics=1
Print additional CMS statistics. Very verbose if n=2.
-XX:+PrintCMSInitiationStatistics
Print CMS initiation details
-XX:+CMSDumpAtPromotionFailure
Dump useful information about the state of the CMS old
generation upon a promotion failure
-XX:+CMSPrintChunksInDump (with optin above)
Add more detailed information about the free chunks
-XX:+CMSPrintObjectsInDump (with optin above)
Add more detailed information about the allocated objects
by Alexey Ragozin – http://blog.ragozin.info
HotSpot JVM options cheatsheet
Young space tenuring
Thread local allocation
Parallel processing
-XX:+CMSOldPLABMin=16 -XX:+CMSOldPLABMax=1024
Min and max size of CMS gen PLAB caches per worker per block size
CMS initiating options
CMS Stop-the-World pauses tuning
Misc CMS options
CMS Diagnostic options
- Options for “deterministic” CMS, they disable some heuristics and
require careful validation
Concurrent Mark Sweep (CMS)
All concrete numbers in JVM options in this card are for illustrational purposes only!
-XX:+ParallelRefProcEnabled Enable parallel
processing of references during GC pause
-XX:SoftRefLRUPolicyMSPerMB=1000 Factor
for calculating soft reference TTL based on free heap size
-XX:OnOutOfMemoryError=… Command to be executed
in case of out of memory.
E.g. “kill -9 %p” on Unix or “taskkill /F /PID %p” on Windows.
-XX:G1HeapRegionSize=32m Size of heap region
-XX:MaxGCPauseMillis=500 Target GC pause duration.
G1 is not deterministic, so no guaranties for GC pause to satisfy this limit.
-XX:G1ReservePercent=10 Percentage of heap to keep free.
Reserved memory is used as last resort to avoid promotion failure.
-XX:G1ConfidencePercent=50 Confidence level
for MMU/pause prediction
-XX:G1HeapWastePercent=10 If garbage level is below
threshold, G1 will not attempt to reclaim memory further
-XX:G1MixedGCCountTarget=8 Target number of mixed
collections after a marking cycle
-XX:InitiatingHeapOccupancyPercent=45
Percentage of (entire) heap occupancy to trigger concurrent GC
Garbage First (G1)
-XX:+CMSParallelRemarkEnabled
Whether parallel remark is enabled (enabled by default)
-XX:+CMSParallelSurvivorRemarkEnabled
Whether parallel remark of survivor space enabled,
effective only with option above (enabled by default)
-XX:+CMSConcurrentMTEnabled
Use multiple threads for concurrent phases.
CMS Concurrency options
-XX:+CMSParallelInitialMarkEnabled
Whether parallel initial mark is enabled (enabled by default)
-XX:CMSTriggerInterval=60000 Periodically triggers_
CMS collection. Useful for deterministic object finalization.
GC options cheat sheet download at
http://blog.ragozin.info/2016/10/hotspot-jvm-garbage-collection-options.html
Young collector Old collectior JVM Flags
Serial (DefNew)
Parallel scavenge (PSYoungGen)
Parallel scavenge (PSYoungGen)
Parallel (ParNew)
Serial (DefNew)
Parallel (ParNew)
Serial Mark Sweep Compact
Serial Mark Sweep Compact (PSOldGen)
Parallel Mark Sweep Compact (ParOldGen)
Concurrent Mark Sweep
Concurrent Mark Sweep
Serial Mark Sweep Compact
-XX:+UseSerialGC
-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:+UseParNewGC
-XX:-UseParNewGC1
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+UseG1GCGarbage First (G1)
1 - Notice minus before UseParNewGC, which is explicitly disables parallel mode
-verbose:gc or -XX:+PrintGC Print basic GC info
-XX:+PrintGCDetails Print more details GC info
-XX:+PrintGCTimeStamps Print timestamps for each GC
event (seconds count from start of JVM)
-XX:+PrintGCDateStamps Print date stamps at garbage
collection events: 2011-09-08T14:20:29.557+0400: [GC...
-Xloggc:<file> RedirectsGCoutputtoafileinsteadofconsole
-XX:+PrintTLAB Print TLAB allocation statistics
-XX:+PrintReferenceGC Print times for special
(weak, JNI, etc) reference processing during STW pause
-XX:+PrintJNIGCStalls Reports if GC is waiting for
native code to unpin object in memory
-XX:+PrintClassHistogramAfterFullGC
Prints class histogram after full GC
-XX:+PrintClassHistogramBeforeFullGC
Prints class histogram before full GC
-XX:+UseGCLogFileRotation Enable GC log rotation
-XX:GCLogFileSize=512m Size threshold for GC log file
-XX:NumberOfGCLogFiles=5 Number GC log files
-XX:+PrintGCCause Add cause of GC in log
-XX:+PrintHeapAtGC Print heap details on GC
-XX:+PrintAdaptiveSizePolicy
Print young space sizing decisions
-XX:+PrintHeapAtSIGBREAK Print heap details on signal
-XX:+PrintPromotionFailure
Print additional information for promotion failure
-XX:+PrintPLAB Print survivor PLAB details
-XX:+PrintOldPLAB Print old space PLAB details
-XX:+PrintGCTaskTimeStamps Print timestamps for
individual GC worker thread tasks (very verbose)
-XX:+PrintGCApplicationStoppedTime
Print summary after each JVM safepoint (including non-GC)
-XX:+PrintGCApplicationConcurrentTime
Print time for each concurrent phase of GC
GC Log rotation
More logging options
-XX:+PrintTenuringDistribution Print detailed
demography of young space after each collection
GC log detail options
-Xms256m or -XX:InitialHeapSize=256m
Initial size of JVM heap (young + old)
-Xmx2g or -XX:MaxHeapSize=2g
Max size of JVM heap (young + old)
-XX:NewSize=64m
-XX:MaxNewSize=64m
Absolute (initial and max) size of
young space (Eden + 2 Survivours)
-XX:NewRatio=3 Alternative way to specify size
of young space. Sets ratio of young vs old space
(e.g. -XX:NewRatio=2 means that young space will be 2 time
smaller than old space, i.e. 1/3 of heap size).
-XX:SurvivorRatio=15 Sets size of single survivor space
relative to Eden space size
(e.g. -XX:NewSize=64m -XX:SurvivorRatio=6 means that each
Survivor space will be 8m and Eden will be 48m).
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=1g
Initial and max size of
JVM’s metaspace space
-Xss256k (size in bytes) or
-XX:ThreadStackSize=256 (size in Kbytes)
Thread stack size
-XX:MaxDirectMemorySize=2g Maximum amount
of memory available for NIO off-heap byte buffers
- Highly recommended option
Memory sizing options
by Alexey Ragozin – http://blog.ragozin.info
Available combinations of garbage collection algorithms in HotSpot JVM
- Highly recommended option
HotSpot JVM options cheatsheetAll concrete numbers in JVM options in this card are for illustrational purposes only!
Java Process Memory
JVM Memory
Java Heap
Non-JVMMemory
(nativelibraries)
Non-Heap
Young Gen
OldGen
Eden
Survivor0
Survivor1
ThreadStacks
Metaspace
CompressedClassSpace
CodeCache
NIODirectBuffers
OtherJVMMemmory
-Xms/-Xmx
-XX:CompressedClassSpaceSize=1g Memory reserved
for compressed class space (64bit only)
-XX:InitialCodeCacheSize=256m
-XX:ReservedCodeCacheSize=512m
Initial size and max
size of code cache area

Visualize GC Dynamics
https://github.com/chewiebug/GCViewer
with GC Viewer

Visualize GC Dynamics
with Mission Control

Tracing object allocation hot spots
Traditional approach – instrumentation
Instrumenting each allocation site
Intrusive and expensive
Flight recorder approach – out of TLAB allocation tracing
Low overhead
No byte code transformation
Biased sampling

Tracking allocation hotspots
with Mission Control

Conclusion
 Bottleneck could be everywhere in system
 Move from top to bottom
 Use right tools on each level
 Focus on application KPI
 There is no silver bullet tool

KEEP CALM
AND MAKE YOUR JAVA FAST
Alexey Ragozin
alexey.ragozin@gmai.com
http://blog.ragozin.info

I know why your Java is slow

More Related Content

What's hot

Similar to I know why your Java is slow

More from aragozin

Recently uploaded

I know why your Java is slow