The JVM Magic Baruch Sadogursky Consultant & Architect, AlphaCSP
Agenda Introduction GC Magic 101 General Optimizations Compiler Optimizations What can I do? Programming tips JVM configuration flags 2
Introduction
Introduction In the past, JVM was considered by many as Java Achilles’ heel Interpreter?! JVM team improved performance in 300 to 3000 times JDK 1.6 compared to JDK 1.0 Java is measured to be 50% to 100+% the speed of C and C++ Jake2 vs Quake2 How can it be?
Java Virtual Machines Zoo CEE-J Excelsior JET Hewlett-Packard J9 (IBM) Jbed Jblend Jrockit MRJ MicroJvm MS JVM OJVM PERC Blackdown Java CVM Gemstone Golden Code Development Intent Novell NSIcomCrE-ME ChaiVM HotSpot AegisVM Apache Harmony CACAO Dalvik IcedTea IKVM.NET Jamiga JamVM Jaos JC Jelatine JVM JESSICA Jikes RVM Jnode JOP Juice Jupiter JX Kaffe leJOS Mika VM Mysaifu NanoVM SableVM Squawk virtual machine SuperWaba TinyVM VMkit of Low Level Virtual Machine Wonka VM Xam 5
HotSpot Virtual Machine Developed by Longview Technologies back in 1999 Contains: Class loader Bytecode interpreter 2 Virtual machines 7 Garbage collectors 2 Compilers Runtime libraries
HotSpot Virtual Machine Configured by hundreds of –XX flags Reminder -X options are non-standard -XX options have specific system requirements for correct operations Both are subject to change without notice
GC Magic 101
GC Is Slow? GC has bad performance reputation Reduces throughput Introduces pauses Unpredictable Uncontrolled Performance degradation is proportional to objects count Just give me the damn free() and malloc()! I’ll be just fine! Is it so?
Generational Collectors Weak generational hypothesis Most objects die young (AKA Infant mortality) Few old to young references Generations: regions holding objects of different ages GC is done separately once a generation fills Different GC algorithms The young (nursery) generation Collected by “Minor garbage collection” The old (tenured) generation Collected by “Minor garbage collection”
GC Magic 101 vs Young is better than Tenured Let your objects die in young generation When possible and makes sense 11
GC Magic 101 12 vs Swapping is bad Application's memory footprint should not exceed the available physical memory
GC Magic 101 13 vs Choose: Throughput (client) Low-pause (server)
Tracking Collectors Algorithms Mark-Sweep collector Mark phase marks each reachable object Sweep phase “sweeps” the heap Non marked objects reclaimed as garbage Copying collector Heap is divided into two equal spaces When active space fills, live objects are copied to the unused space Only live objects are examined The roles of the spaces are then flipped
Compaction Compaction: The collector moves all live objects to the bottom of the heap Remaining memory is reclaimed Reduces the cost of objects allocation No potential fragmentation The drawback is slower completion of GC
The Young generation Consists of Eden + two survivor spaces Objects are initially allocated in Eden All HotSpot young collectors are stop-the-world copying collectors Done is parallel for parallel garbage collectors Collections are relatively fast and proportional to number of live objects
The Young generation
The Tenured generation Objects surviving several GC cycles, are promoted to the tenured generation Use -XX:MaxTenuringThreshold=# to change Collectors algorithms used are variations of Mark-Sweep More space efficient Characteristics Lower garbage density Bigger heap space Fewer GC cycles
Generetion Collectors
Garbage Collectors 21
GC Flags 22
When to Use 23
Garbage First (G1) New in JDK 1.6 u14 (May 29th) All memory is divided to 1MB buckets Calculates objects liveness in buckets Drops “dead” buckets If a bucket is not total garbage, it’s not dropped Collects the most garbage buckets first Pauses only on “mark” No sweep User can provide pause time goals Actual seconds or Percentage of runtime G1 records bucket collection time and can estimate how many buckets to collect during pause
Garbage First (G1) Targets multi-process machines and large heaps G1 will be the long-term replacement for the CMS collector Unlike CMS, compacts to battle fragmentation A bucket’s space is fully reclaimed Better throughput Predictable pauses (high probability) Garbage left in buckets with high live ratio May be collected later
Benefits of G1 No imbalance of young-tenured generation Generations are only logical Generations are merely sets of buckets More predictable GC pauses Parallelism and concurrency in collections No fragmentation due to compaction Better heap utilization Better GC ergonomics
Young GCs in G1 Done using evacuation pauses Stop-The-World parallel collections Evacuates surviving objects between sets of buckets
Old GCs in G1 Drops dead buckets Calculates liveness info per bucket Identifies best buckets for subsequent eviction pauses Collect them piggy-backed on young GCs
GC Ergonomics 29
GC Ergonomics Ergonomics goal is to provide good performance with little or no tuning Better matches the needs of different application types The HotSpot, garbage collector and heap size are automatically chosen Based on OS, RAM and no# CPU Server Vs. Client class machine Hints the characteristics of the application
GC Ergonomics
GC Ergonomics With the parallel collectors, one can specify performance goals In contrast to specifying the heap size Improves performance for large applications Max Pause Time Goal Use -XX:MaxGCPauseMillis=<N> Both generation separately Or: Average + Variance No pause time goal by default
GC Ergonomics Throughput Goal Use -XX:GCTimeRatio=<N> The ratio of GC Vs. application time is 1/(1+N) If N=19, GC time goal is 1/(1+19) or 5% Default N is 99, meaning GC time is 1% Minimum Footprint Goal Priority of goals Maximum pause time goal Throughput goal Minimum footprint goal
GC Ergonomics Performance goals may not be met Pause time and throughput goals are somewhat contradicting The pause time goal shrinks the generation The throughput goal grows the generation Statistics are kept by the GC Adaptive to changes in application behavior
GC Tweaking
Heap Size The larger the heap space, the better For both young and old generation Larger space: less frequent GCs, lower GC overhead, objects more likely to become garbage Smaller space: faster GCs (not always! see later) Sometimes max heap size is dictated by available memory and/or max space the JVM can address You have to find a good balance between young and old generation size
Heap Size Maximize the number of objects reclaimed in the young generation Application's memory footprint should not exceed the available physical memory Swapping is bad The above apply to all our GCs 37
Heap Size -Xmx<size> : max heap size young generation + old generation -Xms<size> : initial heap size young generation + old generation -Xmn<size> : young generation size -XX:PermSize=<size> : permanent generation initial size -XX:MaxPermSize=<size> : permanent generation max size 38
Heap Size When -Xms != -Xmx, heap growth or shrinking requires a Full GC Set -Xms to desired heap size Set –Xmx even higher “just in case” Even full GC is better than OOM crash Same for -XX:PermSize and -XX:MaxPermSize Same for -XX:NewSize and -XX:MaxNewSize -Xmn Combines both 39
Tenuring Measure tenuring with - XX:+PrintTenuringDistribution Avoid tenuring for short or even medium-lived objects! Less promotion into the old generation Less frequent old GCs Promote long-lived objects ASAP Yeah, conflict with previous bullet Better copy more, than promote more -XX:TargetSurvivorRatio=<percent>, e.g., 50 How much of the survivor space should be filled Typically leave extra space to deal with “spikes” 40
Permanent Space Classes aren’t unloaded by default -XX:+CMSClassUnloadingEnabled to enable Classloader should be collected It holds references to classes Each object holds reference to classloader 41
GC Options 42
GC Statistics Options GC logging has extremely low / non-existent overhead It’s very helpful when diagnosing production issues Enable it In production too! -XX:+ PrintGC PrintGCDetails PrintGCTimeStamps PrintTenuringDistribution Show this threshold and the ages of objects in the new generation 43
GC Is Slow? – The Answers Reduces throughput You choose Introduces pauses You choose Unpredictable Not any more Uncontrolled Configurable Performance degradation is proportional to objects count Not true Just give me the damn free() and malloc()! I’ll be just fine! Bad idea (see more later)
General Optimizations
HotSpot Optimizations JIT Compilation Compiler Optimizations Generates more performant code that you could write in native Adaptive Optimization Split Time Verification Class Data Sharing
Two Virtual Machines? Client VM Reducing start-up time and memory footprint -client CL flag Server VM Maximum program execution speed -server CL flag Auto-detection Server: >1 CPUs & >=2GB of physical memory Win32 – always detected as client Many 64bit OSes don’t have client VMs 47
Just-In-Time Compilation Everyone knows about JIT! Hot code is compiled to native What is “hot”? Server VM – 10000 invocations Client VM – 1500 invocations Use -XX:CompileThreshold=# to change More invocations – better optimizations Less invocations – shorter warmup time
Just-In-Time Compilation The code is being optimized by the compiler Coming soon…
Adaptive Optimization Allows HotSpot to uncompile previously compiled code Much more aggressive, even speculative optimizations may be performed And rolled back if something goes wrong or new data gathered E.g. classloading might invalidate inlining
Split Time Verification Java suffers from long boot time One of the reasons is bytecode verification Valid flow control Type safety Visibility In order to ease on the weak KVM, J2ME started performing part of the verification in compile time It’s good, so now it’s in Java SE 6 too
Class Data Sharing Helps improve startup time During JDK installation part of rt.jar is preloaded into shared memory file which is attached in runtime No need to reload and reverify those classes every time
Compiler Optimizations
Two Types of Optimizations Java has two compilers: javac bytecode compiler HotSpot VM JIT compiler Both implement similar optimizations Bytecode compiler is limited Dynamic linking Can apply only static optimizations
Warning Caution! Don’t try this at home yourself! The source code you are about to see is not real! It’s pseudo assembly code Don’t writesuch code! Source code should be readable and object-oriented Bytecode will become performant automagically 55
Optimization Rules Make the common case fast Don't worry about uncommon/infrequent case Defer optimization decisions Until you have data Revisit decisions if data warrants 56
Null check Elimination Java is null-safe language Pointer can’t point to meaningless portion of memory Null checks are added by the compiler, NullPointerException is thrown JVM’s profiler can eliminate those checks 57
Example – Original Source 58
Example – Null Check Elimination 59
Inlining Love Encapsulation? Getters and setters Love clean and simple code? Small methods Use static code analysis? Small methods No penalty for using those! JIT brings the implementation of these methods into a containing method This optimization known as “Inlining”
Inlining Not just about eliminating call overhead Provides optimizer with bigger blocks Enables other optimizations hoisting, dead code elimination, code motion, strength reduction 61
Inlining But wait, all public non-final methods in Java are virtual! HotSpot examines the exact case in place In most cases there is only one implementation, which can be inlined But wait, more implementations may be loaded later! In such case HotSpot undoes the inlining Speculative inlining By default limited to 35 bytes of bytecode Use -XX:MaxInlineSize=# to change
Example - Inlining 63
Example – Source Code Revision 64
Example – Source Code Revision 65
Code Hoisting Hoist = to raise or lift Size optimization Eliminate duplicate code in method bodies by hoisting expressions or statements Duplicate bytecode, not necessarily source code
Example – Code Hoisting 67
Bounds Check Elimination Java promises automatic boundary checks for arrays Exception is thrown If programmer checks the boundaries of its array by himself, the automatic check can be turned off
Loop Unrolling Some loops shouldn’t be loops In performance meaning, not code readability Those can be unrolled to set of statements If the boundaries are dynamic, partial unroll will occur
Example – Loop Unrolling 72
Example – Inlining 73
Escape Analysis Escape analysis is not optimization It is check for object not escaping local scope E.g. created in private method, assigned to local variable and not returned Escape analysis opens up possibilities for lots of optimizations
Scalar Replacement Remember the rule “new == always new object”? False! JVM can optimize away allocations Fields are hoisted into registers Object becomes unneeded But object creation is cheap! Yap, but GC is not so cheap… 75
Example – Source Code Revision 76
Example – Scalar Replacement 77
Example – Scalar Replacement 78
Lock Coarsening HotSpot merges adjacent synchronized blocks using the same lock The compiler is allowed to moved statements into merged coarse blocks Tradeoff performance and responsiveness Reduces instruction count But locks are held longer
Example – Source Code Revision 80
Example – Lock Coarsening 81
Lock Elision A thread enters a lock that no other thread will synchronize on Synchronization has no effect Can be deducted using escape analysis Such locks can be elided Elides 4 StringBuffer synchronized calls:
Example - Lock Elision
Constants Folding Trivial optimization How many constants are there? More than you think! Inlining generates constants Unrolling generates constants Escape analysis generates constants JIT determines what is constant in runtime Whatever doesn’t change
Dead Code Elimination Dead code - code that has no effect on the outcome of the program execution publicstaticvoid main(String[] args) { long start = System.nanoTime(); int result = 0; for (inti = 0; i < 10 * 1000 * 1000; i++) { result += Math.sqrt(i); } long duration = (System.nanoTime() - start) / 1000000; System.out.format("Test duration: %d (ms) %n", duration); }
OSR - On Stack Replacement Normally code is switched from interpretation to native in heap context Before entering method OSR - switch from interpretation to compiled code in local context In the middle of a method call JVM tracks code block execution count Less optimizations May prevent bound check elimination and loop unrolling
Out-Of-Order Execution
Out-Of-Order Execution
Programming & Tuning Tips
91
How Can I Help? Just write good quality Java code Object Orientation Polymorphism Abstraction Encapsulation DRY KISS Let the HotSpot optimize
How Can I Help? final keyword For fields: Allows caching Allows lock coarsening For methods: Simplifies Inlining decisions Immutable objects die younger 93
JVM tuning tips Reminder: -XX options are non standard Added for HotSpot development purposes Mostly tested on Solaris 10 Platform dependent Some options may contradict each other Know and experiment with these options 94
Monitoring & Troubleshooting 95
References The HotSpot Home Page Java HotSpot VM Options Dynamic compilation and performance measurement Urban performance legends, revisited Synchronization optimizations in Mustang Robust Java benchmarking Garbage Collection Tuning 96
References JavaOne 2009 Sessions: Garbage Collection Tuning in the Java HotSpot™ Virtual Machine Under the Hood: Inside a High-Performance JVM™ Machine Practical Lessons in Memory Analysis Debugging Your Production JVM™ Machine Inside Out: A Modern Virtual Machine Revealed 97
Virtual machines don't have to be slow, they don't more
Virtual machines don't have to be slow, they don't even have to be slower than running native code. All you have to do is write your code, lay back and let the JVM do its magic ! Learn about various JVM runtime optimizations and why is it considered one of the best VMs in the world. less
0 comments
Post a comment