GC Tuning in the HotSpot Java VM - a FISL 10 Presentation

Garbage Collection
Tuning in the Java
HotSpot™ Virtual
Machine

Tony Printezis, Charlie Hunt,
Ludovic Poitou
Sun Microsystems, Inc.

1

Who We Are
• Tony Printezis
> GC Group / HotSpot JVM development team
> Been working on the HotSpot JVM since 2006
> 10+ years of GC experience
• Charlie Hunt
> Java Platform Performance Engineering Group
> Works with many Sun product teams and customers
> 10+ years of Java technology performance work
• Ludovic Poitou (just the narrator)
> Directory Services Engineering, OpenDS Community guy
> 10+ years of Scaling LDAP directories, now with Java
Copyright Sun Microsystems, Inc. 2

If you remember only one thing...

GC Tuning is an Art !


GC Tuning is an Art
• Unfortunately, we can't give you a flawless recipe or
a flowchart that will apply to all your GC tuning
scenarios
• GC tuning involves a lot of common pattern
recognition
• This pattern recognition requires experience
> We have a lot of it. :-)


Agenda
• Introductions
• Brief GC Overview
• GC Tuning
> Tuning the young generation
> Tuning Parallel GC
> Tuning CMS
• Monitoring the GC
• Conclusions


GCs in the HotSpot JVM
• Three available GCs:
> Serial GC
> Parallel GC / Parallel Old GC
> Concurrent Mark-Sweep GC (CMS)


Heap Layout (same for all GCs)

Young Generation

Old Generation

Permanent Generation


Young Generation
Allocation (new Object())

Eden Survivor Spaces


Old Generation

Promotion
(survivors from the Young Generation)


Permanent Generation

Allocation
(only directly from the JVM)


Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


Your Dream GC
• You would really like a GC that has
> Low GC overhead,
> Low GC pause times, and
> Good space efficiency
• Unfortunately, you'll have to pick two (any two!)


Heap Sizing Tuning Advice

Supersize it!


Heap Sizing Trade-Offs
• Generally, the larger the heap space, the better
> For both young and old generation
> Larger space: less frequent GCs, lower GC overhead,
objects more likely to become garbage
> Smaller space: faster GCs (not always! see later)
• Sometimes max heap size is dictated by available
memory and/or max space the JVM can address
> You have to find a good balance between young and old
generation size


Generation Size Roles
• Young Generation Size
> Dictates frequency of minor GCs
> Dictates how many objects will be reclaimed in the young
generation
– Along with tenuring threshold + survivor space size tuning
• Old Generation
> Should comfortably hold the application's steady-state
live size
> Decrease the major GC frequency as much as possible


Two Very Important Points
• You should try to maximize the number of
objects reclaimed in the young generation
> This is probably the most important piece of advice when
sizing a heap and/or tuning the young generation
• Your application's memory footprint should not
exceed the available physical memory
> This is probably the second most important piece of
advice when sizing a heap
• The above apply to all our GCs


Sizing Heap Spaces
• -Xmx<size> : max heap size
> young generation + old generation
• -Xms<size> : initial heap size
> young generation + old generation
• -Xmn<size> : young generation size
• Applications with emphasis on performance tend to
set -Xms and -Xmx to the same value
• When -Xms != -Xmx, heap growth or shrinking
requires a Full GC


Should -Xms == -Xmx?
• Set -Xms to what you think would be your desired
heap size
> It's expensive to grow the heap
• If memory allows, set -Xmx to something larger than
-Xms “just in case”
> Maybe the application is hit with more load
> Maybe the DB gets larger over time
• In most occasions, it's better to do a Full GC and
grow the heap than to get an OOM and crash


Sizing Heap Spaces (ii)
• -XX:PermSize=<size> : permanent generation initial
size
• -XX:MaxPermSize=<size> : permanent generation
max size
• Applications with emphasis on performance almost
always set -XX:PermSize and -XX:MaxPermSize to
the same value
> Growing or shrinking the permanent generation requires
a Full GC too
• Unfortunately, the permanent generation occupancy
is hard to predict

Stop-The-World Parallel GC Threads
• The number of parallel GC threads is controlled by -
XX:ParallelGCThreads=<num>
• Default value assumes only one JVM per system
• Set the parallel GC thread number according to:
> Number of JVMs deployed on the system / processor set
/ zone
> CPU chip architecture
– Multiple hardware threads per chip core, i.e.,
UltraSPARC T1 / T2


Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


Young Generation Sizing
• Eden size determines
> The frequency of minor GCs
> Which objects will be reclaimed at age 0
– Newly-allocated objects in Eden start from age 0
– Their age is incremented at every minor GC
• Increasing the size of the Eden will not always affect
minor GC times
> Remember: minor GC times are proportional to the
amount of objects they copy (i.e., the live objects), not
the young generation size


Young Object Survivor Ratio
Survivor Ratio

0 Youngest New-Allocated Object Age Oldest


Young Object Survivor Ratio (ii)
Survivor Ratio



Young Object Survivor Ratio (iii)
Survivor Ratio



Sizing Heap Spaces (iii)
• -XX:NewSize=<size> : initial young generation size
• -XX:MaxNewSize=<size> : max young generation
size
• -XX:NewRatio=<ratio> : young generation to old
generation ratio
• Applications with emphasis on performance tend to
use -Xmn to size the young generation since it
combines the use of -XX:NewSize and
-XX:MaxNewSize


Tenuring
• -XX:TargetSurvivorRatio=<percent>, e.g., 50
> How much of the survivor space should be filled
– Typically leave extra space to deal with “spikes”
• -XX:InitialTenuringThreshold=<threshold>
• -XX:MaxTenuringThreshold=<threshold>
• -XX:+AlwaysTenure
> Never keep any objects in the survivor spaces
• -XX:SurvivorRatio=<Integer>, e.g., 6
> Eden to Survivor Size Ratio


Tenuring Threshold Trade-Offs
• Try to retain as many objects as possible in the
survivor spaces so that they can be reclaimed in the
young generation
> Less promotion into the old generation
> Less frequent old GCs
• But also, try not to unnecessarily copy very long-
lived objects between the survivors
> Unnecessary overhead on minor GCs
• Not always easy to find the perfect balance
> Generally: better copy more, than promote more


Tenuring Distribution
• Monitor tenuring distribution with
-XX:+PrintTenuringDistribution
Desired survivor size 6684672 bytes, new threshold 8 (max 8)
- age 1: 2315488 bytes, 2315488 total
- age 2: 19528 bytes, 2335016 total
- age 3: 96 bytes, 2335112 total
- age 4: 32 bytes, 2335144 total

• Young generation seems well tuned here
> We can even decrease the survivor space size


Tenuring Distribution (ii)

- age 1: 3956928 bytes, 3956928 total

• Survivor space too small!
> Increase survivor space and/or eden size


Tenuring Distribution (iii)

- age 1: 2483440 bytes, 2483440 total
- age 2: 501240 bytes, 2984680 total
- age 3: 50016 bytes, 3034696 total
- age 4: 49088 bytes, 3083784 total
- age 5: 48616 bytes, 3132400 total
- age 6: 50128 bytes, 3182528 total

• Might be able to do better
> Either increase max tenuring threshold
> Or even set max tenuring threshold to 2
– If ages > 6 still have around 50K of surviving bytes

Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


Parallel GC Ergonomics
• The Parallel GC has ergonomics
> i.e., auto-tuning
• Ergonomics help in improving out-of-the-box GC
performance
• To get maximum performance, most customers we
know do manual tuning


Parallel GC Tuning Advice
• Tune the young generation as described so far
• Try to avoid / decrease the frequency of major GCs
• We know of customers who use the Parallel GC in
low-pause environments
> Avoid Full GCs by avoiding / minimizing promotion
> Maximize heap size


NUMA
• Non-Uniform Memory Access
> Applicable to most SPARC, Opteron, more recently Intel
platforms
• -XX:+UseNUMA
• Splits the young generation into partitions
> Each partition “belongs” to a CPU
• Allocates new objects into the partition that belongs
to the allocating CPU
• Big win for some applications


Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


CMS Tuning Advice
• Tune the young generation as described so far
• Need to be even more careful about avoiding
premature promotion
> Originally we were using an +AlwaysTenure policy
> We have since changed our mind :-)
• Promotion in CMS is expensive (free lists)
• The more often promotion / reclamation happens,
the more likely fragmentation will settle in the heap


CMS Tuning Advice (ii)
• We know customers who tune their
applications to do mostly minor GCs, even with
CMS
> CMS is used as a “safety net”, when applications
load exceeds what they have provisioned for
> Schedule Full GCs at non-critical times (say, late at
night) to “tidy up” the heap and minimize
fragmentation


Fragmentation
• Two types
> External fragmentation
– No free chuck is large enough to satisfy an allocation
> Internal fragmentation
– Allocator rounds up allocation requests
– Free space wasted due to this rounding up


Fragmentation (ii)
• The bad news: you can never
eliminate it!
> It has been proven
• The good news: you can decrease its likelihood
> Decrease promotion into the CMS old generation
> Be careful when coding
– Large objects of various sizes are the main cause


Concurrent CMS GC Threads
• Number of parallel CMS threads is controlled by
-XX:ParallelCMSThreads=<num>
> Available in post 6 JVMs
• Trade-Off
> CMS cycle duration vs.
> Concurrent overhead during a CMS cycle


Permanent Generation and CMS
• To date, classes will not be unloaded by default from
the permanent generation when using CMS
> Both -XX:+CMSClassUnloadingEnabled and -XX:
+PermGenSweepingEnabled need to be set to enable
class unloading in CMS
> The 2nd switch is not needed in post 6u4 JVMs


Setting CMS Initiating Threshold
• Again, a tricky trade-off!
• Starting a CMS cycle too early
> Frequent CMS cycles
> High concurrent overhead
• Starting a CMS cycle too late
> Chance of an evacuation failure / Full GC
• Initiating heap occupancy should be (much) higher
than the application steady-state live size
• Otherwise, CMS will constantly do CMS cycles


Common CMS Scenarios
• Applications that promote non-trivial amounts of
objects to the old generation
> Old generation grows at a non-trivial rate
> Very frequent CMS cycles
> CMS cycles need to start relatively early
• Applications that promote very few or even no
objects to the old generation
> Old generation grows very slowly, if at all
> Very infrequent CMS cycles
> CMS cycles can start quite late


Initiating CMS Cycles
• CMS will try to automatically find the best initiating
occupancy
> It first does a CMS cycle early to collect stats
> Then, it tries to start cycles as late as possible, but early
enough not to run out of heap before the cycle
completes
> It keeps collecting stats and adjusting when to start
cycles
> Sometimes, the second cycle starts too late


Initiating CMS Cycles (ii)
• -XX:CMSInitiatingOccupancyFraction=<percent>
> Occupancy percentage of CMS old generation that
triggers a CMS cycle
• -XX:+UseCMSInitiatingOccupancyOnly
> Don't use the ergonomic initiating occupancy


Initiating CMS Cycles (iii)
• -XX:CMSInitiatingPermOccupancyFraction=<percent>
> Occupancy percentage of permanent generation that
triggers a CMS cycle
> Class unloading must be enabled


CMS Cycle Initiation Example
• This is good:

[ParNew 640710K->546360K(773376K), 0.1839508 secs]
[CMS-initial-mark 548460K(773376K), 0.0883685 secs]
[ParNew 651320K->556690K(773376K), 0.2052309 secs]
[CMS-concurrent-mark: 0.832/1.038 secs]
[CMS-concurrent-preclean: 0.146/0.151 secs]
[CMS-concurrent-abortable-preclean: 0.181/0.181 secs]
[CMS-remark 623877K(773376K), 0.0328863 secs]
[ParNew 655656K->561336K(773376K), 0.2088224 secs]
[ParNew 648882K->554390K(773376K), 0.2053158 secs]
...
[ParNew 489586K->395012K(773376K), 0.2050494 secs]
[ParNew 463096K->368901K(773376K), 0.2137257 secs]
[CMS-concurrent-sweep: 4.873/6.745 secs]
[CMS-concurrent-reset: 0.010/0.010 secs]
[ParNew 445124K->350518K(773376K), 0.1800791 secs]
[ParNew 455478K->361141K(773376K), 0.1849950 secs]


CMS Cycle Initiation Example (ii)
• Cycle started too early:
[ParNew 390868K->296358K(773376K), 0.1882258 secs]
[ParNew 401318K->306863K(773376K), 0.1933159 secs]
[CMS-concurrent-mark: 0.787/0.981 secs]
[CMS-concurrent-preclean: 0.149/0.152 secs]
[CMS-concurrent-abortable-preclean: 0.105/0.183 secs]
[CMS-remark 374049K(773376K), 0.0353394 secs]
[ParNew 407285K->312829K(773376K), 0.1969370 secs]
[ParNew 405554K->311100K(773376K), 0.1922082 secs]
[ParNew 404913K->310361K(773376K), 0.1909849 secs]
[ParNew 406005K->311878K(773376K), 0.2012884 secs]
[CMS-concurrent-sweep: 2.179/2.963 secs]
[CMS-concurrent-reset: 0.010/0.010 secs]
[ParNew 387767K->292925K(773376K), 0.1843175 secs]
[ParNew 397885K->303822K(773376K), 0.1995878 secs]


CMS Cycle Initiation Example (iii)
• Cycle started too late:

[ParNew 742993K->648506K(773376K), 0.1688876 secs]
[ParNew 753466K->659042K(773376K), 0.1695921 secs]
[Full GC 645986K->234335K(655360K), 8.9112629 secs]
[ParNew 339295K->247490K(773376K), 0.0230993 secs]
[ParNew 352450K->259959K(773376K), 0.1933945 secs]


Start CMS Cycles Explicitly
• If relying on explicit GCs and want them to be
concurrent, use:
> -XX:+ExplicitGCInvokesConcurrent
– Requires a post 6 JVM
> -XX:+ExplicitGCInvokesConcurrentAndUnloadClasses
– Requires a post 6u4 JVM
• Useful when wanting to cause references / finalizers
to be processed


Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


Monitoring the GC
• Online
> VisualVM: http://visualvm.dev.java.net/
> VisualGC:
– http://java.sun.com/performance/jvmstat/
– VisualGC is also available as a VisualVM plug-in
– Can monitor multiple JVMs within the same tool
• Offline
> GC Logging
> PrintGCStats
> GChisto


GC Logging in Production
• Don't be afraid to enable GC logging in
production
> Very helpful when diagnosing production issues
• Extremely low / non-existent overhead
> Maybe some large files in your file system. :-)
> We are surprised that customers are still afraid to enable
it
• Real customer quote:
> “If someone doesn't enable GC logging in production, I
shoot them!”


Important GC Logging Parameters
• You need at least:
> -XX:+PrintGCTimeStamps
– Add -XX:+PrintGCDateStamps if you must
> -XX:+PrintGCDetails
– Preferred over -verbosegc as it's more detailed
• Also useful:
> -Xloggc:<file>
> Separates GC logging output from application output


PrintGCStats
• Summarizes GC logs
• Downloadable script from
> http://java.sun.com/developer/technicalArticles/Program
ming/turbo/PrintGCStats.zip
• Usage
> PrintGCStats -v cpus=<num> <gc log file>
– Where <num> is the number of CPUs on the machine where
the GC log was obtained
• It might not work with some of the printing flags


PrintGCStats Parallel GC

what count total mean max stddev
gen0t(s) 193 11.470 0.05943 0.687 0.0633
gen1t(s) 1 7.350 7.34973 7.350 0.0000
GC(s) 194 18.819 0.09701 7.350 0.5272
alloc(MB) 193 11244.609 58.26222 100.875 18.8519
promo(MB) 193 807.236 4.18257 96.426 9.9291
used0(MB) 193 16018.930 82.99964 114.375 17.4899
used1(MB) 1 635.896 635.89648 635.896 0.0000
used(MB) 194 91802.213 473.20728 736.490 87.8376
commit0(MB) 193 17854.188 92.50874 114.500 9.8209
commit1(MB) 193 123520.000 640.00000 640.000 0.0000
commit(MB) 193 141374.188 732.50874 754.500 9.8209
alloc/elapsed_time = 11244.609 MB / 77.237 s = 145.586 MB/s
alloc/tot_cpu_time = 11244.609 MB / 1235.792 s = 9.099 MB/s
alloc/mut_cpu_time = 11244.609 MB / 934.682 s = 12.030 MB/s
promo/elapsed_time = 807.236 MB / 77.237 s = 10.451 MB/s
promo/gc0_time = 807.236 MB / 11.470 s = 70.380 MB/s
gc_seq_load = 301.110 s / 1235.792 s = 24.366%
gc_conc_load = 0.000 s / 1235.792 s = 0.000%
gc_tot_load = 301.110 s / 1235.792 s = 24.366%


PrintGCStats CMS

what count total mean max stddev
gen0(s) 110 24.381 0.22164 1.751 0.2038
gen0t(s) 110 24.397 0.22179 1.751 0.2038
cmsIM(s) 3 0.285 0.09494 0.108 0.0112
cmsRM(s) 3 0.092 0.03074 0.032 0.0015
GC(s) 113 24.774 0.21924 1.751 0.2013
cmsCM(s) 3 2.459 0.81967 0.835 0.0146
cmsCP(s) 6 0.971 0.16183 0.191 0.0272
cmsCS(s) 3 14.620 4.87333 4.916 0.0638
cmsCR(s) 3 0.036 0.01200 0.016 0.0035
alloc(MB) 110 11275.000 102.50000 102.500 0.0000
promo(MB) 110 1322.718 12.02471 104.608 11.8770
used0(MB) 110 12664.750 115.13409 115.250 1.2157
used(MB) 110 56546.542 514.05947 640.625 91.5858
commit0(MB) 110 12677.500 115.25000 115.250 0.0000
commit1(MB) 110 70400.000 640.00000 640.000 0.0000
commit(MB) 110 83077.500 755.25000 755.250 0.0000
alloc/elapsed_time = 11275.000 MB / 83.621 s = 134.835 MB/s
alloc/tot_cpu_time = 11275.000 MB / 1337.936 s = 8.427 MB/s
alloc/mut_cpu_time = 11275.000 MB / 923.472 s = 12.209 MB/s
promo/elapsed_time = 1322.718 MB / 83.621 s = 15.818 MB/s
promo/gc0_time = 1322.718 MB / 24.397 s = 54.217 MB/s
gc_seq_load = 396.378 s / 1337.936 s = 29.626%
gc_conc_load = 18.086 s / 1337.936 s = 1.352%
gc_tot_load = 414.464 s Microsystems, Inc.
Copyright Sun / 1337.936 s = 30.978% 58

GChisto
• Graphical GC log visualizer
• Under development
> Currently, can only show pause times
• Open source at
> http://gchisto.dev.java.net/
• It might not work with some of the printing flags


GCHisto (ii)


GCHisto (iii)


Agenda
• Introductions
• GC Tuning
> Tuning CMS
• Conclusions


Conclusions
• Remember: GC tuning is an Art
• The talk contained
> Basic GC tuning concepts
> How to monitor GCs
> What to look out for
> Examples of good tuning practices
• ...and practice makes perfect!


Garbage Collection
Tuning in the Java
HotSpot™ Virtual
Machine

Tony Printezis, Charlie Hunt
Antonios.Printezis@sun.com
Charlie.Hunt@sun.com

64

GC Tuning in the HotSpot Java VM - a FISL 10 Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to GC Tuning in the HotSpot Java VM - a FISL 10 Presentation

Similar to GC Tuning in the HotSpot Java VM - a FISL 10 Presentation (20)

More from Ludovic Poitou

More from Ludovic Poitou (6)

GC Tuning in the HotSpot Java VM - a FISL 10 Presentation