@TwitterAds | Confidential
Advanced JVM Tuning
David Keenan
Language Runtimes and Performance
JavaOne 2013
CON4540
Thursday, September 26, 13
@TwitterAds | Confidential
Performance Tuning Overview
2
Thursday, September 26, 13
@TwitterEng 3
Performance Tuning Overview
Top-Down Analysis
- Commonly used when you have the ability to change code at the highest level of the software stack.
1. Monitor target application under load
- `System level diagnostics
- JVM level diagnostics
2. Profile Application Under load
3. Identify bottlenecks, Analyze, and Optimize.
-Make code more efficient
-Reduce allocation rates
4. Repeat
Thursday, September 26, 13
@TwitterEng 4
Performance Tuning Overview
Bottom-Up Analysis
- Commonly used when you do not have the ability to change code at the highest level of the software stack.
- JVM and OS performance optimization is a common use case.
1. Monitor CPU-level statistics against target application under load
- Use hardware counters (cache misses, path level, etc)
- HW Profile and map to instructions, OS/JVM, and Scala/Java code
- Use tools when available, otherwise visual inspect assembly code
2. Manipulate static and runtime compilers to address code issues
- Missed optimizations
-Example: autobox elision
3. Manipulate javac / scala compiler
4. Manipulate core platform libraries
5. Identify issues at higher level of the application stack
6. Repeat
Thursday, September 26, 13
@TwitterEng 5
Latency
Throughput
Memory
Footprint
Performance Triangle
Thursday, September 26, 13
@TwitterEng 6
Latency
Throughput Memory Footprint
Reduce Latency
Thursday, September 26, 13
@TwitterEng 7
Latency
Throughput Memory Footprint
Increase Throughput
Thursday, September 26, 13
@TwitterEng 8
Latency
Throughput
Memory
Footprint
Smaller Memory Footprint
Thursday, September 26, 13
@TwitterAds | Confidential
Performance Metrics
9
Thursday, September 26, 13
@TwitterEng 10
Choosing the Right Metrics
Identify Metrics
- What’s important to your users
- What influences your bottom line?
- What are you willing to trade off?
Define Success
- If its not broken .... Don’t fix it.
- Perfect is the enemy of done.
Thursday, September 26, 13
@TwitterEng 11
Choosing the Right Metrics
We want it all!
-High Throughput	
-Fast response times
-Small footprint
But …
-There’s no free lunch.
Choose your metrics wisely
-Target metrics that impact your customers first
Use Statistics!
- High variability can render some metrics useless
Thursday, September 26, 13
@TwitterEng 12
Throughput Metrics
Transactions per Second (TPS)
- # of Transactions / Time
- Aka pages/sec, queries/sec, hits/sec
-Good measure of top end performance
Average Response time
-Inverse of TPS
-Time / #Transactions
-Sometimes a rolling average.
CPU utilization
-Measure of computation efficiency
-Good for capacity planning, not for development regression testing (new features
can increase work).
Thursday, September 26, 13
@TwitterEng 13
Latency Metrics
Maximum response time
- Worst case
99% response time
- Drops a few outliers
90% response time
- May drop too many outliers and give a false sense of security
Critical Injection Rate
- Critical jOPs in SPECjbb2013
- Achievable throughput under response time SLA
Not Average Response Time
Thursday, September 26, 13
@TwitterEng 14
Memory Footprint Metrics
Heap size after Full GC (Live Data Size) Upcoming slide
Native process size
- # ps aux PID
Static footprint
- Size of application binary
- Size of .jar
- Why does it mater?
- download/deployment speed
-update/refresh speed
Thursday, September 26, 13
@TwitterAds | Confidential
JVM Tuning Basics
15
Thursday, September 26, 13
@TwitterEng 16
JVM Tuning Basics
Track size of Old Generation after Full GCs
[GC 435426K->392697K(657920K), 0.1411660 secs]
[Full GC 392697K->390333K(927232K), 0.5547680 secs]
[GC 625853K->592369K(1000960K), 0.1852460 secs]
[GC 831473K->800585K(1068032K), 0.1707610 secs]
[Full GC 800585K->798499K(1456640K), 1.9056030 secs]
Calculating Live Data Size
Thursday, September 26, 13
@TwitterEng 17
JVM Tuning Basics
Track size of Old Generation after Young GCs if no Full GC events occur
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K-
>18476K(13212096K), 0.0326070 secs] 12330878K-
>583306K(16357824K), 0.0327090 secs] [Times: user=0.48
sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K-
>20088K(13212096K), 0.0270110 secs] 12327434K-
>585068K(16357824K), 0.0271140 secs] [Times: user=0.39
sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K-
>21013K(13212096K), 0.0267490 secs] 12329196K-
>586133K(16357824K), 0.0268490 secs] [Times: user=0.40
sys=0.00, real=0.03 secs]
Calculating Live Data Size
Thursday, September 26, 13
@TwitterEng 18
JVM Tuning Basics
Size of Old Generation
-Good starting point: 2X size of live data at steady state.
-If object promotion rate causes frequent CMS cycles, increase size of the old
generation
-If live data size is 5GB, starting point should be ~10GB.
- Old Generation size alone.
- Set –Xms and –Xmx to same value
- Nobody really needs extra Full GC pauses
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 19
JVM Tuning Basics
Size of Young Generation
- Young gen = Old gen is a good starting point.
- Young generation size should increase with allocation rate
- Sometimes 2-3x larger than Old Gen
- Young GC times dominated by copying of live objects to Survivor spaces, not
size of overall Young Generation
- Size so that most objects die in Young Generation
- Higher Allocation rates -> Larger Young Generation
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 20
JVM Tuning Basics
Example Enterprise Application
- Significant application state
- In memory cache cache size: 3.5GB
- Overall Live data size: 4GB
- High allocation rate of transient data
-Most objects die in large young generation
- Suggested Initial Heap Size Suggestion
--Xms16g -Xmx16g -Xmn8g
Young and Old Generation Sizing
Thursday, September 26, 13
@TwitterEng 21
JVM Tuning Basics
Throughput
--XX:+UseParallelOldGC
Low server response times?
- CMS
- Older technology
- Can be highly tuned, but tuning can be brittle
- -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
- G1
- Current development focus
- Young GC times slower than CMS
- -XX:+UseG1GC
Choosing a Garbage Collector
Thursday, September 26, 13
@TwitterEng 22
JVM Tuning Basics
Recommended GC Logging Flags
- -XX:+PrintGCDateStamps
- -XX:+PrintGCDetails
- -XX:+PrintGCTimeStamps
- -Xloggc:/tmp/file
Other Helpful Flags
- -XX:+PrintHeapAtGC
- -XX:+PrintTenuringDistribution
- -XX:+PrintGCApplicationStoppedTime
- -XX:+PrintReferenceGC
GC logging flags
Thursday, September 26, 13
@TwitterEng 23
JVM Tuning Basics
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K-
>18476K(13212096K), 0.0326070 secs] 12330878K-
>583306K(16357824K), 0.0327090 secs] [Times: user=0.48
sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K-
>20088K(13212096K), 0.0270110 secs] 12327434K-
>585068K(16357824K), 0.0271140 secs] [Times: user=0.39
sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K-
>21013K(13212096K), 0.0267490 secs] 12329196K-
>586133K(16357824K), 0.0268490 secs] [Times: user=0.40
sys=0.00, real=0.03 secs]
(YGen before GC) - (YGen after gc) / ΔTime
(11764216K - 21013K) / (5:46:41.623+0000 - 5:42:54.666+0000)
11.2GB / 186 sec = ~62 MB/sec
Calculating Allocation Rate
Thursday, September 26, 13
@TwitterEng 24
JVM Tuning Basics
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K-
>18476K(13212096K), 0.0326070 secs] 12330878K-
>583306K(16357824K), 0.0327090 secs] [Times: user=0.48
sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K-
>20088K(13212096K), 0.0270110 secs] 12327434K-
>585068K(16357824K), 0.0271140 secs] [Times: user=0.39
sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K-
>21013K(13212096K), 0.0267490 secs] 12329196K-
>586133K(16357824K), 0.0268490 secs] [Times: user=0.40
sys=0.00, real=0.03 secs]
ΔOld Generation Size / ΔTime
(586133K - 583306K) / (5:46:41.623+0000 - 5:39:03.489+0000)
2827Kb / 458 sec = ~6.172Kb/sec
Calculating Promotion Rate
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Latency
25
Thursday, September 26, 13
@TwitterEng 26
Tuning for Latency
Enable CMS
- -XX:+UseConcMarkSweepGC
Good to have
--XX:+CMSScavengeBeforeRemark
- -XX:+ParallelRefProcEnabled
--XX:CMSInitiatingOccupancyFraction=70
Start with Basic Tuning Guidelines
- -XX:+PermSize256m -XX:MaxPermSize=256m
- Old Gen Size is 2X Live Data Size
- Young Gen Size = Old Gen Size
Using CMS
Thursday, September 26, 13
@TwitterEng 27
Tuning for Latency
General rules of thumb
-Increase young gen. size to handle higher allocation rates.
- Increase young gen size if promotion rate high
- May suffer from premature promotion, i.e. promotions
from too frequent young GC.
-Larger young gen decreases GC frequency, and gives
more time for objects to die.
-Increase Old Gen size if promotion rate is still high, avoid
allocation and concurrent mode failures
Using CMS
Thursday, September 26, 13
@TwitterEng 28
Tuning for Latency
CMS Tuned for Latency
-Xmx18g -Xms18g –XX:PermSize=256m 
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark 
-XX:-OmitStackTraceInFastThrow -XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=70 
-XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=6 -XX:NewSize=8g 
-XX:MaxNewSize=8g –verbosegc 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution
Note: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
@TwitterEng 29
Tuning for Latency
Enable G1
-XX:+UseG1GC –XX:MaxGCPauseMillis=100
- Start with just overall heap size and target pause time.
- Increase Young Generation Size for High Allocation
- Tune to keep remembered set processing low
Using G1GC
Thursday, September 26, 13
@TwitterEng 30
Tuning for Latency
G1 Tuning to Consider
-XX:InitiatingHeapOccupancyPercent=90
–XX:G1MixedGCLiveThresholdPercent: The occupancy threshold
of live objects in the old region to be included in the mixed collection.
–XX:G1HeapWastePercent: The threshold of garbage that you can
tolerate in the heap.
–XX:G1MixedGCCountTarget: The target number of mixed garbage
collections within which the regions with at most
G1MixedGCLiveThresholdPercent live data should be collected.
–XX:G1OldCSetRegionThresholdPercent: A limit on the max
number of old regions that can be collected during a mixed collection.
Reference: Monica Beckwith’s InfoQ article:
“G1: One Garbage Collector To Rule Them All“
http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-
All
Using G1GC
Thursday, September 26, 13
@TwitterEng 31
Tuning for Latency
G1GC Tuned for Latency
- -XX:+TieredCompilation –XX:InitialCodeCacheSize=256m 
–XX:ReservedCodeCacheSize=256m -Xmx18g -Xms18g 
–XX:PermSize=256m -XX:MaxPermSize=256M --XX:+UseG1GC 
–XX:MaxGCPauseMillis=200 
-XX:InitiatingHeapOccupancyPercent=90 
-XX+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution
Note: MaxGCPauseMillis biggest tuning knob.
Don’t start with CMS Tuning!
Using G1GC
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Throughput
32
Thursday, September 26, 13
@TwitterEng 33
Enable ParallelOldGC
--XX:+UseParallelOldGC
Old Gen needs to be 2-4X live data size (LDS)
Young generation should be ¾ the heap
Often used when tuning for throughput
--XX:+AggressiveOpts
--XX:+TieredCompilation
Disabling adaptive sizing and tuning survivor spaces directly.
- -XX:-AdaptiveSizePolicy -XX:SurvivorRatio=7 
-XX:TargetSurvivorRatio=90
Using ParallelOldGC
Tuning for Throughput
Thursday, September 26, 13
@TwitterEng 34
Tuning for Throughput
ParallelOldGC tuned for Throughput:
-showversion -server -XX:-UseBiasedLocking 
-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch 
-XX:+UseLargePages -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+UseLargePages 
-Xms29g -Xmx29g -Xmn27g -XX:+UseParallelOldGC 
-XX:ParallelGCThreads=24 -XX:SurvivorRatio=16 
-XX:TargetSurvivorRatio=90 -XX:-UseAdaptiveSizePolicy 
-XX:+AggressiveOpts -XX:InitialCodeCacheSize=160m -
XX:ReservedCodeCache=160m -XX:+TieredCompilation
Using ParallelOldGC
Thursday, September 26, 13
@TwitterEng 35
Enable G1
--XX:+UseG1GC
Old Gen needs to be 2X live data size (LDS)
Young generation should be ¾ the heap
Often used when tuning for throughput
--XX:+AggressiveOpts
--XX:+TieredCompilation
Using G1GC
Tuning for Throughput
Thursday, September 26, 13
@TwitterEng 36
Tuning for Throughput
G1GC tuned for throughput:
-showversion -server -XX:-UseBiasedLocking 
-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch 
-XX:+UseLargePages -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+UseLargePages 
-Xms28g -Xmx28g -Xmn21g -XX:+UseG1GC 
-XX:+AggressiveOpts 
-XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m 
-XX:+TieredCompilation
Using G1GC
Thursday, September 26, 13
@TwitterEng 37
Enable CMS, and tune for throughput
--XX:+UseParNewGC -XX:+UseConcMarkSweepGC
- Configure heap to avoid promotion
- Application design should separate stateful and stateless components
to allow targeted tuning.
Young generation should be ¾ the heap
- Young generation should be size to ensure nearly all objects
die young.
- Very large heaps, very large old generation
- Use memory to avoid the need for Full GC.
Tuning survivor spaces manually, etc.
- -XX:SurvivorRatio=7 -XX:+CMSScavengeBeforeRemark 
-XX:+ParallelRefProcEnabled 
Using CMS
Tuning for Throughput
Thursday, September 26, 13
@TwitterEng 38
Tuning for Throughput
CMS Tuned for Throughput
-Xmx18g -Xms18g –XX:PermSize=256m 
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark 
-XX:-OmitStackTraceInFastThrow -XX:+UseAggressiveOpts 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=90 
-XX:+UseCMSInitiatingOccupancyOnly 
-XX:SurvivorRatio=6 -XX:NewSize=16g 
-XX:MaxNewSize=16g –verbosegc 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution 
-XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m 
-XX:+TieredCompilation
Using CMS
Thursday, September 26, 13
@TwitterAds | Confidential
Tuning for Footprint
39
Thursday, September 26, 13
@TwitterEng 40
Enable ParallelOldGC
--XX:+UseParallelOldGC
Old Gen needs to be 2X live data size (LDS)
Young generation should start at 1/2 the Old Generation size.
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
Definitely not low-pause. Trading higher response times, for
lower footprint and lower throughput.
Using ParallelOldGC
Tuning for Footprint
Thursday, September 26, 13
@TwitterEng 41
Tuning for Footprint
ParallelOldGC tuned for Footprint
-showversion -server -XX:LargePageSizeInBytes=2m 
-XX:+UseLargePages -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+UseLargePages 
-Xms8g -Xmx8g -Xmn4g -XX:+UseParallelOldGC 
-XX:-UseAdaptiveSizePolicy -XX:+AggressiveOpts 
–XX:PermSize=256m -XX:MaxPermSize=256M
Using ParallelOldGC
Thursday, September 26, 13
@TwitterEng 42
Enable G1
--XX:+UseG1GC
Heap should be 3x live data size (LDS)
-Do not tune the size of the young generation
-Allow G1 to adapt the size
- Tune only after observer minimum size according to G1
Increase the Pause Target to decrease GC overhead
--XX:MaxGCPauseMillis=400
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
Using G1GC
Tuning for Footprint
Thursday, September 26, 13
@TwitterEng 43
Tuning for Footprint
G1 Tuned for Footprint
-showversion-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-Xms12g -Xmx12g -XX:+UseG1GC -XX:InitialCodeCacheSize=160m 
-XX:ReservedCodeCache=160m
Using G1GC
Thursday, September 26, 13
@TwitterEng 44
Enable CMS, and tune for throughput
--XX:+UseParNewGC -XX:+UseConcMarkSweepGC
Old Gen needs to be 2X live data size (LDS)
Young generation should start at 1/2 the Old Generation size.
- Young generation should be sized so “enough” objects die in
the old generation to reduce the pressure on CMS
- Promotion rate needs to be low enough so CMS concurrent
threads don’t loose the race (ConcurrentMode Failures)
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
-Young Generation first, then OldGen.
Using CMS
Tuning for Footprint
Thursday, September 26, 13
@TwitterEng 45
Tuning for Footprint
Example of a highly tuned CMS deploy for throughput:
-Xmx12g -Xms12g -Xmn4g –XX:PermSize=256m 
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC 
-XX:CMSInitiatingOccupancyFraction=60 
-XX:SurvivorRatio=6 –verbosegc 
-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCDateStamps -XX:+PrintGCDetails 
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC 
-XX:+PrintTenuringDistribution 
Note: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
@TwitterAds | Confidential
Common Performance Issues
46
Thursday, September 26, 13
@TwitterEng 47
Common Performance Issues
Size of Permanent Generation
- Perm. Gen. only collects and resizes at Full GC.
Heap before GC invocations=40019 (full 36522):
par new generation total 15354176K, used 14K [0x00000003b9c00000, 0x0000000779c00000,
0x0000000779c00000)
eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c039a8, 0x000000074c0a0000)
from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000)
to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000)
concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000,
0x00000007f9c00000)
concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000,
0x0000000800000000)
2013-09-05T17:21:39.530+0000: [Full GC[CMS: 588343K->588343K(2097152K), 1.6166150 secs] 588357K-
>588343K(17451328K), [CMS Perm : 102399K->102399K(102400K)], 1.6167040 secs] [Times: user=1.57 sys=0.00,
real=1.61 secs]
Heap after GC invocations=40020 (full 36523):
par new generation total 15354176K, used 0K [0x00000003b9c00000, 0x0000000779c00000,
0x0000000779c00000)
eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c00000, 0x000000074c0a0000)
from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000)
to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000)
concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000,
0x00000007f9c00000)
concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000,
0x0000000800000000)
}
Recommendation: -XX:PermSize=256m –XX:MaxPermSize=256m
In Enterprise Software
Thursday, September 26, 13
@TwitterEng 48
Common Performance Issues
Size of Code Cache
- Default size is 64mb, 96mb if running TieredCompilation
- Enterprise Applications have lots of code
Aggressively Tune to Avoid Issue
-Tuning Without Using TieredCompilation
- -XX:InitialCodeCacheSize=128m 
-XX:ReservedCodeCacheSize=128m
- Tuning With Using TieredCompilation
- -XX:InitialCodeCacheSize=256m 
-XX:ReservedCodeCacheSize=256m
In Enterprise Software
Thursday, September 26, 13
@TwitterAds | Confidential
OpenJDK Development at Twitter
49
Thursday, September 26, 13
@TwitterEng 50
What’s up with Twitter and JDK Development?
Twitter runs Java + Scala on the HotSpot JVM
- Most Highly Optimized Managed Runtime
-Open source :-)
- Massive performance gains moving technologies
Own and Optimize our Platform
- Build out diagnostic tools
- Build, test, and deploy OpenJDK
- Optimize HotSpot Runtime Compilers for Scala, etc.
- Tailored GC for Twitter’s needs
-extremely low latency requirements ( < 10ms)
@TwitterJDK
Thursday, September 26, 13
@TwitterEng 51
What’s up with Twitter and JDK Development?
Contribute Back to the Community
- Working closely with Oracle Java Development
- Collaborating with Other OpenJDK contributors
- Posting tools to Github and OpenJDK repositories
Interesting isn’t it?
- We’re just ramping up now.
- Follow us soon: @TwitterJDK (new idea)
- Follow me at: @dagskeenan
- #jointheflock
@TwitterJDK
Thursday, September 26, 13
@TwitterAds | Confidential
#ThankYou
52
Thursday, September 26, 13

Java one2013 con4540-keenan

  • 1.
    @TwitterAds | Confidential AdvancedJVM Tuning David Keenan Language Runtimes and Performance JavaOne 2013 CON4540 Thursday, September 26, 13
  • 2.
    @TwitterAds | Confidential PerformanceTuning Overview 2 Thursday, September 26, 13
  • 3.
    @TwitterEng 3 Performance TuningOverview Top-Down Analysis - Commonly used when you have the ability to change code at the highest level of the software stack. 1. Monitor target application under load - `System level diagnostics - JVM level diagnostics 2. Profile Application Under load 3. Identify bottlenecks, Analyze, and Optimize. -Make code more efficient -Reduce allocation rates 4. Repeat Thursday, September 26, 13
  • 4.
    @TwitterEng 4 Performance TuningOverview Bottom-Up Analysis - Commonly used when you do not have the ability to change code at the highest level of the software stack. - JVM and OS performance optimization is a common use case. 1. Monitor CPU-level statistics against target application under load - Use hardware counters (cache misses, path level, etc) - HW Profile and map to instructions, OS/JVM, and Scala/Java code - Use tools when available, otherwise visual inspect assembly code 2. Manipulate static and runtime compilers to address code issues - Missed optimizations -Example: autobox elision 3. Manipulate javac / scala compiler 4. Manipulate core platform libraries 5. Identify issues at higher level of the application stack 6. Repeat Thursday, September 26, 13
  • 5.
  • 6.
    @TwitterEng 6 Latency Throughput MemoryFootprint Reduce Latency Thursday, September 26, 13
  • 7.
    @TwitterEng 7 Latency Throughput MemoryFootprint Increase Throughput Thursday, September 26, 13
  • 8.
  • 9.
    @TwitterAds | Confidential PerformanceMetrics 9 Thursday, September 26, 13
  • 10.
    @TwitterEng 10 Choosing theRight Metrics Identify Metrics - What’s important to your users - What influences your bottom line? - What are you willing to trade off? Define Success - If its not broken .... Don’t fix it. - Perfect is the enemy of done. Thursday, September 26, 13
  • 11.
    @TwitterEng 11 Choosing theRight Metrics We want it all! -High Throughput -Fast response times -Small footprint But … -There’s no free lunch. Choose your metrics wisely -Target metrics that impact your customers first Use Statistics! - High variability can render some metrics useless Thursday, September 26, 13
  • 12.
    @TwitterEng 12 Throughput Metrics Transactionsper Second (TPS) - # of Transactions / Time - Aka pages/sec, queries/sec, hits/sec -Good measure of top end performance Average Response time -Inverse of TPS -Time / #Transactions -Sometimes a rolling average. CPU utilization -Measure of computation efficiency -Good for capacity planning, not for development regression testing (new features can increase work). Thursday, September 26, 13
  • 13.
    @TwitterEng 13 Latency Metrics Maximumresponse time - Worst case 99% response time - Drops a few outliers 90% response time - May drop too many outliers and give a false sense of security Critical Injection Rate - Critical jOPs in SPECjbb2013 - Achievable throughput under response time SLA Not Average Response Time Thursday, September 26, 13
  • 14.
    @TwitterEng 14 Memory FootprintMetrics Heap size after Full GC (Live Data Size) Upcoming slide Native process size - # ps aux PID Static footprint - Size of application binary - Size of .jar - Why does it mater? - download/deployment speed -update/refresh speed Thursday, September 26, 13
  • 15.
    @TwitterAds | Confidential JVMTuning Basics 15 Thursday, September 26, 13
  • 16.
    @TwitterEng 16 JVM TuningBasics Track size of Old Generation after Full GCs [GC 435426K->392697K(657920K), 0.1411660 secs] [Full GC 392697K->390333K(927232K), 0.5547680 secs] [GC 625853K->592369K(1000960K), 0.1852460 secs] [GC 831473K->800585K(1068032K), 0.1707610 secs] [Full GC 800585K->798499K(1456640K), 1.9056030 secs] Calculating Live Data Size Thursday, September 26, 13
  • 17.
    @TwitterEng 17 JVM TuningBasics Track size of Old Generation after Young GCs if no Full GC events occur 2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K- >18476K(13212096K), 0.0326070 secs] 12330878K- >583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs] 2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K- >20088K(13212096K), 0.0270110 secs] 12327434K- >585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs] 2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K- >21013K(13212096K), 0.0267490 secs] 12329196K- >586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs] Calculating Live Data Size Thursday, September 26, 13
  • 18.
    @TwitterEng 18 JVM TuningBasics Size of Old Generation -Good starting point: 2X size of live data at steady state. -If object promotion rate causes frequent CMS cycles, increase size of the old generation -If live data size is 5GB, starting point should be ~10GB. - Old Generation size alone. - Set –Xms and –Xmx to same value - Nobody really needs extra Full GC pauses Young and Old Generation Sizing Thursday, September 26, 13
  • 19.
    @TwitterEng 19 JVM TuningBasics Size of Young Generation - Young gen = Old gen is a good starting point. - Young generation size should increase with allocation rate - Sometimes 2-3x larger than Old Gen - Young GC times dominated by copying of live objects to Survivor spaces, not size of overall Young Generation - Size so that most objects die in Young Generation - Higher Allocation rates -> Larger Young Generation Young and Old Generation Sizing Thursday, September 26, 13
  • 20.
    @TwitterEng 20 JVM TuningBasics Example Enterprise Application - Significant application state - In memory cache cache size: 3.5GB - Overall Live data size: 4GB - High allocation rate of transient data -Most objects die in large young generation - Suggested Initial Heap Size Suggestion --Xms16g -Xmx16g -Xmn8g Young and Old Generation Sizing Thursday, September 26, 13
  • 21.
    @TwitterEng 21 JVM TuningBasics Throughput --XX:+UseParallelOldGC Low server response times? - CMS - Older technology - Can be highly tuned, but tuning can be brittle - -XX:+UseParNewGC -XX:+UseConcMarkSweepGC - G1 - Current development focus - Young GC times slower than CMS - -XX:+UseG1GC Choosing a Garbage Collector Thursday, September 26, 13
  • 22.
    @TwitterEng 22 JVM TuningBasics Recommended GC Logging Flags - -XX:+PrintGCDateStamps - -XX:+PrintGCDetails - -XX:+PrintGCTimeStamps - -Xloggc:/tmp/file Other Helpful Flags - -XX:+PrintHeapAtGC - -XX:+PrintTenuringDistribution - -XX:+PrintGCApplicationStoppedTime - -XX:+PrintReferenceGC GC logging flags Thursday, September 26, 13
  • 23.
    @TwitterEng 23 JVM TuningBasics 2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K- >18476K(13212096K), 0.0326070 secs] 12330878K- >583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs] 2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K- >20088K(13212096K), 0.0270110 secs] 12327434K- >585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs] 2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K- >21013K(13212096K), 0.0267490 secs] 12329196K- >586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs] (YGen before GC) - (YGen after gc) / ΔTime (11764216K - 21013K) / (5:46:41.623+0000 - 5:42:54.666+0000) 11.2GB / 186 sec = ~62 MB/sec Calculating Allocation Rate Thursday, September 26, 13
  • 24.
    @TwitterEng 24 JVM TuningBasics 2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K- >18476K(13212096K), 0.0326070 secs] 12330878K- >583306K(16357824K), 0.0327090 secs] [Times: user=0.48 sys=0.01, real=0.03 secs] 2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K- >20088K(13212096K), 0.0270110 secs] 12327434K- >585068K(16357824K), 0.0271140 secs] [Times: user=0.39 sys=0.00, real=0.02 secs] 2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K- >21013K(13212096K), 0.0267490 secs] 12329196K- >586133K(16357824K), 0.0268490 secs] [Times: user=0.40 sys=0.00, real=0.03 secs] ΔOld Generation Size / ΔTime (586133K - 583306K) / (5:46:41.623+0000 - 5:39:03.489+0000) 2827Kb / 458 sec = ~6.172Kb/sec Calculating Promotion Rate Thursday, September 26, 13
  • 25.
    @TwitterAds | Confidential Tuningfor Latency 25 Thursday, September 26, 13
  • 26.
    @TwitterEng 26 Tuning forLatency Enable CMS - -XX:+UseConcMarkSweepGC Good to have --XX:+CMSScavengeBeforeRemark - -XX:+ParallelRefProcEnabled --XX:CMSInitiatingOccupancyFraction=70 Start with Basic Tuning Guidelines - -XX:+PermSize256m -XX:MaxPermSize=256m - Old Gen Size is 2X Live Data Size - Young Gen Size = Old Gen Size Using CMS Thursday, September 26, 13
  • 27.
    @TwitterEng 27 Tuning forLatency General rules of thumb -Increase young gen. size to handle higher allocation rates. - Increase young gen size if promotion rate high - May suffer from premature promotion, i.e. promotions from too frequent young GC. -Larger young gen decreases GC frequency, and gives more time for objects to die. -Increase Old Gen size if promotion rate is still high, avoid allocation and concurrent mode failures Using CMS Thursday, September 26, 13
  • 28.
    @TwitterEng 28 Tuning forLatency CMS Tuned for Latency -Xmx18g -Xms18g –XX:PermSize=256m -XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark -XX:-OmitStackTraceInFastThrow -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:SurvivorRatio=6 -XX:NewSize=8g -XX:MaxNewSize=8g –verbosegc -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution Note: Increased Young Gen Size, Survivor Ratio Tuning Using CMS Thursday, September 26, 13
  • 29.
    @TwitterEng 29 Tuning forLatency Enable G1 -XX:+UseG1GC –XX:MaxGCPauseMillis=100 - Start with just overall heap size and target pause time. - Increase Young Generation Size for High Allocation - Tune to keep remembered set processing low Using G1GC Thursday, September 26, 13
  • 30.
    @TwitterEng 30 Tuning forLatency G1 Tuning to Consider -XX:InitiatingHeapOccupancyPercent=90 –XX:G1MixedGCLiveThresholdPercent: The occupancy threshold of live objects in the old region to be included in the mixed collection. –XX:G1HeapWastePercent: The threshold of garbage that you can tolerate in the heap. –XX:G1MixedGCCountTarget: The target number of mixed garbage collections within which the regions with at most G1MixedGCLiveThresholdPercent live data should be collected. –XX:G1OldCSetRegionThresholdPercent: A limit on the max number of old regions that can be collected during a mixed collection. Reference: Monica Beckwith’s InfoQ article: “G1: One Garbage Collector To Rule Them All“ http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them- All Using G1GC Thursday, September 26, 13
  • 31.
    @TwitterEng 31 Tuning forLatency G1GC Tuned for Latency - -XX:+TieredCompilation –XX:InitialCodeCacheSize=256m –XX:ReservedCodeCacheSize=256m -Xmx18g -Xms18g –XX:PermSize=256m -XX:MaxPermSize=256M --XX:+UseG1GC –XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=90 -XX+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution Note: MaxGCPauseMillis biggest tuning knob. Don’t start with CMS Tuning! Using G1GC Thursday, September 26, 13
  • 32.
    @TwitterAds | Confidential Tuningfor Throughput 32 Thursday, September 26, 13
  • 33.
    @TwitterEng 33 Enable ParallelOldGC --XX:+UseParallelOldGC OldGen needs to be 2-4X live data size (LDS) Young generation should be ¾ the heap Often used when tuning for throughput --XX:+AggressiveOpts --XX:+TieredCompilation Disabling adaptive sizing and tuning survivor spaces directly. - -XX:-AdaptiveSizePolicy -XX:SurvivorRatio=7 -XX:TargetSurvivorRatio=90 Using ParallelOldGC Tuning for Throughput Thursday, September 26, 13
  • 34.
    @TwitterEng 34 Tuning forThroughput ParallelOldGC tuned for Throughput: -showversion -server -XX:-UseBiasedLocking -XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseLargePages -Xms29g -Xmx29g -Xmn27g -XX:+UseParallelOldGC -XX:ParallelGCThreads=24 -XX:SurvivorRatio=16 -XX:TargetSurvivorRatio=90 -XX:-UseAdaptiveSizePolicy -XX:+AggressiveOpts -XX:InitialCodeCacheSize=160m - XX:ReservedCodeCache=160m -XX:+TieredCompilation Using ParallelOldGC Thursday, September 26, 13
  • 35.
    @TwitterEng 35 Enable G1 --XX:+UseG1GC OldGen needs to be 2X live data size (LDS) Young generation should be ¾ the heap Often used when tuning for throughput --XX:+AggressiveOpts --XX:+TieredCompilation Using G1GC Tuning for Throughput Thursday, September 26, 13
  • 36.
    @TwitterEng 36 Tuning forThroughput G1GC tuned for throughput: -showversion -server -XX:-UseBiasedLocking -XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseLargePages -Xms28g -Xmx28g -Xmn21g -XX:+UseG1GC -XX:+AggressiveOpts -XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m -XX:+TieredCompilation Using G1GC Thursday, September 26, 13
  • 37.
    @TwitterEng 37 Enable CMS,and tune for throughput --XX:+UseParNewGC -XX:+UseConcMarkSweepGC - Configure heap to avoid promotion - Application design should separate stateful and stateless components to allow targeted tuning. Young generation should be ¾ the heap - Young generation should be size to ensure nearly all objects die young. - Very large heaps, very large old generation - Use memory to avoid the need for Full GC. Tuning survivor spaces manually, etc. - -XX:SurvivorRatio=7 -XX:+CMSScavengeBeforeRemark -XX:+ParallelRefProcEnabled Using CMS Tuning for Throughput Thursday, September 26, 13
  • 38.
    @TwitterEng 38 Tuning forThroughput CMS Tuned for Throughput -Xmx18g -Xms18g –XX:PermSize=256m -XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark -XX:-OmitStackTraceInFastThrow -XX:+UseAggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=90 -XX:+UseCMSInitiatingOccupancyOnly -XX:SurvivorRatio=6 -XX:NewSize=16g -XX:MaxNewSize=16g –verbosegc -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m -XX:+TieredCompilation Using CMS Thursday, September 26, 13
  • 39.
    @TwitterAds | Confidential Tuningfor Footprint 39 Thursday, September 26, 13
  • 40.
    @TwitterEng 40 Enable ParallelOldGC --XX:+UseParallelOldGC OldGen needs to be 2X live data size (LDS) Young generation should start at 1/2 the Old Generation size. Strategy is to reduce young and old GC sizes independently until a maximum acceptable end user response time is met. Definitely not low-pause. Trading higher response times, for lower footprint and lower throughput. Using ParallelOldGC Tuning for Footprint Thursday, September 26, 13
  • 41.
    @TwitterEng 41 Tuning forFootprint ParallelOldGC tuned for Footprint -showversion -server -XX:LargePageSizeInBytes=2m -XX:+UseLargePages -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseLargePages -Xms8g -Xmx8g -Xmn4g -XX:+UseParallelOldGC -XX:-UseAdaptiveSizePolicy -XX:+AggressiveOpts –XX:PermSize=256m -XX:MaxPermSize=256M Using ParallelOldGC Thursday, September 26, 13
  • 42.
    @TwitterEng 42 Enable G1 --XX:+UseG1GC Heapshould be 3x live data size (LDS) -Do not tune the size of the young generation -Allow G1 to adapt the size - Tune only after observer minimum size according to G1 Increase the Pause Target to decrease GC overhead --XX:MaxGCPauseMillis=400 Strategy is to reduce young and old GC sizes independently until a maximum acceptable end user response time is met. Using G1GC Tuning for Footprint Thursday, September 26, 13
  • 43.
    @TwitterEng 43 Tuning forFootprint G1 Tuned for Footprint -showversion-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms12g -Xmx12g -XX:+UseG1GC -XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m Using G1GC Thursday, September 26, 13
  • 44.
    @TwitterEng 44 Enable CMS,and tune for throughput --XX:+UseParNewGC -XX:+UseConcMarkSweepGC Old Gen needs to be 2X live data size (LDS) Young generation should start at 1/2 the Old Generation size. - Young generation should be sized so “enough” objects die in the old generation to reduce the pressure on CMS - Promotion rate needs to be low enough so CMS concurrent threads don’t loose the race (ConcurrentMode Failures) Strategy is to reduce young and old GC sizes independently until a maximum acceptable end user response time is met. -Young Generation first, then OldGen. Using CMS Tuning for Footprint Thursday, September 26, 13
  • 45.
    @TwitterEng 45 Tuning forFootprint Example of a highly tuned CMS deploy for throughput: -Xmx12g -Xms12g -Xmn4g –XX:PermSize=256m -XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=60 -XX:SurvivorRatio=6 –verbosegc -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution Note: Increased Young Gen Size, Survivor Ratio Tuning Using CMS Thursday, September 26, 13
  • 46.
    @TwitterAds | Confidential CommonPerformance Issues 46 Thursday, September 26, 13
  • 47.
    @TwitterEng 47 Common PerformanceIssues Size of Permanent Generation - Perm. Gen. only collects and resizes at Full GC. Heap before GC invocations=40019 (full 36522): par new generation total 15354176K, used 14K [0x00000003b9c00000, 0x0000000779c00000, 0x0000000779c00000) eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c039a8, 0x000000074c0a0000) from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000) to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000) concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000, 0x00000007f9c00000) concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000, 0x0000000800000000) 2013-09-05T17:21:39.530+0000: [Full GC[CMS: 588343K->588343K(2097152K), 1.6166150 secs] 588357K- >588343K(17451328K), [CMS Perm : 102399K->102399K(102400K)], 1.6167040 secs] [Times: user=1.57 sys=0.00, real=1.61 secs] Heap after GC invocations=40020 (full 36523): par new generation total 15354176K, used 0K [0x00000003b9c00000, 0x0000000779c00000, 0x0000000779c00000) eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c00000, 0x000000074c0a0000) from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000) to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000) concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000, 0x00000007f9c00000) concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000, 0x0000000800000000) } Recommendation: -XX:PermSize=256m –XX:MaxPermSize=256m In Enterprise Software Thursday, September 26, 13
  • 48.
    @TwitterEng 48 Common PerformanceIssues Size of Code Cache - Default size is 64mb, 96mb if running TieredCompilation - Enterprise Applications have lots of code Aggressively Tune to Avoid Issue -Tuning Without Using TieredCompilation - -XX:InitialCodeCacheSize=128m -XX:ReservedCodeCacheSize=128m - Tuning With Using TieredCompilation - -XX:InitialCodeCacheSize=256m -XX:ReservedCodeCacheSize=256m In Enterprise Software Thursday, September 26, 13
  • 49.
    @TwitterAds | Confidential OpenJDKDevelopment at Twitter 49 Thursday, September 26, 13
  • 50.
    @TwitterEng 50 What’s upwith Twitter and JDK Development? Twitter runs Java + Scala on the HotSpot JVM - Most Highly Optimized Managed Runtime -Open source :-) - Massive performance gains moving technologies Own and Optimize our Platform - Build out diagnostic tools - Build, test, and deploy OpenJDK - Optimize HotSpot Runtime Compilers for Scala, etc. - Tailored GC for Twitter’s needs -extremely low latency requirements ( < 10ms) @TwitterJDK Thursday, September 26, 13
  • 51.
    @TwitterEng 51 What’s upwith Twitter and JDK Development? Contribute Back to the Community - Working closely with Oracle Java Development - Collaborating with Other OpenJDK contributors - Posting tools to Github and OpenJDK repositories Interesting isn’t it? - We’re just ramping up now. - Follow us soon: @TwitterJDK (new idea) - Follow me at: @dagskeenan - #jointheflock @TwitterJDK Thursday, September 26, 13
  • 52.