1. 7 Deadly Sins of Enterprise Java
Programming and Deployment in the
Multicore Era
Anil Kumar:
Mahesh Somani, anil.kumar@intel.com
msomani@ebay.com Kumar Shiv:
kumar.shiv@intel.com
JavaOne 2010
*Other names and brands may be claimed as the property of others.
2. Agenda
• SALIGIA: (First letter of the seven deadly sins in Latin)
– Superbia, Avaritia, Luxuria, Invidia, Gula, Ira, Acedia
Latin meaning Implication for a Geek
Superbia pride my code is piece of perfection
Luxuria extravagance beefing up unnecessary areas
Gula gluttony too many features and objects allocation
Acedia neglect neglect scaling testing and corner cases
Avaritia greed too much cost cutting on critical resources
Invidia envy watching competition gaining market share
Ira wrath what follows from Almighty! (management)
JavaOne 2010
2
*Other names and brands may be claimed as the property of others.
3. Agenda
• Performance progression in Multi-core era
• Quick details on latest s/w and h/w platforms
• Discussion of seven common pitfalls
• Summary
• References
JavaOne 2010
3
*Other names and brands may be claimed as the property of others.
4. Multi-core era: Progression of performance
• Phenomenal performance gain from hw+sw combine
2010
Year
1,200 SPECjbb2005 K bops: 2S platform
1,000 928
1,011
800
604 632
600 557
2005
Year
400 368
252
200 64
36 51 138
0
• H/W capabilities increased inline with Moore's Law
– ~10x-15x gain just from h/w
• S/W changes to unlock full potential of h/w capabilities
– ~Doubling of performance
JavaOne 2010
4
*Other names and brands may be claimed as the property of others.
5. Multi-core era: Rapid increase in # of cores
Processor Micro- Xeon # of Hyper- LLC
Year Code Name Architecture Series Cores Threading Cache
2005 Irwindale NetBurst Xeon DP 1 2 2MB L2
2005 Paxville-DP NetBurst Dual-Core Xeon 2 2 4MB L2
2006 Dempsey-DP NetBurst 5000 2 2 4MB L2
2006 Woodcrest Core 5100 2 None 4MB L2
2007 Wolfdale-DP Core 5200 2 None 6MB L2
2006 Clovertown Core 5300 4 None 8MB L2
2007 Harpertown Penryn 5400 4 None 12MB L2
2009 Nehalem-EP Nehalem 5500 4 2 8MB L3
2010 Westmere-EP Nehalem 5600 6 2 12MB L3
• In addition to # of cores, many other advance features to deliver
excellent user experience by default
JavaOne 2010
5
*Other names and brands may be claimed as the property of others.
6. Multi-core era: Increase in # of cores to continue
Tick Tock Tick Tock Tick Tock Tick Tock
65nm 45nm 32nm 22nm
Intel® Core™ Nehalem Sandy Bridge
Microarchitecture Microarchitecture Microarchitecture
Intel® Xeon® 5600
Intel’s first 32nm SERVER processor with
6 cores and 12 threads
• Many more advance features to enhance user experience
JavaOne 2010
6
*Other names and brands may be claimed as the property of others.
7. Multi-core era: More Platform level features
Q1’10 Q2 ’10 Q3 ’10
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT
Launch of Xeon 5600 & 7500 series
Up to 8 cores per socket
Xeon 7500 Series
• 4 Socket & greater DDR3
• Nehalem EX (up to 8C + HT)
• New Mission Critical RAS Intel® QPI
• New Levels of Scalability 2010/11
Platform
• Glueless 8 socket Servers
• NUMA • 16S – 32S Scalable Node controller
PCI Express* 2.0 Technology
Up to 6 cores per socket
Xeon 5600 Processor
• 2 Socket
• Westmere (up to 6C + HT) DDR3 Intel®
QPI
• Intel® AES-NI
• Intel TXT
• NUMA
PCI Express* 2.0 Technology
JavaOne 2010
7
*Other names and brands may be claimed as the property of others.
8. Intel’s role in improving Java Performance
Out of box optimal settings
Of H/W platform
Working with Influencing
JVM vendors Intel Java Team future CPU design
Oracle and IBM
Working with
H/W and S/W Working with ISVs
profiling Tools Application level stack
Application characterization helps in
better deployment decisions as well as
optimal utilization of a platform
• Many more advance features to enhance user experience
JavaOne 2010
8
*Other names and brands may be claimed as the property of others.
9. S/W impact: Intel relationship with ISVs
Java Applications (widely used apps)
JVMs (3 major JVMs)
OS (all major Operating Systems)
• Very active role with OS vendors
• Engaged with all three major JVM vendors
– Sun HotSpot (Now Oracle HotSpot)
– Oracle JRockit
– IBM J9
• Interaction with widely used Java applications
• Optimizing complete s/w stack for latest processor while ensuring
excellent performance on existing s/w stack
Close active relationship with ISV partners like: eBay
JavaOne 2010
9
*Other names and brands may be claimed as the property of others.
10. Application environments
• Batch processing: stand alone and/or cluster of systems:
• Computation intensive etc.
• 3-tier applications servers
Java Backend
Client App
server DB etc.
• High frequency trading/financial latency sensitive apps
• Java + native mix
• Virtualized environment
• Cloud environment : very limited
JavaOne 2010
10
*Other names and brands may be claimed as the property of others.
11. Factors impacting deployment configuration
Application deployment
Application configuration by itself
can be very complex
+
JVM
OS
Other
Power
Turbo
Management
Prefetching BIOS
settings
DIMM population DIMM population
(Capacity, Latency) HT: Hyper Threading (Capacity, Latency)
DIMM Type (speed) # of Cores DIMM Type (speed)
Processor Processor Bandwidth
Memory Memory
SKU: GHz, Caches SKU: GHz, Caches
Disk I/O Network
Simple logical mapping (real interaction much more complex and intertwined )
JavaOne 2010
11
*Other names and brands may be claimed as the property of others.
12. Performance methodology
Application
experts
view
Application level monitoring
s/w
Many performance and scaling issues
h/w H/W Performance monitoring counters View of
Intel Java
performance
team
• Many performance and scaling issues get
reflected in h/w resource utilization which get
tracked by performance monitoring counters
– There are >400 performance monitoring counters
JavaOne 2010
12
*Other names and brands may be claimed as the property of others.
13. Performance monitoring counters analysis
• Severe cases are easy to spot
• Moderate or extracting last 30% of the performance:
– It is more art than science !
– 4-5 counters can be collected at a time with min 100 ms granularity
– Absolute values are not very useful
– Solution involves relative values and correlation of h/w resources
problem
Latency
Load
• Histogram patterns to identify phases and anomalies
4,000,000 L2_LD_MIss
L2 cache load miss
3,500,000
3,000,000
2,500,000
2,000,000
1,500,000
1,000,000 GC
500,000
0
1 5 9 13 17 21 25 29 33 37 41 Time
45 49 53 57 61 65 69 73 77 81 85 89 93 97
JavaOne 2010
13
*Other names and brands may be claimed as the property of others.
14. Locating source of problem
• Out-of-box collection from performance analyzers
only useful for simple applications
– Inlining makes it very hard to track source of problem
• What are next steps then?
– disabling inlining when collecting profiles works often
– Analysis across JITed methods code punctuated with h/w
counters helps to identify the issue when methods hotness
profile is FLAT
Methods h/w counters (normalized per sec)
CLK Inst. Retd. Cache misses
A mov eax, [ebx] 110 52 55
mov [var], ebx 82 43 5
B str DB hello, 0 15 56 12
push <mem> 85 25 10
C add <reg>,<reg> 200 90 25
add <reg>,<mem> 24 32 8
• Close cooperation between Intel Java team and
application performance team is crucial
JavaOne 2010
14
*Other names and brands may be claimed as the property of others.
15. Seven common pitfalls
1. Multi-threading, serialization, locks App architects
and
2. Lack of basic characterization programmers
3. JVM selection and JVM parameters Testing
4. Heap management and GC and
deployment
5. Estimate and peak performance
6. Monitoring (GC log etc.)
7. S/W + H/W configuration Issues
during
– Including Network, Disk I/O, OS support
Customer IP and data security being paramount,
only generic examples are being shared
JavaOne 2010
15
*Other names and brands may be claimed as the property of others.
16. 1: Multithreading, serialization and locks
• Often first attempt at multi-threading riddled with
too many locks
– As programmers, it is better to be safe than sorry
– But, once application is running, H/W level and JVM level
profiling can identify potential locks for revisit
– False sharing another cause of poor scaling
new obj1 obj2 obj3 obj4
Scaling issue if objects Thread1 Thread2 Thread3 Thread4
manipulation is very CPU CPU CPU CPU
often and threads can
run across multiple chips
• App is multithreaded but serialization at JVM or
class library level or JNI component
• Most issues are exposed when pushing system
utilization beyond 60% or some throughput level
JavaOne 2010
16
*Other names and brands may be claimed as the property of others.
17. 2: Lack of basic characterization
• Baseline measurement for light load conditions
– Throughput and response time very critical as feedback
to tester as well as to identify any anomaly
Anomaly
Response
Time
50% 100% CPU utilization %
• Basic profile of application:
– Some surprises could detected early
80 90
70 80
70
Throughput
60
50 60
50
40
40
30 30
20 20
10 10
0 0
0 1 2 3 4 5 0 1 2 3 4 5
# of chips # of chips
JavaOne 2010
17
*Other names and brands may be claimed as the property of others.
18. 3: JVM selection and JVM parameters
• What if end-user environment is unknown?
– But, some information could given to user
• JVM selection:
– Often latest JVMs provides best performance for latest h/w
– Throughput computing: ~10% impact is very common
Oracle Hot Spot JVMs latest versions for Xeon 5500/5600/7500 series
S314665: A Journey to the Center of the Java Universe, Wed1PM, Parc 55/Embarcadero
– Response time sensitive apps:
Standard JVMs vs. Real time JVMs
• JVM parameters:
– Up to ~50% benefit (some possibility of negative impact in niche cases)
Locks, strings, heap/GC are common examples helping most applications
JavaOne 2010
18
*Other names and brands may be claimed as the property of others.
19. 4: Heap management and GC
• Desired goal for heap:
– Avoid memory swapping while able to use large enough
heap to reduce GC frequency
Total RAM > Total (Java heap + non-heap memory)
Old space > Total (long live objects) to avoid old space GC
– What if # of instances launched is unknown?
• GC choices: Throughput
– Throughput computing:
Heap
– Response time sensitive apps:
– When deploying multiple instances, # of GC threads impacts response
– 64bit JVM: beware of compressed pointers/references
– Sudden jump on 4GB or 32GB heap boundaries
JavaOne 2010
19
*Other names and brands may be claimed as the property of others.
20. 5: Estimate and peak performance
• Model and anticipate demand
• Pay attention to demand spikes at specific time of day
• Stress test in the target environment.
– Don’t assume linear performance
• Software and hardware configuration lead to non-
linear behavior
– Hyper threading
HT gain
– Resource caps application
dependent
Throughput
CPU utilization %
50% 100%
JavaOne 2010
20
*Other names and brands may be claimed as the property of others.
21. 6: Monitoring
• Low overhead, always-on
• Helps with root cause analysis
• Hardware, OS, JVM, and Application level
monitoring
• Capture and log the important metrics periodically
– CPU, Processes, GC
– Logical resource caps like thread pools and connection
pools
– Errors, external resource utilization (DB, services)
JavaOne 2010
21
*Other names and brands may be claimed as the property of others.
22. 6: Monitoring and Profiling: case studies
• Insufficient heap size
– Default heap size very inconsistent across JVMs/OS
• Memory swapping from too large heap
– JVM starting first, non-Java memory/shared memory space
• Inconsistent default nursery/old space size
• Thread pool size auto-tuning for various
deployment
• No monitoring, detection and notification to user
• OS level:
– Too many context switches, interrupts, exceptions
JavaOne 2010
22
*Other names and brands may be claimed as the property of others.
23. 7: S/W + H/W configuration
• Inconsistent user experience
• Degradation from changes in s/w and/or h/w
upgrade:
– H/W features:
– Turbo, CPU SKU, NUMA, memory population, # of cores
increased but GHz decreased, Power management
– S/W features
– Deployment configurations: not our area of expertise
– Disk and network I/O: did not keep up with the increased
processing power
JavaOne 2010
23
*Other names and brands may be claimed as the property of others.
24. Summary
1. Architect the design to scale
2. Control the Java + JNI environment
3. Heap and GC type
4. JVM and parameter selection
5. Estimate and peak performance
6. Light weight monitoring
7. H/W and S/W configuration
Thank you !
Anil Kumar: anil.kumar@intel.com
Kumar Shiv: kumar.shiv@intel.com
Mahesh Somani: msomani@ebay.com
http://software.intel.com/sites/oss/pdfs/322727-001US_Java_Perf_Xeon_wp.pdf
JavaOne 2010
24
*Other names and brands may be claimed as the property of others.
25. Backup
JavaOne 2010
25
*Other names and brands may be claimed as the property of others.
26. EMON and VTune: H/W counters profiling
• EMON (Intel internal Tool)
– >500 h/w counters can be profiles from 30 minutes run
– Analysis helps in understanding:
– How application is stressing h/w resources
– Helps in predicting/estimating where scaling issue may occur
– Can help in deployment strategy for similar scenarios
• Intel VTune Performance Analyzer:
– H/W counters causing bottleneck can be profiled using Intel
VTune Performance Analyzer to identify the methods
– Oh! Yes, after JITing and optimizations, method name and asm
code matches perfectly (just kidding)
– Requires in-depth knowledge and some tricks to map asm code
to Java source code (disable inlining, if possible)
– http://software.intel.com/en-us/intel-vtune/
JavaOne 2010
26
*Other names and brands may be claimed as the property of others.
27. Non-Uniform Memory Access (NUMA)
Nehalem Nehalem
EP EP
Tylersburg
EP
Intel® C ore™ microarchitecture (Nehalem-EP)
Intel® Next Generation Server Processor Technology (Tylersburg-EP)
JavaOne 2010
27
*Other names and brands may be claimed as the property of others.
28. Scaling over older generation
• Most Java applications should get significant boost
– 50% or more gain for SPECjbb2005, SPECjvm2008 and
SPECjAppServer2004 for Nehalem-EP over Core 2
• For some niche apps Xeon 5400 > Xeon 5500
– When fits into (2x6MB L2) of Xeon 5400 series and
– Does not fit into (4x256k L2 + 8MB L3) of Xeon 5500 series
Xeon 5500 series Nehalem-EP based
Xeon 5400 series Core 2 based
Core 1 Core 2 Core 3 Core 4
Core 1 Core 2 Core 3 Core 4 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1
L1 L1 L1 L1 L1 L1 L1 L1
256k L2 256k L2 256k L2 256k L2
6MB L2 6MB L2
8MB L3
JavaOne 2010
28
*Other names and brands may be claimed as the property of others.
29. JavaOne 2010
29
*Other names and brands may may be claimed as the property of others.
*Other names and brands be claimed as the property of others.
30. Core scaling: Performance evaluation within a socket
• Compare without HT threads
Core 1 Core 2 Core 3 Core 4 Core 5 Core 6
HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1
run 1 X
run 2 X X
Xeon 5600 series (Westmere-EP)
run 3 X X X
run 4 X X X X
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
run 5 X X X X X
run 6 X X X X X X
HT:0
HT:1
HT:0
HT:1
HT:0
HT:1
HT:0
HT:1
HT:0
HT:1
HT:0
HT:1
• Compare with HT threads
12M Shared
Last Level Cache
Core 1 Core 2 Core 3 Core 4 Core 5 Core 6
HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1
run 1 X X
run 2 X X X X
run 3 X X X X X X
run 4 X X X X X X X X
run 5 X X X X X X X X X X
run 6 X X X X X X X X X X X X
X : Logical thread will be used
JavaOne 2010
30
*Other names and brands may be claimed as the property of others.
31. Socket scaling: Overall performance evaluation
• Core scaling ensures performance within a socket
• Socket scaling ensures overall performance
• Multiple JVM instances:
Run 1 Run 2 Run 3 Run 4
• Single JVM instance:
– Good to have NUMA disabled for consistency
– Stresses snooping bandwidth
Run 1 Run 2 Run 3 Run 4
JavaOne 2010
31
*Other names and brands may be claimed as the property of others.