Seven deadly

993 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
993
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Seven deadly

  1. 1. 7 Deadly Sins of Enterprise Java Programming and Deployment in the Multicore Era Anil Kumar: Mahesh Somani, anil.kumar@intel.com msomani@ebay.com Kumar Shiv: kumar.shiv@intel.com JavaOne 2010 *Other names and brands may be claimed as the property of others.
  2. 2. Agenda • SALIGIA: (First letter of the seven deadly sins in Latin) – Superbia, Avaritia, Luxuria, Invidia, Gula, Ira, Acedia Latin meaning Implication for a Geek Superbia pride my code is piece of perfection Luxuria extravagance beefing up unnecessary areas Gula gluttony too many features and objects allocation Acedia neglect neglect scaling testing and corner cases Avaritia greed too much cost cutting on critical resources Invidia envy watching competition gaining market share Ira wrath what follows from Almighty! (management) JavaOne 2010 2 *Other names and brands may be claimed as the property of others.
  3. 3. Agenda • Performance progression in Multi-core era • Quick details on latest s/w and h/w platforms • Discussion of seven common pitfalls • Summary • References JavaOne 2010 3 *Other names and brands may be claimed as the property of others.
  4. 4. Multi-core era: Progression of performance • Phenomenal performance gain from hw+sw combine 2010 Year 1,200 SPECjbb2005 K bops: 2S platform 1,000 928 1,011 800 604 632 600 557 2005 Year 400 368 252 200 64 36 51 138 0 • H/W capabilities increased inline with Moore's Law – ~10x-15x gain just from h/w • S/W changes to unlock full potential of h/w capabilities – ~Doubling of performance JavaOne 2010 4 *Other names and brands may be claimed as the property of others.
  5. 5. Multi-core era: Rapid increase in # of cores Processor Micro- Xeon # of Hyper- LLC Year Code Name Architecture Series Cores Threading Cache 2005 Irwindale NetBurst Xeon DP 1 2 2MB L2 2005 Paxville-DP NetBurst Dual-Core Xeon 2 2 4MB L2 2006 Dempsey-DP NetBurst 5000 2 2 4MB L2 2006 Woodcrest Core 5100 2 None 4MB L2 2007 Wolfdale-DP Core 5200 2 None 6MB L2 2006 Clovertown Core 5300 4 None 8MB L2 2007 Harpertown Penryn 5400 4 None 12MB L2 2009 Nehalem-EP Nehalem 5500 4 2 8MB L3 2010 Westmere-EP Nehalem 5600 6 2 12MB L3 • In addition to # of cores, many other advance features to deliver excellent user experience by default JavaOne 2010 5 *Other names and brands may be claimed as the property of others.
  6. 6. Multi-core era: Increase in # of cores to continue  Tick Tock Tick Tock Tick Tock Tick Tock 65nm 45nm 32nm 22nm Intel® Core™ Nehalem Sandy Bridge Microarchitecture Microarchitecture Microarchitecture Intel® Xeon® 5600 Intel’s first 32nm SERVER processor with 6 cores and 12 threads • Many more advance features to enhance user experience JavaOne 2010 6 *Other names and brands may be claimed as the property of others.
  7. 7. Multi-core era: More Platform level features  Q1’10 Q2 ’10 Q3 ’10 JAN FEB MAR APR MAY JUN JUL AUG SEP OCT Launch of Xeon 5600 & 7500 series Up to 8 cores per socket Xeon 7500 Series • 4 Socket & greater DDR3 • Nehalem EX (up to 8C + HT) • New Mission Critical RAS Intel® QPI • New Levels of Scalability 2010/11 Platform • Glueless 8 socket Servers • NUMA • 16S – 32S Scalable Node controller PCI Express* 2.0 Technology Up to 6 cores per socket Xeon 5600 Processor • 2 Socket • Westmere (up to 6C + HT) DDR3 Intel® QPI • Intel® AES-NI • Intel TXT • NUMA PCI Express* 2.0 Technology JavaOne 2010 7 *Other names and brands may be claimed as the property of others.
  8. 8. Intel’s role in improving Java Performance Out of box optimal settings Of H/W platform Working with Influencing JVM vendors Intel Java Team future CPU design Oracle and IBM Working with H/W and S/W Working with ISVs profiling Tools Application level stack Application characterization helps in better deployment decisions as well as optimal utilization of a platform • Many more advance features to enhance user experience JavaOne 2010 8 *Other names and brands may be claimed as the property of others.
  9. 9. S/W impact: Intel relationship with ISVs Java Applications (widely used apps) JVMs (3 major JVMs) OS (all major Operating Systems) • Very active role with OS vendors • Engaged with all three major JVM vendors – Sun HotSpot (Now Oracle HotSpot) – Oracle JRockit – IBM J9 • Interaction with widely used Java applications • Optimizing complete s/w stack for latest processor while ensuring excellent performance on existing s/w stack Close active relationship with ISV partners like: eBay JavaOne 2010 9 *Other names and brands may be claimed as the property of others.
  10. 10. Application environments • Batch processing: stand alone and/or cluster of systems: • Computation intensive etc. • 3-tier applications servers Java Backend Client App server DB etc. • High frequency trading/financial latency sensitive apps • Java + native mix • Virtualized environment • Cloud environment : very limited JavaOne 2010 10 *Other names and brands may be claimed as the property of others.
  11. 11. Factors impacting deployment configuration Application deployment Application configuration by itself can be very complex + JVM OS Other Power Turbo Management Prefetching BIOS settings DIMM population DIMM population (Capacity, Latency) HT: Hyper Threading (Capacity, Latency) DIMM Type (speed) # of Cores DIMM Type (speed) Processor Processor Bandwidth Memory Memory SKU: GHz, Caches SKU: GHz, Caches Disk I/O Network Simple logical mapping (real interaction much more complex and intertwined ) JavaOne 2010 11 *Other names and brands may be claimed as the property of others.
  12. 12. Performance methodology Application experts view Application level monitoring s/w Many performance and scaling issues h/w H/W Performance monitoring counters View of Intel Java performance team • Many performance and scaling issues get reflected in h/w resource utilization which get tracked by performance monitoring counters – There are >400 performance monitoring counters JavaOne 2010 12 *Other names and brands may be claimed as the property of others.
  13. 13. Performance monitoring counters analysis • Severe cases are easy to spot • Moderate or extracting last 30% of the performance: – It is more art than science ! – 4-5 counters can be collected at a time with min 100 ms granularity – Absolute values are not very useful – Solution involves relative values and correlation of h/w resources problem Latency Load • Histogram patterns to identify phases and anomalies 4,000,000 L2_LD_MIss L2 cache load miss 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 GC 500,000 0 1 5 9 13 17 21 25 29 33 37 41 Time 45 49 53 57 61 65 69 73 77 81 85 89 93 97 JavaOne 2010 13 *Other names and brands may be claimed as the property of others.
  14. 14. Locating source of problem • Out-of-box collection from performance analyzers only useful for simple applications – Inlining makes it very hard to track source of problem • What are next steps then? – disabling inlining when collecting profiles works often – Analysis across JITed methods code punctuated with h/w counters helps to identify the issue when methods hotness profile is FLAT Methods h/w counters (normalized per sec) CLK Inst. Retd. Cache misses A mov eax, [ebx] 110 52 55 mov [var], ebx 82 43 5 B str DB hello, 0 15 56 12 push <mem> 85 25 10 C add <reg>,<reg> 200 90 25 add <reg>,<mem> 24 32 8 • Close cooperation between Intel Java team and application performance team is crucial JavaOne 2010 14 *Other names and brands may be claimed as the property of others.
  15. 15. Seven common pitfalls 1. Multi-threading, serialization, locks App architects and 2. Lack of basic characterization programmers 3. JVM selection and JVM parameters Testing 4. Heap management and GC and deployment 5. Estimate and peak performance 6. Monitoring (GC log etc.) 7. S/W + H/W configuration Issues during – Including Network, Disk I/O, OS support Customer IP and data security being paramount, only generic examples are being shared JavaOne 2010 15 *Other names and brands may be claimed as the property of others.
  16. 16. 1: Multithreading, serialization and locks • Often first attempt at multi-threading riddled with too many locks – As programmers, it is better to be safe than sorry – But, once application is running, H/W level and JVM level profiling can identify potential locks for revisit – False sharing another cause of poor scaling new obj1 obj2 obj3 obj4 Scaling issue if objects Thread1 Thread2 Thread3 Thread4 manipulation is very CPU CPU CPU CPU often and threads can run across multiple chips • App is multithreaded but serialization at JVM or class library level or JNI component • Most issues are exposed when pushing system utilization beyond 60% or some throughput level JavaOne 2010 16 *Other names and brands may be claimed as the property of others.
  17. 17. 2: Lack of basic characterization • Baseline measurement for light load conditions – Throughput and response time very critical as feedback to tester as well as to identify any anomaly Anomaly Response Time 50% 100% CPU utilization % • Basic profile of application: – Some surprises could detected early 80 90 70 80 70 Throughput 60 50 60 50 40 40 30 30 20 20 10 10 0 0 0 1 2 3 4 5 0 1 2 3 4 5 # of chips # of chips JavaOne 2010 17 *Other names and brands may be claimed as the property of others.
  18. 18. 3: JVM selection and JVM parameters • What if end-user environment is unknown? – But, some information could given to user • JVM selection: – Often latest JVMs provides best performance for latest h/w – Throughput computing: ~10% impact is very common Oracle Hot Spot JVMs latest versions for Xeon 5500/5600/7500 series S314665: A Journey to the Center of the Java Universe, Wed1PM, Parc 55/Embarcadero – Response time sensitive apps: Standard JVMs vs. Real time JVMs • JVM parameters: – Up to ~50% benefit (some possibility of negative impact in niche cases) Locks, strings, heap/GC are common examples helping most applications JavaOne 2010 18 *Other names and brands may be claimed as the property of others.
  19. 19. 4: Heap management and GC • Desired goal for heap: – Avoid memory swapping while able to use large enough heap to reduce GC frequency Total RAM > Total (Java heap + non-heap memory) Old space > Total (long live objects) to avoid old space GC – What if # of instances launched is unknown? • GC choices: Throughput – Throughput computing: Heap – Response time sensitive apps: – When deploying multiple instances, # of GC threads impacts response – 64bit JVM: beware of compressed pointers/references – Sudden jump on 4GB or 32GB heap boundaries JavaOne 2010 19 *Other names and brands may be claimed as the property of others.
  20. 20. 5: Estimate and peak performance • Model and anticipate demand • Pay attention to demand spikes at specific time of day • Stress test in the target environment. – Don’t assume linear performance • Software and hardware configuration lead to non- linear behavior – Hyper threading HT gain – Resource caps application dependent Throughput CPU utilization % 50% 100% JavaOne 2010 20 *Other names and brands may be claimed as the property of others.
  21. 21. 6: Monitoring • Low overhead, always-on • Helps with root cause analysis • Hardware, OS, JVM, and Application level monitoring • Capture and log the important metrics periodically – CPU, Processes, GC – Logical resource caps like thread pools and connection pools – Errors, external resource utilization (DB, services) JavaOne 2010 21 *Other names and brands may be claimed as the property of others.
  22. 22. 6: Monitoring and Profiling: case studies • Insufficient heap size – Default heap size very inconsistent across JVMs/OS • Memory swapping from too large heap – JVM starting first, non-Java memory/shared memory space • Inconsistent default nursery/old space size • Thread pool size auto-tuning for various deployment • No monitoring, detection and notification to user • OS level: – Too many context switches, interrupts, exceptions JavaOne 2010 22 *Other names and brands may be claimed as the property of others.
  23. 23. 7: S/W + H/W configuration • Inconsistent user experience • Degradation from changes in s/w and/or h/w upgrade: – H/W features: – Turbo, CPU SKU, NUMA, memory population, # of cores increased but GHz decreased, Power management – S/W features – Deployment configurations: not our area of expertise – Disk and network I/O: did not keep up with the increased processing power JavaOne 2010 23 *Other names and brands may be claimed as the property of others.
  24. 24. Summary 1. Architect the design to scale 2. Control the Java + JNI environment 3. Heap and GC type 4. JVM and parameter selection 5. Estimate and peak performance 6. Light weight monitoring 7. H/W and S/W configuration Thank you ! Anil Kumar: anil.kumar@intel.com Kumar Shiv: kumar.shiv@intel.com Mahesh Somani: msomani@ebay.com http://software.intel.com/sites/oss/pdfs/322727-001US_Java_Perf_Xeon_wp.pdf JavaOne 2010 24 *Other names and brands may be claimed as the property of others.
  25. 25. Backup JavaOne 2010 25 *Other names and brands may be claimed as the property of others.
  26. 26. EMON and VTune: H/W counters profiling • EMON (Intel internal Tool) – >500 h/w counters can be profiles from 30 minutes run – Analysis helps in understanding: – How application is stressing h/w resources – Helps in predicting/estimating where scaling issue may occur – Can help in deployment strategy for similar scenarios • Intel VTune Performance Analyzer: – H/W counters causing bottleneck can be profiled using Intel VTune Performance Analyzer to identify the methods – Oh! Yes, after JITing and optimizations, method name and asm code matches perfectly  (just kidding) – Requires in-depth knowledge and some tricks to map asm code to Java source code (disable inlining, if possible) – http://software.intel.com/en-us/intel-vtune/ JavaOne 2010 26 *Other names and brands may be claimed as the property of others.
  27. 27. Non-Uniform Memory Access (NUMA) Nehalem Nehalem EP EP Tylersburg EP Intel® C ore™ microarchitecture (Nehalem-EP) Intel® Next Generation Server Processor Technology (Tylersburg-EP) JavaOne 2010 27 *Other names and brands may be claimed as the property of others.
  28. 28. Scaling over older generation • Most Java applications should get significant boost – 50% or more gain for SPECjbb2005, SPECjvm2008 and SPECjAppServer2004 for Nehalem-EP over Core 2 • For some niche apps Xeon 5400 > Xeon 5500 – When fits into (2x6MB L2) of Xeon 5400 series and – Does not fit into (4x256k L2 + 8MB L3) of Xeon 5500 series Xeon 5500 series Nehalem-EP based Xeon 5400 series Core 2 based Core 1 Core 2 Core 3 Core 4 Core 1 Core 2 Core 3 Core 4 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 L1 L1 L1 L1 L1 L1 L1 L1 256k L2 256k L2 256k L2 256k L2 6MB L2 6MB L2 8MB L3 JavaOne 2010 28 *Other names and brands may be claimed as the property of others.
  29. 29. JavaOne 2010 29 *Other names and brands may may be claimed as the property of others. *Other names and brands be claimed as the property of others.
  30. 30. Core scaling: Performance evaluation within a socket • Compare without HT threads Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 run 1 X run 2 X X Xeon 5600 series (Westmere-EP) run 3 X X X run 4 X X X X Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 run 5 X X X X X run 6 X X X X X X HT:0 HT:1 HT:0 HT:1 HT:0 HT:1 HT:0 HT:1 HT:0 HT:1 HT:0 HT:1 • Compare with HT threads 12M Shared Last Level Cache Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 HT0 HT1 run 1 X X run 2 X X X X run 3 X X X X X X run 4 X X X X X X X X run 5 X X X X X X X X X X run 6 X X X X X X X X X X X X X : Logical thread will be used JavaOne 2010 30 *Other names and brands may be claimed as the property of others.
  31. 31. Socket scaling: Overall performance evaluation • Core scaling ensures performance within a socket • Socket scaling ensures overall performance • Multiple JVM instances: Run 1 Run 2 Run 3 Run 4 • Single JVM instance: – Good to have NUMA disabled for consistency – Stresses snooping bandwidth Run 1 Run 2 Run 3 Run 4 JavaOne 2010 31 *Other names and brands may be claimed as the property of others.

×