VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms


Published on

VMworld 2013

Emad Benjamin, VMware

Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms

  1. 1. Virtualizing and Tuning Large Scale Java Platforms Emad Benjamin, VMware VAPP4536 #VAPP4536
  2. 2. 2 About the Speaker  I have been with VMware for the last 8 years, working on Java and vSphere  20 years experience as a Software Engineer/Architect, with last 15 years focused on Java development  Open source contributions  Prior work with Cisco, Oracle, and Banking/Trading Systems  Authored the following books: • Virtualizing and Tuning Large Scale Java Platforms • Enterprise Java Applications Architecture on VMware
  3. 3. 3 Disclaimer  This session may contain product features that are currently under development.  This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.  Technical feasibility and market demand will affect final delivery.  Pricing and packaging for any new technologies or features discussed or presented have not been determined.
  4. 4. 4 Agenda  Overview  Design and Sizing Java Platforms  Performance  Best Practices and Tuning  Customer Success Stories  Questions
  5. 5. 5 Java Platforms Overview
  6. 6. 6 Conventional Java Platforms Java Platforms are multitier and multi org DB ServersJava Applications Load Balancer Tier Load Balancers Web Servers IT Operations Network Team IT Operations Server Team IT Apps – Java Dev Team IT Ops & Apps Dev Team Organizational Key Stakeholder Departments Web Server Tier Java App Tier DB Server Tier
  7. 7. 7 Middleware Platform Architecture on vSphere SHARED,ALWAYS-ON INFRASTRUCTURE SHAREDINFRASTRUCTURESERVICES Capacity On Demand High AvailabilityDynamic APPLICATIONSERVICES DBServersJavaApplicationsLoadbalancers WebServers VMwarevSphere HighUptime, Scalable, and DynamicEnterprise JavaApplicationsLoad Balancers as VMs Web Servers Java Application Servers
  8. 8. 8 Java Platforms Design and Sizing
  9. 9. 9 Design and Sizing of Java Platforms on vSphere Step 1 – EstablishLoad profile From production logs/monitoring reports measure: Concurrent Users Requests Per Second Peak ResponseTime Average ResponseTime Establishyour response time SLA Step 2 EstablishBenchmark  Iterate through Benchmark test until youare satisfied with the Load profile metrics and your intendedSLA after each benchmarkiteration youmay have to adjustthe Application Configuration  Adjust the vSphere environmentto scale out/upin order to achieveyour desired number of VMs, number of vCPU and RAM configurations Step 3 – Size Production Env.  The size of the production environmentwould havebeen establishedin Step2, hence either you roll out the environmentfrom Step-2 or build a new one based on the numbers established
  10. 10. 10 Step 2 – Establish Benchmark DETERMINE HOW MANY VMs Establish Horizontal Scalability Scale Out Test How many VMs do you need to meet your Response Time SLAs without reaching 70%-80% saturation of CPU? Establish your Horizontal scalability Factor before bottleneck appear in your application Scale Out Test Building Block VM Building Block VM SLA OK? Test complete Investigate bottlnecked layer Network, Storage, Application Configuration, & vSphere If scale out bottlenecked layer is removed, iterate scale out test If building block app/VM config problem, adjust & iterate No Building Block VM ESTABLISH BUILDING BLOCK VM Establish Vertical scalability Scale Up Test Establish how many JVMs on a VM? Establish how large a VM would be in terms of vCPU and memory ScaleUpTest Building Block VM
  11. 11. 11 Design and Sizing HotSpot JVMs on vSphere JVM Max Heap -Xmx JVM Memory Perm Gen Initial Heap Guest OS Memory VM Memory -Xms Java Stack -Xss per thread -XX:MaxPermSize Other mem Direct native Memory “off-the-heap” Non Direct Memory “Heap”
  12. 12. 12 Design and Sizing of HotSpot JVMs on vSphere  Guest OS Memory approx 1G (depends on OS/other processes)  Perm Size is an area additional to the –Xmx (Max Heap) value and is not GC-ed because it contains class-level information.  “other mem” is additional mem required for NIO buffers, JIT code cache, classloaders, Socket Buffers (receive/send), JNI, GC internal info  If you have multiple JVMs (N JVMs) on a VM then: • VM Memory = Guest OS memory + N * JVM Memory VM Memory = Guest OS Memory + JVM Memory JVM Memory = JVM Max Heap (-Xmx value) + JVM Perm Size (-XX:MaxPermSize) + NumberOfConcurrentThreads * (-Xss) + “other Mem”
  13. 13. 13 Sizing Example JVM Max Heap -Xmx (4096m) JVM Memory (4588m) Perm Gen Initial Heap Guest OS Memory VM Memory (5088m) -Xms (4096m) Java Stack -Xss per thread (256k*100) -XX:MaxPermSize (256m) Other mem (=217m) 500m used by OS set mem Reservation to 5088m
  14. 14. 14 Perm Gen Initial Heap Java Stack Larger JVMs for In-Memory Data Management Systems JVM Max Heap -Xmx (30g) Guest OS Memory -Xms (30g) -Xss per thread (1M*500) -XX:MaxPermSize (0.5g) Other mem (=1g) 0.5-1g used by OS Set memory reservation to 34g JVM Memory for SQLFire (32g) VM Memory for SQLFire (34g)
  15. 15. 15 NUMA Local Memory with Overhead Adjustment Physical RAM On vSphere host Physical RAM On vSphere host Number of VMs On vSphere host 1% RAM overhead vSphere RAM overhead Number of Sockets On vSphere host
  16. 16. 16 Middleware ESXi Cluster 96GB RAM 2 sockets 8 pCPU per socket Middleware components 47GB RAM VMs with 8vCPU Locator/heart beat for middleware DO NOT VMotion Memory Available for all VMs = 96*0.98 -1GB => 94GB Per NUMA memory => 94/2 47GB
  17. 17. 17 96 GB RAM on Server Each NUMA Node has 94/2 47GB 8 vCPU VMs less than 47GB RAM on each VMESX Scheduler If VM is sized greater than 47GB or 8 CPUs, Then NUMA interleaving Occurs and can cause 30% drop in memory throughput performance
  18. 18. 18 1 128 GB RAM on server 2vCPU VMs less than 20GB RAM on each VM 4vCPU VM 40GB RAM split by ESXi into 2 NUMA Clients available in ESX4.1 ESXi Scheduler 2 3 4 5
  19. 19. 19 Java Platform Categories – Category 1  Smaller JVMs < 4GB heap, 4.5GB Java process, and 5GB for VM  vSphere hosts with <96GB RAM is more suitable, as by the time you stack the many JVM instances, you are likely to reach CPU boundary before you can consume all of the RAM. For example if instead you chose a vSphere host with 256GB RAM, then 256/4.5GB => 57JVMs, this would clearly reach CPU boundary  Multiple JVMs per VM  Use Resource pools to manage different LOBs Category 1: 100s to 1000s of JVMs Resource Pool 1 Gold LOB 1 Resource Pool 2 SilverLOB 2 Use 4 sockets servers to get more cores
  20. 20. 20 Most Common Sizing and Configuration Question JVM-1 JVM-2 JVM-1A JVM-1 JVM-2 JVM-1 JVM-2 JVM-2A JVM-3 JVM-4 Option-1 Scale out VM and JVM ( best) Option-2 Scale Up JVM heap size (2nd best) JVM-2JVM-1 Option-3 Scale up VM and JVM (3rd best) 2GB 2GB 2GB 2GB 2vCPU2vCPU 2vCPU 2vCPU 2vCPU2vCPU 4GB4GB
  21. 21. 21 What Else to Consider When Sizing? Job Web JVM-1 Job Web JVM-2 Job Web Job Web JVM-3 Job Web JVM-4 Vertical Horizontal  Mixed workloads Job Scheduler vs Web app require different GC Tuning  Job Schedulers care about Throughput  Web apps care about minimize latency and response time  You can’t have both reduced response time and increased throughput, without compromise  Separate the concerns for optimal tuning
  22. 22. 22 Java Platform Categories – Category 2  Fewer JVMs < 20  Very large JVMs, 32GB to 128GB  Always deploy 1 VM per NUMA node and size to fit perfectly  1 JVM per VM  Choose 2 socket vSphere hosts, and install ample memory128GB to 512GB  Example is in memory databases, like SQLFire and GemFire  Apply latency sensitive BP disable interrupt coalescing pNIC and vNIC  Dedicated vSphere cluster Category 2: a dozen of very large JVMs Use 2 sockets servers to get larger NUMA nodes
  23. 23. 23 Java Platform Categories – Category 3 Category 3: Category-1 accessing data from Category-2 Resource Pool 1 Gold LOB 1 Resource Pool 2 SilverLOB 2
  24. 24. 24 Java Platforms Performance
  25. 25. 25 Performance Perspective See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at http://www.vmware.com/resources/techresources/10158 .
  26. 26. 26 Performance Perspective See the Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server at http://www.vmware.com/resources/techresources/10158 . 80% Threshold % CPU R/T
  27. 27. 27 SQLFire vs. Traditional RDBMS SQLFire scaled 4x compared to RDBMS Response times of SQLFire are 5x to 30x faster than RDBMS Response times on SQLFire are more stable and constant with increased load RDBMS response times increase with increased load
  28. 28. 28 Load Testing SpringTrader Using Client-Server Topology SpringTrader Integration Services Application Tier SpringTrader Application Service SQLFire Member 2 Redundant Locators SpringTrader Data Tier SQLFire Member1 Integration Patterns 4 Application Services
  29. 29. 29 vFabric Reference Architecture Scalability Test 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 0 2000 4000 6000 8000 10000 12000 1 2 3 4 Scalingfrom1AppServicesVM NumberofUsers Number of Application Services VMs Maximum Passing Users and Scaling With this topology 10400 users session
  30. 30. 30 10k Users Load Test Response Time 0 1 2 3 4 5 6 7 0 2000 4000 6000 8000 10000 12000 Seconds Number of Users Operation 90th-Percentile Response-Time Four Application Services VMs HomePage Register Login DashboardTab PortfolioTab TradeTab GetHoldingsPage GetOrdersPage SellOrder GetQuote BuyOrder Logout MarketSummary 10400 users session Approx. 0.25 seconds response time
  31. 31. 31 Java Platforms Best Practices and Tuning
  32. 32. 32 Most Common VM Size for Java Workloads  2 vCPU VM with 1 JVM, for tier-1 production workloads  Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU  Scale out preferred over Scale-up, but both can work  You can diverge from this ratio for less critical workloads 2 vCPU VM 1 JVM (-Xmx 4096m) Approx 5GB RAM Reservation
  33. 33. 33 However for Large JVMs + CMS For large JVMs 4+ vCPU VM 1 JVM (8-128GB)  Start with 4+ vCPU VM with 1 JVM, for tier-1 in memory data management systems type of production workloads  Likely increase JVM size, instead of launching a second JVM instance  Multiple 4vCPU+ will allow for ParallelGCThreads to be allocated 50% of the available vCPUs to the JVM, i.e. 2 GC Threads +  Ability to increase ParallelGCThreads is critical to YoungGen scalability for large JVMs  ParallelGCThreads should be allocated 50% of available vCPU to the JVM and not more. You want to ascertain there other vCPUs available for other txns
  34. 34. 34 Which GC? ESX doesn’t care which GC you select, because of the degree of independence of Java to OS and OS to Hypervisor
  35. 35. 35 GC Policy Types GC Policy Type Description Serial GC •Mark, sweep and compact algorithm •Both minor and full GC are stop the world threads •Stop the world GC means application is stopped while GC is executing •Not very scalable algorithm •Suited for smaller <200MB JVMs like Client machines Throughput GC •Parallel GC •Similar to Serial GC, but uses multiple worker Threads in parallel to increase throughput •Both Young and Old Generation collection are multi thread, but still stop-the-world • number of threads allocated by - XX:ParallelGCThreads=<nThreads> •NOT Concurrent, meaning when the GC worker threads run, they will pause your application threads. If this is a problem move to CMS where GC threads are concurrent.
  36. 36. 36 GC Policy Types GC Policy Type Description Concurrent GC •Concurrent Mark and Sweep, no compaction •Concurrent implies when GC is running it doesn't pause your application threads – this is the key difference to throughput/parallel GC •Suited for application that care more about response time than throughput •CMS does use more heap when compared to throughput/ParallelGC •CMS works on OLD gen concurrently, but young generation is collected using ParNewGC, a version of the throughput collector •Has multiple phases: • Initial mark (short pause) • concurrent mark (no pause) • Pre-cleaning (no pause) • re-mark (short pause) • Concurrent sweeping (no pause) G1 • Only in J7 and mostly experimental, equivalent to CMS + compacting
  37. 37. 37 Tuning GC – Art Meets Science! Either you tune for Throughput or Latency, one at the cost of the other Increase Throughput Reduce Latency Tuning Decisions • improved R/T • reduce latency impact • slightly reduced throughput • improved throughput • longer R/T • increased latency impact Job Web
  38. 38. 38 Parallel Young Gen and CMS Old Gen application threadsminor GC threads concurrent mark and sweep GC Young Generation Minor GC Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads -Xmn Old Generation Major GC Concurrent using in OldGen using XX:+UseConcMarkSweepGC Xmx minus Xmn S 0 S 1
  39. 39. 39 High Level GC Tuning Recipe Measure Minor GC Duration and Frequency Adjust –Xmn Young Gen size and /or ParallelGCThreads Measure Major GC Duration And Frequency Adjust Heap space –Xmx Adjust –Xmn And/or SurvivorSpaces Step A-Young Gen Tuning Step B-Old Gen Tuning Step C- Survivor Spaces Tuning
  40. 40. 40 CMS Collector Example java –Xms30g –Xmx30g –Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC – XX:CMSInitiatingOccupancyFraction=75 –XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4 -XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings -XX:+UseStringCache  This JVM configuration scales up and down effectively  -Xmx=-Xms, and –Xmn 33% of –Xmx  -XX:ParallelGCThreads=< minimum 2 but less than 50% of available vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let Java select it
  41. 41. 41 IBM JVM – GC Choice -Xgc:mode Usage Example -Xgcpolicy:Optthruput (Default) Performs the mark and sweep operations during garbage collection when the application is paused to maximize application throughput. Mostly not suitable for multi CPU machines. Apps that demand a high throughput but are not very sensitive to the occasional long garbage collection pause - Xgcpolicy:Optavgpause Performs the mark and sweep concurrently while the application is running to minimize pause times; this provides best application response times. There is still a stop-the-world GC, but the pause is significantly shorter. After GC, the app threads help out and sweep objects (concurrent sweep). Apps sensitive to long latencies transaction- based systems where Response Time are expected to be stable -Xgcpolicy:Gencon Treats short-lived and long-lived objects differently to provide a combination of lower pause times and high application throughput. Before the heap is filled up, each app helps out and mark objects (concurrent mark). Latency sensitive apps, objects in the transaction don't survive beyond the transaction commit Job Web Web
  42. 42. 42 Middleware on VMware – Best Practices Enterprise Java Applications on VMware Best Practices Guide http://www.vmware.com/resources/techresources/1087 Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs http://www.vmware.com/resources/techresources/10220 vFabric SQLFire Best Practices Guide http://www.vmware.com/resources/techresources/10327 vFabric Reference Architecture http://tinyurl.com/cjkvftt
  43. 43. 43 Middleware on VMware – Best Practices Summary  Follow the design and sizing examples we discussed thus far  Set appropriate memory reservation  Leave HT enabled, size bases on vCPU=1.25pCPU if needed  RHEL6 and SLES 11 SP1 have tickless kernel that does not rely on a high frequency interrupt-based timer, and is therefore much friendlier to virtualized latency-sensitive workloads  Do not overcommit memory  Locators/heartbeat process should not be vMotion® migrated, it otherwise would lead to network split brain problems  vMotion over 10Gbps when doing scheduled maintenance  Use Affinity and Anti-Affinity rules to avoid redundant copies on the same VMware ESX®/ESXi host
  44. 44. 44 Middleware on VMware – Best Practices  Disable NIC interrupt coalescing on physical and virtual NIC  Extremely helpful in reducing latency for latency-sensitive virtual machines  Disable virtual interrupt coalescing for VMXNET3 • It can lead to some performance penalties for other virtual machines on the ESXi host, as well as higher CPU utilization to deal with the higher rate of interrupts from the physical NIC  This implies it is best to use dedicated ESX cluster for Middleware Platforms • All host are configured the same way for latency sensitivity and this insures non middleware workloads, such as other enterprise applications are not negatively impacted • This is applicable in category 2 workloads
  45. 45. 45 Middleware on VMware – Benefits  Flexibility to change compute resources, VM sizes, add more hosts  Ability to apply hardware and OS patches while minimizing downtime  Create more manageable system through reduced middleware sprawl  Ability to tune the entire stack within one platform  Ability to monitor the entire stack within one platform  Ability to handle seasonal workloads, commit resources when they are needed and then remove them when not needed
  46. 46. 46 Customer Success Stories
  47. 47. 47 NewEdge  Virtualized GemFire workload  Multiple geographic active- active datacenters  Multiple Terabytes of data kept in memory  1000s of transactions per second  Multiple vSphere clusters  Each cluster 4 vSphere hosts and 8 large 98GB+ JVMs http://www.vmware.com/files/pdf/customers/VMware-Newedge-12Q4-EN-Case-Study.pdf
  48. 48. 48 Cardinal Health Virtualization Journey 4 Consolidation  < 40% Virtual  <2,000 VMs  <2,355 physical Data Center Optimization  30 DCs to 2 DCs Transition to Blades  <10% Utilization  <10:1 VM/Physical Low Criticality Systems  8X5 Applications Internal cloud  >58% Virtual  >3,852 VMs  <3,049 physical Power Remediation  P2Vs on refresh HW Commoditization  15% Utilization  30:1 VM/Physical Business Critical Systems  SAP ~ 382  WebSphere ~ 290  Unix to Linux ~ 655 Cloud Resources • >90% Virtual  >8,000 VMs  <800 physical Optimizing DCs  Internal disaster recovery  Metered service offerings (SAAS, PAAS, IAAS) Shrinking HW Footprint  > 50% Utilization  > 60:1 VM/Physical Heavy Lifting Systems  Database Servers Virtual HW SW Timeline 2005 – 2008 2009 – 2011 2012 – 2015 Theme Centralized IT Shared Service Capital Intensive - High Response Variable Cost SubscriptionServices DC
  49. 49. 49 Virtualization Why Virtualize WebSphere on VMWare  DC strategy alignment • Pooled resources capacity ~15% utilization • Elasticity – for changing workloads • Unix to Linux • Disaster Recovery  Simplification and manageability • High availability for thousands instead of thousands of high availability solutions • Network & system management in DMZ  Five year cost savings ~ $6 million • Hardware Savings ~ $660K • WAS Licensing ~ $862K • Unix to Linux ~ $3.7M • DMZ – ports~ >$1M
  50. 50. 50 Thank you and are there any Questions? Emad Benjamin, ebenjamin@vmware.com You can get the book here: https://www.createspace.com/3632131
  51. 51. 51 Second Book  Emad Benjamin, ebenjamin@vmware.com  Preview chapter available at VMworld bookstore You can get the book here: Safari: http://tinyurl.com/lj8dtjr Later on Amazon  http://tinyurl.com/kez9trj
  52. 52. 52 Other VMware Activities Related to This Session  HOL: HOL-SDC-1304 vSphere Performance Optimization  Group Discussions: VAPP1010-GD Java with Emad Benjamin
  53. 53. THANK YOU
  54. 54. Virtualizing and Tuning Large Scale Java Platforms Emad Benjamin, VMware VAPP4536 #VAPP4536