Lego Cloud SAP Virtualization Week 2012


Published on

This session will demonstrate that by extending KVM we can deliver none-disruptively the next level of IaaS platform modularization. We will first show instantaneous live migration of VM. Then we will introduce the memory aggregation concept, and finally how to achieve full operational flexibility by dis-aggregating the datacenter resource to its core elements.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lego Cloud SAP Virtualization Week 2012

  1. 1. TRND04The Lego CloudBenoit Hudzia Sr. Researcher; SAP Research CEC BelfastAidan Shribman Sr. Researcher; SAP Research Israel
  2. 2. AgendaIntroductionHardware TrendsLive MigrationMemory AggregationCompute AggregationSummary© 2012 SAP AG. All rights reserved. 2
  3. 3. IntroductionThe evolution of the datacenter
  4. 4. Evolution of Virtualization Resources Disaggregation (True utility Cloud) Flexible Resources Management Basic (Cloud) Consolidation No virtualization© 2012 SAP AG. All rights reserved. 4
  5. 5. Why Disaggregate Resources?Better PerformanceReplacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).Many remote devices working in parallel (e.g. DRAM, disk, compute)Superior ScalabilityGoing beyond boundaries of the single nodeImproved EconomicsDo more with existing hardwareReach better hardware utilization levels© 2012 SAP AG. All rights reserved. 5
  6. 6. The Hecatonchires ProjectHecatonchires “Hundred Headed One”Original idea: provide Distributed Shared Memory (DSM)capabilities to the cloudStrategic goal : full resource liberation brought to the cloud by: Providing more resource flexibility to current cloud paradigm by breaking down nodes to their basic elements (CPU, Memory, I/O) Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack) without degrading any existing capabilities. Using commodity cloud hardware: medium sized hosts (typically 64 GB and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE).Initiated by Benoit Hudzia in 2011. Currently developed by twosmall teams of researchers from the TI Practice located inBelfast and Ra’anana© 2012 SAP AG. All rights reserved. 6
  7. 7. High Level Architecture GuestsNo special HW required but RDMA enabledNICs which support the low overhead lowlatency communication layer VM VM VM App App AppVMs are not bounded by host size anymore as OS OS VM OS H/Wresources such as memory, I/O and compute Ap p OS H/Wcan be aggregated H/W H/WDifferent sized VMs can share infrastructureso we can still support the smaller VMs not Server #1 Server #2 Server #nrequiring dedicated hosts CPUs CPUs CPUs Memory Memory MemoryApplication stack runs unmodified I/O I/O I/O Fast RDMA Communication© 2012 SAP AG. All rights reserved. 7
  8. 8. The Team - Panoramic View© 2012 SAP AG. All rights reserved. 8
  9. 9. Hardware TrendsAre hosts getting closer?
  10. 10. CPUs stopped getting fasterMoore’s law prevailed until 2003 when core’sspeed hit a practical limit of about 3.4 GhzIn data center core are even slower running at2.0 - 2.8 Ghz for to power conservationreasonsSince 2000 you do get more cores – but thatdoes not effect compute cycle and computeinstruction latenciesEffectively arbitrary sequential algorithmshave not gotten faster since Source:© 2012 SAP AG. All rights reserved. 10
  11. 11. DRAM latency has remained constantCPU clock speed and memory bandwidthincreased steadily (at least until 2000)But memory latency remained constant – solocal memory has gotten slower from the CPUperspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010© 2012 SAP AG. All rights reserved. 11
  12. 12. Disk latency has virtually not improved1980s standard disk has a 3,600 RPM Average Latency (ms) 8.32010s standard disk has a 7,200 RPM 7.1 6.72x speedup in 30 years is negligible – 6.1effectively disk has become slower from the 5.8 5.6CPU perspective. 4.2 3 2.5 2 3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000 Panda et al. Supercomputing 2009© 2012 SAP AG. All rights reserved. 12
  13. 13. But: Networks are Steadily Getting FasterSince 1979 we went from 0.01 Gbit/s to up 64 Network Performance (Gbit/s)Gbit/s a x6400 Speedup 70 60A competitive marketplace 50 10 and 40 Gbps Ethernet – originated from network 40 interconnects 30 40 Gbps QPX InfiniBand – originated from computer 20 internal bus technology 10 0InfiniBand/Ethernet convergence Virtual Protocol Interconnects InfiniBand over Ethernet RDMA over converged enhanced Ethernet Using standard semantics defined by OFED Panda et al. Supercomputing 2009© 2012 SAP AG. All rights reserved. 13
  14. 14. And: Communication Stacks Are Becoming FasterNetwork stack deficiencies  Application / OS context switches  Intermediate buffer copies  Transport processingRDMA OFED Verbs API provides  Zero copy  Offloading TCP to NIC using RoCE  Flexibility to use IB, GE or IWARPResulting in  Reduced latency  Processor offloading  Operational flexibility© 2012 SAP AG. All rights reserved. 14
  15. 15. Benchmarking Modern InterconnectsIntel MPI benchmark (IMP) Broadcast latencyUsed typically in HPC and parallel computingComparing:4x DDR IB using Verbs API10 GE TOE (TCP offloading engine) iWARP1 GE Exchange bandwidthMeasured latenciesIB  2 us10 GE 8.23 us1 GE  46.52 us Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM© 2012 SAP AG. All rights reserved. 15
  16. 16. Conclusion: Remote Nodes Have Gotten CloserInterconnects have become much fasterFast interconnects have become a commodityand are moving out of the High PerformanceComputing (HPC) nicheIB latency 2000 ns is only 20x slower thanRAM and is 100x faster than SSDRemote page faulting is much faster thantraditional disk backed page swapping! HANA Performance Analysis, Chaim Bendelac, 2011© 2012 SAP AG. All rights reserved. 16
  17. 17. Result: Blurring of the physical node boundaries 2,000 ns 10,000,000 ns 0 ns 100 ns 10,000,000 ns© 2012 SAP AG. All rights reserved. 17
  18. 18. Live MigrationPretext to Hecatonchire
  19. 19. Enabling Live Migration of SAP WorkloadsBusiness Problem Typical SAP workloads (e.g. SAP ERP) are transactional, large (possibly 64 GB), with a fast rate of memory writes. Classic live migration fails for such workloads as rapid memory writes cause memory pages to be re-sent over and over againHecatonchire’s Solution Enable live migration by reducing both the number of pages re-sent and the cost of a page re-send Non intrusive reducing downtime, service degradation, and total migration time© 2012 SAP AG. All rights reserved. 19
  20. 20. Live Migration Technique Pre-migration process Reservation process • Suspend on host A VM activeVM on host A • Activate on host in Copy dirty pagesB successive Iterative pre-copy • Redirect network traffic Initialize container on target Destination host selected host • VM state rounds on host A released • Synch devices mirrored) (Block remaining state Stop and copy Commitment© 2012 SAP AG. All rights reserved. 20
  21. 21. Pre-copy live migrationReducing number of page re-sendsPage LRU Reordering such that pages whichhave a high chance of being re-dirtied soon aredelayed until laterReducing the cost of a page re-sendBy using XBZRLE delta encoder we can muchmore efficiently represent page changes© 2012 SAP AG. All rights reserved. 21
  22. 22. More Than One Way to Live Migrate … Iterative Stop Pre-Copy Live- Pre-migrate; Pre-copy X and Commit Migration Reservation Rounds Copy Live on A Downtime Live on B Total Migration Time Stop Page PushingPost-Copy Live- Pre-migrate; and 1 Commit Commit Migration Reservation Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time Iterative Stop Page PushingHybrid Post-Copy Pre-migrate; Pre-Copy and 1 Commit Live-Migration Reservation X Copy Round Rounds Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 22
  23. 23. Post-copy live migration using fast interconnectsIn Post-copy live migration the state of the VMis transferred to the destination and activatedbefore memory is transferredPost-copy implementation includes Handling of remote page faults Background transfer of memory pagesService degradation mitigated by RDMA zero-copy interconnects Pre-paging – similar in concept to pre-fetching Hybrid Post Copy – begins with a pre-copy phase MMU integration – eliminating need for VM pause© 2012 SAP AG. All rights reserved. 23
  24. 24. Demo
  25. 25. Memory AggregationIn the oven …
  26. 26. The Memory CloudTurns memory into a distributed memory service Server Server 1 Server Server 2 Server Server 3 Server1 1 VM Server2 2 VM Server3 3 VM Applications App App App Memory RAM RAM RAM Storage Breaks memory Yields double digit Transparent from the bounds of the percentage gains in IT deployment with physical box economics performance at scale and Reliability© 2012 SAP AG. All rights reserved. 26
  27. 27. RRAIM : Remote Redundant Array of Inexpensive MemorySupporting Large Memory Instances On-DemandBusiness Problem RAIM Solution Current instance memory sizes are constrained by physical hosts’ memory size ( Amazon Biggest VM occupy the whole physical host) Heavy swap usage slows execution time for data intensive applicationsHecatonchire Solution VM swaps to memory Application Cloud Access remote DRAM via fast interconnects zero-copy RDMA Hide remote DRAM latency by using page pre-pushing RAM Memory Cloud MMU Integration for transparency for applications and VMs Reliability by using a RAID-1 (mirroring) like schemaHecatonchire Value Proposition Compression / De- duplication / N-tiers Provide memory aggregation on-demand storage / HR-HA Totally transparent to workload (no integration needed) No hardware investment! No dedicated servers! © 2012 SAP AG. All rights reserved. 27
  28. 28. Hecatonchire / RRAIM: Breakthrough CapabilityBreaking the memory box barrier for memory intensive applications L1 cache 10 μsec L2 cache DRAM Access Speed 100 μsec 1 μsec SSD 1 msec Performance Networked Embedded Resources Resources Resources Barrier Local Disk 10 msec Local NAS SAN MB GB TB PB Capacity© 2012 SAP AG. All rights reserved. 28
  29. 29. Lego Cloud Architecture ( Memory block) Memory VM Compute VM Combination VM Memory Host Memory Guest Memory Guest & Host RAM App RAM App Memory memory VM VM VM Cloud RRAIM Memory Cloud Management Services Many Physical Nodes Hosting a variety of VMs© 2012 SAP AG. All rights reserved. 29
  30. 30. Instant Flash Cloning On-DemandBusiness Problem Burst load / service usage that cannot be satisfied in timeExisting solutions Vendors: Amazon / VMWare/ rightscale Startup VM from disk image Requires full VM OS startup sequenceHecatonchire Solution Using a paused VM as source for Copy-on-Write (CoW) We perform a Post-Copy Live MigrationHecatonchire Value Proposition Just in time (sub-second) provisioning© 2012 SAP AG. All rights reserved. 30
  31. 31. Instant Flash Cloning On-DemandWe can clone VMs to meet demand much fasterthan other solutionsReducing infrastructure costs while still minimizinglost opportunities => Just in time provisioningRequires Application Integration We track OS/application metrics in running VMs or in Load Balancer (LB) Alerts are defined if metrics pass a pre-define threshold According to alerts we can scale-up adding more resources or scale-down to save on resources not utilized Amazon Web Services - Guide© 2012 SAP AG. All rights reserved. 31
  32. 32. Compute AggregationOur next challenge
  33. 33. Cost Effective “Small” HPC GridHigh Performance Computing (HPC) Supercomputers at the frontline of processing speeds 10k-100k core Typical benchmark: Grid 500 (Linear Algebra) Small HPC using 10-20 commodity (2 TB / 80 core) nodesTypical Applications Relational Databases Analytics tasks (Linear Algebra) SimulationsHecatonchire Value Proposition Optimal price / performance by using commodity hardware Operational flexibility: node downtime without downing the cluster Seamless deployment within existing cloud © 2012 SAP AG. All rights reserved. 33
  34. 34. Distributed Shared Memory (DSM)Traditional cluster ccNUMADistributed memory Cache coherent shared memoryStandard interconnects Fast interconnectsOS instance on each node One OS instanceDistribution handled by application Distribution handled by hardware Vendors: ScaleMP, Numascale, others© 2012 SAP AG. All rights reserved. 34
  35. 35. Distributed Shared Memory – Inherent LimitationsLinux provides NUMA topology discovery Distance between compute cores Distance between cores to memoryWhile the Linux OS is aware of the NUMAlayout the application may not be aware …Cache-coherency may get very expensive Inter-core: L3 Cache 20 ns Inter-socket: Main Memory 100 ns Inter-node (IB): Remote Memory 2,000 nsThus the ccNUMA architecture many not“really” be transparent to the application! © 2012 SAP AG. All rights reserved. 35
  36. 36. Summary
  37. 37. Roadmap • Live Migration • Pre-copy XBZRLE Delta Encoding • Pre-copy LRU page reordering 2011 • Post-copy using RDMA interconnects • Resource Aggregation • Cloud Management Integration • Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory) 2012 • I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks) • Flash cloning • Lego Landscape • CPU Aggregation - ccNUMA 2013 • Flexible resource management© 2012 SAP AG. All rights reserved. 37
  38. 38. Key takeawaysHecatonchire extends standard Linux stack requiringstandard commodity hardwareWith Hecatonchire unmodified applications or VMs cantape into remote resources tranparentlyTo be released as open source under GPLv2 and LGPLlicenses to Qemu and Linux communitiesDeveloped by SAP Research TI© 2012 SAP AG. All rights reserved. 38
  39. 39. Thank youContact information:Benoit Hudzia; Sr. Researcher;  Hecatonchire WikiSAP Research CEC  hire%2C++Distributed+Shared+Memory+%28DSM%29+ And+Datacenter+Resources+disaggregation+for+CloudAidan Shribman; Sr. Researcher;SAP Research
  40. 40. Appendix
  41. 41. Linux Kernel Virtual Machine (KVM)Released as a Linux Kernel Module (LKM)under GPLv2 license in 2007 by QumranetFull virtualization via Intel VT-x and AMD-Vvirtualization extensions to the x86 instructionsetUses Qemu for invoking KVM, for handling ofI/O and for advanced capabilities such as VMlive migrationKVM considered the primary hypervisor onmost major Linux distributions such asRedHat and SuSE© 2012 SAP AG. All rights reserved. 41
  42. 42. Remote Page Faulting Architecture ComparisonHecatonchire YobusameNo context switches Context switches into user modeZero-copy Use standard TCP/IP transportUse iWarp RDMA Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011© 2012 SAP AG. All rights reserved. 42
  43. 43. Legal DisclaimerThe information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission ofSAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAPhas no obligation to pursue any course of business outlined in this document or any related presentation, or to develop orrelease any functionality mentioned therein. This document, or any related presentation and SAPs strategy and possible futuredevelopments, products and or platforms directions and functionality are all subject to change and may be changed by SAP atany time for any reason without notice. The information on this document is not a commitment, promise or legal obligation todeliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. Thisdocument is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors oromissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materiallyfrom expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only asof their dates, and they should not be relied upon in making purchasing decisions.© 2012 SAP AG. All rights reserved. 43