Successfully reported this slideshow.

Lego Cloud SAP Virtualization Week 2012

1

Share

Loading in …3
×
1 of 43
1 of 43

Lego Cloud SAP Virtualization Week 2012

1

Share

Download to read offline

This session will demonstrate that by extending KVM we can deliver none-disruptively the next level of IaaS platform modularization. We will first show instantaneous live migration of VM. Then we will introduce the memory aggregation concept, and finally how to achieve full operational flexibility by dis-aggregating the datacenter resource to its core elements.

This session will demonstrate that by extending KVM we can deliver none-disruptively the next level of IaaS platform modularization. We will first show instantaneous live migration of VM. Then we will introduce the memory aggregation concept, and finally how to achieve full operational flexibility by dis-aggregating the datacenter resource to its core elements.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Lego Cloud SAP Virtualization Week 2012

  1. 1. TRND04 The Lego Cloud Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast Aidan Shribman Sr. Researcher; SAP Research Israel
  2. 2. Agenda Introduction Hardware Trends Live Migration Memory Aggregation Compute Aggregation Summary © 2012 SAP AG. All rights reserved. 2
  3. 3. Introduction The evolution of the datacenter
  4. 4. Evolution of Virtualization Resources Disaggregation (True utility Cloud) Flexible Resources Management Basic (Cloud) Consolidation No virtualization © 2012 SAP AG. All rights reserved. 4
  5. 5. Why Disaggregate Resources? Better Performance Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM). Many remote devices working in parallel (e.g. DRAM, disk, compute) Superior Scalability Going beyond boundaries of the single node Improved Economics Do more with existing hardware Reach better hardware utilization levels © 2012 SAP AG. All rights reserved. 5
  6. 6. The Hecatonchires Project Hecatonchires “Hundred Headed One” Original idea: provide Distributed Shared Memory (DSM) capabilities to the cloud Strategic goal : full resource liberation brought to the cloud by:  Providing more resource flexibility to current cloud paradigm by breaking down nodes to their basic elements (CPU, Memory, I/O)  Extend existing cloud software stack (KVM, Qemu, libvirt, OpenStack) without degrading any existing capabilities.  Using commodity cloud hardware: medium sized hosts (typically 64 GB and 8/16 cores), and standard interconnects (such as 1 Gigabit or 10 GE). Initiated by Benoit Hudzia in 2011. Currently developed by two small teams of researchers from the TI Practice located in Belfast and Ra’anana © 2012 SAP AG. All rights reserved. 6
  7. 7. High Level Architecture Guests No special HW required but RDMA enabled NICs which support the low overhead low latency communication layer VM VM VM App App App VMs are not bounded by host size anymore as OS OS VM OS H/W resources such as memory, I/O and compute Ap p OS H/W can be aggregated H/W H/W Different sized VMs can share infrastructure so we can still support the smaller VMs not Server #1 Server #2 Server #n requiring dedicated hosts CPUs CPUs CPUs Memory Memory Memory Application stack runs unmodified I/O I/O I/O Fast RDMA Communication © 2012 SAP AG. All rights reserved. 7
  8. 8. The Team - Panoramic View © 2012 SAP AG. All rights reserved. 8
  9. 9. Hardware Trends Are hosts getting closer?
  10. 10. CPUs stopped getting faster Moore’s law prevailed until 2003 when core’s speed hit a practical limit of about 3.4 Ghz In data center core are even slower running at 2.0 - 2.8 Ghz for to power conservation reasons Since 2000 you do get more cores – but that does not effect compute cycle and compute instruction latencies Effectively arbitrary sequential algorithms have not gotten faster since Source: http://www.intel.com/pressroom/kits/quickrefyr.htm © 2012 SAP AG. All rights reserved. 10
  11. 11. DRAM latency has remained constant CPU clock speed and memory bandwidth increased steadily (at least until 2000) But memory latency remained constant – so local memory has gotten slower from the CPU perspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010 © 2012 SAP AG. All rights reserved. 11
  12. 12. Disk latency has virtually not improved 1980s standard disk has a 3,600 RPM Average Latency (ms) 8.3 2010s standard disk has a 7,200 RPM 7.1 6.7 2x speedup in 30 years is negligible – 6.1 effectively disk has become slower from the 5.8 5.6 CPU perspective. 4.2 3 2.5 2 3,600 4,200 4,500 4,900 5,200 5,400 7,200 10,000 12,000 15,000 Panda et al. Supercomputing 2009 © 2012 SAP AG. All rights reserved. 12
  13. 13. But: Networks are Steadily Getting Faster Since 1979 we went from 0.01 Gbit/s to up 64 Network Performance (Gbit/s) Gbit/s a x6400 Speedup 70 60 A competitive marketplace 50  10 and 40 Gbps Ethernet – originated from network 40 interconnects 30  40 Gbps QPX InfiniBand – originated from computer 20 internal bus technology 10 0 InfiniBand/Ethernet convergence  Virtual Protocol Interconnects  InfiniBand over Ethernet  RDMA over converged enhanced Ethernet  Using standard semantics defined by OFED Panda et al. Supercomputing 2009 © 2012 SAP AG. All rights reserved. 13
  14. 14. And: Communication Stacks Are Becoming Faster Network stack deficiencies  Application / OS context switches  Intermediate buffer copies  Transport processing RDMA OFED Verbs API provides  Zero copy  Offloading TCP to NIC using RoCE  Flexibility to use IB, GE or IWARP Resulting in  Reduced latency  Processor offloading  Operational flexibility © 2012 SAP AG. All rights reserved. 14
  15. 15. Benchmarking Modern Interconnects Intel MPI benchmark (IMP) Broadcast latency Used typically in HPC and parallel computing Comparing: 4x DDR IB using Verbs API 10 GE TOE (TCP offloading engine) iWARP 1 GE Exchange bandwidth Measured latencies IB  2 us 10 GE 8.23 us 1 GE  46.52 us Source: Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet, IBM © 2012 SAP AG. All rights reserved. 15
  16. 16. Conclusion: Remote Nodes Have Gotten Closer Interconnects have become much faster Fast interconnects have become a commodity and are moving out of the High Performance Computing (HPC) niche IB latency 2000 ns is only 20x slower than RAM and is 100x faster than SSD Remote page faulting is much faster than traditional disk backed page swapping! HANA Performance Analysis, Chaim Bendelac, 2011 © 2012 SAP AG. All rights reserved. 16
  17. 17. Result: Blurring of the physical node boundaries 2,000 ns 10,000,000 ns 0 ns 100 ns 10,000,000 ns © 2012 SAP AG. All rights reserved. 17
  18. 18. Live Migration Pretext to Hecatonchire
  19. 19. Enabling Live Migration of SAP Workloads Business Problem  Typical SAP workloads (e.g. SAP ERP) are transactional, large (possibly 64 GB), with a fast rate of memory writes.  Classic live migration fails for such workloads as rapid memory writes cause memory pages to be re-sent over and over again Hecatonchire’s Solution  Enable live migration by reducing both the number of pages re-sent and the cost of a page re-send  Non intrusive reducing downtime, service degradation, and total migration time © 2012 SAP AG. All rights reserved. 19
  20. 20. Live Migration Technique Pre-migration process Reservation process • Suspend on host A VM activeVM on host A • Activate on host in Copy dirty pagesB successive Iterative pre-copy • Redirect network traffic Initialize container on target Destination host selected host • VM state rounds on host A released • Synch devices mirrored) (Block remaining state Stop and copy Commitment © 2012 SAP AG. All rights reserved. 20
  21. 21. Pre-copy live migration Reducing number of page re-sends Page LRU Reordering such that pages which have a high chance of being re-dirtied soon are delayed until later Reducing the cost of a page re-send By using XBZRLE delta encoder we can much more efficiently represent page changes © 2012 SAP AG. All rights reserved. 21
  22. 22. More Than One Way to Live Migrate … Iterative Stop Pre-Copy Live- Pre-migrate; Pre-copy X and Commit Migration Reservation Rounds Copy Live on A Downtime Live on B Total Migration Time Stop Page Pushing Post-Copy Live- Pre-migrate; and 1 Commit Commit Migration Reservation Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time Iterative Stop Page Pushing Hybrid Post-Copy Pre-migrate; Pre-Copy and 1 Commit Live-Migration Reservation X Copy Round Rounds Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 22
  23. 23. Post-copy live migration using fast interconnects In Post-copy live migration the state of the VM is transferred to the destination and activated before memory is transferred Post-copy implementation includes  Handling of remote page faults  Background transfer of memory pages Service degradation mitigated by  RDMA zero-copy interconnects  Pre-paging – similar in concept to pre-fetching  Hybrid Post Copy – begins with a pre-copy phase  MMU integration – eliminating need for VM pause © 2012 SAP AG. All rights reserved. 23
  24. 24. Demo
  25. 25. Memory Aggregation In the oven …
  26. 26. The Memory Cloud Turns memory into a distributed memory service Server Server 1 Server Server 2 Server Server 3 Server1 1 VM Server2 2 VM Server3 3 VM Applications App App App Memory RAM RAM RAM Storage Breaks memory Yields double digit Transparent from the bounds of the percentage gains in IT deployment with physical box economics performance at scale and Reliability © 2012 SAP AG. All rights reserved. 26
  27. 27. RRAIM : Remote Redundant Array of Inexpensive Memory Supporting Large Memory Instances On-Demand Business Problem RAIM Solution  Current instance memory sizes are constrained by physical hosts’ memory size ( Amazon Biggest VM occupy the whole physical host)  Heavy swap usage slows execution time for data intensive applications Hecatonchire Solution VM swaps to memory Application Cloud  Access remote DRAM via fast interconnects zero-copy RDMA  Hide remote DRAM latency by using page pre-pushing RAM Memory Cloud  MMU Integration for transparency for applications and VMs  Reliability by using a RAID-1 (mirroring) like schema Hecatonchire Value Proposition Compression / De- duplication / N-tiers  Provide memory aggregation on-demand storage / HR-HA  Totally transparent to workload (no integration needed)  No hardware investment! No dedicated servers! © 2012 SAP AG. All rights reserved. 27
  28. 28. Hecatonchire / RRAIM: Breakthrough Capability Breaking the memory box barrier for memory intensive applications L1 cache 10 μsec L2 cache DRAM Access Speed 100 μsec 1 μsec SSD 1 msec Performance Networked Embedded Resources Resources Resources Barrier Local Disk 10 msec Local NAS SAN MB GB TB PB Capacity © 2012 SAP AG. All rights reserved. 28
  29. 29. Lego Cloud Architecture ( Memory block) Memory VM Compute VM Combination VM Memory Host Memory Guest Memory Guest & Host RAM App RAM App Memory memory VM VM VM Cloud RRAIM Memory Cloud Management Services Many Physical Nodes Hosting a variety of VMs © 2012 SAP AG. All rights reserved. 29
  30. 30. Instant Flash Cloning On-Demand Business Problem  Burst load / service usage that cannot be satisfied in time Existing solutions  Vendors: Amazon / VMWare/ rightscale  Startup VM from disk image  Requires full VM OS startup sequence Hecatonchire Solution  Using a paused VM as source for Copy-on-Write (CoW)  We perform a Post-Copy Live Migration Hecatonchire Value Proposition  Just in time (sub-second) provisioning © 2012 SAP AG. All rights reserved. 30
  31. 31. Instant Flash Cloning On-Demand We can clone VMs to meet demand much faster than other solutions Reducing infrastructure costs while still minimizing lost opportunities => Just in time provisioning Requires Application Integration  We track OS/application metrics in running VMs or in Load Balancer (LB)  Alerts are defined if metrics pass a pre-define threshold  According to alerts we can scale-up adding more resources or scale-down to save on resources not utilized Amazon Web Services - Guide © 2012 SAP AG. All rights reserved. 31
  32. 32. Compute Aggregation Our next challenge
  33. 33. Cost Effective “Small” HPC Grid High Performance Computing (HPC)  Supercomputers at the frontline of processing speeds 10k-100k core  Typical benchmark: Grid 500 (Linear Algebra)  Small HPC using 10-20 commodity (2 TB / 80 core) nodes Typical Applications  Relational Databases  Analytics tasks (Linear Algebra)  Simulations Hecatonchire Value Proposition  Optimal price / performance by using commodity hardware  Operational flexibility: node downtime without downing the cluster  Seamless deployment within existing cloud © 2012 SAP AG. All rights reserved. 33
  34. 34. Distributed Shared Memory (DSM) Traditional cluster ccNUMA Distributed memory Cache coherent shared memory Standard interconnects Fast interconnects OS instance on each node One OS instance Distribution handled by application Distribution handled by hardware Vendors: ScaleMP, Numascale, others © 2012 SAP AG. All rights reserved. 34
  35. 35. Distributed Shared Memory – Inherent Limitations Linux provides NUMA topology discovery  Distance between compute cores  Distance between cores to memory While the Linux OS is aware of the NUMA layout the application may not be aware … Cache-coherency may get very expensive  Inter-core: L3 Cache 20 ns  Inter-socket: Main Memory 100 ns  Inter-node (IB): Remote Memory 2,000 ns Thus the ccNUMA architecture many not “really” be transparent to the application! © 2012 SAP AG. All rights reserved. 35
  36. 36. Summary
  37. 37. Roadmap • Live Migration • Pre-copy XBZRLE Delta Encoding • Pre-copy LRU page reordering 2011 • Post-copy using RDMA interconnects • Resource Aggregation • Cloud Management Integration • Memory Aggregation – RAIM (Redundant Array of Inexpensive Memory) 2012 • I/O Aggregation – vRAID (virtual Redundant Array of Inexpensive Disks) • Flash cloning • Lego Landscape • CPU Aggregation - ccNUMA 2013 • Flexible resource management © 2012 SAP AG. All rights reserved. 37
  38. 38. Key takeaways Hecatonchire extends standard Linux stack requiring standard commodity hardware With Hecatonchire unmodified applications or VMs can tape into remote resources tranparently To be released as open source under GPLv2 and LGPL licenses to Qemu and Linux communities Developed by SAP Research TI © 2012 SAP AG. All rights reserved. 38
  39. 39. Thank you Contact information: Benoit Hudzia; Sr. Researcher;  Hecatonchire Wiki SAP Research CEC Belfast benoit.hudzia@sap.com  https://wiki.wdf.sap.corp/wiki/display/cecbelfast/Hecatonc hire%2C++Distributed+Shared+Memory+%28DSM%29+ And+Datacenter+Resources+disaggregation+for+Cloud Aidan Shribman; Sr. Researcher; SAP Research Israel aidan.Shribman@sap.com
  40. 40. Appendix
  41. 41. Linux Kernel Virtual Machine (KVM) Released as a Linux Kernel Module (LKM) under GPLv2 license in 2007 by Qumranet Full virtualization via Intel VT-x and AMD-V virtualization extensions to the x86 instruction set Uses Qemu for invoking KVM, for handling of I/O and for advanced capabilities such as VM live migration KVM considered the primary hypervisor on most major Linux distributions such as RedHat and SuSE © 2012 SAP AG. All rights reserved. 41
  42. 42. Remote Page Faulting Architecture Comparison Hecatonchire Yobusame No context switches Context switches into user mode Zero-copy Use standard TCP/IP transport Use iWarp RDMA Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011 © 2012 SAP AG. All rights reserved. 42
  43. 43. Legal Disclaimer The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions. © 2012 SAP AG. All rights reserved. 43

×