Your SlideShare is downloading. ×
SAP Virtualization Week 2012 - The Lego Cloud
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

SAP Virtualization Week 2012 - The Lego Cloud

215
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
215
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SAP Virtualization Week 2012: TRND04SAP DKOM 2012: NA 6747The Lego CloudBenoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK)Aidan Shribman Sr. Researcher; SAP Research Israel
  • 2. AgendaIntroductionHardware TrendsLive MigrationFlash CloningMemory PoolingDistributed Shared MemorySummary© 2012 SAP AG. All rights reserved. 2
  • 3. IntroductionThe evolution of the datacenter
  • 4. Evolution of Virtualization Resources Disaggregation (True Utility Cloud) Flexible Resources Management (Cloud) Basic Consolidation No virtualization© 2012 SAP AG. All rights reserved. 4
  • 5. Why Disaggregate Resources?Better PerformanceReplacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).Many remote devices working in parallel (e.g. DRAM, disk, compute)Superior ScalabilityGoing beyond boundaries of the single nodeImproved EconomicsDo more with existing hardwareReach better hardware utilization levels© 2012 SAP AG. All rights reserved. 5
  • 6. The Hecatonchire ProjectHecatonchires in Greek mythology means “Hundred HandedOnes” – the original idea: provide Distributed Shared Memory(DSM) capabilities to the cloudStrategic goal : full resource liberation brought to the cloud by: Breaking down physical nodes to their core elements (CPU, Memory, I/O) Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack) without degrading any existing capabilities Using commodity cloud hardware and standard interconnectsInitiated by Benoit Hudzia in 2011. Currently developed by twoSAP Research TI Practice teams located in Belfast and Ra’ananaHecatonchire is not a monolithic project – but a set of separatecapabilities. We are currently identifying stake holder anddefining use cases for each such capability.© 2012 SAP AG. All rights reserved. 6
  • 7. Hecatonchire ArchitectureCluster Servers Guests Commodity hosts (e.g. 64 GB 16 core) Commodity network adapters: VM VM VM – Standard: softiwarp over 1 GbE App App App OS – Enterprise: RoCE/iWARP over 10 GbE or native IB VM OS OS H/W Ap A modified version of QEMU/KVM hypervisor p OS H/W H/W An RDMA remote memory kernel module H/WGuests / VMs Server #1 Server #2 Server #n Use resource from one or several underlaying hosts CPUs CPUs Existing OS/application can run transparently CPUs Memory Memory – Not exactly … but we will get to this later Memory I/O I/O I/O Fast RDMA Communication© 2012 SAP AG. All rights reserved. 7
  • 8. The Team - Panoramic View© 2012 SAP AG. All rights reserved. 8
  • 9. Hardware TrendsThe blurring of physical host boundaries
  • 10. DRAM Latency Has Remained ConstantCPU clock speed and memory bandwidthincreased steadily while memory latencyremained constantAs a result local memory has appears slowerfrom the CPU perspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010© 2012 SAP AG. All rights reserved. 10
  • 11. CPU Cores Stopped Getting FasterMoore’s law prevailed until 2005 when coreshit a practical limit of about 3.4 GHzThe “single threaded free lunch” (as coined byHerb Sutter) is over Source: http://www.intel.com/pressroom/kits/quickrefyr.htmSo CPU cores have stopped getting faster -but you do get more cores now Source: “The Free Lunch Is Over..” by Herb Sutter© 2012 SAP AG. All rights reserved. 11
  • 12. But … Interconnects Continue to Evolve(providing higher bandwidth and lower latency)© 2012 SAP AG. All rights reserved. 12
  • 13. Result: Remote Nodes Are Becoming “Closer”Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM.Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively. HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011© 2012 SAP AG. All rights reserved. 13
  • 14. Result: Blurring the Boundaries of the Physical Host 15ns-80ns 60ns-100ns 10,000,000 ns 2,000ns 2,000ns 10,000,000 ns 2,000ns 2,000ns 10,000,000 ns© 2012 SAP AG. All rights reserved. 14
  • 15. Live MigrationServing as a platform to evaluate remote page faulting
  • 16. Enabling Live Migration of SAP WorkloadsBusiness Problem Typical SAP workloads such as SAP ERP are transactional, large, with a fast rate of memory writes. Classic live migration fails for such workloads as rapid memory writes cause memory pages to be re-sent over and over againHecatonchire’s Solution Enable live migration by reducing both the number of pages re-sent and the cost of a page re-send Across the board improvement of live migration metrics – Downtime - reduced – service degradation - reduced – total migration time - reduced© 2012 SAP AG. All rights reserved. 16
  • 17. Classic Pre-Copy Live Migration Pre-migration process Reservation process • Suspend on host A VM activeVM on host A • Activate on host in Copy dirty pagesB successive Iterative pre-copy • Redirect network traffic Initialize container on target Destination host selected host • VM state rounds on host A released • Synch devices mirrored) (Block remaining state Stop and copy Commitment© 2012 SAP AG. All rights reserved. 17
  • 18. Hecatonchire Pre-copy Live MigrationReducing number of page re-sends Page LRU reordering such that pages with a low chance of being re-dirtied are sent first Contribution to QEMU planned for 2012Reducing the cost of a page re-sends By using XBZRLE delta encoder we can much more efficiently represent page changes Contributed to QEMU during 2011© 2012 SAP AG. All rights reserved. 18
  • 19. More Than One Way to Live Migrate… Iterative Stop Pre-Copy Live- Pre-migrate; Pre-copy X and Commit Migration Reservation Rounds Copy Live on A Downtime Live on B Total Migration Time Stop Page PushingPost-Copy Live- Pre-migrate; and 1 Commit Migration Reservation Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time Iterative Stop Page PushingHybrid Post-Copy Pre-migrate; Pre-Copy and 1 Commit Live-Migration Reservation X Copy Round Rounds Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 19
  • 20. Hecatonchire Post-copy Live MigrationIn post-copy live migration we reverse order1. Transfer of state: Transfer the VM running state from A to B and Immediately activate the VM on B2. Transfer of memory: B can initiate a network bound page fault handled by A; Background actively push memory from A to B until completionPost-copy has some unique advantages Downtime is minimal as only a few MBs for a GB sized VM need to be transferred before re-activation Total migration time is minimal and predictableHecatonchire unique enhancements Low latency RDMA page transfer protocol Demand pre-paging (pre-fetching) mechanism Full Linux MMU integration Hybrid post-copy supported© 2012 SAP AG. All rights reserved. 20
  • 21. Demo
  • 22. Flash CloningSub-second elastic auto scaling
  • 23. Automated ElasticityElasticity is basis for cloud economics You can scale-up or scale-down on-demand You only pay for what you useChart depicts scaling evolutionScale-up approach: purchase bigger machines to meetrising demandsTraditional scale-out approach: reconfigure the clustersize according to demandAutomated elasticity: grow and shrink your resourcesautomatically responding to changing demandsrepresented by monitored metricsIf you can’t respond fast enough you may eithermiss business opportunities or have to increaseyour margin of purchased resources Amazon Web Services - Guide© 2012 SAP AG. All rights reserved. 24
  • 24. Hecatonchire Flash CloningBusiness Problem AWS auto scaling (and others) take minutes to scale-up: – Disk image clone from a template (AMI) image – Full boot up sequence of VM – Acquiring of an IP address via DHCP – Starting up the applicationHecatonchire Solution Provide just in time (sub-second) scaling according to demand – Clone a paused source VM Copy-on-Write (CoW) including: Disk Image, VM Memory, VM State (registers, etc.) – Use a post-copy live-migration schema including page-faulting to fetch missing pages with background active page pushing – Create a private network switch per clone (to save the need for assigning a new MAC and performing IP reconfigure)© 2012 SAP AG. All rights reserved. 25
  • 25. Memory PoolingTapping into unused memory resources of remote hosts
  • 26. Hecatonchire Breakthrough CapabilityBreaking the Memory Box Barrier for Memory Intensive Applications nsec Access Speed usec SSD Performance Networked Embedded Resources Resources Resources Barrier Local Disk msec Local NAS SAN MB GB TB PB Capacity© 2012 SAP AG. All rights reserved. 27
  • 27. The Memory CloudTurns memory into a distributed memory service Server Server 1 Server Server 2 Server Server 3 Server1 1 VM Server2 2 VM Server3 3 VM Applications App App App Memory RAM RAM RAM StorageBusiness Problem Hecatonchire Solution Large amounts of DRAM required on-demand – from shared cloud  Access remote DRAM via low-latency RDMA stack (using pre- hosts pushing to hide latency) Current cloud offerings are limited by the size of their physical host -  MMU Integration for transport consumption for applications and AWS can’t go beyond 68 GB DRAM as these large memory VMs. And as a result also support : compression (zcache), de- instances fully occupy the physical host duplication (KSM), N-tier storage  No hardware investment needed! No need for dedicated servers!© 2012 SAP AG. All rights reserved. 28
  • 28. RRAIM : Remote Redundant Array of Inexpensive MemoryMemory Fault Tolerance as Part of a Full HA Solution RRAIM-1 (Mirroring) VM High Availability Hecatonchire KVM Kemari / Xen Remus Active  Active Master Slave RAM RAM App App RRAIM-1 VM VM VM VM Cloud Management Stack VM High Availability Many Physical Nodes Hecatonchire RRAIM Hosting a variety of VMs© 2012 SAP AG. All rights reserved. 29
  • 29. Distributed Shared MemoryOur next challenge
  • 30. Cache-Coherent Non Uniform Memory Access (ccNUMA)Traditional cluster ccNUMA Distributed memory  Cache coherent shared memory Standard interconnects  Fast interconnects OS instance on each node  One OS instance Distribution handled by application  Distribution handled by hardware/hypervisor© 2012 SAP AG. All rights reserved. 31
  • 31. Hecatonchire Distributed Shared Memory (DSM) VM© 2012 SAP AG. All rights reserved. 32
  • 32. Hecatonchire DSM – Cache Coherency (CC) ChallengeStandard ccNUMA ccNUMA Inter-node (2000ns) cache-coherency takes too long Inter-node read is expensive while processor cache not large enoughAdding COMA (Cache Only Memory Access) Can help to improve performance for multi-read scenario COMA implementation requires 4k cache-line  leading to false data shareNUMA Topology / Dynamic NUMA Topology COMA Application NUMA-aware implementation may not be complete Dynamic changes in NUMA will not be supported by most current apps We need to attempt to hide some of the performance challenges (so that we can expose a fixed NUMA topologyAdding vCPU live migration Compact vCPU state (only several KB) can be live migrated© 2012 SAP AG. All rights reserved. 33
  • 33. Summary
  • 34. Roadmap • Live Migration • Pre-copy XBZRLE Delta Encoding • Pre-copy LRU page reordering 2011 • Post-copy using RDMA interconnects • Memory Cloud • Memory Pooling • Memory Fault Tolerance (RRAIM) 2012 • Flash Cloning • Lego Landscape • Distributed Shared Memory 2013 • Flexible resource management© 2012 SAP AG. All rights reserved. 35
  • 35. Key takeawaysHecatonchire extends standard Linux stack requiringonly standard commodity hardwareWith Hecatonchire unmodified applications or VMs(which are NUMA-aware) can tape into remote resourcestranparentlyTo be released as open source under GPLv2 and LGPLlicenses to Qemu and Linux communitiesDeveloped by SAP Research Technology Infrastructure(TI) Practice© 2012 SAP AG. All rights reserved. 36
  • 36. Thank youBenoit Hudzia; Sr. Researcher;SAP Research CEC Belfastbenoit.hudzia@sap.comAidan Shribman; Sr. Researcher;SAP Research Israelaidan.Shribman@sap.com
  • 37. Appendix
  • 38. Communication Stacks have Become LeanerTraditional network interface  Application / OS context switches  Intermediate buffer copies  OS handling transport processingRDMA adapters  Zero copy directly from/to application physical memory  Offloading of transport processing to RDMA adapter and effectively bypassing OS and CPU  A standard interface OFED “Verbs” supporting all RDMA adapters (IB, RoCE, iWARP) © 2012 SAP AG. All rights reserved. 39
  • 39. Linux Kernel Virtual Machine (KVM)Released as a Linux Kernel Module (LKM)under GPLv2 license in 2007 by QumranetFull virtualization via Intel VT-x and AMD-Vvirtualization extensions to the x86 instructionsetUses Qemu for invoking KVM, for handling ofI/O and for advanced capabilities such as VMlive migrationKVM considered the primary hypervisor onmost major Linux distributions such asRedHat and SuSE© 2012 SAP AG. All rights reserved. 40
  • 40. Remote Page Faulting Architecture ComparisonHecatonchire YobusameNo context switches Context switches into user modeZero-copy Use standard TCP/IP transportUse iWarp RDMA Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011© 2012 SAP AG. All rights reserved. 41
  • 41. Hecatonchire DSM VM – ccNUMA ChallengeLinux NUMA topology Linux is aware of NUMA topology (which cores and memory banks reside in each zone/node). Linux exposes this topology for applications to make use of it.But is up to the application to be NUMA-aware … if not it may suffer whenrunning on NUMA topologyAnd even if the application is NUMAaware the longer time needed for Cache-Coherency (cc) may hurt performance Inter-core: L3 Cache 20 ns Inter-socket: Main Memory 100 ns Inter-node (IB): Remote Memory 2,000 ns Intel Nehalem Memory Hierarchy© 2012 SAP AG. All rights reserved. 42
  • 42. Legal DisclaimerThe information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission ofSAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAPhas no obligation to pursue any course of business outlined in this document or any related presentation, or to develop orrelease any functionality mentioned therein. This document, or any related presentation and SAPs strategy and possible futuredevelopments, products and or platforms directions and functionality are all subject to change and may be changed by SAP atany time for any reason without notice. The information on this document is not a commitment, promise or legal obligation todeliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. Thisdocument is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors oromissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materiallyfrom expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only asof their dates, and they should not be relied upon in making purchasing decisions.© 2012 SAP AG. All rights reserved. 43