Your SlideShare is downloading. ×
0
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Hecatonchire kvm forum_2012_benoit_hudzia
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hecatonchire kvm forum_2012_benoit_hudzia

1,094

Published on

Modularizing and aggregating physical resources in a datacenter depends not only on low-latency networking, but also on software techniques to deliver such capabilities. In the session we will present …

Modularizing and aggregating physical resources in a datacenter depends not only on low-latency networking, but also on software techniques to deliver such capabilities. In the session we will present some practical features and results of our work, as well as discuss implementation details. Among these features are delivering high-performance, transparent, and partially fault tolerant memory aggregation; and reducing the downtime of live migration using post-copy implementations.
We will present and discuss methods of transparently integrating with the MMU at the OS level, without impacting a running VM;
We will then introfuce the Hecatonchire project ( http://www.hecatonchire.com ) which aim at the disaggregation of datacenter resources and enable true utility computing.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,094
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
43
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Walk : sequential => each thread start reading from 256k / nb threads * threads id and ends when it reach the start of the following threadsBinary split : we split the memory in Nb threads regions . Each threads will then do a binary split walk within each regionRandom walk : each thread will read a page randomly chosen within the overall memory region ( no duplicate)We Are maxing out the 10GbE bandwith with IWARPWe suspect that we do not have enough core to saturate the QDR linkWe have almost no noticeable degradation when we have Threads > CoresSoftIwarp has a significant overhead ( CPU – latency- Memory use)
  • Transcript

    • 1. Memory Aggregation For KVMHecatonchire ProjectBenoit Hudzia; Sr. Researcher; SAP Research BelfastWith the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter IzsakNovember 2012
    • 2. Agenda• Memory as a Utility• Raw Performance• First Use Case : Post Copy• Second Use case : Memory aggregation• Lego Cloud• Summary© 2012 SAP AG. All rights reserved. 2
    • 3. Memory as a UtilityHow we Liquefied Memory Resources
    • 4. The Idea: Turning memory into a distributed memory service Breaks memory from the bounds Transparent deployment with of the physical box performance at scale and Reliability© 2012 SAP AG. All rights reserved. 4
    • 5. High Level Principle Memory Memory Sponsor A Sponsor B Network Memory Demander Virtual Memory Address Space Memory Demanding Process© 2012 SAP AG. All rights reserved. 5
    • 6. How does it work(Simplified Version) Virtual MMU Physical MMU Physical Address (+ TLB) Address (+ TLB) Address Miss Update MMU Invalidate MMU Extract Page Page Table Page Table Entry Entry Remote PTE PTE write Invalidate PTE (Custom Swap Entry) Coherency Coherency Engine Engine Extract Page Prepare Page for RDMA transfer Page request Network RDMA Engine RDMA Engine Fabric Page ResponsePhysical Node A Physical Node B© 2012 SAP AG. All rights reserved. 6
    • 7. Reducing Effects of Network Bound Page FaultsFull Linux MMU integration (reducing the system-wide effects/cost of page fault) Enabling to perform page fault transparency (only pausing the requesting thread)Low latency RDMA Engine and page transfer protocol (reducing latency/cost of pagefaults) Implemented fully in kernel mode OFED VERBS Can use the fastest RDMA hardware available (IB, IWARP, RoCE) Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED)Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults) Currently only a simple fetching of pages surrounding page on which fault occurred© 2012 SAP AG. All rights reserved. 7
    • 8. Transparent SolutionMinimal Modification of the kernel (simple and minimal intrusion)• 4 Hooks in the static kernel , virtually no overhead when enabled for normal operationPaging and memory Cgroup support (Transparent Tiered Memory)• Page are pushed back to their sponsor when paging occurs or if they are local they can be swapped out normallyKVM Specific support (Virtualization Friendly)• Shadow Page table (EPT / NPT )• KVM Asynchronous Page Fault© 2012 SAP AG. All rights reserved. 8
    • 9. Transparent Solution (cont.)Scalable Active – Active Mode (Distributed Shared Memory) • Shared Nothing with distributed index • Write invalidate with distributed index (end of this year)Library LibHeca (Ease of integration) • Simple API bootstrapping and synching all participating nodesWe also support: • KSM • Huge Page • Discontinuous Shared Memory Region • Multiple DSM / VM groups on the same physical node© 2012 SAP AG. All rights reserved. 9
    • 10. Raw PerformanceHow fast can we move memory around ?
    • 11. Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM 25 1 Thread 2 Threads Not enough core 3 Threads 4 Threads to saturate (?) 5 Threads 20 No degradation 6 Threads 7 Threads under high load 8 Threads 15 Maxing out Bandwidth Software RDMA has significant overhead 10 5 0 Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec (SIW - Seq) (IW-Seq) (IB-Seq) (SIW- Bin split) (IW- Bin split) (IB- Bin split) (SIW- Random) (IW- Random) (IB- Random) © 2012 SAP AG. All rights reserved. 11
    • 12. Hard Page Fault Resolution Performance Resolution time Time spend over the Resolution time Average (μs) wire one way Best (μs) Average (μs) SoftIwarp 355 150 + 74 (10 GbE) Iwarp 48 4-6 28 (10GbE) Infiniband 29 2-4 16 (40 Gbps)© 2012 SAP AG. All rights reserved. 12
    • 13. Average Compounded Page Fault Resolution Time(With Prefetch) 6 Micro-seconds IW 10GE Sequential 5.5 IB 40 Gbps Sequential IW 10GE- Binary split 5 IB 40Gbps- Binary split IW 10GE- Random Walk 4.5 IB- Random Walk 4 3.5 3 2.5 2 1.5 1 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads© 2012 SAP AG. All rights reserved. 13
    • 14. Post-Copy Live MigrationTechnology first Use Case
    • 15. Post Copy – Pre Copy – Hybrid Comparison 4 Pre-copy (Forced after 60s) 3.5 Post-CopyDowntime (seconds) 3 Hybrid - 3 seconds 2.5 Hybrid - 5 Seconds 2 1.5 1 0.5 0 1 GB 4 GB 10 GB 14 GB VM Ram Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Network : 10 GB Eth – Chelsio T422-CR IWARP Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds)© 2012 SAP AG. All rights reserved. 15
    • 16. Post Copy vs Pre copy under load 100 Post Copy Dirtying Rate 1GB/s 90 Post Copy Dirtying Rate 5GB/s 80 Post Copy Dirtying Rate 25GB/sDegradation (%) Post Copy Dirtying Rate 50GB/s 70 Post Copy Dirtying Rate 100GB/s 60 Pre Copy Dirtying Rate 1GB/s 50 Pre Copy Dirtying Rate 5GB/s Pre Copy Dirtying Rate 25GB/s 40 Pre Copy Dirtying Rate 50GB/s 30 Pre Copy Dirtying Rate 100GB/s 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 Seconds Virtual Machine : Hardware: • 1 GB RAM -1vCPU Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Workload: App Mem Bench • Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP) © 2012 SAP AG. All rights reserved. 16
    • 17. Post Copy Migration of HANA DB Baseline Pre-Copy Post-Copy Downtime N/A 7.47 s 675 ms Benchmark 0% Benchmark 5% Performance Failed Degradation Virtual Machine: Hardware: • 10 GB Ram , 4 vCPU • Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Fabric: 10 GB Ethernet Switch • Workload : SAP-H ( TPC-H Variant) • NIC: Chelsio IWARP T422-CR© 2012 SAP AG. All rights reserved. 17
    • 18. Memory AggregationSecond use case: Scaling out Memory
    • 19. Scaling Out Virtual Machine MemoryBusiness Problem • Heavy swap usage slows execution time for Solution data intensive applicationsHecatonchire/ RRAIM Solution • Applications use memory mobility for high VM swaps to memory Application Cloud performance swap resource • Completely transparent RAM Memory Cloud • No integration required • Act on results sooner • High reliability built in Compression / Deduplication / N-tiers • Enables iteration or additional data to storage / HR-HA improve results© 2012 SAP AG. All rights reserved. 19
    • 20. Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.© 2012 SAP AG. All rights reserved. 20
    • 21. Quicksort Benchmark with Memory Constraint Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset 10.00% Memory Ratio DSM Overhead RRAIM Overhead 9.00% (constraint using cgroup) 8.00% 3:4 2.08% 5.21% 7.00% 6.00% 1:2 2.62% 6.15% 5.00% DSM Overhead 4.00% RRAIM Overhead 1:3 3.35% 9.21% 3.00% 2.00% 1:4 4.15% 8.68% 1.00% 0.00% 1:5 4.71% 9.28% 3:04 1:02 1:03 1:04 1:05© 2012 SAP AG. All rights reserved. 21
    • 22. Scaling out HANA Memory DSM RRAIM 1.60% Ratio Overhead Overhead 1.40% 1.20% 1:2 1% 0.887% 1.00% DSM Overhead 0.80% 1:3 1.6% 1.548% 0.60% RRAIM Overhead 2:1:1 0.1% - 0.40% 0.20% 1:1:1 1.5% - 0.00% 1:02 1:03 2:01:01 1:01:01 Virtual Machine: Hardware: • 18 GB Ram , 4 vCPU • Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • Workload : SAP-H ( TPC-H Variant)© 2012 SAP AG. All rights reserved. 22
    • 23. Transitioning to a Memory Cloud(Ongoing work) Memory VM Compute VM Combination VM Memory Sponsor Memory Demander Memory Sponsor & Demander RAM App RAM App Memory memory VM VM VM Cloud RRAIM Memory Cloud Management Services (OpenStack) Many Physical Nodes Hosting a variety of VMs© 2012 SAP AG. All rights reserved. 23
    • 24. Lego CloudGoing beyond Memory
    • 25. Virtual Distributed Shared Memory System(Compute Cloud) GuestsCompute aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes VM VM VM App App AppChallenges OS OS VM OS H/W Coherency Protocol Ap p OS H/W Granularity ( False sharing ) H/W H/WHecatonchire Value Proposition Server #1 Server #2 Server #n Optimal price / performance by using commodity hardware CPUs CPUs CPUs Operational flexibility: node downtime without downing Memory Memory Memory the cluster I/O I/O I/O Seamless deployment within existing cloud Fast RDMA Communication© 2012 SAP AG. All rights reserved. 25
    • 26. Disaggregation of datacentre ( and cloud ) resources(Our Aim)Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each.Disaggregation, provides three primary benefits:• Better Performance: • Each function is isolated => limiting the scope of what each box must do • We can leverage dedicated hardware and software => increases performance.• Superior Scalability: • Functions are isolated from each other => alter one function without impacting the others.• Improved Economics: • cost-effective deployment of resource => improved provisioning and consolidation of disparate equipment © 2012 SAP AG. All rights reserved. 26
    • 27. Summary
    • 28. Hecatonchire Project • Features: • Distributed Shared Memory • Memory extension via Memory Servers • HA features • Future :Distributed Workload executions • Use standard Cloud interface • Optimise Cloud infrastructure • Support COTS HW© 2012 SAP AG. All rights reserved. 28
    • 29. Key takeaways• Hecatonchire project aim at disaggregating datacentre resources• Hecatonchire Project currently deliver memory cloud capabilities• Enhancements to be released as open source under GPLv2 and LGPL licenses by the end of November 2012• Hosted on GitHub, check: www.hecatonchire.com• Developed by SAP Research Technology Infrastructure (TI) Programme © 2012 SAP AG. All rights reserved. 29
    • 30. Thank youBenoit Hudzia; Sr. Researcher;SAP Research Belfastbenoit.hudzia@sap.com
    • 31. Backup Slide
    • 32. Instant Flash Cloning On-DemandBusiness Problem Burst load / service usage that cannot be satisfied in timeExisting solutions Vendors: Amazon / VMWare/ rightscale Startup VM from disk image Requires full VM OS startup sequenceHecatonchire Solution Go live after VM-state (MBs) and hot memory (<5%) cloning Use post-copy live-migration schema in background Complete background transfer and disconnect from sourceHecatonchire Value Proposition Just in time (sub-second) provisioning© 2012 SAP AG. All rights reserved. 32
    • 33. DRAM Latency Has Remained ConstantCPU clock speed and memory bandwidthincreased steadily (at least until 2000)But memory latency remained constant – solocal memory has gotten slower from the CPUperspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010© 2012 SAP AG. All rights reserved. 33
    • 34. CPUs Stopped Getting FasterMoore’s law prevailed until 2005 when core’sspeed hit a practical limit of about 3.4 GHzSince 2005 you do get more cores but the“single threaded free lunch” is over Source: http://www.intel.com/pressroom/kits/quickrefyr.htmEffectively arbitrary sequential algorithmshave not gotten faster since Source: “The Free Lunch Is Over..” by Herb Sutter© 2012 SAP AG. All rights reserved. 34
    • 35. While … Interconnect Link Speed has Kept Growing Panda et al. Supercomputing 2009© 2012 SAP AG. All rights reserved. 35
    • 36. Result: Remote Nodes Have Gotten CloserAccessing DRAM on a remote host via IBinterconnects is only 20x slower than localDRAMAnd remote DRAM has far better performancethan paging in from an SSD or HDD deviceFast interconnects have become a commodity- moving out of the High PerformanceComputing (HPC) niche HANA Performance Analysis, Chaim Bendelac, 2011© 2012 SAP AG. All rights reserved. 36
    • 37. Post-Copy Live Migration (pre-migration) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time© 2012 SAP AG. All rights reserved. 37
    • 38. Post-Copy Live Migration (reservation) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time© 2012 SAP AG. All rights reserved. 38
    • 39. Post-Copy Live Migration (stop and copy) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time© 2012 SAP AG. All rights reserved. 39
    • 40. Post-Copy Live Migration (post-copy) Guest VM Guest VM Page fault Page push Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time© 2012 SAP AG. All rights reserved. 40
    • 41. Post-Copy Live Migration (commit) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time© 2012 SAP AG. All rights reserved. 41

    ×