Hecatonchire kvm forum_2012_benoit_hudzia

Benoit Hudzia
Benoit HudziaChief AIOPS - Big Data Analytics Platform
Memory Aggregation For KVM
Hecatonchire Project

Benoit Hudzia; Sr. Researcher; SAP Research Belfast
With the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak

November 2012
Agenda

•   Memory as a Utility
•   Raw Performance
•   First Use Case : Post Copy
•   Second Use case : Memory aggregation
•   Lego Cloud
•   Summary


© 2012 SAP AG. All rights reserved.        2
Memory as a Utility
How we Liquefied Memory Resources
The Idea: Turning memory into a distributed memory service




      Breaks memory from the bounds      Transparent deployment with
             of the physical box      performance at scale and Reliability

© 2012 SAP AG. All rights reserved.                                          4
High Level Principle

                                                     Memory                   Memory
                                                    Sponsor A                Sponsor B


                                                                Network



                          Memory Demander

                                                    Virtual Memory Address Space

                                            Memory Demanding Process




© 2012 SAP AG. All rights reserved.                                                      5
How does it work
(Simplified Version)

    Virtual                        MMU                       Physical                              MMU                    Physical
    Address                        (+ TLB)                   Address                               (+ TLB)                Address
                            Miss                  Update MMU                      Invalidate MMU                 Extract Page

                                   Page Table                                                      Page Table
                                   Entry                                                           Entry
          Remote PTE
                                                 PTE write                        Invalidate PTE
          (Custom Swap Entry)

                                   Coherency                                                       Coherency
                                   Engine                                                          Engine
                                                                                    Extract Page                   Prepare Page for RDMA
                                                                                                                   transfer
                                                     Page request
                                                                        Network
                                   RDMA Engine                                                     RDMA Engine
                                                                         Fabric   Page Response




Physical Node A                                                                                                             Physical Node B
© 2012 SAP AG. All rights reserved.                                                                                                           6
Reducing Effects of Network Bound Page Faults


Full Linux MMU integration (reducing the system-wide effects/cost of page fault)
 Enabling to perform page fault transparency (only pausing the requesting thread)

Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page
faults)
 Implemented fully in kernel mode OFED VERBS
 Can use the fastest RDMA hardware available (IB, IWARP, RoCE)
 Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED)

Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults)
 Currently only a simple fetching of pages surrounding page on which fault occurred



© 2012 SAP AG. All rights reserved.                                                        7
Transparent Solution
Minimal Modification of the kernel (simple and minimal intrusion)
•    4 Hooks in the static kernel , virtually no overhead when enabled for normal operation

Paging and memory Cgroup support (Transparent Tiered Memory)
• Page are pushed back to their sponsor when paging occurs or if they are local they can be
  swapped out normally

KVM Specific support (Virtualization Friendly)
• Shadow Page table (EPT / NPT )
• KVM Asynchronous Page Fault


© 2012 SAP AG. All rights reserved.                                                           8
Transparent Solution (cont.)
Scalable Active – Active Mode (Distributed Shared Memory)
   • Shared Nothing with distributed index
   • Write invalidate with distributed index (end of this year)


Library LibHeca (Ease of integration)
   • Simple API bootstrapping and synching all participating nodes


We also support:
   •     KSM
   •     Huge Page
   •     Discontinuous Shared Memory Region
   •     Multiple DSM / VM groups on the same physical node
© 2012 SAP AG. All rights reserved.                                  9
Raw Performance
How fast can we move memory around ?
Raw Bandwidth usage
HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps
 Gb/s Sequential Walk over 1GB of shared RAM                      Bin split Walk over 1GB of shared RAM                  Random Walk over 1GB of shared RAM
 25
                                                                                                                                                            1 Thread
                                                                                                                                                            2 Threads
           Not enough core                                                                                                                                  3 Threads
                                                                                                                                                            4 Threads
            to saturate (?)                                                                                                                                 5 Threads
 20                                                                  No degradation                                                                         6 Threads
                                                                                                                                                            7 Threads
                                                                     under high load                                                                        8 Threads


 15         Maxing out
            Bandwidth                                                                                                                Software RDMA
                                                                                                                                     has significant
                                                                                                                                        overhead
 10


   5


   0
           Total Gbit/sec       Total Gbit/sec   Total Gbit/sec    Total Gbit/sec    Total Gbit/sec    Total Gbit/sec     Total Gbit/sec   Total Gbit/sec   Total Gbit/sec
            (SIW - Seq)           (IW-Seq)         (IB-Seq)       (SIW- Bin split)   (IW- Bin split)   (IB- Bin split)   (SIW- Random)     (IW- Random)     (IB- Random)


 © 2012 SAP AG. All rights reserved.                                                                                                                                         11
Hard Page Fault Resolution Performance

                                Resolution time   Time spend over the   Resolution time
                                 Average (μs)        wire one way          Best (μs)
                                                     Average (μs)
  SoftIwarp                           355                150 +                74
   (10 GbE)
     Iwarp                            48                  4-6                 28
   (10GbE)
  Infiniband                          29                  2-4                 16
  (40 Gbps)

© 2012 SAP AG. All rights reserved.                                                       12
Average Compounded Page Fault Resolution Time
(With Prefetch)

                         6
        Micro-seconds

                                                                                                     IW 10GE Sequential
                        5.5                                                                          IB 40 Gbps Sequential
                                                                                                     IW 10GE- Binary split
                         5                                                                           IB 40Gbps- Binary split
                                                                                                     IW 10GE- Random Walk
                        4.5                                                                          IB- Random Walk

                         4
                        3.5
                         3
                        2.5
                         2
                        1.5
                         1
                              1 Thread   2 Threads   3 Threads   4 Threads   5 Threads   6 Threads    7 Threads      8 Threads
© 2012 SAP AG. All rights reserved.                                                                                              13
Post-Copy Live Migration
Technology first Use Case
Post Copy – Pre Copy – Hybrid Comparison
                      4
                                Pre-copy (Forced after 60s)
                     3.5
                                Post-Copy
Downtime (seconds)




                      3
                                Hybrid - 3 seconds
                     2.5
                                Hybrid - 5 Seconds
                      2
                     1.5
                      1
                     0.5
                      0
                                     1 GB                         4 GB                        10 GB                       14 GB VM Ram
                           Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
                           Network : 10 GB Eth – Chelsio T422-CR IWARP
                           Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds)
© 2012 SAP AG. All rights reserved.                                                                                                      15
Post Copy vs Pre copy under load


            100
                                                                                                                                                         Post Copy Dirtying Rate 1GB/s
                  90
                                                                                                                                                         Post Copy Dirtying Rate 5GB/s
                  80                                                                                                                                     Post Copy Dirtying Rate 25GB/s
Degradation (%)




                                                                                                                                                         Post Copy Dirtying Rate 50GB/s
                  70
                                                                                                                                                         Post Copy Dirtying Rate 100GB/s
                  60                                                                                                                                     Pre Copy Dirtying Rate 1GB/s
                  50                                                                                                                                     Pre Copy Dirtying Rate 5GB/s
                                                                                                                                                         Pre Copy Dirtying Rate 25GB/s
                  40
                                                                                                                                                         Pre Copy Dirtying Rate 50GB/s
                  30                                                                                                                                     Pre Copy Dirtying Rate 100GB/s
                  20

                  10

                  0
                       1   3   5   7   9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 Seconds

       Virtual Machine :                                                                       Hardware:
       •           1 GB RAM -1vCPU                                                                 Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
       •           Workload: App Mem Bench                                                     •   Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP)
   © 2012 SAP AG. All rights reserved.                                                                                                                                                     16
Post Copy Migration of HANA DB

                                               Baseline       Pre-Copy                 Post-Copy
                         Downtime                N/A           7.47 s                   675 ms


                      Benchmark                      0%       Benchmark                       5%
                     Performance                                Failed
                     Degradation
 Virtual Machine:                                         Hardware:
    •    10 GB Ram , 4 vCPU                               •   Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
    •    Application : HANA ( In memory Database )        •   Fabric: 10 GB Ethernet Switch
    •    Workload : SAP-H ( TPC-H Variant)                •   NIC: Chelsio IWARP T422-CR

© 2012 SAP AG. All rights reserved.                                                                                        17
Memory Aggregation
Second use case: Scaling out Memory
Scaling Out Virtual Machine Memory
Business Problem
   • Heavy swap usage slows execution time for
                                                      Solution
     data intensive applications
Hecatonchire/ RRAIM Solution
   • Applications use memory mobility for high                       VM swaps to memory
                                                 Application               Cloud
     performance swap resource
         • Completely transparent                   RAM             Memory Cloud
         • No integration required
         • Act on results sooner
         • High reliability built in                                Compression /
                                                                 Deduplication / N-tiers
   • Enables iteration or additional data to                       storage / HR-HA
     improve results

© 2012 SAP AG. All rights reserved.                                                        19
Redundant Array of Inexpensive RAM: RRAIM




  1.     Memory region backed by two remote nodes. Remote page faults and
         swap outs initiated simultaneously to all relevant nodes.

  2.     No immediate effect on computation node upon failure of node.

  3.     When we a new remote enters the cluster, it       synchronizes with
         computation node and mirror node.
© 2012 SAP AG. All rights reserved.                                            20
Quicksort Benchmark with Memory Constraint
      Quicksort Benchmark 512 MB Dataset             Quicksort Benchmark 1GB Dataset                  Quicksort Benchmark 2GB Dataset




                                                                               10.00%
       Memory Ratio                   DSM Overhead   RRAIM Overhead             9.00%
     (constraint using cgroup)
                                                                                8.00%
              3:4                        2.08%             5.21%                7.00%
                                                                                6.00%
              1:2                        2.62%             6.15%                5.00%                                          DSM Overhead
                                                                                4.00%                                          RRAIM Overhead
              1:3                        3.35%             9.21%                3.00%
                                                                                2.00%
              1:4                        4.15%             8.68%                1.00%
                                                                                0.00%
              1:5                        4.71%             9.28%                        3:04   1:02    1:03    1:04   1:05

© 2012 SAP AG. All rights reserved.                                                                                                             21
Scaling out HANA

   Memory                         DSM     RRAIM                      1.60%


    Ratio                       Overhead Overhead                    1.40%

                                                                     1.20%

          1:2                         1%      0.887%                 1.00%

                                                                                                                         DSM Overhead
                                                                     0.80%
          1:3                         1.6%    1.548%                 0.60%
                                                                                                                         RRAIM Overhead



        2:1:1                         0.1%           -               0.40%

                                                                     0.20%

        1:1:1                         1.5%           -               0.00%
                                                                              1:02      1:03     2:01:01   1:01:01




 Virtual Machine:                                        Hardware:
    •    18 GB Ram , 4 vCPU                              •   Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM
    •    Application : HANA ( In memory Database )       •   Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM
                                                         •   Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
    •    Workload : SAP-H ( TPC-H Variant)

© 2012 SAP AG. All rights reserved.                                                                                                       22
Transitioning to a Memory Cloud
(Ongoing work)
                     Memory VM                  Compute VM                           Combination VM
                    Memory Sponsor             Memory Demander                    Memory Sponsor & Demander




                             RAM          App                    RAM        App
                                                                                                       Memory
                                                                                                       memory
                             VM           VM                           VM                               Cloud


                                                                                                       RRAIM




                                                                                          Memory Cloud Management
                                                                                            Services (OpenStack)
                                       Many Physical Nodes
                                      Hosting a variety of VMs



© 2012 SAP AG. All rights reserved.                                                                                 23
Lego Cloud
Going beyond Memory
Virtual Distributed Shared Memory System
(Compute Cloud)
                                                                              Guests
Compute aggregation
 Idea : Virtual Machine compute and memory span
  Multiple physical Nodes                                                VM
                                                                                          VM           VM

                                                                        App               App          App

Challenges                                                                                OS
                                                                                                       OS
                                                             VM
                                                                        OS                             H/W

 Coherency Protocol
                                                             Ap
                                                             p

                                                            OS
                                                                                          H/W
 Granularity ( False sharing )                             H/W         H/W


Hecatonchire Value Proposition
                                                           Server #1          Server #2           Server #n
 Optimal price / performance by using commodity
  hardware                                                   CPUs               CPUs                CPUs
 Operational flexibility: node downtime without downing    Memory             Memory              Memory
  the cluster                                                     I/O            I/O                 I/O
 Seamless deployment within existing cloud
                                                                        Fast RDMA Communication



© 2012 SAP AG. All rights reserved.                                                                           25
Disaggregation of datacentre ( and cloud ) resources
(Our Aim)

Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each.

Disaggregation, provides three primary benefits:
• Better Performance:
    • Each function is isolated => limiting the scope of
       what each box must do
    • We can leverage dedicated hardware and software
       => increases performance.
• Superior Scalability:
    • Functions are isolated from each other => alter one
       function without impacting the others.
• Improved Economics:
    • cost-effective deployment of resource => improved
       provisioning and consolidation of disparate
       equipment

 © 2012 SAP AG. All rights reserved.                                                       26
Summary
Hecatonchire Project

 • Features:
  • Distributed Shared Memory
  • Memory extension via Memory Servers
  • HA features
  • Future :Distributed Workload executions
 • Use standard Cloud interface
 • Optimise Cloud infrastructure
 • Support COTS HW

© 2012 SAP AG. All rights reserved.           28
Key takeaways
•       Hecatonchire project aim at disaggregating
        datacentre resources

•       Hecatonchire Project currently deliver memory
        cloud capabilities

•       Enhancements to be released as open source under
        GPLv2 and LGPL licenses by the end of November
        2012

•       Hosted on GitHub, check: www.hecatonchire.com

•       Developed by SAP Research Technology
        Infrastructure (TI) Programme
    © 2012 SAP AG. All rights reserved.                    29
Thank you

Benoit Hudzia; Sr. Researcher;
SAP Research Belfast
benoit.hudzia@sap.com
Backup Slide
Instant Flash Cloning On-Demand

Business Problem
 Burst load / service usage that cannot be satisfied in time

Existing solutions
 Vendors: Amazon / VMWare/ rightscale
 Startup VM from disk image
 Requires full VM OS startup sequence

Hecatonchire Solution
 Go live after VM-state (MBs) and hot memory (<5%) cloning
 Use post-copy live-migration schema in background
 Complete background transfer and disconnect from source

Hecatonchire Value Proposition
 Just in time (sub-second) provisioning

© 2012 SAP AG. All rights reserved.                             32
DRAM Latency Has Remained Constant


CPU clock speed and memory bandwidth
increased steadily (at least until 2000)

But memory latency remained constant – so
local memory has gotten slower from the CPU
perspective




                                              Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

© 2012 SAP AG. All rights reserved.                                                                         33
CPUs Stopped Getting Faster


Moore’s law prevailed until 2005 when core’s
speed hit a practical limit of about 3.4 GHz

Since 2005 you do get more cores but the
“single threaded free lunch” is over

                                               Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
Effectively arbitrary sequential algorithms
have not gotten faster since




                                               Source: “The Free Lunch Is Over..” by Herb Sutter

© 2012 SAP AG. All rights reserved.                                                                         34
While … Interconnect Link Speed has Kept Growing




                                      Panda et al. Supercomputing 2009
© 2012 SAP AG. All rights reserved.                                      35
Result: Remote Nodes Have Gotten Closer


Accessing DRAM on a remote host via IB
interconnects is only 20x slower than local
DRAM

And remote DRAM has far better performance
than paging in from an SSD or HDD device

Fast interconnects have become a commodity
- moving out of the High Performance
Computing (HPC) niche




                                              HANA Performance Analysis, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved.                                                             36
Post-Copy Live Migration (pre-migration)


                                      Guest VM




                                       Host A                                                       Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                         Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                      37
Post-Copy Live Migration (reservation)


                                      Guest VM                                                      Guest VM




                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        38
Post-Copy Live Migration (stop and copy)


                                      Guest VM                                                      Guest VM




                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        39
Post-Copy Live Migration (post-copy)


                                      Guest VM                                                      Guest VM




                                                                       Page fault
                                                                       Page push


                                       Host A                                                        Host B


                                                          Stop              Page Pushing
                                                                                           Commit
                 Pre-migrate           Reservation        and                     1
                                                          Copy                 Round

                Live on A                             Downtime         Degraded on B                           Live on B

                                                     Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                        40
Post-Copy Live Migration (commit)


                                                                                                   Guest VM




                                      Host A                                                        Host B


                                                         Stop              Page Pushing
                                                                                          Commit
                 Pre-migrate          Reservation        and                     1
                                                         Copy                 Round

                Live on A                            Downtime         Degraded on B                           Live on B

                                                    Total Migration Time



© 2012 SAP AG. All rights reserved.                                                                                       41
1 of 41

Recommended

Enhancing Live Migration Process for CPU and/or memory intensive VMs running... by
Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...Enhancing Live Migration Process for CPU and/or  memory intensive VMs running...
Enhancing Live Migration Process for CPU and/or memory intensive VMs running...Benoit Hudzia
1.1K views25 slides
Lego Cloud SAP Virtualization Week 2012 by
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Benoit Hudzia
2K views43 slides
Hana Memory Scale out using the hecatonchire Project by
Hana Memory Scale out using the hecatonchire ProjectHana Memory Scale out using the hecatonchire Project
Hana Memory Scale out using the hecatonchire ProjectBenoit Hudzia
2.5K views35 slides
Sap On Esx Backup Methodology by
Sap On Esx   Backup MethodologySap On Esx   Backup Methodology
Sap On Esx Backup MethodologyMaarten Daniels
754 views10 slides
IBM System z - zEnterprise a future platform for enterprise systems by
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM Sverige
196 views22 slides
Network Virtualization in Windows Server 2012 by
Network Virtualization in Windows Server 2012Network Virtualization in Windows Server 2012
Network Virtualization in Windows Server 2012Lai Yoong Seng
1.7K views22 slides

More Related Content

What's hot

Fremtidens platform til koncernsystemer (IBM System z) by
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)IBM Danmark
1.1K views23 slides
PRIMERGY Bladeframe: Caratteristiche e benefici by
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiFSCitalia
327 views26 slides
Ta3 by
Ta3Ta3
Ta3leo1092
475 views38 slides
Ibm power7 by
Ibm power7Ibm power7
Ibm power7Tom Presotto
1.4K views51 slides
Hadoop Summit 2012 | HBase Consistency and Performance Improvements by
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
4.8K views66 slides
Next Gen Datacenter by
Next Gen DatacenterNext Gen Datacenter
Next Gen DatacenterRui Lopes
1K views46 slides

What's hot(20)

Fremtidens platform til koncernsystemer (IBM System z) by IBM Danmark
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)
IBM Danmark1.1K views
PRIMERGY Bladeframe: Caratteristiche e benefici by FSCitalia
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e benefici
FSCitalia327 views
Ta3 by leo1092
Ta3Ta3
Ta3
leo1092475 views
Hadoop Summit 2012 | HBase Consistency and Performance Improvements by Cloudera, Inc.
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Cloudera, Inc.4.8K views
Next Gen Datacenter by Rui Lopes
Next Gen DatacenterNext Gen Datacenter
Next Gen Datacenter
Rui Lopes1K views
Windows azure uk universities overview march 2012 by Lee Stott
Windows azure uk universities overview march 2012Windows azure uk universities overview march 2012
Windows azure uk universities overview march 2012
Lee Stott494 views
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ... by EMC Forum India
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
Track 1, Session 3 - intelligent infrastructure for the virtualized world by ...
EMC Forum India354 views
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP... by Juniper Networks
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Network Configuration Example: Configuring IS-IS Dual Stacking of IPv4 and IP...
Juniper Networks607 views
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform by Vitor Tomaz
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
[.Net Juniors Academy] Introdução ao Cloud Computing e Windows Azure Platform
Vitor Tomaz729 views
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland by Fuenteovejuna
Shared Personalization Service - How To Scale to 15K RPS, Patrice PellandShared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Shared Personalization Service - How To Scale to 15K RPS, Patrice Pelland
Fuenteovejuna 551 views
2011 04-dsi-javaee-in-the-cloud-andreadis by dandre
2011 04-dsi-javaee-in-the-cloud-andreadis2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis
dandre437 views
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm... by Yahoo Developer Network
Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...Developer's Most Frequent Hadoop Headaches &  How to Address Them__HadoopSumm...
Developer's Most Frequent Hadoop Headaches & How to Address Them__HadoopSumm...
SDEC2011 Using Couchbase for social game scaling and speed by Korea Sdec
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
Korea Sdec1.7K views
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids by Luís Ganhão
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage GridsDB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
DB 11g R2 Keynote: Consolidate On Low Cost Server And Storage Grids
Luís Ganhão236 views
Dell high density GPU solution by Clayton Li
Dell high density GPU solutionDell high density GPU solution
Dell high density GPU solution
Clayton Li1.3K views
Lync Server 2010: High Availability [I3004] by Fabrizio Volpe
Lync Server 2010: High Availability [I3004] Lync Server 2010: High Availability [I3004]
Lync Server 2010: High Availability [I3004]
Fabrizio Volpe2.1K views
Lync 2010 High Availability by Harold Wong
Lync 2010 High AvailabilityLync 2010 High Availability
Lync 2010 High Availability
Harold Wong3.8K views

Similar to Hecatonchire kvm forum_2012_benoit_hudzia

SAP Virtualization Week 2012 - The Lego Cloud by
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
644 views43 slides
The Value of NetApp with VMware by
The Value of NetApp with VMwareThe Value of NetApp with VMware
The Value of NetApp with VMwareCapito Livingstone
2.4K views22 slides
Realtime Apache Hadoop at Facebook by
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
11.8K views37 slides
Balancing Replication and Partitioning in a Distributed Java Database by
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
6.5K views99 slides
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study. by
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.Takatoshi Matsuo
3.9K views39 slides
Searching conversations with hadoop by
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
994 views30 slides

Similar to Hecatonchire kvm forum_2012_benoit_hudzia(20)

SAP Virtualization Week 2012 - The Lego Cloud by aidanshribman
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
aidanshribman644 views
Realtime Apache Hadoop at Facebook by parallellabs
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
parallellabs11.8K views
Balancing Replication and Partitioning in a Distributed Java Database by Ben Stopford
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
Ben Stopford6.5K views
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study. by Takatoshi Matsuo
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
HA Clustering of PostgreSQL(replication)@2012.9.29 PG Study.
Takatoshi Matsuo3.9K views
Memcached, presented to LCA2010 by Mark Atwood
Memcached, presented to LCA2010Memcached, presented to LCA2010
Memcached, presented to LCA2010
Mark Atwood1.3K views
Scalability by felho
ScalabilityScalability
Scalability
felho1.6K views
Membase Meetup Chicago - january 2011 by Membase
Membase Meetup Chicago - january 2011Membase Meetup Chicago - january 2011
Membase Meetup Chicago - january 2011
Membase596 views
Virtualizing Latency Sensitive Workloads and vFabric GemFire by Carter Shanklin
Virtualizing Latency Sensitive Workloads and vFabric GemFireVirtualizing Latency Sensitive Workloads and vFabric GemFire
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Carter Shanklin2.1K views
Practical Intro Merb by Paul Pajo
Practical Intro MerbPractical Intro Merb
Practical Intro Merb
Paul Pajo506 views
Practical Intro Merb by Paul Pajo
Practical Intro MerbPractical Intro Merb
Practical Intro Merb
Paul Pajo589 views
Netapp 1229343173196796-1 by Newlink
Netapp 1229343173196796-1Netapp 1229343173196796-1
Netapp 1229343173196796-1
Newlink525 views
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac... by AOE
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
Magento Imagine 2013: Fabrizio Branca - Learning To Fly: How Angry Birds Reac...
AOE11.2K views
Netezza vs Teradata vs Exadata by Asis Mohanty
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
Asis Mohanty34.9K views
Florian adler minute project by Dmitry Buzdin
Florian adler   minute projectFlorian adler   minute project
Florian adler minute project
Dmitry Buzdin1.9K views
Apache Camel: The Swiss Army Knife of Open Source Integration by prajods
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
prajods5.4K views
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas... by cwensel
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
cwensel5K views
Migration of a computation cluster to Debian by Logilab
Migration of a computation cluster to DebianMigration of a computation cluster to Debian
Migration of a computation cluster to Debian
Logilab910 views

Recently uploaded

Attacking IoT Devices from a Web Perspective - Linux Day by
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day Simone Onofri
15 views68 slides
Empathic Computing: Delivering the Potential of the Metaverse by
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the MetaverseMark Billinghurst
470 views80 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
16 views6 slides
AI: mind, matter, meaning, metaphors, being, becoming, life values by
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life valuesTwain Liu 刘秋艳
35 views16 slides
Perth MeetUp November 2023 by
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023 Michael Price
15 views44 slides
Spesifikasi Lengkap ASUS Vivobook Go 14 by
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14Dot Semarang
35 views1 slide

Recently uploaded(20)

Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2216 views
AI: mind, matter, meaning, metaphors, being, becoming, life values by Twain Liu 刘秋艳
AI: mind, matter, meaning, metaphors, being, becoming, life valuesAI: mind, matter, meaning, metaphors, being, becoming, life values
AI: mind, matter, meaning, metaphors, being, becoming, life values
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price15 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS16 views
.conf Go 2023 - Data analysis as a routine by Splunk
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
Splunk93 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About PostmanFive Things You SHOULD Know About Postman
Five Things You SHOULD Know About Postman
Postman27 views
Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291614 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin75 views
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS33 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10209 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS34 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS27 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier34 views

Hecatonchire kvm forum_2012_benoit_hudzia

  • 1. Memory Aggregation For KVM Hecatonchire Project Benoit Hudzia; Sr. Researcher; SAP Research Belfast With the contribution of Aidan Shribman, Roei Tell, Steve Walsh, Peter Izsak November 2012
  • 2. Agenda • Memory as a Utility • Raw Performance • First Use Case : Post Copy • Second Use case : Memory aggregation • Lego Cloud • Summary © 2012 SAP AG. All rights reserved. 2
  • 3. Memory as a Utility How we Liquefied Memory Resources
  • 4. The Idea: Turning memory into a distributed memory service Breaks memory from the bounds Transparent deployment with of the physical box performance at scale and Reliability © 2012 SAP AG. All rights reserved. 4
  • 5. High Level Principle Memory Memory Sponsor A Sponsor B Network Memory Demander Virtual Memory Address Space Memory Demanding Process © 2012 SAP AG. All rights reserved. 5
  • 6. How does it work (Simplified Version) Virtual MMU Physical MMU Physical Address (+ TLB) Address (+ TLB) Address Miss Update MMU Invalidate MMU Extract Page Page Table Page Table Entry Entry Remote PTE PTE write Invalidate PTE (Custom Swap Entry) Coherency Coherency Engine Engine Extract Page Prepare Page for RDMA transfer Page request Network RDMA Engine RDMA Engine Fabric Page Response Physical Node A Physical Node B © 2012 SAP AG. All rights reserved. 6
  • 7. Reducing Effects of Network Bound Page Faults Full Linux MMU integration (reducing the system-wide effects/cost of page fault)  Enabling to perform page fault transparency (only pausing the requesting thread) Low latency RDMA Engine and page transfer protocol (reducing latency/cost of page faults)  Implemented fully in kernel mode OFED VERBS  Can use the fastest RDMA hardware available (IB, IWARP, RoCE)  Tested with Software RDMA solution ( Soft IWARP and SoftRoCE) (NO SPECIAL HW REQUIRED) Demand pre-paging (pre-fetching) mechanism (reducing the number of page faults)  Currently only a simple fetching of pages surrounding page on which fault occurred © 2012 SAP AG. All rights reserved. 7
  • 8. Transparent Solution Minimal Modification of the kernel (simple and minimal intrusion) • 4 Hooks in the static kernel , virtually no overhead when enabled for normal operation Paging and memory Cgroup support (Transparent Tiered Memory) • Page are pushed back to their sponsor when paging occurs or if they are local they can be swapped out normally KVM Specific support (Virtualization Friendly) • Shadow Page table (EPT / NPT ) • KVM Asynchronous Page Fault © 2012 SAP AG. All rights reserved. 8
  • 9. Transparent Solution (cont.) Scalable Active – Active Mode (Distributed Shared Memory) • Shared Nothing with distributed index • Write invalidate with distributed index (end of this year) Library LibHeca (Ease of integration) • Simple API bootstrapping and synching all participating nodes We also support: • KSM • Huge Page • Discontinuous Shared Memory Region • Multiple DSM / VM groups on the same physical node © 2012 SAP AG. All rights reserved. 9
  • 10. Raw Performance How fast can we move memory around ?
  • 11. Raw Bandwidth usage HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM 25 1 Thread 2 Threads Not enough core 3 Threads 4 Threads to saturate (?) 5 Threads 20 No degradation 6 Threads 7 Threads under high load 8 Threads 15 Maxing out Bandwidth Software RDMA has significant overhead 10 5 0 Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec (SIW - Seq) (IW-Seq) (IB-Seq) (SIW- Bin split) (IW- Bin split) (IB- Bin split) (SIW- Random) (IW- Random) (IB- Random) © 2012 SAP AG. All rights reserved. 11
  • 12. Hard Page Fault Resolution Performance Resolution time Time spend over the Resolution time Average (μs) wire one way Best (μs) Average (μs) SoftIwarp 355 150 + 74 (10 GbE) Iwarp 48 4-6 28 (10GbE) Infiniband 29 2-4 16 (40 Gbps) © 2012 SAP AG. All rights reserved. 12
  • 13. Average Compounded Page Fault Resolution Time (With Prefetch) 6 Micro-seconds IW 10GE Sequential 5.5 IB 40 Gbps Sequential IW 10GE- Binary split 5 IB 40Gbps- Binary split IW 10GE- Random Walk 4.5 IB- Random Walk 4 3.5 3 2.5 2 1.5 1 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads © 2012 SAP AG. All rights reserved. 13
  • 15. Post Copy – Pre Copy – Hybrid Comparison 4 Pre-copy (Forced after 60s) 3.5 Post-Copy Downtime (seconds) 3 Hybrid - 3 seconds 2.5 Hybrid - 5 Seconds 2 1.5 1 0.5 0 1 GB 4 GB 10 GB 14 GB VM Ram Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM Network : 10 GB Eth – Chelsio T422-CR IWARP Workload App Mem Bench (~80% of the VM RAM) Dirtying Rate : 1GB/s (256k Page dirtied per seconds) © 2012 SAP AG. All rights reserved. 15
  • 16. Post Copy vs Pre copy under load 100 Post Copy Dirtying Rate 1GB/s 90 Post Copy Dirtying Rate 5GB/s 80 Post Copy Dirtying Rate 25GB/s Degradation (%) Post Copy Dirtying Rate 50GB/s 70 Post Copy Dirtying Rate 100GB/s 60 Pre Copy Dirtying Rate 1GB/s 50 Pre Copy Dirtying Rate 5GB/s Pre Copy Dirtying Rate 25GB/s 40 Pre Copy Dirtying Rate 50GB/s 30 Pre Copy Dirtying Rate 100GB/s 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 Seconds Virtual Machine : Hardware: • 1 GB RAM -1vCPU Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Workload: App Mem Bench • Network : 10 GB Eth Switch – NIC : Chelsio T422-CR (IWARP) © 2012 SAP AG. All rights reserved. 16
  • 17. Post Copy Migration of HANA DB Baseline Pre-Copy Post-Copy Downtime N/A 7.47 s 675 ms Benchmark 0% Benchmark 5% Performance Failed Degradation Virtual Machine: Hardware: • 10 GB Ram , 4 vCPU • Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Fabric: 10 GB Ethernet Switch • Workload : SAP-H ( TPC-H Variant) • NIC: Chelsio IWARP T422-CR © 2012 SAP AG. All rights reserved. 17
  • 18. Memory Aggregation Second use case: Scaling out Memory
  • 19. Scaling Out Virtual Machine Memory Business Problem • Heavy swap usage slows execution time for Solution data intensive applications Hecatonchire/ RRAIM Solution • Applications use memory mobility for high VM swaps to memory Application Cloud performance swap resource • Completely transparent RAM Memory Cloud • No integration required • Act on results sooner • High reliability built in Compression / Deduplication / N-tiers • Enables iteration or additional data to storage / HR-HA improve results © 2012 SAP AG. All rights reserved. 19
  • 20. Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node. © 2012 SAP AG. All rights reserved. 20
  • 21. Quicksort Benchmark with Memory Constraint Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset 10.00% Memory Ratio DSM Overhead RRAIM Overhead 9.00% (constraint using cgroup) 8.00% 3:4 2.08% 5.21% 7.00% 6.00% 1:2 2.62% 6.15% 5.00% DSM Overhead 4.00% RRAIM Overhead 1:3 3.35% 9.21% 3.00% 2.00% 1:4 4.15% 8.68% 1.00% 0.00% 1:5 4.71% 9.28% 3:04 1:02 1:03 1:04 1:05 © 2012 SAP AG. All rights reserved. 21
  • 22. Scaling out HANA Memory DSM RRAIM 1.60% Ratio Overhead Overhead 1.40% 1.20% 1:2 1% 0.887% 1.00% DSM Overhead 0.80% 1:3 1.6% 1.548% 0.60% RRAIM Overhead 2:1:1 0.1% - 0.40% 0.20% 1:1:1 1.5% - 0.00% 1:02 1:03 2:01:01 1:01:01 Virtual Machine: Hardware: • 18 GB Ram , 4 vCPU • Memory Host: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, 4 cores, 16GB RAM • Application : HANA ( In memory Database ) • Compute Host: Intel(R) Xeon(R) CPU X5650 @ 2.56GHz, 8 cores, 96GB RAM • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • Workload : SAP-H ( TPC-H Variant) © 2012 SAP AG. All rights reserved. 22
  • 23. Transitioning to a Memory Cloud (Ongoing work) Memory VM Compute VM Combination VM Memory Sponsor Memory Demander Memory Sponsor & Demander RAM App RAM App Memory memory VM VM VM Cloud RRAIM Memory Cloud Management Services (OpenStack) Many Physical Nodes Hosting a variety of VMs © 2012 SAP AG. All rights reserved. 23
  • 25. Virtual Distributed Shared Memory System (Compute Cloud) Guests Compute aggregation  Idea : Virtual Machine compute and memory span Multiple physical Nodes VM VM VM App App App Challenges OS OS VM OS H/W  Coherency Protocol Ap p OS H/W  Granularity ( False sharing ) H/W H/W Hecatonchire Value Proposition Server #1 Server #2 Server #n  Optimal price / performance by using commodity hardware CPUs CPUs CPUs  Operational flexibility: node downtime without downing Memory Memory Memory the cluster I/O I/O I/O  Seamless deployment within existing cloud Fast RDMA Communication © 2012 SAP AG. All rights reserved. 25
  • 26. Disaggregation of datacentre ( and cloud ) resources (Our Aim) Breaking out the functions of Memory ,Compute, I/O, and optimizing the delivery of each. Disaggregation, provides three primary benefits: • Better Performance: • Each function is isolated => limiting the scope of what each box must do • We can leverage dedicated hardware and software => increases performance. • Superior Scalability: • Functions are isolated from each other => alter one function without impacting the others. • Improved Economics: • cost-effective deployment of resource => improved provisioning and consolidation of disparate equipment © 2012 SAP AG. All rights reserved. 26
  • 28. Hecatonchire Project • Features: • Distributed Shared Memory • Memory extension via Memory Servers • HA features • Future :Distributed Workload executions • Use standard Cloud interface • Optimise Cloud infrastructure • Support COTS HW © 2012 SAP AG. All rights reserved. 28
  • 29. Key takeaways • Hecatonchire project aim at disaggregating datacentre resources • Hecatonchire Project currently deliver memory cloud capabilities • Enhancements to be released as open source under GPLv2 and LGPL licenses by the end of November 2012 • Hosted on GitHub, check: www.hecatonchire.com • Developed by SAP Research Technology Infrastructure (TI) Programme © 2012 SAP AG. All rights reserved. 29
  • 30. Thank you Benoit Hudzia; Sr. Researcher; SAP Research Belfast benoit.hudzia@sap.com
  • 32. Instant Flash Cloning On-Demand Business Problem  Burst load / service usage that cannot be satisfied in time Existing solutions  Vendors: Amazon / VMWare/ rightscale  Startup VM from disk image  Requires full VM OS startup sequence Hecatonchire Solution  Go live after VM-state (MBs) and hot memory (<5%) cloning  Use post-copy live-migration schema in background  Complete background transfer and disconnect from source Hecatonchire Value Proposition  Just in time (sub-second) provisioning © 2012 SAP AG. All rights reserved. 32
  • 33. DRAM Latency Has Remained Constant CPU clock speed and memory bandwidth increased steadily (at least until 2000) But memory latency remained constant – so local memory has gotten slower from the CPU perspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010 © 2012 SAP AG. All rights reserved. 33
  • 34. CPUs Stopped Getting Faster Moore’s law prevailed until 2005 when core’s speed hit a practical limit of about 3.4 GHz Since 2005 you do get more cores but the “single threaded free lunch” is over Source: http://www.intel.com/pressroom/kits/quickrefyr.htm Effectively arbitrary sequential algorithms have not gotten faster since Source: “The Free Lunch Is Over..” by Herb Sutter © 2012 SAP AG. All rights reserved. 34
  • 35. While … Interconnect Link Speed has Kept Growing Panda et al. Supercomputing 2009 © 2012 SAP AG. All rights reserved. 35
  • 36. Result: Remote Nodes Have Gotten Closer Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM And remote DRAM has far better performance than paging in from an SSD or HDD device Fast interconnects have become a commodity - moving out of the High Performance Computing (HPC) niche HANA Performance Analysis, Chaim Bendelac, 2011 © 2012 SAP AG. All rights reserved. 36
  • 37. Post-Copy Live Migration (pre-migration) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 37
  • 38. Post-Copy Live Migration (reservation) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 38
  • 39. Post-Copy Live Migration (stop and copy) Guest VM Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 39
  • 40. Post-Copy Live Migration (post-copy) Guest VM Guest VM Page fault Page push Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 40
  • 41. Post-Copy Live Migration (commit) Guest VM Host A Host B Stop Page Pushing Commit Pre-migrate Reservation and 1 Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 41

Editor's Notes

  1. Walk : sequential =&gt; each thread start reading from 256k / nb threads * threads id and ends when it reach the start of the following threadsBinary split : we split the memory in Nb threads regions . Each threads will then do a binary split walk within each regionRandom walk : each thread will read a page randomly chosen within the overall memory region ( no duplicate)We Are maxing out the 10GbE bandwith with IWARPWe suspect that we do not have enough core to saturate the QDR linkWe have almost no noticeable degradation when we have Threads &gt; CoresSoftIwarp has a significant overhead ( CPU – latency- Memory use)