Project Hecatonchire
Dr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering
July 2013
The Need
Next Generation Datacentre
© 2013 SAP AG. All rights reserved. 3Public
Advantages of Current Datacenter Designs
• Cheap, off the shelf, commodity par...
© 2013 SAP AG. All rights reserved. 4Public
Lots of Problems
• Datacenters mix customers and applications
• Heterogeneous,...
© 2013 SAP AG. All rights reserved. 5Public
Achieving the Right-sized Servers
• Eliminates unnecessary components
• Increa...
The Lego Cloud
Original Concept
© 2013 SAP AG. All rights reserved. 7Public
Original Idea :Lego Cloud
Resources aggregation
 Idea : Virtual Machine compu...
© 2013 SAP AG. All rights reserved. 8Public
From Rack Scale to Cloud Scale
FastRDMACommunication
CPUs
Memory
I/O
CPUs
Memo...
© 2013 SAP AG. All rights reserved. 9Public
Aggregating resource with Fast Networking
Author: Chaim Bendalac
PCIe SSD
(fus...
The Memory Cloud
Focus On Memory Aggregation
© 2013 SAP AG. All rights reserved. 11Public
How do we offer better Memory TCO
• Memory in the nodes of current clusters i...
© 2013 SAP AG. All rights reserved. 12Public
Decouple memory from cores aggregation
Many shared-memory parallel applicatio...
© 2013 SAP AG. All rights reserved. 13Public
The Idea: Turning memory into a distributed memory service
Breaks memory from...
Architecture
High Level Version
© 2013 SAP AG. All rights reserved. 15Public
Linux kernel
High Level Architecture
• Lightweight: Minimal set of hooks in t...
© 2013 SAP AG. All rights reserved. 16Public
Extending Virtual Machine Memory
• Instead of using swap for
freeing up the R...
© 2013 SAP AG. All rights reserved. 17Public
How does it work
(Simplified Version)
Virtual
Address
MMU
(+ TLB)
Physical
Ad...
© 2013 SAP AG. All rights reserved. 18Public
Hecatonchire Memory Structure
• 2^64 Group (DSM-VM , ex: meta-VM / Process) t...
© 2013 SAP AG. All rights reserved. 19Public
Coherency engine
• Different coherency model
• Shared Nothing
• Write Invalid...
© 2013 SAP AG. All rights reserved. 20Public
RDMA Engine
• In Kernel Implementation
• Optimised memory use
• Fabric agnost...
Raw Performance
How fast can we move memory page
© 2013 SAP AG. All rights reserved. 22Public
Hard Page Fault Resolution Performance
(10 GB sequential R/W scan)
Resolution...
© 2013 SAP AG. All rights reserved. 23Public
Average Compounded Page Fault Resolution Time
(R/W With Prefetch)
1
1.5
2
2.5...
Use Case: In Memory Database
Scaling Out HANA memory
© 2013 SAP AG. All rights reserved. 25Public
Memory Scale out for SAP HANA – Benchmark
Hardware:
• Server with Intel Xeon ...
© 2013 SAP AG. All rights reserved. 26Public
Use Case : Scaling out Hana Virtual Memory
See 4
TB RAMConstraint to 1
TB Phy...
© 2013 SAP AG. All rights reserved. 27Public
0
50
100
150
200
250
300
350
400
450
500
1 20 40 60
Baseline
Half-Half
Scalin...
© 2013 SAP AG. All rights reserved. 28Public
Per Query Overhead
-50%
0%
50%
100%
150%
200%
1 2 3 4 5 6 7 8 9 10 11 12 13 1...
© 2013 SAP AG. All rights reserved. 29Public
Scaling out HANA
Virtual Machine:
• 1TB GB Ram , 40 vCPU
• Application : HANA...
© 2013 SAP AG. All rights reserved. 30Public
High Availability of Memory Node ( RRAIM ) running HANA
Memory Ratio
(Medium ...
© 2013 SAP AG. All rights reserved. 31Public
Enterprise Class Feature
Before Dedup
After
Dedup
Online RAM Deduplication (v...
Next
Work in progress and future possible directions
© 2013 SAP AG. All rights reserved. 33Public
Transitioning to a Memory Cloud
Transparent Cloud Integration
Compute VM
Memo...
© 2013 SAP AG. All rights reserved. 34Public
Hecatonchire Cloud
• Hierarchical, coordinated, global system can set and man...
© 2013 SAP AG. All rights reserved. 35Public
BareBone Memory Scale Out and Fine grained Memory Sharing
BareBone Memory Sca...
© 2013 SAP AG. All rights reserved. 36Public
Hierarchical Memory
• In the future a significant Portion of Memory will be n...
© 2013 SAP AG. All rights reserved. 37Public
Cpu Aggregation
• Leverage ACPI virtualization feature from Intel/AMD
• “Tita...
© 2013 SAP AG. All rights reserved. 38Public
Incremental effort
• Prefetch
• Support New fabric
• Dirty Bit tracking and H...
Thank You!
Contact information:
Dr. Benoit Hudzia Benoit.Hudzia@sap.com
Hecatonchire Project:
WWW: http://www.hecatonchire...
Backup Slides
© 2013 SAP AG. All rights reserved. 41Public
Raw Bandwidth usage
HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp...
© 2013 SAP AG. All rights reserved. 42Public
Enabling Live Migration of HANA DB (Small Instance)
Baseline Pre-Copy
(Standa...
© 2013 SAP AG. All rights reserved. 43Public
Redundant Array of Inexpensive RAM: RRAIM
1. Memory region backed by two remo...
© 2013 SAP AG. All rights reserved. 44Public
Quicksort Benchmark with Memory Constraint
Memory Ratio
(constraint using cgr...
Upcoming SlideShare
Loading in …5
×

Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

3,472 views

Published on

Hecatonchire aim to address these gaps by developing key enabling technologies that allow virtualization to go beyond the boundaries of the underlying hardware, breaking the barrier of using a single physical server for a virtual machine. With Hecatonchire, compute, memory, and I/O resources will be provisioned across multiple hosts and can also be consumed in dynamically changing quantities instead of in fixed bundles. This will effectively enable a fluid transformation of cloud infrastructures from commodity sized physical nodes to very large virtual machines to meet the rapidly growing demand for cloud-based services.
This slide deck provide the vision, status and roadmap update of the Hecatonchire project
Website: http://hecatonchire.com/
Git: https://github.com/hecatonchire

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,472
On SlideShare
0
From Embeds
0
Number of Embeds
513
Actions
Shares
0
Downloads
92
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Main point: Networking faster in comparison to other latencies. The immediate conclusion would be to scale-out.
  • Walk : sequential => each thread start reading from 256k / nb threads * threads id and ends when it reach the start of the following threadsBinary split : we split the memory in Nb threads regions . Each threads will then do a binary split walk within each regionRandom walk : each thread will read a page randomly chosen within the overall memory region ( no duplicate)We Are maxing out the 10GbE bandwith with IWARPWe suspect that we do not have enough core to saturate the QDR linkWe have almost no noticeable degradation when we have Threads > CoresSoftIwarp has a significant overhead ( CPU – latency- Memory use)
  • Project Hecatonchire - The Lego Cloud : Status, Vision, Roadmap 08/2013

    1. 1. Project Hecatonchire Dr. Benoit Hudzia, TIP HANA Cloud Computing, Systems Engineering July 2013
    2. 2. The Need Next Generation Datacentre
    3. 3. © 2013 SAP AG. All rights reserved. 3Public Advantages of Current Datacenter Designs • Cheap, off the shelf, commodity parts • No need for custom servers or networking kit • (Sort of) easy to scale horizontally • Runs standard software • No need for “clusters” or “grid” OSs • Ideal for VMs • Highly redundant • Homogeneous
    4. 4. © 2013 SAP AG. All rights reserved. 4Public Lots of Problems • Datacenters mix customers and applications • Heterogeneous, unpredictable workload patterns • Competition over resources • How to achieve high-reliability? • Inefficient resource consumption model • Heat and Power • 30 billion watts per year, worldwide • May cost more than the machines
    5. 5. © 2013 SAP AG. All rights reserved. 5Public Achieving the Right-sized Servers • Eliminates unnecessary components • Increased power efficiency • Optimized performance by workload Using high-volume components to build high-value systems
    6. 6. The Lego Cloud Original Concept
    7. 7. © 2013 SAP AG. All rights reserved. 7Public Original Idea :Lego Cloud Resources aggregation  Idea : Virtual Machine compute and memory span Multiple physical Nodes  Node can be specialized for specific functionality : – Memory Servers – Compute Servers – IO Servers o Storage o Accelerator (GPU) o Etc.. Hecatonchire Value Proposition  Optimal price / performance by using commodity hardware  Seamless deployment within existing cloud  Memory / CPU / IO resource pooling  Just in time resource consumption  Heterogeneous workload and Hardware  Simplified Reliability , Availability and Serviceability model CPUs Memory I/O CPUs Memory I/O CPUs Memory I/O H/W OS App VM H/W OS App VM H/W OS Ap p VM H/W OS App VM Server #1 Server #2 Server #n Guests Fast RDMA Communication
    8. 8. © 2013 SAP AG. All rights reserved. 8Public From Rack Scale to Cloud Scale FastRDMACommunication CPUs Memory I/O CPUs Memory I/O CPUs Memory I/O Rack Scale Datacentre Scale Auto Scaling VM(s) Efficient Resources Pooling
    9. 9. © 2013 SAP AG. All rights reserved. 9Public Aggregating resource with Fast Networking Author: Chaim Bendalac PCIe SSD (fusionio) PCIe Fabric Infiniband QDR Infiniband FDR Iwarp 10GbE Iwarp 40 GbE Intel Silicon Photonic Bandwidth 15Gbps 64Gbps 40 Gbps 54Gbps 10Gbps 40Gbps 100Gbps Latency (4kB) ~50 µsec ~0.8 µsec 4 µsec 1 µsec 6-7 µsec 1-2 µsec > 1 µsec
    10. 10. The Memory Cloud Focus On Memory Aggregation
    11. 11. © 2013 SAP AG. All rights reserved. 11Public How do we offer better Memory TCO • Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application • It remains unused most of the time • How can we unleash your memory-constrained application by using the memory in the rest of nodes Memory that grows with your business, not before.
    12. 12. © 2013 SAP AG. All rights reserved. 12Public Decouple memory from cores aggregation Many shared-memory parallel applications do not scale beyond a few tens of cores... However, may benefit from large amounts of memory: • In-memory databases • Datamining • VM • Scientific applications • etc Eliminate Physical Limitation of Cloud / DC
    13. 13. © 2013 SAP AG. All rights reserved. 13Public The Idea: Turning memory into a distributed memory service Breaks memory from the bounds of the physical box Transparent deployment with performance at scale and Reliability
    14. 14. Architecture High Level Version
    15. 15. © 2013 SAP AG. All rights reserved. 15Public Linux kernel High Level Architecture • Lightweight: Minimal set of hooks in the MMU (5) • Module with 3 core components • 3 existing consumption model • Native • User Space Library • KVM • 4th model under dev– Hgroup • Similar to Cgroup , allow barebone memory extension without modification Kernel MMU Coherency Engine RDMA Engine RDMA Drivers Management/ Tracing User Space Library Char Device Apps Apps Apps hgroup Apps
    16. 16. © 2013 SAP AG. All rights reserved. 16Public Extending Virtual Machine Memory • Instead of using swap for freeing up the RAM, push pages to remote hosts • If a page is needed again, request it back
    17. 17. © 2013 SAP AG. All rights reserved. 17Public How does it work (Simplified Version) Virtual Address MMU (+ TLB) Physical Address Page Table Entry Coherency Engine RDMA Engine RDMA Engine MMU (+ TLB) Physical Address Page Table Entry Coherency Engine Miss Remote PTE (Custom Swap Entry) Page request Page Response PTE write Update MMU Invalidate PTE Invalidate MMU Extract Page Extract Page Prepare Page for RDMA transfer Physical Node BPhysical Node A Network Fabric
    18. 18. © 2013 SAP AG. All rights reserved. 18Public Hecatonchire Memory Structure • 2^64 Group (DSM-VM , ex: meta-VM / Process) that own a single Virtual Memory address space • 2^64 sub group (SVM- physical process ) • 2^64 Memory region per group that can partition 64 bit address space • Each Sub group is associated with a single process and can own 0 or more MR • A virtual memory address space can be backed by more than one Memory region ( RRAIM) • Each Memory Region is own by one physical Node ( sub-group) • Under Dev: • MR split – Merge • MR relocation Heca Group Group Group Virtual Memory Address Space Sub-Group Sub-Group Sub-Group MR MR MR MR MR MR MR
    19. 19. © 2013 SAP AG. All rights reserved. 19Public Coherency engine • Different coherency model • Shared Nothing • Write Invalidate • Some variant • Custom Distributed Index • Worst case : O(log n) hops • On a 16 Node cluster • Random R/W on each node • 97 % < 2 hops • 99% < 3 hops • Smart Prefetching Coherency Engine Index Page Table/MMU Coherency Engine Index Page Table/MMU Coherency Engine Index Page Table/MMU Coherency Engine Index Page Table/MMU RAM RAM RAM RAM Prefetch Prefetch Prefetch Prefetch
    20. 20. © 2013 SAP AG. All rights reserved. 20Public RDMA Engine • In Kernel Implementation • Optimised memory use • Fabric agnostic ( as long as we have RDMA verbs) • Tested on : • Iwarp ( Chelsio ), IB ( Connectx2 – QDR), SoftIwarp, RoCE, SoftRoCE • Multiplex Groups/Apps • Flow Control / Multi queue support • Under Dev: • Multi NIC support for single group • NIC Multiplexing (active/ active fail over) RDMA Engine App App App RDMA Engine App App App RDMA Engine App AppApp RAM RAM RAM
    21. 21. Raw Performance How fast can we move memory page
    22. 22. © 2013 SAP AG. All rights reserved. 22Public Hard Page Fault Resolution Performance (10 GB sequential R/W scan) Resolution time Average (μs) 4KB Page IOpS Page (4KB) (with prefetch) SoftIwarp (10 GbE) 330 50k/s Iwarp (10GbE) 28 250k/s Infiniband (QDR) 16 650k/s
    23. 23. © 2013 SAP AG. All rights reserved. 23Public Average Compounded Page Fault Resolution Time (R/W With Prefetch) 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads IW 10GE Sequential IB 40 Gbps Sequential IW 10GE- Binary split IB 40Gbps- Binary split IW 10GE- Random Walk IB- Random Walk Micro-seconds
    24. 24. Use Case: In Memory Database Scaling Out HANA memory
    25. 25. © 2013 SAP AG. All rights reserved. 25Public Memory Scale out for SAP HANA – Benchmark Hardware: • Server with Intel Xeon West Mere • 4 socket • 10Core • 1 TB RAM • Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC Virtual Machine: • Large Size: 1 TB Ram - 40 vCPU • Hypervisor: KVM • Note: 5-7% overhead due to virtualization vs barebone • Application : SAP HANA ( In memory Database) • Workload : OLAP ( TPC-H Variant) • Data size ~2.5 TB uncompressed => 300 GB compressed • 18 different Queries • 15 iteration of each query set
    26. 26. © 2013 SAP AG. All rights reserved. 26Public Use Case : Scaling out Hana Virtual Memory See 4 TB RAMConstraint to 1 TB Physical RAM (scratch pad) 1 TB Virtual Memory allocated RDMA Fabric We never consume more than 4 TB of Physical Ram at anytime VM
    27. 27. © 2013 SAP AG. All rights reserved. 27Public 0 50 100 150 200 250 300 350 400 450 500 1 20 40 60 Baseline Half-Half Scaling Out Hana (half of memory remote) Nb Users HECA Overhead per query set 1 ~3% 20 ~3% 40 ~3% 60 ~3% Virtual Machine: • 1TB Ram , 40 vCPU () • Application : HANA ( In memory Database) • Workload : OLAP ( TPC-H Variant) • ~300GB data set compressed Hardware: • Intel Xeon West Mere – 4 socket - 10Core – 1 TBB RAM • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 Query Set Completion Time (s) Users
    28. 28. © 2013 SAP AG. All rights reserved. 28Public Per Query Overhead -50% 0% 50% 100% 150% 200% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 User 20 Users 40 Users Query
    29. 29. © 2013 SAP AG. All rights reserved. 29Public Scaling out HANA Virtual Machine: • 1TB GB Ram , 40 vCPU • Application : HANA ( In memory Database ) • Workload : OLAP ( TPC-H Variant)- 80 Users • ~300GB data set compressed Memory Ratio HECA Overhead per query set – 80 users 1:2 4% 1:3 5.6% 2:1:1 0.9% 1:1:1 2 % Hardware: • Intel Xeon West Mere – 4 socket - 10Core – 1 TB RAM • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2
    30. 30. © 2013 SAP AG. All rights reserved. 30Public High Availability of Memory Node ( RRAIM ) running HANA Memory Ratio (Medium 128GB - SAP-H Benchmark – 80 users) RRAIM Overhead vs Heca 1:2 -0.2% 1:3 -0.1% RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM • Memory region backed by two remote nodes. • Remote page faults and swap outs initiated simultaneously to all relevant nodes. • No immediate effect on computation node upon failure of node. • When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
    31. 31. © 2013 SAP AG. All rights reserved. 31Public Enterprise Class Feature Before Dedup After Dedup Online RAM Deduplication (via KSM) Automatic Memory Tiering Local Memory Remote Memory Compressed Memory SSD/NVRAM HDD Local Node Remote Node
    32. 32. Next Work in progress and future possible directions
    33. 33. © 2013 SAP AG. All rights reserved. 33Public Transitioning to a Memory Cloud Transparent Cloud Integration Compute VM Memory Demander Memory Cloud Management Services (OpenStack) App App memory Memory Cloud Heca-NOVA VM RAM VM VM RAM Many Physical Nodes Hosting a variety of VMs Combination VM Memory Sponsor & Demander Memory VM Memory Sponsor • Automatically reclaim underutilized or abandoned memory resources • Automatically and intelligently redeploys memory workloads across the infrastructure
    34. 34. © 2013 SAP AG. All rights reserved. 34Public Hecatonchire Cloud • Hierarchical, coordinated, global system can set and manage power, budgets, respond to faults, support enclave components that leverage machine learning • Resource management is hierarchical, and managers are stackable • Resource managers are integrated • Resource managers are customizable and adaptable • Sharing is avoided whenever possible • Strict enforcement is costly
    35. 35. © 2013 SAP AG. All rights reserved. 35Public BareBone Memory Scale Out and Fine grained Memory Sharing BareBone Memory Scale Out • No need for Virtualization and/or Hypervisor • Similar to Linux Control Group • Allow barebones memory scale out for HANA or any other applications Fine grained Distributed Shared Memory for user space: • Memory coherence Model • Shared Nothing, • Write Invalidate • Read-Write protection • Posix Style Shared Memory
    36. 36. © 2013 SAP AG. All rights reserved. 36Public Hierarchical Memory • In the future a significant Portion of Memory will be non-volatile • Helps reduce power • Helps with resilience • Helps with cost • Aim to offer transparent and/or ease the use Storage Class memory • Transparent integration with Storage Class memory : • Local NVRAM • Remote NVRAM • Hybrid DRAM/NVRAM • Interface directly with NVMe • Persistence / HA
    37. 37. © 2013 SAP AG. All rights reserved. 37Public Cpu Aggregation • Leverage ACPI virtualization feature from Intel/AMD • “Titanomachie” Smart vCPU and data scheduling • mixing prefetch , vCPU migration/placement , data / computation reordering • Provide variable granularity support • From Cache line • To Huge page
    38. 38. © 2013 SAP AG. All rights reserved. 38Public Incremental effort • Prefetch • Support New fabric • Dirty Bit tracking and HW feature from next gen arch of AMD/ Intel • MRIOV support
    39. 39. Thank You! Contact information: Dr. Benoit Hudzia Benoit.Hudzia@sap.com Hecatonchire Project: WWW: http://www.hecatonchire.com Github: https://github.com/hecatonchire/
    40. 40. Backup Slides
    41. 41. © 2013 SAP AG. All rights reserved. 41Public Raw Bandwidth usage HW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps 0 5 10 15 20 25 Total Gbit/sec (SIW - Seq) Total Gbit/sec (IW-Seq) Total Gbit/sec (IB-Seq) Total Gbit/sec (SIW- Bin split) Total Gbit/sec (IW- Bin split) Total Gbit/sec (IB- Bin split) Total Gbit/sec (SIW- Random) Total Gbit/sec (IW- Random) Total Gbit/sec (IB- Random) 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM Maxing out Bandwidth Not enough core to saturate (?) No degradation under high load Software RDMA has significant overhead
    42. 42. © 2013 SAP AG. All rights reserved. 42Public Enabling Live Migration of HANA DB (Small Instance) Baseline Pre-Copy (Standard) Post-Copy (Heca) Downtime N/A 7.47 s 675 ms Performance Degradation (80 users) 0% Benchmark Failed (HANA crash- Vm unresponsive) 5%
    43. 43. © 2013 SAP AG. All rights reserved. 43Public Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.
    44. 44. © 2013 SAP AG. All rights reserved. 44Public Quicksort Benchmark with Memory Constraint Memory Ratio (constraint using cgroup) HECA Overhead RRAIM Overhead 3:4 2.08% 5.21% 1:2 2.62% 6.15% 1:3 3.35% 9.21% 1:4 4.15% 8.68% 1:5 4.71% 9.28% Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% DSM Overhead RRAIM Overhead

    ×