Hana Memory Scale out using the hecatonchire Project

2,264 views

Published on

This session will present a memory scale-out solution that liberates SAP HANA, or similar memory demanding enterprise applications, from the classical limitation of underlying physical servers. The solution relies on key enabling technology developed within SAP. It allows applications or hypervisors to go beyond the boundaries of the underlying hardware, and effectively enables a fluid transformation from commodity sized physical nodes to very large virtual instances, in order to meet the rapidly growing demand of memory intensive applications.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,264
On SlideShare
0
From Embeds
0
Number of Embeds
56
Actions
Shares
0
Downloads
78
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Main point: Networking faster in comparison to other latencies. The immediate conclusion would be to scale-out.
  • Walk : sequential => each thread start reading from 256k / nb threads * threads id and ends when it reach the start of the following threadsBinary split : we split the memory in Nb threads regions . Each threads will then do a binary split walk within each regionRandom walk : each thread will read a page randomly chosen within the overall memory region ( no duplicate)We Are maxing out the 10GbE bandwith with IWARPWe suspect that we do not have enough core to saturate the QDR linkWe have almost no noticeable degradation when we have Threads > CoresSoftIwarp has a significant overhead ( CPU – latency- Memory use)
  • Hana Memory Scale out using the hecatonchire Project

    1. 1. Virtualized Memory Scale-Out for SAP HANADr. Benoit Hudzia SAP Research – Next Business and TechnologyCloud and Virtualization Week , April 2013
    2. 2. Agenda• Reducing Memory TCO via Scale Out• Memory As a service• Hana Memory Scale Out• Conclusion & Next Steps© 2013 SAP AG. All rights reserved. Public 2
    3. 3. Reducing Memory TCO via Scale OutHecatonchire Project – Memory Cloud
    4. 4. Achieving the Right-sized Servers • Eliminates unnecessary components • Increased power efficiency • Optimized performance by workload Using high-volume components to build high-value systems© 2013 SAP AG. All rights reserved. Public 4
    5. 5. How do we offer better Memory TCO • Memory in the nodes of current clusters is Overscaled in order to fit the requirements of “any” application • It remains unused most of the time • How can we unleash your memory-constrained application by using the memory in the rest of nodes Memory that grows with your business, not before.© 2013 SAP AG. All rights reserved. Public 5
    6. 6. Decouple memory from cores aggregationMany shared-memory parallel applications do not scalebeyond a few tens of cores...However, may benefit from large amounts of memory:• In-memory databases• Datamining• VM Eliminate Physical• Scientific applications Limitation of Cloud / DC• etc© 2013 SAP AG. All rights reserved. Public 6
    7. 7. Scaling-Out with Fast Networking Author: Chaim Bendalac Low latency memory transfers can help reduce the need for overprovisioning while also providing much lower latency than swapping data out to disk.© 2013 SAP AG. All rights reserved. Public 7
    8. 8. Memory as a ServiceHecatonchire Project – Memory Cloud
    9. 9. The Idea: Turning memory into a distributed memory service Breaks memory from the bounds Transparent deployment with of the physical box performance at scale and Reliability© 2013 SAP AG. All rights reserved. Public 9
    10. 10. High Level Principle Memory Memory Sponsor A Sponsor B Network Memory Demander Virtual Memory Address Space Memory Demanding Process© 2013 SAP AG. All rights reserved. Public 10
    11. 11. Hard Page Fault Resolution Performance Time spend over Resolution time Page (4KB) the wire one way Average (μs)- 4KB Transfer/Sec Average (μs) Page (with prefetch) SoftIwarp (10 150 + 330 50k/s GbE) Iwarp 4-6 28 250k/s (10GbE) Infiniband 2-4 16 650k/s (40 Gbps)© 2013 SAP AG. All rights reserved. Public 11
    12. 12. Average Compounded Page Fault Resolution Time(With Prefetch) 6 Micro-seconds IW 10GE Sequential 5.5 IB 40 Gbps Sequential IW 10GE- Binary split 5 IB 40Gbps- Binary split IW 10GE- Random Walk 4.5 IB- Random Walk 4 3.5 3 2.5 2 1.5 1 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads 7 Threads 8 Threads© 2013 SAP AG. All rights reserved. Public 12
    13. 13. Benchmark of Scaled Out HANAHecatonchire Project – Memory Cloud
    14. 14. Use-Case: Hana Memory Scale Out - Memory Aggregation• Aggregate the RAM of a cluster into 1 HUGE HANA DBOverview:• Given n nodes, each with X GB of RAM: • 1 node will run a VM with a HANA DB • High Core count / Memory Ratio • n-1 nodes will serve as memory containers • Low Core Count / Memory Ratio Up to 30% in HW cost Reduction compare to a single node instance• Deploy a VM with n*X GB using Hecatonchire• Instead of using swap for freeing up the RAM, push pages to remote hosts• If a page is needed again, request it back © 2013 SAP AG. All rights reserved. Public 14
    15. 15. Memory Scale out for SAP HANA – Benchmark • Application : SAP HANA ( In memory Virtual Machine: Database) • Small Size: 64 GB Ram - 32 vCPU • Workload : OLAP ( TPC-H Variant) • Medium Size: 128 GB RAM – 40 vCPU • Data size ~600 GB uncompressed • Hypervisor: KVM • 18 differents Queries • 15 iteration of each query set Hardware: • Server with Intel Xeon West Mere • 4 socket • 10Core • 512 GB RAM • Fabric: • Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • 10 GbE Ethernet Switch + Chelsio T422 NIC© 2013 SAP AG. All rights reserved. Public 15
    16. 16. Scaling Out Hana(Small Size) Query Set Completion Time (s) 500 Nb Users HECA 450 Overhead 400 350 per query set 300 1 ~3% 250 Baseline Half-Half 200 20 ~3% 150 40 ~3% 100 50 60 ~3% 0 1 20 40 60 UsersVirtual Machine: • 64 GB Ram , 32 vCPU (Small) Hardware: • Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM • Application : HANA ( In memory Database) • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • Workload : OLAP ( TPC-H Variant) • ~600GB data set uncompressed© 2013 SAP AG. All rights reserved. Public 16
    17. 17. Per Query Overhead 200% 150% 100% 1 User 20 Users 40 Users 50% 0% Query 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 -50%© 2013 SAP AG. All rights reserved. Public 17
    18. 18. Scaling out HANA(Medium Size ) Memory HECA 0.018 0.016 Ratio Overhead 0.014 per query set – 80 users 0.012 0.01 1:2 1% 0.008 0.006 1:3 1.6% 0.004 0.002 2:1:1 0.1% 0 1:02 1:03 2:01:01 1:01:01 1:1:1 1.5% Memory ratioVirtual Machine: • 128 GB Ram , 40 vCPU (Medium) Hardware: • Intel Xeon West Mere – 4 socket - 10Core – 512 GB RAM • Application : HANA ( In memory Database ) • Fabric: Infiniband QDR 40Gbps Switch + Mellanox ConnectX2 • Workload : OLAP ( TPC-H Variant)- 80 Users • ~600GB data set uncompressed© 2013 SAP AG. All rights reserved. Public 18
    19. 19. Production and Cloud readyHecatonchire Project – Memory Cloud
    20. 20. High Availability of Memory Node ( RRAIM ) running HANA RAM RAM Memory Ratio RRAIM Overhead RAM RAM RAM RAM (Medium 128GB - SAP-H vs Heca RAM Mirroring RAM Benchmark – 80 users) RAM RAM RAM RAM 1:2 -0.2% 1:3 -0.1% • Memory region backed by two remote nodes. HA • Remote page faults and swap outs initiated simultaneously to all relevant nodes. • No immediate effect on computation node upon failure of node. • When we a new remote enters the cluster, it synchronizes with computation node and mirror node.© 2013 SAP AG. All rights reserved. Public 20
    21. 21. Enterprise Class Feature Online RAM Deduplication (via KSM) Automatic Memory Tiering Compressed Memory Local Remote SSD Memory Memory After Before Dedup Dedup HDD Local Node Remote Node© 2013 SAP AG. All rights reserved. Public 21
    22. 22. Enabling Live Migration of HANA DB (Small Instance) Baseline Pre-Copy Post-Copy (Standard) (Heca) Downtime N/A 7.47 s 675 ms Performance 0% Benchmark 5% Degradation Failed (80 users) (HANA crash- Vm unresponsive)© 2013 SAP AG. All rights reserved. Public 22
    23. 23. NextHecatonchire Project – Memory Cloud
    24. 24. Transitioning to a Memory CloudTransparent Cloud Integration Memory VM Compute VM Combination VM Memory Sponsor Memory Demander Memory Sponsor & Demander RAM App RAM App Memory memory VM VM VM Cloud Heca-NOVA• Automatically reclaim Memory Cloud Management underutilized or abandoned Services (OpenStack) memory resources• Automatically and intelligently Many Physical Nodes redeploys memory workloads Hosting a variety of VMs across the infrastructure© 2013 SAP AG. All rights reserved. Public 24
    25. 25. BareBone Memory Scale Out and Fine grained Memory SharingBareBone Memory Scale Out Fine grained Distributed Shared Memory :• No need for Virtualization and/or Hypervisor • Memory coherence Model • Shared Nothing,• Similar to Linux Control Group • Write Invalidate• Allow barebones memory scale • Read-Write protection out for HANA or any other applications© 2013 SAP AG. All rights reserved. Public 25
    26. 26. Virtual Distributed Shared Memory System(Compute Cloud) GuestsCompute aggregation Idea : Virtual Machine compute and memory span Multiple physical Nodes VM VM VM App App AppChallenges OS OS VM OS H/W Coherency Protocol Ap p OS H/W Granularity ( False sharing ) H/W H/WHecatonchire Value Proposition Server #1 Server #2 Server #n Optimal price / performance by using commodity hardware CPUs CPUs CPUs Operational flexibility: node downtime without downing Memory Memory Memory the cluster I/O I/O I/O Seamless deployment within existing cloud Fast RDMA Communication© 2013 SAP AG. All rights reserved. Public 26
    27. 27. Hecatonchire: Open-Source Project https://github.com/hecatonchire/ Http://www.hecatonchire.com© 2013 SAP AG. All rights reserved. Public 27
    28. 28. Open Source RoadmapStandardization :• Linux Kernel Upstream• KVM/QEMU Upstream• OpenStack Module (Nova-Heca)Collaboration, collaboration, collaboration!• Academic and Industrial© 2013 SAP AG. All rights reserved. Public 28
    29. 29. Thank You!Contact information:Dr. Benoit Hudzia Benoit.Hudzia@sap.comHecatonchire Project:WWW: http://www.hecatonchire.comGithub: https://github.com/hecatonchire/
    30. 30. Backup Slides
    31. 31. How does it work(Simplified Version) Virtual MMU Physical MMU Physical Address (+ TLB) Address (+ TLB) Address Miss Update MMU Invalidate MMU Extract Page Page Table Page Table Entry Entry Remote PTE PTE write Invalidate PTE (Custom Swap Entry) Coherency Coherency Engine Engine Extract Page Prepare Page for RDMA transfer Page request Network RDMA Engine RDMA Engine Fabric Page ResponsePhysical Node A Physical Node B© 2013 SAP AG. All rights reserved. Public 31
    32. 32. Raw Bandwidth usageHW: 4 core i5-2500 CPU @ 3.30GHz- SoftIwarp 10GbE – Iwarp Chelsio T422 10GbE - IB ConnectX2 QDR 40 Gbps Gb/s Sequential Walk over 1GB of shared RAM Bin split Walk over 1GB of shared RAM Random Walk over 1GB of shared RAM 25 1 Thread 2 Threads Not enough core 3 Threads 4 Threads to saturate (?) 5 Threads 20 No degradation 6 Threads 7 Threads under high load 8 Threads 15 Maxing out Bandwidth Software RDMA has significant overhead 10 5 0 Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec Total Gbit/sec (SIW - Seq) (IW-Seq) (IB-Seq) (SIW- Bin split) (IW- Bin split) (IB- Bin split) (SIW- Random) (IW- Random) (IB- Random) © 2013 SAP AG. All rights reserved. Public 32
    33. 33. Redundant Array of Inexpensive RAM: RRAIM 1. Memory region backed by two remote nodes. Remote page faults and swap outs initiated simultaneously to all relevant nodes. 2. No immediate effect on computation node upon failure of node. 3. When we a new remote enters the cluster, it synchronizes with computation node and mirror node.© 2013 SAP AG. All rights reserved. Public 33
    34. 34. Quicksort Benchmark with Memory Constraint Quicksort Benchmark 512 MB Dataset Quicksort Benchmark 1GB Dataset Quicksort Benchmark 2GB Dataset Memory Ratio HECA Overhead RRAIM Overhead 10.00% (constraint using cgroup) 8.00% 3:4 2.08% 5.21% 6.00% 1:2 2.62% 6.15% 4.00% DSM Overhead 1:3 3.35% 9.21% 2.00% RRAIM Overhead 0.00% 1:4 4.15% 8.68% 1:5 4.71% 9.28%© 2013 SAP AG. All rights reserved. Public 34
    35. 35. 200%150%100% 1 User 20 Users 40 Users 50% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 -50%© 2013 SAP AG. All rights reserved. Public 37

    ×