SlideShare a Scribd company logo
1 of 26
HyperLoop: Group-Based NIC Offloading
to Accelerate Replicated Transactions
in Multi-tenant Storage Systems
Daehyeok Kim
Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu
Jitendra Padhye, Shachar Raindel, Steven Swanson
Vyas Sekar, Srinivasan Seshan
Multi-tenant Storage Systems
Replica servers
Storage frontend
2
• Replicated transactions
 Data availability and integrity
 Consistent and atomic updates
 e.g., Chain replication
• Multiple replica instances are
co-located on the same server
Problem: Large and Unpredictable Latency
• Both average and tail latencies increase
• Gap between average and tail latencies increases
3
0
40
80
120
160
9 12 15 18 21 24 27
Latency(ms)
Number of tenants on a server
Average 99th percentile
YCSB on
Run by CPUs
CPU Involvement on Replicas
4
• CPU involvement for executing and forwarding transactional operations
• Arbitrary CPU scheduling  Unpredictable latency
• Replicas’ CPU utilization hits 100%
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICStorage
Frontend Software Replica Software Replica Software
Group of replicas
Logging DATA
Log DATA DATA
ACK
1. Execute Logging
2. Forward Logging
1. Execute Logging
2. Forward ACK
Critical path of operations
DATA
DATA
Logging
DATA
DATA
Our Goal
5
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICStorage
Frontend Software Replica Software Replica Software
NIC NICStorage NICStorage
Frontend Software Replica Software Replica Software
Today’s storage system
Removing replica CPUs from the critical path!
Storage/Net Library Storage/Net Library Storage/Net Library
Run by CPUs
Critical path of operations
DATA DATA ACK
Our Work: HyperLoop
• Framework for building fast chain replicated transactional
storage systems enabled by three key ideas:
1. RDMA NIC + NVM
2. Leveraging the programmability with RDMA NICs
3. APIs covering key transactional operations
• Minimal modifications for existing applications
• e.g., 866 lines for MongoDB out of ~500K lines of code
• Up to 24x tail latency reduction in storage applications
6
Outline
• Motivation
• HyperLoop Design
• Implementation and Evaluation
7
Idea 1: RDMA NIC + NVM
• RDMA (Remote Direct Memory Access) NICs
• Enables direct memory access from the network without CPUs
• NVM (Non-Volatile Memory)
• Provides a durable storage medium for RDMA NICs
8
Replica Software
RDMA
NIC
NVM
NVM/RDMA Library
Replica Software
NICStorage
Storage/Net Library
Roadblock 1: Operation Forwarding
• RDMA NICs can execute the logging operation
• CPUs are still involved to request the RNICs to forward operations
9
NVM/RDMA Library
RNIC
NVM/RDMA Library
RNICNVM
NVM/RDMA Library
RNICNVM
Frontend Software Replica Software Replica Software
Group of replicas
Logging DATA (WRITE)
Log DATA DATA
Run by CPUs
1. Execute Logging
2. Forward Logging
1. Execute Logging
2. Forward ACK
Roadblock 2: Transactional Operations
• CPUs are involved to execute and forward operations
• RDMA NIC primitives do not support some key transactional operations
(e.g., locking, commit)
10
NVM/RDMA Library
RNIC
NVM/RDMA Library
RNICNVM
NVM/RDMA Library
RNICNVM
Frontend Software Replica Software Replica Software
Commit log
Log DATA
Data DATA
DATA
DATA
Run by CPUs
Group of replicas
1. Execute Commit
2. Forward Commit
1. Execute Commit
2. Forward ACK
Can We Avoid the Roadblocks?
11
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICSSD
Frontend Software Replica Software Replica Software
Today’s storage system
NIC RNICNVM RNICNVM
Frontend Software Replica Software Replica Software
Storage/Net Library Storage/Net Library Storage/Net Library
Pushing replication functionality to RNICs!
Offload?
Critical path of operations
Idea 2:
Leveraging the Programmability of RNICs
• Commodity RDMA NICs are not fully programmable
• Opportunity: RDMA WAIT operation
• Supported by commodity RDMA NICs
• Allows a NIC to wait for a completion of an event (e.g., receiving)
• Triggers the NIC to perform an operation upon the completion
12
13
• Step 1: Frontend library collects the base addresses of memory regions
registered to replica RNICs
• Step 2: HyperLoop library programs replica RNICs with RDMA WAIT and
the template of target operation
HyperLoop Library
RNIC
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Frontend Software Replica Software Replica Software
1. WAIT for receiving
2. Forward ACK
Bootstrapping – Program the NICs
1. WAIT for receiving
2. Forward WRITE with
Param (Src, Dst, Len)R1 R2
14
• Idea: Manipulating parameter regions of programmed operations
• Replica NICs can forward operations with proper parameters
HyperLoop Library
RNIC
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Frontend Software Replica Software Replica Software
Update log
(WRITE)
Forwarding Operations
1. WAIT for receiving
2. Forward ACK
✔
✔
1. WAIT for receiving
2. Forward WRITE with
Param (Src, Dst, Len)
✔
✔
Param (R1-0xA, R2-0xB, 64)R1 R2
DATA
R1-0xA
DATA
R2-0xB
WRITE
DATA
WRITE
ACK
Idle CPUs
SEND
Param(R1-0xA, R2-0xB, 64)
WRITE
DATA
Idea 3:
APIs for Transactional Operations
Transactional Operations Memory Operations HyperLoop Primitives
Logging
Memory write
to log region
Group Log
Commit
Memory copy
from log to data region
Group Commit
Locking/Unlocking
Compare and swap
on lock region
Group Lock/Unlock
15
See the paper for details!
HyperLoop Library
Frontend Software
16
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Replica Software Replica Software
1. Update log (Group Log)
2. Grab a lock (Group Lock)
3. Commit the log (Group Commit)
4. Release the lock (Group Unlock)
Log
Data
Transaction with HyperLoop Primitives
RNIC
Replica NICs can execute and forward operations!
HyperLoop: Putting It All Together
17
RNIC RNICNVM RNICNVM
Frontend Software Replica Software Replica Software
HyperLoop Library HyperLoop Library HyperLoop Library
Idea 3: APIs for transactional operations:
Group- Log, Commit, Lock/Unlock
Idea 1: RDMA NIC + NVM
Idea 2: Leveraging the
programmability of RNICs
Outline
• Motivation
• HyperLoop Design
• Implementation and Evaluation
18
Implementation and Case Studies
• HyperLoop library
• C APIs for HyperLoop group primitives
• Modify user-space RDMA verbs library and NIC driver
• Case Studies
• RocksDB: Add the replication logic using HyperLoop library
(modify 120 LOCs)
• MongoDB: Replace the existing replication logic with HyperLoop library
(modify 866 LOCs)
19
Evaluation
• Experimental setup
• 8 servers with two Intel 8-core CPUs, 64 GB RAM, Mellanox CX-3 NIC
• Use tmpfs to emulate NVM
• Baseline implementation
• Provides the same APIs as HyperLoop, but involve replica CPUs
• Experiments
• Microbenchmark: Latency reduction in group memory operations
• Application benchmark: Latency reduction in applications
20
Microbenchmark – Write Latency
21
128 256 512 1K 2K 4K 8K
Data size (bytes)
Baseline-99th HyperLoop-99th
Latency(μs)
104
103
102
10
1
105
801.8x
Application benchmark – MongoDB
22
0
2
4
6
8
A B D E F
Latency(ms)
YCSB Workload
Baseline-99th HyperLoop-99th
4.9x
Other Results
• Latency reduction for other memory operations on a group:
• Memory copy: 496 – 848x
• Compare-and-Swap: 849x
• Tail latency reduction for RocksDB: 5.7 – 24.2x
• Latency reduction regardless of group sizes
23
Conclusion
• Predictable low latency is lacking in replicated storage systems
• Root cause: CPU involvement on the critical path
• Our solution: HyperLoop
– Offloads transactional operations to commodity RNICs + NVM
– Minimal modifications for existing applications
• Result: up to 24x tail latency reduction in in-memory storage apps
• More opportunities
– Other data center workloads
– Efficient remote memory utilization
24
Backup slides
25
Background: Work Queue in RDMA NICs
• Work queue is a FIFO queue in
memory
• Each queue entry is a “work request”
• Work request includes the
parameters
• NIC fetches and executes the
work requests
26
DMA
Send
Workqueue
OP: WRITE
Addrs, Len
OP: WRITE
Addrs, Len

More Related Content

What's hot

How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
Gosuke Miyashita
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Community
 

What's hot (20)

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong TangAccelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang
 
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Ceph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wildCeph Day London 2014 - Deploying ceph in the wild
Ceph Day London 2014 - Deploying ceph in the wild
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)Storage tiering and erasure coding in Ceph (SCaLE13x)
Storage tiering and erasure coding in Ceph (SCaLE13x)
 
HKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM serversHKG15-401: Ceph and Software Defined Storage on ARM servers
HKG15-401: Ceph and Software Defined Storage on ARM servers
 
Experiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah WatkinsExperiences building a distributed shared log on RADOS - Noah Watkins
Experiences building a distributed shared log on RADOS - Noah Watkins
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and OutlookLinux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook
 
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
Designing for High Performance Ceph at Scale
Designing for High Performance Ceph at ScaleDesigning for High Performance Ceph at Scale
Designing for High Performance Ceph at Scale
 
DPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. MeltonDPDK Integration: A Product's Journey - Roger B. Melton
DPDK Integration: A Product's Journey - Roger B. Melton
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Email storage with Ceph - Danny Al-Gaaf
Email storage with Ceph -  Danny Al-GaafEmail storage with Ceph -  Danny Al-Gaaf
Email storage with Ceph - Danny Al-Gaaf
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
Enkitec
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
inside-BigData.com
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
Enkitec
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems (20)

High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Getting started with Riak in the Cloud
Getting started with Riak in the CloudGetting started with Riak in the Cloud
Getting started with Riak in the Cloud
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSDHigh-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G coreTối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Criteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech TalkCriteo meetup - S.R.E Tech Talk
Criteo meetup - S.R.E Tech Talk
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
MYSQL
MYSQLMYSQL
MYSQL
 
Multi-Site Perforce at NetApp
Multi-Site Perforce at NetAppMulti-Site Perforce at NetApp
Multi-Site Perforce at NetApp
 

Recently uploaded

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Recently uploaded (20)

Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

  • 1. HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu Jitendra Padhye, Shachar Raindel, Steven Swanson Vyas Sekar, Srinivasan Seshan
  • 2. Multi-tenant Storage Systems Replica servers Storage frontend 2 • Replicated transactions  Data availability and integrity  Consistent and atomic updates  e.g., Chain replication • Multiple replica instances are co-located on the same server
  • 3. Problem: Large and Unpredictable Latency • Both average and tail latencies increase • Gap between average and tail latencies increases 3 0 40 80 120 160 9 12 15 18 21 24 27 Latency(ms) Number of tenants on a server Average 99th percentile YCSB on
  • 4. Run by CPUs CPU Involvement on Replicas 4 • CPU involvement for executing and forwarding transactional operations • Arbitrary CPU scheduling  Unpredictable latency • Replicas’ CPU utilization hits 100% Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICStorage Frontend Software Replica Software Replica Software Group of replicas Logging DATA Log DATA DATA ACK 1. Execute Logging 2. Forward Logging 1. Execute Logging 2. Forward ACK Critical path of operations DATA DATA Logging DATA DATA
  • 5. Our Goal 5 Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICStorage Frontend Software Replica Software Replica Software NIC NICStorage NICStorage Frontend Software Replica Software Replica Software Today’s storage system Removing replica CPUs from the critical path! Storage/Net Library Storage/Net Library Storage/Net Library Run by CPUs Critical path of operations DATA DATA ACK
  • 6. Our Work: HyperLoop • Framework for building fast chain replicated transactional storage systems enabled by three key ideas: 1. RDMA NIC + NVM 2. Leveraging the programmability with RDMA NICs 3. APIs covering key transactional operations • Minimal modifications for existing applications • e.g., 866 lines for MongoDB out of ~500K lines of code • Up to 24x tail latency reduction in storage applications 6
  • 7. Outline • Motivation • HyperLoop Design • Implementation and Evaluation 7
  • 8. Idea 1: RDMA NIC + NVM • RDMA (Remote Direct Memory Access) NICs • Enables direct memory access from the network without CPUs • NVM (Non-Volatile Memory) • Provides a durable storage medium for RDMA NICs 8 Replica Software RDMA NIC NVM NVM/RDMA Library Replica Software NICStorage Storage/Net Library
  • 9. Roadblock 1: Operation Forwarding • RDMA NICs can execute the logging operation • CPUs are still involved to request the RNICs to forward operations 9 NVM/RDMA Library RNIC NVM/RDMA Library RNICNVM NVM/RDMA Library RNICNVM Frontend Software Replica Software Replica Software Group of replicas Logging DATA (WRITE) Log DATA DATA Run by CPUs 1. Execute Logging 2. Forward Logging 1. Execute Logging 2. Forward ACK
  • 10. Roadblock 2: Transactional Operations • CPUs are involved to execute and forward operations • RDMA NIC primitives do not support some key transactional operations (e.g., locking, commit) 10 NVM/RDMA Library RNIC NVM/RDMA Library RNICNVM NVM/RDMA Library RNICNVM Frontend Software Replica Software Replica Software Commit log Log DATA Data DATA DATA DATA Run by CPUs Group of replicas 1. Execute Commit 2. Forward Commit 1. Execute Commit 2. Forward ACK
  • 11. Can We Avoid the Roadblocks? 11 Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICSSD Frontend Software Replica Software Replica Software Today’s storage system NIC RNICNVM RNICNVM Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library Pushing replication functionality to RNICs! Offload? Critical path of operations
  • 12. Idea 2: Leveraging the Programmability of RNICs • Commodity RDMA NICs are not fully programmable • Opportunity: RDMA WAIT operation • Supported by commodity RDMA NICs • Allows a NIC to wait for a completion of an event (e.g., receiving) • Triggers the NIC to perform an operation upon the completion 12
  • 13. 13 • Step 1: Frontend library collects the base addresses of memory regions registered to replica RNICs • Step 2: HyperLoop library programs replica RNICs with RDMA WAIT and the template of target operation HyperLoop Library RNIC HyperLoop Library RNICNVM HyperLoop Library RNICNVM Frontend Software Replica Software Replica Software 1. WAIT for receiving 2. Forward ACK Bootstrapping – Program the NICs 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len)R1 R2
  • 14. 14 • Idea: Manipulating parameter regions of programmed operations • Replica NICs can forward operations with proper parameters HyperLoop Library RNIC HyperLoop Library RNICNVM HyperLoop Library RNICNVM Frontend Software Replica Software Replica Software Update log (WRITE) Forwarding Operations 1. WAIT for receiving 2. Forward ACK ✔ ✔ 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len) ✔ ✔ Param (R1-0xA, R2-0xB, 64)R1 R2 DATA R1-0xA DATA R2-0xB WRITE DATA WRITE ACK Idle CPUs SEND Param(R1-0xA, R2-0xB, 64) WRITE DATA
  • 15. Idea 3: APIs for Transactional Operations Transactional Operations Memory Operations HyperLoop Primitives Logging Memory write to log region Group Log Commit Memory copy from log to data region Group Commit Locking/Unlocking Compare and swap on lock region Group Lock/Unlock 15 See the paper for details!
  • 16. HyperLoop Library Frontend Software 16 HyperLoop Library RNICNVM HyperLoop Library RNICNVM Replica Software Replica Software 1. Update log (Group Log) 2. Grab a lock (Group Lock) 3. Commit the log (Group Commit) 4. Release the lock (Group Unlock) Log Data Transaction with HyperLoop Primitives RNIC Replica NICs can execute and forward operations!
  • 17. HyperLoop: Putting It All Together 17 RNIC RNICNVM RNICNVM Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library Idea 3: APIs for transactional operations: Group- Log, Commit, Lock/Unlock Idea 1: RDMA NIC + NVM Idea 2: Leveraging the programmability of RNICs
  • 18. Outline • Motivation • HyperLoop Design • Implementation and Evaluation 18
  • 19. Implementation and Case Studies • HyperLoop library • C APIs for HyperLoop group primitives • Modify user-space RDMA verbs library and NIC driver • Case Studies • RocksDB: Add the replication logic using HyperLoop library (modify 120 LOCs) • MongoDB: Replace the existing replication logic with HyperLoop library (modify 866 LOCs) 19
  • 20. Evaluation • Experimental setup • 8 servers with two Intel 8-core CPUs, 64 GB RAM, Mellanox CX-3 NIC • Use tmpfs to emulate NVM • Baseline implementation • Provides the same APIs as HyperLoop, but involve replica CPUs • Experiments • Microbenchmark: Latency reduction in group memory operations • Application benchmark: Latency reduction in applications 20
  • 21. Microbenchmark – Write Latency 21 128 256 512 1K 2K 4K 8K Data size (bytes) Baseline-99th HyperLoop-99th Latency(μs) 104 103 102 10 1 105 801.8x
  • 22. Application benchmark – MongoDB 22 0 2 4 6 8 A B D E F Latency(ms) YCSB Workload Baseline-99th HyperLoop-99th 4.9x
  • 23. Other Results • Latency reduction for other memory operations on a group: • Memory copy: 496 – 848x • Compare-and-Swap: 849x • Tail latency reduction for RocksDB: 5.7 – 24.2x • Latency reduction regardless of group sizes 23
  • 24. Conclusion • Predictable low latency is lacking in replicated storage systems • Root cause: CPU involvement on the critical path • Our solution: HyperLoop – Offloads transactional operations to commodity RNICs + NVM – Minimal modifications for existing applications • Result: up to 24x tail latency reduction in in-memory storage apps • More opportunities – Other data center workloads – Efficient remote memory utilization 24
  • 26. Background: Work Queue in RDMA NICs • Work queue is a FIFO queue in memory • Each queue entry is a “work request” • Work request includes the parameters • NIC fetches and executes the work requests 26 DMA Send Workqueue OP: WRITE Addrs, Len OP: WRITE Addrs, Len

Editor's Notes

  1. Thanks for the introduction. Hi I am Daehyeok Kim. I am a PhD student at Carnegie Mellon. Today, I will tell you how we can make replicated storage transactions faster by offloading them to the NIC!
  2. To give you some context for this work, many online services today rely on replicated storage service and systems, for instance, Azure storage, amazon s3 and open source implementations such as MongoDB and RocksDB. To achieve high availability, integrity and consistency, these systems use some form of replicated transactions. (click) For instance, chain replication is a popular replication protocol used by these storage services today. (click) Notice that these storage systems are inherently multitenant .. which means that 100s of replica instances can be co-located on a single server to improve resource utilization. (complicated)
  3. However, we have seen that such replicated storage systems can suffer from high and unpredictable latencies! (click) This result shows one such measurement -- the latency of MongoDB in a multi-tenant setting running the YCSB which is a widely used database benchmark workload; the x-axis is the number of tenants in a server, y-axis is the latency for running the workload. And, the blue and red lines represent the average and tail latencies, respectively. (click) We see that as the number of tenants on a server grows, the average and tail latencies increase. Further, the gap between average and tail grows quite rapidly. Naturally, such large and unpredictable latencies impact the end-to-end performance of storage-intensive applications.
  4. So, why does this happen? To understand this, it is useful to look at the software and hardware stack of these storage applications Let us use chain replication as an example --  Suppose there are two replicas and a frontend server in a replication group forming a chain. (click) Assume that the frontend server wants to append the DATA to the log with logging operation. There are two typical steps each replica performs: (click) First, when the replica software receives the operation along with DATA, it first executes the operation; in logging, it will store data to the storage. (click) The next step is to forward the operation to the next replica in the chain. The same thing happens when processing other transactional operations such as commit and locking. (click) Replica software performs the executing and forwarding steps with storage and network software stacks, which are scheduled and run by CPUs. (click) When CPUs are busy, arbitrary scheduling delay can occur while processing the incoming transaction requests, which can impact the replication latency. Indeed, in our experiments we observe replica CPU utilization hits 100% ! Note that in a multi-tenant case, since multiple replica instances are contending for limited CPU, the latency problem will become worse.
  5. Our goal is to improve the latency of replicated storage systems (click) Ideally, we want to remove the replicas CPU from the critical path and offload the operations to the NICs and storage!
  6. To meet this goal, we present HyperLoop, a framework to enable fast replicated transactional storage systems. HyperLoop by itself is not a full end-to-end storage systems, but rather it provides building blocks that developers can use to build such systems. In particular, we focus on chain replication as it’s widely used today. There are three key ideas in our design: (click) First, we leverage the combination of two hardware technologies, RDMA and Non-volatile Memory to enable the NIC to directly replicate data to durable storage. To overcome some practical challenges in realizing this idea, we leverage the programmability supported by commodity RDMA NICs and propose a set of primitive APIs that covers key transactional operations. (click) Our results show HyperLoop can reduce tail latency by up to 24 times in in-memory storage applications. On a practical note, realizing these benefits requires only minimal modifications for existing applications  -- For instance, integrating HyperLoop into MongoDB requires 860 lines of modifications out of around 500 thousands lines of source code.
  7. Now let me explain three key ideas in HyperLoop framework.
  8. To remove replica CPUs from the critical path of chain replication, we observe an opportunity to combine two hardware technologies -- RDMA and NVM. RDMA allows us to access remote memory region through the network without involving remote CPUs.   Now, NVM or Non-Volatile Memory can provide RDMA NICs with durable storage, so the NICs can write data to durable remote regions. However, just writing data to NVMs via RDMA NICs is not enough for offloading replicated transactional operations from the CPUs as we will see.
  9. The first challenge is that replicas’ CPUs are still involved to forward operations to a group of replicas. Let me bring back the logging example to explain this. Here, Logging operation can be replaced with RDMA WRITE, since what it does is writing data to the log region in the memory. (click) Thus, the NIC on replicas can execute logging operation on behalf of CPUs. (click) However, replica CPUs still need to be involved to forward the operation.
  10. The second challenge is that while existing RDMA primitives might be sufficient for data logging, they cannot perform other key transactional operations, which are required by most of storage systems today. Let me explain this problem using COMMIT operation as an example. (click) When the frontend server issues a commit operation, the data in the log region has to be copied to the data region in the meory. (click) However, since RDMA NICs do not natively support such a commit operation, the replica CPUs need to execute the operation as well as forward it as previously mentioned. Similarly, executing other transactional operations such as locking will also require the replica CPUs.
  11. So, can we address these challenges to let the NICs to execute and forward the transactional operations? (click) In particular, is it possible to push the replication and transaction functionality of replicas into the NICs?
  12. At first glance, it seems that we need to effectively put the software stack on to the NIC -- that is, putting a full fledged programmability in your RDMA NICs. Fortunately we observe that there is a cool feature called RDMA WAIT that we can repurpose to make the replicas NICs to forward operations without CPUs! Essentially, the WAIT operation allows a NIC to wait for a completion of an event, for example receiving, and upon the completion of the event, it triggers the NIC to perform pre-programmed operations. While this set of operations is indeed limited, we found it is sufficient for our needs! Next I will describe at a high level how we can leverage this feature
  13. Let us first look at the bootstrapping phase of our mechanism. Suppose there are two replicas, R1 and R2 and storage frontend server as shown in the figure. (click) As soon as the replication group is established, the HyperLoop library in the frontend server collects the base addresses of memory regions registered to replica RNICs. The library will use this information in the data path as we will see later. (click) Then the library on replicas programs the NICs with the RDMA WAIT and the template of target operations, for instance WRITE. The template contains an empty parameter values for the operation. By using WAIT operation, the NIC will wait for a completion of a receiving event and be triggered to forward the programmed operation.
  14. After the bootstrapping phase, the replicas CPUs do not need to be involved in the data path. Now, although the NICs can forward the programmed operation as soon as it detects a receiving event, we first need to set the parameter of operation correctly without involving CPUs. (click) To do this, we design a mechanism to enable the frontend software to manipulate the parameters of the operation programmed on replicas NIC. Let me bring back the previous example to explain how our idea works. (click) To update the log, the frontend server issues a WRITE operation to the first replica. (click) And immediately after this, it will also send parameters for the WRITE operation programmed on the NICs. And upon receiving this, the replica’s NIC will update the parameter of the WRITE operation as shown in the figure without involving a CPU. (click) This receiving event will also trigger the NIC to forward the WRITE operation with updated parameters. (click) With this technique, the replica NICs are now being able to forward the operation with the proper parameter configured by the frontend server without involving replicas CPUs.
  15. Ok, then the only thing remaining is making the NICs execute transactional operations. Let’s see how we can do it. As mentioned earlier, existing RDMA primitives do not support key transactional operations such as locking and commit. Fortunately, we show by construction that it is indeed possible to abstract these transactional operations into memory operations. (click) In particular, we can transform logging into memory write, commit into local memory copy and locking into the atomic compare and swap. //While RNICs natively support compare and swap operation, it does not support memory copy. So, to offload memory copy operation to the NIC, we design the //memory copy primitive. (click) Combining this memory abstraction with our operation forwarding mechanism, we design HyperLoop primitives to execute transactional operations on a group of replicas. HyperLoop primitives consist of group Log, Commit, Lock and UNLOCK. (click) Please refer to our paper for the detailed design for each memory primitive.
  16. Now let me show you how a write operation in a typical storage application can be performed with HyperLoop primitives. The frontend server first updates the log with the group log. Then it tries to grab a global lock with the group lock. If it grabs the lock successfully, it will commit the log using group commit. Then, it will release the lock using group unlock. (click) So, with Hyperloop primitives, the storage application can perform transactional operations on a group of replicas without involving their CPUs.
  17. To summarize, our main goal was to remove replicas CPUs from the critical path of chain replicated storage transactions. We achieve this goal using three key ideas: (click) First, we leverage the combination of RDMA NIC and NVM to offload transactional operations from CPUs. And to solve the practical challenges, (click) we show how the limited programmability of RNICs suffices to forward operations to group of replicas (à positive) (click) and we provide new APIs that covers key transactional operations such as logging ,locking and commit. Developers can use HyperLoop primitives to build fast replicated transactional systems.
  18. We implemented the HyperLoop primitives as C API library. To implement operation forwarding mechasim, we modified a part of user-space RDMA verbs library and NIC driver. As case studies, we integrate HyperLoop into two in-memory databases, RocksDB and MongoDB. Since RocksDB does not natively support replication, we added replication functionality using HyperLoop by modifying around 120 lines of code. For MongoDB, we replaced the existing replication logic with HyperLoop by modifying around 800 lines of code.
  19. To evaluate the benefits of HyperLoop, we use a testbed consisting of 8 servers equipped with RDMA NICs. In the evaluation, we assume storage nodes are equipped with battery-backed DRAM which is a form of NVM available today. So we use tmpfs to emulate Non-volatile memory on top of DRAM. As a comparison point, we implemented the baseline library which provides the same set of APIs as HyperLoop library, but involves replica CPUs to handle operations. We evaluated the performance benefits using Microbenchmark and application benchmark workload.
  20. Using the microbenchmark, we measured the latency of each primitive while emulating busy CPUs on replicas by introducing CPU-intensive tasks. (click) This graph shows the tail latency when performing memory writes on a group of replicas; And X-axis is data size and Y-axis is log-scaled latency. (click) We can see that compared to the baseline, HyperLoop reduces 99th percentile tail latency by around 800times. We also observed that the Average latency is reduced by about 30 times compared to the baseline. This result demonstrates that offloading the memory operations to NICs is beneficial to reduce both tail and average latencies.
  21. And using the application benchmark, we evaluated the performance benefits for existing storage applications. I will show you the results for MongoDB in this talk. As mentioned earlier, we modified the backend part of MongoDB to integrate our baseline and Hyperloop library. Here, X-axis represents different YCSB workloads and the Y-axis represents the latency for running each workload. We emulate the case where 10 tenants instances are co-located. We can see that HyperLoop can reduce tail end-to-end latency of MongoDB for all workload. (click) For instance, it reduces tail latency by around 5 times for workload A.
  22. We have other results in the paper showing the latency reduction when performing memory copy and compare-and-swap on a group of replicas. Also, we showed that HyperLoop can reduce the tail latency for RocksDB by up to 24 times.
  23. To conclude the talk, predictable low latency is lacking in replicated storage systems today. The root cause is CPU involvement on replicas. To solve this problem, we propose HyperLoop that accelerates transactional operations by offloading them to a group of NICs with the support of NVM. And we showed that it can be applied to existing applications with minimal modifications and reduce the both average and tail latencies. Our work shows the feasibility and benefits of abstracting transactions into memory operations and offloading them to a group of NICs in the context of storage systems. There are lots of opportunities for future works in terms of applying our insights to other data center workloads and generalizing this approach to utilizing remote memory more efficiently and effectively. With that, I’m happy to take any questions. Thank you.
  24. And we evaluate the performance benefits for real-world storage applications. As mentioned earlier, we modified the backend part of MongoDB to integrate our baseline and Hyperloop library. Here, X-axis represents different YCSB workloads and the Y-axis represents the latency for running each workload. We emulate the case where 10 tenants instances are co-located. We can see that HyperLoop can reduce both average and tail end-to-end latency of MongoDB for all workload. For instance, it reduces tail latency by around 5 times compared to the baseline for workload A.
  25. However, it turns out that just writing data to NVMs via RDMA NICs is not enough for entirely offloading replicated transactional operations from the CPUs. There are two key reasons: The first reason is that replicas’ CPUs are still involved to let RDMA NICs to forward operations to a group of replicas. I will use the same WRITE operation as an example to explain this. (click) The storage frontend software requests the RNIC to issue WRITE operation to replicate data to a group of replica. Upon receiving this operation, the RNIC on the first replica will perform the operation on the NVM. (click) Then, the replica software which is run by the host CPU will notice the WRITE operations has been completed. And then it requests the NIC to send the WRITE operation with data which just has been replicated to the log region. (click) And the replica software on the last replica in the chain will send an ACK to the frontend server once it confirms the WRITE operation has been completed, which also involves the host CPU. So, this involves replicas CPUs to forward WRITE operations to another replica in the group.
  26. The second reason of involving replicas’ CPUs is that while existing RDMA primitives might be sufficient for simple data replication, they cannot express transactional operations which are required by most of storage systems today. I will explain this problem using COMMIT operation as an example. In the previous example, the DATA was replicated to the log region of replicas with the help of replicas’ CPUs. Now, the frontend server wants to commit the log. (click) It sends a COMMIT operation along with the parameter describing the memory location of data to be committed and the destination location in the data region. Since existing RDMA primitives does not support such a commit operation, the operation must be sent as a message via RDMA WRITE or SEND primitive. (click) Once the replica software on the first replica receives the COMMIT message, it will parse and executes the COMMIT operation and let the NIC send the commit operation message to the next replica. (click) The next replica process the COMMIT in the same manner. (click) In this example, replica CPUs are involved not only to forward the COMMIT operation but also to execute it. Similarly, performing other transactional operations such as locking will also require the involvement of replica CPUs. The root cause of this CPU involvement is that existing RDMA primitives only support simple write or read primitives and does not support transactional operations. So, we need new abstractions and primitives that can effectively leverage the combination of RDMA NICs and NVMs to entirely offload replicated transactions from the CPUs to a group of replicas NICs.
  27. After this bootstrapping phase, the replicas CPUs do not need to be involved in the data path. Now I will explain how HyperLoop supports group operation by explaining its data path. We develop a technique which enables the frontend server to manipulate the parameters of pre-programmed operations on the replicas’ NICs. We realized this idea based on our observation that the operations’ parameter region can be registered to the NIC so that remote NICs can access it. Let me explain how it works: After the bootstrapping phase, the RNICs on the replicas is programmed with WAIT and WRITE operation. (click) The frontend server issues a WRITE operation to the first replica’s NIC. (click) And immediately after this, it will also send parameters for the WRITE operation programmed on the RNIC on the first replica. The Hyperloop Library can calculate this parameter values using the memory address information of replicas which was collected during the bootstrapping phase. And upon receiving these operations, the replica’s NIC will update the parameter of the pre-programmed WRITE operation. (click) Also this receiving event will trigger the NIC to forward the updated WRITE operation to the next replica server’s NIC. With this technique, the frontend server can manipulate the parameters of operations pre-programmed on the replica NICs in the group with arbitrary parameters. And the replica NICs are now being able to forward the operation to the next replica in the same group with the parameter set by the frontend server. So, frontend server can perform any HyperLoop group primitives on a group of replica NICs without involving CPUs.
  28. As mentioned earlier, existing RDMA primitives do not support transactional operations such as locking and committing, so replica CPUs need to be involved to perform such operations. To address this, we observed that not only simple logging but also other transactional operations can be abstracted by memory operations which can be offloaded to RDMA NICs. For example, locking can be expressed with the compare and swap and committing can be achieved with local memory copy. However, while RNICs natively support compare and swap operation, it does not support memory copy. To offload memory copy to the NIC, we design the memory copy primitive by extending RDMA write operation. With our abstraction, logging, locking and committing can be transformed into memory operations and RNICs can perform them. Please refer to our paper for detailed design for each memory primitive. Based on our abstraction, we propose HyperLoop primitives that perform replicated transactional operations on a group of replicas without involving their CPUs in the critical path. HyperLoop primitives consist of group WRITE, group LOCK and UNLOCK and group COMMIT. Let me explain how a write transactional operation can be performed with HyperLoop primitives. (click) For example, to perform a write operation, the frontend server first updates the log with the group WRITE. (click) Then it tries to grab a global lock with the group LOCK. (click) If it grabs the lock successfully, it will commit the log using group COMMIT. (click) Then, it will release the lock using the group UNLOCK. Now, all these transactional operations can be performed by RNICs but forwarding the operations to the next replica still requires the involvement of replica CPUs. So, how can we perform the operations on a group of RNICs without CPU involvement?
  29. Next, we wanted to see whether HyperLoop can achieve the similar latency reduction even when the group size becomes larger. Here, the y-axis is the ratio between the latency achieved by HyperLoop and the baseline when performing write operations. This result shows that HyperLoop can achieve significant latency reduction regardless of the group sizes.
  30. So, Why does this happen? To understand this, it is useful to look at the software and hardware stack running for storage in a replica server. (click) As we can see, the data replication software components are scheduled and run by CPUs, arbitrary scheduling delay can occur while processing the incoming replication and transactions request, which can impact the replication latency. Further, in a multi-tenant case, since multiple replica instances for different tenants are contending with each other for limited CPU resources, the latency problem can become worse.
  31. Unfortunately, it turns out just writing data to NVMs via RDMA NICs is not enough for entirely offloading replicated transactional operations from the CPUs. (click) This example illustrates the strawman design of using NVM and RDMA for storage replication. The red arrows shows the critical path of data replication. You can see that replicas CPUs are still involved in the critical path of operation. (click) There are two key reasons for this.
  32. (Point the key part) The first challenge is that replicas’ CPUs are still involved to control RDMA NICs to forward operations to a group of replicas. Let me bring back the logging example to explain this. Logging operation can be replaced with RDMA WRITE, since what it does is writing data to the log region. (click) The storage frontend issues RDMA WRITE operation to replicate data. The NIC on the first replica will receive and perform the operation on the NVM. (click) Then, the replica software which will notice the WRITE operations has been completed. And then it requests the NIC to send the WRITE operation with the replicated data. And the replica software on the last replica in the group will send an ACK to the frontend server once it confirms the WRITE operation has been completed, which involves the host CPU. (click) So to forward the operations, replicas CPUs still need to be involved.
  33. The second challenge is that while existing RDMA primitives might be sufficient for simple data replication, they cannot perform transactional operations, which are required by most of storage systems today. Let me explain this problem using COMMIT operation as an example. From the previous example, the DATA has been replicated to the log region of replicas. Now, the frontend server wants to commit the log. (click) It sends a COMMIT operation along with the parameter describing the data being committed and the destination location. Since existing RDMA primitives do not support such a commit operation, the operation must be sent as a message via RDMA WRITE or SEND. (click) Once the replica software on the first replica receives the COMMIT message, it will parse and executes the operation and request the NIC to send the commit message to the next replica. (click) The next replica process the COMMIT in the same manner. (click) In this example, we see that replica CPUs are involved not only to forward the COMMIT operation but also to execute it. Similarly, performing other transactional operations such as locking will also require the replica CPUs since they are not natively supported by RDMA primitives.
  34. Now let me show you how a write operation can be performed with HyperLoop primitives. The frontend server first updates the log with the group log. Then it tries to grab a global lock with the group lock. If it grabs the lock successfully, it will commit the log using group commit. Then, it will release the lock using the group unlock. So, with Hyperloop primitives the storage application can perform transactional operations without involving replicas CPUs.
  35. To give you some context for this work, many online services today rely on replicated storage systems, for instance, Azure storage, amazon s3 and open source implementations such as MongoDB and RocksDB. To achieve high availability, integrity and consistency, these systems use some form of replicated transactions. For instance, chain replication is a popular replication protocol used by these storage services today. Notice that these storage systems are inherently multitenant .. which means that 100s of replica instances can be co-located on a single server to improve resource utilization. (complicated)