HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

•Download as PPTX, PDF•

2 likes•115 views

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems, ACM SIGCOMM 2018.

Science

HyperLoop: Group-Based NIC Offloading
to Accelerate Replicated Transactions
in Multi-tenant Storage Systems
Daehyeok Kim
Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu
Jitendra Padhye, Shachar Raindel, Steven Swanson
Vyas Sekar, Srinivasan Seshan

Multi-tenant Storage Systems
Replica servers
Storage frontend
2
• Replicated transactions
 Data availability and integrity
 Consistent and atomic updates
 e.g., Chain replication
• Multiple replica instances are
co-located on the same server

Problem: Large and Unpredictable Latency
• Both average and tail latencies increase
• Gap between average and tail latencies increases
3
0
40
80
120
160
9 12 15 18 21 24 27
Latency(ms)
Number of tenants on a server
Average 99th percentile
YCSB on

Run by CPUs
CPU Involvement on Replicas
4
• CPU involvement for executing and forwarding transactional operations
• Arbitrary CPU scheduling  Unpredictable latency
• Replicas’ CPU utilization hits 100%
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICStorage
Frontend Software Replica Software Replica Software
Group of replicas
Logging DATA
Log DATA DATA
ACK
1. Execute Logging
2. Forward Logging
1. Execute Logging
2. Forward ACK
Critical path of operations
DATA
DATA
Logging
DATA
DATA

Our Goal
5
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICStorage
Frontend Software Replica Software Replica Software
NIC NICStorage NICStorage
Frontend Software Replica Software Replica Software
Today’s storage system
Removing replica CPUs from the critical path!
Storage/Net Library Storage/Net Library Storage/Net Library
Run by CPUs
Critical path of operations
DATA DATA ACK

Our Work: HyperLoop
• Framework for building fast chain replicated transactional
storage systems enabled by three key ideas:
1. RDMA NIC + NVM
2. Leveraging the programmability with RDMA NICs
3. APIs covering key transactional operations
• Minimal modifications for existing applications
• e.g., 866 lines for MongoDB out of ~500K lines of code
• Up to 24x tail latency reduction in storage applications
6

Outline
• Motivation
• HyperLoop Design
• Implementation and Evaluation
7

Idea 1: RDMA NIC + NVM
• RDMA (Remote Direct Memory Access) NICs
• Enables direct memory access from the network without CPUs
• NVM (Non-Volatile Memory)
• Provides a durable storage medium for RDMA NICs
8
Replica Software
RDMA
NIC
NVM
NVM/RDMA Library
Replica Software
NICStorage
Storage/Net Library

Roadblock 1: Operation Forwarding
• RDMA NICs can execute the logging operation
• CPUs are still involved to request the RNICs to forward operations
9
NVM/RDMA Library
RNIC
NVM/RDMA Library
RNICNVM
NVM/RDMA Library
RNICNVM
Frontend Software Replica Software Replica Software
Group of replicas
Logging DATA (WRITE)
Log DATA DATA
Run by CPUs
1. Execute Logging
2. Forward Logging
1. Execute Logging
2. Forward ACK

Roadblock 2: Transactional Operations
• CPUs are involved to execute and forward operations
• RDMA NIC primitives do not support some key transactional operations
(e.g., locking, commit)
10
NVM/RDMA Library
RNIC
NVM/RDMA Library
RNICNVM
NVM/RDMA Library
RNICNVM
Frontend Software Replica Software Replica Software
Commit log
Log DATA
Data DATA
DATA
DATA
Run by CPUs
Group of replicas
1. Execute Commit
2. Forward Commit
1. Execute Commit
2. Forward ACK

Can We Avoid the Roadblocks?
11
Storage/Net Library
NIC
Storage/Net Library
NICStorage
Storage/Net Library
NICSSD
Frontend Software Replica Software Replica Software
Today’s storage system
NIC RNICNVM RNICNVM
Frontend Software Replica Software Replica Software
Storage/Net Library Storage/Net Library Storage/Net Library
Pushing replication functionality to RNICs!
Offload?
Critical path of operations

Idea 2:
Leveraging the Programmability of RNICs
• Commodity RDMA NICs are not fully programmable
• Opportunity: RDMA WAIT operation
• Supported by commodity RDMA NICs
• Allows a NIC to wait for a completion of an event (e.g., receiving)
• Triggers the NIC to perform an operation upon the completion
12

13
• Step 1: Frontend library collects the base addresses of memory regions
registered to replica RNICs
• Step 2: HyperLoop library programs replica RNICs with RDMA WAIT and
the template of target operation
HyperLoop Library
RNIC
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Frontend Software Replica Software Replica Software
1. WAIT for receiving
2. Forward ACK
Bootstrapping – Program the NICs
1. WAIT for receiving
2. Forward WRITE with
Param (Src, Dst, Len)R1 R2

14
• Idea: Manipulating parameter regions of programmed operations
• Replica NICs can forward operations with proper parameters
HyperLoop Library
RNIC
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Frontend Software Replica Software Replica Software
Update log
(WRITE)
Forwarding Operations
1. WAIT for receiving
2. Forward ACK
✔
✔
1. WAIT for receiving
2. Forward WRITE with
Param (Src, Dst, Len)
✔
✔
Param (R1-0xA, R2-0xB, 64)R1 R2
DATA
R1-0xA
DATA
R2-0xB
WRITE
DATA
WRITE
ACK
Idle CPUs
SEND
Param(R1-0xA, R2-0xB, 64)
WRITE
DATA

Idea 3:
APIs for Transactional Operations
Transactional Operations Memory Operations HyperLoop Primitives
Logging
Memory write
to log region
Group Log
Commit
Memory copy
from log to data region
Group Commit
Locking/Unlocking
Compare and swap
on lock region
Group Lock/Unlock
15
See the paper for details!

HyperLoop Library
Frontend Software
16
HyperLoop Library
RNICNVM
HyperLoop Library
RNICNVM
Replica Software Replica Software
1. Update log (Group Log)
2. Grab a lock (Group Lock)
3. Commit the log (Group Commit)
4. Release the lock (Group Unlock)
Log
Data
Transaction with HyperLoop Primitives
RNIC
Replica NICs can execute and forward operations!

HyperLoop: Putting It All Together
17
RNIC RNICNVM RNICNVM
Frontend Software Replica Software Replica Software
HyperLoop Library HyperLoop Library HyperLoop Library
Idea 3: APIs for transactional operations:
Group- Log, Commit, Lock/Unlock
Idea 1: RDMA NIC + NVM
Idea 2: Leveraging the
programmability of RNICs

Outline
• Motivation
• HyperLoop Design
• Implementation and Evaluation
18

Implementation and Case Studies
• HyperLoop library
• C APIs for HyperLoop group primitives
• Modify user-space RDMA verbs library and NIC driver
• Case Studies
• RocksDB: Add the replication logic using HyperLoop library
(modify 120 LOCs)
• MongoDB: Replace the existing replication logic with HyperLoop library
(modify 866 LOCs)
19

Evaluation
• Experimental setup
• 8 servers with two Intel 8-core CPUs, 64 GB RAM, Mellanox CX-3 NIC
• Use tmpfs to emulate NVM
• Baseline implementation
• Provides the same APIs as HyperLoop, but involve replica CPUs
• Experiments
• Microbenchmark: Latency reduction in group memory operations
• Application benchmark: Latency reduction in applications
20

Microbenchmark – Write Latency
21
128 256 512 1K 2K 4K 8K
Data size (bytes)
Baseline-99th HyperLoop-99th
Latency(μs)
104
103
102
10
1
105
801.8x

Application benchmark – MongoDB
22
0
2
4
6
8
A B D E F
Latency(ms)
YCSB Workload
Baseline-99th HyperLoop-99th
4.9x

Other Results
• Latency reduction for other memory operations on a group:
• Memory copy: 496 – 848x
• Compare-and-Swap: 849x
• Tail latency reduction for RocksDB: 5.7 – 24.2x
• Latency reduction regardless of group sizes
23

Conclusion
• Predictable low latency is lacking in replicated storage systems
• Root cause: CPU involvement on the critical path
• Our solution: HyperLoop
– Offloads transactional operations to commodity RNICs + NVM
– Minimal modifications for existing applications
• Result: up to 24x tail latency reduction in in-memory storage apps
• More opportunities
– Other data center workloads
– Efficient remote memory utilization
24

Background: Work Queue in RDMA NICs
• Work queue is a FIFO queue in
memory
• Each queue entry is a “work request”
• Work request includes the
parameters
• NIC fetches and executes the
work requests
26
DMA
Send
Workqueue
OP: WRITE
Addrs, Len
OP: WRITE
Addrs, Len

What's hot

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang

Ceph Community

How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13

Gosuke Miyashita

Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...

Ceph Community

optimizing_ceph_flash

Vijayendra Shamanna

Ceph Day London 2014 - Deploying ceph in the wild

Ceph Community

Build an High-Performance and High-Durable Block Storage Service Based on Ceph

Rongze Zhu

Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions

Colleen Corrice

Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements. Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations. This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.

Storage tiering and erasure coding in Ceph (SCaLE13x)

Sage Weil

HKG15-401: Ceph and Software Defined Storage on ARM servers

Linaro

Experiences building a distributed shared log on RADOS - Noah Watkins

Ceph Community

Erasure Code at Scale - Thomas William Byrne

Ceph Community

Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

Danny Al-Gaaf

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...

Ceph Community

Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...

Ceph Community

Designing for High Performance Ceph at Scale

James Saint-Rossy

DPDK Integration: A Product's Journey - Roger B. Melton

harryvanhaaren

Ceph Day Beijing: Big Data Analytics on Ceph Object Store

Ceph Community

Email storage with Ceph - Danny Al-Gaaf

Ceph Community

Ceph at Work in Bloomberg: Object Store, RBD and OpenStack

Red_Hat_Storage

Erik Krogen of LinkedIn presents regarding Dynamometer, a system open sourced by LinkedIn for scale- and performance-testing HDFS. He discusses one major use case for Dynamometer, tuning NameNode GC, and discusses characteristics of NameNode GC such as why it is important, and how it interacts with various current and future GC algorithms. This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

Erik Krogen

What's hot (20)

Accelerating Ceph with iWARP RDMA over Ethernet - Brien Porter, Haodong Tang

How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13

Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...

optimizing_ceph_flash

Ceph Day London 2014 - Deploying ceph in the wild

Build an High-Performance and High-Durable Block Storage Service Based on Ceph

Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions

Storage tiering and erasure coding in Ceph (SCaLE13x)

HKG15-401: Ceph and Software Defined Storage on ARM servers

Experiences building a distributed shared log on RADOS - Noah Watkins

Erasure Code at Scale - Thomas William Byrne

Linux Stammtisch Munich: Ceph - Overview, Experiences and Outlook

Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...

Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...

Designing for High Performance Ceph at Scale

DPDK Integration: A Product's Journey - Roger B. Melton

Ceph Day Beijing: Big Data Analytics on Ceph Object Store

Email storage with Ceph - Danny Al-Gaaf

Ceph at Work in Bloomberg: Object Store, RBD and OpenStack

Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

High performace network of Cloud Native Taiwan User Group

HungWei Chiu

Getting started with Riak in the Cloud

Ines Sombra

Hadoop ppt on the basics and architecture

saipriyacoool

Oracle GoldenGate Architecture Performance

Enkitec

In this deck from the 2018 OpenFabrics Workshop, Xiaoyi Lu from OSU presents: High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD. "The convergence of Big Data and HPC has been pushing the innovation of accelerating Big Data analytics and management on modern HPC clusters. Recent studies have shown that the performance of Apache Hadoop, Spark, and Memcached can be significantly improved by leveraging the high-performance networking technologies, such as Remote Direct Memory Access (RDMA). Most of these studies are based on `DRAM+RDMA' schemes. On the other hand, Non-Volatile Memory (NVM) and NVMe-SSD technologies can support RDMA access with low-latency, high-throughput, and persistence on HPC clusters. NVMs and NVMe-SSDs provide the opportunity to build novel high-performance and QoS-aware communication and I/O subsystems for data-intensive applications. In this talk, we propose new communication and I/O schemes for these data analytics stacks, which are designed with RDMA over NVM and NVMe-SSD. Our studies show that the proposed designs can significantly improve the communication, I/O, and application performance for Big Data analytics and management middleware, such as Hadoop, Spark, Memcached, etc. In addition, we will also discuss how to design QoS-aware schemes in these frameworks with NVMe-SSD." Watch the video: https://wp.me/p3RLHQ-iyB Learn more: http://web.cse.ohio-state.edu/~lu.932/ and https://www.openfabrics.org/index.php/2018-ofa-workshop-presentations.html Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD

inside-BigData.com

iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information. In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc. Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.

iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...

DataStax Academy

Leveraging Cassandra for real-time multi-datacenter public cloud analytics

Julien Anguenot

In-memory Caching in HDFS: Lower Latency, Same Great Taste

DataWorks Summit

Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core

Vietnam Open Infrastructure User Group

A Dataflow Processing Chip for Training Deep Neural Networks

inside-BigData.com

Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)

Bobby Curtis

OGG Architecture Performance

Enkitec

Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space. * Who needs this library? Why bypass the kernel? * How does it work? * How good is it? What are the benchmarks? * Pros and cons Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.

Introduction to DPDK

Kernel TLV

Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms. Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data. Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Apache Apex

OpenPOWER Acceleration of HPCC Systems

HPCC Systems

SRE Tech Talk meetup - 28/05/2019 at Paris. Presenting Kubernetes at NoSQL. Managing stateful applications is not an easy task. Getting them working at scale on +4500 servers world wide starts to be very time consuming. We'll talk about challenges we've been facing when moving from a full configuration manager (chef) solution to a mixed solution with a scheduler (Kubernetes). We'll also talk about the pitfalls to avoid when switching to a scheduler for stateful apps.

Criteo meetup - S.R.E Tech Talk

Pierre Mavro

LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging. This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years. • We will outline our main use cases and historical rates of cluster growth in multiple dimensions. • We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs. • The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer. • We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID). • We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes. • Finally, we will take a peek at our future goals, requirements, and growth perspectives. SPEAKERS Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn Erik Krogen, Senior Software Engineer, LinkedIn

Scaling Hadoop at LinkedIn

DataWorks Summit

In-memory processing has started to become the norm in large scale data handling. This is aclose to the metal analysis of highly important but often neglected aspects of memory accesstimes and how it impacts big data and NoSQL technologies.We cover aspects such as the TLB, the Transparent Huge Pages, the QPI Link, Hyperthreading and the impact of virtualization on high-memory footprint applications. We present benchmarks of various technologies ranging from Cloudera’s Impala to Couchbase and how they are impacted by the underlying hardware.The key takeaway is a better understanding of how to size a cluster, how to choose a cloud provider and an instance type for big data and NoSQL workloads and why not every core or GB of RAM is created equal.

Memory, Big Data, NoSQL and Virtualization

Bigstep

MYSQL

gilashikwa

When most topologies with Perforce involved a single P4D instance and proxies hanging off of that instance, the backup and performance needs were focused in one central location. As Perforce evolves into a multi-site design, there is a greater need to have high-performing and stable solutions in multiple locations. In this session, learn how to achieve a scalable multi-site design that addresses performance, stability, backups/Disaster Recovery, and monitoring with distributed Perforce.

Multi-Site Perforce at NetApp

Perforce

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems (20)

High performace network of Cloud Native Taiwan User Group

Getting started with Riak in the Cloud

Hadoop ppt on the basics and architecture

Oracle GoldenGate Architecture Performance

High-Performance Big Data Analytics with RDMA over NVM and NVMe-SSD

iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...

Leveraging Cassandra for real-time multi-datacenter public cloud analytics

In-memory Caching in HDFS: Lower Latency, Same Great Taste

Tối ưu hiệu năng đáp ứng các yêu cầu của hệ thống 4G core

A Dataflow Processing Chip for Training Deep Neural Networks

Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)

OGG Architecture Performance

Introduction to DPDK

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

OpenPOWER Acceleration of HPCC Systems

Criteo meetup - S.R.E Tech Talk

Scaling Hadoop at LinkedIn

Memory, Big Data, NoSQL and Virtualization

MYSQL

Multi-Site Perforce at NetApp

Recently uploaded

Use of mutants in understanding seedling development.pptx

RenuJangid3

Velocity and Acceleration PowerPoint.ppt

RakeshMohan42

Theoretical predictions and observational data indicate a class of sub-Neptune exoplanets may have water-rich interiors covered by hydrogen-dominated atmospheres. Provided suitable climate conditions, such planets could host surface liquid oceans. Motivated by recent JWST observations of K2-18 b, we self-consistently model the photochemistry and potential detectability of biogenic sulfur gases in the atmospheres of temperate sub-Neptune waterworlds for the first time. On Earth today, organic sulfur compounds produced by marine biota are rapidly destroyed by photochemical processes before they can accumulate to significant levels. Domagal-Goldman et al. suggest that detectable biogenic sulfur signatures could emerge in Archean-like atmospheres with higher biological production or low UV flux. In this study, we explore biogenic sulfur across a wide range of biological fluxes and stellar UV environments. Critically, the main photochemical sinks are absent on the nightside of tidally locked planets. To address this, we further perform experiments with a 3D general circulation model and a 2D photochemical model (VULCAN 2D) to simulate the global distribution of biogenic gases to investigate their terminator concentrations as seen via transmission spectroscopy. Our models indicate that biogenic sulfur gases can rise to potentially detectable levels on hydrogen-rich water worlds, but only for enhanced global biosulfur flux (20 times modern Earth’s flux). We find that it is challenging to identify DMS at 3.4 μm where it strongly overlaps with CH4, whereas it is more plausible to detect DMS and companion byproducts, ethylene (C2H4) and ethane (C2H6), in the mid-infrared between 9 and 13 μm. Unified Astronomy Thesaurus concepts: Exoplanet atmospheres (487); Exoplanet

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds

Sérgio Sacani

Human genetics..........................pptx

Silpa

+971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Clinic in Abu Dhabi, (United Arab Emirates)+971581248768

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics

sakshisoni2385

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx

DiariAli

biology HL practice questions IB BIOLOGY

1301aanya

POGONATUM : morphology, anatomy, reproduction etc.

Silpa

GBSN - Microbiology (Unit 2)

Areesha Ahmad

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

Alex Henderson

Ultrasound color Doppler imaging has been routinely used for the diagnosis of cardiovascular diseases, enabling real-time flow visualization through the Doppler effect. Yet, its inability to provide true flow velocity vectors due to its one-dimensional detection limits its efficacy. To overcome this limitation, various VFI schemes, including multi-angle beams, speckle tracking, and transverse oscillation, have been explored, with some already available commercially. However, many of these methods still rely on autocorrelation, which poses inherent issues such as underestimation, aliasing, and the need for large ensemble sizes. Conversely, speckle-tracking-based VFI enables lateral velocity estimation but suffers from significantly lower accuracy compared to axial velocity measurements. To address these challenges, we have presented a speckle-tracking-based VFI approach utilizing multi-angle ultrafast plane wave imaging. Our approach involves estimating axial velocity components projected onto individual steered plane waves, which are then combined to derive the velocity vector. Additionally, we've introduced a VFI visualization technique with high spatial and temporal resolutions capable of tracking flow particle trajectories. Simulation and flow phantom experiments demonstrate that the proposed VFI method outperforms both speckle-tracking-based VFI and autocorrelation VFI counterparts by at least a factor of three. Furthermore, in vivo measurements on carotid arteries using the Prodigy ultrasound scanner demonstrate the effectiveness of our approach compared to existing methods, providing a more robust imaging tool for hemodynamic studies. Learning objectives: - Understand fundamental limitations of color Doppler imaging. - Understand principles behind advanced vector flow imaging techniques. - Familiarize with the ultrasound speckle tracking technique and its implications in flow imaging. - Explore experiments conducted using multi-angle plane wave ultrafast imaging, specifically utilizing the pulse-sequence mode on a 128-channel ultrasound research platform.

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...

Scintica Instrumentation

FAIRSpectra - Enabling the FAIRification of Analytical Science

Alex Henderson

Bacterial Identification and Classifications

Areesha Ahmad

development of diagnostic enzyme assay to detect leuser virus

NazaninKarimi6

Introduction of DNA analysis in Forensic's .pptx

rohankumarsinghrore1

Digital Dentistry.Digital Dentistryvv.pptx

MohamedFarag457087

www.seribangash.com The Mariana Trench is one of the most remarkable geological features on Earth. Here are some details about it: Location: The Mariana Trench is located in the western Pacific Ocean, east of the Mariana Islands. It stretches for about 2,550 kilometers (1,580 miles) and is known as the deepest part of the world's oceans. Depth: The trench reaches incredible depths, with its deepest point known as the Challenger Deep, which plunges down to approximately 10,984 meters (36,037 feet) below sea level. To put this into perspective, if Mount Everest, the tallest mountain on Earth, were placed at the bottom of the Challenger Deep, its peak would still be over 2 kilometers (1.25 miles) underwater. Formation: The Mariana Trench was formed by the subduction of the Pacific Plate beneath the Mariana Plate. This process creates a deep trench as the heavier Pacific Plate is forced beneath the lighter Mariana Plate. Geological Features: The trench is characterized by steep, V-shaped valleys, and its walls are composed of highly compressed sedimentary rock. At the bottom of the trench, there are also large amounts of marine sediment. Pressure: The pressure at the bottom of the Mariana Trench is immense, reaching over 1,000 times the pressure at the surface. This extreme pressure creates a challenging environment for exploration and makes it difficult for organisms to survive. Exploration: Despite its extreme conditions, the Mariana Trench has been the subject of numerous scientific expeditions and explorations. One of the most famous explorations was the dive to the Challenger Deep by Swiss scientist Jacques Piccard and U.S. Navy Lieutenant Don Walsh in 1960. More recently, in 2012, filmmaker James Cameron made a solo dive to the bottom of the Challenger Deep in the Deepsea Challenger submersible. Biological Discoveries: Despite the harsh conditions, the Mariana Trench is home to a surprising variety of life forms, including unique species of deep-sea fish, crustaceans, and microbial life. Some organisms have adapted to survive in the extreme pressure and darkness of the trench. Environmental Importance: Studying the Mariana Trench provides valuable insights into the geology, biology, and oceanography of the deep sea. It also helps scientists better understand the processes that shape the Earth's crust and the distribution of life in the oceans. Conservation: Due to its remote location and extreme depths, the Mariana Trench has remained relatively untouched by human activity. However, there is growing concern about the potential impacts of deep-sea mining and pollution on this fragile ecosystem, highlighting the need for conservation efforts to protect this unique environment. https://seribangash.com/barber-shop-business-complete-guide-for-beginners/ https://seribangash.com/legend-virat-kohli-in-cricket-history/

The Mariana Trench remarkable geological features on Earth.pptx

seri bangash

Criminology is the scientific study of crime, criminals,and the criminal justice system. It is an interdisciplinaryfield that draws upon knowledge and methodologiesfrom sociology, psychology, law, biology, statistics, andother related disciplines. Criminologists examine variousaspects of crime, including its causes, consequences,prevention, and control. Three broad models of criminal behaviorsare the following: psychological,sociological and biological models. The primary goal of criminology is to understand whyindividuals commit crimes and to develop effective strategiesfor crime prevention and reduction. Criminologists study the social, economic, and psychological factors that contributeto criminal behavior.

Exploring Criminology and Criminal Behaviour.pdf

rohankumarsinghrore1

300003-World Science Day For Peace And Development.pptx

ryanrooker

Recently uploaded (20)

Use of mutants in understanding seedling development.pptx

Velocity and Acceleration PowerPoint.ppt

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds

Human genetics..........................pptx

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics

Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx

biology HL practice questions IB BIOLOGY

POGONATUM : morphology, anatomy, reproduction etc.

GBSN - Microbiology (Unit 2)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry

(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...

FAIRSpectra - Enabling the FAIRification of Analytical Science

Bacterial Identification and Classifications

development of diagnostic enzyme assay to detect leuser virus

Introduction of DNA analysis in Forensic's .pptx

Digital Dentistry.Digital Dentistryvv.pptx

The Mariana Trench remarkable geological features on Earth.pptx

Exploring Criminology and Criminal Behaviour.pdf

300003-World Science Day For Peace And Development.pptx

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

1. HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu Jitendra Padhye, Shachar Raindel, Steven Swanson Vyas Sekar, Srinivasan Seshan

2. Multi-tenant Storage Systems Replica servers Storage frontend 2 • Replicated transactions  Data availability and integrity  Consistent and atomic updates  e.g., Chain replication • Multiple replica instances are co-located on the same server

3. Problem: Large and Unpredictable Latency • Both average and tail latencies increase • Gap between average and tail latencies increases 3 0 40 80 120 160 9 12 15 18 21 24 27 Latency(ms) Number of tenants on a server Average 99th percentile YCSB on

4. Run by CPUs CPU Involvement on Replicas 4 • CPU involvement for executing and forwarding transactional operations • Arbitrary CPU scheduling  Unpredictable latency • Replicas’ CPU utilization hits 100% Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICStorage Frontend Software Replica Software Replica Software Group of replicas Logging DATA Log DATA DATA ACK 1. Execute Logging 2. Forward Logging 1. Execute Logging 2. Forward ACK Critical path of operations DATA DATA Logging DATA DATA

5. Our Goal 5 Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICStorage Frontend Software Replica Software Replica Software NIC NICStorage NICStorage Frontend Software Replica Software Replica Software Today’s storage system Removing replica CPUs from the critical path! Storage/Net Library Storage/Net Library Storage/Net Library Run by CPUs Critical path of operations DATA DATA ACK

6. Our Work: HyperLoop • Framework for building fast chain replicated transactional storage systems enabled by three key ideas: 1. RDMA NIC + NVM 2. Leveraging the programmability with RDMA NICs 3. APIs covering key transactional operations • Minimal modifications for existing applications • e.g., 866 lines for MongoDB out of ~500K lines of code • Up to 24x tail latency reduction in storage applications 6

7. Outline • Motivation • HyperLoop Design • Implementation and Evaluation 7

8. Idea 1: RDMA NIC + NVM • RDMA (Remote Direct Memory Access) NICs • Enables direct memory access from the network without CPUs • NVM (Non-Volatile Memory) • Provides a durable storage medium for RDMA NICs 8 Replica Software RDMA NIC NVM NVM/RDMA Library Replica Software NICStorage Storage/Net Library

9. Roadblock 1: Operation Forwarding • RDMA NICs can execute the logging operation • CPUs are still involved to request the RNICs to forward operations 9 NVM/RDMA Library RNIC NVM/RDMA Library RNICNVM NVM/RDMA Library RNICNVM Frontend Software Replica Software Replica Software Group of replicas Logging DATA (WRITE) Log DATA DATA Run by CPUs 1. Execute Logging 2. Forward Logging 1. Execute Logging 2. Forward ACK

10. Roadblock 2: Transactional Operations • CPUs are involved to execute and forward operations • RDMA NIC primitives do not support some key transactional operations (e.g., locking, commit) 10 NVM/RDMA Library RNIC NVM/RDMA Library RNICNVM NVM/RDMA Library RNICNVM Frontend Software Replica Software Replica Software Commit log Log DATA Data DATA DATA DATA Run by CPUs Group of replicas 1. Execute Commit 2. Forward Commit 1. Execute Commit 2. Forward ACK

11. Can We Avoid the Roadblocks? 11 Storage/Net Library NIC Storage/Net Library NICStorage Storage/Net Library NICSSD Frontend Software Replica Software Replica Software Today’s storage system NIC RNICNVM RNICNVM Frontend Software Replica Software Replica Software Storage/Net Library Storage/Net Library Storage/Net Library Pushing replication functionality to RNICs! Offload? Critical path of operations

12. Idea 2: Leveraging the Programmability of RNICs • Commodity RDMA NICs are not fully programmable • Opportunity: RDMA WAIT operation • Supported by commodity RDMA NICs • Allows a NIC to wait for a completion of an event (e.g., receiving) • Triggers the NIC to perform an operation upon the completion 12

13. 13 • Step 1: Frontend library collects the base addresses of memory regions registered to replica RNICs • Step 2: HyperLoop library programs replica RNICs with RDMA WAIT and the template of target operation HyperLoop Library RNIC HyperLoop Library RNICNVM HyperLoop Library RNICNVM Frontend Software Replica Software Replica Software 1. WAIT for receiving 2. Forward ACK Bootstrapping – Program the NICs 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len)R1 R2

14. 14 • Idea: Manipulating parameter regions of programmed operations • Replica NICs can forward operations with proper parameters HyperLoop Library RNIC HyperLoop Library RNICNVM HyperLoop Library RNICNVM Frontend Software Replica Software Replica Software Update log (WRITE) Forwarding Operations 1. WAIT for receiving 2. Forward ACK ✔ ✔ 1. WAIT for receiving 2. Forward WRITE with Param (Src, Dst, Len) ✔ ✔ Param (R1-0xA, R2-0xB, 64)R1 R2 DATA R1-0xA DATA R2-0xB WRITE DATA WRITE ACK Idle CPUs SEND Param(R1-0xA, R2-0xB, 64) WRITE DATA

15. Idea 3: APIs for Transactional Operations Transactional Operations Memory Operations HyperLoop Primitives Logging Memory write to log region Group Log Commit Memory copy from log to data region Group Commit Locking/Unlocking Compare and swap on lock region Group Lock/Unlock 15 See the paper for details!

16. HyperLoop Library Frontend Software 16 HyperLoop Library RNICNVM HyperLoop Library RNICNVM Replica Software Replica Software 1. Update log (Group Log) 2. Grab a lock (Group Lock) 3. Commit the log (Group Commit) 4. Release the lock (Group Unlock) Log Data Transaction with HyperLoop Primitives RNIC Replica NICs can execute and forward operations!

17. HyperLoop: Putting It All Together 17 RNIC RNICNVM RNICNVM Frontend Software Replica Software Replica Software HyperLoop Library HyperLoop Library HyperLoop Library Idea 3: APIs for transactional operations: Group- Log, Commit, Lock/Unlock Idea 1: RDMA NIC + NVM Idea 2: Leveraging the programmability of RNICs

18. Outline • Motivation • HyperLoop Design • Implementation and Evaluation 18

19. Implementation and Case Studies • HyperLoop library • C APIs for HyperLoop group primitives • Modify user-space RDMA verbs library and NIC driver • Case Studies • RocksDB: Add the replication logic using HyperLoop library (modify 120 LOCs) • MongoDB: Replace the existing replication logic with HyperLoop library (modify 866 LOCs) 19

20. Evaluation • Experimental setup • 8 servers with two Intel 8-core CPUs, 64 GB RAM, Mellanox CX-3 NIC • Use tmpfs to emulate NVM • Baseline implementation • Provides the same APIs as HyperLoop, but involve replica CPUs • Experiments • Microbenchmark: Latency reduction in group memory operations • Application benchmark: Latency reduction in applications 20

21. Microbenchmark – Write Latency 21 128 256 512 1K 2K 4K 8K Data size (bytes) Baseline-99th HyperLoop-99th Latency(μs) 104 103 102 10 1 105 801.8x

22. Application benchmark – MongoDB 22 0 2 4 6 8 A B D E F Latency(ms) YCSB Workload Baseline-99th HyperLoop-99th 4.9x

23. Other Results • Latency reduction for other memory operations on a group: • Memory copy: 496 – 848x • Compare-and-Swap: 849x • Tail latency reduction for RocksDB: 5.7 – 24.2x • Latency reduction regardless of group sizes 23

24. Conclusion • Predictable low latency is lacking in replicated storage systems • Root cause: CPU involvement on the critical path • Our solution: HyperLoop – Offloads transactional operations to commodity RNICs + NVM – Minimal modifications for existing applications • Result: up to 24x tail latency reduction in in-memory storage apps • More opportunities – Other data center workloads – Efficient remote memory utilization 24

25. Backup slides 25

26. Background: Work Queue in RDMA NICs • Work queue is a FIFO queue in memory • Each queue entry is a “work request” • Work request includes the parameters • NIC fetches and executes the work requests 26 DMA Send Workqueue OP: WRITE Addrs, Len OP: WRITE Addrs, Len

Editor's Notes

Thanks for the introduction. Hi I am Daehyeok Kim. I am a PhD student at Carnegie Mellon. Today, I will tell you how we can make replicated storage transactions faster by offloading them to the NIC!
To give you some context for this work, many online services today rely on replicated storage service and systems, for instance, Azure storage, amazon s3 and open source implementations such as MongoDB and RocksDB. To achieve high availability, integrity and consistency, these systems use some form of replicated transactions. (click) For instance, chain replication is a popular replication protocol used by these storage services today. (click)Notice that these storage systems are inherently multitenant .. which means that 100s of replica instances can be co-located on a single server to improve resource utilization. (complicated)
However, we have seen that such replicated storage systems can suffer from high and unpredictable latencies! (click)This result shows one such measurement -- the latency of MongoDB in a multi-tenant setting running the YCSB which is a widely used database benchmark workload; the x-axis is the number of tenants in a server, y-axis is the latency for running the workload. And, the blue and red lines represent the average and tail latencies, respectively. (click)We see that as the number of tenants on a server grows, the average and tail latencies increase. Further, the gap between average and tail grows quite rapidly. Naturally, such large and unpredictable latencies impact the end-to-end performance of storage-intensive applications.
So, why does this happen? To understand this, it is useful to look at the software and hardware stack of these storage applications Let us use chain replication as an example -- Suppose there are two replicas and a frontend server in a replication group forming a chain. (click)Assume that the frontend server wants to append the DATA to the log with logging operation. There are two typical steps each replica performs: (click) First, when the replica software receives the operation along with DATA, it first executes the operation; in logging, it will store data to the storage. (click) The next step is to forward the operation to the next replica in the chain. The same thing happens when processing other transactional operations such as commit and locking. (click) Replica software performs the executing and forwarding steps with storage and network software stacks, which are scheduled and run by CPUs. (click) When CPUs are busy, arbitrary scheduling delay can occur while processing the incoming transaction requests, which can impact the replication latency. Indeed, in our experiments we observe replica CPU utilization hits 100% ! Note that in a multi-tenant case, since multiple replica instances are contending for limited CPU, the latency problem will become worse.
Our goal is to improve the latency of replicated storage systems (click)Ideally, we want to remove the replicas CPU from the critical path and offload the operations to the NICs and storage!
To meet this goal, we present HyperLoop, a framework to enable fast replicated transactional storage systems. HyperLoop by itself is not a full end-to-end storage systems, but rather it provides building blocks that developers can use to build such systems. In particular, we focus on chain replication as it’s widely used today. There are three key ideas in our design: (click) First, we leverage the combination of two hardware technologies, RDMA and Non-volatile Memory to enable the NIC to directly replicate data to durable storage. To overcome some practical challenges in realizing this idea, we leverage the programmability supported by commodity RDMA NICs and propose a set of primitive APIs that covers key transactional operations. (click)Our results show HyperLoop can reduce tail latency by up to 24 times in in-memory storage applications. On a practical note, realizing these benefits requires only minimal modifications for existing applications -- For instance, integrating HyperLoop into MongoDB requires 860 lines of modifications out of around 500 thousands lines of source code.
Now let me explain three key ideas in HyperLoop framework.
To remove replica CPUs from the critical path of chain replication, we observe an opportunity to combine two hardware technologies -- RDMA and NVM. RDMA allows us to access remote memory region through the network without involving remote CPUs. Now, NVM or Non-Volatile Memory can provide RDMA NICs with durable storage, so the NICs can write data to durable remote regions. However, just writing data to NVMs via RDMA NICs is not enough for offloading replicated transactional operations from the CPUs as we will see.
The first challenge is that replicas’ CPUs are still involved to forward operations to a group of replicas. Let me bring back the logging example to explain this. Here, Logging operation can be replaced with RDMA WRITE, since what it does is writing data to the log region in the memory. (click) Thus, the NIC on replicas can execute logging operation on behalf of CPUs. (click) However, replica CPUs still need to be involved to forward the operation.
The second challenge is that while existing RDMA primitives might be sufficient for data logging, they cannot perform other key transactional operations, which are required by most of storage systems today. Let me explain this problem using COMMIT operation as an example. (click) When the frontend server issues a commit operation, the data in the log region has to be copied to the data region in the meory. (click) However, since RDMA NICs do not natively support such a commit operation, the replica CPUs need to execute the operation as well as forward it as previously mentioned. Similarly, executing other transactional operations such as locking will also require the replica CPUs.
So, can we address these challenges to let the NICs to execute and forward the transactional operations? (click) In particular, is it possible to push the replication and transaction functionality of replicas into the NICs?
At first glance, it seems that we need to effectively put the software stack on to the NIC -- that is, putting a full fledged programmability in your RDMA NICs. Fortunately we observe that there is a cool feature called RDMA WAIT that we can repurpose to make the replicas NICs to forward operations without CPUs! Essentially, the WAIT operation allows a NIC to wait for a completion of an event, for example receiving, and upon the completion of the event, it triggers the NIC to perform pre-programmed operations. While this set of operations is indeed limited, we found it is sufficient for our needs! Next I will describe at a high level how we can leverage this feature
Let us first look at the bootstrapping phase of our mechanism. Suppose there are two replicas, R1 and R2 and storage frontend server as shown in the figure. (click) As soon as the replication group is established, the HyperLoop library in the frontend server collects the base addresses of memory regions registered to replica RNICs. The library will use this information in the data path as we will see later. (click) Then the library on replicas programs the NICs with the RDMA WAIT and the template of target operations, for instance WRITE. The template contains an empty parameter values for the operation. By using WAIT operation, the NIC will wait for a completion of a receiving event and be triggered to forward the programmed operation.
After the bootstrapping phase, the replicas CPUs do not need to be involved in the data path. Now, although the NICs can forward the programmed operation as soon as it detects a receiving event, we first need to set the parameter of operation correctly without involving CPUs. (click) To do this, we design a mechanism to enable the frontend software to manipulate the parameters of the operation programmed on replicas NIC. Let me bring back the previous example to explain how our idea works. (click) To update the log, the frontend server issues a WRITE operation to the first replica. (click) And immediately after this, it will also send parameters for the WRITE operation programmed on the NICs. And upon receiving this, the replica’s NIC will update the parameter of the WRITE operation as shown in the figure without involving a CPU. (click) This receiving event will also trigger the NIC to forward the WRITE operation with updated parameters. (click)With this technique, the replica NICs are now being able to forward the operation with the proper parameter configured by the frontend server without involving replicas CPUs.
Ok, then the only thing remaining is making the NICs execute transactional operations. Let’s see how we can do it. As mentioned earlier, existing RDMA primitives do not support key transactional operations such as locking and commit. Fortunately, we show by construction that it is indeed possible to abstract these transactional operations into memory operations. (click) In particular, we can transform logging into memory write, commit into local memory copy and locking into the atomic compare and swap. //While RNICs natively support compare and swap operation, it does not support memory copy. So, to offload memory copy operation to the NIC, we design the //memory copy primitive. (click) Combining this memory abstraction with our operation forwarding mechanism, we design HyperLoop primitives to execute transactional operations on a group of replicas. HyperLoop primitives consist of group Log, Commit, Lock and UNLOCK. (click)Please refer to our paper for the detailed design for each memory primitive.
Now let me show you how a write operation in a typical storage application can be performed with HyperLoop primitives. The frontend server first updates the log with the group log. Then it tries to grab a global lock with the group lock. If it grabs the lock successfully, it will commit the log using group commit. Then, it will release the lock using group unlock. (click) So, with Hyperloop primitives, the storage application can perform transactional operations on a group of replicas without involving their CPUs.
To summarize, our main goal was to remove replicas CPUs from the critical path of chain replicated storage transactions. We achieve this goal using three key ideas: (click) First, we leverage the combination of RDMA NIC and NVM to offload transactional operations from CPUs. And to solve the practical challenges, (click) we show how the limited programmability of RNICs suffices to forward operations to group of replicas (à positive) (click) and we provide new APIs that covers key transactional operations such as logging ,locking and commit. Developers can use HyperLoop primitives to build fast replicated transactional systems.
We implemented the HyperLoop primitives as C API library. To implement operation forwarding mechasim, we modified a part of user-space RDMA verbs library and NIC driver. As case studies, we integrate HyperLoop into two in-memory databases, RocksDB and MongoDB. Since RocksDB does not natively support replication, we added replication functionality using HyperLoop by modifying around 120 lines of code. For MongoDB, we replaced the existing replication logic with HyperLoop by modifying around 800 lines of code.
To evaluate the benefits of HyperLoop, we use a testbed consisting of 8 servers equipped with RDMA NICs. In the evaluation, we assume storage nodes are equipped with battery-backed DRAM which is a form of NVM available today. So we use tmpfs to emulate Non-volatile memory on top of DRAM. As a comparison point, we implemented the baseline library which provides the same set of APIs as HyperLoop library, but involves replica CPUs to handle operations. We evaluated the performance benefits using Microbenchmark and application benchmark workload.
Using the microbenchmark, we measured the latency of each primitive while emulating busy CPUs on replicas by introducing CPU-intensive tasks. (click)This graph shows the tail latency when performing memory writes on a group of replicas; And X-axis is data size and Y-axis is log-scaled latency. (click)We can see that compared to the baseline, HyperLoop reduces 99th percentile tail latency by around 800times. We also observed that the Average latency is reduced by about 30 times compared to the baseline. This result demonstrates that offloading the memory operations to NICs is beneficial to reduce both tail and average latencies.
And using the application benchmark, we evaluated the performance benefits for existing storage applications. I will show you the results for MongoDB in this talk. As mentioned earlier, we modified the backend part of MongoDB to integrate our baseline and Hyperloop library. Here, X-axis represents different YCSB workloads and the Y-axis represents the latency for running each workload. We emulate the case where 10 tenants instances are co-located. We can see that HyperLoop can reduce tail end-to-end latency of MongoDB for all workload. (click) For instance, it reduces tail latency by around 5 times for workload A.
We have other results in the paper showing the latency reduction when performing memory copy and compare-and-swap on a group of replicas. Also, we showed that HyperLoop can reduce the tail latency for RocksDB by up to 24 times.
To conclude the talk, predictable low latency is lacking in replicated storage systems today. The root cause is CPU involvement on replicas. To solve this problem, we propose HyperLoop that accelerates transactional operations by offloading them to a group of NICs with the support of NVM. And we showed that it can be applied to existing applications with minimal modifications and reduce the both average and tail latencies. Our work shows the feasibility and benefits of abstracting transactions into memory operations and offloading them to a group of NICs in the context of storage systems. There are lots of opportunities for future works in terms of applying our insights to other data center workloads and generalizing this approach to utilizing remote memory more efficiently and effectively. With that, I’m happy to take any questions. Thank you.
And we evaluate the performance benefits for real-world storage applications. As mentioned earlier, we modified the backend part of MongoDB to integrate our baseline and Hyperloop library. Here, X-axis represents different YCSB workloads and the Y-axis represents the latency for running each workload. We emulate the case where 10 tenants instances are co-located. We can see that HyperLoop can reduce both average and tail end-to-end latency of MongoDB for all workload. For instance, it reduces tail latency by around 5 times compared to the baseline for workload A.
However, it turns out that just writing data to NVMs via RDMA NICs is not enough for entirely offloading replicated transactional operations from the CPUs. There are two key reasons: The first reason is that replicas’ CPUs are still involved to let RDMA NICs to forward operations to a group of replicas. I will use the same WRITE operation as an example to explain this. (click) The storage frontend software requests the RNIC to issue WRITE operation to replicate data to a group of replica. Upon receiving this operation, the RNIC on the first replica will perform the operation on the NVM. (click) Then, the replica software which is run by the host CPU will notice the WRITE operations has been completed. And then it requests the NIC to send the WRITE operation with data which just has been replicated to the log region. (click) And the replica software on the last replica in the chain will send an ACK to the frontend server once it confirms the WRITE operation has been completed, which also involves the host CPU. So, this involves replicas CPUs to forward WRITE operations to another replica in the group.
The second reason of involving replicas’ CPUs is that while existing RDMA primitives might be sufficient for simple data replication, they cannot express transactional operations which are required by most of storage systems today. I will explain this problem using COMMIT operation as an example. In the previous example, the DATA was replicated to the log region of replicas with the help of replicas’ CPUs. Now, the frontend server wants to commit the log. (click) It sends a COMMIT operation along with the parameter describing the memory location of data to be committed and the destination location in the data region. Since existing RDMA primitives does not support such a commit operation, the operation must be sent as a message via RDMA WRITE or SEND primitive. (click) Once the replica software on the first replica receives the COMMIT message, it will parse and executes the COMMIT operation and let the NIC send the commit operation message to the next replica. (click) The next replica process the COMMIT in the same manner. (click) In this example, replica CPUs are involved not only to forward the COMMIT operation but also to execute it. Similarly, performing other transactional operations such as locking will also require the involvement of replica CPUs. The root cause of this CPU involvement is that existing RDMA primitives only support simple write or read primitives and does not support transactional operations. So, we need new abstractions and primitives that can effectively leverage the combination of RDMA NICs and NVMs to entirely offload replicated transactions from the CPUs to a group of replicas NICs.
After this bootstrapping phase, the replicas CPUs do not need to be involved in the data path. Now I will explain how HyperLoop supports group operation by explaining its data path. We develop a technique which enables the frontend server to manipulate the parameters of pre-programmed operations on the replicas’ NICs. We realized this idea based on our observation that the operations’ parameter region can be registered to the NIC so that remote NICs can access it. Let me explain how it works: After the bootstrapping phase, the RNICs on the replicas is programmed with WAIT and WRITE operation. (click) The frontend server issues a WRITE operation to the first replica’s NIC. (click) And immediately after this, it will also send parameters for the WRITE operation programmed on the RNIC on the first replica. The Hyperloop Library can calculate this parameter values using the memory address information of replicas which was collected during the bootstrapping phase. And upon receiving these operations, the replica’s NIC will update the parameter of the pre-programmed WRITE operation. (click) Also this receiving event will trigger the NIC to forward the updated WRITE operation to the next replica server’s NIC. With this technique, the frontend server can manipulate the parameters of operations pre-programmed on the replica NICs in the group with arbitrary parameters. And the replica NICs are now being able to forward the operation to the next replica in the same group with the parameter set by the frontend server. So, frontend server can perform any HyperLoop group primitives on a group of replica NICs without involving CPUs.
As mentioned earlier, existing RDMA primitives do not support transactional operations such as locking and committing, so replica CPUs need to be involved to perform such operations. To address this, we observed that not only simple logging but also other transactional operations can be abstracted by memory operations which can be offloaded to RDMA NICs. For example, locking can be expressed with the compare and swap and committing can be achieved with local memory copy. However, while RNICs natively support compare and swap operation, it does not support memory copy. To offload memory copy to the NIC, we design the memory copy primitive by extending RDMA write operation. With our abstraction, logging, locking and committing can be transformed into memory operations and RNICs can perform them. Please refer to our paper for detailed design for each memory primitive. Based on our abstraction, we propose HyperLoop primitives that perform replicated transactional operations on a group of replicas without involving their CPUs in the critical path. HyperLoop primitives consist of group WRITE, group LOCK and UNLOCK and group COMMIT. Let me explain how a write transactional operation can be performed with HyperLoop primitives. (click) For example, to perform a write operation, the frontend server first updates the log with the group WRITE. (click) Then it tries to grab a global lock with the group LOCK. (click) If it grabs the lock successfully, it will commit the log using group COMMIT. (click) Then, it will release the lock using the group UNLOCK. Now, all these transactional operations can be performed by RNICs but forwarding the operations to the next replica still requires the involvement of replica CPUs. So, how can we perform the operations on a group of RNICs without CPU involvement?
Next, we wanted to see whether HyperLoop can achieve the similar latency reduction even when the group size becomes larger. Here, the y-axis is the ratio between the latency achieved by HyperLoop and the baseline when performing write operations. This result shows that HyperLoop can achieve significant latency reduction regardless of the group sizes.
So, Why does this happen? To understand this, it is useful to look at the software and hardware stack running for storage in a replica server. (click)As we can see, the data replication software components are scheduled and run by CPUs, arbitrary scheduling delay can occur while processing the incoming replication and transactions request, which can impact the replication latency. Further, in a multi-tenant case, since multiple replica instances for different tenants are contending with each other for limited CPU resources, the latency problem can become worse.
Unfortunately, it turns out just writing data to NVMs via RDMA NICs is not enough for entirely offloading replicated transactional operations from the CPUs. (click) This example illustrates the strawman design of using NVM and RDMA for storage replication. The red arrows shows the critical path of data replication. You can see that replicas CPUs are still involved in the critical path of operation. (click) There are two key reasons for this.
(Point the key part) The first challenge is that replicas’ CPUs are still involved to control RDMA NICs to forward operations to a group of replicas. Let me bring back the logging example to explain this. Logging operation can be replaced with RDMA WRITE, since what it does is writing data to the log region. (click) The storage frontend issues RDMA WRITE operation to replicate data. The NIC on the first replica will receive and perform the operation on the NVM. (click) Then, the replica software which will notice the WRITE operations has been completed. And then it requests the NIC to send the WRITE operation with the replicated data. And the replica software on the last replica in the group will send an ACK to the frontend server once it confirms the WRITE operation has been completed, which involves the host CPU. (click) So to forward the operations, replicas CPUs still need to be involved.
The second challenge is that while existing RDMA primitives might be sufficient for simple data replication, they cannot perform transactional operations, which are required by most of storage systems today. Let me explain this problem using COMMIT operation as an example. From the previous example, the DATA has been replicated to the log region of replicas. Now, the frontend server wants to commit the log. (click) It sends a COMMIT operation along with the parameter describing the data being committed and the destination location. Since existing RDMA primitives do not support such a commit operation, the operation must be sent as a message via RDMA WRITE or SEND. (click) Once the replica software on the first replica receives the COMMIT message, it will parse and executes the operation and request the NIC to send the commit message to the next replica. (click) The next replica process the COMMIT in the same manner. (click) In this example, we see that replica CPUs are involved not only to forward the COMMIT operation but also to execute it. Similarly, performing other transactional operations such as locking will also require the replica CPUs since they are not natively supported by RDMA primitives.
Now let me show you how a write operation can be performed with HyperLoop primitives. The frontend server first updates the log with the group log. Then it tries to grab a global lock with the group lock. If it grabs the lock successfully, it will commit the log using group commit. Then, it will release the lock using the group unlock. So, with Hyperloop primitives the storage application can perform transactional operations without involving replicas CPUs.
To give you some context for this work, many online services today rely on replicated storage systems, for instance, Azure storage, amazon s3 and open source implementations such as MongoDB and RocksDB. To achieve high availability, integrity and consistency, these systems use some form of replicated transactions. For instance, chain replication is a popular replication protocol used by these storage services today. Notice that these storage systems are inherently multitenant .. which means that 100s of replica instances can be co-located on a single server to improve resource utilization. (complicated)

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

Similar to HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems (20)

Recently uploaded

Recently uploaded (20)

HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems

Editor's Notes