FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

•Download as PPTX, PDF•

0 likes•248 views

Daehyeok Kim

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds, USENIX NSDI 2019.

Science

FreeFlow: Software-based Virtual RDMA
Networking for Containerized Clouds
Daehyeok Kim
Tianlong Yu1, Hongqiang Liu3, Yibo Zhu4, Jitu Padhye2, Shachar Raindel2
Chuanxiong Guo4, Vyas Sekar1, Srinivasan Seshan1
Carnegie Mellon University1, Microsoft2, Alibaba group3, Bytedance4

Two Trends in Cloud Applications
• Lightweight isolation
• Portability
1
• Higher networking performance
Containerization RDMA networking

Benefits of Containerization
2
NIC
Container 1
IP: 10.0.0.1
IP: 30.0.0.1
Network
App
Container 2
IP: 20.0.0.1
Network
App
Host 1 Host 2
NIC
Container 2
Network
App
IP: 20.0.0.1
IP: 40.0.0.1
Migration
Namespace Isolation
Portability
Software Switch
Software
Switch

Containerization and RDMA are in Conflict!
3
RDMA NIC
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.1
RDMA
App
Host 1 Host 2
RDMA NIC
Container 2
RDMA
App
IP: 20.0.0.1
IP: 20.0.0.1
Migration
Namespace Isolation
Portability

Existing H/W based Virtualization Isn’t Working
4
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.2
RDMA
App
Container 2
RDMA
App
IP: 20.0.0.1
Host 1 Host 2
VF 1 VF 2
NIC Switch
RDMA NIC
IP: 10.0.0.2
VF
NIC Switch
IP: 20.0.0.1
Migration
Using Single Root I/O Virtualization (SR-IOV)
VF Virtual Function
Namespace Isolation
Portability

Sub-optimal Performance of Containerized Apps
5
0
1000
2000
3000
Resnet-50 Inception-v3 Alexnet
TrainingSpeed
(Images/sec)
Model
Native RDMA
Container+TCP
RDMA networking can improve the training speed of NN model by ~ 10x !
9.2x
14.4x
Speech recognition RNN training Image classification CNN training
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Native RDMA Container+TCP

Our Work: FreeFlow
• Enable high speed RDMA networking capabilities for containerized
applications
• Compatible with existing RDMA applications
• Close to native RDMA performance
• Evaluation with real-world data-intensive applications
6

Outline
• Motivation
• FreeFlow Design
• Implementation and Evaluation
7

FreeFlow Design Overview
8
RDMA App
RDMA NIC
Native RDMA FreeFlow
RDMA NIC
Container 1
IP: 10.0.0.1
RDMA App
Container 2
IP: 20.0.0.1
FreeFlow
IP: 30.0.0.1
Host Host
Verbs library
Verbs library
Verbs API
Verbs API
RDMA App
Verbs API
NIC command

Background on RDMA
9
RDMA App
RDMA NIC
MEM-1RDMA CTX
RDMA App
RDMA NIC
MEM-2 RDMA CTX
1. Control path
- Setup RDMA Context
- Post work requests (e.g.,
write)
2. Data path
- NIC processes work requests
- NIC directly accesses memory
“Host 1 wants to write contents in MEM-1 to MEM-2 on Host 2”
Host 1 Host 2
Verbs library Verbs library

FreeFlow in the Scene
10
RDMA App
RDMA NIC
MEM-1RDMA CTX
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
Container 1 Container 2
FreeFlow
S-RDMA CTX S-MEM-1
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow
S-MEM-2 S-RDMA CTX
C1: How to forward verbs calls?
Verbs library Verbs library
C2: How to synchronize memory?

$Challenge 1: Verbs forwarding in Control Path 11 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) Attempt 1: Forward “as it is” Incorrect Attempt 2: “Serialize” and forward  Inefficient struct ibv_qp { struct ibv_context *context; …. };$

$Internal Structure of Verbs Library 12 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) struct ibv_qp { struct ibv_context *context; …. }; Parameters are serialized by Verbs library!$

FreeFlow Control Path Channel
13
RDMA App
FreeFlow library
Write (VNIC_fd, serialized parameters)
Parameters are forwarded correctly
without manual serialization!
Idea: Leveraging the serialized output of verbs library
Verbs library
Shim
RDMA App
RDMA NIC
FreeFlow Router
Verbs library
Container
NIC command
Verbs API
VNIC
ibv_post_send (struct ibv_qp* qp, ….)
FreeFlow Router
VNIC

Challenge 2: Synchronizing Memory for Data Path
14
• Shadow memory in FreeFlow router
• A copy of application’s memory region
• Directly accessed by NICs
• S-MEM and MEM must be synchronized.
• How to synchronize S-MEM and MEM?
RDMA App
RDMA NIC
MEMRDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM
Verbs library
VNIC
Container

Strawman Approach for Synchronization
15
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow Router
S-MEM-2 S-RDMA CTX
Verbs library
VNIC
Container
DATA
?
Explicit synchronization
High freq. High overhead
Low freq. Wrong data for app

Containers can Share Memory Regions
16
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
MEM and S-MEM can be located on
the same physical memory region
• FreeFlow router is running in a container

Zero-copy Synchronization in Data Path
17
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
Synchronization without explicit memory copy:
Method1: Allocate shared buffers with FreeFlow APIs
Method2: Re-map app’s memory space to shadow
memory space
FreeFlow supports both!
How to allocated MEM-1 to
shadow memory space?

FreeFlow Design Summary
18
FreeFlow control path channel
Zero-copy memory synchronization
FreeFlow provides near native RDMA performance for containers!
RDMA NIC
Container 1
IP: 10.0.0.1
RDMA App
Container 2
IP: 20.0.0.1
FreeFlow Router
IP: 30.0.0.1
Verbs library
RDMA App
VNIC VNIC

Outline
• Motivation
• FreeFlow Design
• Implementation and Evaluation
19

Implementation and Experimental Setup
• FreeFlow Library
• Add 4000 lines in C to libibverbs and libmlx4.
• FreeFlow Router
• 2000 lines in C++
• Testbed setup
• Two Intel Xeon E5-2620 8-core CPUs, 64 GB RAM
• 56 Gbps Mellanox ConnectX-3 NICs
• Docker containers
20

Does FreeFlow Support Low Latency?
21
0
1
2
3
4
64 256 1K 4K
Latency(us)
Message size (B)
Native RDMA FreeFlow
0.38μs

Does FreeFlow Support High Throughput?
22
0
20
40
60
2K 8K 32K 128K 512K 1M
Throughput(Gbps)
Message size (B)
Native RDMA
FreeFlow
Bounded by control path channel performance

Do Applications Benefit from FreeFlow?
23
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Container+TCP Native RDMA FreeFlow
8.7x

Summary
• Containerization today can’t benefit from speed of RDMA.
• Existing solutions for NIC virtualization don’t work (e.g., SR-IOV).
• FreeFlow enables containerized apps to use RDMA.
• Challenges and Key Ideas
• Control path: Leveraging Verbs library structure for efficient Verbs forwarding
• Data path: Zero-copy memory synchronization
• Performance close to native RDMA
24
github.com/microsoft/freeflow

What's hot

Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityThomas Graf

Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas

Atom The Redis Streams-Powered Microservices SDK: Dan PipemazoRedis Labs

2015 FOSDEM - OVS Stateful ServicesThomas Graf

LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK

DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf

Cilium - BPF & XDP for containersDocker, Inc.

http server on user-level mTCP stack accelerated by DPDKLinaro

Investigating the Impact of Network Topology on the Processing Times of SDN C...Steffen Gebert

LF_DPDK17_Lagopus RouterLF_DPDK

The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf

LF_DPDK_Mellanox bifurcated driver modelLF_DPDK

LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK

Introduction to eBPFRogerColl2

Compiling P4 to XDP, IOVISOR Summit 2017Cheng-Chun William Tu

BPF & Cilium - Turning Linux into a Microservices-aware Operating SystemThomas Graf

Introduction to Remote Procedure CallAbdelrahman Al-Ogail

Link Capacity Estimation in Wireless Software Defined NetworksFarzaneh Pakzad

Packet Framework - Cristian Dumitrescuharryvanhaaren

Sun RPC (Remote Procedure Call)Peter R. Egli

What's hot (20)

Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security

Kernel Recipes 2017 - EBPF and XDP - Eric Leblond

Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo

2015 FOSDEM - OVS Stateful Services

LF_DPDK17_Accelerating P4-based Dataplane with DPDK

DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP

Cilium - BPF & XDP for containers

http server on user-level mTCP stack accelerated by DPDK

Investigating the Impact of Network Topology on the Processing Times of SDN C...

LF_DPDK17_Lagopus Router

The Next Generation Firewall for Red Hat Enterprise Linux 7 RC

LF_DPDK_Mellanox bifurcated driver model

LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...

Introduction to eBPF

Compiling P4 to XDP, IOVISOR Summit 2017

BPF & Cilium - Turning Linux into a Microservices-aware Operating System

Introduction to Remote Procedure Call

Link Capacity Estimation in Wireless Software Defined Networks

Packet Framework - Cristian Dumitrescu

Sun RPC (Remote Procedure Call)

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebula Project

DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...Docker, Inc.

Simplify Networking for ContainersLinuxCon ContainerCon CloudOpen China

Building a sdn solution for the deployment of web application stacks in dockerJorge Juan Mendoza

How OpenShift SDN helps to automateIlkka Tengvall

DockerCon - The missing piece : when Docker networking unleashes software arc...Laurent Grangeau

The missing piece : when Docker networking and services finally unleashes so...Adrien Blind

drbd9_and_drbdmanage_may_2015Alexandre Huynh

What CloudStackers Need To Know About LINSTOR/DRBDShapeBlue

middleware in embedded systemsAkhil Kumar

Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN

DockerCon 2022 - From legacy to Kubernetes, securely & quicklyEric Smalling

Scaling the Container Dataplane Michelle Holley

micro-ROS: bringing ROS 2 to MCUseProsima

Remote Web DeskSatish Chandra

FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)FIWARE

SDN: an introductionLuca Profico

cncf overview and building edge computing using kubernetesKrishna-Kumar

Network simulator 2 a simulation tool for linuxPratik Joshi

3-Way Scripts as a Practical Platform for Secure Distributed Code in CloudsTokyo University of Science

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds (20)

OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT

DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...

Simplify Networking for Containers

Building a sdn solution for the deployment of web application stacks in docker

How OpenShift SDN helps to automate

DockerCon - The missing piece : when Docker networking unleashes software arc...

The missing piece : when Docker networking and services finally unleashes so...

drbd9_and_drbdmanage_may_2015

What CloudStackers Need To Know About LINSTOR/DRBD

middleware in embedded systems

Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster

DockerCon 2022 - From legacy to Kubernetes, securely & quickly

Scaling the Container Dataplane

micro-ROS: bringing ROS 2 to MCUs

Remote Web Desk

FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)

SDN: an introduction

cncf overview and building edge computing using kubernetes

Network simulator 2 a simulation tool for linux

3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds

Recently uploaded

Vital Signs of Animals Presentation By Aftab Ahmed Rahimoonintarciacompanies

MSC IV_Forensic medicine - Mechanical injuries.pdfSuchita Rawat

Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001

Polyethylene and its polymerization.pptxMuhammadRazzaq31

Film Coated Tablet and Film Coating raw materials.pdfPharmatech-rx

Information science research with large language models: between science and ...Fabiano Dalpiaz

Taphonomy and Quality of the Fossil RecordSangram Sahoo

SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison

PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)kushbuR

Technical english Technical english.pptxyoussefboujtat3

NUMERICAL Proof Of TIme Electron Theory.syedmuneemqadri

HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPTABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA

Classification of Kerogen, Perspective on palynofacies in depositional envi...Sangram Sahoo

GBSN - Biochemistry (Unit 3) MetabolismAreesha Ahmad

dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET

NuGOweek 2024 programme final FLYER short.pdfpablovgd

EU START PROJECT. START-Newsletter_Issue_4.pdfStart Project

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Costs to heap leach gold ore tailings in Karamoja region of UgandaTimothyOkuna

A Scientific PowerPoint on Albert Einsteinxgamestudios8

Recently uploaded (20)

Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon

MSC IV_Forensic medicine - Mechanical injuries.pdf

Heads-Up Multitasker: CHI 2024 Presentation.pdf

Polyethylene and its polymerization.pptx

Film Coated Tablet and Film Coating raw materials.pdf

Information science research with large language models: between science and ...

Taphonomy and Quality of the Fossil Record

SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx

PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)

Technical english Technical english.pptx

NUMERICAL Proof Of TIme Electron Theory.

HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT

Classification of Kerogen, Perspective on palynofacies in depositional envi...

GBSN - Biochemistry (Unit 3) Metabolism

dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...

NuGOweek 2024 programme final FLYER short.pdf

EU START PROJECT. START-Newsletter_Issue_4.pdf

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...

Costs to heap leach gold ore tailings in Karamoja region of Uganda

A Scientific PowerPoint on Albert Einstein

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

1. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds Daehyeok Kim Tianlong Yu1, Hongqiang Liu3, Yibo Zhu4, Jitu Padhye2, Shachar Raindel2 Chuanxiong Guo4, Vyas Sekar1, Srinivasan Seshan1 Carnegie Mellon University1, Microsoft2, Alibaba group3, Bytedance4

2. Two Trends in Cloud Applications • Lightweight isolation • Portability 1 • Higher networking performance Containerization RDMA networking

3. Benefits of Containerization 2 NIC Container 1 IP: 10.0.0.1 IP: 30.0.0.1 Network App Container 2 IP: 20.0.0.1 Network App Host 1 Host 2 NIC Container 2 Network App IP: 20.0.0.1 IP: 40.0.0.1 Migration Namespace Isolation Portability Software Switch Software Switch

4. Containerization and RDMA are in Conflict! 3 RDMA NIC Container 1 IP: 10.0.0.1 IP: 10.0.0.1 RDMA App Container 2 IP: 10.0.0.1 RDMA App Host 1 Host 2 RDMA NIC Container 2 RDMA App IP: 20.0.0.1 IP: 20.0.0.1 Migration Namespace Isolation Portability

5. Existing H/W based Virtualization Isn’t Working 4 Container 1 IP: 10.0.0.1 IP: 10.0.0.1 RDMA App Container 2 IP: 10.0.0.2 RDMA App Container 2 RDMA App IP: 20.0.0.1 Host 1 Host 2 VF 1 VF 2 NIC Switch RDMA NIC IP: 10.0.0.2 VF NIC Switch IP: 20.0.0.1 Migration Using Single Root I/O Virtualization (SR-IOV) VF Virtual Function Namespace Isolation Portability

6. Sub-optimal Performance of Containerized Apps 5 0 1000 2000 3000 Resnet-50 Inception-v3 Alexnet TrainingSpeed (Images/sec) Model Native RDMA Container+TCP RDMA networking can improve the training speed of NN model by ~ 10x ! 9.2x 14.4x Speech recognition RNN training Image classification CNN training 0 0.5 1 0 10 20 30 40 CDF Time per step (sec) Native RDMA Container+TCP

7. Our Work: FreeFlow • Enable high speed RDMA networking capabilities for containerized applications • Compatible with existing RDMA applications • Close to native RDMA performance • Evaluation with real-world data-intensive applications 6

8. Outline • Motivation • FreeFlow Design • Implementation and Evaluation 7

9. FreeFlow Design Overview 8 RDMA App RDMA NIC Native RDMA FreeFlow RDMA NIC Container 1 IP: 10.0.0.1 RDMA App Container 2 IP: 20.0.0.1 FreeFlow IP: 30.0.0.1 Host Host Verbs library Verbs library Verbs API Verbs API RDMA App Verbs API NIC command

10. Background on RDMA 9 RDMA App RDMA NIC MEM-1RDMA CTX RDMA App RDMA NIC MEM-2 RDMA CTX 1. Control path - Setup RDMA Context - Post work requests (e.g., write) 2. Data path - NIC processes work requests - NIC directly accesses memory “Host 1 wants to write contents in MEM-1 to MEM-2 on Host 2” Host 1 Host 2 Verbs library Verbs library

11. FreeFlow in the Scene 10 RDMA App RDMA NIC MEM-1RDMA CTX “Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2” Container 1 Container 2 FreeFlow S-RDMA CTX S-MEM-1 RDMA App RDMA NIC RDMA CTXMEM-2 FreeFlow S-MEM-2 S-RDMA CTX C1: How to forward verbs calls? Verbs library Verbs library C2: How to synchronize memory?

12. Challenge 1: Verbs forwarding in Control Path 11 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) Attempt 1: Forward “as it is” Incorrect Attempt 2: “Serialize” and forward  Inefficient struct ibv_qp { struct ibv_context *context; …. };

13. Internal Structure of Verbs Library 12 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) struct ibv_qp { struct ibv_context *context; …. }; Parameters are serialized by Verbs library!

14. FreeFlow Control Path Channel 13 RDMA App FreeFlow library Write (VNIC_fd, serialized parameters) Parameters are forwarded correctly without manual serialization! Idea: Leveraging the serialized output of verbs library Verbs library Shim RDMA App RDMA NIC FreeFlow Router Verbs library Container NIC command Verbs API VNIC ibv_post_send (struct ibv_qp* qp, ….) FreeFlow Router VNIC

15. Challenge 2: Synchronizing Memory for Data Path 14 • Shadow memory in FreeFlow router • A copy of application’s memory region • Directly accessed by NICs • S-MEM and MEM must be synchronized. • How to synchronize S-MEM and MEM? RDMA App RDMA NIC MEMRDMA CTX FreeFlow Router S-RDMA CTX S-MEM Verbs library VNIC Container

16. Strawman Approach for Synchronization 15 “Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2” RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container RDMA App RDMA NIC RDMA CTXMEM-2 FreeFlow Router S-MEM-2 S-RDMA CTX Verbs library VNIC Container DATA ? Explicit synchronization High freq. High overhead Low freq. Wrong data for app

17. Containers can Share Memory Regions 16 RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container Host Shared memory MEM MEM and S-MEM can be located on the same physical memory region • FreeFlow router is running in a container

18. Zero-copy Synchronization in Data Path 17 RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container Host Shared memory MEM Synchronization without explicit memory copy: Method1: Allocate shared buffers with FreeFlow APIs Method2: Re-map app’s memory space to shadow memory space FreeFlow supports both! How to allocated MEM-1 to shadow memory space?

19. FreeFlow Design Summary 18 FreeFlow control path channel Zero-copy memory synchronization FreeFlow provides near native RDMA performance for containers! RDMA NIC Container 1 IP: 10.0.0.1 RDMA App Container 2 IP: 20.0.0.1 FreeFlow Router IP: 30.0.0.1 Verbs library RDMA App VNIC VNIC

20. Outline • Motivation • FreeFlow Design • Implementation and Evaluation 19

21. Implementation and Experimental Setup • FreeFlow Library • Add 4000 lines in C to libibverbs and libmlx4. • FreeFlow Router • 2000 lines in C++ • Testbed setup • Two Intel Xeon E5-2620 8-core CPUs, 64 GB RAM • 56 Gbps Mellanox ConnectX-3 NICs • Docker containers 20

22. Does FreeFlow Support Low Latency? 21 0 1 2 3 4 64 256 1K 4K Latency(us) Message size (B) Native RDMA FreeFlow 0.38μs

23. Does FreeFlow Support High Throughput? 22 0 20 40 60 2K 8K 32K 128K 512K 1M Throughput(Gbps) Message size (B) Native RDMA FreeFlow Bounded by control path channel performance

24. Do Applications Benefit from FreeFlow? 23 0 0.5 1 0 10 20 30 40 CDF Time per step (sec) Container+TCP Native RDMA FreeFlow 8.7x

25. Summary • Containerization today can’t benefit from speed of RDMA. • Existing solutions for NIC virtualization don’t work (e.g., SR-IOV). • FreeFlow enables containerized apps to use RDMA. • Challenges and Key Ideas • Control path: Leveraging Verbs library structure for efficient Verbs forwarding • Data path: Zero-copy memory synchronization • Performance close to native RDMA 24 github.com/microsoft/freeflow

Editor's Notes

Thanks for the introduction. My name is Daehyeok Kim. I am a PhD student at Carnegie Mellon University. Today, I will tell you how we can virtualize RDMA networking for containerized clouds with FreeFlow!
So, there are two recent trends in developing cloud applications – Containerization and RDMA networking. Container solutions such as docker and LXC, offer (click) lightweight isolation and portability, which lowers the complexity of deploying and managing the applications. On the other hand, Remote Direct Memory Access or RDMA networking offers (click) significantly higher throughput, lower latency while maintaining lower CPU utilization As a result, many applications especially data-intensive applications adopt container or RDMA networking, but not both at the same time. So, You may be wondering why they are not used together and I will explain this in next couple slides.
So, what’s the benefits of containerizing apps? Especially for networked applications? With containerization, for traditional TCP/IP container networking, containers can use their own network namespace with the support of a software switch. (click) Within their namespace, they can use their own IP addresses regardless of the address of host namespace. (click) So, it supports namespace isolation. (click) Moreover, even when the container is migrated to another host machine, it can retain its IP address, (click) which supports portability of the application running inside the container.
However, when it comes to RDMA networking, container cannot support two properties because they are in conflict. (click) In this case, containers have to use host namespace to directly interact with the RDMA network interface card. Because of this, each container cannot have isolated network namespace which means all containers have to use the same IP address . And another issue happens when migrating the container. (click) Since the namespace on the new host machine uses another IP address, (click) the migrated container also has to change its IP address to this address, (click) limiting the portability of containers.
One natural solution we can think of would be using existing hardware IO virtualization technique such as SR-IOV. At high-level, (click)SR-IOV capable NIC provides a set of virtual NICs called virtual function and (click) we can set different IP addresses to each virtual function. (click) Then virtual machines or containers can use virtual function in a host namespace but with different IP addresses. (click) This provides you with namespace isolation with one caveat which is that you are limited to a set of IP addresses that are assigned to a particular virtual function. While this might not affect namespace isolation of a single machine, when you have to deal with portability, this limits that ability.
As a result, containerized applications cannot use RDMA today and it results in sub-optimal performance. To show this, we ran different neural net training workloads on Tensorflow with container with TCP networking and native RDMA networking. (click) As you can see here, the containerized tensorflow with TCP networking show significantly worse performance than the native RDMA case. (click) Native RDMA can improve the performance by around 10 times.
So our goal was simple: we wanted to make containerized applications to be able to use high speed RDMA networking while achieving isolation and portability. In this talk, I will present FreeFlow, a software-based virtual RDMA networking solution for containerized apps that achieves our goal. With FreeFlow, any existing containerized RDMA applications can run with minimal or no code modification. Our evaluation with real data-intensive applications show that FreeFlow can provide high-performance which is close to native RDMA performance.
Now let me explain the key ideas in the design of FreeFlow.
In a native RDMA case, an RDMA application directly talk to RDMA NIC via Verbs library which translates (click) the user-space RDMA commands into (click) NIC commands.. (click) In FreeFlow, we introduce a software layer that virtualizes RDMA networking for containers. At high-level, (click) the FreeFlow layer gets Verbs calls from RDMA apps on containers, and (click) process them by interacting with the physical RDMA NIC on behalf of the apps. Since FreeFlow allows containers to run their own network namespace, it provides isolation and portability. However, while realizing this idea, we faced several challenges. To explain them, I first briefly introduce the background on how native RDMA works.
Let me bring a simple example where Host 1 wants to write contents in MEM-1 into MEM-2 on Host 2. RDMA app and RDMA NIC communicates through two different communication paths: control and data path. **In the control path, **the RDMA app issues a set of verbs calls to setup RDMA context and register the memory regions. **Once the context is ready, it exchanges the context with the remote side to establish a RDMA data channel. **Then it can post work requests, for example, write request in this case. **Then, in the data path, the NIC processes work requests by directly accessing memory without involving CPU where the most of RDMA’s performance gain comes from.
Now Let’s bring FreeFlow into the scene. **FreeFlow provides apps in container with virtual resources for both control and data path. When the app creates a context, the freeflow layer creates a shadow memory for the apps memory region. Later, in the data path, the NIC will directly access this shadow memory instead of apps memory region. (click) Because of this, these two memory regions must be synchronized. (click) Then it creates corresponding shadow context and registered it to Physical RDMA NIC. (click) Then, the NIC will directly access the shadow memory. while it sounds like a simple idea, here are two key challenges. (click) The First one is how to forward verbs call from the app to the freeflow layer efficiently. (clIck) Second challenge is how to synchronize shadow and apps memory region without incurring much overhead? I will explain how we solve these in next slides.
The first challenge is how to efficiently intercept and forward verbs call from the application. (click) That is what the app should send the freeflow layer to forward verbs call correctly. (click) We first came up with this design that is putting a shim layer under the app. And intercepts the verbs call. (click) In this example, the shim layer intercepts “post_send” which is used for posting send or write work requests. Then the next question is what the shim forward to the FreeFlow layer? (click) Our first attempt was to forward as it is but failed. Why? (click) Because as we can see here the verb call takes a pointer argument, (click) which makes this approach incorrect. (click) The second approach we took was to serialize and forward. (click) But we observe that the qp data structure indeed contains lots of pointer member variables (click)which makes serialization very difficult or inefficient.
Fortunately, we found a hint by looking at (click) one step deeper into the existing verbs library. (click) When the app issues verbs call, the verbs library outputs serialized parameter before it sends a corresponding NIC command. This means that the complexity of verbs call arguments can be simplified internally by the library.
So, based on this observation, (click) we decouple the freeflow layer into two parts called freeflow router and freeflow library and makes the VNIC as an interface between them. (click) Our shim layer intercepts the serialized output of full-fledged verbs library instead of intercepting App facing APIs. (click) Then, we make it to forward the call to virtual nic on the freeflow router. (click) This way, verbs calls can be forwarded correctly without explicit serialization.
The second challenge is that how to synchronize shadow memory with corresponding apps’ memory region in the data path. (click) As mentioned earlier, the shadow memory is a copy of application’s memory region and NIC directly accesses the shadow memory. (click) And those memory regions must be synchronized for correct data path operations. (click) We observe that the synchronization is not trivial because of the characteristics of RDMA operation. Let me explain this challenge with an example.
Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2. So, when the write request has been issued, the FreeFlow router needs to somehow copy the data from MEM-1 to the shadow memory, (click) And then the NIC will read from the shadow memory and transmit it to the NIC on the other side, (click )Then the receiver NIC put the data to S-MEM-2 without CPU involvement. (click) Since there is not signal, the freeflow router does not know when it should copy the data from the shadow memory to app memory. (click) One straightforward solution would be to explicitly synchronize two memory regions. But it's not efficieint or effetive. (click) But if it is too frequent, it would consume excessive amount of CPU cycles and memory bandwidth. If the frequency is too low, the app can end up with reading wrong data. Both results are undesirable. So how can we solve this problem?
We observe an opportunity from the fact that containers in the same host machine are just processes and they can share memory regions. (click) Since both FreeFlow router and applications are running on different containers but on the same host, (click)they can share memory space by locating them in the same physical memory region.
Now we know that apps regions and the shadow memory can be allocated on the shared memory space. (click) then the question is how can the app allocate its memory region to shadow memory space? (click) FreeFlow provides two methods for this: (click) The first method is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. (click) The second method is to re-map app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the bufferwithout code modification. Because each method has different trade-off we provide both methods for developers.
So, the end result is the system, FreeFlow that allows you to combine those two great technologies, containerization and RDMA. In particular, We introducing the freeflow router with two key ideas: **First, we leverage the internal structure of verbs library to forward verbs call efficiently. **Second, we design a zero-copy synchronization in data path and provide two different methods for developers. **As a result, FreeFlow can achieve near native RDMA performance for containers while achieving isolation and portability as we will see next couple slides.
We implemented FreeFlow library by modifying the user-space verbs libraries and drivers. And we implemented FreeFlow router in C++. (click) Our testbed consists of two servers equipped with RDMA-capable NIC. We used docker as a container solution.
One nice thing about RDMA is that it provides a latency of single digit microseconds which is incredibly valuable for latency-sensitive applications. So, We first measure the latency of RDMA operatio. Two containers were running on different hosts. (click) This graph shows the RDMA WRITE lantecy.. And X-axis is message size and Y-axis is the latency. ((click) We can see that compared to the native RDMA case, FreeFlow adds at most .4 us of extra latency The extra delay is mainly due to the IPC between the FreeFlow library and FreeFlow router.
High throughput is another advantage of using RDMA. (click)This graph shows the RDMA SEND throughput. The Sender transmits 1GB data with different sizes of messages. The X-axis is message size and Y-axis is the throughput in gbps. (click) We see that with message size equal or larger than 8KB, FreeFlow gets full throughput as native RDMA (click) When the message sizes are smaller like 2KB, FreeFlow achieves nearly half of the full throughput. This is because the throughput is bounded by the capacity of control path channel between the freeflow library and the router
How about data-intensive app performance? As an example, We trained speech recognition RNN model on Tensorflow We compared the performance of container TCP networking, Native RDMA, FreeFlow. (click) This figure shows the CDF of the time spent for each training step, (click) We can see that the median training time with FreeFlow is around 8.7 times faster than the container TCP case and it is also very close to native RDMA.
To conclude the talk, containerization today cannot benefit from the performance of RDMA. Existing virtualization techniques don’t work for it. Today I introduced FreeFlow that enables containerized apps to use RDMA with minimal or no code modification. We addressed two key challenges to virtualized control and data path We show it can support near native RDMA performance. Please check out our paper and github repo for more details. With that, I’m happy to take questions, thank you.
We also run the image recognition workload implemented with Tensorflow. For image recognition, we run three specific models, ResNet-50, Inception-v3 and AlexNet. The figure shows the median training speed per second with 10-percentile and 99-percentile values. From the results of all three different models, we conclude, first, the network performance is indeed a bottleneck in the distributed training. Comparing host RDMA with host TCP, host RDMA performs 1.8 to 3.0 times better in terms of the training speed. The gap between FreeFlow and Weave on container overlay is even wider. For example, FreeFlow runs 14.6 times faster on AlexNet. Second, FreeFlow performance is very close to host RDMA. The difference is less than 4.9%.
First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs. It provides high performance by allowing VMs to directly access to physical NICs. However, it has fundamental portability limitations since it requires reconfiguration on hardware NICs when a VM is migrated to another host machine. . Also, it does not provide controllability since all traffics are bypassing hypervisors or software switches.
The second possible solution is using SoftRoCE, which is a software-emulated RDMA being developed by Mellanox. It could easily achieve isolation, portability, and controllability since it implements RDMA stack on top of kernel network stack, so it could potentially leverage existing virtual network solutions for TCP/IP networking. However it has low performance since packets are processed in the kernel stack.
Lastly, the idea of Hybrid Virtualization could be extended for containers. The basic idea is that it virtualizes control operations of RDMA using hypervisor environments, while data operations are set up to bypass the hypervisor completely like in SR-IOV. Although it can provides isolation, portability, and performance, It suffers from low controllability since it only can manipulate control plane commands.
Lastly, We ran Spark on two servers. One of the server runs a master container that schedules jobs on slave containers. Both of the servers run a slave container. The RDMA extension for spark is implemented by is closed source. We download the binary from their official website and did not make any modification. We demonstrate the basic benchmarks shipped with the Spark distribution – GroupBy and SortBy. As shown in the figure, We conclude similar observations as running TensorFlow. The performance of network does impact the application end-to-end performance significantly. When running with RvS, the performance is very close to running on host RDMA, better than host TCP, and up to 1.8 times better than running containers with Weave.
Let me first explain a design decision we made to efficiently intercept RDMA API calls. There are multiple ways to intercept the communications between applications and physical NICs, but we must choose an efficient one. A number of commercial technologies supporting RDMA are available today, including Infiniband, RoCE and iWarp. Applications may also use several different high-level APIs to access RDMA features, such as MPI and rsocket. As shown in the figure, the “narrow waist” of these APIs is the IB Verbs API, so called Verbs. Thus, we choose to support Verbs in RvS and by doing so we can naturally support all higher-level APIs. Let me briefly explain some basic concepts in Verbs. Verbs uses a concept of “queue pairs” called QP for data transfer. For every RDMA connection, each of two endpoints has a send queue (SQ) and a receive queue (RQ), together called QP. Each endpoint also has a separate completion queue called CQ / that is used by the NIC to notify the endpoint about completion of send or receive requests To transparently support Verbs, RvS creates virtual QPs and CQs in virtual NICs / and relates the operations on them with operations on real QPs and CQs in the physical NICs.
Cloud application developers are seeking better performance, lower management cost, and higher resource efficiency. This has lead to growing adoption of two technologies, called Containerization and RDMA networking. Containers offer lightweight isolation and portability, which lowers the complexity of deploying and managing cloud applications. Because of this, containers are now a standard way of managing and deploying large cloud applications. On the other hand, Remote Direct Memory Access or **RDMA** networking offers significantly higher throughput, lower latency while maintaining lower CPU utilization than standard TCP/IP based networking. Because of this, many data-intensive applications, for example, deep learning and data analytics frameworks are adopting RDMA as their communication method. So, our question is ** can containerized applications use RDMA? **
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
Why is it hard to achieve good performance in this design? To realize our design, we need to solve two problems: First, since both control and data path RDMA commands issued by applications are forwarded to the freeflow router, the forwarding mechanism should be efficient. It is difficult to achieve this since we observe that the structure of RDMA command arguments are complex …….. Second, as the FreeFlow router maintains shadow context
Unfortunately, the two trends are fundamentally at odds with each other especially in clouds. The core value of containerization is to provide an efficient and flexible resource management to applications. To support this, containers have three properties in networking: **isolation, portability, and controllability.** For isolation, each container needs to have its dedicated network namespace to eliminate conflicts with other containers. Second, to provide portability, a container typically uses virtual networking to communicate with other containers, and its virtual IP address stick with the container regardless of which host machine it is *placed in* or *migrate to*. For controllability, container orchestrators can easily enforce control plane policies such as admission control and routing, and data plane policies such as QoS. This properties are particularly useful in multi-tenant cloud environments. To support these properties, in TCP/IP-based operations, networking is fully virtualized via software switches. However, it is hard to fully virtualize RDMA networking. RDMA achieves high networking performance by offloading network processing to hardware NICs. Because of this, it’s difficult to modify the control plane states in hardware while it’s also hard to control the data path since traffic directly goes between memory and NICs. As a result, several data-intensive applications that have adopted both of these technologies use RDMA only when running in dedicated bare-metal clusters. However, using dedicated clusters to run a single application is not cost efficient both for service providers or customers.
We all know that there has been a number of I/O virtualization solutions for virtual machines, including hardware virtualization such as SR-IOV, and RDMA-specific software-based approaches such as HyV. It’s important to note that all these solutions have not been developed or used for containers, Can we just use them for virtualizing RDMA for containerized clouds?
First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs. It provides high performance by allowing VMs to directly access to physical NICs. However, it has fundamental portability limitations since it requires a static configuration on hardware NICs whenever a VM is migrated to another host machine. Also, it does not provide controllability since all traffics are bypassing hypervisors or any software layers.
On the other hand, there is a software-based technique called HyV. Its basic idea is that it virtualizes control operations of RDMA using hypervisor environments, while data operations are set up to bypass the hypervisor completely like in SR-IOV. Unlike SR-IOV, it can provide isolation, portability and a limited controllability with the support of hypervisor. So, it seems like software-based approach could be useful to realize RNIC virtualization for containers because it can intercept RDMA commands and manage NIC resources in the software. By leveraging this, can we come up with a software-based RNIC virtualization for containers?
To motivate the first problem, let me explain how RDMA commands initiated by applications are delivered to the NICs. This figure on the left-hand side illustrates how the native RDMA case works.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
The first challenges is how to efficiently intercept and forward verbs call from the application. As mentioned earlier, the FreeFlow Router needs to get verbs call from the app and interact with the NIC to execute the call on behalf of the app. **One natural way of doing this is to insert a shim layer under the app which captures and forwards every verbs call. However, we realize that this approach can be inaccuracy or inefficient. **Let me take an example of one specific verbs call, post_send which is used for posting a send or write work request. **Here, the call requires an argument which is a pointer to a qp structure. So, just forwarding it as it is results in incorrect verbs call forwarding. **Moreover, we observe that the qp structure contains lots of pointers to different structure types which makes the serialization hard and inefficient. So, how can we efficiently forward the verb call which takes a complex arguments?
Fortunately, we found a hint by looking at one step deeper into the verbs library. In the native RDMA case, when the app issues verbs call such as post_send, the verbs library outputs serialized parameter before it passes the verbs call to RDMA NIC. This way, the NIC can digest and process the verbs call without manual serialization. Which means that the complexity of function call arguments is simplified internally by the verbs library.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
So, based on this observation, **our shim layer intercepts the serialized output of full-fledged verbs library instead of interception App facing APIs. In particular, while the verbs library originally delivers verbs call to the physical NIC, we make it to forward the call to virtual nic on the freeflow router. **This way, verbs calls can be forwarded correctly without manual serialization.
Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2. **In performing an RDMA Write operation, the NIC directly writes data into the shadow memory region without involving host CPU, the freeflow router does not know exactly when it has to synchronize the data in shadow memory with the corresponding app memory region. **One straightforward solution to synchronize two memory regions would be to keep copying shadow memory content, S-MEM-2, to the app memory, MEM-2. However, it consumes additional CPU cycles and memory bw while incurring non-negligible mem copy latency which can affect the data path performance. So how can we solve this problem?
We observe an opportunity from the fact that containers in the same host machines are just processes and they can share memory regions. Since both FreeFlow router and applications are running on different containers but on the same host, they can share memory space by locating them in the same physical memory region.
By leveraging this observation, we design a zero-copy memory synchronization for data path. **At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies. **In particular, FreeFlow provides two options to achieve this: One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.
We evaluate FreeFlow by answering two key questions: First one is does FreeFlow provide low latency and high throughput which is close to bare-metal RDMA? Second question is can real-world applications using RDMA benefit from FreeFlow?
By leveraging this observation, we design a zero-copy memory synchronization for data path. **At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies. **In particular, FreeFlow provides two options to achieve this: One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds (20)

Recently uploaded

Recently uploaded (20)

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

Editor's Notes