SlideShare a Scribd company logo
1 of 25
FreeFlow: Software-based Virtual RDMA
Networking for Containerized Clouds
Daehyeok Kim
Tianlong Yu1, Hongqiang Liu3, Yibo Zhu4, Jitu Padhye2, Shachar Raindel2
Chuanxiong Guo4, Vyas Sekar1, Srinivasan Seshan1
Carnegie Mellon University1, Microsoft2, Alibaba group3, Bytedance4
Two Trends in Cloud Applications
• Lightweight isolation
• Portability
1
• Higher networking performance
Containerization RDMA networking
Benefits of Containerization
2
NIC
Container 1
IP: 10.0.0.1
IP: 30.0.0.1
Network
App
Container 2
IP: 20.0.0.1
Network
App
Host 1 Host 2
NIC
Container 2
Network
App
IP: 20.0.0.1
IP: 40.0.0.1
Migration
Namespace Isolation
Portability
Software Switch
Software
Switch
Containerization and RDMA are in Conflict!
3
RDMA NIC
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.1
RDMA
App
Host 1 Host 2
RDMA NIC
Container 2
RDMA
App
IP: 20.0.0.1
IP: 20.0.0.1
Migration
Namespace Isolation
Portability
Existing H/W based Virtualization Isn’t Working
4
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.2
RDMA
App
Container 2
RDMA
App
IP: 20.0.0.1
Host 1 Host 2
VF 1 VF 2
NIC Switch
RDMA NIC
IP: 10.0.0.2
VF
NIC Switch
IP: 20.0.0.1
Migration
Using Single Root I/O Virtualization (SR-IOV)
VF Virtual Function
Namespace Isolation
Portability
Sub-optimal Performance of Containerized Apps
5
0
1000
2000
3000
Resnet-50 Inception-v3 Alexnet
TrainingSpeed
(Images/sec)
Model
Native RDMA
Container+TCP
RDMA networking can improve the training speed of NN model by ~ 10x !
9.2x
14.4x
Speech recognition RNN training Image classification CNN training
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Native RDMA Container+TCP
Our Work: FreeFlow
• Enable high speed RDMA networking capabilities for containerized
applications
• Compatible with existing RDMA applications
• Close to native RDMA performance
• Evaluation with real-world data-intensive applications
6
Outline
• Motivation
• FreeFlow Design
• Implementation and Evaluation
7
FreeFlow Design Overview
8
RDMA App
RDMA NIC
Native RDMA FreeFlow
RDMA NIC
Container 1
IP: 10.0.0.1
RDMA App
Container 2
IP: 20.0.0.1
FreeFlow
IP: 30.0.0.1
Host Host
Verbs library
Verbs library
Verbs API
Verbs API
RDMA App
Verbs API
NIC command
Background on RDMA
9
RDMA App
RDMA NIC
MEM-1RDMA CTX
RDMA App
RDMA NIC
MEM-2 RDMA CTX
1. Control path
- Setup RDMA Context
- Post work requests (e.g.,
write)
2. Data path
- NIC processes work requests
- NIC directly accesses memory
“Host 1 wants to write contents in MEM-1 to MEM-2 on Host 2”
Host 1 Host 2
Verbs library Verbs library
FreeFlow in the Scene
10
RDMA App
RDMA NIC
MEM-1RDMA CTX
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
Container 1 Container 2
FreeFlow
S-RDMA CTX S-MEM-1
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow
S-MEM-2 S-RDMA CTX
C1: How to forward verbs calls?
Verbs library Verbs library
C2: How to synchronize memory?
Challenge 1: Verbs forwarding in Control Path
11
RDMA App
RDMA NIC
FreeFlow
Verbs library
Container
NIC command
Verbs API
?
RDMA App
Shim
ibv_post_send (struct ibv_qp* qp, …)
Attempt 1: Forward “as it is”
Incorrect
Attempt 2: “Serialize” and forward
 Inefficient
struct ibv_qp {
struct ibv_context *context;
….
};
Internal Structure of Verbs Library
12
RDMA App
RDMA NIC
FreeFlow
Verbs library
Container
NIC command
Verbs API
?
RDMA App
Shim
ibv_post_send (struct ibv_qp* qp, …)
struct ibv_qp {
struct ibv_context *context;
….
};
Parameters are serialized by Verbs library!
FreeFlow Control Path Channel
13
RDMA App
FreeFlow library
Write (VNIC_fd, serialized parameters)
Parameters are forwarded correctly
without manual serialization!
Idea: Leveraging the serialized output of verbs library
Verbs library
Shim
RDMA App
RDMA NIC
FreeFlow Router
Verbs library
Container
NIC command
Verbs API
VNIC
ibv_post_send (struct ibv_qp* qp, ….)
FreeFlow Router
VNIC
Challenge 2: Synchronizing Memory for Data Path
14
• Shadow memory in FreeFlow router
• A copy of application’s memory region
• Directly accessed by NICs
• S-MEM and MEM must be synchronized.
• How to synchronize S-MEM and MEM?
RDMA App
RDMA NIC
MEMRDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM
Verbs library
VNIC
Container
Strawman Approach for Synchronization
15
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow Router
S-MEM-2 S-RDMA CTX
Verbs library
VNIC
Container
DATA
?
Explicit synchronization
High freq. High overhead
Low freq. Wrong data for app
Containers can Share Memory Regions
16
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
MEM and S-MEM can be located on
the same physical memory region
• FreeFlow router is running in a container
Zero-copy Synchronization in Data Path
17
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
Synchronization without explicit memory copy:
Method1: Allocate shared buffers with FreeFlow APIs
Method2: Re-map app’s memory space to shadow
memory space
FreeFlow supports both!
How to allocated MEM-1 to
shadow memory space?
FreeFlow Design Summary
18
FreeFlow control path channel
Zero-copy memory synchronization
FreeFlow provides near native RDMA performance for containers!
RDMA NIC
Container 1
IP: 10.0.0.1
RDMA App
Container 2
IP: 20.0.0.1
FreeFlow Router
IP: 30.0.0.1
Verbs library
RDMA App
VNIC VNIC
Outline
• Motivation
• FreeFlow Design
• Implementation and Evaluation
19
Implementation and Experimental Setup
• FreeFlow Library
• Add 4000 lines in C to libibverbs and libmlx4.
• FreeFlow Router
• 2000 lines in C++
• Testbed setup
• Two Intel Xeon E5-2620 8-core CPUs, 64 GB RAM
• 56 Gbps Mellanox ConnectX-3 NICs
• Docker containers
20
Does FreeFlow Support Low Latency?
21
0
1
2
3
4
64 256 1K 4K
Latency(us)
Message size (B)
Native RDMA FreeFlow
0.38μs
Does FreeFlow Support High Throughput?
22
0
20
40
60
2K 8K 32K 128K 512K 1M
Throughput(Gbps)
Message size (B)
Native RDMA
FreeFlow
Bounded by control path channel performance
Do Applications Benefit from FreeFlow?
23
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Container+TCP Native RDMA FreeFlow
8.7x
Summary
• Containerization today can’t benefit from speed of RDMA.
• Existing solutions for NIC virtualization don’t work (e.g., SR-IOV).
• FreeFlow enables containerized apps to use RDMA.
• Challenges and Key Ideas
• Control path: Leveraging Verbs library structure for efficient Verbs forwarding
• Data path: Zero-copy memory synchronization
• Performance close to native RDMA
24
github.com/microsoft/freeflow

More Related Content

What's hot

Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityCilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityThomas Graf
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondAnne Nicolas
 
Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
Atom The Redis Streams-Powered Microservices SDK: Dan PipemazoAtom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
Atom The Redis Streams-Powered Microservices SDK: Dan PipemazoRedis Labs
 
2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful ServicesThomas Graf
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
Cilium - BPF & XDP for containers
 Cilium - BPF & XDP for containers Cilium - BPF & XDP for containers
Cilium - BPF & XDP for containersDocker, Inc.
 
http server on user-level mTCP stack accelerated by DPDK
http server on user-level mTCP stack accelerated by DPDKhttp server on user-level mTCP stack accelerated by DPDK
http server on user-level mTCP stack accelerated by DPDKLinaro
 
Investigating the Impact of Network Topology on the Processing Times of SDN C...
Investigating the Impact of Network Topology on the Processing Times of SDN C...Investigating the Impact of Network Topology on the Processing Times of SDN C...
Investigating the Impact of Network Topology on the Processing Times of SDN C...Steffen Gebert
 
LF_DPDK17_Lagopus Router
LF_DPDK17_Lagopus RouterLF_DPDK17_Lagopus Router
LF_DPDK17_Lagopus RouterLF_DPDK
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThomas Graf
 
LF_DPDK_Mellanox bifurcated driver model
LF_DPDK_Mellanox bifurcated driver modelLF_DPDK_Mellanox bifurcated driver model
LF_DPDK_Mellanox bifurcated driver modelLF_DPDK
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPFRogerColl2
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Cheng-Chun William Tu
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating SystemThomas Graf
 
Introduction to Remote Procedure Call
Introduction to Remote Procedure CallIntroduction to Remote Procedure Call
Introduction to Remote Procedure CallAbdelrahman Al-Ogail
 
Link Capacity Estimation in Wireless Software Defined Networks
Link Capacity Estimation in Wireless Software Defined NetworksLink Capacity Estimation in Wireless Software Defined Networks
Link Capacity Estimation in Wireless Software Defined NetworksFarzaneh Pakzad
 
Packet Framework - Cristian Dumitrescu
Packet Framework - Cristian DumitrescuPacket Framework - Cristian Dumitrescu
Packet Framework - Cristian Dumitrescuharryvanhaaren
 
Sun RPC (Remote Procedure Call)
Sun RPC (Remote Procedure Call)Sun RPC (Remote Procedure Call)
Sun RPC (Remote Procedure Call)Peter R. Egli
 

What's hot (20)

Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityCilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
Atom The Redis Streams-Powered Microservices SDK: Dan PipemazoAtom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
Atom The Redis Streams-Powered Microservices SDK: Dan Pipemazo
 
2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services2015 FOSDEM - OVS Stateful Services
2015 FOSDEM - OVS Stateful Services
 
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDKLF_DPDK17_Accelerating P4-based Dataplane with DPDK
LF_DPDK17_Accelerating P4-based Dataplane with DPDK
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Cilium - BPF & XDP for containers
 Cilium - BPF & XDP for containers Cilium - BPF & XDP for containers
Cilium - BPF & XDP for containers
 
http server on user-level mTCP stack accelerated by DPDK
http server on user-level mTCP stack accelerated by DPDKhttp server on user-level mTCP stack accelerated by DPDK
http server on user-level mTCP stack accelerated by DPDK
 
Investigating the Impact of Network Topology on the Processing Times of SDN C...
Investigating the Impact of Network Topology on the Processing Times of SDN C...Investigating the Impact of Network Topology on the Processing Times of SDN C...
Investigating the Impact of Network Topology on the Processing Times of SDN C...
 
LF_DPDK17_Lagopus Router
LF_DPDK17_Lagopus RouterLF_DPDK17_Lagopus Router
LF_DPDK17_Lagopus Router
 
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RCThe Next Generation Firewall for Red Hat Enterprise Linux 7 RC
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
 
LF_DPDK_Mellanox bifurcated driver model
LF_DPDK_Mellanox bifurcated driver modelLF_DPDK_Mellanox bifurcated driver model
LF_DPDK_Mellanox bifurcated driver model
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
 
Introduction to eBPF
Introduction to eBPFIntroduction to eBPF
Introduction to eBPF
 
Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
BPF  & Cilium - Turning Linux into a Microservices-aware Operating SystemBPF  & Cilium - Turning Linux into a Microservices-aware Operating System
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
 
Introduction to Remote Procedure Call
Introduction to Remote Procedure CallIntroduction to Remote Procedure Call
Introduction to Remote Procedure Call
 
Link Capacity Estimation in Wireless Software Defined Networks
Link Capacity Estimation in Wireless Software Defined NetworksLink Capacity Estimation in Wireless Software Defined Networks
Link Capacity Estimation in Wireless Software Defined Networks
 
Packet Framework - Cristian Dumitrescu
Packet Framework - Cristian DumitrescuPacket Framework - Cristian Dumitrescu
Packet Framework - Cristian Dumitrescu
 
Sun RPC (Remote Procedure Call)
Sun RPC (Remote Procedure Call)Sun RPC (Remote Procedure Call)
Sun RPC (Remote Procedure Call)
 

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebula Project
 
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...Docker, Inc.
 
Building a sdn solution for the deployment of web application stacks in docker
Building a sdn solution for the deployment of web application stacks in dockerBuilding a sdn solution for the deployment of web application stacks in docker
Building a sdn solution for the deployment of web application stacks in dockerJorge Juan Mendoza
 
How OpenShift SDN helps to automate
How OpenShift SDN helps to automateHow OpenShift SDN helps to automate
How OpenShift SDN helps to automateIlkka Tengvall
 
DockerCon - The missing piece : when Docker networking unleashes software arc...
DockerCon - The missing piece : when Docker networking unleashes software arc...DockerCon - The missing piece : when Docker networking unleashes software arc...
DockerCon - The missing piece : when Docker networking unleashes software arc...Laurent Grangeau
 
The missing piece : when Docker networking and services finally unleashes so...
 The missing piece : when Docker networking and services finally unleashes so... The missing piece : when Docker networking and services finally unleashes so...
The missing piece : when Docker networking and services finally unleashes so...Adrien Blind
 
drbd9_and_drbdmanage_may_2015
drbd9_and_drbdmanage_may_2015drbd9_and_drbdmanage_may_2015
drbd9_and_drbdmanage_may_2015Alexandre Huynh
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDShapeBlue
 
middleware in embedded systems
middleware in embedded systemsmiddleware in embedded systems
middleware in embedded systemsAkhil Kumar
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
DockerCon 2022 - From legacy to Kubernetes, securely & quickly
DockerCon 2022 - From legacy to Kubernetes, securely & quicklyDockerCon 2022 - From legacy to Kubernetes, securely & quickly
DockerCon 2022 - From legacy to Kubernetes, securely & quicklyEric Smalling
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane Michelle Holley
 
micro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUsmicro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUseProsima
 
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)FIWARE
 
SDN: an introduction
SDN: an introductionSDN: an introduction
SDN: an introductionLuca Profico
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetesKrishna-Kumar
 
Network simulator 2 a simulation tool for linux
Network simulator 2 a simulation tool for linuxNetwork simulator 2 a simulation tool for linux
Network simulator 2 a simulation tool for linuxPratik Joshi
 
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in CloudsTokyo University of Science
 

Similar to FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds (20)

OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBITOpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
 
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...
DockerCon EU 2015: The Missing Piece: when Docker networking unleashing soft ...
 
Simplify Networking for Containers
Simplify Networking for ContainersSimplify Networking for Containers
Simplify Networking for Containers
 
Building a sdn solution for the deployment of web application stacks in docker
Building a sdn solution for the deployment of web application stacks in dockerBuilding a sdn solution for the deployment of web application stacks in docker
Building a sdn solution for the deployment of web application stacks in docker
 
How OpenShift SDN helps to automate
How OpenShift SDN helps to automateHow OpenShift SDN helps to automate
How OpenShift SDN helps to automate
 
DockerCon - The missing piece : when Docker networking unleashes software arc...
DockerCon - The missing piece : when Docker networking unleashes software arc...DockerCon - The missing piece : when Docker networking unleashes software arc...
DockerCon - The missing piece : when Docker networking unleashes software arc...
 
The missing piece : when Docker networking and services finally unleashes so...
 The missing piece : when Docker networking and services finally unleashes so... The missing piece : when Docker networking and services finally unleashes so...
The missing piece : when Docker networking and services finally unleashes so...
 
drbd9_and_drbdmanage_may_2015
drbd9_and_drbdmanage_may_2015drbd9_and_drbdmanage_may_2015
drbd9_and_drbdmanage_may_2015
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 
middleware in embedded systems
middleware in embedded systemsmiddleware in embedded systems
middleware in embedded systems
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
DockerCon 2022 - From legacy to Kubernetes, securely & quickly
DockerCon 2022 - From legacy to Kubernetes, securely & quicklyDockerCon 2022 - From legacy to Kubernetes, securely & quickly
DockerCon 2022 - From legacy to Kubernetes, securely & quickly
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
micro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUsmicro-ROS: bringing ROS 2 to MCUs
micro-ROS: bringing ROS 2 to MCUs
 
Remote Web Desk
Remote Web DeskRemote Web Desk
Remote Web Desk
 
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)
FIWARE Wednesday Webinars - The Use of DDS Middleware in Robotics (Part 2)
 
SDN: an introduction
SDN: an introductionSDN: an introduction
SDN: an introduction
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetes
 
Network simulator 2 a simulation tool for linux
Network simulator 2 a simulation tool for linuxNetwork simulator 2 a simulation tool for linux
Network simulator 2 a simulation tool for linux
 
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
 

Recently uploaded

Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoonintarciacompanies
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfSuchita Rawat
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxMuhammadRazzaq31
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfPharmatech-rx
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Fabiano Dalpiaz
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil RecordSangram Sahoo
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)kushbuR
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptxyoussefboujtat3
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.syedmuneemqadri
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...Sangram Sahoo
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismAreesha Ahmad
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfpablovgd
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfStart Project
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaTimothyOkuna
 
A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einsteinxgamestudios8
 

Recently uploaded (20)

Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed RahimoonVital Signs of Animals Presentation By Aftab Ahmed Rahimoon
Vital Signs of Animals Presentation By Aftab Ahmed Rahimoon
 
MSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdfMSC IV_Forensic medicine - Mechanical injuries.pdf
MSC IV_Forensic medicine - Mechanical injuries.pdf
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptx
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Information science research with large language models: between science and ...
Information science research with large language models: between science and ...Information science research with large language models: between science and ...
Information science research with large language models: between science and ...
 
Taphonomy and Quality of the Fossil Record
Taphonomy and Quality of the  Fossil RecordTaphonomy and Quality of the  Fossil Record
Taphonomy and Quality of the Fossil Record
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
 
Technical english Technical english.pptx
Technical english Technical english.pptxTechnical english Technical english.pptx
Technical english Technical english.pptx
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPTHIV AND INFULENZA VIRUS PPT HIV PPT  INFULENZA VIRUS PPT
HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 
NuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdfNuGOweek 2024 programme final FLYER short.pdf
NuGOweek 2024 programme final FLYER short.pdf
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
A Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert EinsteinA Scientific PowerPoint on Albert Einstein
A Scientific PowerPoint on Albert Einstein
 

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds

  • 1. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds Daehyeok Kim Tianlong Yu1, Hongqiang Liu3, Yibo Zhu4, Jitu Padhye2, Shachar Raindel2 Chuanxiong Guo4, Vyas Sekar1, Srinivasan Seshan1 Carnegie Mellon University1, Microsoft2, Alibaba group3, Bytedance4
  • 2. Two Trends in Cloud Applications • Lightweight isolation • Portability 1 • Higher networking performance Containerization RDMA networking
  • 3. Benefits of Containerization 2 NIC Container 1 IP: 10.0.0.1 IP: 30.0.0.1 Network App Container 2 IP: 20.0.0.1 Network App Host 1 Host 2 NIC Container 2 Network App IP: 20.0.0.1 IP: 40.0.0.1 Migration Namespace Isolation Portability Software Switch Software Switch
  • 4. Containerization and RDMA are in Conflict! 3 RDMA NIC Container 1 IP: 10.0.0.1 IP: 10.0.0.1 RDMA App Container 2 IP: 10.0.0.1 RDMA App Host 1 Host 2 RDMA NIC Container 2 RDMA App IP: 20.0.0.1 IP: 20.0.0.1 Migration Namespace Isolation Portability
  • 5. Existing H/W based Virtualization Isn’t Working 4 Container 1 IP: 10.0.0.1 IP: 10.0.0.1 RDMA App Container 2 IP: 10.0.0.2 RDMA App Container 2 RDMA App IP: 20.0.0.1 Host 1 Host 2 VF 1 VF 2 NIC Switch RDMA NIC IP: 10.0.0.2 VF NIC Switch IP: 20.0.0.1 Migration Using Single Root I/O Virtualization (SR-IOV) VF Virtual Function Namespace Isolation Portability
  • 6. Sub-optimal Performance of Containerized Apps 5 0 1000 2000 3000 Resnet-50 Inception-v3 Alexnet TrainingSpeed (Images/sec) Model Native RDMA Container+TCP RDMA networking can improve the training speed of NN model by ~ 10x ! 9.2x 14.4x Speech recognition RNN training Image classification CNN training 0 0.5 1 0 10 20 30 40 CDF Time per step (sec) Native RDMA Container+TCP
  • 7. Our Work: FreeFlow • Enable high speed RDMA networking capabilities for containerized applications • Compatible with existing RDMA applications • Close to native RDMA performance • Evaluation with real-world data-intensive applications 6
  • 8. Outline • Motivation • FreeFlow Design • Implementation and Evaluation 7
  • 9. FreeFlow Design Overview 8 RDMA App RDMA NIC Native RDMA FreeFlow RDMA NIC Container 1 IP: 10.0.0.1 RDMA App Container 2 IP: 20.0.0.1 FreeFlow IP: 30.0.0.1 Host Host Verbs library Verbs library Verbs API Verbs API RDMA App Verbs API NIC command
  • 10. Background on RDMA 9 RDMA App RDMA NIC MEM-1RDMA CTX RDMA App RDMA NIC MEM-2 RDMA CTX 1. Control path - Setup RDMA Context - Post work requests (e.g., write) 2. Data path - NIC processes work requests - NIC directly accesses memory “Host 1 wants to write contents in MEM-1 to MEM-2 on Host 2” Host 1 Host 2 Verbs library Verbs library
  • 11. FreeFlow in the Scene 10 RDMA App RDMA NIC MEM-1RDMA CTX “Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2” Container 1 Container 2 FreeFlow S-RDMA CTX S-MEM-1 RDMA App RDMA NIC RDMA CTXMEM-2 FreeFlow S-MEM-2 S-RDMA CTX C1: How to forward verbs calls? Verbs library Verbs library C2: How to synchronize memory?
  • 12. Challenge 1: Verbs forwarding in Control Path 11 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) Attempt 1: Forward “as it is” Incorrect Attempt 2: “Serialize” and forward  Inefficient struct ibv_qp { struct ibv_context *context; …. };
  • 13. Internal Structure of Verbs Library 12 RDMA App RDMA NIC FreeFlow Verbs library Container NIC command Verbs API ? RDMA App Shim ibv_post_send (struct ibv_qp* qp, …) struct ibv_qp { struct ibv_context *context; …. }; Parameters are serialized by Verbs library!
  • 14. FreeFlow Control Path Channel 13 RDMA App FreeFlow library Write (VNIC_fd, serialized parameters) Parameters are forwarded correctly without manual serialization! Idea: Leveraging the serialized output of verbs library Verbs library Shim RDMA App RDMA NIC FreeFlow Router Verbs library Container NIC command Verbs API VNIC ibv_post_send (struct ibv_qp* qp, ….) FreeFlow Router VNIC
  • 15. Challenge 2: Synchronizing Memory for Data Path 14 • Shadow memory in FreeFlow router • A copy of application’s memory region • Directly accessed by NICs • S-MEM and MEM must be synchronized. • How to synchronize S-MEM and MEM? RDMA App RDMA NIC MEMRDMA CTX FreeFlow Router S-RDMA CTX S-MEM Verbs library VNIC Container
  • 16. Strawman Approach for Synchronization 15 “Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2” RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container RDMA App RDMA NIC RDMA CTXMEM-2 FreeFlow Router S-MEM-2 S-RDMA CTX Verbs library VNIC Container DATA ? Explicit synchronization High freq. High overhead Low freq. Wrong data for app
  • 17. Containers can Share Memory Regions 16 RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container Host Shared memory MEM MEM and S-MEM can be located on the same physical memory region • FreeFlow router is running in a container
  • 18. Zero-copy Synchronization in Data Path 17 RDMA App RDMA NIC MEM-1RDMA CTX FreeFlow Router S-RDMA CTX S-MEM-1 Verbs library VNIC Container Host Shared memory MEM Synchronization without explicit memory copy: Method1: Allocate shared buffers with FreeFlow APIs Method2: Re-map app’s memory space to shadow memory space FreeFlow supports both! How to allocated MEM-1 to shadow memory space?
  • 19. FreeFlow Design Summary 18 FreeFlow control path channel Zero-copy memory synchronization FreeFlow provides near native RDMA performance for containers! RDMA NIC Container 1 IP: 10.0.0.1 RDMA App Container 2 IP: 20.0.0.1 FreeFlow Router IP: 30.0.0.1 Verbs library RDMA App VNIC VNIC
  • 20. Outline • Motivation • FreeFlow Design • Implementation and Evaluation 19
  • 21. Implementation and Experimental Setup • FreeFlow Library • Add 4000 lines in C to libibverbs and libmlx4. • FreeFlow Router • 2000 lines in C++ • Testbed setup • Two Intel Xeon E5-2620 8-core CPUs, 64 GB RAM • 56 Gbps Mellanox ConnectX-3 NICs • Docker containers 20
  • 22. Does FreeFlow Support Low Latency? 21 0 1 2 3 4 64 256 1K 4K Latency(us) Message size (B) Native RDMA FreeFlow 0.38μs
  • 23. Does FreeFlow Support High Throughput? 22 0 20 40 60 2K 8K 32K 128K 512K 1M Throughput(Gbps) Message size (B) Native RDMA FreeFlow Bounded by control path channel performance
  • 24. Do Applications Benefit from FreeFlow? 23 0 0.5 1 0 10 20 30 40 CDF Time per step (sec) Container+TCP Native RDMA FreeFlow 8.7x
  • 25. Summary • Containerization today can’t benefit from speed of RDMA. • Existing solutions for NIC virtualization don’t work (e.g., SR-IOV). • FreeFlow enables containerized apps to use RDMA. • Challenges and Key Ideas • Control path: Leveraging Verbs library structure for efficient Verbs forwarding • Data path: Zero-copy memory synchronization • Performance close to native RDMA 24 github.com/microsoft/freeflow

Editor's Notes

  1. Thanks for the introduction. My name is Daehyeok Kim. I am a PhD student at Carnegie Mellon University. Today, I will tell you how we can virtualize RDMA networking for containerized clouds with FreeFlow!
  2. So, there are two recent trends in developing cloud applications – Containerization and RDMA networking. Container solutions such as docker and LXC, offer (click) lightweight isolation and portability, which lowers the complexity of deploying and managing the applications. On the other hand, Remote Direct Memory Access or RDMA networking offers (click) significantly higher throughput, lower latency while maintaining lower CPU utilization As a result, many applications especially data-intensive applications adopt container or RDMA networking, but not both at the same time. So, You may be wondering why they are not used together and I will explain this in next couple slides.
  3. So, what’s the benefits of containerizing apps? Especially for networked applications? With containerization, for traditional TCP/IP container networking, containers can use their own network namespace with the support of a software switch. (click) Within their namespace, they can use their own IP addresses regardless of the address of host namespace. (click) So, it supports namespace isolation. (click) Moreover, even when the container is migrated to another host machine, it can retain its IP address, (click) which supports portability of the application running inside the container.
  4. However, when it comes to RDMA networking, container cannot support two properties because they are in conflict. (click) In this case, containers have to use host namespace to directly interact with the RDMA network interface card. Because of this, each container cannot have isolated network namespace which means all containers have to use the same IP address . And another issue happens when migrating the container. (click) Since the namespace on the new host machine uses another IP address, (click) the migrated container also has to change its IP address to this address, (click) limiting the portability of containers.
  5. One natural solution we can think of would be using existing hardware IO virtualization technique such as SR-IOV. At high-level, (click)SR-IOV capable NIC provides a set of virtual NICs called virtual function and (click) we can set different IP addresses to each virtual function. (click) Then virtual machines or containers can use virtual function in a host namespace but with different IP addresses. (click) This provides you with namespace isolation with one caveat which is that you are limited to a set of IP addresses that are assigned to a particular virtual function. While this might not affect namespace isolation of a single machine, when you have to deal with portability, this limits that ability.
  6. As a result, containerized applications cannot use RDMA today and it results in sub-optimal performance. To show this, we ran different neural net training workloads on Tensorflow with container with TCP networking and native RDMA networking. (click) As you can see here, the containerized tensorflow with TCP networking show significantly worse performance than the native RDMA case. (click) Native RDMA can improve the performance by around 10 times.
  7. So our goal was simple: we wanted to make containerized applications to be able to use high speed RDMA networking while achieving isolation and portability. In this talk, I will present FreeFlow, a software-based virtual RDMA networking solution for containerized apps that achieves our goal. With FreeFlow, any existing containerized RDMA applications can run with minimal or no code modification. Our evaluation with real data-intensive applications show that FreeFlow can provide high-performance which is close to native RDMA performance.
  8. Now let me explain the key ideas in the design of FreeFlow.
  9. In a native RDMA case, an RDMA application directly talk to RDMA NIC via Verbs library which translates (click) the user-space RDMA commands into (click) NIC commands.. (click) In FreeFlow, we introduce a software layer that virtualizes RDMA networking for containers. At high-level, (click) the FreeFlow layer gets Verbs calls from RDMA apps on containers, and (click) process them by interacting with the physical RDMA NIC on behalf of the apps. Since FreeFlow allows containers to run their own network namespace, it provides isolation and portability. However, while realizing this idea, we faced several challenges. To explain them, I first briefly introduce the background on how native RDMA works.
  10. Let me bring a simple example where Host 1 wants to write contents in MEM-1 into MEM-2 on Host 2. RDMA app and RDMA NIC communicates through two different communication paths: control and data path. **In the control path, **the RDMA app issues a set of verbs calls to setup RDMA context and register the memory regions. **Once the context is ready, it exchanges the context with the remote side to establish a RDMA data channel. **Then it can post work requests, for example, write request in this case. **Then, in the data path, the NIC processes work requests by directly accessing memory without involving CPU where the most of RDMA’s performance gain comes from.
  11. Now Let’s bring FreeFlow into the scene. **FreeFlow provides apps in container with virtual resources for both control and data path. When the app creates a context, the freeflow layer creates a shadow memory for the apps memory region. Later, in the data path, the NIC will directly access this shadow memory instead of apps memory region. (click) Because of this, these two memory regions must be synchronized. (click) Then it creates corresponding shadow context and registered it to Physical RDMA NIC. (click) Then, the NIC will directly access the shadow memory. while it sounds like a simple idea, here are two key challenges. (click) The First one is how to forward verbs call from the app to the freeflow layer efficiently. (clIck) Second challenge is how to synchronize shadow and apps memory region without incurring much overhead? I will explain how we solve these in next slides.
  12. The first challenge is how to efficiently intercept and forward verbs call from the application. (click) That is what the app should send the freeflow layer to forward verbs call correctly. (click) We first came up with this design that is putting a shim layer under the app. And intercepts the verbs call. (click) In this example, the shim layer intercepts “post_send” which is used for posting send or write work requests. Then the next question is what the shim forward to the FreeFlow layer? (click) Our first attempt was to forward as it is but failed. Why? (click) Because as we can see here the verb call takes a pointer argument, (click) which makes this approach incorrect. (click) The second approach we took was to serialize and forward. (click) But we observe that the qp data structure indeed contains lots of pointer member variables (click)which makes serialization very difficult or inefficient.
  13. Fortunately, we found a hint by looking at (click) one step deeper into the existing verbs library. (click) When the app issues verbs call, the verbs library outputs serialized parameter before it sends a corresponding NIC command. This means that the complexity of verbs call arguments can be simplified internally by the library.
  14. So, based on this observation, (click) we decouple the freeflow layer into two parts called freeflow router and freeflow library and makes the VNIC as an interface between them. (click) Our shim layer intercepts the serialized output of full-fledged verbs library instead of intercepting App facing APIs. (click) Then, we make it to forward the call to virtual nic on the freeflow router. (click) This way, verbs calls can be forwarded correctly without explicit serialization.
  15. The second challenge is that how to synchronize shadow memory with corresponding apps’ memory region in the data path. (click) As mentioned earlier, the shadow memory is a copy of application’s memory region and NIC directly accesses the shadow memory. (click) And those memory regions must be synchronized for correct data path operations. (click) We observe that the synchronization is not trivial because of the characteristics of RDMA operation. Let me explain this challenge with an example.
  16. Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2. So, when the write request has been issued, the FreeFlow router needs to somehow copy the data from MEM-1 to the shadow memory, (click) And then the NIC will read from the shadow memory and transmit it to the NIC on the other side, (click )Then the receiver NIC put the data to S-MEM-2 without CPU involvement. (click) Since there is not signal, the freeflow router does not know when it should copy the data from the shadow memory to app memory. (click) One straightforward solution would be to explicitly synchronize two memory regions. But it's not efficieint or effetive. (click) But if it is too frequent, it would consume excessive amount of CPU cycles and memory bandwidth. If the frequency is too low, the app can end up with reading wrong data. Both results are undesirable. So how can we solve this problem?
  17. We observe an opportunity from the fact that containers in the same host machine are just processes and they can share memory regions. (click) Since both FreeFlow router and applications are running on different containers but on the same host, (click)they can share memory space by locating them in the same physical memory region.
  18. Now we know that apps regions and the shadow memory can be allocated on the shared memory space. (click) then the question is how can the app allocate its memory region to shadow memory space? (click) FreeFlow provides two methods for this: (click) The first method is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. (click) The second method is to re-map app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the bufferwithout code modification. Because each method has different trade-off we provide both methods for developers.
  19. So, the end result is the system, FreeFlow that allows you to combine those two great technologies, containerization and RDMA. In particular, We introducing the freeflow router with two key ideas: **First, we leverage the internal structure of verbs library to forward verbs call efficiently. **Second, we design a zero-copy synchronization in data path and provide two different methods for developers. **As a result, FreeFlow can achieve near native RDMA performance for containers while achieving isolation and portability as we will see next couple slides.
  20. We implemented FreeFlow library by modifying the user-space verbs libraries and drivers. And we implemented FreeFlow router in C++. (click) Our testbed consists of two servers equipped with RDMA-capable NIC. We used docker as a container solution.
  21. One nice thing about RDMA is that it provides a latency of single digit microseconds which is incredibly valuable for latency-sensitive applications. So, We first measure the latency of RDMA operatio. Two containers were running on different hosts. (click) This graph shows the RDMA WRITE lantecy.. And X-axis is message size and Y-axis is the latency. ((click) We can see that compared to the native RDMA case, FreeFlow adds at most .4 us of extra latency The extra delay is mainly due to the IPC between the FreeFlow library and FreeFlow router.
  22. High throughput is another advantage of using RDMA. (click)This graph shows the RDMA SEND throughput. The Sender transmits 1GB data with different sizes of messages. The X-axis is message size and Y-axis is the throughput in gbps. (click) We see that with message size equal or larger than 8KB, FreeFlow gets full throughput as native RDMA (click) When the message sizes are smaller like 2KB, FreeFlow achieves nearly half of the full throughput. This is because the throughput is bounded by the capacity of control path channel between the freeflow library and the router
  23. How about data-intensive app performance? As an example, We trained speech recognition RNN model on Tensorflow We compared the performance of container TCP networking, Native RDMA, FreeFlow. (click) This figure shows the CDF of the time spent for each training step, (click) We can see that the median training time with FreeFlow is around 8.7 times faster than the container TCP case and it is also very close to native RDMA.
  24. To conclude the talk, containerization today cannot benefit from the performance of RDMA. Existing virtualization techniques don’t work for it. Today I introduced FreeFlow that enables containerized apps to use RDMA with minimal or no code modification. We addressed two key challenges to virtualized control and data path We show it can support near native RDMA performance. Please check out our paper and github repo for more details. With that, I’m happy to take questions, thank you.
  25. We also run the image recognition workload implemented with Tensorflow. For image recognition, we run three specific models, ResNet-50, Inception-v3 and AlexNet. The figure shows the median training speed per second with 10-percentile and 99-percentile values. From the results of all three different models, we conclude, first, the network performance is indeed a bottleneck in the distributed training. Comparing host RDMA with host TCP, host RDMA performs 1.8 to 3.0 times better in terms of the training speed. The gap between FreeFlow and Weave on container overlay is even wider. For example, FreeFlow runs 14.6 times faster on AlexNet. Second, FreeFlow performance is very close to host RDMA. The difference is less than 4.9%.
  26. First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs. It provides high performance by allowing VMs to directly access to physical NICs. However, it has fundamental portability limitations since it requires reconfiguration on hardware NICs when a VM is migrated to another host machine. . Also, it does not provide controllability since all traffics are bypassing hypervisors or software switches.
  27. The second possible solution is using SoftRoCE, which is a software-emulated RDMA being developed by Mellanox. It could easily achieve isolation, portability, and controllability since it implements RDMA stack on top of kernel network stack, so it could potentially leverage existing virtual network solutions for TCP/IP networking. However it has low performance since packets are processed in the kernel stack.
  28. Lastly, the idea of Hybrid Virtualization could be extended for containers. The basic idea is that it virtualizes control operations of RDMA using hypervisor environments, while data operations are set up to bypass the hypervisor completely like in SR-IOV. Although it can provides isolation, portability, and performance, It suffers from low controllability since it only can manipulate control plane commands.
  29. Lastly, We ran Spark on two servers. One of the server runs a master container that schedules jobs on slave containers. Both of the servers run a slave container. The RDMA extension for spark is implemented by is closed source. We download the binary from their official website and did not make any modification. We demonstrate the basic benchmarks shipped with the Spark distribution – GroupBy and SortBy. As shown in the figure, We conclude similar observations as running TensorFlow. The performance of network does impact the application end-to-end performance significantly. When running with RvS, the performance is very close to running on host RDMA, better than host TCP, and up to 1.8 times better than running containers with Weave.
  30. Let me first explain a design decision we made to efficiently intercept RDMA API calls. There are multiple ways to intercept the communications between applications and physical NICs, but we must choose an efficient one. A number of commercial technologies supporting RDMA are available today, including Infiniband, RoCE and iWarp. Applications may also use several different high-level APIs to access RDMA features, such as MPI and rsocket. As shown in the figure, the “narrow waist” of these APIs is the IB Verbs API, so called Verbs. Thus, we choose to support Verbs in RvS and by doing so we can naturally support all higher-level APIs. Let me briefly explain some basic concepts in Verbs. Verbs uses a concept of “queue pairs” called QP for data transfer. For every RDMA connection, each of two endpoints has a send queue (SQ) and a receive queue (RQ), together called QP. Each endpoint also has a separate completion queue called CQ / that is used by the NIC to notify the endpoint about completion of send or receive requests To transparently support Verbs, RvS creates virtual QPs and CQs in virtual NICs / and relates the operations on them with operations on real QPs and CQs in the physical NICs.
  31. Cloud application developers are seeking better performance, lower management cost, and higher resource efficiency. This has lead to growing adoption of two technologies, called Containerization and RDMA networking. Containers offer lightweight isolation and portability, which lowers the complexity of deploying and managing cloud applications. Because of this, containers are now a standard way of managing and deploying large cloud applications. On the other hand, Remote Direct Memory Access or **RDMA** networking offers significantly higher throughput, lower latency while maintaining lower CPU utilization than standard TCP/IP based networking. Because of this, many data-intensive applications, for example, deep learning and data analytics frameworks are adopting RDMA as their communication method. So, our question is ** can containerized applications use RDMA? **
  32. At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
  33. Why is it hard to achieve good performance in this design? To realize our design, we need to solve two problems: First, since both control and data path RDMA commands issued by applications are forwarded to the freeflow router, the forwarding mechanism should be efficient. It is difficult to achieve this since we observe that the structure of RDMA command arguments are complex …….. Second, as the FreeFlow router maintains shadow context
  34. Unfortunately, the two trends are fundamentally at odds with each other especially in clouds. The core value of containerization is to provide an efficient and flexible resource management to applications. To support this, containers have three properties in networking: **isolation, portability, and controllability.** For isolation, each container needs to have its dedicated network namespace to eliminate conflicts with other containers. Second, to provide portability, a container typically uses virtual networking to communicate with other containers, and its virtual IP address stick with the container regardless of which host machine it is *placed in* or *migrate to*. For controllability, container orchestrators can easily enforce control plane policies such as admission control and routing, and data plane policies such as QoS. This properties are particularly useful in multi-tenant cloud environments. To support these properties, in TCP/IP-based operations, networking is fully virtualized via software switches. However, it is hard to fully virtualize RDMA networking. RDMA achieves high networking performance by offloading network processing to hardware NICs. Because of this, it’s difficult to modify the control plane states in hardware while it’s also hard to control the data path since traffic directly goes between memory and NICs. As a result, several data-intensive applications that have adopted both of these technologies use RDMA only when running in dedicated bare-metal clusters. However, using dedicated clusters to run a single application is not cost efficient both for service providers or customers.
  35. We all know that there has been a number of I/O virtualization solutions for virtual machines, including hardware virtualization such as SR-IOV, and RDMA-specific software-based approaches such as HyV. It’s important to note that all these solutions have not been developed or used for containers, Can we just use them for virtualizing RDMA for containerized clouds?
  36. First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs. It provides high performance by allowing VMs to directly access to physical NICs. However, it has fundamental portability limitations since it requires a static configuration on hardware NICs whenever a VM is migrated to another host machine. Also, it does not provide controllability since all traffics are bypassing hypervisors or any software layers.
  37. On the other hand, there is a software-based technique called HyV. Its basic idea is that it virtualizes control operations of RDMA using hypervisor environments, while data operations are set up to bypass the hypervisor completely like in SR-IOV. Unlike SR-IOV, it can provide isolation, portability and a limited controllability with the support of hypervisor. So, it seems like software-based approach could be useful to realize RNIC virtualization for containers because it can intercept RDMA commands and manage NIC resources in the software. By leveraging this, can we come up with a software-based RNIC virtualization for containers?
  38. To motivate the first problem, let me explain how RDMA commands initiated by applications are delivered to the NICs. This figure on the left-hand side illustrates how the native RDMA case works.
  39. At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
  40. The first challenges is how to efficiently intercept and forward verbs call from the application. As mentioned earlier, the FreeFlow Router needs to get verbs call from the app and interact with the NIC to execute the call on behalf of the app. **One natural way of doing this is to insert a shim layer under the app which captures and forwards every verbs call. However, we realize that this approach can be inaccuracy or inefficient. **Let me take an example of one specific verbs call, post_send which is used for posting a send or write work request. **Here, the call requires an argument which is a pointer to a qp structure. So, just forwarding it as it is results in incorrect verbs call forwarding. **Moreover, we observe that the qp structure contains lots of pointers to different structure types which makes the serialization hard and inefficient. So, how can we efficiently forward the verb call which takes a complex arguments?
  41. Fortunately, we found a hint by looking at one step deeper into the verbs library. In the native RDMA case, when the app issues verbs call such as post_send, the verbs library outputs serialized parameter before it passes the verbs call to RDMA NIC. This way, the NIC can digest and process the verbs call without manual serialization. Which means that the complexity of function call arguments is simplified internally by the verbs library.
  42. At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
  43. At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability. FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications. The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications. It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context. Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability. Also, by intercepting both control and data path commands, it can provide full controllability. The only question we have is how to provide a good performance while guaranteeing thoses properties.
  44. So, based on this observation, **our shim layer intercepts the serialized output of full-fledged verbs library instead of interception App facing APIs. In particular, while the verbs library originally delivers verbs call to the physical NIC, we make it to forward the call to virtual nic on the freeflow router. **This way, verbs calls can be forwarded correctly without manual serialization.
  45. Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2. **In performing an RDMA Write operation, the NIC directly writes data into the shadow memory region without involving host CPU, the freeflow router does not know exactly when it has to synchronize the data in shadow memory with the corresponding app memory region. **One straightforward solution to synchronize two memory regions would be to keep copying shadow memory content, S-MEM-2, to the app memory, MEM-2. However, it consumes additional CPU cycles and memory bw while incurring non-negligible mem copy latency which can affect the data path performance. So how can we solve this problem?
  46. We observe an opportunity from the fact that containers in the same host machines are just processes and they can share memory regions. Since both FreeFlow router and applications are running on different containers but on the same host, they can share memory space by locating them in the same physical memory region.
  47. By leveraging this observation, we design a zero-copy memory synchronization for data path. **At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies. **In particular, FreeFlow provides two options to achieve this: One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.
  48. We evaluate FreeFlow by answering two key questions: First one is does FreeFlow provide low latency and high throughput which is close to bare-metal RDMA? Second question is can real-world applications using RDMA benefit from FreeFlow?
  49. By leveraging this observation, we design a zero-copy memory synchronization for data path. **At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies. **In particular, FreeFlow provides two options to achieve this: One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region. It requires some amount of application code modification. The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.