4. Containerization and RDMA are in Conflict!
3
RDMA NIC
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.1
RDMA
App
Host 1 Host 2
RDMA NIC
Container 2
RDMA
App
IP: 20.0.0.1
IP: 20.0.0.1
Migration
Namespace Isolation
Portability
5. Existing H/W based Virtualization Isn’t Working
4
Container 1
IP: 10.0.0.1
IP: 10.0.0.1
RDMA
App
Container 2
IP: 10.0.0.2
RDMA
App
Container 2
RDMA
App
IP: 20.0.0.1
Host 1 Host 2
VF 1 VF 2
NIC Switch
RDMA NIC
IP: 10.0.0.2
VF
NIC Switch
IP: 20.0.0.1
Migration
Using Single Root I/O Virtualization (SR-IOV)
VF Virtual Function
Namespace Isolation
Portability
6. Sub-optimal Performance of Containerized Apps
5
0
1000
2000
3000
Resnet-50 Inception-v3 Alexnet
TrainingSpeed
(Images/sec)
Model
Native RDMA
Container+TCP
RDMA networking can improve the training speed of NN model by ~ 10x !
9.2x
14.4x
Speech recognition RNN training Image classification CNN training
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Native RDMA Container+TCP
7. Our Work: FreeFlow
• Enable high speed RDMA networking capabilities for containerized
applications
• Compatible with existing RDMA applications
• Close to native RDMA performance
• Evaluation with real-world data-intensive applications
6
9. FreeFlow Design Overview
8
RDMA App
RDMA NIC
Native RDMA FreeFlow
RDMA NIC
Container 1
IP: 10.0.0.1
RDMA App
Container 2
IP: 20.0.0.1
FreeFlow
IP: 30.0.0.1
Host Host
Verbs library
Verbs library
Verbs API
Verbs API
RDMA App
Verbs API
NIC command
10. Background on RDMA
9
RDMA App
RDMA NIC
MEM-1RDMA CTX
RDMA App
RDMA NIC
MEM-2 RDMA CTX
1. Control path
- Setup RDMA Context
- Post work requests (e.g.,
write)
2. Data path
- NIC processes work requests
- NIC directly accesses memory
“Host 1 wants to write contents in MEM-1 to MEM-2 on Host 2”
Host 1 Host 2
Verbs library Verbs library
11. FreeFlow in the Scene
10
RDMA App
RDMA NIC
MEM-1RDMA CTX
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
Container 1 Container 2
FreeFlow
S-RDMA CTX S-MEM-1
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow
S-MEM-2 S-RDMA CTX
C1: How to forward verbs calls?
Verbs library Verbs library
C2: How to synchronize memory?
12. Challenge 1: Verbs forwarding in Control Path
11
RDMA App
RDMA NIC
FreeFlow
Verbs library
Container
NIC command
Verbs API
?
RDMA App
Shim
ibv_post_send (struct ibv_qp* qp, …)
Attempt 1: Forward “as it is”
Incorrect
Attempt 2: “Serialize” and forward
Inefficient
struct ibv_qp {
struct ibv_context *context;
….
};
13. Internal Structure of Verbs Library
12
RDMA App
RDMA NIC
FreeFlow
Verbs library
Container
NIC command
Verbs API
?
RDMA App
Shim
ibv_post_send (struct ibv_qp* qp, …)
struct ibv_qp {
struct ibv_context *context;
….
};
Parameters are serialized by Verbs library!
14. FreeFlow Control Path Channel
13
RDMA App
FreeFlow library
Write (VNIC_fd, serialized parameters)
Parameters are forwarded correctly
without manual serialization!
Idea: Leveraging the serialized output of verbs library
Verbs library
Shim
RDMA App
RDMA NIC
FreeFlow Router
Verbs library
Container
NIC command
Verbs API
VNIC
ibv_post_send (struct ibv_qp* qp, ….)
FreeFlow Router
VNIC
15. Challenge 2: Synchronizing Memory for Data Path
14
• Shadow memory in FreeFlow router
• A copy of application’s memory region
• Directly accessed by NICs
• S-MEM and MEM must be synchronized.
• How to synchronize S-MEM and MEM?
RDMA App
RDMA NIC
MEMRDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM
Verbs library
VNIC
Container
16. Strawman Approach for Synchronization
15
“Container 1 wants to write contents in MEM-1 to MEM-2 on Container 2”
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
RDMA App
RDMA NIC
RDMA CTXMEM-2
FreeFlow Router
S-MEM-2 S-RDMA CTX
Verbs library
VNIC
Container
DATA
?
Explicit synchronization
High freq. High overhead
Low freq. Wrong data for app
17. Containers can Share Memory Regions
16
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
MEM and S-MEM can be located on
the same physical memory region
• FreeFlow router is running in a container
18. Zero-copy Synchronization in Data Path
17
RDMA App
RDMA NIC
MEM-1RDMA CTX
FreeFlow Router
S-RDMA CTX S-MEM-1
Verbs library
VNIC
Container
Host
Shared
memory
MEM
Synchronization without explicit memory copy:
Method1: Allocate shared buffers with FreeFlow APIs
Method2: Re-map app’s memory space to shadow
memory space
FreeFlow supports both!
How to allocated MEM-1 to
shadow memory space?
23. Does FreeFlow Support High Throughput?
22
0
20
40
60
2K 8K 32K 128K 512K 1M
Throughput(Gbps)
Message size (B)
Native RDMA
FreeFlow
Bounded by control path channel performance
24. Do Applications Benefit from FreeFlow?
23
0
0.5
1
0 10 20 30 40
CDF
Time per step (sec)
Container+TCP Native RDMA FreeFlow
8.7x
25. Summary
• Containerization today can’t benefit from speed of RDMA.
• Existing solutions for NIC virtualization don’t work (e.g., SR-IOV).
• FreeFlow enables containerized apps to use RDMA.
• Challenges and Key Ideas
• Control path: Leveraging Verbs library structure for efficient Verbs forwarding
• Data path: Zero-copy memory synchronization
• Performance close to native RDMA
24
github.com/microsoft/freeflow
Editor's Notes
Thanks for the introduction. My name is Daehyeok Kim. I am a PhD student at Carnegie Mellon University.
Today, I will tell you how we can virtualize RDMA networking for containerized clouds with FreeFlow!
So, there are two recent trends in developing cloud applications – Containerization and RDMA networking.
Container solutions such as docker and LXC, offer
(click) lightweight isolation and portability, which lowers the complexity of deploying and managing the applications.
On the other hand, Remote Direct Memory Access or RDMA networking offers
(click) significantly higher throughput, lower latency while maintaining lower CPU utilization
As a result, many applications especially data-intensive applications adopt container or RDMA networking, but not both at the same time.
So, You may be wondering why they are not used together and I will explain this in next couple slides.
So, what’s the benefits of containerizing apps? Especially for networked applications?
With containerization, for traditional TCP/IP container networking, containers can use their own network namespace with the support of a software switch.
(click) Within their namespace, they can use their own IP addresses regardless of the address of host namespace. (click) So, it supports namespace isolation.
(click) Moreover, even when the container is migrated to another host machine, it can retain its IP address,
(click) which supports portability of the application running inside the container.
However, when it comes to RDMA networking, container cannot support two properties because they are in conflict.
(click) In this case, containers have to use host namespace to directly interact with the RDMA network interface card.
Because of this, each container cannot have isolated network namespace which means all containers have to use the same IP address .
And another issue happens when migrating the container.
(click) Since the namespace on the new host machine uses another IP address,
(click) the migrated container also has to change its IP address to this address,
(click) limiting the portability of containers.
One natural solution we can think of would be using existing hardware IO virtualization technique such as SR-IOV.
At high-level, (click)SR-IOV capable NIC provides a set of virtual NICs called virtual function and
(click) we can set different IP addresses to each virtual function.
(click) Then virtual machines or containers can use virtual function in a host namespace but with different IP addresses.
(click)
This provides you with namespace isolation with one caveat which is that you are limited to a set of IP addresses that are assigned to a particular virtual function.
While this might not affect namespace isolation of a single machine, when you have to deal with portability, this limits that ability.
As a result, containerized applications cannot use RDMA today and it results in sub-optimal performance.
To show this, we ran different neural net training workloads on Tensorflow with container with TCP networking and native RDMA networking.
(click) As you can see here, the containerized tensorflow with TCP networking show significantly worse performance than the native RDMA case.
(click) Native RDMA can improve the performance by around 10 times.
So our goal was simple: we wanted to make containerized applications to be able to use high speed RDMA networking while achieving isolation and portability.
In this talk, I will present FreeFlow, a software-based virtual RDMA networking solution for containerized apps that achieves our goal.
With FreeFlow, any existing containerized RDMA applications can run with minimal or no code modification.
Our evaluation with real data-intensive applications show that FreeFlow can provide high-performance which is close to native RDMA performance.
Now let me explain the key ideas in the design of FreeFlow.
In a native RDMA case, an RDMA application directly talk to RDMA NIC via Verbs library which translates
(click) the user-space RDMA commands into
(click) NIC commands..
(click) In FreeFlow, we introduce a software layer that virtualizes RDMA networking for containers.
At high-level, (click) the FreeFlow layer gets Verbs calls from RDMA apps on containers,
and (click) process them by interacting with the physical RDMA NIC on behalf of the apps.
Since FreeFlow allows containers to run their own network namespace, it provides isolation and portability.
However, while realizing this idea, we faced several challenges.
To explain them, I first briefly introduce the background on how native RDMA works.
Let me bring a simple example where Host 1 wants to write contents in MEM-1 into MEM-2 on Host 2.
RDMA app and RDMA NIC communicates through two different communication paths: control and data path.
**In the control path,
**the RDMA app issues a set of verbs calls to setup RDMA context and register the memory regions.
**Once the context is ready, it exchanges the context with the remote side to establish a RDMA data channel.
**Then it can post work requests, for example, write request in this case.
**Then, in the data path, the NIC processes work requests
by directly accessing memory without involving CPU where the most of RDMA’s performance gain comes from.
Now Let’s bring FreeFlow into the scene.
**FreeFlow provides apps in container with virtual resources for both control and data path.
When the app creates a context, the freeflow layer creates a shadow memory for the apps memory region.
Later, in the data path, the NIC will directly access this shadow memory instead of apps memory region.
(click) Because of this, these two memory regions must be synchronized.
(click) Then it creates corresponding shadow context and registered it to Physical RDMA NIC.
(click) Then, the NIC will directly access the shadow memory.
while it sounds like a simple idea, here are two key challenges.
(click) The First one is how to forward verbs call from the app to the freeflow layer efficiently.
(clIck) Second challenge is how to synchronize shadow and apps memory region without incurring much overhead?
I will explain how we solve these in next slides.
The first challenge is how to efficiently intercept and forward verbs call from the application.
(click) That is what the app should send the freeflow layer to forward verbs call correctly.
(click) We first came up with this design that is putting a shim layer under the app. And intercepts the verbs call.
(click) In this example, the shim layer intercepts “post_send” which is used for posting send or write work requests.
Then the next question is what the shim forward to the FreeFlow layer?
(click) Our first attempt was to forward as it is but failed. Why?
(click) Because as we can see here the verb call takes a pointer argument,
(click) which makes this approach incorrect.
(click) The second approach we took was to serialize and forward.
(click) But we observe that the qp data structure indeed contains lots of pointer member variables
(click)which makes serialization very difficult or inefficient.
Fortunately, we found a hint by looking at (click) one step deeper into the existing verbs library.
(click) When the app issues verbs call, the verbs library outputs serialized parameter before it sends a corresponding NIC command.
This means that the complexity of verbs call arguments can be simplified internally by the library.
So, based on this observation,
(click) we decouple the freeflow layer into two parts called freeflow router and freeflow library and makes the VNIC as an interface between them.
(click) Our shim layer intercepts the serialized output of full-fledged verbs library instead of intercepting App facing APIs.
(click) Then, we make it to forward the call to virtual nic on the freeflow router.
(click) This way, verbs calls can be forwarded correctly without explicit serialization.
The second challenge is that how to synchronize shadow memory with corresponding apps’ memory region in the data path.
(click) As mentioned earlier, the shadow memory is a copy of application’s memory region and NIC directly accesses the shadow memory.
(click) And those memory regions must be synchronized for correct data path operations.
(click) We observe that the synchronization is not trivial because of the characteristics of RDMA operation.
Let me explain this challenge with an example.
Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2.
So, when the write request has been issued, the FreeFlow router needs to somehow copy the data from MEM-1 to the shadow memory,
(click) And then the NIC will read from the shadow memory and transmit it to the NIC on the other side,
(click )Then the receiver NIC put the data to S-MEM-2 without CPU involvement.
(click) Since there is not signal, the freeflow router does not know when it should copy the data from the shadow memory to app memory.
(click) One straightforward solution would be to explicitly synchronize two memory regions. But it's not efficieint or effetive.
(click) But if it is too frequent, it would consume excessive amount of CPU cycles and memory bandwidth.
If the frequency is too low, the app can end up with reading wrong data. Both results are undesirable.
So how can we solve this problem?
We observe an opportunity from the fact that containers in the same host machine are just processes and they can share memory regions.
(click) Since both FreeFlow router and applications are running on different containers but on the same host,
(click)they can share memory space by locating them in the same physical memory region.
Now we know that apps regions and the shadow memory can be allocated on the shared memory space.
(click) then the question is how can the app allocate its memory region to shadow memory space?
(click) FreeFlow provides two methods for this:
(click) The first method is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region.
It requires some amount of application code modification.
(click) The second method is to re-map app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the bufferwithout code modification.
Because each method has different trade-off we provide both methods for developers.
So, the end result is the system, FreeFlow that allows you to combine those two great technologies, containerization and RDMA.
In particular, We introducing the freeflow router with two key ideas:
**First, we leverage the internal structure of verbs library to forward verbs call efficiently.
**Second, we design a zero-copy synchronization in data path and provide two different methods for developers.
**As a result, FreeFlow can achieve near native RDMA performance for containers while achieving isolation and portability as we will see next couple slides.
We implemented FreeFlow library by modifying the user-space verbs libraries and drivers.
And we implemented FreeFlow router in C++.
(click) Our testbed consists of two servers equipped with RDMA-capable NIC.
We used docker as a container solution.
One nice thing about RDMA is that it provides a latency of single digit microseconds which is incredibly valuable for latency-sensitive applications.
So, We first measure the latency of RDMA operatio.
Two containers were running on different hosts.
(click) This graph shows the RDMA WRITE lantecy..
And X-axis is message size and Y-axis is the latency.
((click) We can see that compared to the native RDMA case, FreeFlow adds at most .4 us of extra latency
The extra delay is mainly due to the IPC between the FreeFlow library and FreeFlow router.
High throughput is another advantage of using RDMA.
(click)This graph shows the RDMA SEND throughput.
The Sender transmits 1GB data with different sizes of messages.
The X-axis is message size and Y-axis is the throughput in gbps.
(click) We see that with message size equal or larger than 8KB, FreeFlow gets full throughput as native RDMA
(click) When the message sizes are smaller like 2KB, FreeFlow achieves nearly half of the full throughput.
This is because the throughput is bounded by the capacity of control path channel between the freeflow library and the router
How about data-intensive app performance?
As an example, We trained speech recognition RNN model on Tensorflow
We compared the performance of container TCP networking, Native RDMA, FreeFlow.
(click) This figure shows the CDF of the time spent for each training step,
(click) We can see that the median training time with FreeFlow is around 8.7 times faster than the container TCP case and it is also very close to native RDMA.
To conclude the talk, containerization today cannot benefit from the performance of RDMA.
Existing virtualization techniques don’t work for it.
Today I introduced FreeFlow that enables containerized apps to use RDMA with minimal or no code modification.
We addressed two key challenges to virtualized control and data path
We show it can support near native RDMA performance.
Please check out our paper and github repo for more details.
With that, I’m happy to take questions, thank you.
We also run the image recognition workload implemented with Tensorflow.
For image recognition, we run three specific models, ResNet-50, Inception-v3 and AlexNet.
The figure shows the median training speed per second with 10-percentile and 99-percentile values.
From the results of all three different models, we conclude, first, the network performance is indeed a bottleneck in the distributed training.
Comparing host RDMA with host TCP, host RDMA performs 1.8 to 3.0 times better in terms of the training speed.
The gap between FreeFlow and Weave on container overlay is even wider.
For example, FreeFlow runs 14.6 times faster on AlexNet.
Second, FreeFlow performance is very close to host RDMA.
The difference is less than 4.9%.
First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs.
It provides high performance by allowing VMs to directly access to physical NICs.
However, it has fundamental portability limitations since it requires reconfiguration on hardware NICs
when a VM is migrated to another host machine. .
Also, it does not provide controllability since all traffics are bypassing hypervisors or software switches.
The second possible solution is using SoftRoCE,
which is a software-emulated RDMA being developed by Mellanox.
It could easily achieve isolation, portability, and controllability
since it implements RDMA stack on top of kernel network stack,
so it could potentially leverage existing virtual network solutions for TCP/IP networking.
However it has low performance since packets are processed in the kernel stack.
Lastly, the idea of Hybrid Virtualization could be extended for containers.
The basic idea is that it virtualizes control operations of RDMA using hypervisor environments,
while data operations are set up to bypass the hypervisor completely like in SR-IOV.
Although it can provides isolation, portability, and performance,
It suffers from low controllability since it only can manipulate control plane commands.
Lastly, We ran Spark on two servers.
One of the server runs a master container that schedules jobs on slave containers.
Both of the servers run a slave container.
The RDMA extension for spark is implemented by is closed source.
We download the binary from their official website and did not make any modification.
We demonstrate the basic benchmarks shipped with the Spark distribution – GroupBy and SortBy.
As shown in the figure,
We conclude similar observations as running TensorFlow.
The performance of network does impact the application end-to-end performance significantly.
When running with RvS, the performance is very close to running on host RDMA, better than host TCP, and up to 1.8 times better than running containers with Weave.
Let me first explain a design decision we made to efficiently intercept RDMA API calls.
There are multiple ways to intercept the communications between applications and physical NICs,
but we must choose an efficient one.
A number of commercial technologies supporting RDMA are available today,
including Infiniband, RoCE and iWarp.
Applications may also use several different high-level APIs to access RDMA features,
such as MPI and rsocket.
As shown in the figure, the “narrow waist” of these APIs is the IB Verbs API, so called Verbs.
Thus, we choose to support Verbs in RvS and by doing so we can naturally support all higher-level APIs.
Let me briefly explain some basic concepts in Verbs.
Verbs uses a concept of “queue pairs” called QP for data transfer.
For every RDMA connection, each of two endpoints has a send queue (SQ) and a receive queue (RQ), together called QP.
Each endpoint also has a separate completion queue called CQ /
that is used by the NIC to notify the endpoint about completion of send or receive requests
To transparently support Verbs,
RvS creates virtual QPs and CQs in virtual NICs / and relates the operations on them with operations on real QPs and CQs in the physical NICs.
Cloud application developers are seeking better performance, lower management cost, and higher resource efficiency.
This has lead to growing adoption of two technologies, called Containerization and RDMA networking.
Containers offer lightweight isolation and portability, which lowers the complexity of deploying and managing cloud applications.
Because of this, containers are now a standard way of managing and deploying large cloud applications.
On the other hand, Remote Direct Memory Access or **RDMA** networking offers significantly higher throughput, lower latency while maintaining lower CPU utilization than standard TCP/IP based networking.
Because of this, many data-intensive applications, for example, deep learning and data analytics frameworks are adopting RDMA as their communication method.
So, our question is ** can containerized applications use RDMA? **
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability.
FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications.
The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications.
It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context.
Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability.
Also, by intercepting both control and data path commands, it can provide full controllability.
The only question we have is how to provide a good performance while guaranteeing thoses properties.
Why is it hard to achieve good performance in this design?
To realize our design, we need to solve two problems:
First, since both control and data path RDMA commands issued by applications are forwarded to the freeflow router, the forwarding mechanism should be efficient.
It is difficult to achieve this since we observe that the structure of RDMA command arguments are complex ……..
Second, as the FreeFlow router maintains shadow context
Unfortunately, the two trends are fundamentally at odds with each other especially in clouds.
The core value of containerization is to provide an efficient and flexible resource management to applications.
To support this, containers have three properties in networking:
**isolation, portability, and controllability.**
For isolation, each container needs to have its dedicated network namespace to eliminate conflicts with other containers.
Second, to provide portability, a container typically uses virtual networking to communicate with other containers,
and its virtual IP address stick with the container regardless of which host machine it is *placed in* or *migrate to*.
For controllability, container orchestrators can easily enforce
control plane policies such as admission control and routing, and data plane policies such as QoS.
This properties are particularly useful in multi-tenant cloud environments.
To support these properties, in TCP/IP-based operations,
networking is fully virtualized via software switches.
However, it is hard to fully virtualize RDMA networking.
RDMA achieves high networking performance by offloading network processing to hardware NICs.
Because of this, it’s difficult to modify the control plane states in hardware
while it’s also hard to control the data path since traffic directly goes between memory and NICs.
As a result, several data-intensive applications that have adopted both of these technologies use RDMA
only when running in dedicated bare-metal clusters.
However, using dedicated clusters to run a single application is not cost efficient
both for service providers or customers.
We all know that there has been a number of I/O virtualization solutions for virtual machines, including hardware virtualization such as SR-IOV, and RDMA-specific software-based approaches such as HyV.
It’s important to note that all these solutions have not been developed or used for containers,
Can we just use them for virtualizing RDMA for containerized clouds?
First, SR-IOV is an I/O virtualization technique supported by hardware devices such as NICs.
It provides high performance by allowing VMs to directly access to physical NICs.
However, it has fundamental portability limitations since it requires a static configuration on hardware NICs whenever a VM is migrated to another host machine.
Also, it does not provide controllability since all traffics are bypassing hypervisors or any software layers.
On the other hand, there is a software-based technique called HyV.
Its basic idea is that it virtualizes control operations of RDMA using hypervisor environments,
while data operations are set up to bypass the hypervisor completely like in SR-IOV.
Unlike SR-IOV, it can provide isolation, portability and a limited controllability with the support of hypervisor.
So, it seems like software-based approach could be useful to realize RNIC virtualization for containers because it can intercept RDMA commands and manage NIC resources in the software.
By leveraging this, can we come up with a software-based RNIC virtualization for containers?
To motivate the first problem, let me explain how RDMA commands initiated by applications are delivered to the NICs.
This figure on the left-hand side illustrates how the native RDMA case works.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability.
FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications.
The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications.
It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context.
Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability.
Also, by intercepting both control and data path commands, it can provide full controllability.
The only question we have is how to provide a good performance while guaranteeing thoses properties.
The first challenges is how to efficiently intercept and forward verbs call from the application.
As mentioned earlier, the FreeFlow Router needs to get verbs call from the app and interact with the NIC to execute the call on behalf of the app.
**One natural way of doing this is to insert a shim layer under the app which captures and forwards every verbs call.
However, we realize that this approach can be inaccuracy or inefficient.
**Let me take an example of one specific verbs call, post_send which is used for posting a send or write work request.
**Here, the call requires an argument which is a pointer to a qp structure. So, just forwarding it as it is results in incorrect verbs call forwarding.
**Moreover, we observe that the qp structure contains lots of pointers to different structure types which makes the serialization hard and inefficient.
So, how can we efficiently forward the verb call which takes a complex arguments?
Fortunately, we found a hint by looking at one step deeper into the verbs library.
In the native RDMA case, when the app issues verbs call such as post_send, the verbs library outputs serialized parameter before it passes the verbs call to RDMA NIC.
This way, the NIC can digest and process the verbs call without manual serialization.
Which means that the complexity of function call arguments is simplified internally by the verbs library.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability.
FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications.
The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications.
It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context.
Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability.
Also, by intercepting both control and data path commands, it can provide full controllability.
The only question we have is how to provide a good performance while guaranteeing thoses properties.
At high-level, FreeFlow introduces two components, FreeFlow library and FreeFlow router, to virtualize RDMA networking for containers while achieving isolation, portability and controllability.
FreeFlow library provides a communication channel between the applications and freeflow router to forward RDMA commands issued by the applications.
The key role of FreeFlow router is to virtualize RDMA NIC resources by creating shadow RDMA contexts and memory regions corresponding to ones created by applications.
It forwards the intercepted RDMA commands to the RDMA NIC based on this shadow context.
Since FreeFlow allows containers to run their own network namespace, it can provide an isolation and portability.
Also, by intercepting both control and data path commands, it can provide full controllability.
The only question we have is how to provide a good performance while guaranteeing thoses properties.
So, based on this observation, **our shim layer intercepts the serialized output of full-fledged verbs library instead of interception App facing APIs.
In particular, while the verbs library originally delivers verbs call to the physical NIC, we make it to forward the call to virtual nic on the freeflow router.
**This way, verbs calls can be forwarded correctly without manual serialization.
Let’s assume the container 1 wants to write contents in its memory region, MEM-1, to another memory region, MEM-2, on container 2.
**In performing an RDMA Write operation, the NIC directly writes data into the shadow memory region without involving host CPU, the freeflow router does not know exactly when it has to synchronize the data in shadow memory with the corresponding app memory region.
**One straightforward solution to synchronize two memory regions would be to keep copying shadow memory content, S-MEM-2, to the app memory, MEM-2.
However, it consumes additional CPU cycles and memory bw while incurring non-negligible mem copy latency which can affect the data path performance.
So how can we solve this problem?
We observe an opportunity from the fact that containers in the same host machines are just processes and they can share memory regions.
Since both FreeFlow router and applications are running on different containers but on the same host, they can share memory space by locating them in the same physical memory region.
By leveraging this observation, we design a zero-copy memory synchronization for data path.
**At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies.
**In particular, FreeFlow provides two options to achieve this:
One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region.
It requires some amount of application code modification.
The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.
We evaluate FreeFlow by answering two key questions:
First one is does FreeFlow provide low latency and high throughput which is close to bare-metal RDMA?
Second question is can real-world applications using RDMA benefit from FreeFlow?
By leveraging this observation, we design a zero-copy memory synchronization for data path.
**At high-level, FreeFlow allocates the app and shadow memory space in the same physical memory region, so that those two memory regions can be synchronized without explicit memory copies.
**In particular, FreeFlow provides two options to achieve this:
One is to allocate apps memory buffer using FreeFlow’s memory allocation API so that FreeFlow allocates the apps memory buffer in the shared memory region.
It requires some amount of application code modification.
The second option is to re-mapping app’s memory space to shadow memory region. While it does not need code modification, it requires application’s memory buffer to be page-aligned. From our experience with real-world applications, many RDMA applications already make their data buffer page aligned, so the freeflow router can just remap the buffer to the shadow memory region without code modification.