Hello, I’m Yoshiro Yamabe, I’m working NTT Software Innovation Center. Now I tackle about Remote Direct Memory Access, RDMA technologies.Today, I’ll talk about my experience, as “RDMA programming design and case studies, for better performance distributed applications”.
At first, I talk about RDMA protocol’s merit, demerit and potential.
RDMA provides low latency and low cpu overhead, because of two major features, zero-copy and cpu-bypass.
So the potential of RDMA is very attractive, for example, in simple ping-pong program, RDMA’s latency is one fifth of IPoIB’s latency.
However, it is difficult to implement than existing TCP/IP programming.
For example, to establish RDMA connection, it is need to exchange various information by using IP. And there are many other complications, so this benchmark’s IPoIB program is about three hundred lines with comments and spaces, but RDMA program is about 800 lines.
I thought distributed deep learning is compatible for RDMA, so I chose MXNet as a distributed deep learning framework and tried to apply RDMA to it.
MXNet adopts parameter server mechanisms for distributed learning.
Parameter server consists of two types of nodes, one is worker node which computes parameter, and the other is server node which gathers parameter from worker nodes, and aggregates them,
and finally, scatter parameter to all worker nodes.
So the processing flow of each batch is, at first, calculate parameter on workers, second, each worker push parameter to server, third, server aggregates them, last, server push parameters to each workers. Then next batch will start.
Zoom up this flow, the detail is like this.
In existing implementation, zeromq, there are user to kernel or kernel to user memcpy for each send/recv.
If RDMA is applied, these copies are reduced instead of increasing additional send/recv processes.
But memory copy latency is large, so RDMA can reduces overhead.
Red box in this figure indicate memcpy from data update memory area to communication memory area, and ideally, this types of copies should be reduced,
But it is out of scope in this implementation.