SlideShare a Scribd company logo
1 of 12
Copyright©2018 NTT corp. All Rights Reserved.
RDMA programming design and case studies
– for better performance distributed applications
NTT Software Innovation Center
Yoshiro Yamabe
2Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• Remote Direct Memory Access(RDMA) is low latency and low cpu
overhead, but hard to implement
• In simple micro-benchmark, RDMA’s latency is about one fifth of
IPoIB’s latency (IPoIB : IP over Infiniband)
RDMA features, potential
1
10
100
1000
10000
100000
1000000
8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32M 128M 512M
Latency[us]
Data size[byte]
IPoIB
RDMA
3usec vs 15usec
3Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• To make use of RDMA potential, I tried to apply RDMA to MXNet, a
distributed deep learning framework
– MXNet adopts Parameter Server mechanism for distributed learning
• RDMA was seemed to be effective for this model
– There are two times of communication between worker and server for each
batch
About case studies
Parameter
Server
worker
2. Push parameter
to Server node
4. Pull parameter from
server node
1. Calculate
Parameter on each
worker node
3. Aggregate parameters
from all worker nodes
worker
worker
4Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
Data flow comparison – existing implementation and RDMA
calc
aggr
worker
server
calc
push pull
Parameter
Server
worker
2. push
4. pull1.
calculation
3.
aggregation
calc
aggr
calc
pullpush
worker
server
Existing(ZeroM
Q)
RDMA(ibverb
s) pack/unpack message
copy
user->kernel copy
send process(short)
recv process
• By applying RDMA, I can reduce 2 memcpies per communication,
total 4 copies per push and pull
Reduced
overhead
5Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• RDMA mechanisms
– RDMA operations(Operations : WRITE(put), READ(get), etc…)
– RDMA operations’ options(inline send, immediate data, etc…)
– Way to detect communication completions(Polling or MSI-X interrupt)
• Kernel layer mechanisms
– ring buffer for asynchronous communication (≒ socket buffer)
– Separating detect completion thread
RDMA implementation Tunings
Types Tunings STEP1 STEP2
RDMA
mechanisms
RDMA operations 〇(using RDMA
WRITE)
〇(=STEP1)
RDMA operations’
options
〇 〇(=STEP1)
Way to detect completion - 〇
Kernel layer
mechanisms
Ring buffer 〇 〇
Separating detect
completion thread
- 〇
6Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
Result(Step1 vs Step2)
0
10
20
30
40
50
60
70
1 2 3
Timecost[s]
# of worker
STEP1
STEP2
• Step2 implementation is 1.5x faster than Step1 implementation
– “step0 implementation” which doesn’t have ring buffer is many times slower
than step1 implementation
1.5x speed up
7Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
Result(But…)
0
10
20
30
40
50
60
70
1 2 3
Timecost[s]
# of worker
zeromq + IPoIB 
• Existing implementation(zeromq) is fast!! 
– Step2 implementation is faster than it, but a bit…
(There are still other tuning so RDMA can be faster!!)
STEP1
STEP2
8Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• RDMA is efficient for reducing latency, cpu overhead, but there
are many techniques to achieve high performance
– Selecting optimal options, flags, and redesign functions!
– Porting a part of kernel layer tunings into user layer programs
• As a matter of course, it is need to choose appropriate applications for
RDMA
– CPU centric, network bottleneck, etc…
– In this workload and environment, network load may be not adequate for
confirming RDMA effect with lightweight change
• My Conclusion
– To enjoy RDMA effect easily, libraries providing RDMA design patterns are
needed
– “true zero-copy RDMA implementation” is needed for higher performance
Conclusion
9Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
User
memory
• RDMA can reduce overhead and accelerate applications by these two
features
– zero-copy
• NIC can read/write data to user memory directly
– cpu-bypass
• When using one-sided RDMA, client can send data without involving server’s cpu
The Features of RDMA
User
memory
kernel memory
memcpy
NIC kernel memoryNIC
memcpy
User
memory
User
memory
kernel memory
RDM
ANIC kernel memory
RDM
ANIC
IP
RDMA
10Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• Various resources for RDMA
– Memory Region(MR)
• RDMA NIC can only read data on MR / write data to MR
– Completion Queue(CQ)
• Communication’s success/failure information(completion) are pushed into CQ, and
program can get it by pop from CQ by two ways
– Polling or completion channel(MSI-X interrupt)
– Queue Pair(QP)
• similar to socket
• By posting request to QP, RDMA NIC execute data transfer referring to it.
– RDMA Operations
• SEND, RDMA WRITE, RDMA READ, ATOMIC
– SEND needs for receiver side to post receive request to QP(two-sided RDMA)
– Others don’t need for receiver side to any operation
(one-sided RDMA, cpu-bypass)
• Service Types
– RC, UC, UD, RD
– RC can detect packet loss and re-send, but slower than other UC and UD
RDMA Programming Features
11Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
• Ring Memory Region
– By using ring MR, sender can issue RDMA independent of receiver status
– It’s similar to IP’s socket buffer in kernel space
• Create additional thread for getting completion from CQ
– Getting completion in send function in application is inefficient because IP’s
send function returns when send data pass to kernel
– Using polling to get work completion because CPU resource waste can be
avoided by pinning thread to core
Various techniques of RDMA programming for this attempt
・・・
A
B
C
D E
RING MR
CQ
polling
Polling Thread
RDMA NIC
generate
completion
write data process data
Receiving
Thread
pass
informatio
n
receiver side
RDMA Operation
12Copyright©2018 NTT corp. All Rights Reserved.
NTT Confidential
Experimental environment
OS Ubuntu16.04
CPU Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz
# of core, node 2 numa nodes and 14core 28thread for each numa node
Memory 256GB
Network Mellanox ConnectX-5, 100Gbps cable
GPU Tesla V100
• Using 1 server nodes and 1-3 worker nodes
• GPU and HCA are on the same numa node(e.g. node1) on all
machines, and using numactl to avoid “remote access”
• Using Cifar-10 and ResNet50 to training
• Environment

More Related Content

What's hot

Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向Naoya Horiguchi
 
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」PC Cluster Consortium
 
不揮発メモリ(NVDIMM)とLinuxの対応動向について
不揮発メモリ(NVDIMM)とLinuxの対応動向について不揮発メモリ(NVDIMM)とLinuxの対応動向について
不揮発メモリ(NVDIMM)とLinuxの対応動向についてYasunori Goto
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdfYasunori Goto
 
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のか
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のかIIJmio meeting 25 スマートフォンはなぜ「つながらない」のか
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のかtechlog (Internet Initiative Japan Inc.)
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説Takateru Yamagishi
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt AffinityについてTakuya ASADA
 
FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料一路 川染
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux NetworkingPLUMgrid
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD
 
NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~Fixstars Corporation
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIAllan Cantle
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
 
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化kazuhcurry
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
閉域網接続の技術入門
閉域網接続の技術入門閉域網接続の技術入門
閉域網接続の技術入門Masayuki Kobayashi
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤTakashi Hoshino
 

What's hot (20)

Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向
 
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」
PCCC22:インテル株式会社 テーマ1「インテル® Agilex™ FPGA デバイス 最新情報」
 
不揮発メモリ(NVDIMM)とLinuxの対応動向について
不揮発メモリ(NVDIMM)とLinuxの対応動向について不揮発メモリ(NVDIMM)とLinuxの対応動向について
不揮発メモリ(NVDIMM)とLinuxの対応動向について
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudyリペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
リペア時間短縮にむけた取り組み@Yahoo! JAPAN #casstudy
 
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のか
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のかIIJmio meeting 25 スマートフォンはなぜ「つながらない」のか
IIJmio meeting 25 スマートフォンはなぜ「つながらない」のか
 
DPDK
DPDKDPDK
DPDK
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
CUDAのアセンブリ言語基礎のまとめ PTXとSASSの概説
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt Affinityについて
 
FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料FPGA+SoC+Linux実践勉強会資料
FPGA+SoC+Linux実践勉強会資料
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
 
AMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor ArchitectureAMD EPYC™ Microprocessor Architecture
AMD EPYC™ Microprocessor Architecture
 
NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~NVMCT #1 ~今さら聞けないSSDの基本~
NVMCT #1 ~今さら聞けないSSDの基本~
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
Pacemaker + PostgreSQL レプリケーション構成(PG-REX)のフェイルオーバー高速化
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
閉域網接続の技術入門
閉域網接続の技術入門閉域網接続の技術入門
閉域網接続の技術入門
 
10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ10分で分かるLinuxブロックレイヤ
10分で分かるLinuxブロックレイヤ
 

Similar to RDMA programming design and case studies – for better performance distributed applications

Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus SDN/OpenFlow switch
 
Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Sneeker Yeh
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVYoshihiro Nakajima
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchJim St. Leger
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstackIkuo Kumagai
 
Dccp evaluation for sip signaling ict4 m
Dccp evaluation for sip signaling   ict4 m Dccp evaluation for sip signaling   ict4 m
Dccp evaluation for sip signaling ict4 m Agus Awaludin
 
Understanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and RaftUnderstanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and RaftHitoshi Mitake
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...Masaaki Nakagawa
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scopeArshit Rai
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-finalshintaro mizuno
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scopeArshit Rai
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 

Similar to RDMA programming design and case studies – for better performance distributed applications (20)

Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics WorkshopLagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
 
slides
slidesslides
slides
 
Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)Introduction to netlink in linux kernel (english)
Introduction to netlink in linux kernel (english)
 
DSP Processor
DSP Processor DSP Processor
DSP Processor
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Software Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFVSoftware Stacks to enable SDN and NFV
Software Stacks to enable SDN and NFV
 
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitchDPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
 
Approaching hyperconvergedopenstack
Approaching hyperconvergedopenstackApproaching hyperconvergedopenstack
Approaching hyperconvergedopenstack
 
Dccp evaluation for sip signaling ict4 m
Dccp evaluation for sip signaling   ict4 m Dccp evaluation for sip signaling   ict4 m
Dccp evaluation for sip signaling ict4 m
 
Understanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and RaftUnderstanding performance aspects of etcd and Raft
Understanding performance aspects of etcd and Raft
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scope
 
Real Time Image Processing
Real Time Image ProcessingReal Time Image Processing
Real Time Image Processing
 
Rust at Ather
Rust at AtherRust at Ather
Rust at Ather
 
NTTs Journey with Openstack-final
NTTs Journey with Openstack-finalNTTs Journey with Openstack-final
NTTs Journey with Openstack-final
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Embedded systems
Embedded systemsEmbedded systems
Embedded systems
 
Summer training embedded system and its scope
Summer training  embedded system and its scopeSummer training  embedded system and its scope
Summer training embedded system and its scope
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 

More from NTT Software Innovation Center

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesNTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤NTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤NTT Software Innovation Center
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...NTT Software Innovation Center
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU ComputingNTT Software Innovation Center
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”NTT Software Innovation Center
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~NTT Software Innovation Center
 
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...NTT Software Innovation Center
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyNTT Software Innovation Center
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法NTT Software Innovation Center
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題NTT Software Innovation Center
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitNTT Software Innovation Center
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesNTT Software Innovation Center
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組NTT Software Innovation Center
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組NTT Software Innovation Center
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向NTT Software Innovation Center
 

More from NTT Software Innovation Center (20)

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between Businesses
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
不揮発WALバッファ
不揮発WALバッファ不揮発WALバッファ
不揮発WALバッファ
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
 
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategy
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility services
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組
 
MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向
 

Recently uploaded

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Recently uploaded (20)

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

RDMA programming design and case studies – for better performance distributed applications

  • 1. Copyright©2018 NTT corp. All Rights Reserved. RDMA programming design and case studies – for better performance distributed applications NTT Software Innovation Center Yoshiro Yamabe
  • 2. 2Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • Remote Direct Memory Access(RDMA) is low latency and low cpu overhead, but hard to implement • In simple micro-benchmark, RDMA’s latency is about one fifth of IPoIB’s latency (IPoIB : IP over Infiniband) RDMA features, potential 1 10 100 1000 10000 100000 1000000 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32M 128M 512M Latency[us] Data size[byte] IPoIB RDMA 3usec vs 15usec
  • 3. 3Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • To make use of RDMA potential, I tried to apply RDMA to MXNet, a distributed deep learning framework – MXNet adopts Parameter Server mechanism for distributed learning • RDMA was seemed to be effective for this model – There are two times of communication between worker and server for each batch About case studies Parameter Server worker 2. Push parameter to Server node 4. Pull parameter from server node 1. Calculate Parameter on each worker node 3. Aggregate parameters from all worker nodes worker worker
  • 4. 4Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential Data flow comparison – existing implementation and RDMA calc aggr worker server calc push pull Parameter Server worker 2. push 4. pull1. calculation 3. aggregation calc aggr calc pullpush worker server Existing(ZeroM Q) RDMA(ibverb s) pack/unpack message copy user->kernel copy send process(short) recv process • By applying RDMA, I can reduce 2 memcpies per communication, total 4 copies per push and pull Reduced overhead
  • 5. 5Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • RDMA mechanisms – RDMA operations(Operations : WRITE(put), READ(get), etc…) – RDMA operations’ options(inline send, immediate data, etc…) – Way to detect communication completions(Polling or MSI-X interrupt) • Kernel layer mechanisms – ring buffer for asynchronous communication (≒ socket buffer) – Separating detect completion thread RDMA implementation Tunings Types Tunings STEP1 STEP2 RDMA mechanisms RDMA operations 〇(using RDMA WRITE) 〇(=STEP1) RDMA operations’ options 〇 〇(=STEP1) Way to detect completion - 〇 Kernel layer mechanisms Ring buffer 〇 〇 Separating detect completion thread - 〇
  • 6. 6Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential Result(Step1 vs Step2) 0 10 20 30 40 50 60 70 1 2 3 Timecost[s] # of worker STEP1 STEP2 • Step2 implementation is 1.5x faster than Step1 implementation – “step0 implementation” which doesn’t have ring buffer is many times slower than step1 implementation 1.5x speed up
  • 7. 7Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential Result(But…) 0 10 20 30 40 50 60 70 1 2 3 Timecost[s] # of worker zeromq + IPoIB  • Existing implementation(zeromq) is fast!!  – Step2 implementation is faster than it, but a bit… (There are still other tuning so RDMA can be faster!!) STEP1 STEP2
  • 8. 8Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • RDMA is efficient for reducing latency, cpu overhead, but there are many techniques to achieve high performance – Selecting optimal options, flags, and redesign functions! – Porting a part of kernel layer tunings into user layer programs • As a matter of course, it is need to choose appropriate applications for RDMA – CPU centric, network bottleneck, etc… – In this workload and environment, network load may be not adequate for confirming RDMA effect with lightweight change • My Conclusion – To enjoy RDMA effect easily, libraries providing RDMA design patterns are needed – “true zero-copy RDMA implementation” is needed for higher performance Conclusion
  • 9. 9Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential User memory • RDMA can reduce overhead and accelerate applications by these two features – zero-copy • NIC can read/write data to user memory directly – cpu-bypass • When using one-sided RDMA, client can send data without involving server’s cpu The Features of RDMA User memory kernel memory memcpy NIC kernel memoryNIC memcpy User memory User memory kernel memory RDM ANIC kernel memory RDM ANIC IP RDMA
  • 10. 10Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • Various resources for RDMA – Memory Region(MR) • RDMA NIC can only read data on MR / write data to MR – Completion Queue(CQ) • Communication’s success/failure information(completion) are pushed into CQ, and program can get it by pop from CQ by two ways – Polling or completion channel(MSI-X interrupt) – Queue Pair(QP) • similar to socket • By posting request to QP, RDMA NIC execute data transfer referring to it. – RDMA Operations • SEND, RDMA WRITE, RDMA READ, ATOMIC – SEND needs for receiver side to post receive request to QP(two-sided RDMA) – Others don’t need for receiver side to any operation (one-sided RDMA, cpu-bypass) • Service Types – RC, UC, UD, RD – RC can detect packet loss and re-send, but slower than other UC and UD RDMA Programming Features
  • 11. 11Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential • Ring Memory Region – By using ring MR, sender can issue RDMA independent of receiver status – It’s similar to IP’s socket buffer in kernel space • Create additional thread for getting completion from CQ – Getting completion in send function in application is inefficient because IP’s send function returns when send data pass to kernel – Using polling to get work completion because CPU resource waste can be avoided by pinning thread to core Various techniques of RDMA programming for this attempt ・・・ A B C D E RING MR CQ polling Polling Thread RDMA NIC generate completion write data process data Receiving Thread pass informatio n receiver side RDMA Operation
  • 12. 12Copyright©2018 NTT corp. All Rights Reserved. NTT Confidential Experimental environment OS Ubuntu16.04 CPU Intel® Xeon® CPU E5-2697 v3 @ 2.60GHz # of core, node 2 numa nodes and 14core 28thread for each numa node Memory 256GB Network Mellanox ConnectX-5, 100Gbps cable GPU Tesla V100 • Using 1 server nodes and 1-3 worker nodes • GPU and HCA are on the same numa node(e.g. node1) on all machines, and using numactl to avoid “remote access” • Using Cifar-10 and ResNet50 to training • Environment

Editor's Notes

  1. Hello, I’m Yoshiro Yamabe, I’m working NTT Software Innovation Center. Now I tackle about Remote Direct Memory Access, RDMA technologies. Today, I’ll talk about my experience, as “RDMA programming design and case studies, for better performance distributed applications”.
  2. At first, I talk about RDMA protocol’s merit, demerit and potential. RDMA provides low latency and low cpu overhead, because of two major features, zero-copy and cpu-bypass. So the potential of RDMA is very attractive, for example, in simple ping-pong program, RDMA’s latency is one fifth of IPoIB’s latency. However, it is difficult to implement than existing TCP/IP programming. For example, to establish RDMA connection, it is need to exchange various information by using IP. And there are many other complications, so this benchmark’s IPoIB program is about three hundred lines with comments and spaces, but RDMA program is about 800 lines.
  3. I thought distributed deep learning is compatible for RDMA, so I chose MXNet as a distributed deep learning framework and tried to apply RDMA to it. MXNet adopts parameter server mechanisms for distributed learning. Parameter server consists of two types of nodes, one is worker node which computes parameter, and the other is server node which gathers parameter from worker nodes, and aggregates them, and finally, scatter parameter to all worker nodes. So the processing flow of each batch is, at first, calculate parameter on workers, second, each worker push parameter to server, third, server aggregates them, last, server push parameters to each workers. Then next batch will start.
  4. Zoom up this flow, the detail is like this. In existing implementation, zeromq, there are user to kernel or kernel to user memcpy for each send/recv. If RDMA is applied, these copies are reduced instead of increasing additional send/recv processes. But memory copy latency is large, so RDMA can reduces overhead. Red box in this figure indicate memcpy from data update memory area to communication memory area, and ideally, this types of copies should be reduced, But it is out of scope in this implementation.