SlideShare a Scribd company logo
1 of 19
LA-UR 10-05188 Implementation & Comparison of RDMA Over Ethernet Students:	   Lee Gaiser, Brian Kraus, and James Wernicke Mentors:	   Andree Jacobson, Susan Coulter, JharrodLaFon, and Ben McClelland
Summary Background Objective Testing Environment Methodology Results Conclusion Further Work Challenges Lessons Learned Acknowledgments References & Links Questions
Background : Remote Direct Memory Access (RDMA) RDMA provides high-throughput, low-latency networking: Reduce consumption of CPU cycles Reduce communication latency Images courtesy of http://www.hpcwire.com/features/17888274.html
Background : InfiniBand Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service Failover Scalability Reliable transport How do we interface this high performance link with existing Ethernet infrastructure?
Background : RDMA over Converged Ethernet (RoCE) Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network layers from IB stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet. Not quite IB strength, but it’s getting close. As of OFED 1.5.1, code written for OFED RDMA auto-magically works with RoCE.
Objective We would like to answer the following questions: What kind of performance can we get out of RoCE on our cluster? Can we implement RoCE in software (Soft RoCE) and how does it compare with hardware RoCE?
Testing Environment Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling Operating System: CentOS 5.3 2.6.32.16 kernel Software/Drivers: Open Fabrics Enterprise Distribution (OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE) OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2
Methodology Set up a pair of nodes for each technology: IB, RoCE, Soft RoCE, and no RDMA Install, configure & run minimal services on test nodes to maximize machine performance. Directly connect nodes to maximize network performance. Acquire latency benchmarks OSU MPI Latency Test Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test Script it all to perform many repetitions
Results : Latency
Results : Uni-directional Bandwidth
Results : Bi-directional Bandwidth
Results : Analysis ,[object Object]
Up to 5.7x speedup in latency
Up to 3.7x increase in bandwidth
IB QDR vs. RoCE:
IB less than 1µs faster than RoCE at 128-byte message.
IB peak bandwidth is 2-2.5x greater than RoCE.,[object Object]
Further Work & Questions How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes ability to perform? How does RoCE compare with iWARP?
Challenges Finding an OS that works with OFED & RDMA: Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers. Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t configured for RDMA; used OpenMPI 1.4.1 supplied with OFED instead. The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.

More Related Content

What's hot

How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
Etsuji Nakai
 

What's hot (20)

eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
How VXLAN works on Linux
How VXLAN works on LinuxHow VXLAN works on Linux
How VXLAN works on Linux
 
Ceph and RocksDB
Ceph and RocksDBCeph and RocksDB
Ceph and RocksDB
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and SecurityCilium - Bringing the BPF Revolution to Kubernetes Networking and Security
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
 
Deploying IPv6 on OpenStack
Deploying IPv6 on OpenStackDeploying IPv6 on OpenStack
Deploying IPv6 on OpenStack
 
VXLAN Integration with CloudStack Advanced Zone
VXLAN Integration with CloudStack Advanced ZoneVXLAN Integration with CloudStack Advanced Zone
VXLAN Integration with CloudStack Advanced Zone
 
OpenShift 4 installation
OpenShift 4 installationOpenShift 4 installation
OpenShift 4 installation
 
Cilium + Istio with Gloo Mesh
Cilium + Istio with Gloo MeshCilium + Istio with Gloo Mesh
Cilium + Istio with Gloo Mesh
 
Redpanda and ClickHouse
Redpanda and ClickHouseRedpanda and ClickHouse
Redpanda and ClickHouse
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
NVMe over Fabric
NVMe over FabricNVMe over Fabric
NVMe over Fabric
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Cisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to LibfabricCisco's journey from Verbs to Libfabric
Cisco's journey from Verbs to Libfabric
 
Ceph Introduction 2017
Ceph Introduction 2017  Ceph Introduction 2017
Ceph Introduction 2017
 
The State of libfabric in Open MPI
The State of libfabric in Open MPIThe State of libfabric in Open MPI
The State of libfabric in Open MPI
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 

Similar to Implementation & Comparison Of Rdma Over Ethernet

Bandwidth estimation for ieee 802
Bandwidth estimation for ieee 802Bandwidth estimation for ieee 802
Bandwidth estimation for ieee 802
Mumbai Academisc
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
Ceph Community
 
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Principled Technologies
 
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
Performance Evaluation of Soft RoCE over 1 Gigabit EthernetPerformance Evaluation of Soft RoCE over 1 Gigabit Ethernet
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
IOSR Journals
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
inside-BigData.com
 
Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2
Vijay Tolani
 
Cooperation without synchronization practical cooperative relaying for wirele...
Cooperation without synchronization practical cooperative relaying for wirele...Cooperation without synchronization practical cooperative relaying for wirele...
Cooperation without synchronization practical cooperative relaying for wirele...
ieeeprojectschennai
 

Similar to Implementation & Comparison Of Rdma Over Ethernet (20)

International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Bandwidth estimation for ieee 802
Bandwidth estimation for ieee 802Bandwidth estimation for ieee 802
Bandwidth estimation for ieee 802
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
Link_NwkingforDevOps
Link_NwkingforDevOpsLink_NwkingforDevOps
Link_NwkingforDevOps
 
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
 
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
Performance Evaluation of Soft RoCE over 1 Gigabit EthernetPerformance Evaluation of Soft RoCE over 1 Gigabit Ethernet
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
 
Network Analysis & Designing
Network Analysis & DesigningNetwork Analysis & Designing
Network Analysis & Designing
 
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer RequirementsLinac Coherent Light Source (LCLS) Data Transfer Requirements
Linac Coherent Light Source (LCLS) Data Transfer Requirements
 
Linkage aggregation
Linkage aggregationLinkage aggregation
Linkage aggregation
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for Ceph
 
Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2Why 10 Gigabit Ethernet Draft v2
Why 10 Gigabit Ethernet Draft v2
 
Madge LANswitch 3LS Application Guide
Madge LANswitch 3LS Application GuideMadge LANswitch 3LS Application Guide
Madge LANswitch 3LS Application Guide
 
Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™
Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™
Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™
 
Ccna interview questions
Ccna interview questions Ccna interview questions
Ccna interview questions
 
Cooperation without synchronization practical cooperative relaying for wirele...
Cooperation without synchronization practical cooperative relaying for wirele...Cooperation without synchronization practical cooperative relaying for wirele...
Cooperation without synchronization practical cooperative relaying for wirele...
 
Computer Network Performance evaluation based on Network scalability using OM...
Computer Network Performance evaluation based on Network scalability using OM...Computer Network Performance evaluation based on Network scalability using OM...
Computer Network Performance evaluation based on Network scalability using OM...
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Networking & Servers
Networking & ServersNetworking & Servers
Networking & Servers
 
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentationTurbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
 
Top 20 ccna interview questions and answers pdf
Top 20 ccna interview questions and answers pdfTop 20 ccna interview questions and answers pdf
Top 20 ccna interview questions and answers pdf
 

Implementation & Comparison Of Rdma Over Ethernet

  • 1. LA-UR 10-05188 Implementation & Comparison of RDMA Over Ethernet Students: Lee Gaiser, Brian Kraus, and James Wernicke Mentors: Andree Jacobson, Susan Coulter, JharrodLaFon, and Ben McClelland
  • 2. Summary Background Objective Testing Environment Methodology Results Conclusion Further Work Challenges Lessons Learned Acknowledgments References & Links Questions
  • 3. Background : Remote Direct Memory Access (RDMA) RDMA provides high-throughput, low-latency networking: Reduce consumption of CPU cycles Reduce communication latency Images courtesy of http://www.hpcwire.com/features/17888274.html
  • 4. Background : InfiniBand Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service Failover Scalability Reliable transport How do we interface this high performance link with existing Ethernet infrastructure?
  • 5. Background : RDMA over Converged Ethernet (RoCE) Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network layers from IB stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet. Not quite IB strength, but it’s getting close. As of OFED 1.5.1, code written for OFED RDMA auto-magically works with RoCE.
  • 6. Objective We would like to answer the following questions: What kind of performance can we get out of RoCE on our cluster? Can we implement RoCE in software (Soft RoCE) and how does it compare with hardware RoCE?
  • 7. Testing Environment Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling Operating System: CentOS 5.3 2.6.32.16 kernel Software/Drivers: Open Fabrics Enterprise Distribution (OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE) OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2
  • 8. Methodology Set up a pair of nodes for each technology: IB, RoCE, Soft RoCE, and no RDMA Install, configure & run minimal services on test nodes to maximize machine performance. Directly connect nodes to maximize network performance. Acquire latency benchmarks OSU MPI Latency Test Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test Script it all to perform many repetitions
  • 12.
  • 13. Up to 5.7x speedup in latency
  • 14. Up to 3.7x increase in bandwidth
  • 15. IB QDR vs. RoCE:
  • 16. IB less than 1µs faster than RoCE at 128-byte message.
  • 17.
  • 18. Further Work & Questions How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes ability to perform? How does RoCE compare with iWARP?
  • 19. Challenges Finding an OS that works with OFED & RDMA: Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers. Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t configured for RDMA; used OpenMPI 1.4.1 supplied with OFED instead. The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.
  • 20. Lessons Learned Installing and configuring HPC clusters Building, installing, and fixing Linux kernel, modules, and drivers Working with IB, 10GbE, and RDMA technologies Using tools such as OMB-3.1.1 and netperf for benchmarking performance
  • 21. Acknowledgments Andree Jacobson Susan Coulter JharrodLaFon Ben McClelland Sam Gutierrez Bob Pearson
  • 23. References & Links Submaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU. http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdf Feldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.html Woodruff, R. Access to InfiniBand from Linux.Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/ OFED 1.5.2-rc2http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/OFED-1.5.2-rc2.tgz OFED 1.5.1-rxehttp://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgz OMB 3.1.1http://mvapich.cse.ohio-state.edu/benchmarks/OMB-3.1.1.tgz

Editor's Notes

  1. Introduce yourself, the institute, your teammates, and what you’ve been working on for the past two months.
  2. Rephrase these bullet points with a little more elaboration.
  3. Emphasize how RDMA eliminates unnecessary communication.
  4. Explain that we are using IB QDR and what that means.
  5. Emphasize that the biggest advantage of RoCE is latency, not necessarily bandwidth. Talk about 40Gb & 100Gb Ethernet on the horizon.
  6. OSU benchmarks were more appropriate than netperf
  7. The highlight here is that latency between IB & RoCE differs by only 1.7us at 128 byte messages. It continues to be very close up through 4K messages. Also notice that latency in RoCE and no RDMA converge at higher messages.
  8. Note that RoCE & IB are very close up to 1K message size. IB QDR peaks out at 3MB/s, RoCE at 1.2MB/s
  9. Note that the bandwidth trends are similar to the uni-directional bandwidths. Explain that Soft RoCE could not complete this test. IB QDR peaks at 5.5 MB/s and RoCE peaks at 2.3 MB/s.