To Infiniband and Beyond

2,871
-1

Published on

Harvard HPC Seminar Series

Theresa Kaltz, PhD, High Performance Technical Computing, FAS, Harvard

Due to the wide availability and low cost of high speed networking, commodity clusters have become the de facto standard for building high performance parallel computing systems. This talk will introduce the leading technology for high speed interconnects called Infiniband and compare its deployment and performance to Ethernet. In addition, some emerging interconnect technologies and trends in cluster networking will be discussed.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,871
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
140
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

To Infiniband and Beyond

  1. 1. To Infiniband and Beyond: High Speed Interconnects in Commodity HPC Clusters Teresa Kaltz, PhD Research Computing December 3, 2009 1
  2. 2. Interconnect Types on Top 500 On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems. Michael Feldman, HPCwire Editor 2
  3. 3. Top 500 Interconnects 2002-2009 500 450 400 350 300 Other 250 Infiniband 200 Ethernet 150 100 50 0 2002 2003 2004 2005 2006 2007 2008 2009 3
  4. 4. What is Infiniband Anyway? •  Open, standard interconnect architecture –  http://www.infinibandta.org/index.php –  Complete specification available for download •  Complete "ecosystem" –  Both hardware and software •  High bandwidth, low latency, switch-based •  Allows remote direct memory access (RDMA) 4
  5. 5. Why Remote DMA? •  TCP offload engines reduce overhead via offloading protocol processing like checksum •  2 copies on receive: NIC  kernel  user •  Solution is Remote DMA (RDMA) Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 % 5
  6. 6. What is RDMA? 6
  7. 7. Infiniband Signalling Rate •  Each link is a point to point serial connection •  Usually aggregated into groups of four •  Unidirectional effective bandwidth –  SDR 4X: 1 GB/s –  DDR 4X: 2 GB/s –  QDR 4X: 4 GB/s •  Bidirectional bandwidth twice unidirectional •  Many factors impact measured performance! 7
  8. 8. Infiniband Roadmap from IBTA 8
  9. 9. DDR 4X Unidirectional Bandwidth •  Achieved bandwidth limited by PCIe 8x Gen 1 •  Current platforms mostly ship with PCIe Gen 2 9
  10. 10. QDR 4X Unidirectional Bandwidth •  Still seem to have bottleneck at host if using QDR http://mvapich.cse.ohio-state.edu/performance/interNode.shtml 10
  11. 11. Latency Measurements: IB vs GbE 11
  12. 12. Infiniband Latency Measurements 12
  13. 13. Infiniband Silicon Vendors •  Both switch and HCA parts –  Mellanox: Infiniscale, Infinihost –  Qlogic: Truescale, Infinipath •  Many OEM's use their silicon •  Large switches –  Parts arranged in fat tree topology 13
  14. 14. Infiniband Switch Hardware   24 port silicon product line at right   Scales to thousands of ports 288 Ports   Host-based and hardware- based subnet management   Current generation (QDR) based on 144 Ports 36 port silicon   Up to 864 ports in single 96 Ports switch!! 48 Ports 24 Ports 14
  15. 15. Infiniband Topology •  Infiniband uses credit-based flow control –  Need to avoid loops in topology that may produce deadlock •  Common topology for small and medium size networks is tree (CLOS) •  Mesh/torus more cost effective for large clusters (>2500 hosts) 15
  16. 16. Infiniband Routing •  Infiniband is statically routed •  Subnet management software discovers fabric and generates set of routing tables –  Most subnet managers support multiple routing algorithms •  Tables updated with changes in topology only •  Often cannot achieve theoretical bisection bandwidth with static routing •  QDR silicon introduces adaptive routing 16
  17. 17. HPCC Random Ring Benchmark 1600 1400 Avg Bandwidth (MB/s) 1200 1000 "Routing 1" "Routing 2" 800 "Routing 3" 600 "Routing 4" 400 200 0 Number of Enclosures 17
  18. 18. Infiniband Specification for Software •  IB specification does not define API •  Actions are known as "verbs" –  Services provided to upper layer protocols –  Send verb, receive verb, etc •  Community has standardized around open source distribution called OFED to provide verbs •  Some Infiniband software is also available from vendors –  Subnet management 18
  19. 19. Application Support of Infiniband •  All MPI implementations support native IB –  OpenMPI, MVAPICH, Intel MPI •  Existing socket applications –  IP over IB –  Sockets direct protocol (SDP) •  Does NOT require re-link of application •  Oracle uses RDS (reliable datagram sockets) –  First available in Oracle 10g R2 •  Developer can program to "verbs" layer 19
  20. 20. Infiniband Software Layers 20
  21. 21. OFED Software •  Openfabrics Enterprise Distribution software from Openfabrics Alliance –  http://www.openfabrics.org/ •  Contains everything needed to run Infiniband –  HCA drivers –  verbs implementation –  subnet management –  diagnostic tools •  Versions qualified together 21
  22. 22. Openfabrics Software Components 22
  23. 23. "High Performance" Ethernet •  1 GbE cheap and ubiquitous –  hardware acceleration –  multiple multiport NIC's –  supported in kernel •  10 GbE still used primarily as uplinks from edge switches and as backbone •  Some vendors providing 10 GbE to server –  low cost NIC on motherboard –  HCA's with performance proportional to cost 23
  24. 24. RDMA over Ethernet •  NIC capable of RDMA is called RNIC •  RDMA is primary method of reducing latency on host side •  Multiple vendors have RNIC's –  Mainstream: Broadcom, Intel, etc. –  Boutique: Chelsio, Mellanox, etc. •  New Ethernet standards –  "Data Center Bridging"; "Converged Enhanced Ethernet"; "Data Center Ethernet"; etc 24
  25. 25. What is iWarp? •  RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group •  http://www.rdmaconsortium.org/home •  Also defined SRP, iSER in addition to verbs •  iWARP supported in OFED •  Most specification work complete in ~2003 25
  26. 26. RDMA over Ethernet? The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name. You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names. Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009 26
  27. 27. The Future: InfiniFibreNet •  Vendors moving towards "converged fabrics" •  Using same "fabric" for both networking and storage •  Storage protocols and IB over Ethernet •  Storage protocols over Infiniband –  NFS over RDMA, lustre •  Gateway switches and converged adapters –  Various combinations of Ethernet, IB and FC 27
  28. 28. Any Questions? THANK YOU! (And no mention of The Cloud) 28
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×