• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

InfiniBandを中心としたデータセンタ内インタコネクトの動向

  • 6,964 views
Uploaded on

いろいろタイトルに偽りありだけど、InfiniBandを付け焼刃で勉強したアウトプット。

いろいろタイトルに偽りありだけど、InfiniBandを付け焼刃で勉強したアウトプット。

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,964
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
134
Comments
0
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. InfiniBand 2010 11 22
  • 2. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 3. InfiniBand •  90 SAN (System Area Network) –  PC/WS CPU –  e.g., Myrinet QSNet Fibre Channel •  InfiniBand Trade Association –  1999 Compaq Dell IBM Intel Microsoft Sun •  2000/10 Version 1 •  2008/6 Version 1.2.1 –  Volume 1: –  Volume 2: –  “to design a scalable and high performance communication and I/O architecture by taking an integrated view of computing, networking and storage technologies.” •  HPC MPI
  • 4. InfiniBand •  •  RDMA (Remote Direct –  10 40 Gbps Memory Access) –  120 Gbps •  QoS •  –  16 –  1 –  •  CPU •  –  OS –  48K –  –  2128 •  •  –  –  –  – 
  • 5. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 6. •  IT •  –  •  •  –  e.g., Edge Virtual Bridging Port Profile Migration –  •  •  e.g., – 
  • 7. HP  POD@SC2010
  • 8. Data Center Network Convergence •  –  Ethernet InfiniBand Myrinet –  Fibre Channel –  Ethernet •  •  –  InfiniBand –  IEEE Data Center Bridging (DCB) •  Data Center Ethernet (DCE) Converged Enhanced Ethernet (CEE)
  • 9. DC Internet core   switch   aggregate   switch   … access   (ToR)   switch   rack …
  • 10. DC DC DC 80% 80% DB MapReduce 64:1 200:1 75 150 us 5 10 us < 1Tbps 10 Tbps < 20 Wire speed 10 G port 100 W 10 W (Arista )
  • 11. TOP500 / InfiniBand InfiniBand InfiniBand InfiniBand Ethernet Ethernet Ethernet Ethernet 2005 InfiniBand
  • 12. TOP500 / 2010 11 4%  2%   2%   6%   10%   20%   Ethernet Ethernet 45%   22%   InfiniBand 43%   InfiniBand 46%   Ethernet   InfiniBand   Proprietary   Ethernet   InfiniBand   Proprietary   Custom   Other   Custom   Other  
  • 13. LINPACK 2010 11 100 InfiniBand: 80% 90 80 Efficiency    (%)   70 10 Gigabit Ethernet: 74% 60 50 40 Gigabit Ethernet: 54% 30 InfiniBand 20 Gigabit Ethernet 10 10 Gigabit Ethernet 0 0 50 100 150 200 250 300 350 400 450 500 TOP500  rank Rmax Rpeak
  • 14. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 15. InfiniBand Ethernet InfiniBand TCP/IP over Ethernet App Buffer Buffer App App Buffer Buffer App RDMA OS OS OS OS bypass OS (zero copy) Buffer Buffer HCA HCA NIC NIC RDMA:   CPU
  • 16. IB Subnet Compute Node HCA CPU Storage Node IB Link Host  System  Bus Mem HCA Switch TCA Target Ctl. CPU DRAM CPU Router Router IB or Network HCA:  Host  Channel  Adapter   TCA:  Target  Channel  Adapter
  • 17. IB   •        •  TCP/IP   •  RDMA   •    •  QP   •  •  •    •    •  •    •    •   
  • 18. IB QP: Queue Pair WQE: Work Queue Element CQE: Completion Queue Element
  • 19. •  –  Single Data Rate (SDR): 2.5 Gbps –  Double Data Rate (DDR): 5.0 Gbps –  Quad Data Rate (QDR): 10.0 Gbps •  4x 12x –  HCA-HCA: 40 Gbps (QDR 4x) –  - : 120 Gbps (QDR 12x) •  8B10B –  40 x 8/10 = 32 Gbps (QDR 4x)
  • 20. InfiniBand Per  1x  Lane  Bandwidth  (Gbps) SDR DDR QDR FDR EDR 2.5  Gbps 5  Gbps 10  Gbps 14  Gbps 26  Gbps SDR:  Single  Data  Rate DDR:  Double  Data  Rate QDR:  Quad  Data  Rate FDR:  Fourteen  Data  Rate EDR:  Enhanced  Data  Rate HDR:  High  Data  Rate NDR:  Next  Data  Rate hWp://www.infinibandta.org/content/pages.php?pg=technology_overview
  • 21. (.0*3$"U*3&M0,*'04-,$<04*/.0*L&-U0/*#9,;&/*&"'*L,9/9-934*#9,*L&-U0/*9L0,D 57 &/$9"S*06J6*#39V*-9"/,93*&"'*.9V*L&-U0/4*&,0*,91/0'*V$/.$"*&*41<"0/*<0D 5N /V00"*/.0*491,-0*&"'*'04/$"&/$9"6*(.0,0*&,0*/V9*/ML04*9#*L&-U0/46 5O 5P W <%86+B)8)0'7'8&+5)#6'&*D*/.040*&,0*L&-U0/4*140'*/9*/,&$"*&"'* ;&$"/&$"*3$"U*9L0,&/$9"6*(.040*L&-U0/4*&,0*-,0&/0'*&"'*-9"D 5K 41;0'*V$/.$"*/.0*A$"U*A&M0,*&"'*&,0*"9/*41<X0-/*/9*#39V*-9"/,936* 5> A$"U*;&"&J0;0"/*L&-U0/4*&,0*140'*/9*"0J9/$&/0*9L0,&/$9"&3*L&D 5Q ,&;0/0,4*<0/V00"*/.0*L9,/4*&/*0&-.*0"'*9#*/.0*3$"U*41-.*&4*<$/* 5R ,&/0S*3$"U*V$'/.S*0/-6*(.0M*&,0*&349*140'*/9*-9":0M*#39V*-9"/,93* LRH:  Local  Rouang  Header   7= •  -,0'$/4*&"'*;&$"/&$"*3$"U*$"/0J,$/M6*A$"U*;&"&J0;0"/*L&-U0/4*&,0* 75 "0:0,*#9,V&,'0'*/9*9/.0,*3$"U46*lobal  Rouang  Header   GRH:  G 77 BTH:  Base  Transport  Header   W C)&)+5)#6'&*D*/.040*&,0*/.0*L&-U0/4*/.&/*-9":0M*!%+*9L0,&/$9"4* 7N •  ETH:  Extended  Transport  Header   &"'*/.0M*-9"4$4/*9#*&*"1;<0,*9#*'$##0,0"/*.0&'0,4S*V.$-.*;$J./*9,* ;$J./*"9/*<0*L,040"/6 7O 7P –  Send E&)"& C'*%7%&'" C)&)+EH7ID*G @8F C'*%7%&'" 3F*'G 7K 7> –  Read 5)#6'& 7Q 7R –  Write N= N5 <AJ KAJ 4LJ @LJ 5)H*D)F 3+C)&) 3MAM NMAM N7 –  Acks NN NO •  POO'"+<)H'"+5"D&D#D* NP NK L")8GOD"&+<)H'"+5"D&D#D* –  256B 4KB N> NQ ='&.D"6+<)H'"+5"D&D#D* NR <%86+<)H'"+5"D&D#D* O= O5 /%0("'+1;++34!+C)&)+5)#6'&+/D"7)& O7
  • 22. QoS •  –  Ethernet •  SL VL –  Head-of-line blocking –  16 SL VL Receive Buffer Arbitraaon credit Link  control Virtual Lane De-­‐mux Mux packet
  • 23. •  DLID Local ID SL •  LRH SL  (4bit) DLID  (16bit) Payload Switch port FDB (DLID to port) SL to VL table
  • 24. )2! ($3#',! ^$)-.-'%)-$*!@"[R^A8!J2,*!%*!%&%?),#!#,',-<,(!%!"[R^!@?$D );! &3,! )$! (-)-$*!aA=!-)!#,(?$*&(!0;!)2#$)):-*+!0%'B!-)(!-*Z,')-$*!$.!?%'B,)(8! &3',&! 0;! 92-(! #,&3',(! '$*+,()-$*8! M<,#! )-/,=! )2,! ($3#',! +#%&3%::;! #,D 12,*!:-*B! &3',(! -)(! )2#$)):-*+=! 12-'2! /%;! %+%-*! '%3(,! '$*+,()-$*! )$! 0,! &,),'),&8!N.!%::!?%#%/,),#(!b!(3'2!%(!)2,!#%),(!$.!)2#$)):-*+!%*&! *&,#!(,<D •  #,&3')-$*! -*!Congestion Notification Explicit )2#$)):-*+! $<,#! )-/,! b! %#,! %??#$?#-%),:;! (,)=! )2,! !5EF78!92,! –  Tree saturation *,)1$#B!(2$3:&!(,)):,!-*)$!%!()%0:,!()%),!1-)2!Z3()!,*$3+2!'$*D /%)#->! )$! +,()-$*!)$!B,,?!)2,!($3#',(!43,*'2,&8!92,!(,))-*+!$.!(3'2!?%D •  ,)2$&! 1,! #%/,),#(! -(! 3*&,#! '$*)#$:! $.! %! R$*+,()-$*! R$*)#$:! S%*%+,#! ?$)! )#%..-'! @RRSA=! 12-'2! ,()%0:-(2,(! )2,-#! <%:3,(8! N)! /%;! %&Z3()! )2,/! –  Ver 1.2 $<,#! )-/,=! 03)! )2,;! /3()! 0,! (,)! %??#$?#-%),:;! )$! 43,*'2! '$*D *+! G)%)-'! –  +,()-$*!43-'B:;L!($.)1%#,!-*),#<,*)-$*!0;!)2,!RRS!-(!)$$!(:$1! ),&!-*!5K7L! )$!%<$-&!)2,!?#$0:,/8! !H%-#!?#$D (! &,(-#,&=! Source HCA Switch 1 2 Destination 3+2?3)! -(! 5 FECN HCA '-,(8! M3#! Index CCT threshold 2!:$%&8! Timer BECN BECN ,!/%&,!-*! 4 3 ,)1$#B! $.! Figure 2. Technique of InfiniBand Congestion Control
  • 25. •  –  QP •  QP CQ •  4 –  RC TCP UD UDP reliable unreliable connected RC UC connectionless RD UD
  • 26. InfiniBand •  Fat tree (Clos) 3D Torus –  Fat tree 3D Torus •  –  Fat tree •  TSUBAME 2.0 •  TACC Ranger –  Sun Magnum (3456 port, 5 stages-Clos network) –  3D-Torus •  Sndia Red Sky –  •  NASA Pleiades: 10D Hypercube •  Google ISCA10 : Flattened Butterfly
  • 27. TOP500 #4 NEC Confidential Green500 #2 ? 4-7 hWp://www.gsic.atech.ac.jp/~ccwww/TSUBAME20.pdf
  • 28. TOP500 10 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 1 F X M X X X H H M M M M M M M F F F X X X X X M M M M M M M F F F M M 2 F F X M M M X X H M M M H H H M F F F F F F F H M M M M M M M M M F F 3 F F F F M M M H X M M M M H H H M F F F F F F X H F F M M C M C M M F 4 F F F F M M M M H H M H H M F H H M F F F F M F X H H F M O F M C M M 5 F X X X X M M M M M M M H H F H H F F F F F F F F F F M F M M M F M 6 X X F F M M M H H H H M H F H F F F F F F F M M F F F M M F M C C 7 F X X X F X M M X M M M H M H F H M F X F X F F X F F M M C M M M F 8 M M F F F F F X M X M H H M M F X H F F F F M M M F M H F M O M F M M 9 M F F M F X M M M M M M H H F F H F F F F F M M M F F M M M M F M 10 F M F X M M M M M F M M F F F F F F F F M M X M H M C F F M M H Mesh  or  torus C Hypercube F Fat  tree O Other H Hierarchical No  network Based  on  Ada  Gavrilovska,  “AWaining  High  Performance  Communicaaons,”  2010 X Crossbar
  • 29. InfiniBand •  •  RDMA (Remote Direct –  10 40 Gbps Memory Access) –  120 Gbps •  QoS •  –  16 –  1 –  •  CPU •  –  OS –  48K –  –  2128 •  •  –  –  –  – 
  • 30. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 31. Data Center Bridging •  FCoE lossless IEEE 802.1 –  Fibre Channel –  MPI •  –  802.1Qau: Congestion Notification •  –  802.1Qbb: Priority-based Flow Control •  PAUSE –  802.1Qaz: Enhanced Transmission Selection • 
  • 32. DC DC DC 80% 80% DB MapReduce 64:1 200:1 75 150 us 5 10 us < 1Tbps 10 Tbps < 20 Wire speed 10 G port 100 W 10 W (Arista )
  • 33. DC Internet core   switch   aggregate   switch   … access   (ToR)   switch   rack …
  • 34. Spanning Tree Protocol •  Ethernet •  802.1D: Spanning Tree Protocol •  802.1w: RSTP (Rapid STP) –  –  802.1D-2004 •  PVST (Per VLAN ST) –  VLAN ST •  topology VLAN ID ST –  CISCO •  802.1s: MSTP (Multiple STP) –  VLAN ST •  PVST ST –  RSTP –  802.1Q-2003 VLAN
  • 35. IETF TRILL •  Transparent Interconnection of Lots of Links •  STP –  FSPF IS-IS –  TRILL Rbridge IP TTL L2 –  Fat tree •  •  –  Multi-chassis Link Aggregation (MLAG)
  • 36. IETF TRILL C D Rbridge (Routing bridge) IS-­‐IS B … E server A Z TRILL  Payload L3  header L2  header Payload A  MAC Z  MAC Hop  Cnt. B  IP E  IP A  MAC B  MAC Rbridge hop
  • 37. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 38. DDR3   10.6  GB/s  *  3ch DIMM Nehalem-­‐EP Nehalem-­‐EP QPI   12.8  GB/s   Intel  5520   Chipset PCIe   PCIe   40GbE  NIC   Switch Switch 26  Gbps   PCIe  2.0  8x   4  GB/s   NIC,  HCA,  SATA   QPI:  Quick  Path  Interconnect
  • 39. Boxboro   Chipset SMB SMB SMB SMB Nehalem-­‐EX Nehalem-­‐EX SMB SMB DDR3   SMB SMB DIMM SMB SMB SMB SMB Nehalem-­‐EX Nehalem-­‐EX SMB SMB SMB QPI   SMB 12.8  GB/s   SMI   ( ) 6.4  GB/s  *  4ch   Boxboro   Chipset PCIe
  • 40. PCI Express
  • 41. 100 Gbps •  –  40 GbE: PCIe 2.0 16x 3.0 8x –  100 GbE: PCIe 3.0 16x –  SandyBridge Intel •  CPU PCIe 3.0 •  QPI 8 GT/s •  CPU –  CPU I/O
  • 42. •  –  –  TOP500 •  InfiniBand •  Data Center Ethernet •  I/F • 
  • 43. Fulcrum FocalPoint FM6000 •  DCD •  12W •  10Gbps 1W •  FM6372 18 40 GbE –  12 72W Max  port  BW SGMII  ports XAUI  ports KR/XFI  ports KR4  ports FM6316   160G 72 16 16 4 FM6324 240G 72 24 24 8   FM6332 320G 72 24 32 8 FM6348 480G 72 24 48 12 FM6364 640G 72 24 64 16 FM6372 720G 72 24 72 18
  • 44. InfiniBand Switch Silicon Family InfiniScale™ InfiniScale™ III InfiniScale™ IV 24 (4X) or 8 (12X) * 36 (4X) or 12 (12X) * # IB Ports 8 (4X) * 10Gb/s 10, 20Gb/s 20, 40Gb/s Ball to Ball Latency 240 ns 200, 140 ns 120, 100 ns Switching Capacity 160 Gb/s 960 Gb/s 2880 Gb/s PCI 2.2 or MPC860 (master CPU Interface PCIe 2.0 x4 MPC860 (slave only) and slave) Typ. Power (W) 18 25 (SDR), 30 (DDR) 74 (DDR), 85 (QDR) Package (mm) 40x40 40x40 45x45 RoHS Compliance R5 R5 R5 R6 IC available R6 IC available © 2009 MELLANOX TECHNOLOGIES - CONFIDENTIAL - 57
  • 45. •  –  •  InfiniBand •  Data Center Bridging (Ethernet) –  •  InfiniBand –  –  •  100Gbps CPU I/O •  –