Cisco usNIC: how it works, how it is used in Open MPI

6,015 views

Published on

In this talk, I expand on the slides I presented at the Madrid, Spain EuroMPI conference in September 2013 (I re-used some of the slides from that Madrid presentation, but there's a bunch of new content in the latter half of the slide deck).

This talk is a technical deep dive into how Cisco's usNIC technology works, and how we use that technology in the BTL plugin that we wrote for Open MPI.

I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,015
On SlideShare
0
From Embeds
0
Number of Embeds
1,842
Actions
Shares
0
Downloads
148
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Cisco usNIC: how it works, how it is used in Open MPI

  1. 1. Cisco  Userspace  NIC  (usNIC)   Jeff  Squyres   Cisco  Systems,  Inc.   November  7,  2013   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 1
  2. 2. Yes,  we  sell  servers  now   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 2
  3. 3. Record-­‐seNng   Intel  I CS  servers   Cisco  Uvy  Bridge   1U  and  2U  servers   Ultra  low   Cisco  2  x  10Gb  VIC   latency  Ethernet   Yes,   really!   40Gb  top-­‐of-­‐rack   Cisco  10/40Gb   and  core  witches   switches   Nexus  s © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
  4. 4. Industry-­‐leading  compute  without  compromise   Rack   HPC  performance   4  socket  +  giant  memory   UCS  C240  M3   Perfect  as  HPC  cluster  head  nodes   or  IO  nodes  (2  socket)   UCS  C420  M3   4-­‐socket  rack  server  for  large-­‐memory   compute  workloads   UCS  C220  M3   Blade   Ideal  for  HPC  compute-­‐intensive   applicaXons  (2  socket)   UCS  B200  M3   Blade  form  factor,  2-­‐socket   © 2013 Cisco and/or its affiliates. All rights reserved. UCS  B420  M3   4-­‐socket  blade  for     large-­‐memory  compute  workloads   Cisco  UCS:    Many  Server  Form  Factors,  One  System   4   Cisco Public 4
  5. 5. Worldwide  X86  Server  Blade  Market  Share   UCS  impacBng  growth  of   established  vendors  like  HP   Legacy  offerings  flat-­‐lining   or  in  decline   Cisco  growth  out-­‐pacing  the  market   UCS  #2  and     climbing   Market  AppeXte     for  InnovaXon  Fuels   UCS  Growth   Customers  have  shiMed  19.3%  of   the  global  x86  blade  server  market   to  Cisco  and  over  26%  in  the   Americas  (Source:    IDC  Worldwide  Quarterly   Source:    IDC  Worldwide  Quarterly  Server  Tracker,  Q1  2013  Revenue  Share,  May  2013   Server  Tracker,  Q1  2013  Revenue  Share,  May   2013)   Demand  for  Data  Center  InnovaBon  Has  Vaulted  Cisco  Unified  CompuBng  System     (UCS)  to  the  #2  Leader  in  the  Fast-­‐Growing  Segment  of  the  x86  Server  Market   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
  6. 6. 16  world  records   Best  CPU   Performance   Best   VirtualizaXon  &   Cloud   Performance   8  world  records   Best  Database   Performance   Best  Enterprise   ApplicaXon   Performance   Best  Enterprise   Middleware   Performance   Best  HPC   Performance   © 2013 Cisco and/or its affiliates. All rights reserved. 9  world  records   18  world  records   14  world  records   15  world  records   Cisco Public 6
  7. 7. One  wire  to  rule  them  all:   •  Commodity  traffic  (e.g.,  ssh)   •  Cluster  /  hardware  management   •  File  system  /  IO  traffic   •  MPI  traffic   10G  or  40G   with  real  QoS   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
  8. 8. Low  latency,  high  density  10  /  40Gb  switches   Low  latency   High  density   Nexus  3548   190ns  port-­‐to-­‐port  latency  (L2  and  L3)   Created  for  HPC  /  HFT   48  10Gb  /  12  40Gb  ports   Nexus  6004   1us  port-­‐to-­‐port  latency   384  10Gb  /  96  40Gb  ports   Cisco  Nexus:  Years  of  experience  rolled  into  dependable  soluBons   8   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
  9. 9. Spine   Leaf   CharacterisXcs   •  •  •  •  •  •  3  Hops   Low  OversubscripXon  –  Non-­‐Blocking   <  ~3.5  usecs  depending  on  config  and  workload   10G  or  40G  Capable   Spine:  4  to  16  Wide   Leaf:  Determined  by  Spine  Density   Spine  -­‐  Leaf   Port  Scale   Latency   Spines   Leafs   10G  Fabric   6004  -­‐  6001   18,432  x  10G  3:1   ~  3  usecs  Cut-­‐through   16   384   40G  Fabric   6004  -­‐  6004   7,680  x  40G  5:1   ~  3  usecs  Cut-­‐through   16   96   Mixed  Fabric   6004  -­‐  6001   4,680  x  10G  3:1   ~  3  usecs  S&F   4   96   10G  Fabric   6004  -­‐  3548   12,288  x  10G  3:1   ~  1.5  usecs  Cut-­‐through   16   384   40G  Fabric   6004  -­‐  3548   1,152  x  40G  1:1   ~  1.5  usecs  Cut-­‐through   6   96   Mixed  Fabric   6004  -­‐  3548   3,072  x  10G  3:1   ~  1.5  usecs  S&F   4   96   …many  other  configuraBons  are  also  possible   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
  10. 10. CharacterisXcs   •  •  •  Spine2   •  •  Spine1   3  Hops  Pod  –  5  hops  DC  east-­‐west  traffic   Low  OversubscripXon  –  Non-­‐Blocking   <  ~3.5  usecs  depending  on  config  and   workload   10G  or  40G  Capable   Two  spine  layers   Leaf   Spine2-­‐Spine1-­‐Leaf   Port  Scale   Latency   Spine2   Spine1   Leafs   10G  Fabric   6004  -­‐  6004  -­‐  6001   55,296  x  10G  3:1   ~  3-­‐5  usecs  Cut-­‐through   48   16  x  6   192   40G  Fabric   6004  -­‐  6004  -­‐  6004   23,040  x  40G  5:1   ~  3-­‐5  usecs  Cut-­‐through   48   16   48   Mixed  Fabric   6004  -­‐  6004  -­‐  6001   18,432  x  10G  3:1   ~  3-­‐5  usecs  S&F   32   4  x  8   48   10G  Fabric   6004  -­‐  6004  -­‐  3548   24,576  x  10G  2:1   ~  1.5-­‐3.5  usecs  Cut-­‐through   32   16  x  4   192   40G  Fabric   6004  -­‐  6004  -­‐  3548   2,304  x  40G  1:1   ~  1.5-­‐3.5  usecs  Cut-­‐through   24   6  x  8   48   Mixed  Fabric   6004  -­‐  6004  -­‐  3548   9,216x  10G  2:1   ~  1.5-­‐3.5  usecs  S&F   24   6  x  8   48   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
  11. 11. © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 11
  12. 12. •  Direct  access  to  NIC  hardware  from   Linux  userspace   OperaXng  System  bypass   Via  the  Linux  Verbs  API  (UD)   •  UXlizes  Cisco  Virtual  Interface  Card   (VIC)  for  ultra-­‐low  Ethernet  latency   2nd  generaXon  80Gbps  Cisco  ASIC   2  x  10Gbps  Ethernet  ports   2  x  40Gbps  coming  …soon…   PCI  and  mezzanine  form  factors   •  Half-­‐round  trip  (HRT)  ping-­‐pong   latencies  (Intel  E5-­‐2690  v2  servers):   Raw  back  to  back:  1.57μs   MPI  back  to  back:  1.85μs   Through  MPI+N3548:  2.05μs   © 2013 Cisco and/or its affiliates. All rights reserved. These  numbers   keep  going  down   Cisco Public 12
  13. 13. TCP/IP   usNIC   ApplicaXon   Userspace   ApplicaXon   Userspace  sockets   library   Userspace  verbs  library   Kernel   TCP  stack   Bootstrapping   and  setup   General  Ethernet  driver   Verbs  IB  core   Cisco  VIC  driver   Cisco  USNIC   driver   Cisco  VIC  hardware   © 2013 Cisco and/or its affiliates. All rights reserved. Send  and   receive   fast  path   Cisco  VIC  hardware   Cisco Public 13
  14. 14. MPI   MPI  directly   injects   L2  frames    to  the  network   © 2013 Cisco and/or its affiliates. All rights reserved. Userspace  verbs  library   Cisco  VIC  hardware   MPI  receives   L2  frames   directly  from   the  VIC   Cisco Public 14
  15. 15. MPI  process   MPI  process   x86  Chipset  VT-­‐d   I/O MMU VIC   SR-IOV NIC QP QP Queue pair Classifier   Inbound   L2  frames   © 2013 Cisco and/or its affiliates. All rights reserved. Outbound   L2  frames   Cisco Public 15
  16. 16. Physical  FuncXon  (PF)   MAC  address:  aa:bb:cc:dd:ee:ff     QP   QP   VF   QP   QP   VF     VF         VF   Physical  port   © 2013 Cisco and/or its affiliates. All rights reserved. VF   VF   VIC                   Physical  FuncXon  (PF)   MAC  address:  a  a:bb:cc:dd:ee:fe   QP   QP   VF   QP   QP   VF     VF         VF   VF   VF   Physical  port   Cisco Public 16
  17. 17. MPI  process   PF  (MAC)     VF   VF   VF     QP   QP     VF   VF   VF     Physical  port   © 2013 Cisco and/or its affiliates. All rights reserved. VIC           PF  (MAC)     VF   VF   VF     QP   QP     VF   VF   VF     Physical  port   Intel IO MMU MPI  process   Cisco Public 17
  18. 18. •  Used  for  physical  ßà  virtual  memory  translaXon   •  usnic  verbs  driver  programs  (and  deprograms)  the  IOMMU     VIC   Virtual   Virtual   Intel IO MMU Physical   Virtual   Physical   Userspace   process   RAM   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
  19. 19. •  For  the  purposes  of  this  talk,  let’s  assume  that  each  physical  port  has   one  Linux  ethX  device   •  Each  ethX  device  corresponds  to  a  PF   •  Each  usnic_Y  device  corresponds  to  an  ethX  device2   VIC   Physical  port  0   eth4  /  usnic_0   Physical  port  1   eth5  /  usnic_1   Physical  port   Physical  port   (fiber)   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
  20. 20. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#7 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PCI 8086:1521 eth3 PCI 1137:0043 eth4 Intel  Xeon  E5-­‐2690  (“Sandy  Bridge”)   2  sockets,  8  cores,  64GB  per  socket   usnic_0 PCI 1137:0043 eth5   VIC   ports   usnic_1 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 PCI 1000:005b L3 (20MB) sda L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth6 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) usnic_2 Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PCI 1137:0043 PCI 1137:0043 eth7   VIC   ports   usnic_3 Indexes: physical © 2013 Cisco and/or its affiliates. All rights reserved. Date: Thu Nov 7 10:58:23 2013 Cisco Public 20
  21. 21. PU P#23 eth3 Machine (128GB) PCI 1137:0043 NUMANode P#0 (64GB) eth4 Socket P#0 L3 (20MB) PCI 8086:1521 eth0 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) usnic_0 L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) PCI 1137:0043 L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 PU P#0 PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#16 PU P#17 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 Core P#7 eth5 PCI 8086:1521 eth1 PCI 8086:1521 eth2 PU P#7 PU P#23 usnic_1 PCI 8086:1521 eth3 PCI 1137:0043 PCI 102b:0522 eth4 usnic_0 PCI 1137:0043 eth5 PCI 1000:005b usnic_1 sda PCI 102b:0522 NUMANode P#1 (64GB) L2 (256KB) PCI 1137:0043 Socket P#1 L3 (20MB) L1d (32KB) eth6 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1i (32KB) (32KB) L1d L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) usnic_2 L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 PCI 1137:0043 Core P#7 PU PU P#15 P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 L1i (32KB) Core P#7 PU P#24 PU P#15 eth7 PCI 1000:005b sda PCI 1137:0043 eth6 usnic_2 PCI 1137:0043 eth7 PU P#31 PU P#31 usnic_3 usnic_3 Indexes: physical © 2013 Cisco and/or its affiliates. All rights reserved. Date: Thu Nov 7 10:58:23 2013 Cisco Public 21
  22. 22. ApplicaXon   Open  MPI  layer  (OMPI)   Point-­‐to-­‐point  messaging  layer  (PML)   Byte  Transfer  Layer  (BTL)   OperaXng  System   Hardware   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
  23. 23. MPI_Send  /  MPI_Recv  (etc.)   OB1  PML   usnic  BTL   /dev/usnic_0   usnic  BTL   /dev/usnic_1   VIC  0   © 2013 Cisco and/or its affiliates. All rights reserved. usnic  BTL   /dev/usnic_2   usnic  BTL   /dev/usnic_3   VIC  1   Cisco Public 23
  24. 24. •  Byte  Transfer  Layer   •  Point-­‐to-­‐point  transfer  plugins  in  OMPI  layer   •  No  protocol  is  assumed  /  required   •  “usnic”  BTL     usnic  BTL   /dev/usnic_2   •  Uses  unreliable  datagram  (UD)  verbs   •  Handles  all  fragmentaXon  and  re-­‐assembly  (vs.  PML)   •  Retransmissions  and  ACKs  handled  in  sovware   •  Sliding  window  retransmission  scheme   •  Direct  inject  /  direct  receive  of  L2  Ethernet  frames   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
  25. 25. •  Priority  queue  for  small  and  control  packets   •  Data  queue  for  up  to  MTU-­‐sized  data  packets   Priority   QP   Data   QP   CQ   •  Each  module  has  two  UD  queue  pairs   CQ   •  One  BTL  module  for  each  usNIC  verbs  device   •  Each  QP  has  its  own  CQ   •  QPs  may  or  may  not  be  on  same  VF   •  Overall  BTL  glue  polls  CQs  for  each  device   •  First,  priority  CQs   •  Then  data  CQs   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
  26. 26. •  “raw”  latency  (no  MPI,  no  verbs)  is  1.57μs   •  MPI  latency  back-­‐to-­‐back  on  Sandy  Bridge  1.85μs   •  Verbs  responsible  for  about  80ns  of  the  difference  (not  related  to  MPI  API)   •  All  the  rest  of  OMPI  is  only  about  200ns   Raw:  1.57μs   MPI:  200ns   Verbs:  80ns   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
  27. 27. •  Deferred  and  piggy-­‐backed  ACKs   Process  A   Msg   Process  B   ACK  N   Msg   Msg   Msg   Immediate   Deferred   Time   ACK  N+2   © 2013 Cisco and/or its affiliates. All rights reserved. Msg   Msg   Msg   Msg+ACK  N+2   Deferred  +   piggybacked   Cisco Public 27
  28. 28. •  Host  writes  WQ  structure   Writes  index  to  VIC  via  PIO   VIC  reads  WQ  descriptor   VIC  reads  buffer  from  RAM   VIC  sends  buffer  from  RAM   WQ   descriptor   Host   Write  WQ  in   Read  WQ VIC   dex   ket   Read  pac VIC  now  has   buffer  address   Send  on  wire   Buffer  to   send   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 28
  29. 29. •  Host  writes  WQ  structure   Writes  index  +  encoded  buffer  address  to  VIC  via  PIO   VIC  reads  WQ  descriptor   VIC  reads  buffer  from  RAM   VIC  sends  buffer  from  RAM   WQ   descriptor   Buffer  to   send   © 2013 Cisco and/or its affiliates. All rights reserved. Host   Write  WQ  in VIC   dex+addr     Read  WQ ket   Read  pac Send  on  wire   Send  ~400ns   sooner   Cisco Public 29
  30. 30. •  Minimize  length  of  priority  receive  queue   •  Using  2048  different  receive  buffers  200ns  worse  than  using  64   •  Result  of  IOMMU  cache  effect   •  We  scale  length  of  priority  RQ  with  number  of  processes  in  job   Use  this  much   Userspace   process   Virtual   VIC   Virtual   Intel IO MMU Physical   Instead  of  this  much   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
  31. 31. •  Use  fastpaths  wherever  possible   Be  friendly  to  the  opXmizer  and  instrucXon  cache   Made  a  noXceable  difference  (!)   if (fastpathable)! do_it_inline();! else! call_slower_path();! © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
  32. 32. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#0 PU P#16 MPI  processes  running  on  these  cores…   PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 eth1 PCI 8086:1521 eth2 PU P#7 PU P#17 PCI 8086:1521 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:0043 eth5 usnic_1 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 PCI 1000:005b L3 (20MB) sda L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth6 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) usnic_2 Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PCI 1137:0043 PCI 1137:0043 eth7 usnic_3 © 2013 Cisco and/or its affiliates. All rights reserved. Indexes: physical Date: Thu Nov 7 10:58:23 2013 Cisco Public 32
  33. 33. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#0 PU P#16 MPI  processes  running  on  these  cores…   PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 eth1 PCI 8086:1521 eth2 PU P#7 PU P#17 PCI 8086:1521 PCI 8086:1521 eth3 PCI 1137:0043 eth4 Only  use  these  usNIC  devices   for  short  messages   usnic_0 PCI 1137:0043 eth5 usnic_1 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 PCI 1000:005b L3 (20MB) sda L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth6 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) usnic_2 Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PCI 1137:0043 PCI 1137:0043 eth7 usnic_3 © 2013 Cisco and/or its affiliates. All rights reserved. Indexes: physical Date: Thu Nov 7 10:58:23 2013 Cisco Public 33
  34. 34. Machine (128GB) NUMANode P#0 (64GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#0 PU P#16 MPI  processes  running  on  these  cores…   PU P#1 PU P#2 PU P#3 PU P#4 PU P#5 PU P#6 PU P#18 PU P#19 PU P#20 PU P#21 PU P#22 PU P#23 eth1 PCI 8086:1521 eth2 PU P#7 PU P#17 PCI 8086:1521 PCI 8086:1521 eth3 PCI 1137:0043 eth4 Use  ALL  usNIC  devices   for  long  messages   usnic_0 PCI 1137:0043 eth5 usnic_1 PCI 102b:0522 NUMANode P#1 (64GB) Socket P#1 PCI 1000:005b L3 (20MB) sda L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB) eth6 L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB) usnic_2 Core P#0 Core P#1 Core P#2 Core P#3 Core P#4 Core P#5 Core P#6 Core P#7 PU P#8 PU P#9 PU P#10 PU P#11 PU P#12 PU P#13 PU P#14 PU P#15 PU P#24 PU P#25 PU P#26 PU P#27 PU P#28 PU P#29 PU P#30 PU P#31 PCI 1137:0043 PCI 1137:0043 eth7 usnic_3 © 2013 Cisco and/or its affiliates. All rights reserved. Indexes: physical Date: Thu Nov 7 10:58:23 2013 Cisco Public 34
  35. 35. •  Everything  above  the   firmware  is  open  source   •  Open  MPI   DistribuXng  Cisco  Open  MPI  1.6.5   Upstream  in  Open  MPI  1.7.3   •  Libibverbs  plugin   •  Verbs  kernel  module   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
  36. 36. Hardware   •  Cisco  UCS  C220  M3  Rack  Server     •  Intel  E5-­‐2690  Processor  2.9  GHz  (3.3  GHz  Turbo),  2  Socket,  8  Cores/Socket   •  1600  MHz  DDR3  Memory,  8  GB  x  16,  128  GB  installed   •  Cisco  VIC  1225  with  Ultra  Low  Latency  Networking  usNIC  Driver     •  Cisco  Nexus  3548   •  48  Port  10  Gbps  Ultra  Low  Latency  Ethernet  Networking  Switch   SoMware   •  OS:  Centos  6.4,  Kernel:  2.6.32-­‐358.el6.x86_64  (SMP)   •  NetPIPE  (ver  3.7.1)   •  Intel  MPI  Benchmarks  (ver  3.2.4)   •  High  Performance  Linpack  (ver  2.1)   •  Other:  Intel  C  Compiler  (ver  13.0.1),  Open  MPI  (ver  1.6.5),  Cisco  usNIC  (1.0.0.7x)     © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
  37. 37. 1   © 2013 Cisco and/or its affiliates. All rights reserved. Cisco  usNIC  Latency   8388611   6291459   4194307   3145731   2097155   1572867   1048579   786435   524291   393219   262147   196611   131075   98307   65539   49155   32771   24579   16387   12291   8195   6147   4099   3075   2051   1539   1027   771   515   387   259   195   131   99   67   51   35   27   19   12   10   4   1   Latency  (usecs)   10000   10000   1000   7500   100   5000   2.05  usecs  latency  for  small     messages     Throughput  (Mbps)   9.3  Gbps  Throughput   2500   0   Message  Size  (bytes)   Cisco  usNIC  Throughput   Cisco Public 37
  38. 38. PingPing  and  PingPong  Latency  track  together!   900   100   600   2.05  usecs  PingPong  Latency   2.10  usecs  PingPing  Latency   10   Throughput  (MB/s)   1200   1000   Latecny  (usecs)   10000   300   1   0   4   16   64   256   1024   4096   16384   65536   262144   1048576   4194304   Message  Size  (bytes)   PingPong  ThroughPut  (MB/s)   © 2013 Cisco and/or its affiliates. All rights reserved. PingPing  Througput  (MB/s)   PingPong  Latency  (usecs)   PingPing  Latency  (usecs)   Cisco Public 38
  39. 39. Full  Bi-­‐direcBonal  Performance  for  both   Exchange  and  SendRecv   1800   100   2.11  usecs  SendRecv  Latency   2.58  usecs  Exchange  Latency   1200   10   Throughput  (MB/s)   2400   1000   Latecny  (usecs)   10000   600   1   0   4   16   64   256   1024   4096   16384   65536   262144   1048576   4194304   Message  Size  (bytes)   SendRecv  Throughput  (MB/s)   © 2013 Cisco and/or its affiliates. All rights reserved. Exchange  Throughput  (MB/s)   SendRecv  Latency  (usecs)   Exchange  Latency  (usecs)   Cisco Public 39
  40. 40.   GFLOPS  =  FLOPS/Cycle  x  Num  CPU  Cores  x  Freq  (GHz)   E5-­‐2690  Max  GFLOPS  =  8  x  16  x  3.3  =  422  GFLOPS     12500   Single  Node  HPL  Score  (16  cores):  340.51  GFLOPS*   32  Node  HPL  Score  (512  cores):  9,773.45  GFLOPS     10000   Efficiency  based  on  Single  Machine  Score:      (9,773.45)/(340.51  x  32)  x  100  =  89.69%         GFlops   7500   5000   2500   0   GFlops   16   32   64   128   256   512   340.51   673.68   1271.14   2647.09   5258.27   9773.45   #  of  CPU  Cores   *  Score  may  improve  with  addiBonal  compiler  serngs  or  newer  compiler  versions     © 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
  41. 41. Thank  you.  

×