Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cisco's journey from Verbs to Libfabric

This is one of two mini-talks that I gave at Euro MPI 2015 / Bordeaux.

It describes the journey Cisco undertook to evaluate two different Linux operating-system bypass APIs: Verbs and Libfabric. I detail the technical points we evaluated in both APIs, and ultimately show why we picked Libfabric over Verbs.

  • Be the first to comment

Cisco's journey from Verbs to Libfabric

  1. 1. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 1 Cisco’s Journey From Verbs to Libfabric Abondon the shackles of Verbs Embrace the freedom of Libfabric Jeffrey M. Squyres Cisco Systems 23 September 2015
  2. 2. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 2 Application Kernel Cisco VIC ethX port TCP stack General Ethernet driver enic.ko Userspace sockets API userspace library Application Verbs IB core usnic.ko Send and receive fast path usNICTCP/IP
  3. 3. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 Verbs is a fine API. …if you make InfiniBand hardware.
  4. 4. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 ...but now there’s this libfabric thing (see libfabric.org community for details)
  5. 5. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Keep in mind, Cisco already supports UD Verbs
  6. 6. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 •  Monotonic enum •  Could not add popular Ethernet values 1500 9000 •  usNIC verbs provider had to lie (!) …just like iWARP providers •  MPI had to match verbs device with IP interface to find real MTU Verbs IBV_MTU_256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_2048 IBV_MTU_4096
  7. 7. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 •  Integer (not enum) endpoint attribute Libfabric
  8. 8. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 •  Integer (not enum) endpoint attribute Libfabric DONE
  9. 9. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 •  Mandatory GRH structure InfiniBand-specific header •  40 bytes UDP header is 42 bytes …and a different format •  Breaks ib_ud_pingpong •  usnic verbs provider used “magic” ibv_port_query() to return extensions pointers E.g., enable 42-byte UDP mode Verbs e t len chksmacdmac … ver len n e x t h o p sgid dgid UDP header: 42 bytes GRH: 40 bytes
  10. 10. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload
  11. 11. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 •  FI_MSG_PREFIX and ep_attr.msg_prefix_size Libfabric e t len chksmacdmac … payload DONE
  12. 12. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 •  Tuple: (device, port) Usually a physical device and port Does not match virtualized VIC hardware •  Queue pair •  Completion queue Verbs PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 ibv_device ibv_port QP QP CQ QP
  13. 13. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 •  Maps nicely to SR-IOV •  Fabric à PCI physical function (PF) •  Domain à PCI virtual function (VF) •  Endpoint à Resources in VF PU P#9 PCI 8086:1521 eth3 PCI 1137:0043 eth4 usnic_0 PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:00cf PCI 1137:0043 Libfabric fi_fabric fi_domain fi_endpoint (resources in domain) EP EP CQ EP
  14. 14. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 •  GID and GUID No easy mapping back to IP interface •  usnic verbs provider encoded MAC in GID Still cumbersome to map back to IP interface •  Could use RDMA CM …but that would be a ton more code Verbs mac[0] = gid->raw[8] ^ 2; mac[1] = gid->raw[9]; mac[2] = gid->raw[10]; mac[3] = gid->raw[13]; mac[4] = gid->raw[14]; mac[5] = gid->raw[15];
  15. 15. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 •  Can use IP addressing directly Libfabric Everything is awesome
  16. 16. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 •  Can use IP addressing directly Libfabric Everything is awesomeDONE
  17. 17. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 •  Generic send call ibv_post_send(…SG list…) Lots of branches •  Wasteful allocations •  No prefixed receive •  Branching in completions Verbs
  18. 18. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 •  Multiple types of send calls fi_send(buffer, …) •  Variable-length prefix receive Provider-specific •  Fewer branches in completions Libfabric
  19. 19. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 1.9 1.95 2 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 0.1 1 10 100 Time(microseconds) Buffer size Open MPI with usNIC: IMB PingPong Latency imb-pingpong-ompi-1.8-verbs.out imb-pingpong-ompi-1.8-libfabric.out
  20. 20. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 61000 62000 63000 64000 65000 66000 67000 68000 69000 1e+06 Bandwidth(megabits/second) Buffer size Open MPI with usNIC: IMB SendRecv Bandwidth imb-sendrecv-ompi-1.8-verbs.out imb-sendrecv-ompi-1.8-libfabric.out
  21. 21. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 •  Performance issues •  Memory registration still a problem •  No MPI-style tag matching •  One-sided capabilities do not match MPI •  Network topology is a separate API Verbs
  22. 22. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 •  Performance happiness •  Many MPI-helpful features: Tag matching One-sided operations Triggered operations •  Inherently designed to be more than just point-to-point •  More work to be done… but promising MMU notify Network topology Libfabric
  23. 23. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 •  Long design discussions about how to expose Ethernet / VIC concepts in the verbs API …usually with few good answers Especially problematic with new VIC features over time •  Conclusion: possible (obviously), but not preferable •  Whole API designed with multiple vendor hardware models in mind •  Much easier to match our hardware to core Libfabric concepts •  Conclusion: much more preferable than verbs LibfabricVerbs
  24. 24. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Ok, so let’s do libfabric!
  25. 25. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Does it play well with MPI?
  26. 26. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins MPI_Send(…)
  27. 27. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 •  Inherently multi-device •  Round-robin for small messages •  Striping for large messages •  Major protocol decisions and MPI message matching driven by an Open MPI engine Byte Transport Layer (BTL) plugins
  28. 28. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Matching Transport Layer (MTL) plugins •  Most details hidden by network API • MXM • Portals • PSM •  As a side effect, must handle: • Process loopback • Server loopback (usually via shared memory)
  29. 29. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC (verbs) •  MXM •  Portals •  PSM •  PSM2
  30. 30. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 •  IB / iWarp (verbs) •  Portals •  SCIF •  Shared memory •  TCP •  uGNI •  usNIC Byte Transport Layer (BTL) plugins Matching Transport Layer (MTL) plugins •  MXM •  Portals •  PSM •  PSM2 •  ofi libfabric
  31. 31. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 libfabric usnic BTL ofi MTL •  Cisco developed •  usNIC-specific •  OFI point-to-point / UD •  Tested with usNIC •  Intel developed •  Provider neutral •  OFI tag matching •  Tested with PSM / PSM2
  32. 32. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Bootstrapping Message passing There are two main parts of the usNIC BTL
  33. 33. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 verbs bootstrapping verbs message passing These two parts were previously written to the Verbs API
  34. 34. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 verbs bootstrapping verbs message passing sideband bootstrapping 1.  Find the corresponding ethX device 2.  Obtain MTU 3.  Open usNIC-specific configuration options Per the previous slides, the Verbs API requires some… help… in the form of sideband bootstrapping
  35. 35. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing à Now let’s convert to use the libfabric API
  36. 36. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 verbs bootstrapping verbs message passing sideband bootstrapping libfabric bootstrapping à libfabric message passing àPretty much a ~1:1 swap of verbs à libfabric calls Bootstrapping sequence totally different / not comparable …but libfabric needs no sideband bootstrapping (got to delete several hundred lines of OMPI code – yay!)
  37. 37. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  38. 38. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL Both libfabric usage models co-exist (and play well with each other) inside a single MPI implementation. Proof positive of successful co-design of libfabric and MPI implementations.
  39. 39. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 •  For a specific provider Ask fi_getinfo() for prov_name=“usnic” •  Use usNIC extensions Netmask, link speed, IP device name, etc. •  usNIC-specific error messages •  For any tag-matching provider •  No extension use 100% portable •  Generic error messages usnic BTL ofi MTL
  40. 40. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 •  Libfabric is the Way Forward for Cisco Open community Matches our hardware Performance benefits Features benefits •  Libfabric matches MPI Has features MPI has been asking for… for years Optimistic about its future (come join us!) http://libfabric.org
  41. 41. © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 Thank you.

    Be the first to comment

    Login to see the comments

  • DmitryGladkov2

    Sep. 8, 2017

This is one of two mini-talks that I gave at Euro MPI 2015 / Bordeaux. It describes the journey Cisco undertook to evaluate two different Linux operating-system bypass APIs: Verbs and Libfabric. I detail the technical points we evaluated in both APIs, and ultimately show why we picked Libfabric over Verbs.

Views

Total views

2,242

On Slideshare

0

From embeds

0

Number of embeds

689

Actions

Downloads

110

Shares

0

Comments

0

Likes

1

×