High Performance Interconnects:
Landscape, Assessments &
Rankings
Dan Olds
Partner, OrionX
April 12, 2017
 Very top end of networking market – when
you absolutely need high bandwidth and
low latency
 Without HPI, you don’t really have a
cluster – or at least one that works very
well
 Performance has been rising at 30%
annually
 Spending on HPI is also rising significantly
High Performance Interconnects (HPI)
©2016 OrionX 2
Current HPI line
Multi-Rack
10G
40G
100G
1G
100G
Link Speed
TCP/IP
InfiniBand
Specialized
OPA
Net Protocol App Comm
MPI
JDBC
RMI
IIOP
SOAP
etc.
Single-Rack
HPI
market segment
Three Types of HPI
©2016 OrionX 4
 Ethernet
– Sold by a host of providers, Cisco, HPE, Juniper, plus many others
– Tried and true interconnect, easiest to implement
– While it has the bandwidth of others, latency is pretty high (ms rather than ns)
 Proprietary
– Primarily sold by Cray, SGI, and IBM plus a few others
– Have to purchase a system in order to get their brand of HPI
– Intel is a new entrant in this segment of the market, although without an accompanying system
 InfiniBand
– Mellanox has emerged as the de-facto leader
– Highest performance based on published numbers:200Gb/s, 200m messages/s, 90ns latency
Key Differences in HPI: Product Maturity/Position
©2016 OrionX 5
Ethernet
 Ethernet has been around longer than any HPI, but was surpassed in performance
– Still many installations, but has lost much of its share at the high end
– Latency (measured in ms, not ns) is the problem, not bandwidth
Key Differences in HPI: Product Maturity/Position
©2016 OrionX.net 6
 Intel Omni Path Architecture
– Intel and Omni-Path are in their infancy still, very few installations
– Handful of customers (although some big names), few, if any, in production
– Claims bandwidth/latency/message rate same or better than InfiniBand (covered later)
Key Differences in HPI: Product Maturity/Position
©2016 OrionX.net 7
 InfiniBand
– Has been in the HPI market since early 2000’s
– Thousands of customers, millions of nodes
– Now makes up large proportion of Top500 list (187 systems)
– Synonymous with Mellanox these days
Key Differences in HPI technology
©2016 OrionX 8
Onload vs. Offload
 Onload: main CPU handles all network processing chores, adapter and switches just pass
the messages, examples
– Intel Omni-Path Architecture, Ethernet
– Also PC servers, old UNIX systems where CPUs handled every task and received interrupts
on communications
 Offload: HCA and switches handle all network processing tasks, very little or no need for
main CPU cycles, allows CPU to continue processing applications, examples
– Mellanox InfiniBand
– Mainframes with communication assist processors used to allow CPU to process
applications, not communications
Offload Details
©2016 OrionX.net 9
 Network protocol load includes:
– Link Layer: packet layout, packet forwarding,
flow control, data integrity, QoS
– Network layer: adds header, routing of
packets from one subnet to another
– Transport layer: in-order packet delivery,
divides data into packets, receiver
reassembles packets, sends/receives
acknowledgements
– MPI operations: scatter, gather, broadcast,
etc.
 With offload, ALL of these operations are
handled by the adapter hardware, example:
InfiniBand HBA
Onload Details
©2016 OrionX.net 10
 Network protocol load includes:
– Link layer packet layout, packet forwarding,
flow control, data integrity, QoS
– Network layer: header, routing of packets
from one subnet to another
– Transport layer: in-order packet delivery,
divides data into packets, receiver
reassembles packets, sends/receives
acknowledgements
– MPI operations: scatter, gather, broadcast,
etc.
 With onload, ALL of these operations are
performed by the host processor, using host
memory
Onload vs. Offload
©2016 OrionX.net 11
 Onload vs. Offload
isn’t a big deal when
the cluster is
small…
Onload vs. Offload
©2016 OrionX.net 12
 But it will become a
very large deal when
the cluster becomes
larger
 Will particularly be a
problem on scatter,
gather type collective
problems when head
node will be overrun
trying to process
messages
Onload vs. Offload
©2016 OrionX.net 13
 As node count increases, performance of Onload will drop
– Higher node count = more messaging, pressure on head node
 Node counts are increasing significantly
 Offload uses dedicated hardware ASICs
– Much faster than general purpose CPUs
 MPI not highly parallel
– With Onload, this means that speed is
limited to slowest core speed
– Has no bearing on Offload speed
Rampant FUD War
©2016 OrionX.net 14
 Cost of HPI in cluster budget is
typically ~25% of total
 Prices in high tech typically
don’t increase over time
 Price points for new products
typically are the same as the
former high end product they
replace…ex: high end PCs,
low-end servers, etc.
From: The Next Platform, “Intel Stretches Deep Learning On
Scalable System Framework”, 5/10/16
More FUD War…..
©2016 OrionX.net 15
 All images provided by Intel, all from The Next Platform story “Intel Stretches Learning on Scalable System Framework” May
10th, 2016
 What else do these images have in common?
FUD Wars – Behind the Numbers
©2016 OrionX.net 16
 It’s all in the fine print,
right?
 Here’s Intel’s fine print for
the graphs on the last
slide….
 “dapl” is key, it’s an Intel
MPI mechanism that
doesn’t allow for offload
operations ala InfiniBand
…..48 port (B0 silicon). IOU Non-posted Prefetch disabled in BIOS.
Snoop hold-off timer = 9. EDR based on internal testing: Intel MPI
5.1.3, shm:dapl fabric, RHEL 7.2 -genv
I_MPI_DAPL_EAGER_MESSAGE_AGGREGATION off. Mellanox
EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox
SB7700 – 36 Port EDR InfiniBand switch. MLNX_OFED_LINUX-3.2-
2.0.0.0 (OFED-3.2-2.0.0). IOU Non-posted Prefetch enabled in BIOS.
1. osu_latency 8 B message. 2. osu_bw 1 MB message. 3.
osu_mbw_mr, 8 B………
Software and workloads used in performance tests may have been
optimized for performance only on Intel microprocessors.
Even More FUD
©2016 OrionX.net 17
 100% CPU core utilization
on a Offload HCA?!!
 Does anyone believe this?!!
 This means that about half
of the Top500 systems are
absolutely useless
 Intel is using a CPU polling
mechanism that pegs the
CPU on the Mellanox box at
100%, yet has nothing to do
with network comms
 Both Intel and Mellanox
have benchmarked OPA at
~65% CPU utilization
FUD Aside, here are the numbers….UPDATE FOR HDR
©2016 OrionX.net 18
Intel OPA Mellanox EDR/HDR
InfiniBand
Bandwidth 100 Gb/sec 100 Gb/sec / 200Gb/sec
Latency (ns) .93 .85 or less / .90 or less
Message rate 89 million/sec* 150m/sec / 200m/sec
* this number, provided by Intel, has dropped from >150 million in 2015
HPI Roadmaps
©2016 OrionX.net 19
 InfiniBand roadmap
shows HDR now
(200Gb/s) and NDR
down the road
(400Gb/s? 2020?)
 Can’t find a solid Intel
OPA roadmap
 Ethernet roadmap
shows 200Gb/s in
2018-19
Major HPI Choices: OrionX analysis
©2016 OrionX.net 20
Vendor
Market Customer Product
Presence Trends Overall Readiness Needs Overall Capabilities Roadmap Overall
Mellanox 9 9 9 8 9 8.5 9 10 9.5
Ethernet vendors 7 7 7 9 6 7.5 7 6 6.5
Intel 6 8 7 6 7 6.5 7 8 7.5
Mellanox
Intel
Ethernet vendors
Vendor Market Product Customer
Ethernet 7 6.5 7.5
Mellanox 9 9.5 8.5
Intel 7 7.5 6.5
OrionX Constellation
©2016 OrionX 22
OrionX Constellation™ reports
Questions?
Comments?
Concerns?

High Performance Interconnects: Landscape, Assessments & Rankings

  • 1.
    High Performance Interconnects: Landscape,Assessments & Rankings Dan Olds Partner, OrionX April 12, 2017
  • 2.
     Very topend of networking market – when you absolutely need high bandwidth and low latency  Without HPI, you don’t really have a cluster – or at least one that works very well  Performance has been rising at 30% annually  Spending on HPI is also rising significantly High Performance Interconnects (HPI) ©2016 OrionX 2
  • 3.
    Current HPI line Multi-Rack 10G 40G 100G 1G 100G LinkSpeed TCP/IP InfiniBand Specialized OPA Net Protocol App Comm MPI JDBC RMI IIOP SOAP etc. Single-Rack HPI market segment
  • 4.
    Three Types ofHPI ©2016 OrionX 4  Ethernet – Sold by a host of providers, Cisco, HPE, Juniper, plus many others – Tried and true interconnect, easiest to implement – While it has the bandwidth of others, latency is pretty high (ms rather than ns)  Proprietary – Primarily sold by Cray, SGI, and IBM plus a few others – Have to purchase a system in order to get their brand of HPI – Intel is a new entrant in this segment of the market, although without an accompanying system  InfiniBand – Mellanox has emerged as the de-facto leader – Highest performance based on published numbers:200Gb/s, 200m messages/s, 90ns latency
  • 5.
    Key Differences inHPI: Product Maturity/Position ©2016 OrionX 5 Ethernet  Ethernet has been around longer than any HPI, but was surpassed in performance – Still many installations, but has lost much of its share at the high end – Latency (measured in ms, not ns) is the problem, not bandwidth
  • 6.
    Key Differences inHPI: Product Maturity/Position ©2016 OrionX.net 6  Intel Omni Path Architecture – Intel and Omni-Path are in their infancy still, very few installations – Handful of customers (although some big names), few, if any, in production – Claims bandwidth/latency/message rate same or better than InfiniBand (covered later)
  • 7.
    Key Differences inHPI: Product Maturity/Position ©2016 OrionX.net 7  InfiniBand – Has been in the HPI market since early 2000’s – Thousands of customers, millions of nodes – Now makes up large proportion of Top500 list (187 systems) – Synonymous with Mellanox these days
  • 8.
    Key Differences inHPI technology ©2016 OrionX 8 Onload vs. Offload  Onload: main CPU handles all network processing chores, adapter and switches just pass the messages, examples – Intel Omni-Path Architecture, Ethernet – Also PC servers, old UNIX systems where CPUs handled every task and received interrupts on communications  Offload: HCA and switches handle all network processing tasks, very little or no need for main CPU cycles, allows CPU to continue processing applications, examples – Mellanox InfiniBand – Mainframes with communication assist processors used to allow CPU to process applications, not communications
  • 9.
    Offload Details ©2016 OrionX.net9  Network protocol load includes: – Link Layer: packet layout, packet forwarding, flow control, data integrity, QoS – Network layer: adds header, routing of packets from one subnet to another – Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements – MPI operations: scatter, gather, broadcast, etc.  With offload, ALL of these operations are handled by the adapter hardware, example: InfiniBand HBA
  • 10.
    Onload Details ©2016 OrionX.net10  Network protocol load includes: – Link layer packet layout, packet forwarding, flow control, data integrity, QoS – Network layer: header, routing of packets from one subnet to another – Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements – MPI operations: scatter, gather, broadcast, etc.  With onload, ALL of these operations are performed by the host processor, using host memory
  • 11.
    Onload vs. Offload ©2016OrionX.net 11  Onload vs. Offload isn’t a big deal when the cluster is small…
  • 12.
    Onload vs. Offload ©2016OrionX.net 12  But it will become a very large deal when the cluster becomes larger  Will particularly be a problem on scatter, gather type collective problems when head node will be overrun trying to process messages
  • 13.
    Onload vs. Offload ©2016OrionX.net 13  As node count increases, performance of Onload will drop – Higher node count = more messaging, pressure on head node  Node counts are increasing significantly  Offload uses dedicated hardware ASICs – Much faster than general purpose CPUs  MPI not highly parallel – With Onload, this means that speed is limited to slowest core speed – Has no bearing on Offload speed
  • 14.
    Rampant FUD War ©2016OrionX.net 14  Cost of HPI in cluster budget is typically ~25% of total  Prices in high tech typically don’t increase over time  Price points for new products typically are the same as the former high end product they replace…ex: high end PCs, low-end servers, etc. From: The Next Platform, “Intel Stretches Deep Learning On Scalable System Framework”, 5/10/16
  • 15.
    More FUD War….. ©2016OrionX.net 15  All images provided by Intel, all from The Next Platform story “Intel Stretches Learning on Scalable System Framework” May 10th, 2016  What else do these images have in common?
  • 16.
    FUD Wars –Behind the Numbers ©2016 OrionX.net 16  It’s all in the fine print, right?  Here’s Intel’s fine print for the graphs on the last slide….  “dapl” is key, it’s an Intel MPI mechanism that doesn’t allow for offload operations ala InfiniBand …..48 port (B0 silicon). IOU Non-posted Prefetch disabled in BIOS. Snoop hold-off timer = 9. EDR based on internal testing: Intel MPI 5.1.3, shm:dapl fabric, RHEL 7.2 -genv I_MPI_DAPL_EAGER_MESSAGE_AGGREGATION off. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 – 36 Port EDR InfiniBand switch. MLNX_OFED_LINUX-3.2- 2.0.0.0 (OFED-3.2-2.0.0). IOU Non-posted Prefetch enabled in BIOS. 1. osu_latency 8 B message. 2. osu_bw 1 MB message. 3. osu_mbw_mr, 8 B……… Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
  • 17.
    Even More FUD ©2016OrionX.net 17  100% CPU core utilization on a Offload HCA?!!  Does anyone believe this?!!  This means that about half of the Top500 systems are absolutely useless  Intel is using a CPU polling mechanism that pegs the CPU on the Mellanox box at 100%, yet has nothing to do with network comms  Both Intel and Mellanox have benchmarked OPA at ~65% CPU utilization
  • 18.
    FUD Aside, hereare the numbers….UPDATE FOR HDR ©2016 OrionX.net 18 Intel OPA Mellanox EDR/HDR InfiniBand Bandwidth 100 Gb/sec 100 Gb/sec / 200Gb/sec Latency (ns) .93 .85 or less / .90 or less Message rate 89 million/sec* 150m/sec / 200m/sec * this number, provided by Intel, has dropped from >150 million in 2015
  • 19.
    HPI Roadmaps ©2016 OrionX.net19  InfiniBand roadmap shows HDR now (200Gb/s) and NDR down the road (400Gb/s? 2020?)  Can’t find a solid Intel OPA roadmap  Ethernet roadmap shows 200Gb/s in 2018-19
  • 20.
    Major HPI Choices:OrionX analysis ©2016 OrionX.net 20 Vendor Market Customer Product Presence Trends Overall Readiness Needs Overall Capabilities Roadmap Overall Mellanox 9 9 9 8 9 8.5 9 10 9.5 Ethernet vendors 7 7 7 9 6 7.5 7 6 6.5 Intel 6 8 7 6 7 6.5 7 8 7.5
  • 21.
    Mellanox Intel Ethernet vendors Vendor MarketProduct Customer Ethernet 7 6.5 7.5 Mellanox 9 9.5 8.5 Intel 7 7.5 6.5 OrionX Constellation
  • 22.
    ©2016 OrionX 22 OrionXConstellation™ reports Questions? Comments? Concerns?