Mellnox Interconnect presentation in OpenPOWER Brazil workshop

© 2019 Mellanox Technologies | Confidential 1
Paving the Road to Exascale
March 2019
Interconnect Your Future
Guilherme Fuhrken
gfuhrken@Mellanox.com
+55-11-99934-2297

Cloud &
Web 2.0
Big
Data
Enterprise
Business
Intelligence
HPC
Storage
Security
AI & Machine
Learning
Internet
of Things
Exponential Data
Growth Everywhere
Source: IDC
HPC: High-Performance Compute
AI: Artificial Intelligence

Higher Data Speeds
Faster Data Processing
Better Data Security
Exponential Data
Growth Everywhere
source: IDC

Higher Data Speeds
Faster Data Processing
Better Data Security
Adapters Switches
Cables &
Transceivers
SmartNIC System on a Chip
HPC and AI Needs the Most
Intelligent Interconnect

Mellanox Accelerates Leading HPC and AI Systems
World’s Top 3 Supercomputers
Summit CORAL System
World’s Fastest HPC / AI System
9.2K InfiniBand Nodes
Sierra CORAL System
#2 USA Supercomputer
1 2
Wuxi Supercomputing Center
Fastest Supercomputer in China
41K InfiniBand Nodes
3

Mellanox Accelerates Leading HPC and AI Systems
(Examples)
JUWELS Supercomputer
Fastest HPC / AI System in Japan
The world's Fastest Industry
Supercomputer
7 15 26
Fastest Supercomputer in Canada
Dragonfly+ Topology
NASA Ames Research Center
20K InfiniBand Nodes
World’s leading Industry
Supercomputer
34 5927

The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the Data
Creates Performance Bottlenecks
Faster Data Speeds and In-Network Computing
Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!
Higher Performance and Scale

In-Network
Computing
Self-Healing
Technology
In-Network
Computing
Unbreakable Data Centers
Delivers Highest Application Performance
GPUDirect™ RDMA
Critical for HPC and Machine Learning ApplicationsGPU Acceleration Technology
10X Performance Acceleration
Critical for HPC and Machine Learning Applications
35X
Faster Network Recovery
5000X
10X Performance Acceleration
Performance Acceleration
In-Network Computing Delivers Highest Performance
Scalable Hierarchical
Aggregation and
Reduction Protocol

30%-250% Higher Return on Investment
Up to 50% Saving on Capital and Operation Expenses
Highest Applications Performance, Scalability and Productivity
InfiniBand Delivers Best Return on Investment
1.9X
Better
2X
Better
1.4X
Better
2.5X
Better
1.3X
Better
Molecular DynamicsGenomicsWeather Automotive Chemistry

Mellanox Supercharges Leading AI Companies
Higher ROI
Lower CapEx
& OpEx
60%
50%
Unlocking the Power of Artificial Intelligence
Cognitive Toolkit
RDMA Supercharges Leading AI Frameworks

Scalable Hierarchical
Aggregation and
Reduction Protocol
(SHARP)

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
 Reliable Scalable General Purpose Primitive
 In-network Tree based aggregation mechanism
 Large number of groups
 Multiple simultaneous outstanding operations
 Applicable to Multiple Use-cases
 HPC Applications using MPI / SHMEM
 Distributed Machine Learning applications
 Scalable High Performance Collective Offload
 Barrier, Reduce, All-Reduce, Broadcast and more
 Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
 Integer and Floating-Point, 16/32/64 bits
SHArP Tree
SHARP Tree Aggregation Node
(Process running on HCA)
SHARP Tree Endnode
(Process running on HCA)
SHARP Tree Root

SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in Latency
Providing Scalable Flat LatencyScalable Hierarchical
Aggregation and
Reduction Protocol

Oak Ridge National Laboratory – Coral Summit Supercomputer
SHARP AllReduce Performance Advantages
SHARP Enables Highest PerformanceScalable Hierarchical
Aggregation and
Reduction Protocol

SHARP Performance – Application (OSU)
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project
http://mvapich.cse.ohio-state.edu/
Source: Prof. DK Panda, Ohio State University

SHARP Performance Advantage for AI
 SHARP provides 16% Performance Increase for deep learning, initial results
 TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)
16%

GPUDirect

10X Higher Performance with GPUDirect™ RDMA
Accelerates HPC and Deep Learning performance
Lowest communication latency for GPUs
GPUDirect™ RDMA

HDR InfiniBand

Highest-Performance 200Gb/s InfiniBand Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
40 HDR (200Gb/s) InfiniBand Ports
80 HDR100 InfiniBand Ports
Throughput of 16Tb/s, <90ns Latency
200Gb/s Adapter, 0.6us latency
215 million messages per second
(10 / 25 / 40 / 50 / 56 / 100 / 200Gb/s)
MPI, SHMEM/PGAS, UPC
For Commercial and Open Source Applications
Leverages Hardware Accelerations
System on Chip and SmartNIC
Programmable adapter
Smart Offloads

Leading Connectivity
ConnectX-6 HDR InfiniBand Adapter
Leading Performance
Leading Features
 200Gb/s InfiniBand and Ethernet
 HDR, HDR100, EDR (100Gb/s) and lower speeds
 200GbE, 100GbE and lower speeds
 Single and dual ports
 50Gb/s PAM4 SerDes
 200Gb/s throughput, 0.6usec latency, 215 million message per second
 PCIe Gen3 / Gen4, 32 lanes
 Integrated PCIe switch
 Multi-Host - up to 8 hosts, supporting 4 dual-socket servers
 In-network computing and memory for HPC collective offloads
 Security – Block-level encryption to storage, key management, FIPS
 Storage – NVMe Emulation, NVMe-oF target, Erasure coding, T10/DIF

HDR InfiniBand Switch: QM8700, 1U Series
 40 ports of HDR, 200G
 80 ports of HDR100, 100G
Superior performance
40 QSFP56 ports (50G PAM4 per lane)
 90ns latency
 390M packets per sec (64B)
 16Tb/s aggregate bandwidth
Superior resiliency
 22’’ depth
 6 fans (5+1), hot swappable
 2 power supplies (1+1), hot swappable

HDR InfiniBand Switch: CS8500, Modular Series
 800 ports of HDR, 200G
 1600 ports of HDR100, 100G
Superior performance
800 QSFP56 ports
 300ns latency
 320Tb/s aggregate bandwidth
 LCD Tablet IO panel
Water-cooled solution
 Liquid – Liquid 4U CDU
 Liquid – Air 42U (350mm wide) stand alone HEX
 0C – 35C (air) or 40C (water) operating air range

BlueField SoC
Advantages and Platforms

BlueField Block Diagram
 Tile Architecture - 16 ARM® A72 CPUs subsystem
 SkyMesh™ fully coherent low-latency interconnect
 8MB L2 Cache, 8 Tiles
 48KB I-Cache, 32KB D-Cache per core
 12MB L3 Last Level Cache
 ARM Frequency: 0.8GHz - 1.3GHz
 Dual Port 100g IO Controller, based on ConnectX-5
 Dual 100Gb/s Ethernet/InfiniBand, compatible with ConnectX-5
 NVMe-oF hardware accelerator
 High-end Networking Offloads: RDMA, Erasure Coding, T10-DIF
 Fully Integrated PCIe switch
 32 Bifurcated PCI Gen3/4 lanes (up to 200Gb/s)
 Root Complex or Endpoint modes
 2x16, 4x8, 8x4 or 16x2 configurations
 Hardware Accelerators, Crypto Engines
 Bulk crypto by A72 Neon ISA (AES, SHA)
 Public Key acceleration, True RNG
 Memory Controllers
 2x Channels DDR4 Memory Controllers w/ ECC
 NVDIMM-N Support
Dual VPI Ports
Ethernet/InfiniBand:
1, 10, 25,40,50,100G
32-lanes
PCIe Gen3/4

BlueField for Smart Solutions
 SoC: Compute, networking and PCIe connectivity
 Dual port VPI EDR/100GbE
 16 Arm cores
 32 lanes of PCIe switch gen3/4
Storage Solutions
BlueField SoC (System on Chip)
 NVMe-based storage platforms
 RDMA, NVMe over Fabrics, RAID, Signature offload
 Partner’s solutions based on BlueField storage controller
Smart Adapters
 In-network computing and collective offloads
 Co-processor running proprietary smart algorithms
 Security and privacy algorithms

Mellanox Ethernet
Switch Systems

Open Ethernet – The Freedom to Optimize
Open APIs
Automation
End-to-End
Interconnect
Network-OS
Choice
SONiC
SDK
Customer-OS

The only predictable 25/50/100Gb/s Ethernet switch
Spectrum: The Ultimate 25/100GbE Switch
Full wire speed, non-blocking switch
ZPL: ZeroPacketLoss for all packets sizes
 Doesn’t drop packets per RFC2544

SN2100 – 16x100GbE ports
(up to 32 x50GbE , 64x25/10GbE)
Ideal storage / Database 25/100GbE Switch
Open Ethernet SN2000 Series
300nsSN2700 – 169W
SN2410 – 165W
SN2100 – 94W
 Predictable Performance
 Fair Traffic Distribution for Cloud
 Best-in-Class Throughput, Latency, Power Consumption
 Zero Packet Loss
SN2700 – 32x100GbE (up to 64 x 50/25/10GbE)
The Ideal 100GbE ToR / Aggregation
SN2410 – 8x100GbE + 48x25GbE
25GbE  100GbE ToR
Energy efficiency
SN2010 – 4x100GbE + 18x10/25GbE
Ideal Hyperconverged Switch
10/25GbE  100GbE half width ToR

Ideal Spine/Super-spine
High Density Leaf/200GbE Spine/Super-spine
Ideal Leaf
SN3800 – 64x100GbE (QSFP-28) 2U
Open Ethernet SN3000 Series
Compact ½ U Switch
Introducing speeds from 1GbE through 400GbE
SN3700 – 32x200GbE (QSFP-56) 1U
SN3510 – 48x25/50GbE (SFP-56) +
6x400GbE (QSFP-DD) 1U
SN3200 – 16x400GbE (QSFP-DD) ½U

End-to-End Solutions
for All Platforms
Highest Performance and Scalability for Intel, AMD, IBM Power,
NVIDIA, Arm and FPGA-based Compute and Storage Platforms at
10, 20, 25, 40, 50, 100, 200 and 400Gb/s Speeds
Unleashing the Power of All Compute Architectures
X86
Open
POWER
GPU ARM FPGA

Thank You
Guilherme Fuhrken
gfuhrken@Mellanox.com
+55-11-99934-2297

Mellnox Interconnect presentation in OpenPOWER Brazil workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mellnox Interconnect presentation in OpenPOWER Brazil workshop

Similar to Mellnox Interconnect presentation in OpenPOWER Brazil workshop (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

Mellnox Interconnect presentation in OpenPOWER Brazil workshop

Editor's Notes