Co-Design Architecture for Exascale

Dror Goldenberg, March 2016, HPCAC Swiss
Co-Design Architecture
Emergence of New Co-Processors

© 2016 Mellanox Technologies 2
Co-Design Architecture to Enable Exascale Performance
CPU-Centric Co-Design
Limited to Main CPU Usage
Results in Performance Limitation
Creating Synergies
Enables Higher Performance and Scale
Software
Software
In-CPU
Computing
In-Network
Computing
In-Storage
Computing

The Intelligence is Moving to the Interconnect
CPU
Interconnect
Past Future

Intelligent Interconnect Delivers Higher Datacenter ROI
Users
NETWORK
COMPUTING
NETWORK
Users
Intelligence
Network Offloads
Computing for applications
Smart Network
Increase Datacenter Value
Network functions
On CPU
COMPUTING

Breaking the Application Latency Wall
§ Today: Network device latencies are on the order of 100 nanoseconds
§ Challenge: Enabling the next order of magnitude improvement in application performance
§ Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network

Introducing Switch-IB 2 World’s First Smart Switch

Introducing Switch-IB 2 World’s First Smart Switch
§ The world fastest switch with <90 nanosecond latency
§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec
§ Adaptive Routing, Congestion control, support for multiple topologies
World’s First Smart Switch
Build for Scalable Compute and Storage Infrastructures
10X Higher Performance with The New Switch SHArP Technology

SHArP (Scalable Hierarchical Aggregation Protocol) Technology
Delivering 10X Performance Improvement
for MPI and SHMEM/PAGS Communications
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHArP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network

SHArP Performance Advantage
§  MiniFE is a Finite Element mini-application
•  Implements kernels that represent
implicit finite-element applications
10X to 25X Performance Improvement
AllRedcue MPI Collective

The Intelligence is Moving to the Interconnect
Communication Frameworks (MPI, SHMEM/PGAS)
The Only Approach to Deliver 10X Performance Improvements
Applications Transport
RDMA
SR-IOV
Collectives
Peer-Direct
GPUDirect
More…
MPI / SHMEM Offloads
Q1’16
Q3’16

Multi-Host Socket DirectTM – Low Latency Socket Communication
§ Each CPU with direct network access
§  QPI avoidance for I/O – improve performance
§  Enables GPU / peer direct on both sockets
§ Solution is transparent to software
CPU CPUCPU CPU
QPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU

Introducing ConnectX-4 Lx Programmable Adapter
Scalable, Efficient, High-Performance and Flexible Solution
Security
Cloud/Virtualization
Storage
High Performance Computing
Precision Time Synchronization
Networking + FPGA
Mellanox Acceleration Engines
and FGPA Programmability
On One Adapter

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect
“Summit” System “Sierra” System
Paving the Road to Exascale

NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”
§ Cheyenne supercomputer system
§ 5.34-petaflop SGI ICE XA Cluster
§ Intel “Broadwell” processors
§ More than 4K compute nodes
§ Mellanox EDR InfiniBand interconnect
§ Mellanox Unified Fabric Manager
§ Partial 9D Enhanced Hypercube interconnect topology
§ DDN SFA14KX systems
§ 20 petabytes of usable file system space
§ IBM GPFS (General Parallel File System)

High-Performance Designed 100Gb/s Interconnect Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
VCSELs, Silicon Photonics and Copper
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
7.02 Billion msg/sec (195M msg/sec/port)
100Gb/s Adapter, 0.7us latency
150 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 100GbE)
Throughput of 6.4Tb/s

Leading Supplier of End-to-End Interconnect Solutions
StoreAnalyze
Enabling the Use of Data
SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WANNPU & Multicore
NPS
TILE

The Performance Advantage of EDR 100G InfiniBand (28-80%)
28%

End-to-End Interconnect Solutions for All Platforms
Highest Performance and Scalability for
X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms
10, 20, 25, 40, 50, 56 and 100Gb/s Speeds
X86
Open
POWER
GPU ARM FPGA
Smart Interconnect to Unleash The Power of All Compute Architectures

Technology Roadmap – One-Generation Lead over the Competition
2000 202020102005
20G 40G 56G 100G
“Roadrunner”
Mellanox Connected
1st3rd
TOP500 2003
Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G

§ Transparent InfiniBand integration into OpenStack
•  Since Havana
§ RDMA directly from VM - SRIOV
§ MAC to GUID mapping
§ VLAN to pkey mapping
§ InfiniBand SDN network
§ Ideal fit for High Performance Computing Clouds
OpenStack Over InfiniBand – Extreme Performance in the Cloud
InfiniBand Enables The Highest Performance and Efficiency

§ Mellanox End to End
•  Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables
§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization
•  Both VMs and Hypervisors
•  Accelerations are critical to reach line rate
-  SR-IOV, RDMA, etc.
25, 50 And 100Gb/s Clouds Are Here!
92.412 Gb/s
0.71%

The Next Generation HPC Software Framework
To Meet the Needs of Future Systems / Applications
Unified Communication – X Framework (UCX)

Exascale Co-Design Collaboration
Collaborative Effort
Industry, National Laboratories and Academia
The Next Generation
HPC Software Framework

A Collaboration Effort
§ Mellanox co-designs network interface and contributes MXM technology
•  Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH
§ ORNL co-designs network interface and contributes UCCS project
•  InfiniBand optimizations, Cray devices, shared memory
§ NVIDIA co-designs high-quality support for GPU devices
•  GPUDirect, GDR copy, etc.
§ IBM co-designs network interface and contributes ideas and concepts from PAMI
§ UH/UTK focus on integration with their research platforms

Mellanox HPC-X™ Scalable HPC Software Toolkit
§ Complete MPI, PGAS OpenSHMEM and UPC package
§ Maximize application performance
§ For commercial and open source applications
§ Based on UCX (Unified Communication – X Framework)

Mellanox Delivers Highest MPI (HPC-X) Performance
Enabling Highest Applications Scalability and Performance
Mellanox ConnectX-4 Collectives Offload

Mellanox Delivers Highest Applications Performance (HPC-X)
§ Quantum Espresso application
Intel MPI
Bull MPI
(HPC-X)
Quantum
Espresso
Test Case # nodes <me (s) <me (s) Gain
A 43 584 368 37%
B 196 2592 998 61%
Enabling Highest Applications Scalability and Performance

Maximize Performance via Accelerator and GPU Offloads
GPUDirect RDMA Technology

GPUs are Everywhere!
GPUDirect RDMA / Sync
CPU
GPUChip
set
GPU
Memory
System
Memory
1
GPU

§ Eliminates CPU bandwidth and latency bottlenecks
§ Uses remote direct memory access (RDMA) transfers between GPUs
§ Resulting in significantly improved MPI efficiency between GPUs in remote nodes
§ Based on PCIe PeerDirect technology
GPUDirect™ RDMA (GPUDirect 3.0)
With GPUDirect™ RDMA
Using PeerDirect™

Mellanox GPUDirect RDMA Performance Advantage
§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs
§ GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand
•  Unlocks performance between GPU and InfiniBand
•  This provides a significant decrease in GPU-GPU communication latency
•  Provides complete CPU offload from all GPU communications across the network
102%
2X Application
Performance!

GPUDirect Sync (GPUDirect 4.0)
§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect
•  Control path still uses the CPU
-  CPU prepares and queues communication tasks on GPU
-  GPU triggers communication on HCA
-  Mellanox HCA directly accesses GPU memory
§ GPUDirect Sync (GPUDirect 4.0)
•  Both data path and control path go directly
between the GPU and the Mellanox interconnect
0
10
20
30
40
50
60
70
80
2 4
Averagetimeperiteration(us)
Number of nodes/GPUs
2D stencil benchmark
RDMA only RDMA+PeerSync
27% faster 23% faster
Maximum Performance
For GPU Clusters

Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface
CUDA
Driver + runtime
Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA
Driver + runtime
CUDA Application
rCUDA provides remote access from
every node to any GPU in the system
CPU
VGPU
CPU
VGPU
CPU
VGPU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU

Interconnect Architecture Comparison
Offload versus Onload (Non-Offload)

Offload versus Onload (Non-Offload)
§ Two interconnect architectures exist – Offload-based and Onload-based
§ Offload Architecture
•  The Interconnect manages and executes all network operations
•  The interconnect is capable of including application acceleration engines
•  Offloads the CPU and therefore free CPU cycles to be used by the applications
•  Development requires large R&D investment
•  Higher data center ROI
§ Onload architecture
•  A CPU-centric approach – everything must be executed on and by the CPU
•  The CPU is responsible for all network functions, the interconnect only pushes the data into the wire
•  Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU
•  Onload the CPU and reduces the CPU cycles available for the applications
•  Does not require R&D investments or interconnect expertise

Sandia National Laboratory Paper – Offloading versus Onloading

Interconnect Throughput – Offload versus Onload
The Offloading Advantage!Network Performance Dramatically
Depends on CPU Frequency!
Data Throughput:
20% Higher at common Xeon Frequency
250% Higher at common Xeon Phi Frequency
Common Xeon Frequency 2.6GHz
Common Xeon Phi Frequency ~1Ghz

Only Offload Architecture Can Enable Co-Processors
Offloading (Highest Performance for all Frequencies)
Onloading (performance loss with lower CPU frequency)
Common Xeon Frequency
Common Xeon Phi Frequency
Onloading Technology Not Suitable for Co-Processors!

Switch LatencyMessage Rate
Mellanox InfiniBand Leadership Over Omni-Path
20%
Lower
44%
Higher
Power Consumption
Per Switch Port
Scalability
CPU efficiency
25%
Lower
2X
Higher
100
Gb/s
Link Speed
200
Gb/s
Link Speed
2014
Gain Competitive Advantage Today
Protect Your Future
2017
Smart Network For Smart Systems
RDMA, Acceleration Engines, Programmability
Higher Performance
Unlimited Scalability
Higher Resiliency
Proven!

Co-Design Architecture for Exascale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Co-Design Architecture for Exascale

Similar to Co-Design Architecture for Exascale (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Co-Design Architecture for Exascale