More Related Content Similar to Co-Design Architecture for Exascale (20) More from inside-BigData.com (20) Co-Design Architecture for Exascale2. © 2016 Mellanox Technologies 2
Co-Design Architecture to Enable Exascale Performance
CPU-Centric Co-Design
Limited to Main CPU Usage
Results in Performance Limitation
Creating Synergies
Enables Higher Performance and Scale
Software
Software
In-CPU
Computing
In-Network
Computing
In-Storage
Computing
3. © 2016 Mellanox Technologies 3
The Intelligence is Moving to the Interconnect
CPU
Interconnect
Past Future
4. © 2016 Mellanox Technologies 4
Intelligent Interconnect Delivers Higher Datacenter ROI
Users
NETWORK
COMPUTING
NETWORK
Users
Intelligence
Network Offloads
Computing for applications
Smart Network
Increase Datacenter Value
Network functions
On CPU
COMPUTING
5. © 2016 Mellanox Technologies 5
Breaking the Application Latency Wall
§ Today: Network device latencies are on the order of 100 nanoseconds
§ Challenge: Enabling the next order of magnitude improvement in application performance
§ Solution: Creating synergies between software and hardware – intelligent interconnect
Intelligent Interconnect Paves the Road to Exascale Performance
10 years ago
~10
microsecond
~100
microsecond
NetworkCommunication
Framework
Today
~10
microsecond
Communication
Framework
~0.1
microsecond
Network
~1
microsecond
Communication
Framework
Future
~0.05
microsecond
Co-Design
Network
6. © 2016 Mellanox Technologies 6
Introducing Switch-IB 2 World’s First Smart Switch
7. © 2016 Mellanox Technologies 7
Introducing Switch-IB 2 World’s First Smart Switch
§ The world fastest switch with <90 nanosecond latency
§ 36-ports, 100Gb/s per port, 7.2Tb/s throughput, 7.02 Billion messages/sec
§ Adaptive Routing, Congestion control, support for multiple topologies
World’s First Smart Switch
Build for Scalable Compute and Storage Infrastructures
10X Higher Performance with The New Switch SHArP Technology
8. © 2016 Mellanox Technologies 8
SHArP (Scalable Hierarchical Aggregation Protocol) Technology
Delivering 10X Performance Improvement
for MPI and SHMEM/PAGS Communications
Switch-IB 2 Enables the Switch Network to
Operate as a Co-Processor
SHArP Enables Switch-IB 2 to Manage and
Execute MPI Operations in the Network
9. © 2016 Mellanox Technologies 9
SHArP Performance Advantage
§ MiniFE is a Finite Element mini-application
• Implements kernels that represent
implicit finite-element applications
10X to 25X Performance Improvement
AllRedcue MPI Collective
10. © 2016 Mellanox Technologies 10
The Intelligence is Moving to the Interconnect
Communication Frameworks (MPI, SHMEM/PGAS)
The Only Approach to Deliver 10X Performance Improvements
Applications Transport
RDMA
SR-IOV
Collectives
Peer-Direct
GPUDirect
More…
MPI / SHMEM Offloads
Q1’16
Q3’16
11. © 2016 Mellanox Technologies 11
Multi-Host Socket DirectTM – Low Latency Socket Communication
§ Each CPU with direct network access
§ QPI avoidance for I/O – improve performance
§ Enables GPU / peer direct on both sockets
§ Solution is transparent to software
CPU CPUCPU CPU
QPI
Multi-Host Socket Direct Performance
50% Lower CPU Utilization
20% lower Latency
Multi Host Evaluation Kit
Lower Application Latency, Free-up CPU
12. © 2016 Mellanox Technologies 12
Introducing ConnectX-4 Lx Programmable Adapter
Scalable, Efficient, High-Performance and Flexible Solution
Security
Cloud/Virtualization
Storage
High Performance Computing
Precision Time Synchronization
Networking + FPGA
Mellanox Acceleration Engines
and FGPA Programmability
On One Adapter
13. © 2016 Mellanox Technologies 13
Mellanox InfiniBand Proven and Most Scalable HPC Interconnect
“Summit” System “Sierra” System
Paving the Road to Exascale
14. © 2016 Mellanox Technologies 14
NCAR-Wyoming Supercomputing Center (NWSC) – “Cheyenne”
§ Cheyenne supercomputer system
§ 5.34-petaflop SGI ICE XA Cluster
§ Intel “Broadwell” processors
§ More than 4K compute nodes
§ Mellanox EDR InfiniBand interconnect
§ Mellanox Unified Fabric Manager
§ Partial 9D Enhanced Hypercube interconnect topology
§ DDN SFA14KX systems
§ 20 petabytes of usable file system space
§ IBM GPFS (General Parallel File System)
15. © 2016 Mellanox Technologies 15
High-Performance Designed 100Gb/s Interconnect Solutions
Transceivers
Active Optical and Copper Cables
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
VCSELs, Silicon Photonics and Copper
36 EDR (100Gb/s) Ports, <90ns Latency
Throughput of 7.2Tb/s
7.02 Billion msg/sec (195M msg/sec/port)
100Gb/s Adapter, 0.7us latency
150 million messages per second
(10 / 25 / 40 / 50 / 56 / 100Gb/s)
32 100GbE Ports, 64 25/50GbE Ports
(10 / 25 / 40 / 50 / 100GbE)
Throughput of 6.4Tb/s
16. © 2016 Mellanox Technologies 16
Leading Supplier of End-to-End Interconnect Solutions
StoreAnalyze
Enabling the Use of Data
SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio (VPI)
Metro / WANNPU & Multicore
NPS
TILE
17. © 2016 Mellanox Technologies 17
The Performance Advantage of EDR 100G InfiniBand (28-80%)
28%
18. © 2016 Mellanox Technologies 18
End-to-End Interconnect Solutions for All Platforms
Highest Performance and Scalability for
X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms
10, 20, 25, 40, 50, 56 and 100Gb/s Speeds
X86
Open
POWER
GPU ARM FPGA
Smart Interconnect to Unleash The Power of All Compute Architectures
19. © 2016 Mellanox Technologies 19
Technology Roadmap – One-Generation Lead over the Competition
2000 202020102005
20G 40G 56G 100G
“Roadrunner”
Mellanox Connected
1st3rd
TOP500 2003
Virginia Tech (Apple)
2015
200G
Terascale Petascale Exascale
Mellanox 400G
20. © 2016 Mellanox Technologies 20
§ Transparent InfiniBand integration into OpenStack
• Since Havana
§ RDMA directly from VM - SRIOV
§ MAC to GUID mapping
§ VLAN to pkey mapping
§ InfiniBand SDN network
§ Ideal fit for High Performance Computing Clouds
OpenStack Over InfiniBand – Extreme Performance in the Cloud
InfiniBand Enables The Highest Performance and Efficiency
21. © 2016 Mellanox Technologies 21
§ Mellanox End to End
• Mellanox ConnectX-4 NIC family, Switch-IB/Spectrum switches and 25/100Gb/s cables
§ Bring the astonishing 100Gb/s speeds to the cloud with minimal CPU utilization
• Both VMs and Hypervisors
• Accelerations are critical to reach line rate
- SR-IOV, RDMA, etc.
25, 50 And 100Gb/s Clouds Are Here!
92.412 Gb/s
0.71%
22. © 2016 Mellanox Technologies 22
The Next Generation HPC Software Framework
To Meet the Needs of Future Systems / Applications
Unified Communication – X Framework (UCX)
23. © 2016 Mellanox Technologies 23
Exascale Co-Design Collaboration
Collaborative Effort
Industry, National Laboratories and Academia
The Next Generation
HPC Software Framework
24. © 2016 Mellanox Technologies 24
A Collaboration Effort
§ Mellanox co-designs network interface and contributes MXM technology
• Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH
§ ORNL co-designs network interface and contributes UCCS project
• InfiniBand optimizations, Cray devices, shared memory
§ NVIDIA co-designs high-quality support for GPU devices
• GPUDirect, GDR copy, etc.
§ IBM co-designs network interface and contributes ideas and concepts from PAMI
§ UH/UTK focus on integration with their research platforms
25. © 2016 Mellanox Technologies 25
Mellanox HPC-X™ Scalable HPC Software Toolkit
§ Complete MPI, PGAS OpenSHMEM and UPC package
§ Maximize application performance
§ For commercial and open source applications
§ Based on UCX (Unified Communication – X Framework)
26. © 2016 Mellanox Technologies 26
Mellanox Delivers Highest MPI (HPC-X) Performance
Enabling Highest Applications Scalability and Performance
Mellanox ConnectX-4 Collectives Offload
27. © 2016 Mellanox Technologies 27
Mellanox Delivers Highest Applications Performance (HPC-X)
§ Quantum Espresso application
Intel MPI
Bull MPI
(HPC-X)
Quantum
Espresso
Test Case # nodes <me (s) <me (s) Gain
A 43 584 368 37%
B 196 2592 998 61%
Enabling Highest Applications Scalability and Performance
28. © 2016 Mellanox Technologies 28
Maximize Performance via Accelerator and GPU Offloads
GPUDirect RDMA Technology
29. © 2016 Mellanox Technologies 29
GPUs are Everywhere!
GPUDirect RDMA / Sync
CPU
GPUChip
set
GPU
Memory
System
Memory
1
GPU
30. © 2016 Mellanox Technologies 30
§ Eliminates CPU bandwidth and latency bottlenecks
§ Uses remote direct memory access (RDMA) transfers between GPUs
§ Resulting in significantly improved MPI efficiency between GPUs in remote nodes
§ Based on PCIe PeerDirect technology
GPUDirect™ RDMA (GPUDirect 3.0)
With GPUDirect™ RDMA
Using PeerDirect™
31. © 2016 Mellanox Technologies 31
Mellanox GPUDirect RDMA Performance Advantage
§ HOOMD-blue is a general-purpose Molecular Dynamics simulation code accelerated on GPUs
§ GPUDirect RDMA allows direct peer to peer GPU communications over InfiniBand
• Unlocks performance between GPU and InfiniBand
• This provides a significant decrease in GPU-GPU communication latency
• Provides complete CPU offload from all GPU communications across the network
102%
2X Application
Performance!
32. © 2016 Mellanox Technologies 32
GPUDirect Sync (GPUDirect 4.0)
§ GPUDirect RDMA (3.0) – direct data path between the GPU and Mellanox interconnect
• Control path still uses the CPU
- CPU prepares and queues communication tasks on GPU
- GPU triggers communication on HCA
- Mellanox HCA directly accesses GPU memory
§ GPUDirect Sync (GPUDirect 4.0)
• Both data path and control path go directly
between the GPU and the Mellanox interconnect
0
10
20
30
40
50
60
70
80
2 4
Averagetimeperiteration(us)
Number of nodes/GPUs
2D stencil benchmark
RDMA only RDMA+PeerSync
27% faster 23% faster
Maximum Performance
For GPU Clusters
33. © 2016 Mellanox Technologies 33
Remote GPU Access through rCUDA
GPU servers GPU as a Service
rCUDA daemon
Network Interface
CUDA
Driver + runtime
Network Interface
rCUDA library
Application
Client Side Server Side
Application
CUDA
Driver + runtime
CUDA Application
rCUDA provides remote access from
every node to any GPU in the system
CPU
VGPU
CPU
VGPU
CPU
VGPU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU
34. © 2016 Mellanox Technologies 34
Interconnect Architecture Comparison
Offload versus Onload (Non-Offload)
35. © 2016 Mellanox Technologies 35
Offload versus Onload (Non-Offload)
§ Two interconnect architectures exist – Offload-based and Onload-based
§ Offload Architecture
• The Interconnect manages and executes all network operations
• The interconnect is capable of including application acceleration engines
• Offloads the CPU and therefore free CPU cycles to be used by the applications
• Development requires large R&D investment
• Higher data center ROI
§ Onload architecture
• A CPU-centric approach – everything must be executed on and by the CPU
• The CPU is responsible for all network functions, the interconnect only pushes the data into the wire
• Cannot support acceleration engines, no support for RDMA, and network transport is done by the CPU
• Onload the CPU and reduces the CPU cycles available for the applications
• Does not require R&D investments or interconnect expertise
36. © 2016 Mellanox Technologies 36
Sandia National Laboratory Paper – Offloading versus Onloading
37. © 2016 Mellanox Technologies 37
Interconnect Throughput – Offload versus Onload
The Offloading Advantage!Network Performance Dramatically
Depends on CPU Frequency!
Data Throughput:
20% Higher at common Xeon Frequency
250% Higher at common Xeon Phi Frequency
Common Xeon Frequency 2.6GHz
Common Xeon Phi Frequency ~1Ghz
38. © 2016 Mellanox Technologies 38
Only Offload Architecture Can Enable Co-Processors
Offloading (Highest Performance for all Frequencies)
Onloading (performance loss with lower CPU frequency)
Common Xeon Frequency
Common Xeon Phi Frequency
Onloading Technology Not Suitable for Co-Processors!
39. © 2016 Mellanox Technologies 39
Switch LatencyMessage Rate
Mellanox InfiniBand Leadership Over Omni-Path
20%
Lower
44%
Higher
Power Consumption
Per Switch Port
Scalability
CPU efficiency
25%
Lower
2X
Higher
100
Gb/s
Link Speed
200
Gb/s
Link Speed
2014
Gain Competitive Advantage Today
Protect Your Future
2017
Smart Network For Smart Systems
RDMA, Acceleration Engines, Programmability
Higher Performance
Unlimited Scalability
Higher Resiliency
Proven!