Mellanox hpc update @ hpcday 2012 kiev

New Advancements in HPC Interconnect
Technology
October 2012, HPC@mellanox.com
© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1

Leading Server and Storage Interconnect Provider

High-Performance
Web 2.0 Cloud Database Storage
Computing

Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio

ICs Adapter Cards Switches/Gateways Software Cables

Scalability, Reliability, Power, Performance

Mellanox in the TOP500

 InfiniBand has become the de-factor interconnect solution for High Performance
Computing
• Most used interconnect on the TOP500 list – 210 systems
• FDR InfiniBand connects the fastest InfiniBand system on the TOP500 list – 2.9 Petaflops,
91% efficiency (LRZ)

 InfiniBand connects more of the world’s sustained Petaflop systems than any other
interconnect (40% - 8 systems out of 20)

 Most used interconnect solution in the TOP100, TOP200, TOP300, TOP400, TOP500
• Connects 47% (47 systems) of the TOP100 while Ethernet only 2% (2 system)
• Connects 55% (110 systems) of the TOP200 while Ethernet only 10% (20 systems)
• Connects 52.7% (158 systems) of the TOP300 while Ethernet only 22.7% (68 systems)
• Connects 46.3% (185 systems) of the TOP400 while Ethernet only 33.8% (135 systems)

 FDR InfiniBand Based Systems increased 10X Since Nov’11


Mellanox InfiniBand Paves the Road to Exascale

PetaFlop
Mellanox Connected


Mellanox 56Gb/s FDR Technology


InfiniBand Roadmap

2011
2008 56Gb/s
2005 40Gb/s
2002
20Gb/s
10Gb/s

3.0

Highest Performance, Reliability, Scalability, Efficiency

FDR InfiniBand New Features and Capabilities

Performance / Scalability Reliability / Efficiency

• >12GB/s bandwidth, <0.7usec latency • Link bit encoding – 64/66

• PCI Express 3.0 • Forward Error Correction

• InfiniBand Routing and IB-Ethernet Bridging • Lower power consumption


Virtual Protocol Interconnect (VPI) Technology

VPI Adapter VPI Switch
Unified Fabric Manager
Switch OS Layer
Applications
Managemen
Networking Storage Clustering
t
64 ports 10GbE
Acceleration Engines 36 ports 40GbE
48 10GbE + 12 40GbE
36 ports IB up to 56Gb/s
8 VPI subnets
Ethernet: 10/40 Gb/s
3.0 InfiniBand:10/20/40/
56 Gb/s

LOM Adapter Card Mezzanine Card


FDR InfiniBand PCIe 3.0 vs QDR InfiniBand PCIe 2.0

Double the Bandwidth, Half the Latency
120% Higher Application ROI

FDR InfiniBand Application Examples

17% 20%

18%


FDR InfiniBand Meets the Needs of Changing Storage World

SSDs, the storage hierarchy, In-Memory Computing…..

Remote I/O access needs to be equal to local I/O access SMB IO Micro
Client Benchmark

IB FDR

SMB
Server

IB FDR

Fusion
Fusion
Fusion
FusionIO
IO
IO
IO

Native Throughput Performance over InfiniBand FDR

FDR/QDR InfiniBand Comparisons – Linpack Efficiency

• Derived from 6/12 TOP500 List
• Highest &Lowest Outlier Removed from each group

Powered by Mellanox FDR InfiniBand


Connect-IB
The Foundation for Exascale Computing


Roadmap of Interconnect Innovations

InfiniHost InfiniHost III ConnectX (1,2,3)
World’s first World’s first PCIe World’s first
InfiniBand HCA InfiniBand HCA Virtual Protocol
Interconnect (VPI)
Connect-IB
Adapter
The Exascale
10Gb/s InfiniBand 20Gb/s InfiniBand 40/56Gb/s InfiniBand Foundation
PCI-X host interface PCIe 1.0 PCIe 2.0, 3.0 x8
1 million msg/sec 2 million msg/sec 33 million msg/sec

June
2012

2002 2005 2008-11

Connect-IB Performance Highlights

▪ World’s first 100Gb/s interconnect adapter
• PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s

▪ Highest InfiniBand message rate: 130 million messages per second
• 4X higher than other InfiniBand solutions

▪ <0.7 micro-second application latency

▪ Supports GPUDirect RDMA for direct GPU-to-GPU communication

▪ Unmatchable Storage Performance
• 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)

▪ Enhanced congestion-control mechanism

▪ Supports Scalable HPC with MPI, SHMEM and PGAS offloads
Enter the World of Boundless Performance

Mellanox Scalable Solutions
The Co-Design Architecture


Mellanox ScalableHPC Accelerate Parallel Applications

MXM FCA
- Reliable Messaging Optimized for Mellanox HCA - Topology Aware Collective Optimization
- Hybrid Transport Mechanism - Hardware Multicast
- Efficient Memory Registration - Separate Virtual Fabric for Collectives
- Receive Side Tag Matching - CoreDIrect Hardware Offload

InfiniBand Verbs API


Mellanox MXM – HPCC Random Ring Latency

© 2012 MELLANOX TECHNOLOGIES 19

Mellanox MXM Scalability


Mellanox MPI Optimizations – Highest Scalability at LLNL

 Mellanox MPI optimization enable linear strong scaling for LLNL application

World Leading Performance and Scalability

Fabric Collectives Accelerations for MPI/SHMEM


Collective Operation Challenges at Large Scale

 Collective algorithms are not topology aware
and can be inefficient

 Congestion due to many-to-many
communications

Ideal Actual

 Slow nodes and OS jitter affect
scalability and increase variability


Mellanox Collectives Acceleration Components

 CORE-Direct
• Adapter-based hardware offloading for collectives operations
• Includes floating-point capability on the adapter for data reductions
• CORE-Direct API is exposed through the Mellanox drivers

 FCA
• FCA is a software plug-in package that integrates into available MPIs
• Provides scalable topology aware collective operations
• Utilizes powerful InfiniBand multicast and QOS capabilities
• Integrates CORE-Direct collective hardware offloads


FCA collective performance with OpenMPI


Fabric Collective Accelerations Provide Linear Scalability

Barrier Collective Reduce Collective
100.0 3000
80.0 2500
Latency (us)

2000

Latency (us)
60.0
1500
40.0
1000
20.0 500
0.0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
processes (PPN=8)
Processes (PPN=8)
Without FCA With FCA Without FCA With FCA

8-Byte Broadcast
10000
Bandwidth (KB*processes)

8000
6000
4000
2000
0
0 500 1000 1500 2000 2500

Processes (PPN=8)
Without FCA With FCA


Application Example: CPMD and Amber

 CPMD and Amber are leading molecular dynamics applications

 Result: FCA accelerates CPMD by nearly 35% and Amber by 33%
• At 16 nodes, 192 cores
• Performance benefit increases with cluster size – higher benefit expected at larger scale

*Acknowledgment: HPC Advisory Council for providing the performance results Higher is better

Accelerator and GPU Offloads


GPU Networking Communication (pre-GPUDirect)

 In GPU-to-GPU communications, GPU applications uses “pinned” buffers
• A section in the host memory dedicated for the GPU
• Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance

 InfiniBand verbs API uses “pinned” buffers for efficient communication
• Zero-copy data transfers, Kernel bypass

 CPU is involved in the GPU data path
• Memory copies between the different “pinned buffers”
• Slows down the GPU communications and creates communication bottleneck

Receive Transmit
System

1 2
2 1

System
CPU CPU Memory
Memory

GPU Chip Chip GPU
set set

Mellanox Mellanox
InfiniBand InfiniBand
GPU GPU
Memory Memory


NVIDIA GPUDirect 1.0

 GPUDirect 1.0 joint development between Mellanox and NVIDIA
• Eliminates system memory copies for GPU communications across network using InfiniBand Verbs API
• Reduces latency by 30% for GPUs communication

 GPU Direct 1.0 availability announced in May 2010
• Available in CUDA-4.0 and higher

Receive Transmit
System System
1

1
Memory CPU CPU Memory

Chip Chip
GPU set set
GPU

Mellanox Mellanox
InfiniBand InfiniBand
GPU GPU
Memory Memory


GPUDirect 1.0 – Application Performance

 LAMPS
• 3 nodes, 10% gain

3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node

 Amber – Cellulose

 Amber – FactorX


GPUDirect RDMA Peer-to-Peer

 GPU Direct RDMA (previously known as GPU Direct 3.0)
 Enables peer to peer communication directly between HCA and GPU
 Dramatically reduces overall latency for GPU to GPU communications

System GDDR5 GDDR5 System
Memory Memory Memory Memory

CPU GPU GPU CPU

PCI Express 3.0 PCI Express 3.0

GPU

Mellanox Mellanox
HCA HCA
Mellanox VPI


Optimizing GPU and Accelerator Communications

 NVIDIA GPUs
• Mellanox were original partners in Co-Development of GPUDirect 1.0
• Recently announced support of GPUDirect RDMA Peer-to-Peer GPU-to-HCA data path

 AMD GPUs
• Sharing of System Memory: AMD DirectGMA Pinned supported today
• AMD DirectGMA P2P: Peer-to-Peer GPU-to-HCA data path under development

 Intel MIC
• MIC software development system enables the MIC to communicate directly over the
InfiniBand verbs API to Mellanox devices


GPU as a Service

GPUs in every server GPUs as a Service

CPU
CPU
GPU CPU VGPU GPU
GPU
GPU
CPU GPU GPU
GPU
CPU GPU
GPU
GPU VGPU GPU
GPU
CPU GPU
GPU
GPU CPU CPU
GPU VGPU

 GPUs as a network-resident service
• Little to no overhead when using FDR InfiniBand

 Virtualize and decouple GPU services from CPU
services
• A new paradigm in cluster flexibility
• Lower cost, lower power and ease of use with shared GPU
resources
• Remove difficult physical requirements of the GPU for
standard compute servers


Accelerating Big Data Applications


Hadoop™ Map Reduce Framework for Unstructured Data

 Map Reduce Programing Model
Raw Date 1 1

Sort
1 1
Map 1 1
Reduce 3

1

1 1

Sort
1 1 3
Map 1
1 Reduce 2
1
1
1

1 1

Sort
1 1
Map 1 1 Reduce 4

1 1

 Map
• The map function is used on a set of input values and calculates a set of key/value pairs
 Reduce
• The reduce function aggregates the key/value pairs data into a scalar
• The reduce function receives all the data for an individual "key" from all the mappers

Mellanox Unstructured Data Accelerator (UDA)

 Plug-in architecture for Hadoop
• Hadoop applications are unmodified
• Plug-in to Apache Hadoop
• Enabled via xml configuration

 Accelerates Map Reduce operations
• Data communication over RDMA
- Using RDMA for In-Memory processing
- Kernel bypass for minimizing CPU overhead, faster execution
• Enables to start the Reduce operation in parallel to the Shuffle operation
- Reduce disk IO operation
• Supports InfiniBand and Ethernet

Enables Competitive Advantage with Big Data Analytics Acceleration

UDA Performance Benefit for Map Reduce

Terasort Benchmark*
16GB data per node
1000

900

800

700
Execution Time (sec)

Lower is better
600
45% 1 GE
500
10 GE
400 UDA 10GE

300

200 ~2X Acceleration
100

0
8 Nodes 10 Nodes 12 Nodes

*TeraSort is a popular benchmark used to measure the performance of Hadoop cluster

Fastest Job Completion!

Accelerating Big Data Analytics – EMC/GreenPlum

 EMC 1000-Node Analytic Platform
 Accelerates Industry's Hadoop Development
 24 PetaByte of physical storage
• Half of every written word since inception of mankind
 Mellanox InfiniBand FDR Solutions

Hadoop
2X Faster Hadoop Job Run-Time
Acceleration

High Throughput, Low Latency, RDMA Critical for ROI

Thank You
HPC@mellanox.com


Mellanox hpc update @ hpcday 2012 kiev

More Related Content

What's hot

Viewers also liked

Similar to Mellanox hpc update @ hpcday 2012 kiev

More from Volodymyr Saviak

Recently uploaded

Mellanox hpc update @ hpcday 2012 kiev