• Save
Mellanox hpc update @ hpcday 2012 kiev
Upcoming SlideShare
Loading in...5
×
 

Mellanox hpc update @ hpcday 2012 kiev

on

  • 1,002 views

 

Statistics

Views

Total Views
1,002
Views on SlideShare
641
Embed Views
361

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 361

http://supercomputers.kiev.ua 361

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Mellanox hpc update @ hpcday 2012 kiev Mellanox hpc update @ hpcday 2012 kiev Presentation Transcript

  • New Advancements in HPC Interconnect Technology October 2012, HPC@mellanox.com© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1
  • Leading Server and Storage Interconnect Provider High-Performance Web 2.0 Cloud Database Storage Computing Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio ICs Adapter Cards Switches/Gateways Software Cables Scalability, Reliability, Power, Performance© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2
  • Mellanox in the TOP500 InfiniBand has become the de-factor interconnect solution for High Performance Computing • Most used interconnect on the TOP500 list – 210 systems • FDR InfiniBand connects the fastest InfiniBand system on the TOP500 list – 2.9 Petaflops, 91% efficiency (LRZ) InfiniBand connects more of the world’s sustained Petaflop systems than any other interconnect (40% - 8 systems out of 20) Most used interconnect solution in the TOP100, TOP200, TOP300, TOP400, TOP500 • Connects 47% (47 systems) of the TOP100 while Ethernet only 2% (2 system) • Connects 55% (110 systems) of the TOP200 while Ethernet only 10% (20 systems) • Connects 52.7% (158 systems) of the TOP300 while Ethernet only 22.7% (68 systems) • Connects 46.3% (185 systems) of the TOP400 while Ethernet only 33.8% (135 systems) FDR InfiniBand Based Systems increased 10X Since Nov’11 © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3
  • Mellanox InfiniBand Paves the Road to Exascale PetaFlop Mellanox Connected© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4
  • Mellanox 56Gb/s FDR Technology© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5
  • InfiniBand Roadmap 2011 2008 56Gb/s 2005 40Gb/s 2002 20Gb/s 10Gb/s 3.0 Highest Performance, Reliability, Scalability, Efficiency© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6
  • FDR InfiniBand New Features and Capabilities Performance / Scalability Reliability / Efficiency• >12GB/s bandwidth, <0.7usec latency • Link bit encoding – 64/66• PCI Express 3.0 • Forward Error Correction• InfiniBand Routing and IB-Ethernet Bridging • Lower power consumption © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7
  • Virtual Protocol Interconnect (VPI) Technology VPI Adapter VPI Switch Unified Fabric Manager Switch OS Layer Applications Managemen Networking Storage Clustering t 64 ports 10GbE Acceleration Engines 36 ports 40GbE 48 10GbE + 12 40GbE 36 ports IB up to 56Gb/s 8 VPI subnets Ethernet: 10/40 Gb/s 3.0 InfiniBand:10/20/40/ 56 Gb/s LOM Adapter Card Mezzanine Card© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8
  • FDR InfiniBand PCIe 3.0 vs QDR InfiniBand PCIe 2.0 Double the Bandwidth, Half the Latency 120% Higher Application ROI© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
  • FDR InfiniBand Application Examples 17% 20% 18%© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10
  • FDR InfiniBand Meets the Needs of Changing Storage WorldSSDs, the storage hierarchy, In-Memory Computing…..Remote I/O access needs to be equal to local I/O access SMB IO Micro Client Benchmark IB FDR SMB Server IB FDR Fusion Fusion Fusion FusionIO IO IO IO Native Throughput Performance over InfiniBand FDR © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11
  • FDR/QDR InfiniBand Comparisons – Linpack Efficiency• Derived from 6/12 TOP500 List• Highest &Lowest Outlier Removed from each group © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 12
  • Powered by Mellanox FDR InfiniBand © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13
  • Connect-IB The Foundation for Exascale Computing© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14
  • Roadmap of Interconnect Innovations InfiniHost InfiniHost III ConnectX (1,2,3) World’s first World’s first PCIe World’s first InfiniBand HCA InfiniBand HCA Virtual Protocol Interconnect (VPI) Connect-IB Adapter The Exascale10Gb/s InfiniBand 20Gb/s InfiniBand 40/56Gb/s InfiniBand FoundationPCI-X host interface PCIe 1.0 PCIe 2.0, 3.0 x81 million msg/sec 2 million msg/sec 33 million msg/sec June 2012 2002 2005 2008-11 © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15
  • Connect-IB Performance Highlights▪ World’s first 100Gb/s interconnect adapter • PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s▪ Highest InfiniBand message rate: 130 million messages per second • 4X higher than other InfiniBand solutions▪ <0.7 micro-second application latency▪ Supports GPUDirect RDMA for direct GPU-to-GPU communication▪ Unmatchable Storage Performance • 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)▪ Enhanced congestion-control mechanism▪ Supports Scalable HPC with MPI, SHMEM and PGAS offloads Enter the World of Boundless Performance© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16
  • Mellanox Scalable Solutions The Co-Design Architecture© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17
  • Mellanox ScalableHPC Accelerate Parallel Applications MXM FCA - Reliable Messaging Optimized for Mellanox HCA - Topology Aware Collective Optimization - Hybrid Transport Mechanism - Hardware Multicast - Efficient Memory Registration - Separate Virtual Fabric for Collectives - Receive Side Tag Matching - CoreDIrect Hardware Offload InfiniBand Verbs API© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 18
  • Mellanox MXM – HPCC Random Ring Latency© 2012 MELLANOX TECHNOLOGIES 19
  • Mellanox MXM Scalability© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20
  • Mellanox MPI Optimizations – Highest Scalability at LLNL Mellanox MPI optimization enable linear strong scaling for LLNL application World Leading Performance and Scalability © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21
  • Fabric Collectives Accelerations for MPI/SHMEM© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22
  • Collective Operation Challenges at Large Scale Collective algorithms are not topology aware and can be inefficient Congestion due to many-to-many communications Ideal Actual Slow nodes and OS jitter affect scalability and increase variability © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23
  • Mellanox Collectives Acceleration Components CORE-Direct • Adapter-based hardware offloading for collectives operations • Includes floating-point capability on the adapter for data reductions • CORE-Direct API is exposed through the Mellanox drivers FCA • FCA is a software plug-in package that integrates into available MPIs • Provides scalable topology aware collective operations • Utilizes powerful InfiniBand multicast and QOS capabilities • Integrates CORE-Direct collective hardware offloads© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24
  • FCA collective performance with OpenMPI© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25
  • Fabric Collective Accelerations Provide Linear Scalability Barrier Collective Reduce Collective 100.0 3000 80.0 2500 Latency (us) 2000 Latency (us) 60.0 1500 40.0 1000 20.0 500 0.0 0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 processes (PPN=8) Processes (PPN=8) Without FCA With FCA Without FCA With FCA 8-Byte Broadcast 10000 Bandwidth (KB*processes) 8000 6000 4000 2000 0 0 500 1000 1500 2000 2500 Processes (PPN=8) Without FCA With FCA© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26
  • Application Example: CPMD and Amber  CPMD and Amber are leading molecular dynamics applications  Result: FCA accelerates CPMD by nearly 35% and Amber by 33% • At 16 nodes, 192 cores • Performance benefit increases with cluster size – higher benefit expected at larger scale*Acknowledgment: HPC Advisory Council for providing the performance results Higher is better © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 27
  • Accelerator and GPU Offloads© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 28
  • GPU Networking Communication (pre-GPUDirect)  In GPU-to-GPU communications, GPU applications uses “pinned” buffers • A section in the host memory dedicated for the GPU • Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance  InfiniBand verbs API uses “pinned” buffers for efficient communication • Zero-copy data transfers, Kernel bypass  CPU is involved in the GPU data path • Memory copies between the different “pinned buffers” • Slows down the GPU communications and creates communication bottleneck Receive Transmit System 1 2 2 1System CPU CPU MemoryMemory GPU Chip Chip GPU set set Mellanox Mellanox InfiniBand InfiniBand GPU GPU Memory Memory © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 29
  • NVIDIA GPUDirect 1.0 GPUDirect 1.0 joint development between Mellanox and NVIDIA • Eliminates system memory copies for GPU communications across network using InfiniBand Verbs API • Reduces latency by 30% for GPUs communication GPU Direct 1.0 availability announced in May 2010 • Available in CUDA-4.0 and higher Receive TransmitSystem System 1 1Memory CPU CPU Memory Chip Chip GPU set set GPU Mellanox Mellanox InfiniBand InfiniBand GPU GPU Memory Memory © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 30
  • GPUDirect 1.0 – Application Performance LAMPS • 3 nodes, 10% gain 3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node Amber – Cellulose • 8 nodes, 32% gain Amber – FactorX • 8 nodes, 27% gain© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 31
  • GPUDirect RDMA Peer-to-Peer  GPU Direct RDMA (previously known as GPU Direct 3.0)  Enables peer to peer communication directly between HCA and GPU  Dramatically reduces overall latency for GPU to GPU communications System GDDR5 GDDR5 System Memory Memory Memory Memory CPU GPU GPU CPU PCI Express 3.0 PCI Express 3.0 GPU Mellanox Mellanox HCA HCA Mellanox VPI © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 32
  • Optimizing GPU and Accelerator Communications  NVIDIA GPUs • Mellanox were original partners in Co-Development of GPUDirect 1.0 • Recently announced support of GPUDirect RDMA Peer-to-Peer GPU-to-HCA data path  AMD GPUs • Sharing of System Memory: AMD DirectGMA Pinned supported today • AMD DirectGMA P2P: Peer-to-Peer GPU-to-HCA data path under development  Intel MIC • MIC software development system enables the MIC to communicate directly over the InfiniBand verbs API to Mellanox devices © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 33
  • GPU as a Service GPUs in every server GPUs as a Service CPU CPU GPU CPU VGPU GPU GPU GPU CPU GPU GPU GPU CPU GPU GPU GPU VGPU GPU GPU CPU GPU GPU GPU CPU CPU GPU VGPU GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility • Lower cost, lower power and ease of use with shared GPU resources • Remove difficult physical requirements of the GPU for standard compute servers © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 34
  • Accelerating Big Data Applications© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 35
  • Hadoop™ Map Reduce Framework for Unstructured Data  Map Reduce Programing Model Raw Date 1 1 Sort 1 1 Map 1 1 Reduce 3 1 1 1 Sort 1 1 3 Map 1 1 Reduce 2 1 1 1 1 1 Sort 1 1 Map 1 1 Reduce 4 1 1  Map • The map function is used on a set of input values and calculates a set of key/value pairs  Reduce • The reduce function aggregates the key/value pairs data into a scalar • The reduce function receives all the data for an individual "key" from all the mappers © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 36
  • Mellanox Unstructured Data Accelerator (UDA) Plug-in architecture for Hadoop • Hadoop applications are unmodified • Plug-in to Apache Hadoop • Enabled via xml configuration Accelerates Map Reduce operations • Data communication over RDMA - Using RDMA for In-Memory processing - Kernel bypass for minimizing CPU overhead, faster execution • Enables to start the Reduce operation in parallel to the Shuffle operation - Reduce disk IO operation • Supports InfiniBand and EthernetEnables Competitive Advantage with Big Data Analytics Acceleration© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 37
  • UDA Performance Benefit for Map Reduce Terasort Benchmark* 16GB data per node 1000 900 800 700 Execution Time (sec) Lower is better 600 45% 1 GE 500 10 GE 400 UDA 10GE 300 200 ~2X Acceleration 100 0 8 Nodes 10 Nodes 12 Nodes *TeraSort is a popular benchmark used to measure the performance of Hadoop cluster Fastest Job Completion!© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 38
  • Accelerating Big Data Analytics – EMC/GreenPlum EMC 1000-Node Analytic Platform Accelerates Industrys Hadoop Development 24 PetaByte of physical storage • Half of every written word since inception of mankind Mellanox InfiniBand FDR Solutions Hadoop 2X Faster Hadoop Job Run-Time Acceleration High Throughput, Low Latency, RDMA Critical for ROI© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 39
  • Thank You HPC@mellanox.com© 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 40