New Advancements in HPC Interconnect
                        Technology
                               October 2012, HPC@mellanox.com
© 2012 MELLANOX TECHNOLOGIES             - MELLANOX CONFIDENTIAL -   1
Leading Server and Storage Interconnect Provider

                                               High-Performance
         Web 2.0                   Cloud                                     Database          Storage
                                                  Computing




           Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio

        ICs                    Adapter Cards     Switches/Gateways                  Software             Cables




                  Scalability, Reliability, Power, Performance
© 2012 MELLANOX TECHNOLOGIES                     - MELLANOX CONFIDENTIAL -                                        2
Mellanox in the TOP500

 InfiniBand has become the de-factor interconnect solution for High Performance
  Computing
  • Most used interconnect on the TOP500 list – 210 systems
  • FDR InfiniBand connects the fastest InfiniBand system on the TOP500 list – 2.9 Petaflops,
    91% efficiency (LRZ)


 InfiniBand connects more of the world’s sustained Petaflop systems than any other
  interconnect (40% - 8 systems out of 20)

 Most used interconnect solution in the TOP100, TOP200, TOP300, TOP400, TOP500
  •   Connects 47% (47 systems) of the TOP100 while Ethernet only 2% (2 system)
  •   Connects 55% (110 systems) of the TOP200 while Ethernet only 10% (20 systems)
  •   Connects 52.7% (158 systems) of the TOP300 while Ethernet only 22.7% (68 systems)
  •   Connects 46.3% (185 systems) of the TOP400 while Ethernet only 33.8% (135 systems)


 FDR InfiniBand Based Systems increased 10X Since Nov’11


 © 2012 MELLANOX TECHNOLOGIES            - MELLANOX CONFIDENTIAL -                              3
Mellanox InfiniBand Paves the Road to Exascale




                               PetaFlop
                               Mellanox Connected




© 2012 MELLANOX TECHNOLOGIES      - MELLANOX CONFIDENTIAL -   4
Mellanox 56Gb/s FDR Technology




© 2012 MELLANOX TECHNOLOGIES       - MELLANOX CONFIDENTIAL -   5
InfiniBand Roadmap




                                                                      2011
                                                             2008    56Gb/s
                                2005                        40Gb/s
      2002
                               20Gb/s
     10Gb/s




                                                                       3.0




    Highest Performance, Reliability, Scalability, Efficiency
© 2012 MELLANOX TECHNOLOGIES            - MELLANOX CONFIDENTIAL -             6
FDR InfiniBand New Features and Capabilities

                Performance / Scalability                               Reliability / Efficiency

•    >12GB/s bandwidth, <0.7usec latency                 •    Link bit encoding – 64/66

•    PCI Express 3.0                                     •    Forward Error Correction

•    InfiniBand Routing and IB-Ethernet Bridging         •    Lower power consumption

    © 2012 MELLANOX TECHNOLOGIES            - MELLANOX CONFIDENTIAL -                              7
Virtual Protocol Interconnect (VPI) Technology


                                      VPI Adapter                                              VPI Switch
                                                                                    Unified Fabric Manager
                                                                                      Switch OS Layer
                       Applications
                                               Managemen
    Networking     Storage      Clustering
                                                   t
                                                                                                  64 ports 10GbE
                 Acceleration Engines                                                             36 ports 40GbE
                                                                                                  48 10GbE + 12 40GbE
                                                                                                  36 ports IB up to 56Gb/s
                                                                                                  8 VPI subnets
                                          Ethernet: 10/40 Gb/s
                 3.0                      InfiniBand:10/20/40/
                                          56 Gb/s




             LOM       Adapter Card    Mezzanine Card




© 2012 MELLANOX TECHNOLOGIES                            - MELLANOX CONFIDENTIAL -                                            8
FDR InfiniBand PCIe 3.0 vs QDR InfiniBand PCIe 2.0




                               Double the Bandwidth, Half the Latency
                                 120% Higher Application ROI
© 2012 MELLANOX TECHNOLOGIES                 - MELLANOX CONFIDENTIAL -   9
FDR InfiniBand Application Examples


                               17%                          20%




                               18%




© 2012 MELLANOX TECHNOLOGIES    - MELLANOX CONFIDENTIAL -         10
FDR InfiniBand Meets the Needs of Changing Storage World

SSDs, the storage hierarchy, In-Memory Computing…..

Remote I/O access needs to be equal to local I/O access     SMB     IO Micro
                                                            Client Benchmark


                                                                      IB FDR




                                                            SMB
                                                            Server



                                                                      IB FDR


                                                                        Fusion
                                                                       Fusion
                                                                      Fusion
                                                                     FusionIO
                                                                          IO
                                                                         IO
                                                                        IO




            Native Throughput Performance over InfiniBand FDR
 © 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -                        11
FDR/QDR InfiniBand Comparisons – Linpack Efficiency




•   Derived from 6/12 TOP500 List
•   Highest &Lowest Outlier Removed from each group
    © 2012 MELLANOX TECHNOLOGIES                      - MELLANOX CONFIDENTIAL -   12
Powered by Mellanox FDR InfiniBand




 © 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -   13
Connect-IB
                               The Foundation for Exascale Computing




© 2012 MELLANOX TECHNOLOGIES                - MELLANOX CONFIDENTIAL -   14
Roadmap of Interconnect Innovations

     InfiniHost                    InfiniHost III               ConnectX (1,2,3)
    World’s first                 World’s first PCIe                 World’s first
  InfiniBand HCA                   InfiniBand HCA                  Virtual Protocol
                                                                  Interconnect (VPI)
                                                                                       Connect-IB
                                                                       Adapter
                                                                                       The Exascale
10Gb/s InfiniBand                20Gb/s InfiniBand              40/56Gb/s InfiniBand    Foundation
PCI-X host interface             PCIe 1.0                       PCIe 2.0, 3.0 x8
1 million msg/sec                2 million msg/sec              33 million msg/sec




                                                                                       June
                                                                                       2012

      2002                            2005                        2008-11
  © 2012 MELLANOX TECHNOLOGIES               - MELLANOX CONFIDENTIAL -                                15
Connect-IB Performance Highlights

▪ World’s first 100Gb/s interconnect adapter
   • PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s

▪ Highest InfiniBand message rate: 130 million messages per second
   • 4X higher than other InfiniBand solutions

▪ <0.7 micro-second application latency

▪ Supports GPUDirect RDMA for direct GPU-to-GPU communication

▪ Unmatchable Storage Performance
   • 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs)

▪ Enhanced congestion-control mechanism

▪ Supports Scalable HPC with MPI, SHMEM and PGAS offloads
                     Enter the World of Boundless Performance
© 2012 MELLANOX TECHNOLOGIES      - MELLANOX CONFIDENTIAL -               16
Mellanox Scalable Solutions
                                   The Co-Design Architecture




© 2012 MELLANOX TECHNOLOGIES              - MELLANOX CONFIDENTIAL -   17
Mellanox ScalableHPC Accelerate Parallel Applications




                           MXM                                                         FCA
    -     Reliable Messaging Optimized for Mellanox HCA           -     Topology Aware Collective Optimization
    -     Hybrid Transport Mechanism                              -     Hardware Multicast
    -     Efficient Memory Registration                           -     Separate Virtual Fabric for Collectives
    -     Receive Side Tag Matching                               -     CoreDIrect Hardware Offload



                                      InfiniBand Verbs API

© 2012 MELLANOX TECHNOLOGIES                        - MELLANOX CONFIDENTIAL -                                     18
Mellanox MXM – HPCC Random Ring Latency




© 2012 MELLANOX TECHNOLOGIES              19
Mellanox MXM Scalability




© 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -   20
Mellanox MPI Optimizations – Highest Scalability at LLNL




 Mellanox MPI optimization enable linear strong scaling for LLNL application

                  World Leading Performance and Scalability
 © 2012 MELLANOX TECHNOLOGIES     - MELLANOX CONFIDENTIAL -                21
Fabric Collectives Accelerations for MPI/SHMEM




© 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -   22
Collective Operation Challenges at Large Scale

 Collective algorithms are not topology aware
  and can be inefficient




 Congestion due to many-to-many
  communications


                                                           Ideal   Actual

 Slow nodes and OS jitter affect
  scalability and increase variability




 © 2012 MELLANOX TECHNOLOGIES       - MELLANOX CONFIDENTIAL -               23
Mellanox Collectives Acceleration Components

 CORE-Direct
   • Adapter-based hardware offloading for collectives operations
   • Includes floating-point capability on the adapter for data reductions
   • CORE-Direct API is exposed through the Mellanox drivers


 FCA
   • FCA is a software plug-in package that integrates into available MPIs
   • Provides scalable topology aware collective operations
   • Utilizes powerful InfiniBand multicast and QOS capabilities
   • Integrates CORE-Direct collective hardware offloads



© 2012 MELLANOX TECHNOLOGIES          - MELLANOX CONFIDENTIAL -              24
FCA collective performance with OpenMPI




© 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -   25
Fabric Collective Accelerations Provide Linear Scalability

                               Barrier Collective                                                                                         Reduce Collective
                 100.0                                                                                                3000
                  80.0                                                                                                2500
  Latency (us)




                                                                                                                      2000




                                                                                                       Latency (us)
                  60.0
                                                                                                                      1500
                  40.0
                                                                                                                      1000
                  20.0                                                                                                 500
                   0.0                                                                                                   0
                         0      500     1000    1500                          2000       2500                                0          500       1000     1500     2000   2500
                                                                                                                                                processes (PPN=8)
                                      Processes (PPN=8)
                             Without FCA                        With FCA                                                                Without FCA        With FCA



                                                                                        8-Byte Broadcast
                                                                            10000
                                                 Bandwidth (KB*processes)




                                                                            8000
                                                                            6000
                                                                            4000
                                                                            2000
                                                                               0
                                                                                    0      500     1000               1500       2000    2500

                                                                                                 Processes (PPN=8)
                                                                                        Without FCA                    With FCA

© 2012 MELLANOX TECHNOLOGIES                                                               - MELLANOX CONFIDENTIAL -                                                         26
Application Example: CPMD and Amber

    CPMD and Amber are leading molecular dynamics applications

    Result: FCA accelerates CPMD by nearly 35% and Amber by 33%
       • At 16 nodes, 192 cores
       • Performance benefit increases with cluster size – higher benefit expected at larger scale




*Acknowledgment: HPC Advisory Council for providing the performance results             Higher is better
   © 2012 MELLANOX TECHNOLOGIES                             - MELLANOX CONFIDENTIAL -                      27
Accelerator and GPU Offloads




© 2012 MELLANOX TECHNOLOGIES             - MELLANOX CONFIDENTIAL -   28
GPU Networking Communication (pre-GPUDirect)

   In GPU-to-GPU communications, GPU applications uses “pinned” buffers
     • A section in the host memory dedicated for the GPU
     • Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance


   InfiniBand verbs API uses “pinned” buffers for efficient communication
     • Zero-copy data transfers, Kernel bypass


   CPU is involved in the GPU data path
     • Memory copies between the different “pinned buffers”
     • Slows down the GPU communications and creates communication bottleneck


                 Receive                                                                          Transmit
                                                                                                                     System




                                                                                                              1 2
           2 1




System
                                 CPU                                                       CPU                       Memory
Memory



          GPU                    Chip                                                      Chip              GPU
                                  set                                                       set


                                              Mellanox               Mellanox
                                             InfiniBand             InfiniBand
          GPU                                                                                                 GPU
         Memory                                                                                             Memory



  © 2012 MELLANOX TECHNOLOGIES                       - MELLANOX CONFIDENTIAL -                                           29
NVIDIA GPUDirect 1.0

 GPUDirect 1.0 joint development between Mellanox and NVIDIA
  • Eliminates system memory copies for GPU communications across network using InfiniBand Verbs API
  • Reduces latency by 30% for GPUs communication


 GPU Direct 1.0 availability announced in May 2010
  • Available in CUDA-4.0 and higher




              Receive                                                             Transmit
System                                                                                            System
          1




                                                                                         1
Memory                          CPU                                        CPU                    Memory




                                Chip                                       Chip
          GPU                    set                                        set
                                                                                         GPU


                                        Mellanox               Mellanox
                                       InfiniBand             InfiniBand
          GPU                                                                             GPU
         Memory                                                                          Memory



 © 2012 MELLANOX TECHNOLOGIES                  - MELLANOX CONFIDENTIAL -                               30
GPUDirect 1.0 – Application Performance



 LAMPS
   • 3 nodes, 10% gain

                                                           3 nodes, 1 GPU per node   3 nodes, 3 GPUs per node




 Amber – Cellulose
   • 8 nodes, 32% gain




 Amber – FactorX
   • 8 nodes, 27% gain

© 2012 MELLANOX TECHNOLOGIES   - MELLANOX CONFIDENTIAL -                                                        31
GPUDirect RDMA Peer-to-Peer

  GPU Direct RDMA (previously known as GPU Direct 3.0)
  Enables peer to peer communication directly between HCA and GPU
  Dramatically reduces overall latency for GPU to GPU communications


                    System       GDDR5                            GDDR5       System
                    Memory       Memory                           Memory      Memory




                      CPU          GPU                              GPU        CPU



       PCI Express 3.0                                                                 PCI Express 3.0

                    GPU


                                Mellanox                           Mellanox
                                  HCA                                HCA
                                           Mellanox VPI


 © 2012 MELLANOX TECHNOLOGIES                 - MELLANOX CONFIDENTIAL -                                  32
Optimizing GPU and Accelerator Communications

  NVIDIA GPUs
    • Mellanox were original partners in Co-Development of GPUDirect 1.0
    • Recently announced support of GPUDirect RDMA Peer-to-Peer GPU-to-HCA data path



  AMD GPUs
    • Sharing of System Memory: AMD DirectGMA Pinned supported today
    • AMD DirectGMA P2P: Peer-to-Peer GPU-to-HCA data path under development



  Intel MIC
    • MIC software development system enables the MIC to communicate directly over the
       InfiniBand verbs API to Mellanox devices




 © 2012 MELLANOX TECHNOLOGIES             - MELLANOX CONFIDENTIAL -                      33
GPU as a Service

            GPUs in every server                                             GPUs as a Service

          CPU
                                                                           CPU
          GPU                     CPU                                     VGPU        GPU
                                                                                      GPU
                                                                                       GPU
                    CPU           GPU                                                   GPU
                                                                                        GPU
                                                                           CPU           GPU
                                                                                          GPU
                    GPU                                                   VGPU            GPU
                                                                                           GPU
         CPU                                                                                GPU
                                                                                            GPU
         GPU                    CPU                                        CPU
                                GPU                                       VGPU



 GPUs as a network-resident service
  • Little to no overhead when using FDR InfiniBand


 Virtualize and decouple GPU services from CPU
  services
  • A new paradigm in cluster flexibility
  • Lower cost, lower power and ease of use with shared GPU
    resources
  • Remove difficult physical requirements of the GPU for
    standard compute servers


 © 2012 MELLANOX TECHNOLOGIES                 - MELLANOX CONFIDENTIAL -                           34
Accelerating Big Data Applications




© 2012 MELLANOX TECHNOLOGIES            - MELLANOX CONFIDENTIAL -   35
Hadoop™ Map Reduce Framework for Unstructured Data

  Map Reduce Programing Model
                Raw Date                        1                      1




                                                               Sort
                                                1                      1
                                  Map           1                      1
                                                                           Reduce          3

                                                1


                                                 1                     1




                                                               Sort
                                                 1                     1                   3
                                 Map             1
                                                                       1   Reduce          2
                                                                       1
                                                1
                                                                       1

                                                1                      1




                                                               Sort
                                                1                      1
                                 Map            1                      1   Reduce          4

                                                1                      1

  Map
    • The map function is used on a set of input values and calculates a set of key/value pairs
  Reduce
    • The reduce function aggregates the key/value pairs data into a scalar
    • The reduce function receives all the data for an individual "key" from all the mappers
 © 2012 MELLANOX TECHNOLOGIES              - MELLANOX CONFIDENTIAL -                              36
Mellanox Unstructured Data Accelerator (UDA)

 Plug-in architecture for Hadoop
   • Hadoop applications are unmodified
   • Plug-in to Apache Hadoop
   • Enabled via xml configuration



 Accelerates Map Reduce operations
   • Data communication over RDMA
     - Using RDMA for In-Memory processing
     - Kernel bypass for minimizing CPU overhead, faster execution
   • Enables to start the Reduce operation in parallel to the Shuffle operation
     - Reduce disk IO operation
   • Supports InfiniBand and Ethernet



Enables Competitive Advantage with Big Data Analytics Acceleration
© 2012 MELLANOX TECHNOLOGIES         - MELLANOX CONFIDENTIAL -                    37
UDA Performance Benefit for Map Reduce

                                                   Terasort Benchmark*
                                                         16GB data per node
                                  1000

                                   900

                                   800

                                   700
           Execution Time (sec)




                                                                             Lower is better
                                   600
                                                                 45%                                  1 GE
                                   500
                                                                                                      10 GE
                                   400                                                                UDA 10GE

                                   300

                                   200               ~2X Acceleration
                                   100

                                     0
                                         8 Nodes                10 Nodes                   12 Nodes

 *TeraSort is a popular benchmark used to measure the performance of Hadoop cluster



                                                   Fastest Job Completion!
© 2012 MELLANOX TECHNOLOGIES                                   - MELLANOX CONFIDENTIAL -                         38
Accelerating Big Data Analytics – EMC/GreenPlum

 EMC 1000-Node Analytic Platform
 Accelerates Industry's Hadoop Development
 24 PetaByte of physical storage
   • Half of every written word since inception of mankind
 Mellanox InfiniBand FDR Solutions




                   Hadoop
                                       2X Faster Hadoop Job Run-Time
                 Acceleration


               High Throughput, Low Latency, RDMA Critical for ROI
© 2012 MELLANOX TECHNOLOGIES             - MELLANOX CONFIDENTIAL -     39
Thank You
                               HPC@mellanox.com




© 2012 MELLANOX TECHNOLOGIES      - MELLANOX CONFIDENTIAL -   40

Mellanox hpc update @ hpcday 2012 kiev

  • 1.
    New Advancements inHPC Interconnect Technology October 2012, HPC@mellanox.com © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 1
  • 2.
    Leading Server andStorage Interconnect Provider High-Performance Web 2.0 Cloud Database Storage Computing Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio ICs Adapter Cards Switches/Gateways Software Cables Scalability, Reliability, Power, Performance © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 2
  • 3.
    Mellanox in theTOP500  InfiniBand has become the de-factor interconnect solution for High Performance Computing • Most used interconnect on the TOP500 list – 210 systems • FDR InfiniBand connects the fastest InfiniBand system on the TOP500 list – 2.9 Petaflops, 91% efficiency (LRZ)  InfiniBand connects more of the world’s sustained Petaflop systems than any other interconnect (40% - 8 systems out of 20)  Most used interconnect solution in the TOP100, TOP200, TOP300, TOP400, TOP500 • Connects 47% (47 systems) of the TOP100 while Ethernet only 2% (2 system) • Connects 55% (110 systems) of the TOP200 while Ethernet only 10% (20 systems) • Connects 52.7% (158 systems) of the TOP300 while Ethernet only 22.7% (68 systems) • Connects 46.3% (185 systems) of the TOP400 while Ethernet only 33.8% (135 systems)  FDR InfiniBand Based Systems increased 10X Since Nov’11 © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 3
  • 4.
    Mellanox InfiniBand Pavesthe Road to Exascale PetaFlop Mellanox Connected © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 4
  • 5.
    Mellanox 56Gb/s FDRTechnology © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 5
  • 6.
    InfiniBand Roadmap 2011 2008 56Gb/s 2005 40Gb/s 2002 20Gb/s 10Gb/s 3.0 Highest Performance, Reliability, Scalability, Efficiency © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 6
  • 7.
    FDR InfiniBand NewFeatures and Capabilities Performance / Scalability Reliability / Efficiency • >12GB/s bandwidth, <0.7usec latency • Link bit encoding – 64/66 • PCI Express 3.0 • Forward Error Correction • InfiniBand Routing and IB-Ethernet Bridging • Lower power consumption © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 7
  • 8.
    Virtual Protocol Interconnect(VPI) Technology VPI Adapter VPI Switch Unified Fabric Manager Switch OS Layer Applications Managemen Networking Storage Clustering t 64 ports 10GbE Acceleration Engines 36 ports 40GbE 48 10GbE + 12 40GbE 36 ports IB up to 56Gb/s 8 VPI subnets Ethernet: 10/40 Gb/s 3.0 InfiniBand:10/20/40/ 56 Gb/s LOM Adapter Card Mezzanine Card © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 8
  • 9.
    FDR InfiniBand PCIe3.0 vs QDR InfiniBand PCIe 2.0 Double the Bandwidth, Half the Latency 120% Higher Application ROI © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 9
  • 10.
    FDR InfiniBand ApplicationExamples 17% 20% 18% © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 10
  • 11.
    FDR InfiniBand Meetsthe Needs of Changing Storage World SSDs, the storage hierarchy, In-Memory Computing….. Remote I/O access needs to be equal to local I/O access SMB IO Micro Client Benchmark IB FDR SMB Server IB FDR Fusion Fusion Fusion FusionIO IO IO IO Native Throughput Performance over InfiniBand FDR © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 11
  • 12.
    FDR/QDR InfiniBand Comparisons– Linpack Efficiency • Derived from 6/12 TOP500 List • Highest &Lowest Outlier Removed from each group © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 12
  • 13.
    Powered by MellanoxFDR InfiniBand © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 13
  • 14.
    Connect-IB The Foundation for Exascale Computing © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 14
  • 15.
    Roadmap of InterconnectInnovations InfiniHost InfiniHost III ConnectX (1,2,3) World’s first World’s first PCIe World’s first InfiniBand HCA InfiniBand HCA Virtual Protocol Interconnect (VPI) Connect-IB Adapter The Exascale 10Gb/s InfiniBand 20Gb/s InfiniBand 40/56Gb/s InfiniBand Foundation PCI-X host interface PCIe 1.0 PCIe 2.0, 3.0 x8 1 million msg/sec 2 million msg/sec 33 million msg/sec June 2012 2002 2005 2008-11 © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 15
  • 16.
    Connect-IB Performance Highlights ▪World’s first 100Gb/s interconnect adapter • PCIe 3.0 x16, dual FDR 56Gb/s InfiniBand ports to provide >100Gb/s ▪ Highest InfiniBand message rate: 130 million messages per second • 4X higher than other InfiniBand solutions ▪ <0.7 micro-second application latency ▪ Supports GPUDirect RDMA for direct GPU-to-GPU communication ▪ Unmatchable Storage Performance • 8,000,000 IOPs (1QP), 18,500,000 IOPs (32 QPs) ▪ Enhanced congestion-control mechanism ▪ Supports Scalable HPC with MPI, SHMEM and PGAS offloads Enter the World of Boundless Performance © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 16
  • 17.
    Mellanox Scalable Solutions The Co-Design Architecture © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 17
  • 18.
    Mellanox ScalableHPC AccelerateParallel Applications MXM FCA - Reliable Messaging Optimized for Mellanox HCA - Topology Aware Collective Optimization - Hybrid Transport Mechanism - Hardware Multicast - Efficient Memory Registration - Separate Virtual Fabric for Collectives - Receive Side Tag Matching - CoreDIrect Hardware Offload InfiniBand Verbs API © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 18
  • 19.
    Mellanox MXM –HPCC Random Ring Latency © 2012 MELLANOX TECHNOLOGIES 19
  • 20.
    Mellanox MXM Scalability ©2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 20
  • 21.
    Mellanox MPI Optimizations– Highest Scalability at LLNL  Mellanox MPI optimization enable linear strong scaling for LLNL application World Leading Performance and Scalability © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 21
  • 22.
    Fabric Collectives Accelerationsfor MPI/SHMEM © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 22
  • 23.
    Collective Operation Challengesat Large Scale  Collective algorithms are not topology aware and can be inefficient  Congestion due to many-to-many communications Ideal Actual  Slow nodes and OS jitter affect scalability and increase variability © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 23
  • 24.
    Mellanox Collectives AccelerationComponents  CORE-Direct • Adapter-based hardware offloading for collectives operations • Includes floating-point capability on the adapter for data reductions • CORE-Direct API is exposed through the Mellanox drivers  FCA • FCA is a software plug-in package that integrates into available MPIs • Provides scalable topology aware collective operations • Utilizes powerful InfiniBand multicast and QOS capabilities • Integrates CORE-Direct collective hardware offloads © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 24
  • 25.
    FCA collective performancewith OpenMPI © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 25
  • 26.
    Fabric Collective AccelerationsProvide Linear Scalability Barrier Collective Reduce Collective 100.0 3000 80.0 2500 Latency (us) 2000 Latency (us) 60.0 1500 40.0 1000 20.0 500 0.0 0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 processes (PPN=8) Processes (PPN=8) Without FCA With FCA Without FCA With FCA 8-Byte Broadcast 10000 Bandwidth (KB*processes) 8000 6000 4000 2000 0 0 500 1000 1500 2000 2500 Processes (PPN=8) Without FCA With FCA © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 26
  • 27.
    Application Example: CPMDand Amber  CPMD and Amber are leading molecular dynamics applications  Result: FCA accelerates CPMD by nearly 35% and Amber by 33% • At 16 nodes, 192 cores • Performance benefit increases with cluster size – higher benefit expected at larger scale *Acknowledgment: HPC Advisory Council for providing the performance results Higher is better © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 27
  • 28.
    Accelerator and GPUOffloads © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 28
  • 29.
    GPU Networking Communication(pre-GPUDirect)  In GPU-to-GPU communications, GPU applications uses “pinned” buffers • A section in the host memory dedicated for the GPU • Allows optimizations such as write-combining and overlapping GPU computation and data transfer for best performance  InfiniBand verbs API uses “pinned” buffers for efficient communication • Zero-copy data transfers, Kernel bypass  CPU is involved in the GPU data path • Memory copies between the different “pinned buffers” • Slows down the GPU communications and creates communication bottleneck Receive Transmit System 1 2 2 1 System CPU CPU Memory Memory GPU Chip Chip GPU set set Mellanox Mellanox InfiniBand InfiniBand GPU GPU Memory Memory © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 29
  • 30.
    NVIDIA GPUDirect 1.0 GPUDirect 1.0 joint development between Mellanox and NVIDIA • Eliminates system memory copies for GPU communications across network using InfiniBand Verbs API • Reduces latency by 30% for GPUs communication  GPU Direct 1.0 availability announced in May 2010 • Available in CUDA-4.0 and higher Receive Transmit System System 1 1 Memory CPU CPU Memory Chip Chip GPU set set GPU Mellanox Mellanox InfiniBand InfiniBand GPU GPU Memory Memory © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 30
  • 31.
    GPUDirect 1.0 –Application Performance  LAMPS • 3 nodes, 10% gain 3 nodes, 1 GPU per node 3 nodes, 3 GPUs per node  Amber – Cellulose • 8 nodes, 32% gain  Amber – FactorX • 8 nodes, 27% gain © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 31
  • 32.
    GPUDirect RDMA Peer-to-Peer  GPU Direct RDMA (previously known as GPU Direct 3.0)  Enables peer to peer communication directly between HCA and GPU  Dramatically reduces overall latency for GPU to GPU communications System GDDR5 GDDR5 System Memory Memory Memory Memory CPU GPU GPU CPU PCI Express 3.0 PCI Express 3.0 GPU Mellanox Mellanox HCA HCA Mellanox VPI © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 32
  • 33.
    Optimizing GPU andAccelerator Communications  NVIDIA GPUs • Mellanox were original partners in Co-Development of GPUDirect 1.0 • Recently announced support of GPUDirect RDMA Peer-to-Peer GPU-to-HCA data path  AMD GPUs • Sharing of System Memory: AMD DirectGMA Pinned supported today • AMD DirectGMA P2P: Peer-to-Peer GPU-to-HCA data path under development  Intel MIC • MIC software development system enables the MIC to communicate directly over the InfiniBand verbs API to Mellanox devices © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 33
  • 34.
    GPU as aService GPUs in every server GPUs as a Service CPU CPU GPU CPU VGPU GPU GPU GPU CPU GPU GPU GPU CPU GPU GPU GPU VGPU GPU GPU CPU GPU GPU GPU CPU CPU GPU VGPU  GPUs as a network-resident service • Little to no overhead when using FDR InfiniBand  Virtualize and decouple GPU services from CPU services • A new paradigm in cluster flexibility • Lower cost, lower power and ease of use with shared GPU resources • Remove difficult physical requirements of the GPU for standard compute servers © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 34
  • 35.
    Accelerating Big DataApplications © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 35
  • 36.
    Hadoop™ Map ReduceFramework for Unstructured Data  Map Reduce Programing Model Raw Date 1 1 Sort 1 1 Map 1 1 Reduce 3 1 1 1 Sort 1 1 3 Map 1 1 Reduce 2 1 1 1 1 1 Sort 1 1 Map 1 1 Reduce 4 1 1  Map • The map function is used on a set of input values and calculates a set of key/value pairs  Reduce • The reduce function aggregates the key/value pairs data into a scalar • The reduce function receives all the data for an individual "key" from all the mappers © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 36
  • 37.
    Mellanox Unstructured DataAccelerator (UDA)  Plug-in architecture for Hadoop • Hadoop applications are unmodified • Plug-in to Apache Hadoop • Enabled via xml configuration  Accelerates Map Reduce operations • Data communication over RDMA - Using RDMA for In-Memory processing - Kernel bypass for minimizing CPU overhead, faster execution • Enables to start the Reduce operation in parallel to the Shuffle operation - Reduce disk IO operation • Supports InfiniBand and Ethernet Enables Competitive Advantage with Big Data Analytics Acceleration © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 37
  • 38.
    UDA Performance Benefitfor Map Reduce Terasort Benchmark* 16GB data per node 1000 900 800 700 Execution Time (sec) Lower is better 600 45% 1 GE 500 10 GE 400 UDA 10GE 300 200 ~2X Acceleration 100 0 8 Nodes 10 Nodes 12 Nodes *TeraSort is a popular benchmark used to measure the performance of Hadoop cluster Fastest Job Completion! © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 38
  • 39.
    Accelerating Big DataAnalytics – EMC/GreenPlum  EMC 1000-Node Analytic Platform  Accelerates Industry's Hadoop Development  24 PetaByte of physical storage • Half of every written word since inception of mankind  Mellanox InfiniBand FDR Solutions Hadoop 2X Faster Hadoop Job Run-Time Acceleration High Throughput, Low Latency, RDMA Critical for ROI © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 39
  • 40.
    Thank You HPC@mellanox.com © 2012 MELLANOX TECHNOLOGIES - MELLANOX CONFIDENTIAL - 40