Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Voltaire fca en_nov10

on

  • 769 views

 

Statistics

Views

Total Views
769
Views on SlideShare
720
Embed Views
49

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 49

http://www.hpc-wissen.de 49

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Voltaire fca en_nov10 Voltaire fca en_nov10 Presentation Transcript

  • Voltaire Fabric Collective Accelerator™ (FCA) Ghislain de Jacquelot – ghislaindj@voltaire.com November 19, 2010 © 2010 Voltaire Inc.
  • MPI Collectives ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters Collective Operations % of MPI Job Runtime 100 ► Popular examples: 90 • Reduce 80 70 • Allreduce Percentage 60 • Barrier 50 • Bcast 40 • Gather 30 20 • Allgather 10 0 ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo FLUENT DYNA STAR-CD Your cluster might be spending half its time on idle collective cycles © 2010 Voltaire Inc. 2
  • The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability Expected Actual ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric © 2010 Voltaire Inc. 3 View slide
  • The Voltaire InfiniBand Fabric: Equipped for the Challenge Grid Director Unified Fabric Switches: Manager (UFM): Fabric Topology Aware Processing + + Orchestrator Power Eth 4036 Eth 4036 19 20 21 22 23 24 25 26 27 28 SM Info SM 19 20 21 22 23 24 25 26 27 28 SM Info SM 29 30 31 32 33 34 35 36 29 30 31 32 33 34 35 36 PWR PS/Fan PWR PS/Fan CLI Rst CLI Rst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 + + ………. ………. Fabric computing in use to address the collective challenge © 2010 Voltaire Inc. 4 View slide
  • Introducing: Voltaire Fabric Collective Accelerator Grid Director Grid Director FCA Manager: Unified Fabric Switches: Switches: Topology-based collective tree Manager Fabric Separate Virtual network (UFM): Collective Processing + + for result distribution IB multicast Topology Aware operations Power Integration with jobOrchestrator schedulers offloaded to switch CPUs Eth 4036 Eth 4036 19 20 21 22 23 24 25 26 27 28 SM Info SM 19 20 21 22 23 24 25 26 27 28 SM Info SM 29 30 31 32 33 34 35 36 29 30 31 32 33 34 35 36 PWR PS/Fan PWR PS/Fan CLI Rst CLI Rst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 + FCA Agent: +  Inter-core processing localized & optimized ………. ………. Breakthrough performance with no additional hardware © 2010 Voltaire Inc. 5
  • Efficient Collectives with FCA 4. 2nd tier offload 5. Result distribution 1. Pre-config (result at root) (single message) 648 19 1 20 2 21 3 22 4 23 5 6 24 25 7 26 8 11664 27 9 28 10 29 11 30 12 31 13 32 14 33 15 34 16 35 17 36 18 648 Eth CLI 4036 SM Info SM PWR PS/Fan Rst 19 1 20 2 21 3 22 4 23 5 24 6 25 7 26 8 27 9 28 10 29 11 30 12 31 13 32 14 33 15 34 16 35 17 36 18 Eth CLI 4036 SM Info SM PWR PS/Fan Rst 648 36 648 36 4036 4036 36 36 Eth Eth 19 20 21 22 23 24 25 26 27 28 SM Info SM 19 20 21 22 23 24 25 26 27 28 SM Info SM 29 30 31 32 33 34 35 36 29 30 31 32 33 34 35 36 PWR PS/Fan PWR PS/Fan CLI Rst CLI Rst 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 3. 1st tier offload 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 1 2 5 6 1 2 5 6 1 2 5 6 36 8 311664 711664 4 11664 11664 36 11664 11664 411664 8 3 11664 7 116644 116648 311664 711664 36 2. Inter-core 6. Allreduce on 100K processing cores in 25 usec © 2010 Voltaire Inc. 6
  • FCA Benefits: Slashing Job Runtime ► Slashing Runtime IMB Allreduce 2048 Cores Open MPI: 4000 >3000usec 3500 3000 2500 usec 2000 1500 1000 500 FCA: <30usec 0 ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network Server-based Collectives FCA-based Collectives © 2010 Voltaire Inc. Completion Time Distribution 7
  • FCA Benefits: Unprecedented Scalability on HPC Clusters 10000 ompi-Allreduce-bynode 1000 ompi-Barrier-bynode 100 > 100X FCA-Allreduce > 50% 10 FCA-Barrier 1 0 200 400 600 800 1000 1200 ► Extreme performance ► As process count increases improvement on raw • % of time spent in MPI collectives increases ► Scale according to number • % of time spent in collectives of switch hops, not number increases of nodes – O(log18) Enabling capability computing on HPC clusters © 2010 Voltaire Inc. 8
  • Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty © 2010 Voltaire Inc. 9
  • FCA What is the alternative/competitive solution? FCA NIC-based offload Topology aware Network Congestion Elimination Fabric switches offload computation Result distribution based on IB multicast Support non-blocking collectives OS “noise” reduction Expected MPI Job runtime Improvement 30-40% 1-2% A Fabric Wide Challenge requires a Fabric Wide Solution © 2010 Voltaire Inc. 10
  • Benchmarks 1/4 © 2010 Voltaire Inc. 11
  • FCA Impact on Fluent Rating: Higher is Better! aircraft_2m eddy_417k 3800 5000 3600 4000 3400 InfiniBand 3000 InfiniBand Rating Rating 3200 2000 InfiniBand + InfiniBand + 3000 FCA 1000 FCA 2800 0 88 Ranks 88 Ranks sedan_4m truck_111m 4100 56 4000 54 52 3900 InfiniBand InfiniBand 50 Rating 3800 Rating 48 3700 InfiniBand + InfiniBand + 46 3600 FCA FCA 44 3500 42 88 Ranks 88 Ranks Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1 © 2010 Voltaire Inc. 12
  • Benchmarks 2/4 © 2010 Voltaire Inc. 13
  • System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 ► FCA:1.0_RC3 rev 2760 4 x QDR 4 x QDR ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR © 2010 Voltaire Inc. 14
  • IMB (Pallas) Benchmark Results Collective latency (usec) 10000 ompi-Allreduce ompi-Reduce 1000 ompi-Barrier FCA-Allreduce Up to 100X Faster 100 FCA-Reduce FCA-Barrier 10 0 500 1000 1500 2000 2500 Collective run time reduction (% ) - FCA vs Open MPI Number of ranks (16 ranks per node) 100% 80% 60% Allreduce Reduce 40% Up to 99.5% Runtime Barrier Reduction 20% 0% 0 500 1000 1500 2000 2500 Number of ranks © 2010 Voltaire Inc. 15
  • OpenFOAM - I ► OpenFOAM • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies Open Foam CFD Aerodynamic Benchmark (64 cores) 5000 4500 4000 3500 3000 Seconds Open MPI 1.4.1 2500 Open MPI 1.4.1 + FCA 2000 1500 1000 500 0 1 © 2010 Voltaire Inc. 16
  • Benchmarks 3/4 © 2010 Voltaire Inc. 17
  • System Configuration ► Nodes type: NEC HPC • CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost DDR HCA ► System Configuration: 700 nodes • 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed) ► OS: Scientific Linux 5.3 ► Open MPI: 1.4.1 ► FCA:1.1 3 x QDR 3 x QDR ► UFM: 2.3 ► Switch: 3.0.629 30 x DDR 30 x DDR © 2010 Voltaire Inc. 18
  • OpenFOAM - II ► ERCOFTAC UFR 2-02 • http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder • Used in many areas of engineering, including civil and environmental • Run with OpenFOAM (pimpleFoam solver) ERCOFTAC UFR 2-02: Flow past a square cylinder (256 cores) 4000 3500 3000 2500 Open MPI 1.4.1 2000 FCA 1500 1000 500 0 © 2010 Voltaire Inc. 19
  • Molecular Dynamics: LS1-Mardyn >95% Improvement ► The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous at the beginning of simulation time. ► "agglo" uses a custom reduce operator (not supported by FCA), while “split” uses a standard one © 2010 Voltaire Inc. 20
  • Benchmarks 4/4 © 2010 Voltaire Inc. 21
  • Setup ► 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz ► Voltaire QDR InfiniBand ► Platform MPI 8.0 ► Fluent version 12.1 ► Star-CD version 4.12 192 cores per enclosure © 2010 Voltaire Inc. 22
  • Fluent 192 Cores Rating: Higher is Better truck_poly_14m truck_14m 1300 1450 1250 1400 1350 1200 1300 1150 truck_poly_14m 1250 truck_14m 1100 1200 1050 1150 1000 1100 PMPI PMPI + FCA PMPI PMPI + FCA truck_111m 180 160 140 120 100 80 truck_111m 60 40 20 0 PMPI PMPI + FCA © 2010 Voltaire Inc. 23
  • Star-CD A-Class benchmark 192 cores Runtime – Lower is Better © 2010 Voltaire Inc. 24
  • Logistics & Roadmap November 19, 2010 © 2010 Voltaire Inc.
  • FCA Ordering & Packaging SWL-00347 FCA Add-on License for 1 node SWL-00344 UFM-FCA Bundle License for 1 node ► Switch CPU software shipping automatically on all switches starting from version 3.0 • Recommended to upgrade to latest version ► FCA Add-on package includes: • FCA Manager - add-on to UFM • OMA - host add-on for Open MPI (not required for other MPIs once supported) ► Bundle includes the above as well as UFM itself ► FCA license is installed on the UFM server © 2010 Voltaire Inc. 26
  • FCA Roadmap ► FCA v1.1 (Available Q2 2010) • Collective Operations  MPI_Reduce, MPI_Allreduce (MAX & SUM)  MPI_Bcast  Integer & floating point (32/64), up to 8 elements (128 byte)  MPI_Barrier • Topologies  Fat Tree  HyperScale  Torus • MPI  Open MPI  SDK available for MPI integration ► FCA v2.0 (Available Q4 2010) • Allgather • Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc) • Increased Message size for Bcast, Reduce & Allreduce © 2010 Voltaire Inc. 27
  • FCA SDK – Integration with Additional MPIs ► Easy to use software development kit ► Integration to be performed by MPI vendor ► Package includes: • Documentation • High level & flow presentation • Software packages  Dynamically linked library – binary only  Header files  Sample application © 2010 Voltaire Inc. 28
  • Coming Soon: Platform MPI (formerly HP MPI) Support ► Platform MPI version 8.x - Q3 2010 Platform MPI 8.x (formerly HP-MPI) ► Initial benchmarking expected end of Q2 2010 ► Other MPI vendors evaluating the technology as well • Leveraging Voltaire SDK © 2010 Voltaire Inc. 29
  • Voltaire Fabric Collective Accelerator Summary ► Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric! © 2010 Voltaire Inc. 30
  • Thank You November 19, 2010 © 2010 Voltaire Inc.