Voltaire fca en_nov10

784 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
784
On SlideShare
0
From Embeds
0
Number of Embeds
56
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Voltaire fca en_nov10

  1. 1. © 2010 Voltaire Inc. November 19, 2010 Voltaire Fabric Collective Accelerator™ (FCA) Ghislain de Jacquelot – ghislaindj@voltaire.com
  2. 2. © 2010 Voltaire Inc. 2 MPI Collectives Percentage ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters ► Popular examples: • Reduce • Allreduce • Barrier • Bcast • Gather • Allgather 0 10 20 30 40 50 60 70 80 90 100 ANSYS FLUENT SAGE CPMD LSTC LS- DYNA CD-Adapco STAR-CD Dacapo Collective Operations % of MPI Job Runtime Your cluster might be spending half its time on idle collective cycles
  3. 3. © 2010 Voltaire Inc. 3 The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric Expected Actual
  4. 4. © 2010 Voltaire Inc. 4 The Voltaire InfiniBand Fabric: Equipped for the Challenge 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. ++ + + Grid Director Switches: Fabric Processing Power Unified Fabric Manager (UFM): Topology Aware Orchestrator Fabric computing in use to address the collective challenge
  5. 5. © 2010 Voltaire Inc. 5 Introducing: Voltaire Fabric Collective Accelerator 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. ++ + + Grid Director Switches: Fabric Processing Power Breakthrough performance with no additional hardware Grid Director Switches: Collective operations offloaded to switch CPUs FCA Agent:  Inter-core processing localized & optimized Unified Fabric Manager (UFM): Topology Aware Orchestrator FCA Manager: Topology-based collective tree Separate Virtual network IB multicast for result distribution Integration with job schedulers
  6. 6. © 2010 Voltaire Inc. 6 Efficient Collectives with FCA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3. 1st tier offload 648 4. 2nd tier offload (result at root) 11664 1. Pre-config 2. Inter-core processing 36 36 36 36 36 648 648 5. Result distribution (single message) 6. Allreduce on 100K cores in 25 usec 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 64836 36
  7. 7. © 2010 Voltaire Inc. 7 FCA Benefits: Slashing Job Runtime ► Slashing Runtime ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network IMB Allreduce 2048 Cores 0 500 1000 1500 2000 2500 3000 3500 4000 usec Completion Time Distribution Server-based Collectives FCA-based Collectives FCA: <30usec Open MPI: >3000usec
  8. 8. © 2010 Voltaire Inc. 8 FCA Benefits: Unprecedented Scalability on HPC Clusters 1 10 100 1000 10000 0 200 400 600 800 1000 1200 ompi-Allreduce-bynode ompi-Barrier-bynode FCA-Allreduce FCA-Barrier ► Extreme performance improvement on raw collectives ► Scale according to number of switch hops, not number of nodes – O(log18) ► As process count increases • % of time spent in MPI increases • % of time spent in collectives increases Enabling capability computing on HPC clusters > 100X > 50%
  9. 9. © 2010 Voltaire Inc. 9 Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty
  10. 10. © 2010 Voltaire Inc. 10 FCA What is the alternative/competitive solution? FCA NIC-based offload Topology aware Network Congestion Elimination Fabric switches offload computation Result distribution based on IB multicast Support non-blocking collectives OS “noise” reduction Expected MPI Job runtime Improvement 30-40% 1-2% A Fabric Wide Challenge requires a Fabric Wide Solution
  11. 11. © 2010 Voltaire Inc. 11 Benchmarks 1/4
  12. 12. © 2010 Voltaire Inc. 12 FCA Impact on Fluent Rating: Higher is Better! 2800 3000 3200 3400 3600 3800 Rating 88 Ranks aircraft_2m InfiniBand InfiniBand + FCA 0 1000 2000 3000 4000 5000 Rating 88 Ranks eddy_417k InfiniBand InfiniBand + FCA 3500 3600 3700 3800 3900 4000 4100 Rating 88 Ranks sedan_4m InfiniBand InfiniBand + FCA 42 44 46 48 50 52 54 56 Rating 88 Ranks truck_111m InfiniBand InfiniBand + FCA Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1
  13. 13. © 2010 Voltaire Inc. 13 Benchmarks 2/4
  14. 14. © 2010 Voltaire Inc. 14 System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 ► FCA:1.0_RC3 rev 2760 ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR 4 x QDR4 x QDR
  15. 15. © 2010 Voltaire Inc. 15 IMB (Pallas) Benchmark Results Collective latency (usec) 10 100 1000 10000 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) ompi-Allreduce ompi-Reduce ompi-Barrier FCA-Allreduce FCA-Reduce FCA-Barrier Up to 100X Faster Collective run time reduction (%) - FCA vs Open MPI 0% 20% 40% 60% 80% 100% 0 500 1000 1500 2000 2500 Number of ranks Allreduce Reduce Barrier Up to 99.5% Runtime Reduction
  16. 16. © 2010 Voltaire Inc. 16 Open Foam CFD Aerodynamic Benchmark (64 cores) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 Seconds Open MPI 1.4.1 Open MPI 1.4.1 + FCA OpenFOAM - I ► OpenFOAM • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies
  17. 17. © 2010 Voltaire Inc. 17 Benchmarks 3/4
  18. 18. © 2010 Voltaire Inc. 18 System Configuration ► Nodes type: NEC HPC • CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost DDR HCA ► System Configuration: 700 nodes • 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed) ► OS: Scientific Linux 5.3 ► Open MPI: 1.4.1 ► FCA:1.1 ► UFM: 2.3 ► Switch: 3.0.629 30 x DDR 30 x DDR 3 x QDR3 x QDR
  19. 19. © 2010 Voltaire Inc. 19 OpenFOAM - II ► ERCOFTAC UFR 2-02 • http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder • Used in many areas of engineering, including civil and environmental • Run with OpenFOAM (pimpleFoam solver) 0 500 1000 1500 2000 2500 3000 3500 4000 ERCOFTAC UFR 2-02: Flow past a square cylinder (256 cores) Open MPI 1.4.1 FCA
  20. 20. © 2010 Voltaire Inc. 20 Molecular Dynamics: LS1-Mardyn ► The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous at the beginning of simulation time. ► "agglo" uses a custom reduce operator (not supported by FCA), while “split” uses a standard one >95% Improvement
  21. 21. © 2010 Voltaire Inc. 21 Benchmarks 4/4
  22. 22. © 2010 Voltaire Inc. 22 Setup ► 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz ► Voltaire QDR InfiniBand ► Platform MPI 8.0 ► Fluent version 12.1 ► Star-CD version 4.12 192 cores per enclosure
  23. 23. © 2010 Voltaire Inc. 23 Fluent 192 Cores Rating: Higher is Better 1000 1050 1100 1150 1200 1250 1300 PMPI PMPI + FCA truck_poly_14m truck_poly_14m 1100 1150 1200 1250 1300 1350 1400 1450 PMPI PMPI + FCA truck_14m truck_14m 0 20 40 60 80 100 120 140 160 180 PMPI PMPI + FCA truck_111m truck_111m
  24. 24. © 2010 Voltaire Inc. 24 Star-CD A-Class benchmark 192 cores Runtime – Lower is Better
  25. 25. © 2010 Voltaire Inc. November 19, 2010 Logistics & Roadmap
  26. 26. © 2010 Voltaire Inc. 26 FCA Ordering & Packaging SWL-00347 FCA Add-on License for 1 node SWL-00344 UFM-FCA Bundle License for 1 node ► Switch CPU software shipping automatically on all switches starting from version 3.0 • Recommended to upgrade to latest version ► FCA Add-on package includes: • FCA Manager - add-on to UFM • OMA - host add-on for Open MPI (not required for other MPIs once supported) ► Bundle includes the above as well as UFM itself ► FCA license is installed on the UFM server
  27. 27. © 2010 Voltaire Inc. 27 FCA Roadmap ► FCA v1.1 (Available Q2 2010) • Collective Operations  MPI_Reduce, MPI_Allreduce (MAX & SUM)  MPI_Bcast  Integer & floating point (32/64), up to 8 elements (128 byte)  MPI_Barrier • Topologies  Fat Tree  HyperScale  Torus • MPI  Open MPI  SDK available for MPI integration ► FCA v2.0 (Available Q4 2010) • Allgather • Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc) • Increased Message size for Bcast, Reduce & Allreduce
  28. 28. © 2010 Voltaire Inc. 28 FCA SDK – Integration with Additional MPIs ► Easy to use software development kit ► Integration to be performed by MPI vendor ► Package includes: • Documentation • High level & flow presentation • Software packages  Dynamically linked library – binary only  Header files  Sample application
  29. 29. © 2010 Voltaire Inc. 29 Coming Soon: Platform MPI (formerly HP MPI) Support ► Platform MPI version 8.x - Q3 2010 ► Initial benchmarking expected end of Q2 2010 ► Other MPI vendors evaluating the technology as well • Leveraging Voltaire SDK Platform MPI 8.x (formerly HP-MPI)
  30. 30. © 2010 Voltaire Inc. 30 Voltaire Fabric Collective Accelerator Summary ► Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric!
  31. 31. © 2010 Voltaire Inc. November 19, 2010 Thank You

×