Voltaire - Reducing the Runtime of Collective Communications
Upcoming SlideShare
Loading in...5
×
 

Voltaire - Reducing the Runtime of Collective Communications

on

  • 875 views

Presented at ISC '10 Birds of a Feather Session

Presented at ISC '10 Birds of a Feather Session

Statistics

Views

Total Views
875
Views on SlideShare
875
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Voltaire - Reducing the Runtime of Collective Communications Voltaire - Reducing the Runtime of Collective Communications Presentation Transcript

  • Reducing the Runtime of Collective Communications ISC’10 Birds of a Feather Session June 3, 2010 © 2010 Voltaire Inc.
  • Agenda ► Scalability Challenges for Group Communication ► Voltaire Fabric Collective Accelerator™ (FCA™) • Yaron Haviv, CTO, Voltaire ► Customer Experience: University of Braunschweig • Josef Schüle © 2010 Voltaire Inc. Confidential - Internal 2
  • About Voltaire (NASDAQ: VOLT) ► Leading provider of scale-out data center fabrics • Used by more than 30% of Fortune100 companies • Hundreds of installations of over 1000 servers ► Addressing the challenges of HPC, virtualized data centers and clouds ► More than half of TOP500 InfiniBand sites ► InfiniBand and 10GbE scale-out fabrics End-to-End Scale-out Fabric Product Line © 2010 Voltaire Inc. Confidential - Internal 3
  • MPI Collectives ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters Collective Operations % of MPI Job Runtime 100 ► Popular examples: 90 • Reduce 80 70 • Allreduce Percentage 60 • Barrier 50 • Bcast 40 30 • Gather 20 • Allgather 10 0 ANSYS SAGE CPMD LSTC LS- CD-Adapco Dacapo FLUENT DYNA STAR-CD Your cluster might be spending half its time on idle collective cycles © 2010 Voltaire Inc. Confidential - Internal 4
  • Collective Example - Allreduce ► Allreduce – The Concept • Perform specific operation on all arguments, and distribute result to all processes. Example with SUM operation: 30 15 8 30 7 30 15 6 30 9 ► Allreduce on a 4-node cluster 144144 144144 144 2 52 6 1 20 5 1 2 5 6 144144 144144 20 2 52 6 1 5 144144 144144 1 2 5 6 144144 144144 144144 144144 3 4 7 8 3 4 7 8 144144 144144 3 4 7 8 144144 144144 3 4 7 8 144144 144144 © 2010 Voltaire Inc. Confidential - Internal 5
  • Now try running it on a Petascale machine… Dozens of core switches (3 hops) Hundreds of edge switches (1 hop) 1 2 5 6 1 2 5 6 Tens of thousands 1 2 5 6 3 4 7 8 3 4 7 8 of cores 3 4 7 8 Single Operation > 3000usec – Not Scalable © 2010 Voltaire Inc. Confidential - Internal 6
  • The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability Expected Actual ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric © 2010 Voltaire Inc. Confidential - Internal 7
  • The Voltaire InfiniBand Fabric: Equipped for the Challenge Grid Director Unified Fabric Switches: Manager (UFM): Fabric Topology Aware Processing + + Orchestrator Power + + ………. ………. Fabric computing in use to address the collective challenge © 2010 Voltaire Inc. Confidential - Internal 8
  • Introducing: Voltaire Fabric Collective Accelerator Grid Director Grid Director FCA Manager: Unified Fabric Switches: Manager (UFM): Topology-based collective tree Switches: Fabric Topology Aware Separate Virtual network Collective Processing + + for result distribution IB multicast Orchestrator operations Power Integration with job schedulers offloaded to switch CPUs + FCA Agent: + Inter-core processing localized & optimized ………. ………. Breakthrough performance with no additional hardware © 2010 Voltaire Inc. Confidential - Internal 9
  • Efficient Collectives with FCA 4. 2nd tier offload 5. Result distribution 1. Pre-config (result at root) (single message) 648 11664 648 36 648 36 36 648 36 3. 1st tier offload 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 11664 1 2 5 6 1 2 5 6 1 2 5 6 36 8 311664 711664 4 11664 11664 36 11664 11664 411664 8 3 11664 7 36 116644 116648 311664 711664 2. Inter-core 6. Allreduce on 100K processing cores in 25 usec © 2010 Voltaire Inc. Confidential - Internal 10
  • UFM Integrated With Job Schedulers Matching Jobs Automatically Job Submitted in Scheduler Created in UFM • QoS • Routing • Placement • Collectives Application Level Monitoring Fabric-wide Policy Pushed to Match & Optimization Measurements Application Requirements © 2010 Voltaire Inc. Confidential - Internal 11
  • FCA Benefits: Slashing Job Runtime ► Slashing Runtime IMB Allreduce 2048 Cores Open MPI: 4000 >3000usec 3500 3000 2500 usec 2000 1500 1000 500 FCA: <30usec 0 ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network Server-based Collectives FCA-based Collectives © 2010 Voltaire Inc. Confidential - Internal Completion Time Distribution 12
  • FCA Benefits: Unprecedented Scalability on HPC Clusters 10000 ompi-Allreduce-bynode 1000 ompi-Barrier-bynode 100 > 180X FCA-Allreduce > 50% 10 FCA-Barrier 1 0 200 400 600 800 1000 1200 ► Extreme performance ► As process count increases improvement on raw • % of time spent in MPI collectives increases ► Scale according to number • % of time spent in collectives of switch hops, not number increases of nodes – O(log18) Enabling capability computing on HPC clusters © 2010 Voltaire Inc. Confidential - Internal 13
  • Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty ► Reduce fabric congestion • Avoid interference to other jobs © 2010 Voltaire Inc. Confidential - Internal 14
  • Customer Experience University of Braunschweig June 3, 2010 © 2010 Voltaire Inc.
  • About University of Braunschweig ► General Overview • Founded in 1745 • 120 institutes with ca. 2900 employees • Ca. 13000 students ► Main Fields of Research • Mobility and transport (road, rail, air and space) • Biological and biotechnological research • Digital television © 2010 Voltaire Inc. Confidential - Internal 16
  • System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 4 x QDR 4 x QDR ► FCA:1.0_RC3 rev 2760 ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR © 2010 Voltaire Inc. Confidential - Internal 17
  • FCA Performance: A Real Cluster Example with 2048 Ranks Collective latency (usec) 10000 4000 Microsecond ompi-Allreduce 1000 ompi-Barrier Latency (us) 180x Faster FCA-Allreduce 100 FCA-Barrier 10 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) © 2010 Voltaire Inc. Confidential - Internal 18
  • Real Application Results ► OpenFoam • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies Open Foam CFD Aerodynamic Benchmark (64 cores) 5000 4500 4000 41 ette b 3500 % r 3000 Seconds Open MPI 1.4.1 2500 Open MPI 1.4.1 + FCA 2000 1500 1000 500 0 1 ► Expected benefits for several other applications • e.g. DLPOLY (molecular dynamics) © 2010 Voltaire Inc. Confidential - Internal 19
  • Voltaire Fabric Collective Accelerator Summary ► Fully Integrated Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric! © 2010 Voltaire Inc. Confidential - Internal 20
  • Q&A © 2010 Voltaire Inc. Confidential - Internal 21