Igor Sfiligoi, Frank Würthwein – UCSD
Emily A. Belli, Jeff Candy – General Atomics
Comparing single-node and multi-node performance
of an important fusion HPC code benchmark
Problem statement
• HPC systems have enjoyed tremendous growth
in compute throughput
• Mostly driven by the adoption of many-GPU compute nodes
• Interconnect to compute ratio has however not kept the pace
• Especially when it comes to node-to-node bandwidth
• Limiting scaling of communication-heavy applications
2
On the flip side
• HPC systems have enjoyed tremendous growth
in compute throughput
• Mostly driven by the adoption of many-GPU compute nodes
• Interconnect to compute ratio has however not kept the pace
• Especially when it comes to node-to-node bandwidth
• Limiting scaling of communication-heavy applications
3
Largest single many-GPU node is now capable
of doing the task of O(100) CPU-only nodes.
(with high interconnect bandwidth)
But such HW is hard to come by!
CGYRO – A HPC Fusion Simulation Tool
• Eulerian gyrokinetic turbulence solver
• MPI-enabled, can scale from
O(10) to O(1k) compute chips
• Communication-heavy
• Large buffers exchanged
through MPI_AllToAll
4
https://gafusion.github.io/doc/cgyro.html
Tested systems
• Google Cloud a2-megagpu-16g instance – 16x A100, NVLINK-only
• NERSC HPC Perlmutter (Phase 1) – 4x A100, NVLINK + 2x 100 Gbps
• ORNL HPC Summit – 6x V100, NVLINK + 2x 100 Gbps
• NERSC HPC Cori – CPU-only Intel Xeon Phi KNL, 40 Gbps
5
Detailed running environment described in the paper
Measured the run;me of the mainstream nl03 benchmark input.
(GCP)
Looking at compute speed alone
6
0 50 100 150 200 250 300 350 400 450
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 16x4 A100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Compute time per CGYRO nl03 step
Single GCP node as fast as 200 Cori nodes
Almost ideal scaling as we add more A100s
V100 about half A100 speed
0 100 200 300 400 500 600 700 800
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 16x4 A100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Total time per CGYRO nl03 step
Actual CGYRO speed
7
Single GCP node faster
than 8x PerlmuLer nodes
Single GCP node faster
than 16x Summit nodes
Compute-only
Total time
0 100 200 300 400 500 600 700 800
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 16x4 A100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Total time per CGYRO nl03 step
Actual CGYRO speed
8
Single GCP node faster
than 8x PerlmuLer nodes
Single GCP node faster
than 16x Summit nodes
CGYRO completely
communicaIon bound in
mulI-node GPU HPC systems
Compute-only
Total time
Fraction in compute
9
GCP Perlmutter Phase 1 Summit Cori
The higher interconnect to compute ratio makes the single GCP instance much faster than 8 Perlmutter nodes,
even though it has half the A100 GPUs. (we would need about 12 Perlmutter nodes to reach parity, i.e. 3x the number of GPUs)
58%
25%
Fraction in compute
10
GCP Perlmutter Phase 1 Summit Cori
NVLINK-connected A100s in GCP
have the same interconnect to compute ra;o as
Cori’s Cray Arie-connected Intel Xeon Phi KNLs
NVLINK-connected A100s in GCP
have the same interconnect to compute ratio as
Cori’s Cray Arie-connected Intel Xeon Phi KNLs
While it takes more nodes, the CPU-based Cori behaves similarly to the single-node GCP instance.
Fraction in compute
11
GCP Perlmutter Phase 1 Summit Cori
Doubling the network capacity about as effec;ve as adding 50% more GPUs.
1.9x
1.5x
1.25x
Hypothesis confirmed
• We confirm that networking has not been keeping up with
compute in the recent GPU-based HPC systems
• Hurting CGYRO for sure and
likely other communication-heavy HPC applications, too
• GPUs provide great benefit in terms of compute for CGYRO
• 16 A100 GPUs as fast as ~200 high-end KNL CPUs
12
Desiderata
• High-interconnect GPU-based HPC systems desired for CGYRO
(and likely other communication-heavy HPC applications)
• Large-GPU-count nodes (like GCP) great when the problem fits
• And many mainstream CGYRO problems do
• But only GCP has them, could not find one on-prem!
• Larger problems will have to settle for lower compute efficiencies
• Unless system providers increase the interconnect throughput
• Adding more NICs would help,
but we look forward to external NVLINK switches, too
13
Acknowledgements
• This work was partially supported by
• The U.S. Department of Energy under awards DE-FG02-95ER54309,
DE-FC02-06ER54873 (Edge Simulation Laboratory) and
DE-SC0017992 (AToM SciDAC-4 project).
• The US National Science Foundation (NSF) Grant OAC-1826967.
• An award of computer time was provided by the INCITE program.
• This research used resources of the Oak Ridge Leadership Computing
Facility, which is an Office of Science User Facility supported under
Contract DE-AC05-00OR22725.
• Computing resources were also provided by the National Energy Research
Scientific Computing Center, which is an Office of Science User Facility
supported under Contract DE-AC02-05CH11231.
14

Comparing single-node and multi-node performance of an important fusion HPC code benchmark

  • 1.
    Igor Sfiligoi, FrankWürthwein – UCSD Emily A. Belli, Jeff Candy – General Atomics Comparing single-node and multi-node performance of an important fusion HPC code benchmark
  • 2.
    Problem statement • HPCsystems have enjoyed tremendous growth in compute throughput • Mostly driven by the adoption of many-GPU compute nodes • Interconnect to compute ratio has however not kept the pace • Especially when it comes to node-to-node bandwidth • Limiting scaling of communication-heavy applications 2
  • 3.
    On the flipside • HPC systems have enjoyed tremendous growth in compute throughput • Mostly driven by the adoption of many-GPU compute nodes • Interconnect to compute ratio has however not kept the pace • Especially when it comes to node-to-node bandwidth • Limiting scaling of communication-heavy applications 3 Largest single many-GPU node is now capable of doing the task of O(100) CPU-only nodes. (with high interconnect bandwidth) But such HW is hard to come by!
  • 4.
    CGYRO – AHPC Fusion Simulation Tool • Eulerian gyrokinetic turbulence solver • MPI-enabled, can scale from O(10) to O(1k) compute chips • Communication-heavy • Large buffers exchanged through MPI_AllToAll 4 https://gafusion.github.io/doc/cgyro.html
  • 5.
    Tested systems • GoogleCloud a2-megagpu-16g instance – 16x A100, NVLINK-only • NERSC HPC Perlmutter (Phase 1) – 4x A100, NVLINK + 2x 100 Gbps • ORNL HPC Summit – 6x V100, NVLINK + 2x 100 Gbps • NERSC HPC Cori – CPU-only Intel Xeon Phi KNL, 40 Gbps 5 Detailed running environment described in the paper Measured the run;me of the mainstream nl03 benchmark input. (GCP)
  • 6.
    Looking at computespeed alone 6 0 50 100 150 200 250 300 350 400 450 Summit 32x6 V100 Summit 16x6 V100 Perlmutter 16x4 A100 Perlmutter 8x4 A100 GCP 1x16 A100 Cori 128x1 KNL Cori 256x1 KNL Compute time per CGYRO nl03 step Single GCP node as fast as 200 Cori nodes Almost ideal scaling as we add more A100s V100 about half A100 speed
  • 7.
    0 100 200300 400 500 600 700 800 Summit 32x6 V100 Summit 16x6 V100 Perlmutter 16x4 A100 Perlmutter 8x4 A100 GCP 1x16 A100 Cori 128x1 KNL Cori 256x1 KNL Total time per CGYRO nl03 step Actual CGYRO speed 7 Single GCP node faster than 8x PerlmuLer nodes Single GCP node faster than 16x Summit nodes Compute-only Total time
  • 8.
    0 100 200300 400 500 600 700 800 Summit 32x6 V100 Summit 16x6 V100 Perlmutter 16x4 A100 Perlmutter 8x4 A100 GCP 1x16 A100 Cori 128x1 KNL Cori 256x1 KNL Total time per CGYRO nl03 step Actual CGYRO speed 8 Single GCP node faster than 8x PerlmuLer nodes Single GCP node faster than 16x Summit nodes CGYRO completely communicaIon bound in mulI-node GPU HPC systems Compute-only Total time
  • 9.
    Fraction in compute 9 GCPPerlmutter Phase 1 Summit Cori The higher interconnect to compute ratio makes the single GCP instance much faster than 8 Perlmutter nodes, even though it has half the A100 GPUs. (we would need about 12 Perlmutter nodes to reach parity, i.e. 3x the number of GPUs) 58% 25%
  • 10.
    Fraction in compute 10 GCPPerlmutter Phase 1 Summit Cori NVLINK-connected A100s in GCP have the same interconnect to compute ra;o as Cori’s Cray Arie-connected Intel Xeon Phi KNLs NVLINK-connected A100s in GCP have the same interconnect to compute ratio as Cori’s Cray Arie-connected Intel Xeon Phi KNLs While it takes more nodes, the CPU-based Cori behaves similarly to the single-node GCP instance.
  • 11.
    Fraction in compute 11 GCPPerlmutter Phase 1 Summit Cori Doubling the network capacity about as effec;ve as adding 50% more GPUs. 1.9x 1.5x 1.25x
  • 12.
    Hypothesis confirmed • Weconfirm that networking has not been keeping up with compute in the recent GPU-based HPC systems • Hurting CGYRO for sure and likely other communication-heavy HPC applications, too • GPUs provide great benefit in terms of compute for CGYRO • 16 A100 GPUs as fast as ~200 high-end KNL CPUs 12
  • 13.
    Desiderata • High-interconnect GPU-basedHPC systems desired for CGYRO (and likely other communication-heavy HPC applications) • Large-GPU-count nodes (like GCP) great when the problem fits • And many mainstream CGYRO problems do • But only GCP has them, could not find one on-prem! • Larger problems will have to settle for lower compute efficiencies • Unless system providers increase the interconnect throughput • Adding more NICs would help, but we look forward to external NVLINK switches, too 13
  • 14.
    Acknowledgements • This workwas partially supported by • The U.S. Department of Energy under awards DE-FG02-95ER54309, DE-FC02-06ER54873 (Edge Simulation Laboratory) and DE-SC0017992 (AToM SciDAC-4 project). • The US National Science Foundation (NSF) Grant OAC-1826967. • An award of computer time was provided by the INCITE program. • This research used resources of the Oak Ridge Leadership Computing Facility, which is an Office of Science User Facility supported under Contract DE-AC05-00OR22725. • Computing resources were also provided by the National Energy Research Scientific Computing Center, which is an Office of Science User Facility supported under Contract DE-AC02-05CH11231. 14