Comparing single-node and multi-node performance of an important fusion HPC code benchmark

Igor Sfiligoi, Frank Würthwein – UCSD
Emily A. Belli, Jeff Candy – General Atomics
Comparing single-node and multi-node performance
of an important fusion HPC code benchmark

Problem statement
• HPC systems have enjoyed tremendous growth
in compute throughput
• Mostly driven by the adoption of many-GPU compute nodes
• Interconnect to compute ratio has however not kept the pace
• Especially when it comes to node-to-node bandwidth
• Limiting scaling of communication-heavy applications
2

On the flip side
• HPC systems have enjoyed tremendous growth
in compute throughput
• Mostly driven by the adoption of many-GPU compute nodes
• Interconnect to compute ratio has however not kept the pace
• Especially when it comes to node-to-node bandwidth
• Limiting scaling of communication-heavy applications
3
Largest single many-GPU node is now capable
of doing the task of O(100) CPU-only nodes.
(with high interconnect bandwidth)
But such HW is hard to come by!

CGYRO – A HPC Fusion Simulation Tool
• Eulerian gyrokinetic turbulence solver
• MPI-enabled, can scale from
O(10) to O(1k) compute chips
• Communication-heavy
• Large buffers exchanged
through MPI_AllToAll
4
https://gafusion.github.io/doc/cgyro.html

Tested systems
• Google Cloud a2-megagpu-16g instance – 16x A100, NVLINK-only
• NERSC HPC Perlmutter (Phase 1) – 4x A100, NVLINK + 2x 100 Gbps
• ORNL HPC Summit – 6x V100, NVLINK + 2x 100 Gbps
• NERSC HPC Cori – CPU-only Intel Xeon Phi KNL, 40 Gbps
5
Detailed running environment described in the paper
Measured the run;me of the mainstream nl03 benchmark input.
(GCP)

Looking at compute speed alone
6
0 50 100 150 200 250 300 350 400 450
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 16x4 A100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Compute time per CGYRO nl03 step
Single GCP node as fast as 200 Cori nodes
Almost ideal scaling as we add more A100s
V100 about half A100 speed

0 100 200 300 400 500 600 700 800
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Total time per CGYRO nl03 step
Actual CGYRO speed
7
Single GCP node faster
than 8x PerlmuLer nodes
than 16x Summit nodes
Compute-only
Total time

0 100 200 300 400 500 600 700 800
Summit 32x6 V100
Summit 16x6 V100
Perlmutter 8x4 A100
GCP 1x16 A100
Cori 128x1 KNL
Cori 256x1 KNL
Total time per CGYRO nl03 step
Actual CGYRO speed
8
than 8x PerlmuLer nodes
than 16x Summit nodes
CGYRO completely
communicaIon bound in
mulI-node GPU HPC systems
Compute-only
Total time

Fraction in compute
9
GCP Perlmutter Phase 1 Summit Cori
The higher interconnect to compute ratio makes the single GCP instance much faster than 8 Perlmutter nodes,
even though it has half the A100 GPUs. (we would need about 12 Perlmutter nodes to reach parity, i.e. 3x the number of GPUs)
58%
25%

Fraction in compute
10
NVLINK-connected A100s in GCP
have the same interconnect to compute ra;o as
Cori’s Cray Arie-connected Intel Xeon Phi KNLs
NVLINK-connected A100s in GCP
have the same interconnect to compute ratio as
Cori’s Cray Arie-connected Intel Xeon Phi KNLs
While it takes more nodes, the CPU-based Cori behaves similarly to the single-node GCP instance.

Fraction in compute
11
Doubling the network capacity about as eﬀec;ve as adding 50% more GPUs.
1.9x
1.5x
1.25x

Hypothesis confirmed
• We confirm that networking has not been keeping up with
compute in the recent GPU-based HPC systems
• Hurting CGYRO for sure and
likely other communication-heavy HPC applications, too
• GPUs provide great benefit in terms of compute for CGYRO
• 16 A100 GPUs as fast as ~200 high-end KNL CPUs
12

Desiderata
• High-interconnect GPU-based HPC systems desired for CGYRO
(and likely other communication-heavy HPC applications)
• Large-GPU-count nodes (like GCP) great when the problem fits
• And many mainstream CGYRO problems do
• But only GCP has them, could not find one on-prem!
• Larger problems will have to settle for lower compute efficiencies
• Unless system providers increase the interconnect throughput
• Adding more NICs would help,
but we look forward to external NVLINK switches, too
13

Acknowledgements
• This work was partially supported by
• The U.S. Department of Energy under awards DE-FG02-95ER54309,
DE-FC02-06ER54873 (Edge Simulation Laboratory) and
DE-SC0017992 (AToM SciDAC-4 project).
• The US National Science Foundation (NSF) Grant OAC-1826967.
• An award of computer time was provided by the INCITE program.
• This research used resources of the Oak Ridge Leadership Computing
Facility, which is an Office of Science User Facility supported under
Contract DE-AC05-00OR22725.
• Computing resources were also provided by the National Energy Research
Scientific Computing Center, which is an Office of Science User Facility
supported under Contract DE-AC02-05CH11231.
14

Comparing single-node and multi-node performance of an important fusion HPC code benchmark

More Related Content

More from Igor Sfiligoi

Comparing single-node and multi-node performance of an important fusion HPC code benchmark