Early Application experiences on Summit

ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Early application experiences on Summit
Wayne Joubert
Scientific Computing Group
Oak Ridge Leadership Computing Facility
3rd OpenPOWER Academia Discussion Group Workshop
Nov. 10, 2018

2
Summit – background
Officially launched June 8, 2018
World’s fastest supercomputer
Peak speed 200 PF
#1 on TOP500, @ 122.3 PF, June 2018
#1 Green500 level-3 measured system
#1 on HPCG benchmark
Used by 5 out of 6 Gordon Bell Finalist teams
Achieved world’s first ExaOp calculation by an
application, @ 2.36 ExaOps (ExaFlops16)
Not yet officially accepted, but already achieving
impressive results on conventional science and
machine learning applications

5
Summit early users
• 1,080 compute nodes have been
available to users since December
2017, after that built up to present
4,608 nodes
• Used by 13 CAAR teams (Center
for Accelerated Application
Readiness)
• 65 Letters of Intent for Summit Early
Science program – were allowed on
Summit for application readiness
• Gordon Bell teams (5)
• System Acceptance Test team –
preparations for final system
acceptance testing Graphic courtesy Tjerk Straatsma

6
Summit early science applicants
• Received 65 LOIs in January, 47 full
proposals in June
• Awardees will be among the first users
to get access to Summit after
acceptance
• Notably, 12 of the 65 LOIs (~ 20%)
had a machine learning component –
remarkable growth in a short period of
time
• Tremendous interest in running on
Summit, from Early Science as well as
2019 INCITE projects (announcement
Monday)

7
Summit Gordon Bell Teams
Slide courtesy Jack Wells

8
Summit Gordon Bell Finalist Projects
• CoMet team used Tensor Cores to achieve 2.36 ExaOps performance on a
comparative genomics application
• Prabhat’s LBL team, deep learning application, 1.13 ExaOps peak, .999
ExaOps sustained performance for identification of extreme weather patterns
from high resolution climate simulation data
• University of Tokyo team used AI and transprecision computing for
earthquake simulation
• ORNL / Robert Patton team, MENNDL code, 152 PetaOps analyzing atomic-
level materials properties from electron microscopy data
• LBL-led team using LQCD code with mixed precision multigrid solver to
study the physics of subatomic particles
• Full presentations @ SC18 sessions, Wed. 3:30-5:00, Thu. 10:30-12:00

9
Summit: first impressions
• Our early experience with Summit is that it is an
extremely powerful system
– Very strong GPUs
– Apps are often getting a higher fraction of peak than on
Titan – improvements to GPU hardware, software over time
– New features useful for some apps, e.g., Tensor Cores,
NVMe devices
– Low-congestion fat tree interconnect with adaptive routing
• Many apps have gotten impressive results already
• The early system was somewhat rough around the
edges, with a number of issues we have had to
work through with the vendors
• The system has progressively gotten better as all
parties have been working through the issues
CAAR application performance to
date – number of nodes scaled to
(out of 4,608 Summit nodes) and
performance vs. CPU-only
From “Early Application Results on Summit,” T.P.
Straatsma, Smoky Mountain Conference 2018

10
Summit node performance
• Summit nodes are achieving a high percentage of theoretical
peak performance characteristics
• For details see Vazhkudai et al., “The Design, Deployment,
and Evaluation of the CORAL Pre-Exascale Systems,” @
SC18, Wed. 3:30PM

11
Summit node performance: CPU memory subsystem
• Using the Stream benchmark to measure CPU memory bandwidth
• Theoretical peak 340 GB/sec, actual ~ 275 GB/sec, ~ 82% of peak
• Significant boost from previous Titan, JaguarPF nodes
SUMMIT: peak 170 X 2 = 340, actual ~ 275
actual: ~ 82%
TITAN: peak 25.6 X 2 = 51.2, actual ~ 34
actual ~ 67%
JAGUARPF: peak 25.6, actual ~ 19
actual ~ 75%

12
Summit node performance: GPU HBM memory
• Theoretical peak bandwidth of 900 GB/sec
• Measured performance from GPU Stream benchmark: 789 (Copy), 788 (Mul), 831 (Add and
Triad) GB/sec, representing 88%-92% of peak.
• Compares extremely well to Titan K20X, ~ 181 GB/sec out of 250 GB/sec peak (72%)
• Innovations were made in the GPU memory to achieve a higher fraction of peak performance

13
Summit node performance: CPU-GPU NVLINK
• Previously relied on much slower PCIe-2 connection on Titan
• On-socket transfer rates are 92%, 86% of peak 50, 100 GB/sec
• Off-socket transfers go through the X-Bus, are slower

14
Summit node performance: Infiniband interconnect
• Node-to-node bandwidth, latency measured using IMB benchmark
• Achieving 22.36, 44.29 GB/sec out of peak 25, 50 GB/sec unidirectional / bidirectional for
sufficiently large messages
• 89% of theoretical peak

15
GTC application (CAAR; acceptance test code)
• GTC (Gyrokinetic Toroidal Code) is a particle-in-cell (PIC) application
to simulate magnetically confined plasmas in fusion reactors such as
ITER
• Written in Fortran 90
• Accelerated primarily with OpenACC
• OpenMP acceleration of CPU-only parts; also an OpenMP code
version for Intel Xeon Phi
• Project personnel: Zhihong Lin (co-PI), William Tang (co-PI), Jian
Bao, Wayne Joubert, Matthew Niemerg, Lei Shi, Sam Taimourzadeh,
Bei Wang, Peng Wang, Wenlu Zhang
• http://phoenix.ps.uci.edu/gtc_group

16
GTC application: experiences porting to Summit GPUs
• Expensive particle push and charge loops are mapped to GPU using OpenACC, with persistent
data on the GPUs
• (aside: a number of codes since 2012 have used OpenACC on Titan and Summit, including
several Summit CAAR codes. Some codes are now starting to use OpenMP 4, e.g., PSDNS on
Titan)
• “Shift” operation — to move particles to different MPI ranks — to get high performance uses highly
optimized custom CUDA code — takes advantage of OpenACC/CUDA interoperability
• Poisson field solver
– original code used PETSc ILU(0)+GMRES sparse solver (CPU-only)
– now uses NVIDIA’s AMGX algebraic multigrid solver using GPUs, > 20X faster
– also option to use use Hypre algebraic multigrid solver (GPU support in development)
• Build option exists to use GPU Unified Memory, originally was much slower than explicit transfers,
now performance is near parity thanks to PGI compiler improvements
• GTC has significant I/O requirements. The GTC I/O behaviors uncovered some issues with the
Summit GPFS file system, which were addressed

17
GTC results
Weak scaling to 4500 nodes of Summit

18
CoMet application (INCITE, Early Science, Gordon Bell)
CoMet = Combinatorial Metrics code
A new biosciences application used to find genomic features within a
population
Not a “traditional” modeling and simulation code (e.g., continuum PDE
solver, PIC, Monte Carlo, etc.)
Also is not a deep learning app per se, though is part of an AI workflow
Best described as a data analytics application used in comparative
genomics studies
Gordon Bell Finalist -- see talk Thurs 11:30 AM

19
CoMet application
The primary computation is an all-to-all comparison of vectors
Computationally similar to a distributed DGEMM operation, as in the
ScaLAPACK library and PBLAS — very computationally intensive, but
also requires communication of very large matrices
Written in C++, uses CUDA, cuBLAS and modified MAGMA calls
Uses explicit calls for both asynchronous MPI point-to-point messages
and asynchronous CPU/GPU transfers, with pipelining to overlap
compute and transfer
OpenMP threading is used for CPU work, done concurrently with GPU

20
CoMet algorithm: Custom Correlation Coefficient (CCC)
Used to analyze allele data from a genome, encoded as 2-bit vector elements
Base implementation uses bitwise operations (AND, OR, NOT, shift, mask,
__popcll, etc.) to operate on this binary allelic data
v1 v2
0
1
1
1
Vectors composed of
2-bit entries
1 1
0 1
1 1
0 1
Take all combinations of
bits from the left and right
vector elements
0 0
2 2
00 10
01 11
Tally results into a
table to represent
how the 2 vectors
are related

21
CCC method: mapping to Tensor Cores
• Each vector is replaced by two
vectors, each containing the number
of 0s and 1s of each element of the
original vector, forming a new matrix
of vectors V
• Then taking the dense matrix-matrix
product VT V generates all 2X2 tables
for all vector pairs
• HGEMM applied using call to
cublasGemmEx in cuBLAS library,
gives identical result to original
method
0
1
1
1
0
0 2 0
0 2
1 1
Original
vector
# 0s
FP16
vectors
# 1s

22
CoMet performance
• Achieved 2.36 ExaOps (mixed
precision ExaFlops) at 4,560
nodes (99% of Summit) using the
Tensor Cores
• Near-perfect scaling made
possible by Summit’s Mellanox
Infiniband fat tree network with
adaptive routing
• Equivalent to 86.4 TF per GPU
for the whole computation
(including communications and
transfers)
• > 4X faster than original bitwise
implementation on Summit GPUs
W. Joubert, J. Nance, D. Weighill, D. Jacobson, “Parallel Accelerated Vector Similarity Calculations
for Genomics Applications,” Parallel Computing, vol. 75, July 2018, pp. 130-145,
https://www.sciencedirect.com/science/article/pii/S016781911830084X
W. Joubert, J. Nance, S. Climer, D. Weighill, D. Jacobson, “Parallel Accelerated Custom Correlation
Coefficient Calculations for Genomics Applications,” arxiv 1705.08213 [cs], Parallel Computing,
accepted.
Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan,
Daniel Jacobson, “Attacking the Opioid Epidemic: Determining the Epistatic and Pleiotropic Genetic
Architectures for Chronic Pain and Opioid Addiction,” SC18, Gordon Bell finalist, to appear.

23
Summit Power Consumption
• 2-way CCC/sp/tc @
4560 nodes
• Summit power usage
for 1 out of 4 phases of
the run, duration ~ 50
sec.
• Avg power: 11.45 MW
(20% higher than HPL)
• 206 GigaOps / Watt

24
Issues / challenges of using Tensor Cores
• Matrices for this problem are tall and skinny – axis order had to be reversed to
give shorter leading matrix dimension for better TC performance (about 2X faster)
(thanks to Sean Treichler of NVIDIA for suggestion)
• HGEMM performance as a function of matrix size is irregular, hard to precisely
predict – performed extensive timing tests with Baidu DeepBench benchmark to
try to understand – advisable to pad up to a multiple of a small power of 2 (e.g., 8,
16, 32) – however too much padding will be wasteful
• There are many tuning options for HGEMM (~16 choices for the algorithm setting)
– determined CUBLAS_GEMM_ALGO4_TENSOR_OP was the best – would prefer if
default setting would give this performance (hoping for improvements with CUDA
10)
• TC/HGEMM has surprising data-dependent performance: 125 TF theoretical
peak, 113 TF achievable on zero-filled matrices, 105 TF peak on random CCC
matrices, ~95 TF peak on matrices with fully random FP16 entries

25
Issues
• Measurements on 1 Summit GPU
using nvidia-smi
• Data-dependent performance of
Tensor Cores is due to 300W
power/frequency throttling of Voltas
on Summit
• Baidu DeepBench GEMM
benchmark has a bug (reported),
incorrectly fills FP16 matrices with
zeros instead of the intended
random values, thus miscomputes
GPU performance

26
Reduced precision: other possible opportunities
• We are starting to look at other opportunities for using reduced
precision for science calculations on Summit
• In the past scientists have had accuracy concerns and usually
required double precision
– E.g., S3D combustion code, 2009 paper found single precision not adequate
• New hardware (16X faster HGEMM than DGEMM) may call for a
second look
– ICL/Dongarra group already developing iterative refinement dense solver
using Tensor Cores (see talk @ SC18, Wed. 4PM)
– Deep learning projects already seeing high rates, e.g., peak 1.13 ExaOps
– Previous work on reduced precision iterative solvers e.g., Turner/Walker 1992
paper on reduced precision GMRES sparse solver
– Need to carefully evaluate on a case-by-case basis

27
Summit: general comments on user experiences
• The most common execution configuration on Summit is 1 MPI rank owns 1 GPU
and some CPU cores (like Titan), though some codes are using other
configurations, and no doubt users will experiment with still others
• Have requested jsrun options that would allow arbitrary execution configuration
on nodes—some users absolutely need this flexibility, e.g., 2 apps need
nonuniform resource sets for master/slave execution
• Earlier saw long jsrun/MPI init times on Summit, especially at large node/rank
counts. This has improved considerably.
• Earlier Spectrum MPI beta versions we received had never been run at such high
node counts—various issues encountered and bugs filed—IBM has worked to
address

28
Summit: general comments
• We would prefer more vendor tuning of third party libraries, as we have had in
the past. IBM does give us some optimized build instructions for third party
libraries.
• A more general concern regarding the broader industry: every new HPC system
we get has more complex node hardware and software stack. We hope HPC
vendors very carefully manage this complexity. Users want and need advanced
hardware features but also need reliable, coherent software to access them.
• Similarly, users mix programming models, e.g., MPI, OpenMP, OpenACC,
Kokkos, etc., sometimes in complex ways. We need to have interoperability
and coherency between these (Example: can an OpenMP thread launch an
OpenACC kernel)

29
• GPU high-speed atomic update operations of Volta (and Pascal) have made a
huge impact on some applications
• Unified memory, automatic migration of data to GPU very helpful for some codes—
e.g., codes with deep data structures. However, some users will prefer manual
control of the exact timing of transfers for performance.
• Most codes that run at our center also run at other sites. Use of vendor-specific
libraries or features that give higher performance may be avoided by some users
to maintain portability. We prefer standards-based approaches when possible.
• MPS will be used by some, but can add complexity, e.g., need to manage CUDA
mem handles. Also, MPS adds to the myriad of complexities to manage (resource
sets, ranks per node, SMT mode, NUMA domains, thread binding, GPU binding,
etc.).

30
• We like having multiple compilers for risk mitigation, but there may not be any single
compiler satisfying all requirements for a project, e.g., OpenACC, fast CPU code
generation, etc. Also, Fortran is important, used by slightly under half of projects
(INCITE 2014).
• Features like RDMA and GPUDirect are important to users. RDMA is needed by at
least one library (ExaTensor) used by 3 CAAR projects.
• Because of Titan, we have already optimized many of our codes for CPU-GPU
interconnect bandwidth (overlapped transfers, data persistence on GPUs) and latency
(large transfers, longer-running kernels). However, some users still need to run many
kernels, e.g., QMCPACK, thus still rely on low-latency kernel launch.
• Inter-node messages of many possible sizes depending on the app, e.g., halo (e.g.,
S3D-Legion), large (~ 1 GB) messages (ScaLAPACK, CoMet, SLATE), small latency-
limited messages (climate codes)—teams will work to optimize each of these cases.

31
Conclusions
• Summit has shown itself a very powerful system for multiple
applications so far
• We have worked with IBM and other partners to resolve issues
• We are looking forward to the new science that Summit will make
possible in the near future

32
This research used resources of the Oak Ridge Leadership
Computing Facility at the Oak Ridge National Laboratory, which
is supported by the Office of Science of the U.S. Department of
Energy under Contract No. DE-AC05-00OR22725.
Questions?
Wayne Joubert
joubert@ornl.gov

Early Application experiences on Summit

More Related Content

What's hot

Similar to Early Application experiences on Summit

More from Ganesan Narayanasamy

Recently uploaded

Early Application experiences on Summit