NVIDIA HPC ソフトウエア斜め読み
NARUHIKO TAN | HPC SOLUTION ARCHITECT
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
PLATFORM SPECIALIZATION
CUDA
ACCELERATION LIBRARIES
Core Communication
Math Data Analytics AI Quantum
std::transform(par, x, x+n, y, y,
[=](float x, float y){ return y +
a*x; }
);
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
#pragma acc data copy(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
#pragma omp target data map(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
__global__
void saxpy(int n, float a,
float *x, float *y) {
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) y[i] += a*x[i];
}
int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
saxpy<<<(N+255)/256,256>>>(...);
cudaMemcpy(y, d_y, ...);
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
INCREMENTAL PORTABLE OPTIMIZATION
OpenACC, OpenMP
PLATFORM SPECIALIZATION
CUDA
NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud
Develop for the NVIDIA Platform: GPU, CPU and Interconnect
Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
HPC-X
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
SHARP HCOLL
UCX SHMEM
MPI
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ C++
PILLARS OF STANDARD LANGUAGE PARALLELISM
7
Copyright (C) 2021 Bryce Adelstein Lelbach
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
C++ with OpenMP
Ø Composable, compact and elegant
Ø Easy to read and maintain
Ø ISO Standard
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
const MFloat* const distributionsStart = &[distributions[distStartId];
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}
std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
}
});
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
Ø Hierarchical grids, complex moving geometries
Ø Adaptive meshing, load balancing
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
Ø Physics: aeroacoustics, combustion, biomedical, ...
Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
Ø Programming model: MPI + ISO C++ parallelism
M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
Decaying isotropic turbulence
400k fully-resolved particles
1 1.025
8.74
0
1
2
3
4
5
6
7
8
9
10
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
Relative
Speed-Up
PARALLELISM IN C++ ROADMAP
C++ 14 C++ 17 C++ 20 C++ PIPELINE
• Memory model
enhancements
• Lambdas
• Atomics
extensions
• Generic Lambda
Expressions
• Parallel algorithms
• Forward progress
guarantees
• Memory model
clarifications
• Scalable
synchronization
library
• Ranges
• Span
• Linear algebra
algorithms
• Asynchronous
parallel algorithms
• Senders-receivers
• Mdspan
• Range-based
parallel algorithms
• Extended floating-
point types
General parallelism user facing feature
How users run C++
code on GPUs today
Co-designed with
V100 hardware
support
Custom algorithms
and async.
control flow
N-dimensional
loops and
usability
Extended C++ interface
to BLAS/Lapack
General usability
of performance
provided by executors
C++ 11
PILLARS OF STANDARD LANGUAGE PARALLELISM
11
Copyright (C) 2021 Bryce Adelstein Lelbach
With Senders & Receivers
Today
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
SENDERS & RECEIVERS
Maxwell’s equations
template <ComputeSchedulerT, WriteSchedulerT>
auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer)
{
return repeat_n(
n_outer_iterations,
repeat_n(
n_inner_iterations,
schedule(scheduler)
| bulk(grid.cells, update_h(accessor))
| bulk(grid.cells, update_e(time, dt, accessor)))
| transfer(writer)
| then(dump_results(report_step, accessor)))
| then([]{ printf("simulation completen"); })
);
}
Simplify Work Across CPUs and
Accelerators
• Uniform abstraction between code and
diverse resources
• ISO standard
• Write once, run everywhere
•
ELECTROMAGNETISM
Raw performance & % of peak
std::sync_wait(maxwell(inline_scheduler, inline_scheduler));
std::sync_wait(maxwell(openmp_scheduler, inline_scheduler));
std::sync_wait(maxwell(cuda, inline_scheduler));
§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80
§ Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs)
§ clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp
0
5
10
15
20
25
30
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Speedup
Scheduler
0
10
20
30
40
50
60
70
80
90
100
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Efficiency
vs
STREAM
TRIAD
Scheduler
STRONG SCALING USING ISO STANDARD C++
NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.
NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
0
0.2
0.4
0.6
0.8
1
1.2
0
5
10
15
20
25
30
35
40
0 200 400 600 800 1000 1200
Speedup
Number of GPUs
Maxwell SR Scaling Ideal Scaling Efficiency
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
PALABOS CARBON SEQUESTRATION
15
Copyright (C) 2022 NVIDIA
Ø Palabos is a framework for fluid dynamics simulations using
Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
0
4
8
12
16
32 128 224 320 416 512
A100 GPUs
Strong Scaling
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ FORTRAN
MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features
DO CONCURRENT Reductions
Support for reduction operations
on concurrent loops (ala
OpenACC/OpenMP). Began
supporting in nvfortran 21.11.
Fortran 202X
Coming in 2023
Atomics
Propose support for atomic
variable accesses
Asynchronous Tasking
Propose support for asynchronous
tasks
Fortran 202Y
In discussion
DO CONCURRENT
Data parallel loop construct, locality
specifiers. Supported in nvfortran
Array Intrinsics
Various math intrinsics that may
apply to entire arrays and map to
accelerated libraries supported in
nvfortran.
Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
Fortran 2018
MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/
MiniWeather
0
10
20
OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)
if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then
x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif
state_out(i,k,ll) = state_init(i,k,ll)
+ dt * tend(i,k,ll)
enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
OpenACC version uses –gpu=managed option.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation
POT3D: DO CONCURRENT
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D
POT3D
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ PYTHON
PRODUCTIVITY
Sequential and Composable Code
§ Sequential semantics - no visible
parallelism or synchronization
§ Name-based global data – no partitioning
§ Composable – can combine with other
libraries and datatypes
def cg_solve(A, b, conv_iters):
x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
for i in range(max_iters):
Ap = A.dot(p)
alpha = rsold / (p.dot(Ap))
x = x + alpha * p
r = r - alpha * Ap
rsnew = r.dot(r)
if i % conv_iters == 0 and 
np.sqrt(rsnew) < 1e-10:
converged = i
break
beta = rsnew / rsold
p = r + beta * p
rsold = rsnew
PERFORMANCE
§Transparently run at any scale needed to address computational challenges at hand
§Automatically leverage all the available hardware
Transparent Acceleration
Supercomputer
Multi-GPU
GPU
DPU
Grace
CPU
COMPUTATIONAL FLUID DYNAMICS
Time
(seconds)
Relative dataset size
Number of GPUs
0
50
100
150
1 2 4 8 16 32 64 128 256 512 1024
Distributed NumPy Performance
(weak scaling)
cuPy Legate
for _ in range(iter):
un = u.copy()
vn = v.copy()
b = build_up_b(rho, dt, dx, dy, u, v)
p = pressure_poisson_periodic(b, nit, p, dx, dy)
…
Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
• CFD codes like:
• Shallow-Water Equation Solver
• Oil Pipeline Risk Management: Geoclaw-
landspill simulations
• Python Libraries: Jupyter, NumPy, SciPy,
SymPy, Matplotlib
CFD Python on cuNumeric!
ACCELERATED STANDARD LANGUAGES
Parallel performance for wherever your code runs
std::transform(par, x, x+n, y,
y,[=](float x, float y){
return y + a*x;
}
);
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
ISO C++ ISO Fortran Python
CPU GPU
nvc++ -stdpar=multicore
nvfortran –stdpar=multicore
legate –cpus 16 saxpy.py
nvc++ -stdpar=gpu
nvfortran –stdpar=gpu
legate –gpus 1 saxpy.py
LERN MORE
GTC2022 sessions
§ No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496]
§ Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620]
§ C++ Standard Parallelism [S41960]
§ Future of Standard and CUDA C++ [S41961]
§ Connect with Experts: Standard and CUDA C++ User Forum [CWE41949]
§ From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318]
§ Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645]
Blogs
§ Developing Accelerated Code with Standard Language Parallelism
§ Accelerating Standard C++ with GPUs Using stdpar
§ Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK
§ Bringing Tensor Cores to Standard Fortran
§ Accelerating Python on GPUs with nvc++ and Cython
LERN MORE
Blogs
§ Multi-GPU Programming with Standard Parallel C++, Part1
§ Multi-GPU Programming with Standard Parallel C++, Part2
Open-source codes
§ LULESH: https://github.com/LLNL/LULESH
§ STLBM: https://gitlab.com/unigehpfs/stlbm
§ MiniWeather: https://github.com/mrnorman/miniWeather/
§ POT3D: https://github.com/predsci/POT3D
§ Legate: https://github.com/nv-legate
§ Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg
NVIDIA HPC SDK Documentation
https://docs.nvidia.com/hpc-sdk/index.html
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA MATH LIBRARIES
Linear Algebra, FFT, RNG and Basic Math
CUDA Math API
cuFFT
cuSPARSE cuSOLVER
cuBLAS cuTENSOR
cuRAND CUTLASS
MATH LIBRARIES
§ MULTI-GPU MATH LIBRARIES
1
2
3
4
5
6
7
8
9
10
6
4
7
2
8
0
8
8
9
6
1
0
4
1
1
2
1
2
0
1
2
8
1
3
6
1
4
4
1
5
2
1
6
0
1
6
8
1
7
6
1
8
4
1
9
2
2
0
0
2
0
8
2
1
6
2
2
4
2
3
2
2
4
0
2
4
8
2
5
6
2
6
4
2
7
2
2
8
0
2
8
8
2
9
6
3
0
4
3
1
2
3
2
0
3
2
8
3
3
6
3
4
4
3
5
2
3
6
0
3
6
8
3
7
6
3
8
4
3
9
2
4
0
0
4
0
8
4
1
6
4
2
4
4
3
2
4
4
0
4
4
8
4
5
6
4
6
4
4
7
2
4
8
0
4
8
8
4
9
6
5
0
4
5
1
2
SPEEDUP
(LARGER
IS
BETTER)
FFT SIZE (3D)
1 GPU 2 GPUs 4 GPUs 8 GPUs
cuFFTXt: MAXIMIZING SINGLE-NODE PERFORMANCE
Speedups for 3D C2C versus CTK 11.0
* A100 80GB Default clocks: CTK 11.0 vs. CTK 11.6
Recently Introduced
§ Up to 10x improvements for SNMG FFTs
cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS
Performance of FP32 Tensor Contractions on DGX A100
Data residing on Host (Dotted) or Device (Solid) Memory
* DGX A100 80GB
§ Introduced in cuTENSOR v1.4
§ Out-of-core released in v1.5
§
0
20
40
60
80
100
120
140
160
4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608
TFLOPS
(LARGER
IS
BETTER)
SIZES: M = N = K
1 - Device
1 - Host
2 - Device
2 - Host
4 - Device
4 - Host
8 - Device
8 - Host
Releasing cuTENSOR v1.5
§ Added Out-of-core Functionality
§ Library wide optimizations
cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE
LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer
1
2
4
8
16
32
64
128
256
512
1024
2048
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
TIME
IN
SECONDS
(SMALLER
IS
BETTER)
NUMBER OF GPUS
State-of-the-Art HPC SDK 21.11
* Summit: 6x V100 16GB per node
Released in HPC SDK 21.11
§ LU Decomposition
§ With & Without pivoting
§ Cholesky
cuFFTMp: FFTs AT SCALE - SLAB DECOMPOSITION
Distributed 3D FFT Performance: Comparison by Precision
18 12 18 34 61
109
226
429
851
1,860
10 6 9 17 29 52
105
210
410
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
8 16 32 64 128 256 512 1024 2048 4096
2048 2560 3072 4096 5120 6144 8192 10240 12288 16384
TFLOPS
(LARGER
IS
BETTER)
# OF GPUS
PROBLEM SIZE (CUBED)
C2C Z2Z
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→ Slabs
13 14
22 25 27
30 37 41
51
68
79
78.2
104
135
147
122
158
260
278
134
176
394
527
0
100
200
300
400
500
600
1024 2048 4096 8192
TFLOPS
(LARGER
IS
BETTER)
PROBLEM SIZE (CUBED)
32 64 128 256 512 1024 2048
cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION
Distributed 3D FFT Performance: C2C Comparison by GPU Count
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→
Slabs
[S41494] A Deep Dive into the Latest HPC Software
# of GPUs
MATH LIBRARIES
§ MATH LIBRARY DEVICE EXTENSIONS
MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes
0
5
10
15
20
25
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
TFLOPS
(LARGER
IS
BETTER)
FFT SIZES (1D)
cuFFTDx cuFFT
* A100 80GB @ 1410 MHz
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
§ FFT 1D sizes up to 32k
Future Releases
§ cuBLASDx/cuSOLVERDx
§ 2D/3D FFTs
§ Windows Support
LERN MORE
GTC2022 sessions
§ An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153]
§ Recent Developments in NVIDIA Math Libraries [S41491]
§ Connect with Experts: NVIDIA Math Libraries [CWE41721]
§ Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948]
§ NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044]
Examples
§ CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples
MathDx 22.02
§ https://developer.nvidia.com/mathdx
LERN MORE
Math Libraries Documentation
https://docs.nvidia.com/hpc-sdk/index.html#math-libraries
Blog
§ Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
§ Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
DEVELOPER TOOLS
Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Debuggers: cuda-gdb, Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition
Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE
CUDA-GDB
Command-Line and IDE Back-End Debugger
§ Unified CPU and CUDA
Debugging
§ CUDA-C/SASS support
§ Built on GDB and uses
many of the same CLI
commands
COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues
Compute Sanitizer checks correctness isshues via
sub-tools:
§ Memcheck – Memory access error and leak detection
tool.
§ Racecheck – Shared memory data acces hazard detection
tool.
§ Initcheck – Uninitialized device global memory access.
§ Synccheck – Thread cynchronization hazard detection
tool.
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE NEW FEATURES
CORRECTNESS TOOLS FEATURES
§OptiX support in Compute Sanitizer
§ Automatically find correctness issues in OptiX workloads
§Core Dump support in Compute Sanitizer
§ Generate core dumps on detected issues
§5x performance increase in core dump
generation
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 1 bytes
========= at 0x4d70 in
/home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra
w_solid_color_0xebf766b2f0642d4e
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f878f900403 is out of bounds
========= and is 262,132 bytes after the nearest
allocation at 0x7f878f8c0400 of size 16 bytes
========= Device Frame:NVIDIA internal [0x430]
========= Saved host backtrace up to driver entry
point at kernel launch time
========= Host Frame: [0x60fbaa]
========= in /lib/x86_64-linux-
gnu/libnvoptix.so.1
========= Host Frame:optix_stubs.h:568:optixLaunch
[0xe1ff]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main
[0xb735]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta
rt_call_main [0x2dfd0]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame:../csu/libc-
start.c:379:__libc_start_main [0x2e07d]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame: [0x8dde]
========= in
/home/cuda/optixBasic/optixBasic
DEVELOPER TOOLS
§ NSIGHT SYSTEMS
NSIGHT SYSTEMS
System Profiler
Key Features:
§ System-wide application algorithm tuning
§ Multi-process tree support
§ Locate optimization opportunities
§ Visualize millions of events on a very fast GUI timeline
§ Or gaps of unused CPU and GPU time
§ Balance your workload across multiple CPUs and GPUs
§ CPU algorithms, utilization and thread state
GPU streams, kernels memory transfers, etc
§ Command line, Standalone, IDE integration
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
§ Docs/product: https://developer.nvidia.com/nsight-systems
DEVELOPER TOOLS
§ NSIGHT SYSTEMS NEW FEATURES
MULTI-REPORT TILING
Visualize More Parallel Activity
MULTI-REPORT TILING
Visualize More Parallel Activity
Open multiple
reports
Open multiple
reports
Loaded on same
timeline based on
wall-clock
EXPERT SYSTEMS & STATISTICS
Built-in Data Analytics with Advice
NVIDIA NETWORKING ADAPTER SAMPLING
§ Profile NVIDIA Networking
adaptors
§ Sent / Received /
Congestion
§ Correlate with expected
network traffic and other
system activities
GPU DIRECT STORAGE SUPPORT
GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace
§ Direct communication to GPU memory
§ CUFILE APIs used for GPU Direct Storage
DEVELOPER TOOLS
§ NSIGHT COMPUTE
NSIGHT COMPUTE
Kernel Profiling Tool
Key Features:
§ Interactive CUDA API debugging and kernel profiling
§ Build-in rules expertise
§ Fully customizable data collection and display
§ Command line, Standalone, IDE integration, Remote targets
OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX
(host only)
GPUs: Volta, Turing, Ampere GPUs
Docs/product: https://developer.nvidia.com/nsight-compute
DEVELOPER TOOLS
§ NSIGHT COMPUTE NEW FEATURES
REGISTER DEPENDENCY VISUALIZATION
Visualize Register Usage and Dependency Chains
§ SASS view in the Source page
§ Tracking reads and writes for each register
§ Identify long dependency chains
§ Detect inefficient register usage
§ Columns show all dependencies for:
§ Registers
§ Predicates
§ Uniform Registers
§ Uniform Predicates
STANDALONE SOURCE VIEWER
§ View of side-by-side
assembly and correlated
source code for CUDA
kernels
§ No profile required
§ Open .cubin files directly
§ Helps identify compiler
optimizations and
inefficiencies
OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters
§ Model theoretical
hardware usage
§ Understand limitations
from hardware vs.
kernel parameters
§ Configure model to vary
HW and kernel
parameters
§ Opened from an existing
report or as a new
activity
HIERARCHICAL ROOFLINE
§ Visualize multiple levels of the memory
hierarchy
§ Identify bottlenecks caused by memory
limitations
§ Determine how modifying algorithms may (or
may not) impact performance
LERN MORE
GTC2022 sessions
§ Optimizing Communication with Nsight Systems Network Profiling [S41500]
§ Latest Updates to CUDA Developer Tools [D4121]
§ How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723]
§ Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541]
§ What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493]
Nsight Systems Documentation
§ https://docs.nvidia.com/nsight-systems/
Nsight Compute Documentation
§ https://docs.nvidia.com/nsight-compute/
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA HPC ソフトウエア斜め読み

NVIDIA HPC ソフトウエア斜め読み

  • 1.
  • 2.
    AGENDA Accelerated Computing withStandard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 3.
    AGENDA Accelerated Computing withStandard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 4.
    PROGRAMMING THE NVIDIAPLATFORM CPU, GPU, and Network ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran PLATFORM SPECIALIZATION CUDA ACCELERATION LIBRARIES Core Communication Math Data Analytics AI Quantum std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; } ); do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo import cunumeric as np … def saxpy(a, x, y): y[:] += a*x #pragma acc data copy(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } #pragma omp target data map(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] += a*x[i]; } int main(void) { ... cudaMemcpy(d_x, x, ...); cudaMemcpy(d_y, y, ...); saxpy<<<(N+255)/256,256>>>(...); cudaMemcpy(y, d_y, ...); ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION OpenACC, OpenMP PLATFORM SPECIALIZATION CUDA
  • 5.
    NVIDIA HPC SDK Availableat developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud Develop for the NVIDIA Platform: GPU, CPU and Interconnect Libraries | Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries HPC-X NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS SHARP HCOLL UCX SHMEM MPI
  • 6.
  • 7.
    PILLARS OF STANDARDLANGUAGE PARALLELISM 7 Copyright (C) 2021 Bryce Adelstein Lelbach Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 8.
    C++ with OpenMP ØComposable, compact and elegant Ø Easy to read and maintain Ø ISO Standard Ø Portable – nvc++, g++, icpc, MSVC, … Standard C++ #pragma omp parallel // OpenMP parallel region { #pragma omp for // OpenMP for loop for (MInt i = 0; i < noCells; i++) { // Loop over all cells if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets const MInt distNeighStartId = i * distNeighbors; const MFloat* const distributionsStart = &[distributions[distStartId]; for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2) if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration const MInt n1StartId = neighborId[distNeighStartId + j] * nDist; oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format } if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist; oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1]; } } oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution } } } std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop return; for (MInt j = 0; j < nDist; ++j) { if (auto n = c_neighborId(i, j); n == -1) continue; a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn } }); M-AIA WITH C++17 PARALLEL ALGORITHMS Multi-physics simulation framework from RWTH Aachen University Ø Hierarchical grids, complex moving geometries Ø Adaptive meshing, load balancing Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ... Ø Physics: aeroacoustics, combustion, biomedical, ... Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++ Ø Programming model: MPI + ISO C++ parallelism
  • 9.
    M-AIA Multi-physics simulation frameworkdeveloped at the Institute of Aerodynamics, RWTH Aachen University Decaying isotropic turbulence 400k fully-resolved particles 1 1.025 8.74 0 1 2 3 4 5 6 7 8 9 10 OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100) Relative Speed-Up
  • 10.
    PARALLELISM IN C++ROADMAP C++ 14 C++ 17 C++ 20 C++ PIPELINE • Memory model enhancements • Lambdas • Atomics extensions • Generic Lambda Expressions • Parallel algorithms • Forward progress guarantees • Memory model clarifications • Scalable synchronization library • Ranges • Span • Linear algebra algorithms • Asynchronous parallel algorithms • Senders-receivers • Mdspan • Range-based parallel algorithms • Extended floating- point types General parallelism user facing feature How users run C++ code on GPUs today Co-designed with V100 hardware support Custom algorithms and async. control flow N-dimensional loops and usability Extended C++ interface to BLAS/Lapack General usability of performance provided by executors C++ 11
  • 11.
    PILLARS OF STANDARDLANGUAGE PARALLELISM 11 Copyright (C) 2021 Bryce Adelstein Lelbach With Senders & Receivers Today Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 12.
    SENDERS & RECEIVERS Maxwell’sequations template <ComputeSchedulerT, WriteSchedulerT> auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) { return repeat_n( n_outer_iterations, repeat_n( n_inner_iterations, schedule(scheduler) | bulk(grid.cells, update_h(accessor)) | bulk(grid.cells, update_e(time, dt, accessor))) | transfer(writer) | then(dump_results(report_step, accessor))) | then([]{ printf("simulation completen"); }) ); } Simplify Work Across CPUs and Accelerators • Uniform abstraction between code and diverse resources • ISO standard • Write once, run everywhere •
  • 13.
    ELECTROMAGNETISM Raw performance &% of peak std::sync_wait(maxwell(inline_scheduler, inline_scheduler)); std::sync_wait(maxwell(openmp_scheduler, inline_scheduler)); std::sync_wait(maxwell(cuda, inline_scheduler)); § CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80 § Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs) § clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp 0 5 10 15 20 25 30 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Speedup Scheduler 0 10 20 30 40 50 60 70 80 90 100 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Efficiency vs STREAM TRIAD Scheduler
  • 14.
    STRONG SCALING USINGISO STANDARD C++ NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. NVIDIA SUPERPOD § 140x NVIDIA DGX-A100 640 § 1120x NVIDIA A100-SXM4-80 GPUs 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 Speedup Number of GPUs Maxwell SR Scaling Ideal Scaling Efficiency PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
  • 15.
    PALABOS CARBON SEQUESTRATION 15 Copyright(C) 2022 NVIDIA Ø Palabos is a framework for fluid dynamics simulations using Lattice-Boltzmann methods. Ø Code for multi-component flow through a porous media ported to C++ Senders and Receivers. Ø Application: simulating carbon sequestration in sandstone. Christian Huber (Brown University), Jonas Latt (University of Geneva) Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA) 0 4 8 12 16 32 128 224 320 416 512 A100 GPUs Strong Scaling
  • 16.
  • 17.
    MODERN FORTRAN FEATURESFOR HPC Standard Parallelism and Concurrency Features DO CONCURRENT Reductions Support for reduction operations on concurrent loops (ala OpenACC/OpenMP). Began supporting in nvfortran 21.11. Fortran 202X Coming in 2023 Atomics Propose support for atomic variable accesses Asynchronous Tasking Propose support for asynchronous tasks Fortran 202Y In discussion DO CONCURRENT Data parallel loop construct, locality specifiers. Supported in nvfortran Array Intrinsics Various math intrinsics that may apply to entire arrays and map to accelerated libraries supported in nvfortran. Co-Arrays Partitioned Global Address Space arrays, teams of processes (images), collectives & synchronization. Awaiting F18. Fortran 2018
  • 18.
    MINIWEATHER Standard Language Parallelismin Climate/Weather Applications Mini-App written in C++ and Fortran that simulates weather-like fluid flows using Finite Volume and Runge-Kutta methods. Existing parallelization in MPI, OpenMP, OpenACC, … Included in the SPEChpc benchmark suite* Open-source and commonly-used in training events. https://github.com/mrnorman/miniWeather/ MiniWeather 0 10 20 OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx) local(x,z,x0,z0,xrad,zrad,amp,dist,wpert) if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then x = (i_beg-1 + i-0.5_rp) * dx z = (k_beg-1 + k-0.5_rp) * dz x0 = xlen/8 z0 = 1000 xrad = 500 zrad = 500 amp = 0.01_rp dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp if (dist <= pi / 2._rp) then wpert = amp * cos(dist)**2 else wpert = 0._rp endif tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM) + wpert*hy_dens_cell(k) endif state_out(i,k,ll) = state_init(i,k,ll) + dt * tend(i,k,ll) enddo Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5. OpenACC version uses –gpu=managed option. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 19.
    POT3D: DO CONCURRENT POT3Dis a Fortran application for approximating solar coronal magnetic fields. Included in the SPEChpc benchmark suite* Existing parallelization in MPI & OpenACC Optimized the DO CONCURRENT version by using OpenACC solely for data motion and atomics https://github.com/predsci/POT3D POT3D !$acc enter data copyin(phi,dr_i) !$acc enter data create(br) do concurrent (k=1:np,j=1:nt,i=1:nrm1) br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i) enddo !$acc exit data delete(phi,dr_i,br) Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 20.
  • 21.
    PRODUCTIVITY Sequential and ComposableCode § Sequential semantics - no visible parallelism or synchronization § Name-based global data – no partitioning § Composable – can combine with other libraries and datatypes def cg_solve(A, b, conv_iters): x = np.zeros_like(b) r = b - A.dot(x) p = r rsold = r.dot(r) converged = False max_iters = b.shape[0] for i in range(max_iters): Ap = A.dot(p) alpha = rsold / (p.dot(Ap)) x = x + alpha * p r = r - alpha * Ap rsnew = r.dot(r) if i % conv_iters == 0 and np.sqrt(rsnew) < 1e-10: converged = i break beta = rsnew / rsold p = r + beta * p rsold = rsnew
  • 22.
    PERFORMANCE §Transparently run atany scale needed to address computational challenges at hand §Automatically leverage all the available hardware Transparent Acceleration Supercomputer Multi-GPU GPU DPU Grace CPU
  • 23.
    COMPUTATIONAL FLUID DYNAMICS Time (seconds) Relativedataset size Number of GPUs 0 50 100 150 1 2 4 8 16 32 64 128 256 512 1024 Distributed NumPy Performance (weak scaling) cuPy Legate for _ in range(iter): un = u.copy() vn = v.copy() b = build_up_b(rho, dt, dx, dy, u, v) p = pressure_poisson_periodic(b, nit, p, dx, dy) … Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021 • CFD codes like: • Shallow-Water Equation Solver • Oil Pipeline Risk Management: Geoclaw- landspill simulations • Python Libraries: Jupyter, NumPy, SciPy, SymPy, Matplotlib CFD Python on cuNumeric!
  • 24.
    ACCELERATED STANDARD LANGUAGES Parallelperformance for wherever your code runs std::transform(par, x, x+n, y, y,[=](float x, float y){ return y + a*x; } ); import cunumeric as np … def saxpy(a, x, y): y[:] += a*x do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ISO C++ ISO Fortran Python CPU GPU nvc++ -stdpar=multicore nvfortran –stdpar=multicore legate –cpus 16 saxpy.py nvc++ -stdpar=gpu nvfortran –stdpar=gpu legate –gpus 1 saxpy.py
  • 25.
    LERN MORE GTC2022 sessions §No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496] § Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620] § C++ Standard Parallelism [S41960] § Future of Standard and CUDA C++ [S41961] § Connect with Experts: Standard and CUDA C++ User Forum [CWE41949] § From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318] § Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645] Blogs § Developing Accelerated Code with Standard Language Parallelism § Accelerating Standard C++ with GPUs Using stdpar § Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK § Bringing Tensor Cores to Standard Fortran § Accelerating Python on GPUs with nvc++ and Cython
  • 26.
    LERN MORE Blogs § Multi-GPUProgramming with Standard Parallel C++, Part1 § Multi-GPU Programming with Standard Parallel C++, Part2 Open-source codes § LULESH: https://github.com/LLNL/LULESH § STLBM: https://gitlab.com/unigehpfs/stlbm § MiniWeather: https://github.com/mrnorman/miniWeather/ § POT3D: https://github.com/predsci/POT3D § Legate: https://github.com/nv-legate § Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg NVIDIA HPC SDK Documentation https://docs.nvidia.com/hpc-sdk/index.html
  • 27.
    AGENDA Accelerated Computing withStandard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 28.
    NVIDIA MATH LIBRARIES LinearAlgebra, FFT, RNG and Basic Math CUDA Math API cuFFT cuSPARSE cuSOLVER cuBLAS cuTENSOR cuRAND CUTLASS
  • 29.
  • 30.
  • 31.
    cuTENSORMg: MULTI-GPU TENSORCONTRACTIONS Performance of FP32 Tensor Contractions on DGX A100 Data residing on Host (Dotted) or Device (Solid) Memory * DGX A100 80GB § Introduced in cuTENSOR v1.4 § Out-of-core released in v1.5 § 0 20 40 60 80 100 120 140 160 4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608 TFLOPS (LARGER IS BETTER) SIZES: M = N = K 1 - Device 1 - Host 2 - Device 2 - Host 4 - Device 4 - Host 8 - Device 8 - Host Releasing cuTENSOR v1.5 § Added Out-of-core Functionality § Library wide optimizations
  • 32.
    cuSOLVERMp: DENSE LINEARALGEBRA AT SCALE LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer 1 2 4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 TIME IN SECONDS (SMALLER IS BETTER) NUMBER OF GPUS State-of-the-Art HPC SDK 21.11 * Summit: 6x V100 16GB per node Released in HPC SDK 21.11 § LU Decomposition § With & Without pivoting § Cholesky
  • 33.
    cuFFTMp: FFTs ATSCALE - SLAB DECOMPOSITION Distributed 3D FFT Performance: Comparison by Precision 18 12 18 34 61 109 226 429 851 1,860 10 6 9 17 29 52 105 210 410 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 8 16 32 64 128 256 512 1024 2048 4096 2048 2560 3072 4096 5120 6144 8192 10240 12288 16384 TFLOPS (LARGER IS BETTER) # OF GPUS PROBLEM SIZE (CUBED) C2C Z2Z * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs
  • 34.
    13 14 22 2527 30 37 41 51 68 79 78.2 104 135 147 122 158 260 278 134 176 394 527 0 100 200 300 400 500 600 1024 2048 4096 8192 TFLOPS (LARGER IS BETTER) PROBLEM SIZE (CUBED) 32 64 128 256 512 1024 2048 cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION Distributed 3D FFT Performance: C2C Comparison by GPU Count * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs [S41494] A Deep Dive into the Latest HPC Software # of GPUs
  • 35.
    MATH LIBRARIES § MATHLIBRARY DEVICE EXTENSIONS
  • 36.
    MATH LIBRARIES DEVICEEXTENSIONS cuFFTDx Performance: Comparison with cuFFT across various sizes 0 5 10 15 20 25 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 TFLOPS (LARGER IS BETTER) FFT SIZES (1D) cuFFTDx cuFFT * A100 80GB @ 1410 MHz Released in MathDx 22.02 § Available on DevZone § Support Volta+ architecture § FFT 1D sizes up to 32k Future Releases § cuBLASDx/cuSOLVERDx § 2D/3D FFTs § Windows Support
  • 37.
    LERN MORE GTC2022 sessions §An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153] § Recent Developments in NVIDIA Math Libraries [S41491] § Connect with Experts: NVIDIA Math Libraries [CWE41721] § Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948] § NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044] Examples § CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples MathDx 22.02 § https://developer.nvidia.com/mathdx
  • 38.
    LERN MORE Math LibrariesDocumentation https://docs.nvidia.com/hpc-sdk/index.html#math-libraries Blog § Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale § Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
  • 39.
    AGENDA Accelerated Computing withStandard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 40.
    DEVELOPER TOOLS Profilers: NsightSystems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX) Debuggers: cuda-gdb, Nsight Visual Studio Edition Nsight Visual Studio Code Edition Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition Nsight Visual Studio Edition Nsight Visual Studio Code Edition
  • 41.
  • 42.
    CUDA-GDB Command-Line and IDEBack-End Debugger § Unified CPU and CUDA Debugging § CUDA-C/SASS support § Built on GDB and uses many of the same CLI commands
  • 43.
    COMPUTE SANITIZER Automatically Scanfor Bugs and Memory Issues Compute Sanitizer checks correctness isshues via sub-tools: § Memcheck – Memory access error and leak detection tool. § Racecheck – Shared memory data acces hazard detection tool. § Initcheck – Uninitialized device global memory access. § Synccheck – Thread cynchronization hazard detection tool.
  • 44.
    DEVELOPER TOOLS § COMPUTEDEBUGGERS/IDE NEW FEATURES
  • 45.
    CORRECTNESS TOOLS FEATURES §OptiXsupport in Compute Sanitizer § Automatically find correctness issues in OptiX workloads §Core Dump support in Compute Sanitizer § Generate core dumps on detected issues §5x performance increase in core dump generation ========= COMPUTE-SANITIZER ========= Invalid __global__ write of size 1 bytes ========= at 0x4d70 in /home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra w_solid_color_0xebf766b2f0642d4e ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x7f878f900403 is out of bounds ========= and is 262,132 bytes after the nearest allocation at 0x7f878f8c0400 of size 16 bytes ========= Device Frame:NVIDIA internal [0x430] ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x60fbaa] ========= in /lib/x86_64-linux- gnu/libnvoptix.so.1 ========= Host Frame:optix_stubs.h:568:optixLaunch [0xe1ff] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main [0xb735] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta rt_call_main [0x2dfd0] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame:../csu/libc- start.c:379:__libc_start_main [0x2e07d] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame: [0x8dde] ========= in /home/cuda/optixBasic/optixBasic
  • 46.
  • 47.
    NSIGHT SYSTEMS System Profiler KeyFeatures: § System-wide application algorithm tuning § Multi-process tree support § Locate optimization opportunities § Visualize millions of events on a very fast GUI timeline § Or gaps of unused CPU and GPU time § Balance your workload across multiple CPUs and GPUs § CPU algorithms, utilization and thread state GPU streams, kernels memory transfers, etc § Command line, Standalone, IDE integration OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host) GPUs: Pascal+ § Docs/product: https://developer.nvidia.com/nsight-systems
  • 48.
    DEVELOPER TOOLS § NSIGHTSYSTEMS NEW FEATURES
  • 49.
  • 50.
    MULTI-REPORT TILING Visualize MoreParallel Activity Open multiple reports Open multiple reports Loaded on same timeline based on wall-clock
  • 51.
    EXPERT SYSTEMS &STATISTICS Built-in Data Analytics with Advice
  • 52.
    NVIDIA NETWORKING ADAPTERSAMPLING § Profile NVIDIA Networking adaptors § Sent / Received / Congestion § Correlate with expected network traffic and other system activities
  • 53.
    GPU DIRECT STORAGESUPPORT GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace § Direct communication to GPU memory § CUFILE APIs used for GPU Direct Storage
  • 54.
  • 55.
    NSIGHT COMPUTE Kernel ProfilingTool Key Features: § Interactive CUDA API debugging and kernel profiling § Build-in rules expertise § Fully customizable data collection and display § Command line, Standalone, IDE integration, Remote targets OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX (host only) GPUs: Volta, Turing, Ampere GPUs Docs/product: https://developer.nvidia.com/nsight-compute
  • 56.
    DEVELOPER TOOLS § NSIGHTCOMPUTE NEW FEATURES
  • 57.
    REGISTER DEPENDENCY VISUALIZATION VisualizeRegister Usage and Dependency Chains § SASS view in the Source page § Tracking reads and writes for each register § Identify long dependency chains § Detect inefficient register usage § Columns show all dependencies for: § Registers § Predicates § Uniform Registers § Uniform Predicates
  • 58.
    STANDALONE SOURCE VIEWER §View of side-by-side assembly and correlated source code for CUDA kernels § No profile required § Open .cubin files directly § Helps identify compiler optimizations and inefficiencies
  • 59.
    OCCUPANCY CALCULATOR Model HardwareUsage and Identify Limiters § Model theoretical hardware usage § Understand limitations from hardware vs. kernel parameters § Configure model to vary HW and kernel parameters § Opened from an existing report or as a new activity
  • 60.
    HIERARCHICAL ROOFLINE § Visualizemultiple levels of the memory hierarchy § Identify bottlenecks caused by memory limitations § Determine how modifying algorithms may (or may not) impact performance
  • 61.
    LERN MORE GTC2022 sessions §Optimizing Communication with Nsight Systems Network Profiling [S41500] § Latest Updates to CUDA Developer Tools [D4121] § How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723] § Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541] § What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493] Nsight Systems Documentation § https://docs.nvidia.com/nsight-systems/ Nsight Compute Documentation § https://docs.nvidia.com/nsight-compute/
  • 62.
    AGENDA Accelerated Computing withStandard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute