2. AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
3. AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
4. PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
PLATFORM SPECIALIZATION
CUDA
ACCELERATION LIBRARIES
Core Communication
Math Data Analytics AI Quantum
std::transform(par, x, x+n, y, y,
[=](float x, float y){ return y +
a*x; }
);
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
#pragma acc data copy(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
#pragma omp target data map(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
__global__
void saxpy(int n, float a,
float *x, float *y) {
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) y[i] += a*x[i];
}
int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
saxpy<<<(N+255)/256,256>>>(...);
cudaMemcpy(y, d_y, ...);
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
INCREMENTAL PORTABLE OPTIMIZATION
OpenACC, OpenMP
PLATFORM SPECIALIZATION
CUDA
5. NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud
Develop for the NVIDIA Platform: GPU, CPU and Interconnect
Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
HPC-X
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
SHARP HCOLL
UCX SHMEM
MPI
7. PILLARS OF STANDARD LANGUAGE PARALLELISM
7
Copyright (C) 2021 Bryce Adelstein Lelbach
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
8. C++ with OpenMP
Ø Composable, compact and elegant
Ø Easy to read and maintain
Ø ISO Standard
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
const MFloat* const distributionsStart = &[distributions[distStartId];
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}
std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
}
});
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
Ø Hierarchical grids, complex moving geometries
Ø Adaptive meshing, load balancing
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
Ø Physics: aeroacoustics, combustion, biomedical, ...
Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
Ø Programming model: MPI + ISO C++ parallelism
9. M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
Decaying isotropic turbulence
400k fully-resolved particles
1 1.025
8.74
0
1
2
3
4
5
6
7
8
9
10
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
Relative
Speed-Up
10. PARALLELISM IN C++ ROADMAP
C++ 14 C++ 17 C++ 20 C++ PIPELINE
• Memory model
enhancements
• Lambdas
• Atomics
extensions
• Generic Lambda
Expressions
• Parallel algorithms
• Forward progress
guarantees
• Memory model
clarifications
• Scalable
synchronization
library
• Ranges
• Span
• Linear algebra
algorithms
• Asynchronous
parallel algorithms
• Senders-receivers
• Mdspan
• Range-based
parallel algorithms
• Extended floating-
point types
General parallelism user facing feature
How users run C++
code on GPUs today
Co-designed with
V100 hardware
support
Custom algorithms
and async.
control flow
N-dimensional
loops and
usability
Extended C++ interface
to BLAS/Lapack
General usability
of performance
provided by executors
C++ 11
11. PILLARS OF STANDARD LANGUAGE PARALLELISM
11
Copyright (C) 2021 Bryce Adelstein Lelbach
With Senders & Receivers
Today
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
12. SENDERS & RECEIVERS
Maxwell’s equations
template <ComputeSchedulerT, WriteSchedulerT>
auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer)
{
return repeat_n(
n_outer_iterations,
repeat_n(
n_inner_iterations,
schedule(scheduler)
| bulk(grid.cells, update_h(accessor))
| bulk(grid.cells, update_e(time, dt, accessor)))
| transfer(writer)
| then(dump_results(report_step, accessor)))
| then([]{ printf("simulation completen"); })
);
}
Simplify Work Across CPUs and
Accelerators
• Uniform abstraction between code and
diverse resources
• ISO standard
• Write once, run everywhere
•
14. STRONG SCALING USING ISO STANDARD C++
NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.
NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
0
0.2
0.4
0.6
0.8
1
1.2
0
5
10
15
20
25
30
35
40
0 200 400 600 800 1000 1200
Speedup
Number of GPUs
Maxwell SR Scaling Ideal Scaling Efficiency
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
15. PALABOS CARBON SEQUESTRATION
15
Copyright (C) 2022 NVIDIA
Ø Palabos is a framework for fluid dynamics simulations using
Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
0
4
8
12
16
32 128 224 320 416 512
A100 GPUs
Strong Scaling
17. MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features
DO CONCURRENT Reductions
Support for reduction operations
on concurrent loops (ala
OpenACC/OpenMP). Began
supporting in nvfortran 21.11.
Fortran 202X
Coming in 2023
Atomics
Propose support for atomic
variable accesses
Asynchronous Tasking
Propose support for asynchronous
tasks
Fortran 202Y
In discussion
DO CONCURRENT
Data parallel loop construct, locality
specifiers. Supported in nvfortran
Array Intrinsics
Various math intrinsics that may
apply to entire arrays and map to
accelerated libraries supported in
nvfortran.
Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
Fortran 2018
18. MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/
MiniWeather
0
10
20
OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)
if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then
x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif
state_out(i,k,ll) = state_init(i,k,ll)
+ dt * tend(i,k,ll)
enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
OpenACC version uses –gpu=managed option.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation
19. POT3D: DO CONCURRENT
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D
POT3D
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
21. PRODUCTIVITY
Sequential and Composable Code
§ Sequential semantics - no visible
parallelism or synchronization
§ Name-based global data – no partitioning
§ Composable – can combine with other
libraries and datatypes
def cg_solve(A, b, conv_iters):
x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
for i in range(max_iters):
Ap = A.dot(p)
alpha = rsold / (p.dot(Ap))
x = x + alpha * p
r = r - alpha * Ap
rsnew = r.dot(r)
if i % conv_iters == 0 and
np.sqrt(rsnew) < 1e-10:
converged = i
break
beta = rsnew / rsold
p = r + beta * p
rsold = rsnew
22. PERFORMANCE
§Transparently run at any scale needed to address computational challenges at hand
§Automatically leverage all the available hardware
Transparent Acceleration
Supercomputer
Multi-GPU
GPU
DPU
Grace
CPU
23. COMPUTATIONAL FLUID DYNAMICS
Time
(seconds)
Relative dataset size
Number of GPUs
0
50
100
150
1 2 4 8 16 32 64 128 256 512 1024
Distributed NumPy Performance
(weak scaling)
cuPy Legate
for _ in range(iter):
un = u.copy()
vn = v.copy()
b = build_up_b(rho, dt, dx, dy, u, v)
p = pressure_poisson_periodic(b, nit, p, dx, dy)
…
Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
• CFD codes like:
• Shallow-Water Equation Solver
• Oil Pipeline Risk Management: Geoclaw-
landspill simulations
• Python Libraries: Jupyter, NumPy, SciPy,
SymPy, Matplotlib
CFD Python on cuNumeric!
24. ACCELERATED STANDARD LANGUAGES
Parallel performance for wherever your code runs
std::transform(par, x, x+n, y,
y,[=](float x, float y){
return y + a*x;
}
);
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
ISO C++ ISO Fortran Python
CPU GPU
nvc++ -stdpar=multicore
nvfortran –stdpar=multicore
legate –cpus 16 saxpy.py
nvc++ -stdpar=gpu
nvfortran –stdpar=gpu
legate –gpus 1 saxpy.py
25. LERN MORE
GTC2022 sessions
§ No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496]
§ Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620]
§ C++ Standard Parallelism [S41960]
§ Future of Standard and CUDA C++ [S41961]
§ Connect with Experts: Standard and CUDA C++ User Forum [CWE41949]
§ From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318]
§ Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645]
Blogs
§ Developing Accelerated Code with Standard Language Parallelism
§ Accelerating Standard C++ with GPUs Using stdpar
§ Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK
§ Bringing Tensor Cores to Standard Fortran
§ Accelerating Python on GPUs with nvc++ and Cython
26. LERN MORE
Blogs
§ Multi-GPU Programming with Standard Parallel C++, Part1
§ Multi-GPU Programming with Standard Parallel C++, Part2
Open-source codes
§ LULESH: https://github.com/LLNL/LULESH
§ STLBM: https://gitlab.com/unigehpfs/stlbm
§ MiniWeather: https://github.com/mrnorman/miniWeather/
§ POT3D: https://github.com/predsci/POT3D
§ Legate: https://github.com/nv-legate
§ Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg
NVIDIA HPC SDK Documentation
https://docs.nvidia.com/hpc-sdk/index.html
27. AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
28. NVIDIA MATH LIBRARIES
Linear Algebra, FFT, RNG and Basic Math
CUDA Math API
cuFFT
cuSPARSE cuSOLVER
cuBLAS cuTENSOR
cuRAND CUTLASS
31. cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS
Performance of FP32 Tensor Contractions on DGX A100
Data residing on Host (Dotted) or Device (Solid) Memory
* DGX A100 80GB
§ Introduced in cuTENSOR v1.4
§ Out-of-core released in v1.5
§
0
20
40
60
80
100
120
140
160
4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608
TFLOPS
(LARGER
IS
BETTER)
SIZES: M = N = K
1 - Device
1 - Host
2 - Device
2 - Host
4 - Device
4 - Host
8 - Device
8 - Host
Releasing cuTENSOR v1.5
§ Added Out-of-core Functionality
§ Library wide optimizations
32. cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE
LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer
1
2
4
8
16
32
64
128
256
512
1024
2048
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
TIME
IN
SECONDS
(SMALLER
IS
BETTER)
NUMBER OF GPUS
State-of-the-Art HPC SDK 21.11
* Summit: 6x V100 16GB per node
Released in HPC SDK 21.11
§ LU Decomposition
§ With & Without pivoting
§ Cholesky
36. MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes
0
5
10
15
20
25
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
TFLOPS
(LARGER
IS
BETTER)
FFT SIZES (1D)
cuFFTDx cuFFT
* A100 80GB @ 1410 MHz
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
§ FFT 1D sizes up to 32k
Future Releases
§ cuBLASDx/cuSOLVERDx
§ 2D/3D FFTs
§ Windows Support
37. LERN MORE
GTC2022 sessions
§ An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153]
§ Recent Developments in NVIDIA Math Libraries [S41491]
§ Connect with Experts: NVIDIA Math Libraries [CWE41721]
§ Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948]
§ NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044]
Examples
§ CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples
MathDx 22.02
§ https://developer.nvidia.com/mathdx
38. LERN MORE
Math Libraries Documentation
https://docs.nvidia.com/hpc-sdk/index.html#math-libraries
Blog
§ Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
§ Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
39. AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
40. DEVELOPER TOOLS
Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Debuggers: cuda-gdb, Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition
Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
42. CUDA-GDB
Command-Line and IDE Back-End Debugger
§ Unified CPU and CUDA
Debugging
§ CUDA-C/SASS support
§ Built on GDB and uses
many of the same CLI
commands
43. COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues
Compute Sanitizer checks correctness isshues via
sub-tools:
§ Memcheck – Memory access error and leak detection
tool.
§ Racecheck – Shared memory data acces hazard detection
tool.
§ Initcheck – Uninitialized device global memory access.
§ Synccheck – Thread cynchronization hazard detection
tool.
45. CORRECTNESS TOOLS FEATURES
§OptiX support in Compute Sanitizer
§ Automatically find correctness issues in OptiX workloads
§Core Dump support in Compute Sanitizer
§ Generate core dumps on detected issues
§5x performance increase in core dump
generation
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 1 bytes
========= at 0x4d70 in
/home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra
w_solid_color_0xebf766b2f0642d4e
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f878f900403 is out of bounds
========= and is 262,132 bytes after the nearest
allocation at 0x7f878f8c0400 of size 16 bytes
========= Device Frame:NVIDIA internal [0x430]
========= Saved host backtrace up to driver entry
point at kernel launch time
========= Host Frame: [0x60fbaa]
========= in /lib/x86_64-linux-
gnu/libnvoptix.so.1
========= Host Frame:optix_stubs.h:568:optixLaunch
[0xe1ff]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main
[0xb735]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta
rt_call_main [0x2dfd0]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame:../csu/libc-
start.c:379:__libc_start_main [0x2e07d]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame: [0x8dde]
========= in
/home/cuda/optixBasic/optixBasic
47. NSIGHT SYSTEMS
System Profiler
Key Features:
§ System-wide application algorithm tuning
§ Multi-process tree support
§ Locate optimization opportunities
§ Visualize millions of events on a very fast GUI timeline
§ Or gaps of unused CPU and GPU time
§ Balance your workload across multiple CPUs and GPUs
§ CPU algorithms, utilization and thread state
GPU streams, kernels memory transfers, etc
§ Command line, Standalone, IDE integration
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
§ Docs/product: https://developer.nvidia.com/nsight-systems
52. NVIDIA NETWORKING ADAPTER SAMPLING
§ Profile NVIDIA Networking
adaptors
§ Sent / Received /
Congestion
§ Correlate with expected
network traffic and other
system activities
53. GPU DIRECT STORAGE SUPPORT
GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace
§ Direct communication to GPU memory
§ CUFILE APIs used for GPU Direct Storage
57. REGISTER DEPENDENCY VISUALIZATION
Visualize Register Usage and Dependency Chains
§ SASS view in the Source page
§ Tracking reads and writes for each register
§ Identify long dependency chains
§ Detect inefficient register usage
§ Columns show all dependencies for:
§ Registers
§ Predicates
§ Uniform Registers
§ Uniform Predicates
58. STANDALONE SOURCE VIEWER
§ View of side-by-side
assembly and correlated
source code for CUDA
kernels
§ No profile required
§ Open .cubin files directly
§ Helps identify compiler
optimizations and
inefficiencies
59. OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters
§ Model theoretical
hardware usage
§ Understand limitations
from hardware vs.
kernel parameters
§ Configure model to vary
HW and kernel
parameters
§ Opened from an existing
report or as a new
activity
60. HIERARCHICAL ROOFLINE
§ Visualize multiple levels of the memory
hierarchy
§ Identify bottlenecks caused by memory
limitations
§ Determine how modifying algorithms may (or
may not) impact performance
61. LERN MORE
GTC2022 sessions
§ Optimizing Communication with Nsight Systems Network Profiling [S41500]
§ Latest Updates to CUDA Developer Tools [D4121]
§ How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723]
§ Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541]
§ What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493]
Nsight Systems Documentation
§ https://docs.nvidia.com/nsight-systems/
Nsight Compute Documentation
§ https://docs.nvidia.com/nsight-compute/
62. AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute