SlideShare a Scribd company logo
1 of 63
Download to read offline
NVIDIA HPC ソフトウエア斜め読み
NARUHIKO TAN | HPC SOLUTION ARCHITECT
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
PROGRAMMING THE NVIDIA PLATFORM
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
PLATFORM SPECIALIZATION
CUDA
ACCELERATION LIBRARIES
Core Communication
Math Data Analytics AI Quantum
std::transform(par, x, x+n, y, y,
[=](float x, float y){ return y +
a*x; }
);
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
#pragma acc data copy(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
#pragma omp target data map(x,y) {
...
std::transform(par, x, x+n, y, y,
[=](float x, float y){
return y + a*x;
});
...
}
__global__
void saxpy(int n, float a,
float *x, float *y) {
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) y[i] += a*x[i];
}
int main(void) {
...
cudaMemcpy(d_x, x, ...);
cudaMemcpy(d_y, y, ...);
saxpy<<<(N+255)/256,256>>>(...);
cudaMemcpy(y, d_y, ...);
ACCELERATED STANDARD LANGUAGES
ISO C++, ISO Fortran
INCREMENTAL PORTABLE OPTIMIZATION
OpenACC, OpenMP
PLATFORM SPECIALIZATION
CUDA
NVIDIA HPC SDK
Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud
Develop for the NVIDIA Platform: GPU, CPU and Interconnect
Libraries | Accelerated C++ and Fortran | Directives | CUDA
7-8 Releases Per Year | Freely Available
Compilers
nvcc nvc
nvc++
nvfortran
Programming
Models
Standard C++ & Fortran
OpenACC & OpenMP
CUDA
Core
Libraries
libcu++
Thrust
CUB
Math
Libraries
cuBLAS cuTENSOR
cuSPARSE cuSOLVER
cuFFT cuRAND
Communication
Libraries
HPC-X
NVSHMEM
NCCL
DEVELOPMENT
Profilers
Nsight
Systems
Compute
Debugger
cuda-gdb
Host
Device
ANALYSIS
SHARP HCOLL
UCX SHMEM
MPI
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ C++
PILLARS OF STANDARD LANGUAGE PARALLELISM
7
Copyright (C) 2021 Bryce Adelstein Lelbach
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
C++ with OpenMP
Ø Composable, compact and elegant
Ø Easy to read and maintain
Ø ISO Standard
Ø Portable – nvc++, g++, icpc, MSVC, …
Standard C++
#pragma omp parallel // OpenMP parallel region
{
#pragma omp for // OpenMP for loop
for (MInt i = 0; i < noCells; i++) { // Loop over all cells
if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop
const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets
const MInt distNeighStartId = i * distNeighbors;
const MFloat* const distributionsStart = &[distributions[distStartId];
for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2)
if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration
const MInt n1StartId = neighborId[distNeighStartId + j] * nDist;
oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format
}
if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration
const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist;
oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1];
}
}
oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution
}
}
}
std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for
if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop
return;
for (MInt j = 0; j < nDist; ++j) {
if (auto n = c_neighborId(i, j); n == -1) continue;
a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn
}
});
M-AIA WITH C++17 PARALLEL ALGORITHMS
Multi-physics simulation framework
from RWTH Aachen University
Ø Hierarchical grids, complex moving geometries
Ø Adaptive meshing, load balancing
Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ...
Ø Physics: aeroacoustics, combustion, biomedical, ...
Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++
Ø Programming model: MPI + ISO C++ parallelism
M-AIA
Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University
Decaying isotropic turbulence
400k fully-resolved particles
1 1.025
8.74
0
1
2
3
4
5
6
7
8
9
10
OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100)
Relative
Speed-Up
PARALLELISM IN C++ ROADMAP
C++ 14 C++ 17 C++ 20 C++ PIPELINE
• Memory model
enhancements
• Lambdas
• Atomics
extensions
• Generic Lambda
Expressions
• Parallel algorithms
• Forward progress
guarantees
• Memory model
clarifications
• Scalable
synchronization
library
• Ranges
• Span
• Linear algebra
algorithms
• Asynchronous
parallel algorithms
• Senders-receivers
• Mdspan
• Range-based
parallel algorithms
• Extended floating-
point types
General parallelism user facing feature
How users run C++
code on GPUs today
Co-designed with
V100 hardware
support
Custom algorithms
and async.
control flow
N-dimensional
loops and
usability
Extended C++ interface
to BLAS/Lapack
General usability
of performance
provided by executors
C++ 11
PILLARS OF STANDARD LANGUAGE PARALLELISM
11
Copyright (C) 2021 Bryce Adelstein Lelbach
With Senders & Receivers
Today
Common Algorithms that Dispatch to
Vendor-Optimized Parallel Libraries
Tools to Write Your Own Parallel
Algorithms that Run Anywhere
sender auto
algorithm (sender auto s) {
return s | bulk(N,
[] (auto data) {
// ...
}
) | bulk(N,
[] (auto data) {
// ...
}
);
}
Mechanisms for Composing Parallel
Invocations into Task Graphs
sender auto
algorithm (sender auto s) {
return s | bulk(
[] (auto data) {
// ...
}
) | bulk(
[] (auto data) {
// ...
}
);
}
SENDERS & RECEIVERS
Maxwell’s equations
template <ComputeSchedulerT, WriteSchedulerT>
auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer)
{
return repeat_n(
n_outer_iterations,
repeat_n(
n_inner_iterations,
schedule(scheduler)
| bulk(grid.cells, update_h(accessor))
| bulk(grid.cells, update_e(time, dt, accessor)))
| transfer(writer)
| then(dump_results(report_step, accessor)))
| then([]{ printf("simulation completen"); })
);
}
Simplify Work Across CPUs and
Accelerators
• Uniform abstraction between code and
diverse resources
• ISO standard
• Write once, run everywhere
•
ELECTROMAGNETISM
Raw performance & % of peak
std::sync_wait(maxwell(inline_scheduler, inline_scheduler));
std::sync_wait(maxwell(openmp_scheduler, inline_scheduler));
std::sync_wait(maxwell(cuda, inline_scheduler));
§ CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80
§ Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs)
§ clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp
0
5
10
15
20
25
30
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Speedup
Scheduler
0
10
20
30
40
50
60
70
80
90
100
OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100)
Efficiency
vs
STREAM
TRIAD
Scheduler
STRONG SCALING USING ISO STANDARD C++
NVIDIA CONFIDENTIAL. DO NOT
DISTRIBUTE.
NVIDIA SUPERPOD
§ 140x NVIDIA DGX-A100 640
§ 1120x NVIDIA A100-SXM4-80 GPUs
0
0.2
0.4
0.6
0.8
1
1.2
0
5
10
15
20
25
30
35
40
0 200 400 600 800 1000 1200
Speedup
Number of GPUs
Maxwell SR Scaling Ideal Scaling Efficiency
PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
PALABOS CARBON SEQUESTRATION
15
Copyright (C) 2022 NVIDIA
Ø Palabos is a framework for fluid dynamics simulations using
Lattice-Boltzmann methods.
Ø Code for multi-component flow through a porous media
ported to C++ Senders and Receivers.
Ø Application: simulating carbon sequestration in sandstone.
Christian Huber (Brown University), Jonas Latt (University of Geneva)
Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA)
0
4
8
12
16
32 128 224 320 416 512
A100 GPUs
Strong Scaling
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ FORTRAN
MODERN FORTRAN FEATURES FOR HPC
Standard Parallelism and Concurrency Features
DO CONCURRENT Reductions
Support for reduction operations
on concurrent loops (ala
OpenACC/OpenMP). Began
supporting in nvfortran 21.11.
Fortran 202X
Coming in 2023
Atomics
Propose support for atomic
variable accesses
Asynchronous Tasking
Propose support for asynchronous
tasks
Fortran 202Y
In discussion
DO CONCURRENT
Data parallel loop construct, locality
specifiers. Supported in nvfortran
Array Intrinsics
Various math intrinsics that may
apply to entire arrays and map to
accelerated libraries supported in
nvfortran.
Co-Arrays
Partitioned Global Address Space
arrays, teams of processes (images),
collectives & synchronization.
Awaiting F18.
Fortran 2018
MINIWEATHER
Standard Language Parallelism in Climate/Weather Applications
Mini-App written in C++ and Fortran that simulates
weather-like fluid flows using Finite Volume and
Runge-Kutta methods.
Existing parallelization in MPI, OpenMP, OpenACC, …
Included in the SPEChpc benchmark suite*
Open-source and commonly-used in training events.
https://github.com/mrnorman/miniWeather/
MiniWeather
0
10
20
OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC
do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx)
local(x,z,x0,z0,xrad,zrad,amp,dist,wpert)
if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then
x = (i_beg-1 + i-0.5_rp) * dx
z = (k_beg-1 + k-0.5_rp) * dz
x0 = xlen/8
z0 = 1000
xrad = 500
zrad = 500
amp = 0.01_rp
dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp
if (dist <= pi / 2._rp) then
wpert = amp * cos(dist)**2
else
wpert = 0._rp
endif
tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM)
+ wpert*hy_dens_cell(k)
endif
state_out(i,k,ll) = state_init(i,k,ll)
+ dt * tend(i,k,ll)
enddo
Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5.
OpenACC version uses –gpu=managed option.
*SPEChpc is a trademark of The Standard Performance Evaluation Corporation
POT3D: DO CONCURRENT
POT3D is a Fortran application for approximating solar
coronal magnetic fields.
Included in the SPEChpc benchmark suite*
Existing parallelization in MPI & OpenACC
Optimized the DO CONCURRENT version by using
OpenACC solely for data motion and atomics
https://github.com/predsci/POT3D
POT3D
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
ACCELERATED COMPUTING WITH
STANDARD LANGUAGES
§ PYTHON
PRODUCTIVITY
Sequential and Composable Code
§ Sequential semantics - no visible
parallelism or synchronization
§ Name-based global data – no partitioning
§ Composable – can combine with other
libraries and datatypes
def cg_solve(A, b, conv_iters):
x = np.zeros_like(b)
r = b - A.dot(x)
p = r
rsold = r.dot(r)
converged = False
max_iters = b.shape[0]
for i in range(max_iters):
Ap = A.dot(p)
alpha = rsold / (p.dot(Ap))
x = x + alpha * p
r = r - alpha * Ap
rsnew = r.dot(r)
if i % conv_iters == 0 and 
np.sqrt(rsnew) < 1e-10:
converged = i
break
beta = rsnew / rsold
p = r + beta * p
rsold = rsnew
PERFORMANCE
§Transparently run at any scale needed to address computational challenges at hand
§Automatically leverage all the available hardware
Transparent Acceleration
Supercomputer
Multi-GPU
GPU
DPU
Grace
CPU
COMPUTATIONAL FLUID DYNAMICS
Time
(seconds)
Relative dataset size
Number of GPUs
0
50
100
150
1 2 4 8 16 32 64 128 256 512 1024
Distributed NumPy Performance
(weak scaling)
cuPy Legate
for _ in range(iter):
un = u.copy()
vn = v.copy()
b = build_up_b(rho, dt, dx, dy, u, v)
p = pressure_poisson_periodic(b, nit, p, dx, dy)
…
Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython
Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of
Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021
• CFD codes like:
• Shallow-Water Equation Solver
• Oil Pipeline Risk Management: Geoclaw-
landspill simulations
• Python Libraries: Jupyter, NumPy, SciPy,
SymPy, Matplotlib
CFD Python on cuNumeric!
ACCELERATED STANDARD LANGUAGES
Parallel performance for wherever your code runs
std::transform(par, x, x+n, y,
y,[=](float x, float y){
return y + a*x;
}
);
import cunumeric as np
…
def saxpy(a, x, y):
y[:] += a*x
do concurrent (i = 1:n)
y(i) = y(i) + a*x(i)
enddo
ISO C++ ISO Fortran Python
CPU GPU
nvc++ -stdpar=multicore
nvfortran –stdpar=multicore
legate –cpus 16 saxpy.py
nvc++ -stdpar=gpu
nvfortran –stdpar=gpu
legate –gpus 1 saxpy.py
LERN MORE
GTC2022 sessions
§ No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496]
§ Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620]
§ C++ Standard Parallelism [S41960]
§ Future of Standard and CUDA C++ [S41961]
§ Connect with Experts: Standard and CUDA C++ User Forum [CWE41949]
§ From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318]
§ Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645]
Blogs
§ Developing Accelerated Code with Standard Language Parallelism
§ Accelerating Standard C++ with GPUs Using stdpar
§ Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK
§ Bringing Tensor Cores to Standard Fortran
§ Accelerating Python on GPUs with nvc++ and Cython
LERN MORE
Blogs
§ Multi-GPU Programming with Standard Parallel C++, Part1
§ Multi-GPU Programming with Standard Parallel C++, Part2
Open-source codes
§ LULESH: https://github.com/LLNL/LULESH
§ STLBM: https://gitlab.com/unigehpfs/stlbm
§ MiniWeather: https://github.com/mrnorman/miniWeather/
§ POT3D: https://github.com/predsci/POT3D
§ Legate: https://github.com/nv-legate
§ Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg
NVIDIA HPC SDK Documentation
https://docs.nvidia.com/hpc-sdk/index.html
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA MATH LIBRARIES
Linear Algebra, FFT, RNG and Basic Math
CUDA Math API
cuFFT
cuSPARSE cuSOLVER
cuBLAS cuTENSOR
cuRAND CUTLASS
MATH LIBRARIES
§ MULTI-GPU MATH LIBRARIES
1
2
3
4
5
6
7
8
9
10
6
4
7
2
8
0
8
8
9
6
1
0
4
1
1
2
1
2
0
1
2
8
1
3
6
1
4
4
1
5
2
1
6
0
1
6
8
1
7
6
1
8
4
1
9
2
2
0
0
2
0
8
2
1
6
2
2
4
2
3
2
2
4
0
2
4
8
2
5
6
2
6
4
2
7
2
2
8
0
2
8
8
2
9
6
3
0
4
3
1
2
3
2
0
3
2
8
3
3
6
3
4
4
3
5
2
3
6
0
3
6
8
3
7
6
3
8
4
3
9
2
4
0
0
4
0
8
4
1
6
4
2
4
4
3
2
4
4
0
4
4
8
4
5
6
4
6
4
4
7
2
4
8
0
4
8
8
4
9
6
5
0
4
5
1
2
SPEEDUP
(LARGER
IS
BETTER)
FFT SIZE (3D)
1 GPU 2 GPUs 4 GPUs 8 GPUs
cuFFTXt: MAXIMIZING SINGLE-NODE PERFORMANCE
Speedups for 3D C2C versus CTK 11.0
* A100 80GB Default clocks: CTK 11.0 vs. CTK 11.6
Recently Introduced
§ Up to 10x improvements for SNMG FFTs
cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS
Performance of FP32 Tensor Contractions on DGX A100
Data residing on Host (Dotted) or Device (Solid) Memory
* DGX A100 80GB
§ Introduced in cuTENSOR v1.4
§ Out-of-core released in v1.5
§
0
20
40
60
80
100
120
140
160
4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608
TFLOPS
(LARGER
IS
BETTER)
SIZES: M = N = K
1 - Device
1 - Host
2 - Device
2 - Host
4 - Device
4 - Host
8 - Device
8 - Host
Releasing cuTENSOR v1.5
§ Added Out-of-core Functionality
§ Library wide optimizations
cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE
LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer
1
2
4
8
16
32
64
128
256
512
1024
2048
1 2 4 8 16 32 64 128 256 512 1024 2048 4096
TIME
IN
SECONDS
(SMALLER
IS
BETTER)
NUMBER OF GPUS
State-of-the-Art HPC SDK 21.11
* Summit: 6x V100 16GB per node
Released in HPC SDK 21.11
§ LU Decomposition
§ With & Without pivoting
§ Cholesky
cuFFTMp: FFTs AT SCALE - SLAB DECOMPOSITION
Distributed 3D FFT Performance: Comparison by Precision
18 12 18 34 61
109
226
429
851
1,860
10 6 9 17 29 52
105
210
410
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
8 16 32 64 128 256 512 1024 2048 4096
2048 2560 3072 4096 5120 6144 8192 10240 12288 16384
TFLOPS
(LARGER
IS
BETTER)
# OF GPUS
PROBLEM SIZE (CUBED)
C2C Z2Z
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→ Slabs
13 14
22 25 27
30 37 41
51
68
79
78.2
104
135
147
122
158
260
278
134
176
394
527
0
100
200
300
400
500
600
1024 2048 4096 8192
TFLOPS
(LARGER
IS
BETTER)
PROBLEM SIZE (CUBED)
32 64 128 256 512 1024 2048
cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION
Distributed 3D FFT Performance: C2C Comparison by GPU Count
* Selene: A100 80GB @ 1410 MHz
Coming to HPC SDK 22.3
§ Distributed 2D/3D FFTs
§ Slab Decomposition
§ Pencil Decomposition (Preview)
§ Helper functions: Pencils <→
Slabs
[S41494] A Deep Dive into the Latest HPC Software
# of GPUs
MATH LIBRARIES
§ MATH LIBRARY DEVICE EXTENSIONS
MATH LIBRARIES DEVICE EXTENSIONS
cuFFTDx Performance: Comparison with cuFFT across various sizes
0
5
10
15
20
25
2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
TFLOPS
(LARGER
IS
BETTER)
FFT SIZES (1D)
cuFFTDx cuFFT
* A100 80GB @ 1410 MHz
Released in MathDx 22.02
§ Available on DevZone
§ Support Volta+ architecture
§ FFT 1D sizes up to 32k
Future Releases
§ cuBLASDx/cuSOLVERDx
§ 2D/3D FFTs
§ Windows Support
LERN MORE
GTC2022 sessions
§ An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153]
§ Recent Developments in NVIDIA Math Libraries [S41491]
§ Connect with Experts: NVIDIA Math Libraries [CWE41721]
§ Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948]
§ NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044]
Examples
§ CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples
MathDx 22.02
§ https://developer.nvidia.com/mathdx
LERN MORE
Math Libraries Documentation
https://docs.nvidia.com/hpc-sdk/index.html#math-libraries
Blog
§ Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
§ Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
DEVELOPER TOOLS
Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Debuggers: cuda-gdb, Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition
Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE
CUDA-GDB
Command-Line and IDE Back-End Debugger
§ Unified CPU and CUDA
Debugging
§ CUDA-C/SASS support
§ Built on GDB and uses
many of the same CLI
commands
COMPUTE SANITIZER
Automatically Scan for Bugs and Memory Issues
Compute Sanitizer checks correctness isshues via
sub-tools:
§ Memcheck – Memory access error and leak detection
tool.
§ Racecheck – Shared memory data acces hazard detection
tool.
§ Initcheck – Uninitialized device global memory access.
§ Synccheck – Thread cynchronization hazard detection
tool.
DEVELOPER TOOLS
§ COMPUTE DEBUGGERS/IDE NEW FEATURES
CORRECTNESS TOOLS FEATURES
§OptiX support in Compute Sanitizer
§ Automatically find correctness issues in OptiX workloads
§Core Dump support in Compute Sanitizer
§ Generate core dumps on detected issues
§5x performance increase in core dump
generation
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 1 bytes
========= at 0x4d70 in
/home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra
w_solid_color_0xebf766b2f0642d4e
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f878f900403 is out of bounds
========= and is 262,132 bytes after the nearest
allocation at 0x7f878f8c0400 of size 16 bytes
========= Device Frame:NVIDIA internal [0x430]
========= Saved host backtrace up to driver entry
point at kernel launch time
========= Host Frame: [0x60fbaa]
========= in /lib/x86_64-linux-
gnu/libnvoptix.so.1
========= Host Frame:optix_stubs.h:568:optixLaunch
[0xe1ff]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main
[0xb735]
========= in
/home/cuda/optixBasic/optixBasic
========= Host
Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta
rt_call_main [0x2dfd0]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame:../csu/libc-
start.c:379:__libc_start_main [0x2e07d]
========= in /lib/x86_64-linux-
gnu/libc.so.6
========= Host Frame: [0x8dde]
========= in
/home/cuda/optixBasic/optixBasic
DEVELOPER TOOLS
§ NSIGHT SYSTEMS
NSIGHT SYSTEMS
System Profiler
Key Features:
§ System-wide application algorithm tuning
§ Multi-process tree support
§ Locate optimization opportunities
§ Visualize millions of events on a very fast GUI timeline
§ Or gaps of unused CPU and GPU time
§ Balance your workload across multiple CPUs and GPUs
§ CPU algorithms, utilization and thread state
GPU streams, kernels memory transfers, etc
§ Command line, Standalone, IDE integration
OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host)
GPUs: Pascal+
§ Docs/product: https://developer.nvidia.com/nsight-systems
DEVELOPER TOOLS
§ NSIGHT SYSTEMS NEW FEATURES
MULTI-REPORT TILING
Visualize More Parallel Activity
MULTI-REPORT TILING
Visualize More Parallel Activity
Open multiple
reports
Open multiple
reports
Loaded on same
timeline based on
wall-clock
EXPERT SYSTEMS & STATISTICS
Built-in Data Analytics with Advice
NVIDIA NETWORKING ADAPTER SAMPLING
§ Profile NVIDIA Networking
adaptors
§ Sent / Received /
Congestion
§ Correlate with expected
network traffic and other
system activities
GPU DIRECT STORAGE SUPPORT
GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace
§ Direct communication to GPU memory
§ CUFILE APIs used for GPU Direct Storage
DEVELOPER TOOLS
§ NSIGHT COMPUTE
NSIGHT COMPUTE
Kernel Profiling Tool
Key Features:
§ Interactive CUDA API debugging and kernel profiling
§ Build-in rules expertise
§ Fully customizable data collection and display
§ Command line, Standalone, IDE integration, Remote targets
OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX
(host only)
GPUs: Volta, Turing, Ampere GPUs
Docs/product: https://developer.nvidia.com/nsight-compute
DEVELOPER TOOLS
§ NSIGHT COMPUTE NEW FEATURES
REGISTER DEPENDENCY VISUALIZATION
Visualize Register Usage and Dependency Chains
§ SASS view in the Source page
§ Tracking reads and writes for each register
§ Identify long dependency chains
§ Detect inefficient register usage
§ Columns show all dependencies for:
§ Registers
§ Predicates
§ Uniform Registers
§ Uniform Predicates
STANDALONE SOURCE VIEWER
§ View of side-by-side
assembly and correlated
source code for CUDA
kernels
§ No profile required
§ Open .cubin files directly
§ Helps identify compiler
optimizations and
inefficiencies
OCCUPANCY CALCULATOR
Model Hardware Usage and Identify Limiters
§ Model theoretical
hardware usage
§ Understand limitations
from hardware vs.
kernel parameters
§ Configure model to vary
HW and kernel
parameters
§ Opened from an existing
report or as a new
activity
HIERARCHICAL ROOFLINE
§ Visualize multiple levels of the memory
hierarchy
§ Identify bottlenecks caused by memory
limitations
§ Determine how modifying algorithms may (or
may not) impact performance
LERN MORE
GTC2022 sessions
§ Optimizing Communication with Nsight Systems Network Profiling [S41500]
§ Latest Updates to CUDA Developer Tools [D4121]
§ How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723]
§ Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541]
§ What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493]
Nsight Systems Documentation
§ https://docs.nvidia.com/nsight-systems/
Nsight Compute Documentation
§ https://docs.nvidia.com/nsight-compute/
AGENDA
Accelerated Computing with Standard Languages
§ C++
§ Fortran
§ Python
Math Libraries
§ Multi-GPU Math Libraries
§ Math Library Device Extensions
Developer Tools
§ Compute debuggers/IDE
§ Nsight Systems
§ Nsight Compute
NVIDIA HPC ソフトウエア斜め読み

More Related Content

What's hot

What's hot (20)

Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
Deep Learningのための専用プロセッサ「MN-Core」の開発と活用(2022/10/19東大大学院「 融合情報学特別講義Ⅲ」)
 
CUDAプログラミング入門
CUDAプログラミング入門CUDAプログラミング入門
CUDAプログラミング入門
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
Topology Managerについて / Kubernetes Meetup Tokyo 50
Topology Managerについて / Kubernetes Meetup Tokyo 50Topology Managerについて / Kubernetes Meetup Tokyo 50
Topology Managerについて / Kubernetes Meetup Tokyo 50
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
CuPy解説
CuPy解説CuPy解説
CuPy解説
 
TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?TensorFlow XLAは、 中で何をやっているのか?
TensorFlow XLAは、 中で何をやっているのか?
 
Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)Introduction to OpenCL (Japanese, OpenCLの基礎)
Introduction to OpenCL (Japanese, OpenCLの基礎)
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
 
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
 
ドメイン適応の原理と応用
ドメイン適応の原理と応用ドメイン適応の原理と応用
ドメイン適応の原理と応用
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜DQNからRainbowまで 〜深層強化学習の最新動向〜
DQNからRainbowまで 〜深層強化学習の最新動向〜
 
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
 
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステムオープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
オープンソースコンパイラNNgenでつくるエッジ・ディープラーニングシステム
 
A100 GPU 搭載! P4d インスタンス 使いこなしのコツ
A100 GPU 搭載! P4d インスタンス使いこなしのコツA100 GPU 搭載! P4d インスタンス使いこなしのコツ
A100 GPU 搭載! P4d インスタンス 使いこなしのコツ
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜
 

Similar to NVIDIA HPC ソフトウエア斜め読み

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
Dmitri Nesteruk
 

Similar to NVIDIA HPC ソフトウエア斜め読み (20)

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
main
mainmain
main
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Arvindsujeeth scaladays12
Arvindsujeeth scaladays12Arvindsujeeth scaladays12
Arvindsujeeth scaladays12
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
Build 2016 - B880 - Top 6 Reasons to Move Your C++ Code to Visual Studio 2015
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen TatarynovWorkshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
Workshop "Can my .NET application use less CPU / RAM?", Yevhen Tatarynov
 
Overview Of Parallel Development - Ericnel
Overview Of Parallel Development -  EricnelOverview Of Parallel Development -  Ericnel
Overview Of Parallel Development - Ericnel
 

More from NVIDIA Japan

More from NVIDIA Japan (20)

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリー
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティ
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジに
 
GTC 2020 発表内容まとめ
GTC 2020 発表内容まとめGTC 2020 発表内容まとめ
GTC 2020 発表内容まとめ
 
NVIDIA Jetson導入事例ご紹介
NVIDIA Jetson導入事例ご紹介NVIDIA Jetson導入事例ご紹介
NVIDIA Jetson導入事例ご紹介
 
JETSON 最新情報 & 自動外観検査事例紹介
JETSON 最新情報 & 自動外観検査事例紹介JETSON 最新情報 & 自動外観検査事例紹介
JETSON 最新情報 & 自動外観検査事例紹介
 
HELLO AI WORLD - MEET JETSON NANO
HELLO AI WORLD - MEET JETSON NANOHELLO AI WORLD - MEET JETSON NANO
HELLO AI WORLD - MEET JETSON NANO
 
Final 20200326 jetson edge comuputing digital seminar 1 final (1)
Final 20200326 jetson edge comuputing digital seminar 1 final (1)Final 20200326 jetson edge comuputing digital seminar 1 final (1)
Final 20200326 jetson edge comuputing digital seminar 1 final (1)
 
20200326 jetson edge comuputing digital seminar 1 final
20200326 jetson edge comuputing digital seminar 1 final20200326 jetson edge comuputing digital seminar 1 final
20200326 jetson edge comuputing digital seminar 1 final
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

NVIDIA HPC ソフトウエア斜め読み

  • 2. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 3. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 4. PROGRAMMING THE NVIDIA PLATFORM CPU, GPU, and Network ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran PLATFORM SPECIALIZATION CUDA ACCELERATION LIBRARIES Core Communication Math Data Analytics AI Quantum std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; } ); do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo import cunumeric as np … def saxpy(a, x, y): y[:] += a*x #pragma acc data copy(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } #pragma omp target data map(x,y) { ... std::transform(par, x, x+n, y, y, [=](float x, float y){ return y + a*x; }); ... } __global__ void saxpy(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] += a*x[i]; } int main(void) { ... cudaMemcpy(d_x, x, ...); cudaMemcpy(d_y, y, ...); saxpy<<<(N+255)/256,256>>>(...); cudaMemcpy(y, d_y, ...); ACCELERATED STANDARD LANGUAGES ISO C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION OpenACC, OpenMP PLATFORM SPECIALIZATION CUDA
  • 5. NVIDIA HPC SDK Available at developer.nvidia.com/hpc-sdk, on NGC, via Spack, and in the Cloud Develop for the NVIDIA Platform: GPU, CPU and Interconnect Libraries | Accelerated C++ and Fortran | Directives | CUDA 7-8 Releases Per Year | Freely Available Compilers nvcc nvc nvc++ nvfortran Programming Models Standard C++ & Fortran OpenACC & OpenMP CUDA Core Libraries libcu++ Thrust CUB Math Libraries cuBLAS cuTENSOR cuSPARSE cuSOLVER cuFFT cuRAND Communication Libraries HPC-X NVSHMEM NCCL DEVELOPMENT Profilers Nsight Systems Compute Debugger cuda-gdb Host Device ANALYSIS SHARP HCOLL UCX SHMEM MPI
  • 7. PILLARS OF STANDARD LANGUAGE PARALLELISM 7 Copyright (C) 2021 Bryce Adelstein Lelbach Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 8. C++ with OpenMP Ø Composable, compact and elegant Ø Easy to read and maintain Ø ISO Standard Ø Portable – nvc++, g++, icpc, MSVC, … Standard C++ #pragma omp parallel // OpenMP parallel region { #pragma omp for // OpenMP for loop for (MInt i = 0; i < noCells; i++) { // Loop over all cells if (timeStep % ipow2[maxLevel_ – clevel[i * distLevel]] == 0) { // Multi-grid loop const MInt distStartId = i * nDist; // More offsets for 1D accesses // Local offsets const MInt distNeighStartId = i * distNeighbors; const MFloat* const distributionsStart = &[distributions[distStartId]; for (MInt j = 0; j < nDist – 1; j += 2) { // Unrolled loop distributions (factor 2) if (neighborId[I * distNeighbors + j] > -1) { // First unrolled iteration const MInt n1StartId = neighborId[distNeighStartId + j] * nDist; oldDistributions[n1StartId + j] = distributionsStart[j]; // 1D access AoS format } if (neighborId[I * distNeighbors + j + 1] > -1) { // Second unrolled iteration const MInt n2StartId = neighborId[distNeighStartId + j + 1] * nDist; oldDistributions[n2StartId + j + 1] = distributionsStart[j + 1]; } } oldDistributions[distStartId + lastId] = distributionsStart[lastId]; // Zero-th distribution } } } std::for_each_n(par_unseq, start, noCells, [=](auto i) { // Parallel for if (timeStep % IPOW2[maxLevel_ – a_level(i)] != 0) // Multi-level loop return; for (MInt j = 0; j < nDist; ++j) { if (auto n = c_neighborId(i, j); n == -1) continue; a_oldDistribution(n, j) = a_distribution(i, j); // SoA or AoS mem_fn } }); M-AIA WITH C++17 PARALLEL ALGORITHMS Multi-physics simulation framework from RWTH Aachen University Ø Hierarchical grids, complex moving geometries Ø Adaptive meshing, load balancing Ø Numerical methods: FV, DG, LBM, FEM, Level-Set, ... Ø Physics: aeroacoustics, combustion, biomedical, ... Ø Developed by ~20 PhDs (Mech. Eng.), ~500k LOC++ Ø Programming model: MPI + ISO C++ parallelism
  • 9. M-AIA Multi-physics simulation framework developed at the Institute of Aerodynamics, RWTH Aachen University Decaying isotropic turbulence 400k fully-resolved particles 1 1.025 8.74 0 1 2 3 4 5 6 7 8 9 10 OpenMP (2x EPYC 7742) ISO C++ (2x EPYC 7742) ISO C++ (A100) Relative Speed-Up
  • 10. PARALLELISM IN C++ ROADMAP C++ 14 C++ 17 C++ 20 C++ PIPELINE • Memory model enhancements • Lambdas • Atomics extensions • Generic Lambda Expressions • Parallel algorithms • Forward progress guarantees • Memory model clarifications • Scalable synchronization library • Ranges • Span • Linear algebra algorithms • Asynchronous parallel algorithms • Senders-receivers • Mdspan • Range-based parallel algorithms • Extended floating- point types General parallelism user facing feature How users run C++ code on GPUs today Co-designed with V100 hardware support Custom algorithms and async. control flow N-dimensional loops and usability Extended C++ interface to BLAS/Lapack General usability of performance provided by executors C++ 11
  • 11. PILLARS OF STANDARD LANGUAGE PARALLELISM 11 Copyright (C) 2021 Bryce Adelstein Lelbach With Senders & Receivers Today Common Algorithms that Dispatch to Vendor-Optimized Parallel Libraries Tools to Write Your Own Parallel Algorithms that Run Anywhere sender auto algorithm (sender auto s) { return s | bulk(N, [] (auto data) { // ... } ) | bulk(N, [] (auto data) { // ... } ); } Mechanisms for Composing Parallel Invocations into Task Graphs sender auto algorithm (sender auto s) { return s | bulk( [] (auto data) { // ... } ) | bulk( [] (auto data) { // ... } ); }
  • 12. SENDERS & RECEIVERS Maxwell’s equations template <ComputeSchedulerT, WriteSchedulerT> auto maxwell_eqs(ComputeSchedulerT &scheduler, WriteSchedulerT &writer) { return repeat_n( n_outer_iterations, repeat_n( n_inner_iterations, schedule(scheduler) | bulk(grid.cells, update_h(accessor)) | bulk(grid.cells, update_e(time, dt, accessor))) | transfer(writer) | then(dump_results(report_step, accessor))) | then([]{ printf("simulation completen"); }) ); } Simplify Work Across CPUs and Accelerators • Uniform abstraction between code and diverse resources • ISO standard • Write once, run everywhere •
  • 13. ELECTROMAGNETISM Raw performance & % of peak std::sync_wait(maxwell(inline_scheduler, inline_scheduler)); std::sync_wait(maxwell(openmp_scheduler, inline_scheduler)); std::sync_wait(maxwell(cuda, inline_scheduler)); § CPUs: AMD EPYC 7742 CPUs, GPUs: NVIDIA A100-SXM4-80 § Inline (1 CPU HW thread), OpenMP-128 (1x CPU), OpenMP-256 (2x CPUs), Graph (1x GPU), Multi-GPU (2x GPUs) § clang-12 with –O3 –DNDEBUG –mtune=native -fopenmp 0 5 10 15 20 25 30 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Speedup Scheduler 0 10 20 30 40 50 60 70 80 90 100 OpenMP 128 OpenMP 256 CUDA (1 A100) CUDA (2 A100) Efficiency vs STREAM TRIAD Scheduler
  • 14. STRONG SCALING USING ISO STANDARD C++ NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. NVIDIA SUPERPOD § 140x NVIDIA DGX-A100 640 § 1120x NVIDIA A100-SXM4-80 GPUs 0 0.2 0.4 0.6 0.8 1 1.2 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 Speedup Number of GPUs Maxwell SR Scaling Ideal Scaling Efficiency PARALLEL ALGORITHMS AND SENDERS & RECIEVERS
  • 15. PALABOS CARBON SEQUESTRATION 15 Copyright (C) 2022 NVIDIA Ø Palabos is a framework for fluid dynamics simulations using Lattice-Boltzmann methods. Ø Code for multi-component flow through a porous media ported to C++ Senders and Receivers. Ø Application: simulating carbon sequestration in sandstone. Christian Huber (Brown University), Jonas Latt (University of Geneva) Georgy Evtushenko (NVIDIA), Gonzalo Brito (NVIDIA) 0 4 8 12 16 32 128 224 320 416 512 A100 GPUs Strong Scaling
  • 16. ACCELERATED COMPUTING WITH STANDARD LANGUAGES § FORTRAN
  • 17. MODERN FORTRAN FEATURES FOR HPC Standard Parallelism and Concurrency Features DO CONCURRENT Reductions Support for reduction operations on concurrent loops (ala OpenACC/OpenMP). Began supporting in nvfortran 21.11. Fortran 202X Coming in 2023 Atomics Propose support for atomic variable accesses Asynchronous Tasking Propose support for asynchronous tasks Fortran 202Y In discussion DO CONCURRENT Data parallel loop construct, locality specifiers. Supported in nvfortran Array Intrinsics Various math intrinsics that may apply to entire arrays and map to accelerated libraries supported in nvfortran. Co-Arrays Partitioned Global Address Space arrays, teams of processes (images), collectives & synchronization. Awaiting F18. Fortran 2018
  • 18. MINIWEATHER Standard Language Parallelism in Climate/Weather Applications Mini-App written in C++ and Fortran that simulates weather-like fluid flows using Finite Volume and Runge-Kutta methods. Existing parallelization in MPI, OpenMP, OpenACC, … Included in the SPEChpc benchmark suite* Open-source and commonly-used in training events. https://github.com/mrnorman/miniWeather/ MiniWeather 0 10 20 OpenMP (CPU) Concurrent (CPU) Concurrent (GPU) OpenACC do concurrent (ll=1:NUM_VARS, k=1:nz, i=1:nx) local(x,z,x0,z0,xrad,zrad,amp,dist,wpert) if (data_spec_int == DATA_SPEC_GRAVITY_WAVES) then x = (i_beg-1 + i-0.5_rp) * dx z = (k_beg-1 + k-0.5_rp) * dz x0 = xlen/8 z0 = 1000 xrad = 500 zrad = 500 amp = 0.01_rp dist = sqrt( ((x-x0)/xrad)**2 + ((z-z0)/zrad)**2 ) * pi / 2._rp if (dist <= pi / 2._rp) then wpert = amp * cos(dist)**2 else wpert = 0._rp endif tend(i,k,ID_WMOM) = tend(i,k,ID_WMOM) + wpert*hy_dens_cell(k) endif state_out(i,k,ll) = state_init(i,k,ll) + dt * tend(i,k,ll) enddo Source: HPC SDK 22.1, AMD EPYC 7742, NVIDIA A100. MiniWeather: NX=2000, NZ=1000, SIM_TIME=5. OpenACC version uses –gpu=managed option. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 19. POT3D: DO CONCURRENT POT3D is a Fortran application for approximating solar coronal magnetic fields. Included in the SPEChpc benchmark suite* Existing parallelization in MPI & OpenACC Optimized the DO CONCURRENT version by using OpenACC solely for data motion and atomics https://github.com/predsci/POT3D POT3D !$acc enter data copyin(phi,dr_i) !$acc enter data create(br) do concurrent (k=1:np,j=1:nt,i=1:nrm1) br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i) enddo !$acc exit data delete(phi,dr_i,br) Data courtesy of Predictive Science Inc. *SPEChpc is a trademark of The Standard Performance Evaluation Corporation
  • 20. ACCELERATED COMPUTING WITH STANDARD LANGUAGES § PYTHON
  • 21. PRODUCTIVITY Sequential and Composable Code § Sequential semantics - no visible parallelism or synchronization § Name-based global data – no partitioning § Composable – can combine with other libraries and datatypes def cg_solve(A, b, conv_iters): x = np.zeros_like(b) r = b - A.dot(x) p = r rsold = r.dot(r) converged = False max_iters = b.shape[0] for i in range(max_iters): Ap = A.dot(p) alpha = rsold / (p.dot(Ap)) x = x + alpha * p r = r - alpha * Ap rsnew = r.dot(r) if i % conv_iters == 0 and np.sqrt(rsnew) < 1e-10: converged = i break beta = rsnew / rsold p = r + beta * p rsold = rsnew
  • 22. PERFORMANCE §Transparently run at any scale needed to address computational challenges at hand §Automatically leverage all the available hardware Transparent Acceleration Supercomputer Multi-GPU GPU DPU Grace CPU
  • 23. COMPUTATIONAL FLUID DYNAMICS Time (seconds) Relative dataset size Number of GPUs 0 50 100 150 1 2 4 8 16 32 64 128 256 512 1024 Distributed NumPy Performance (weak scaling) cuPy Legate for _ in range(iter): un = u.copy() vn = v.copy() b = build_up_b(rho, dt, dx, dy, u, v) p = pressure_poisson_periodic(b, nit, p, dx, dy) … Extracted from “CFD Python” course at https://github.com/barbagroup/CFDPython Barba, Lorena A., and Forsyth, Gilbert F. (2018). CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Source Education, 1(9), 21, https://doi.org/10.21105/jose.00021 • CFD codes like: • Shallow-Water Equation Solver • Oil Pipeline Risk Management: Geoclaw- landspill simulations • Python Libraries: Jupyter, NumPy, SciPy, SymPy, Matplotlib CFD Python on cuNumeric!
  • 24. ACCELERATED STANDARD LANGUAGES Parallel performance for wherever your code runs std::transform(par, x, x+n, y, y,[=](float x, float y){ return y + a*x; } ); import cunumeric as np … def saxpy(a, x, y): y[:] += a*x do concurrent (i = 1:n) y(i) = y(i) + a*x(i) enddo ISO C++ ISO Fortran Python CPU GPU nvc++ -stdpar=multicore nvfortran –stdpar=multicore legate –cpus 16 saxpy.py nvc++ -stdpar=gpu nvfortran –stdpar=gpu legate –gpus 1 saxpy.py
  • 25. LERN MORE GTC2022 sessions § No More Porting: Coding for GPUs with Standard C++, Fortran, and Python [S41496] § Shifting through the Gears of GPU Programming Understanding Performance and Portability Trade-offs [S41620] § C++ Standard Parallelism [S41960] § Future of Standard and CUDA C++ [S41961] § Connect with Experts: Standard and CUDA C++ User Forum [CWE41949] § From Directives to DO CONCURRENT: A Case Study in Standard Parallelism [S41318] § Evaluating Your Options for Accelerated Numerical Computing in Pure Python [S41645] Blogs § Developing Accelerated Code with Standard Language Parallelism § Accelerating Standard C++ with GPUs Using stdpar § Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK § Bringing Tensor Cores to Standard Fortran § Accelerating Python on GPUs with nvc++ and Cython
  • 26. LERN MORE Blogs § Multi-GPU Programming with Standard Parallel C++, Part1 § Multi-GPU Programming with Standard Parallel C++, Part2 Open-source codes § LULESH: https://github.com/LLNL/LULESH § STLBM: https://gitlab.com/unigehpfs/stlbm § MiniWeather: https://github.com/mrnorman/miniWeather/ § POT3D: https://github.com/predsci/POT3D § Legate: https://github.com/nv-legate § Jacobi example using C++ standard parallelism: https://gitlab.com/unigehpfs/paralg NVIDIA HPC SDK Documentation https://docs.nvidia.com/hpc-sdk/index.html
  • 27. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 28. NVIDIA MATH LIBRARIES Linear Algebra, FFT, RNG and Basic Math CUDA Math API cuFFT cuSPARSE cuSOLVER cuBLAS cuTENSOR cuRAND CUTLASS
  • 29. MATH LIBRARIES § MULTI-GPU MATH LIBRARIES
  • 31. cuTENSORMg: MULTI-GPU TENSOR CONTRACTIONS Performance of FP32 Tensor Contractions on DGX A100 Data residing on Host (Dotted) or Device (Solid) Memory * DGX A100 80GB § Introduced in cuTENSOR v1.4 § Out-of-core released in v1.5 § 0 20 40 60 80 100 120 140 160 4096 8192 16384 32768 49152 65536 81920 98304 114688 131072 147456 163840 180224 196608 TFLOPS (LARGER IS BETTER) SIZES: M = N = K 1 - Device 1 - Host 2 - Device 2 - Host 4 - Device 4 - Host 8 - Device 8 - Host Releasing cuTENSOR v1.5 § Added Out-of-core Functionality § Library wide optimizations
  • 32. cuSOLVERMp: DENSE LINEAR ALGEBRA AT SCALE LU Decomposition (GETRF+GETRS) w/ Pivoting on Summit Supercomputer 1 2 4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 TIME IN SECONDS (SMALLER IS BETTER) NUMBER OF GPUS State-of-the-Art HPC SDK 21.11 * Summit: 6x V100 16GB per node Released in HPC SDK 21.11 § LU Decomposition § With & Without pivoting § Cholesky
  • 33. cuFFTMp: FFTs AT SCALE - SLAB DECOMPOSITION Distributed 3D FFT Performance: Comparison by Precision 18 12 18 34 61 109 226 429 851 1,860 10 6 9 17 29 52 105 210 410 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 8 16 32 64 128 256 512 1024 2048 4096 2048 2560 3072 4096 5120 6144 8192 10240 12288 16384 TFLOPS (LARGER IS BETTER) # OF GPUS PROBLEM SIZE (CUBED) C2C Z2Z * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs
  • 34. 13 14 22 25 27 30 37 41 51 68 79 78.2 104 135 147 122 158 260 278 134 176 394 527 0 100 200 300 400 500 600 1024 2048 4096 8192 TFLOPS (LARGER IS BETTER) PROBLEM SIZE (CUBED) 32 64 128 256 512 1024 2048 cuFFTMp: FFTs AT SCALE - PENCIL DECOMPOSITION Distributed 3D FFT Performance: C2C Comparison by GPU Count * Selene: A100 80GB @ 1410 MHz Coming to HPC SDK 22.3 § Distributed 2D/3D FFTs § Slab Decomposition § Pencil Decomposition (Preview) § Helper functions: Pencils <→ Slabs [S41494] A Deep Dive into the Latest HPC Software # of GPUs
  • 35. MATH LIBRARIES § MATH LIBRARY DEVICE EXTENSIONS
  • 36. MATH LIBRARIES DEVICE EXTENSIONS cuFFTDx Performance: Comparison with cuFFT across various sizes 0 5 10 15 20 25 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 TFLOPS (LARGER IS BETTER) FFT SIZES (1D) cuFFTDx cuFFT * A100 80GB @ 1410 MHz Released in MathDx 22.02 § Available on DevZone § Support Volta+ architecture § FFT 1D sizes up to 32k Future Releases § cuBLASDx/cuSOLVERDx § 2D/3D FFTs § Windows Support
  • 37. LERN MORE GTC2022 sessions § An Explanation of Slab and Pencil Decomposition Performance Across Supercomputing Clusters [S41153] § Recent Developments in NVIDIA Math Libraries [S41491] § Connect with Experts: NVIDIA Math Libraries [CWE41721] § Connect with Experts: Thrust, CUB, and libcu++ User Forum [CWE41948] § NVSHMEM: CUDA-Integrated Communication for NVIDIA GPUs (a Magnum IO session) [S41044] Examples § CUDA Library Samples: https://github.com/NVIDIA/CUDALibrarySamples MathDx 22.02 § https://developer.nvidia.com/mathdx
  • 38. LERN MORE Math Libraries Documentation https://docs.nvidia.com/hpc-sdk/index.html#math-libraries Blog § Multimode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale § Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg
  • 39. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute
  • 40. DEVELOPER TOOLS Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX) Debuggers: cuda-gdb, Nsight Visual Studio Edition Nsight Visual Studio Code Edition Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Eclipse Edition Nsight Visual Studio Edition Nsight Visual Studio Code Edition
  • 41. DEVELOPER TOOLS § COMPUTE DEBUGGERS/IDE
  • 42. CUDA-GDB Command-Line and IDE Back-End Debugger § Unified CPU and CUDA Debugging § CUDA-C/SASS support § Built on GDB and uses many of the same CLI commands
  • 43. COMPUTE SANITIZER Automatically Scan for Bugs and Memory Issues Compute Sanitizer checks correctness isshues via sub-tools: § Memcheck – Memory access error and leak detection tool. § Racecheck – Shared memory data acces hazard detection tool. § Initcheck – Uninitialized device global memory access. § Synccheck – Thread cynchronization hazard detection tool.
  • 44. DEVELOPER TOOLS § COMPUTE DEBUGGERS/IDE NEW FEATURES
  • 45. CORRECTNESS TOOLS FEATURES §OptiX support in Compute Sanitizer § Automatically find correctness issues in OptiX workloads §Core Dump support in Compute Sanitizer § Generate core dumps on detected issues §5x performance increase in core dump generation ========= COMPUTE-SANITIZER ========= Invalid __global__ write of size 1 bytes ========= at 0x4d70 in /home/cuda/optixBasic/draw_solid_color.cu:69:__raygen__dra w_solid_color_0xebf766b2f0642d4e ========= by thread (0,0,0) in block (0,0,0) ========= Address 0x7f878f900403 is out of bounds ========= and is 262,132 bytes after the nearest allocation at 0x7f878f8c0400 of size 16 bytes ========= Device Frame:NVIDIA internal [0x430] ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x60fbaa] ========= in /lib/x86_64-linux- gnu/libnvoptix.so.1 ========= Host Frame:optix_stubs.h:568:optixLaunch [0xe1ff] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:/home/cuda/optixBasic/optixBasic.cpp:227:main [0xb735] ========= in /home/cuda/optixBasic/optixBasic ========= Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_sta rt_call_main [0x2dfd0] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame:../csu/libc- start.c:379:__libc_start_main [0x2e07d] ========= in /lib/x86_64-linux- gnu/libc.so.6 ========= Host Frame: [0x8dde] ========= in /home/cuda/optixBasic/optixBasic
  • 47. NSIGHT SYSTEMS System Profiler Key Features: § System-wide application algorithm tuning § Multi-process tree support § Locate optimization opportunities § Visualize millions of events on a very fast GUI timeline § Or gaps of unused CPU and GPU time § Balance your workload across multiple CPUs and GPUs § CPU algorithms, utilization and thread state GPU streams, kernels memory transfers, etc § Command line, Standalone, IDE integration OS: Linux (x86, Power, Arm SBSA, Tegra), Windows, MacOSX (host) GPUs: Pascal+ § Docs/product: https://developer.nvidia.com/nsight-systems
  • 48. DEVELOPER TOOLS § NSIGHT SYSTEMS NEW FEATURES
  • 50. MULTI-REPORT TILING Visualize More Parallel Activity Open multiple reports Open multiple reports Loaded on same timeline based on wall-clock
  • 51. EXPERT SYSTEMS & STATISTICS Built-in Data Analytics with Advice
  • 52. NVIDIA NETWORKING ADAPTER SAMPLING § Profile NVIDIA Networking adaptors § Sent / Received / Congestion § Correlate with expected network traffic and other system activities
  • 53. GPU DIRECT STORAGE SUPPORT GPU Metrics Sampling of PCIe BAR1 Requests & CuFile Trace § Direct communication to GPU memory § CUFILE APIs used for GPU Direct Storage
  • 55. NSIGHT COMPUTE Kernel Profiling Tool Key Features: § Interactive CUDA API debugging and kernel profiling § Build-in rules expertise § Fully customizable data collection and display § Command line, Standalone, IDE integration, Remote targets OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, MacOSX (host only) GPUs: Volta, Turing, Ampere GPUs Docs/product: https://developer.nvidia.com/nsight-compute
  • 56. DEVELOPER TOOLS § NSIGHT COMPUTE NEW FEATURES
  • 57. REGISTER DEPENDENCY VISUALIZATION Visualize Register Usage and Dependency Chains § SASS view in the Source page § Tracking reads and writes for each register § Identify long dependency chains § Detect inefficient register usage § Columns show all dependencies for: § Registers § Predicates § Uniform Registers § Uniform Predicates
  • 58. STANDALONE SOURCE VIEWER § View of side-by-side assembly and correlated source code for CUDA kernels § No profile required § Open .cubin files directly § Helps identify compiler optimizations and inefficiencies
  • 59. OCCUPANCY CALCULATOR Model Hardware Usage and Identify Limiters § Model theoretical hardware usage § Understand limitations from hardware vs. kernel parameters § Configure model to vary HW and kernel parameters § Opened from an existing report or as a new activity
  • 60. HIERARCHICAL ROOFLINE § Visualize multiple levels of the memory hierarchy § Identify bottlenecks caused by memory limitations § Determine how modifying algorithms may (or may not) impact performance
  • 61. LERN MORE GTC2022 sessions § Optimizing Communication with Nsight Systems Network Profiling [S41500] § Latest Updates to CUDA Developer Tools [D4121] § How to Understand and Optimize Shared Memory Accesses using Nsight Compute [S41723] § Connect with Experts: What’s in Your CUDA Toolbox? Profiling, Optimization, and Debugging Tools [CWE41541] § What, Where, and Why? Use CUDA Developer Tools to Detect, Locate, and Explain Bugs and Bottlenecks [S41493] Nsight Systems Documentation § https://docs.nvidia.com/nsight-systems/ Nsight Compute Documentation § https://docs.nvidia.com/nsight-compute/
  • 62. AGENDA Accelerated Computing with Standard Languages § C++ § Fortran § Python Math Libraries § Multi-GPU Math Libraries § Math Library Device Extensions Developer Tools § Compute debuggers/IDE § Nsight Systems § Nsight Compute