SlideShare a Scribd company logo
Jack Han | Solutions Architect | jahan@nvidia.com
PROFILING
DEEP LEARNING NETWORK
2
AGENDA
Profiling Deep Learning Network
Introduction to Nsight Systems
PyTorch NVTX Annotation
Examples and Benefits
Mixed Precision with BERT
TensorFlow NVTX Plugins
Plugin Usage Example
3
HOW TO SPEED-UP NETWORK
Use the latest GPU and more GPU?
Wait NVIDIA to release the new GPU or newer library?
Optimize Neural Network Computing or Something
Need to understand your network operation
We are talking about this
4
PROFILING NEURAL NETWORK
Profiling with cProfiler + Snakeviz
python -m cProfile -o 100_percent_gpu_utilization.prof train.py
snakeviz 100_percent_gpu_utilization.prof
https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/
Useful?
5
TENSORFLOW PROFILER
Show timeline and can trace network
Still difficult to understand
https://github.com/tensorflow/profiler-ui
Useful?
6
NVIDIA PROFILING TOOLS
CUPTI (CUDA Profiling Tools Interface)
Nsight Systems
Nsight Compute
Nsight Visual Studio Edition
Nsight Eclipse Edition
NVIDIA Profiler (nvprof)
NVIDIA Visual Profiler (nvvp)
7
NSIGHT PRODUCT FAMILY
Standalone Performance Tools
Nsight Systems system-wide application algorithm tuning
Nsight Compute Debug/optimize specific CUDA kernel
Nsight Graphics Debug/optimize specific graphics
IDE plugins
Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
8
NSIGHT SYSTEMS
Profile System-wide application
Multi-process tree, GPU workload trace, etc
Investigate your workload across multiple CPUs and GPUs
CPU algorithms, utilization, and thread states
GPU streams kernels, memory transfers, etc
NVTX, CUDA & Library API, etc
Ready for Big Data
docker, user privilege (linux), cli, etc
Overview
9
Thread/core
migration
Thread state
Processes and
threads
CUDA and OpenGL
API trace
cuDNN and
cuBLAS trase
Kernel and
memory transfer
activities
Multi-GPU
10
NVTX Tracining
11
TRANSITIONING TO PROFILE A KERNEL
Dive into kernel analysis
12
NVIDIA NSIGHT COMPUTE
Interactive CUDA API debugging and kernel profiling
Fast Data Collection
Graphical and multiple kernel comparison reports
Improved Workflow and Fully Customizable
(Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux(x86,ARM), Windows, OSX (host only)
GPUs: Pascal, Volta, Turing
Next Generation Kernel Profiler
Kernel Profile Comparisons with Baseline
Metric Data
Source Correlation
13
PROFILING GPU APPLICATION
How to measure
Focusing GPU Computing
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
• Too few threads
• Register limit
• Large shared
memory
…
• Cache misses
• Bandwidth limit
• Access pattern
…
• Arithmetic
• Control flow
…
NVIDIA (Visual) Profiler / Nsight Compute
NVIDIA Supports them with cuDNN, cuBLAS, and so on
14
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
PROFILING GPU APPLICATION
How to measure
Focusing System Operation
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
CPU-Only
Activities
Memcopy
Latency
Kernel Launch
Latency
Job Startup /
Checkpoints
CPU
Computation
I/O
Nsight System / Application Tracing
15
NSIGHT SYSTEMS PROFILE
Profile with CLI
$ nsys profile –t cuda,osrt,nvtx,cudnn,cublas 
-y 60 
-d 20 
–o baseline 
-f true 
–w true 
python main.py
cuda – GPU kernel
osrt – OS runtime
nvtx – NVIDIA Tools Extension
cudnn – CUDA Deep NN library
cublas– CUDA BLAS library
https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm
ßAPIs to be traced
ßDelayed profile (sec)
ßProfiling duration (sec)
ßOutput filename
ßOverwrite when it’s true
ßDisplay
ßExecution command
16
DISABLING ASYNCHRONOUS OPERATION
Elimination of CPU interop effects
export CUDA_LAUNCH_BLOCKING=1
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
17
PROFILING
PYTORCH MODEL
18
BASELINE PROFILE
MNIST Training: 89 sec, <5% utilization
CPU waits on a semaphore and starves the GPU!
GPU Starvation GPU Starvation
19
NVTX ANNOTATIONS
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
20
NVTX ANNOTATIONS
import torch.cuda.nvtx as nvtx
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
nvtx.range_push("Batch " + str(batch_idx))
nvtx.range_push("Copy to device")
data, target = data.to(device), target.to(device)
nvtx.range_pop()
nvtx.range_push("Forward pass")
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
nvtx.range_pop()
nvtx.range_pop()
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
21
BASELINE PROFILE (WITH NVTX)
GPU is idle during data loading
Data is loaded using a single thread. This starves the GPU!
fwd/bwd fwd/bwd
5.1ms
22
OPTIMIZE SOURCE CODE
Data loader was configured to use 1 worker thread
Let’s switch to using 8 worker threads:
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
23
AFTER OPTIMIZATION
Time for data loading reduced for each bath
60us Reduced from 5.1ms to 60us for batch 50
24
REAL EXAMPLE
https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
How?
25
INITIAL PROFILE
No NVTX
Difficult to understand à no useful
26
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
...
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
Foward
Backward
NVTX TAGGING
27
nvtx.range_push("Batch " + str(step))
nvtx.range_push("Forward pass")
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
nvtx.range_pop()
...
nvtx.range_push("Backward pass")
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
nvtx.range_pop()
nvtx.range_pop()
Foward
Backward
NVTX TAGGING
28
BERT FP32 BENCHMARK
HuggingFace’s pretrained BERT
Getting API List
~460 ms
~60% GEMM
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
29
BERT FP16 BENCHMARK
HuggingFace’s pretrained BERT
Tensor Cores
APIs
~222 ms
2.1x Speed up
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
30
132 ms
8.5 ms
APEX OPTIMIZER OPTIMIZATION
APEX/csrc/adam_cuda_kernel()
Low Utilization
Higher Utilization
FP32
FP16
31
PROFILING GPU APPLICATION
You can investigate the performance bottleneck in each layer
NVIDIA provides such inefficient layers / graphs optimization support
For example,
Like this, adam_cuda_kernel in APEX
cudnnMultiHeadAttnForward() in cuDNN
BERT speed-up
Reminding
cudnnMultiHeadAttnForward
32
PROFILING
TENSORFLOW GRAPHS
33
NVTX WITH TENSORFLOW
The TF_DISABLE_NVTX_RANGES variable enables
and disables NVTX ranges in TensorFlow
NVTX collect is enabled by default
Available in NGC Container
Having NVTX plugin for TF
Import nvtx plugin
Annotation in python code
Get data from through Profiler
34
NVTX PLUGINS FOR DEEP LEARNING
Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a
TensorFlow graph
Ranges are added by wrapping regions of the computation graph with start and end
operations
Profiling TensorFlow for the graphs
NVTX Context
35
SETUP
Install NVTX plugins
pip install nvtx-plugins-tf
No need to install this plugin since NGC 19.08
36
HOW TO USE
Adding Markers
Function Detector
import nvtx.plugins.tf as nvtx_tf
x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', 
domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
x = nvtx_tf.ops.end(x, nvtx_context)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3')
import nvtx.plugins.tf as nvtx_tf
@nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward',
grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True)
def dense_block(x):
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
return x
37
WORKING WITH SESSION/TF.ESTIMATOR
from nvtx.plugins.tf.estimator import NVTXHook
nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’)
training_hooks=[]
training_hooks.append(nvtx_callback)
with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess:
tf.estimator.Estimator(hooks=training_hooks, ...)
38
PROFIING MPI PROCESS
To profile everything, putting the data into one file
To profile everything, putting the data into each node into a separated file
nsys [nsys options] mpirun [mpi options]
mpirun [mpi options] nsys profile [nsys options]
39
COMMAND EXAMPLES
1 GPU profiling
4 GPU profiling
NGC TensorFlow ResNet50-1.5
nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu 
-t osrt,nvtx,cuda,cublas,cudnn 
mpiexec --allow-run-as-root --bind-to socket -np 4 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
nsys profile -y 40 -d 20 -f true -w true -o tf_amp 
-t osrt,nvtx,cuda,cublas,cudnn 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
40
TF WITH NVTX
41
AMP ENABLED
42
MULTI-GPU
Profiling deep learning network using NVIDIA nsight systems

More Related Content

What's hot

GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
智啓 出川
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
智啓 出川
 

What's hot (20)

NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
 
Slurmのジョブスケジューリングと実装
Slurmのジョブスケジューリングと実装Slurmのジョブスケジューリングと実装
Slurmのジョブスケジューリングと実装
 
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
 
20190625 OpenACC 講習会 第3部
20190625 OpenACC 講習会 第3部20190625 OpenACC 講習会 第3部
20190625 OpenACC 講習会 第3部
 
fpgax #11+TFUG ハード部:DNN専用ハードについて語る会-2019-02-02 MN-coreについて 金子 紘也
fpgax #11+TFUG ハード部:DNN専用ハードについて語る会-2019-02-02 MN-coreについて 金子 紘也fpgax #11+TFUG ハード部:DNN専用ハードについて語る会-2019-02-02 MN-coreについて 金子 紘也
fpgax #11+TFUG ハード部:DNN専用ハードについて語る会-2019-02-02 MN-coreについて 金子 紘也
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
 
CXL_説明_公開用.pdf
CXL_説明_公開用.pdfCXL_説明_公開用.pdf
CXL_説明_公開用.pdf
 
PGI CUDA FortranとGPU最適化ライブラリの一連携法
PGI CUDA FortranとGPU最適化ライブラリの一連携法PGI CUDA FortranとGPU最適化ライブラリの一連携法
PGI CUDA FortranとGPU最適化ライブラリの一連携法
 
1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門1076: CUDAデバッグ・プロファイリング入門
1076: CUDAデバッグ・プロファイリング入門
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
CPU / GPU高速化セミナー!性能モデルの理論と実践:実践編
 
Chainer でのプロファイリングをちょっと楽にする話
Chainer でのプロファイリングをちょっと楽にする話Chainer でのプロファイリングをちょっと楽にする話
Chainer でのプロファイリングをちょっと楽にする話
 
Chainer で Tensor コア (fp16) を使いこなす
Chainer で Tensor コア (fp16) を使いこなすChainer で Tensor コア (fp16) を使いこなす
Chainer で Tensor コア (fp16) を使いこなす
 
Gpu vs fpga
Gpu vs fpgaGpu vs fpga
Gpu vs fpga
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
第 1 回 Jetson ユーザー勉強会
第 1 回 Jetson ユーザー勉強会第 1 回 Jetson ユーザー勉強会
第 1 回 Jetson ユーザー勉強会
 
いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門いまさら聞けない!CUDA高速化入門
いまさら聞けない!CUDA高速化入門
 
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
CPU / GPU高速化セミナー!性能モデルの理論と実践:理論編
 

Similar to Profiling deep learning network using NVIDIA nsight systems

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

Similar to Profiling deep learning network using NVIDIA nsight systems (20)

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
RAPIDS Overview
RAPIDS OverviewRAPIDS Overview
RAPIDS Overview
 
Accelerating Data Science With GPUs
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Nvidia at SEMICon, Munich
Nvidia at SEMICon, MunichNvidia at SEMICon, Munich
Nvidia at SEMICon, Munich
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Training course lect1
Training course lect1Training course lect1
Training course lect1
 

Recently uploaded

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
Kamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 

Recently uploaded (20)

Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Online blood donation management system project.pdf
Online blood donation management system project.pdfOnline blood donation management system project.pdf
Online blood donation management system project.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
shape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptxshape functions of 1D and 2 D rectangular elements.pptx
shape functions of 1D and 2 D rectangular elements.pptx
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
A case study of cinema management system project report..pdf
A case study of cinema management system project report..pdfA case study of cinema management system project report..pdf
A case study of cinema management system project report..pdf
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-5 Notes for II-II Mechanical Engineering
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 

Profiling deep learning network using NVIDIA nsight systems

  • 1. Jack Han | Solutions Architect | jahan@nvidia.com PROFILING DEEP LEARNING NETWORK
  • 2. 2 AGENDA Profiling Deep Learning Network Introduction to Nsight Systems PyTorch NVTX Annotation Examples and Benefits Mixed Precision with BERT TensorFlow NVTX Plugins Plugin Usage Example
  • 3. 3 HOW TO SPEED-UP NETWORK Use the latest GPU and more GPU? Wait NVIDIA to release the new GPU or newer library? Optimize Neural Network Computing or Something Need to understand your network operation We are talking about this
  • 4. 4 PROFILING NEURAL NETWORK Profiling with cProfiler + Snakeviz python -m cProfile -o 100_percent_gpu_utilization.prof train.py snakeviz 100_percent_gpu_utilization.prof https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/ Useful?
  • 5. 5 TENSORFLOW PROFILER Show timeline and can trace network Still difficult to understand https://github.com/tensorflow/profiler-ui Useful?
  • 6. 6 NVIDIA PROFILING TOOLS CUPTI (CUDA Profiling Tools Interface) Nsight Systems Nsight Compute Nsight Visual Studio Edition Nsight Eclipse Edition NVIDIA Profiler (nvprof) NVIDIA Visual Profiler (nvvp)
  • 7. 7 NSIGHT PRODUCT FAMILY Standalone Performance Tools Nsight Systems system-wide application algorithm tuning Nsight Compute Debug/optimize specific CUDA kernel Nsight Graphics Debug/optimize specific graphics IDE plugins Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
  • 8. 8 NSIGHT SYSTEMS Profile System-wide application Multi-process tree, GPU workload trace, etc Investigate your workload across multiple CPUs and GPUs CPU algorithms, utilization, and thread states GPU streams kernels, memory transfers, etc NVTX, CUDA & Library API, etc Ready for Big Data docker, user privilege (linux), cli, etc Overview
  • 9. 9 Thread/core migration Thread state Processes and threads CUDA and OpenGL API trace cuDNN and cuBLAS trase Kernel and memory transfer activities Multi-GPU
  • 11. 11 TRANSITIONING TO PROFILE A KERNEL Dive into kernel analysis
  • 12. 12 NVIDIA NSIGHT COMPUTE Interactive CUDA API debugging and kernel profiling Fast Data Collection Graphical and multiple kernel comparison reports Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Command Line, Standalone, IDE Integration Platform Support OS: Linux(x86,ARM), Windows, OSX (host only) GPUs: Pascal, Volta, Turing Next Generation Kernel Profiler Kernel Profile Comparisons with Baseline Metric Data Source Correlation
  • 13. 13 PROFILING GPU APPLICATION How to measure Focusing GPU Computing Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck GPU Profiling CPU/GPU Tracing Application Tracing • Too few threads • Register limit • Large shared memory … • Cache misses • Bandwidth limit • Access pattern … • Arithmetic • Control flow … NVIDIA (Visual) Profiler / Nsight Compute NVIDIA Supports them with cuDNN, cuBLAS, and so on
  • 14. 14 GPU Profiling CPU/GPU Tracing Application Tracing PROFILING GPU APPLICATION How to measure Focusing System Operation Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck CPU-Only Activities Memcopy Latency Kernel Launch Latency Job Startup / Checkpoints CPU Computation I/O Nsight System / Application Tracing
  • 15. 15 NSIGHT SYSTEMS PROFILE Profile with CLI $ nsys profile –t cuda,osrt,nvtx,cudnn,cublas -y 60 -d 20 –o baseline -f true –w true python main.py cuda – GPU kernel osrt – OS runtime nvtx – NVIDIA Tools Extension cudnn – CUDA Deep NN library cublas– CUDA BLAS library https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm ßAPIs to be traced ßDelayed profile (sec) ßProfiling duration (sec) ßOutput filename ßOverwrite when it’s true ßDisplay ßExecution command
  • 16. 16 DISABLING ASYNCHRONOUS OPERATION Elimination of CPU interop effects export CUDA_LAUNCH_BLOCKING=1 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
  • 18. 18 BASELINE PROFILE MNIST Training: 89 sec, <5% utilization CPU waits on a semaphore and starves the GPU! GPU Starvation GPU Starvation
  • 19. 19 NVTX ANNOTATIONS def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 20. 20 NVTX ANNOTATIONS import torch.cuda.nvtx as nvtx def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): nvtx.range_push("Batch " + str(batch_idx)) nvtx.range_push("Copy to device") data, target = data.to(device), target.to(device) nvtx.range_pop() nvtx.range_push("Forward pass") optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) nvtx.range_pop() nvtx.range_pop() https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 21. 21 BASELINE PROFILE (WITH NVTX) GPU is idle during data loading Data is loaded using a single thread. This starves the GPU! fwd/bwd fwd/bwd 5.1ms
  • 22. 22 OPTIMIZE SOURCE CODE Data loader was configured to use 1 worker thread Let’s switch to using 8 worker threads: kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
  • 23. 23 AFTER OPTIMIZATION Time for data loading reduced for each bath 60us Reduced from 5.1ms to 60us for batch 50
  • 25. 25 INITIAL PROFILE No NVTX Difficult to understand à no useful
  • 26. 26 loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) ... if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 Foward Backward NVTX TAGGING
  • 27. 27 nvtx.range_push("Batch " + str(step)) nvtx.range_push("Forward pass") loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) nvtx.range_pop() ... nvtx.range_push("Backward pass") if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 nvtx.range_pop() nvtx.range_pop() Foward Backward NVTX TAGGING
  • 28. 28 BERT FP32 BENCHMARK HuggingFace’s pretrained BERT Getting API List ~460 ms ~60% GEMM 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 29. 29 BERT FP16 BENCHMARK HuggingFace’s pretrained BERT Tensor Cores APIs ~222 ms 2.1x Speed up 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 30. 30 132 ms 8.5 ms APEX OPTIMIZER OPTIMIZATION APEX/csrc/adam_cuda_kernel() Low Utilization Higher Utilization FP32 FP16
  • 31. 31 PROFILING GPU APPLICATION You can investigate the performance bottleneck in each layer NVIDIA provides such inefficient layers / graphs optimization support For example, Like this, adam_cuda_kernel in APEX cudnnMultiHeadAttnForward() in cuDNN BERT speed-up Reminding cudnnMultiHeadAttnForward
  • 33. 33 NVTX WITH TENSORFLOW The TF_DISABLE_NVTX_RANGES variable enables and disables NVTX ranges in TensorFlow NVTX collect is enabled by default Available in NGC Container Having NVTX plugin for TF Import nvtx plugin Annotation in python code Get data from through Profiler
  • 34. 34 NVTX PLUGINS FOR DEEP LEARNING Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a TensorFlow graph Ranges are added by wrapping regions of the computation graph with start and end operations Profiling TensorFlow for the graphs NVTX Context
  • 35. 35 SETUP Install NVTX plugins pip install nvtx-plugins-tf No need to install this plugin since NGC 19.08
  • 36. 36 HOW TO USE Adding Markers Function Detector import nvtx.plugins.tf as nvtx_tf x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) x = nvtx_tf.ops.end(x, nvtx_context) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3') import nvtx.plugins.tf as nvtx_tf @nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward', grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True) def dense_block(x): x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) return x
  • 37. 37 WORKING WITH SESSION/TF.ESTIMATOR from nvtx.plugins.tf.estimator import NVTXHook nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’) training_hooks=[] training_hooks.append(nvtx_callback) with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess: tf.estimator.Estimator(hooks=training_hooks, ...)
  • 38. 38 PROFIING MPI PROCESS To profile everything, putting the data into one file To profile everything, putting the data into each node into a separated file nsys [nsys options] mpirun [mpi options] mpirun [mpi options] nsys profile [nsys options]
  • 39. 39 COMMAND EXAMPLES 1 GPU profiling 4 GPU profiling NGC TensorFlow ResNet50-1.5 nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu -t osrt,nvtx,cuda,cublas,cudnn mpiexec --allow-run-as-root --bind-to socket -np 4 python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64 nsys profile -y 40 -d 20 -f true -w true -o tf_amp -t osrt,nvtx,cuda,cublas,cudnn python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64