Profiling deep learning network using NVIDIA nsight systems

Jack (Jaegeun) Han
Jack (Jaegeun) HanSolutions Architect / Software Engineer at NVIDIA
Jack Han | Solutions Architect | jahan@nvidia.com
PROFILING
DEEP LEARNING NETWORK
2
AGENDA
Profiling Deep Learning Network
Introduction to Nsight Systems
PyTorch NVTX Annotation
Examples and Benefits
Mixed Precision with BERT
TensorFlow NVTX Plugins
Plugin Usage Example
3
HOW TO SPEED-UP NETWORK
Use the latest GPU and more GPU?
Wait NVIDIA to release the new GPU or newer library?
Optimize Neural Network Computing or Something
Need to understand your network operation
We are talking about this
4
PROFILING NEURAL NETWORK
Profiling with cProfiler + Snakeviz
python -m cProfile -o 100_percent_gpu_utilization.prof train.py
snakeviz 100_percent_gpu_utilization.prof
https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/
Useful?
5
TENSORFLOW PROFILER
Show timeline and can trace network
Still difficult to understand
https://github.com/tensorflow/profiler-ui
Useful?
6
NVIDIA PROFILING TOOLS
CUPTI (CUDA Profiling Tools Interface)
Nsight Systems
Nsight Compute
Nsight Visual Studio Edition
Nsight Eclipse Edition
NVIDIA Profiler (nvprof)
NVIDIA Visual Profiler (nvvp)
7
NSIGHT PRODUCT FAMILY
Standalone Performance Tools
Nsight Systems system-wide application algorithm tuning
Nsight Compute Debug/optimize specific CUDA kernel
Nsight Graphics Debug/optimize specific graphics
IDE plugins
Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
8
NSIGHT SYSTEMS
Profile System-wide application
Multi-process tree, GPU workload trace, etc
Investigate your workload across multiple CPUs and GPUs
CPU algorithms, utilization, and thread states
GPU streams kernels, memory transfers, etc
NVTX, CUDA & Library API, etc
Ready for Big Data
docker, user privilege (linux), cli, etc
Overview
9
Thread/core
migration
Thread state
Processes and
threads
CUDA and OpenGL
API trace
cuDNN and
cuBLAS trase
Kernel and
memory transfer
activities
Multi-GPU
10
NVTX Tracining
11
TRANSITIONING TO PROFILE A KERNEL
Dive into kernel analysis
12
NVIDIA NSIGHT COMPUTE
Interactive CUDA API debugging and kernel profiling
Fast Data Collection
Graphical and multiple kernel comparison reports
Improved Workflow and Fully Customizable
(Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux(x86,ARM), Windows, OSX (host only)
GPUs: Pascal, Volta, Turing
Next Generation Kernel Profiler
Kernel Profile Comparisons with Baseline
Metric Data
Source Correlation
13
PROFILING GPU APPLICATION
How to measure
Focusing GPU Computing
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
• Too few threads
• Register limit
• Large shared
memory
…
• Cache misses
• Bandwidth limit
• Access pattern
…
• Arithmetic
• Control flow
…
NVIDIA (Visual) Profiler / Nsight Compute
NVIDIA Supports them with cuDNN, cuBLAS, and so on
14
GPU Profiling
CPU/GPU
Tracing
Application
Tracing
PROFILING GPU APPLICATION
How to measure
Focusing System Operation
Low GPU
Utilization
Low SM
Efficiency
Low Achieved
Occupancy
Memory
Bottleneck
Instructions
Bottleneck
CPU-Only
Activities
Memcopy
Latency
Kernel Launch
Latency
Job Startup /
Checkpoints
CPU
Computation
I/O
Nsight System / Application Tracing
15
NSIGHT SYSTEMS PROFILE
Profile with CLI
$ nsys profile –t cuda,osrt,nvtx,cudnn,cublas 
-y 60 
-d 20 
–o baseline 
-f true 
–w true 
python main.py
cuda – GPU kernel
osrt – OS runtime
nvtx – NVIDIA Tools Extension
cudnn – CUDA Deep NN library
cublas– CUDA BLAS library
https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm
ßAPIs to be traced
ßDelayed profile (sec)
ßProfiling duration (sec)
ßOutput filename
ßOverwrite when it’s true
ßDisplay
ßExecution command
16
DISABLING ASYNCHRONOUS OPERATION
Elimination of CPU interop effects
export CUDA_LAUNCH_BLOCKING=1
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
17
PROFILING
PYTORCH MODEL
18
BASELINE PROFILE
MNIST Training: 89 sec, <5% utilization
CPU waits on a semaphore and starves the GPU!
GPU Starvation GPU Starvation
19
NVTX ANNOTATIONS
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
20
NVTX ANNOTATIONS
import torch.cuda.nvtx as nvtx
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
nvtx.range_push("Batch " + str(batch_idx))
nvtx.range_push("Copy to device")
data, target = data.to(device), target.to(device)
nvtx.range_pop()
nvtx.range_push("Forward pass")
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
nvtx.range_pop()
nvtx.range_pop()
https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
21
BASELINE PROFILE (WITH NVTX)
GPU is idle during data loading
Data is loaded using a single thread. This starves the GPU!
fwd/bwd fwd/bwd
5.1ms
22
OPTIMIZE SOURCE CODE
Data loader was configured to use 1 worker thread
Let’s switch to using 8 worker threads:
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
23
AFTER OPTIMIZATION
Time for data loading reduced for each bath
60us Reduced from 5.1ms to 60us for batch 50
24
REAL EXAMPLE
https://news.developer.nvidia.com/nvidia-achieves-4x-speedup-on-bert-neural-network/
How?
25
INITIAL PROFILE
No NVTX
Difficult to understand à no useful
26
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
...
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
Foward
Backward
NVTX TAGGING
27
nvtx.range_push("Batch " + str(step))
nvtx.range_push("Forward pass")
loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
nvtx.range_pop()
...
nvtx.range_push("Backward pass")
if args.fp16:
optimizer.backward(loss)
else:
loss.backward()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
# modify learning rate with special warm up BERT uses
# if args.fp16 is False, BertAdam is used and handles this automatically
lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion)
for param_group in optimizer.param_groups:
param_group['lr'] = lr_this_step
optimizer.step()
optimizer.zero_grad()
global_step += 1
nvtx.range_pop()
nvtx.range_pop()
Foward
Backward
NVTX TAGGING
28
BERT FP32 BENCHMARK
HuggingFace’s pretrained BERT
Getting API List
~460 ms
~60% GEMM
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
29
BERT FP16 BENCHMARK
HuggingFace’s pretrained BERT
Tensor Cores
APIs
~222 ms
2.1x Speed up
4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
30
132 ms
8.5 ms
APEX OPTIMIZER OPTIMIZATION
APEX/csrc/adam_cuda_kernel()
Low Utilization
Higher Utilization
FP32
FP16
31
PROFILING GPU APPLICATION
You can investigate the performance bottleneck in each layer
NVIDIA provides such inefficient layers / graphs optimization support
For example,
Like this, adam_cuda_kernel in APEX
cudnnMultiHeadAttnForward() in cuDNN
BERT speed-up
Reminding
cudnnMultiHeadAttnForward
32
PROFILING
TENSORFLOW GRAPHS
33
NVTX WITH TENSORFLOW
The TF_DISABLE_NVTX_RANGES variable enables
and disables NVTX ranges in TensorFlow
NVTX collect is enabled by default
Available in NGC Container
Having NVTX plugin for TF
Import nvtx plugin
Annotation in python code
Get data from through Profiler
34
NVTX PLUGINS FOR DEEP LEARNING
Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a
TensorFlow graph
Ranges are added by wrapping regions of the computation graph with start and end
operations
Profiling TensorFlow for the graphs
NVTX Context
35
SETUP
Install NVTX plugins
pip install nvtx-plugins-tf
No need to install this plugin since NGC 19.08
36
HOW TO USE
Adding Markers
Function Detector
import nvtx.plugins.tf as nvtx_tf
x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', 
domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
x = nvtx_tf.ops.end(x, nvtx_context)
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3')
import nvtx.plugins.tf as nvtx_tf
@nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward',
grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True)
def dense_block(x):
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1')
x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’)
return x
37
WORKING WITH SESSION/TF.ESTIMATOR
from nvtx.plugins.tf.estimator import NVTXHook
nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’)
training_hooks=[]
training_hooks.append(nvtx_callback)
with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess:
tf.estimator.Estimator(hooks=training_hooks, ...)
38
PROFIING MPI PROCESS
To profile everything, putting the data into one file
To profile everything, putting the data into each node into a separated file
nsys [nsys options] mpirun [mpi options]
mpirun [mpi options] nsys profile [nsys options]
39
COMMAND EXAMPLES
1 GPU profiling
4 GPU profiling
NGC TensorFlow ResNet50-1.5
nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu 
-t osrt,nvtx,cuda,cublas,cudnn 
mpiexec --allow-run-as-root --bind-to socket -np 4 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
nsys profile -y 40 -d 20 -f true -w true -o tf_amp 
-t osrt,nvtx,cuda,cublas,cudnn 
python ./main.py --mode=training_benchmark --warmup_steps 200 
--num_iter 500 --iter_unit batch --results_dir=results --batch_size 64
40
TF WITH NVTX
41
AMP ENABLED
42
MULTI-GPU
Profiling deep learning network using NVIDIA nsight systems
1 of 43

Recommended

データ爆発時代のネットワークインフラ by
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
677 views26 slides
The Deep Learning Compiler by
The Deep Learning CompilerThe Deep Learning Compiler
The Deep Learning CompilerTae Young Lee
110 views17 slides
HPC+AI ってよく聞くけど結局なんなの by
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのNVIDIA Japan
714 views69 slides
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio... by
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
1.7K views35 slides
マルチコアのプログラミング技法 -- OpenCLとWebCL by
マルチコアのプログラミング技法 -- OpenCLとWebCLマルチコアのプログラミング技法 -- OpenCLとWebCL
マルチコアのプログラミング技法 -- OpenCLとWebCLmaruyama097
7.7K views285 slides
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster by
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-clusterKubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-cluster
Kubernetes meetup-tokyo-13-customizing-kubernetes-for-ml-clusterPreferred Networks
3.6K views37 slides

More Related Content

What's hot

2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール by
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール智啓 出川
1.6K views50 slides
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層 by
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層智啓 出川
3.2K views104 slides
Hopper アーキテクチャで、変わること、変わらないこと by
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
1.2K views60 slides
GPU - Basic Working by
GPU - Basic WorkingGPU - Basic Working
GPU - Basic WorkingNived R Nambiar
1.5K views33 slides
プログラマ目線から見たRDMAのメリットと その応用例について by
プログラマ目線から見たRDMAのメリットとその応用例についてプログラマ目線から見たRDMAのメリットとその応用例について
プログラマ目線から見たRDMAのメリットと その応用例についてMasanori Itoh
5.3K views54 slides
Gnn overview by
Gnn overviewGnn overview
Gnn overviewLouis (Yufeng) Wang
881 views28 slides

What's hot(20)

2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール by 智啓 出川
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
2015年度GPGPU実践プログラミング 第6回 パフォーマンス解析ツール
智啓 出川1.6K views
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層 by 智啓 出川
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
2015年度GPGPU実践プログラミング 第5回 GPUのメモリ階層
智啓 出川3.2K views
Hopper アーキテクチャで、変わること、変わらないこと by NVIDIA Japan
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
NVIDIA Japan1.2K views
プログラマ目線から見たRDMAのメリットと その応用例について by Masanori Itoh
プログラマ目線から見たRDMAのメリットとその応用例についてプログラマ目線から見たRDMAのメリットとその応用例について
プログラマ目線から見たRDMAのメリットと その応用例について
Masanori Itoh5.3K views
1070: CUDA プログラミング入門 by NVIDIA Japan
1070: CUDA プログラミング入門1070: CUDA プログラミング入門
1070: CUDA プログラミング入門
NVIDIA Japan7.6K views
PGI CUDA FortranとGPU最適化ライブラリの一連携法 by 智啓 出川
PGI CUDA FortranとGPU最適化ライブラリの一連携法PGI CUDA FortranとGPU最適化ライブラリの一連携法
PGI CUDA FortranとGPU最適化ライブラリの一連携法
智啓 出川2.5K views
Magnum IO GPUDirect Storage 最新情報 by NVIDIA Japan
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
NVIDIA Japan1.3K views
CUDA 6の話@関西GPGPU勉強会#5 by Yosuke Onoue
CUDA 6の話@関西GPGPU勉強会#5CUDA 6の話@関西GPGPU勉強会#5
CUDA 6の話@関西GPGPU勉強会#5
Yosuke Onoue3.9K views
2015年度先端GPGPUシミュレーション工学特論 第2回 GPUによる並列計算の概念と メモリアクセス by 智啓 出川
2015年度先端GPGPUシミュレーション工学特論 第2回 GPUによる並列計算の概念とメモリアクセス2015年度先端GPGPUシミュレーション工学特論 第2回 GPUによる並列計算の概念とメモリアクセス
2015年度先端GPGPUシミュレーション工学特論 第2回 GPUによる並列計算の概念と メモリアクセス
智啓 出川1.8K views
第13回 配信講義 計算科学技術特論B(2022) by RCCSRENKEI
第13回 配信講義 計算科学技術特論B(2022)第13回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)
RCCSRENKEI188 views
GPGPU Computation by jtsagata
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata293 views
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2 by Preferred Networks
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
続・PFN のオンプレML基盤の取り組み / オンプレML基盤 on Kubernetes 〜PFN、ヤフー〜 #2
Preferred Networks1.5K views
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes by inside-BigData.com
Benefits of Multi-rail Cluster Architectures for GPU-based NodesBenefits of Multi-rail Cluster Architectures for GPU-based Nodes
Benefits of Multi-rail Cluster Architectures for GPU-based Nodes
inside-BigData.com1.1K views
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is... by Preferred Networks
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Preferred Networks9.2K views
IIBMP2019 講演資料「オープンソースで始める深層学習」 by Preferred Networks
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
Preferred Networks2.8K views
GPGPU Seminar (GPU Accelerated Libraries, 1 of 3, cuBLAS) by 智啓 出川
GPGPU Seminar (GPU Accelerated Libraries, 1 of 3, cuBLAS) GPGPU Seminar (GPU Accelerated Libraries, 1 of 3, cuBLAS)
GPGPU Seminar (GPU Accelerated Libraries, 1 of 3, cuBLAS)
智啓 出川2.6K views

Similar to Profiling deep learning network using NVIDIA nsight systems

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS by
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
2K views38 slides
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a... by
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
11 views70 slides
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf by
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfDLow6
11 views21 slides
Hardware & Software Platforms for HPC, AI and ML by
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
1.3K views49 slides
CUDA-Python and RAPIDS for blazing fast scientific computing by
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
1K views82 slides
Volta (Tesla V100) の紹介 by
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介NVIDIA Japan
3.9K views30 slides

Similar to Profiling deep learning network using NVIDIA nsight systems(20)

Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS by Databricks
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Databricks2K views
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a... by E-Commerce Brasil
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf by DLow6
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdfS51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
DLow611 views
Hardware & Software Platforms for HPC, AI and ML by inside-BigData.com
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com1.3K views
CUDA-Python and RAPIDS for blazing fast scientific computing by inside-BigData.com
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
Volta (Tesla V100) の紹介 by NVIDIA Japan
Volta (Tesla V100) の紹介Volta (Tesla V100) の紹介
Volta (Tesla V100) の紹介
NVIDIA Japan3.9K views
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS by PeterAndreasEntschev
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Accelerating Data Science With GPUs by iguazio
Accelerating Data Science With GPUsAccelerating Data Science With GPUs
Accelerating Data Science With GPUs
iguazio899 views
PGI Compilers & Tools Update- March 2018 by NVIDIA
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
NVIDIA3.8K views
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs by Altoros
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
Altoros2.7K views
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc... by Chris Fregly
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
Chris Fregly9.1K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1.1K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1 view
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh441 views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh396 views
Training course lect1 by Noor Dhiya
Training course lect1Training course lect1
Training course lect1
Noor Dhiya15 views

Recently uploaded

SWM L1-L14_drhasan (Part 1).pdf by
SWM L1-L14_drhasan (Part 1).pdfSWM L1-L14_drhasan (Part 1).pdf
SWM L1-L14_drhasan (Part 1).pdfMahmudHasan747870
48 views150 slides
K8S Roadmap.pdf by
K8S Roadmap.pdfK8S Roadmap.pdf
K8S Roadmap.pdfMaryamTavakkoli2
6 views1 slide
Activated sludge process .pdf by
Activated sludge process .pdfActivated sludge process .pdf
Activated sludge process .pdf8832RafiyaAltaf
8 views32 slides
NEW SUPPLIERS SUPPLIES (copie).pdf by
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdfgeorgesradjou
14 views30 slides
Wire Rope by
Wire RopeWire Rope
Wire RopeIwiss Tools Co.,Ltd
9 views5 slides
Saikat Chakraborty Java Oracle Certificate.pdf by
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdfSaikatChakraborty787148
15 views1 slide

Recently uploaded(20)

NEW SUPPLIERS SUPPLIES (copie).pdf by georgesradjou
NEW SUPPLIERS SUPPLIES (copie).pdfNEW SUPPLIERS SUPPLIES (copie).pdf
NEW SUPPLIERS SUPPLIES (copie).pdf
georgesradjou14 views
Informed search algorithms.pptx by Dr.Shweta
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta16 views
Machine Element II Course outline.pdf by odatadese1
Machine Element II Course outline.pdfMachine Element II Course outline.pdf
Machine Element II Course outline.pdf
odatadese18 views
Dynamics of Hard-Magnetic Soft Materials by Shivendra Nandan
Dynamics of Hard-Magnetic Soft MaterialsDynamics of Hard-Magnetic Soft Materials
Dynamics of Hard-Magnetic Soft Materials
Shivendra Nandan14 views
A multi-microcontroller-based hardware for deploying Tiny machine learning mo... by IJECEIAES
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
IJECEIAES13 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1263 views
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... by ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 views
An approach of ontology and knowledge base for railway maintenance by IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 views

Profiling deep learning network using NVIDIA nsight systems

  • 1. Jack Han | Solutions Architect | jahan@nvidia.com PROFILING DEEP LEARNING NETWORK
  • 2. 2 AGENDA Profiling Deep Learning Network Introduction to Nsight Systems PyTorch NVTX Annotation Examples and Benefits Mixed Precision with BERT TensorFlow NVTX Plugins Plugin Usage Example
  • 3. 3 HOW TO SPEED-UP NETWORK Use the latest GPU and more GPU? Wait NVIDIA to release the new GPU or newer library? Optimize Neural Network Computing or Something Need to understand your network operation We are talking about this
  • 4. 4 PROFILING NEURAL NETWORK Profiling with cProfiler + Snakeviz python -m cProfile -o 100_percent_gpu_utilization.prof train.py snakeviz 100_percent_gpu_utilization.prof https://sagivtech.com/2017/09/19/optimizing-pytorch-training-code/ Useful?
  • 5. 5 TENSORFLOW PROFILER Show timeline and can trace network Still difficult to understand https://github.com/tensorflow/profiler-ui Useful?
  • 6. 6 NVIDIA PROFILING TOOLS CUPTI (CUDA Profiling Tools Interface) Nsight Systems Nsight Compute Nsight Visual Studio Edition Nsight Eclipse Edition NVIDIA Profiler (nvprof) NVIDIA Visual Profiler (nvvp)
  • 7. 7 NSIGHT PRODUCT FAMILY Standalone Performance Tools Nsight Systems system-wide application algorithm tuning Nsight Compute Debug/optimize specific CUDA kernel Nsight Graphics Debug/optimize specific graphics IDE plugins Nsight Eclipse Edicion/Visual Studio editor, debugger, some perf analysis
  • 8. 8 NSIGHT SYSTEMS Profile System-wide application Multi-process tree, GPU workload trace, etc Investigate your workload across multiple CPUs and GPUs CPU algorithms, utilization, and thread states GPU streams kernels, memory transfers, etc NVTX, CUDA & Library API, etc Ready for Big Data docker, user privilege (linux), cli, etc Overview
  • 9. 9 Thread/core migration Thread state Processes and threads CUDA and OpenGL API trace cuDNN and cuBLAS trase Kernel and memory transfer activities Multi-GPU
  • 11. 11 TRANSITIONING TO PROFILE A KERNEL Dive into kernel analysis
  • 12. 12 NVIDIA NSIGHT COMPUTE Interactive CUDA API debugging and kernel profiling Fast Data Collection Graphical and multiple kernel comparison reports Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules) Command Line, Standalone, IDE Integration Platform Support OS: Linux(x86,ARM), Windows, OSX (host only) GPUs: Pascal, Volta, Turing Next Generation Kernel Profiler Kernel Profile Comparisons with Baseline Metric Data Source Correlation
  • 13. 13 PROFILING GPU APPLICATION How to measure Focusing GPU Computing Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck GPU Profiling CPU/GPU Tracing Application Tracing • Too few threads • Register limit • Large shared memory … • Cache misses • Bandwidth limit • Access pattern … • Arithmetic • Control flow … NVIDIA (Visual) Profiler / Nsight Compute NVIDIA Supports them with cuDNN, cuBLAS, and so on
  • 14. 14 GPU Profiling CPU/GPU Tracing Application Tracing PROFILING GPU APPLICATION How to measure Focusing System Operation Low GPU Utilization Low SM Efficiency Low Achieved Occupancy Memory Bottleneck Instructions Bottleneck CPU-Only Activities Memcopy Latency Kernel Launch Latency Job Startup / Checkpoints CPU Computation I/O Nsight System / Application Tracing
  • 15. 15 NSIGHT SYSTEMS PROFILE Profile with CLI $ nsys profile –t cuda,osrt,nvtx,cudnn,cublas -y 60 -d 20 –o baseline -f true –w true python main.py cuda – GPU kernel osrt – OS runtime nvtx – NVIDIA Tools Extension cudnn – CUDA Deep NN library cublas– CUDA BLAS library https://docs.nvidia.com/nsight-systems/#nsight_systems/2019.3.6-x86/06-cli-profiling.htm ßAPIs to be traced ßDelayed profile (sec) ßProfiling duration (sec) ßOutput filename ßOverwrite when it’s true ßDisplay ßExecution command
  • 16. 16 DISABLING ASYNCHRONOUS OPERATION Elimination of CPU interop effects export CUDA_LAUNCH_BLOCKING=1 https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#asynchronous-concurrent-execution
  • 18. 18 BASELINE PROFILE MNIST Training: 89 sec, <5% utilization CPU waits on a semaphore and starves the GPU! GPU Starvation GPU Starvation
  • 19. 19 NVTX ANNOTATIONS def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 20. 20 NVTX ANNOTATIONS import torch.cuda.nvtx as nvtx def train(args, model, device, train_loader, optimizer, epoch): model.train() for batch_idx, (data, target) in enumerate(train_loader): nvtx.range_push("Batch " + str(batch_idx)) nvtx.range_push("Copy to device") data, target = data.to(device), target.to(device) nvtx.range_pop() nvtx.range_push("Forward pass") optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) nvtx.range_pop() nvtx.range_pop() https://pytorch.org/docs/stable/_modules/torch/cuda/nvtx.html
  • 21. 21 BASELINE PROFILE (WITH NVTX) GPU is idle during data loading Data is loaded using a single thread. This starves the GPU! fwd/bwd fwd/bwd 5.1ms
  • 22. 22 OPTIMIZE SOURCE CODE Data loader was configured to use 1 worker thread Let’s switch to using 8 worker threads: kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {} kwargs = {'num_workers’: 8, 'pin_memory': True} if use_cuda else {}
  • 23. 23 AFTER OPTIMIZATION Time for data loading reduced for each bath 60us Reduced from 5.1ms to 60us for batch 50
  • 25. 25 INITIAL PROFILE No NVTX Difficult to understand à no useful
  • 26. 26 loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) ... if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 Foward Backward NVTX TAGGING
  • 27. 27 nvtx.range_push("Batch " + str(step)) nvtx.range_push("Forward pass") loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions) nvtx.range_pop() ... nvtx.range_push("Backward pass") if args.fp16: optimizer.backward(loss) else: loss.backward() if (step + 1) % args.gradient_accumulation_steps == 0: if args.fp16: # modify learning rate with special warm up BERT uses # if args.fp16 is False, BertAdam is used and handles this automatically lr_this_step = args.learning_rate * warmup_linear.get_lr(global_step, args.warmup_proportion) for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step += 1 nvtx.range_pop() nvtx.range_pop() Foward Backward NVTX TAGGING
  • 28. 28 BERT FP32 BENCHMARK HuggingFace’s pretrained BERT Getting API List ~460 ms ~60% GEMM 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 29. 29 BERT FP16 BENCHMARK HuggingFace’s pretrained BERT Tensor Cores APIs ~222 ms 2.1x Speed up 4 V100 GPUs w/ NVLINK, Batch size: 32, max_seq_length: 512
  • 30. 30 132 ms 8.5 ms APEX OPTIMIZER OPTIMIZATION APEX/csrc/adam_cuda_kernel() Low Utilization Higher Utilization FP32 FP16
  • 31. 31 PROFILING GPU APPLICATION You can investigate the performance bottleneck in each layer NVIDIA provides such inefficient layers / graphs optimization support For example, Like this, adam_cuda_kernel in APEX cudnnMultiHeadAttnForward() in cuDNN BERT speed-up Reminding cudnnMultiHeadAttnForward
  • 33. 33 NVTX WITH TENSORFLOW The TF_DISABLE_NVTX_RANGES variable enables and disables NVTX ranges in TensorFlow NVTX collect is enabled by default Available in NGC Container Having NVTX plugin for TF Import nvtx plugin Annotation in python code Get data from through Profiler
  • 34. 34 NVTX PLUGINS FOR DEEP LEARNING Allows users to add their own NVIDIA Tools Extension (NVTX) events and time ranges to a TensorFlow graph Ranges are added by wrapping regions of the computation graph with start and end operations Profiling TensorFlow for the graphs NVTX Context
  • 35. 35 SETUP Install NVTX plugins pip install nvtx-plugins-tf No need to install this plugin since NGC 19.08
  • 36. 36 HOW TO USE Adding Markers Function Detector import nvtx.plugins.tf as nvtx_tf x, nvtx_context = nvtx_tf.ops.start(x, message='Dense 1-2', domain_name='Forward’, grad_domain_name='Gradient’, enabled=ENABLE_NVTX, trainable=True) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) x = nvtx_tf.ops.end(x, nvtx_context) x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_3') import nvtx.plugins.tf as nvtx_tf @nvtx_tf.ops.trace(message='Dense Block', domain_name='Forward', grad_domain_name='Gradient', enabled=ENABLE_NVTX, trainable=True) def dense_block(x): x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_1') x = tf.layers.dense(x, 1000, activation=tf.nn.relu, name='dense_2’) return x
  • 37. 37 WORKING WITH SESSION/TF.ESTIMATOR from nvtx.plugins.tf.estimator import NVTXHook nvtx_callback = NVTXHook(skip_n_steps=1, name='Train’) training_hooks=[] training_hooks.append(nvtx_callback) with tf.train.MonitoredSession(hooks=[nvtx_callback]) as sess: tf.estimator.Estimator(hooks=training_hooks, ...)
  • 38. 38 PROFIING MPI PROCESS To profile everything, putting the data into one file To profile everything, putting the data into each node into a separated file nsys [nsys options] mpirun [mpi options] mpirun [mpi options] nsys profile [nsys options]
  • 39. 39 COMMAND EXAMPLES 1 GPU profiling 4 GPU profiling NGC TensorFlow ResNet50-1.5 nsys profile -y 40 -d 20 -f true -w true -o tf_amp_4gpu -t osrt,nvtx,cuda,cublas,cudnn mpiexec --allow-run-as-root --bind-to socket -np 4 python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64 nsys profile -y 40 -d 20 -f true -w true -o tf_amp -t osrt,nvtx,cuda,cublas,cudnn python ./main.py --mode=training_benchmark --warmup_steps 200 --num_iter 500 --iter_unit batch --results_dir=results --batch_size 64