GPU profiling for computer vision applications

Performance Tools for
Computer Vision Applications
@denkiwakame
1
2018/12/15 コンピュータビジョン勉強会 @関東

Performance Tools for CV - Agenda
● NVIDIA GPU Profiler
○ nvprof
○ nvvp
○ NVIDIA NSight systems
● Tensorflow / Keras
○ tf.timeline
● Others
○ perf, gperftools, ….
○ cProfile, yep, ...
2
＿人人人人人人人人人人人人人人＿
＞　GPUに絞ってお話しします　＜
￣Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y^Y￣

What is Profiling ?
WHY YOU NEED TO PROFILE YOUR APPLICATION ?
3

What is Profiling ?
● Application の Performance を計測すること
● Simple Profiling
○ 各部の処理時間を計測する
● Advanced Profiling
○ 何故その処理が遅いのかを分析する
4
timer 差し込みなど
専用のツールが必要

● Command-line profiler
○ /usr/local/cuda/bin/nvprof
● Usage
nvprof
6
$ nvprof [npprof-options] <app> [arguments]
==17126== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn
7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int)
7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int)
6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION)
6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int)
6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int)
5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int)
5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH]
5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn
5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float,
float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)

4つのprofiling mode
● summary mode (default)
● trace mode
● event/metric summary mode
● event/metric trace mode
7
$ nvprof --print-gpu-trace --print-api-trace
$ nvprof --events <event-name> --metrics <metric-name>
$ nvprof --aggregate-mode off [event|metric]
$ nvprof <application>
GPUで発生する全てのアクティビティ
CUDA Runtime API + Driver API 呼出

● summary mode (default)
nvprof
==17126== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 28.09% 9.6689ms 32 302.15us 231.69us 544.94us maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt
8.13% 2.7997ms 34 82.344us 42.145us 121.09us maxwell_scudnn_128x128_relu_interior_nn
7.46% 2.5673ms 72 35.657us 4.6720us 569.87us normalize_kernel(int, float*, float*, float*, int, int, int)
7.01% 2.4122ms 104 23.194us 1.7600us 243.82us copy_kernel(int, float*, int, int, float*, int, int)
6.96% 2.3956ms 116 20.651us 1.1840us 242.98us activate_array_kernel(float*, int, ACTIVATION)
6.46% 2.2237ms 75 29.649us 3.0400us 301.03us add_bias_kernel(float*, float*, int, int, int)
6.24% 2.1489ms 72 29.846us 3.3600us 299.91us scale_bias_kernel(float*, float*, int, int)
5.98% 2.0587ms 184 11.188us 1.4080us 112.64us fill_kernel(int, float, float*, int)
5.87% 2.0187ms 3 672.90us 28.960us 1.8676ms [CUDA memcpy DtoH]
5.16% 1.7760ms 4 444.00us 414.16us 516.91us maxwell_scudnn_128x128_relu_small_nn
5.10% 1.7553ms 32 54.854us 7.2320us 163.24us void cudnn::winograd::generateWinogradTilesKernel<int=0, float,
float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
4.00% 1.3780ms 23 59.912us 18.272us 250.79us shortcut_kernel(int, int, int, int, int, int, int, int, int, int, float*, int, int, int,
float, float, float*)
1.36% 467.82us 1 467.82us 467.82us 467.82us maxwell_scudnn_128x64_relu_small_nn
0.86% 296.90us 1 296.90us 296.90us 296.90us maxwell_scudnn_128x32_relu_small_nn
0.48% 165.99us 2 82.994us 82.882us 83.106us maxwell_scudnn_128x64_relu_interior_nn
0.39% 134.98us 1 134.98us 134.98us 134.98us maxwell_scudnn_128x32_relu_interior_nn
0.26% 89.508us 43 2.0810us 1.7280us 9.3440us
cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)
0.17% 58.018us 2 29.009us 19.809us 38.209us upsample_kernel(unsigned long, float*, int, int, int, int, int, int, float, float*)
API calls: 90.08% 285.51ms 798 357.78us 3.4690us 282.40ms cudaLaunch
9.68% 30.674ms 3 10.225ms 1.6737ms 24.491ms cudaMemcpy
0.11% 363.03us 3540 102ns 86ns 1.5170us cudaSetupArgument
8

例：Tensor Core の利用率を調べる
● 4x4 乗算を1サイクルで実行
○ Volta アーキテクチャに搭載
9
https://www.nvidia.com/content/apac/gtc/ja/pdf/2017/1055.pdf

利用可能な metrics を調べる
● --query-metrics
10
$ nvprof --query-metrics
Available Metrics:
Name Description
Device 0 (TITAN V):
...
shared_load_transactions_per_request: Average number of shared memory load transactions performed for each
shared memory load
shared_store_transactions_per_request: Average number of shared memory store transactions performed for each
shared memory store
local_load_transactions_per_request: Average number of local memory load transactions performed for each local
memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local
memory store
…
half_precision_fu_utilization: The utilization level of the multiprocessor function units that execute 16 bit floating-point
instructions on a scale of 0 to 10. Note that this doesn't specify the utilization level of tensor core unit
tensor_precision_fu_utilization: The utilization level of the multiprocessor function units that execute tensor core
instructions on a scale of 0 to 10
sharedmem
tensorcore !

早速...
● metrics を指定して実行
11
$ nvprof --metrics tensor_precision_fu_utilization <application>
Invocations Metric Name Metric Description Min Max Avg
Device "TITAN V (0)"
Kernel: volta_s884cudnn_fp16_128x128_ldg8_relu_exp_interior_nhwc_tn_v1
3 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (5) Mid (4)
Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
27 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) High (7) Mid (6)
Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1
20 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (4) Mid (6) Mid (5)
Kernel: volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
14 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (2) Mid (5) Mid (4)
Kernel: volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
11 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Low (3) High (7) Mid (6)
utilization level

Profiling Scope
● プロファイリング箇所を限定する
○ 測定したい箇所に cudaProfilerStart(); を埋め込む
12
#include <cuda_profiler_api.h>
cudaProfilerStart();
// do something to profile
...
cudaProfilerStop();
$ nvprof --profile-from-start off <application>
オプションが必要

Python 越しに CUDA API を呼ぶ場合は？
● 普通に nvprof にかける
● ctypes を使うことも
13
$ nvprof [npprof-options] python ...
Python Script
import ctypes
_cudart = ctypes.CDLL('libcudart.so')
ret = _cudart.cudaProfilerStart()
# call cuda-based methods
ret = _cudart.cudaProfilerStop()
https://docs.python.jp/3/library/ctypes.html
xxxlib.cpython-35m-x86_64-linux-gnu.so
libcuda…...so
CUDA を使った Python 拡張ライブラリ

nvvp : nvidia visual profiler
● GUI 版の Profiler
○ navigation に従ってぽちぽちすると使える
15
$ nvvp

タイムラインの確認
16
timeline

カーネルのパフォーマンスを調べる
17
詳細な解析
カーネル一覧
（おもい順）
指定カーネルの分析

カーネルのパフォーマンスを調べる
18
compute res / memory bandwidth / latency
何で律速している？更に詳しい解析

Primary Performance Limiter
● Both High:
○ 演算器・メモリ帯域共に利用率が高い
● Memory High, Compute Low:
○ メモリ帯域で律速
● Compute High, Memory Low:
○ 演算資源で律速
19
Compute Memory

GPUアーキテクチャの話になるので割愛
http://on-demand.gputechconf.com/gtc/2013/webinar/gtc-express-guided-analysis-nvidia-vi
sual-profiler.pdf
20

NVLink view
● NVLink のトポロジや通信のスループットが見れる
○ ※ GPU 間のトポロジは $ nvidia-smi topo --matrix でも調べられる
21

Remote Profiling
● X Forwarding で nvvp を飛ばすのは重い
○ nvprof でプロファイル結果を吐いて、scpすれば良い
● 中継サーバにスクリプトを置く方法もある
○ https://docs.nvidia.com/cuda/profiler-users-guide/index.html#remote-profiling-one-hop
22
$ nvprof --analysis-metrics -o profile.nvvp <application>
カーネルの詳細な分析に必要
https://devblogs.nvidia.com/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/
GPUなしで良い

Remote Profiling
● しかし...
23
$ nvprof -o profile.nvvp <application>
Timeline ぐらいしか見えない
$ nvprof --analysis-metrics -o profile.nvvp <application>
Kernel をリプレイしまくる
不便...
--kernel で限定する … ?
ちょっと複雑なアプリケーションだと
無限に重い

Remote Profiling
● *.nvvp を dump して飛ばさなくても Remote から直接プロ
ファイリングできる
24

Take-home message : NVIDIA が一番詳しい
● https://docs.nvidia.com/cuda/profiler-users-guide/index.html
● http://www.robots.ox.ac.uk/~seminars/seminars/Extra/2015_10_0
8_JeremyAppleyard.pdf
25

> Note that Visual Profiler and nvprof will be
deprecated in a future CUDA release.
> It is recommended to use next-generation tools NVIDIA Nsight Compute for GPU profiling
> and NVIDIA Nsight Systems for GPU and CPU sampling and tracing.
26

NSight systems
NVIDIA NSight Systems for GPU and CPU sampling and Tracing
27

https://www.youtube.com/watch?time_continue=3&v=UaFnnXH6U4E
Watch the official movie ! （投げやり）
28

Question:
30
1. 普段の研究/開発で GPU を使っている
2. CUDA カーネルを書いたことがある
3. プロファイルをきちんと取っている
4. GPUアーキテクチャ完全に理解した

tf.timeline
tensorflow/Keras
31

Tensorflow timeline
● Tensorflow 本体付属のプロファイリング機能
32
import tensorflow as tf
from tensorflow.python.client import timeline
# build your model ...
ops = …
with tf.Session() as sess:
# add additional options to trace the session execution
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(ops, options=options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json file
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('timeline.json', 'w') as f:
f.write(chrome_trace)

tf.timeline from Keras
● Tensorflow バックエンドの Keras でも利用可能
33
from tensorflow.python.client import timeline
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
model.compile(loss='...',
optimizer='...',
options=run_options,
run_metadata=run_metadata)
…
trace = timeline.Timeline(step_stats=run_metadata.step_stats)
with open('timeline.json', 'w') as f:
f.write(trace.generate_chrome_trace_format())

chrome://tracing
● Chrome Event Format に準拠
○ Chrome ブラウザの chrome://tracing でロード
34
timeline
time/node

chrome://tracing
● フォーカスした部分に絞って見ることもできる
35
選択
選択範囲の処理時間合計

GPU間通信のモニタ
● All-Reduce アルゴリズムの比較
36

Catapult
● Chrome Performance tools*
○ https://github.com/catapult-project/catapult
○ Chrome / Go / Android で利用
○ Trace Event Format 詳細
■ https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6
I0nSsKchNAySU/preview
● Projects
○ Trace-viewer Javascript codebase that loads trace files and creates the UI
○ Telemetry
○ Performance Dashboard
○ Systrace
○ Web Page Replay
37
[*] https://docs.google.com/document/d/1QADiFe0ss7Ydq-LUNOPpIf6z4KXGuWs_ygxiJxoMZKo/edit

Tensorflow Profiler and Advisor
　
38
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/
core/profiler/README.md

まとめ
● 多少浮いている気はするが ...
○ GPU のプロファイリングツールを簡単に紹介
○ ツールを使いこなし，世界最速を目指そう !!!
39

GPU profiling for computer vision applications

More Related Content

What's hot

Similar to GPU profiling for computer vision applications

Recently uploaded

GPU profiling for computer vision applications