2015年度GPGPU実践プログラミング　第6回　パフォーマンス解析ツール

第6回パフォーマンス解析ツール
長岡技術科学大学電気電子情報工学専攻出川智啓

今回の内容
2015/05/20GPGPU実践プログラミング2
 プロファイリング
 Compute Profiler
 (Nsight)
 Warpと並列実行の制約
 Occupancy Calculator
 ベクトル和のプロファイリング

プログラムの挙動
 経験的なプログラム高速化手法
 スレッド数は128か256
 メモリはアドレス順にアクセス
等
 プログラム内部で挙動がどのように変化するか
 パフォーマンスカウンタ機能による確認
 プロファイリング

プロファイリング
 パフォーマンスカウンタの数値を出力
 プログラム実行時の挙動の記録
 ボトルネックの特定
 プロファイラ
 コマンドラインで実行するプロファイラ
 Compute Profiler
 nvprof(CUDA 5以降)
 GUIから操作するプロファイラ
 Compute Visual Profiler
 Nsight

Compute Profiler
 CUDAの実行環境に組み込まれたプロファイラ
 環境変数を設定するだけで有効化
 使い方
1. 環境変数の設定 $ export CUDA_PROFILE=1
2. プログラムの実行 $ ./a.out
3. プロファイル結果(標準はcuda_profile_0.log)の確認

Compute Profilerの環境変数
 CUDA_PROFILE
 1でプロファイル実行，0でプロファイル停止
 CUDA_PROFILE_CONFIG
 プロファイル項目を記述したファイルの名前
 CUDA_PROFILE_LOG
 出力ファイル名（標準はcuda_profile_0.log）
 CUDA_PROFILE_CSV
 1で出力ファイルをCSV（カンマ区切り）形式に変更

差分法のプロファイル
 $ export CUDA_PROFILE=1
 $ nvcc ‐arch=sm_20 differentiate.cu
 $ ./a.out
 cuda_profile_0.logが出力される
 $ cat cuda_profile_0.log

プロファイル結果の確認
 標準で出力される値
 method カーネルや関数（API）の名称
 gputime GPU上で処理に要した時間（s単位）
 cputime CPUで処理（=カーネル起動）に要した時間
実際の実行時間=cputime+gputime
 occupancy 同時実行されているWarp数と実行可能な最大
Warp数の比（一般的には1に近いほどよい）
$ cat cuda_profile_0.log
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 Tesla M2050
# TIMESTAMPFACTOR fffff614a81cd038
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 3285.216 ] cputime=[ 3657.000 ]
method=[ _Z13differentiatePdS_ ] gputime=[ 197.408 ] cputime=[ 18.000 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 3598.752 ] cputime=[ 4505.000 ]

GPUの並列処理の実際
 GPUのスレッドを管理
 Streaming Multiprocessor内のCUDA Core数
 Tesla世代 8
 Fermi世代 32
 Kepler世代 192
 Maxwell世代 128
 Warp（ウォープ）という処理の単位の導入
 32スレッドをまとめて1 Warp
 1 Warpごとに処理を実行
256スレッドの同時実行は不可能

 処理はWarp単位で実行
 32スレッドごとに処理を実行
 1ブロックあたりのスレッド数が256=Warp 8個
 ある命令を発行すると，Warp単位で同じ命令を実行
 Warp内でCUDA CoreはSIMD的に動作
 複数のまとまったスレッドが同じ演算を同時に実行
 SIMT(Single Instruction Multiple Threads)
 各スレッドが分岐命令に基づいて異なる処理を実行することもできる
（=厳密なSIMDではない）

 パイプライン実行
 あるWarpがグローバルメモリへアクセス，データ待ち
 その他の実行可能なWarpが計算を実行
 グローバルメモリへのアクセスのレイテンシを隠蔽
計算メモリアクセス計算Warp0
Warp1
Warp2
Warp3
実行開始処理時間
計算メモリアクセス計算
計算メモリアクセス計算
計算メモリアクセス
・・・
・・・

 Warpの同時実行
 同時に実行されているWarp(Active Warp)が多いほどレイテ
ンシ隠蔽が容易
 Active Warpを多くすることで高速化
 Warpの同時実行数はブロック内で使用している資源（レジスタ，
共有メモリ）によって変化
 占有率(Occupancy)
 同時実行されているWarp数と実行可能な最大Warp数の比
 一般的には1に近い方が高速

並列実行に対する制約
 GPUの構成（資源の量，スケジューラ）に起因するハード
ウェア的・ソフトウェア的制約
Tesla世代 Fermi世代
Warpサイズ 32スレッド
1ブロックあたりのスレッド数 512 1024
1SMあたりの最大スレッド数 1024 1536
1SMあたりのWarp数 32 48
1SMあたりのブロック数 8 8
1SMあたりの共有メモリサイズ 16384 byte 49152 byte
1SMあたりの32bitレジスタ数 16384本 32768本
https://www.softek.co.jp/SPG/Pgi/TIPS/public/accel/gpu‐accel2.html

 1ブロックあたりのスレッド数 256
 1ブロックあたりのWarp数
256thread/block / 32thread/Warp = 8 Warp/block
 1SMが処理するブロック数（1SMあたり最大48Warpを処理）
48 Warp/SM / 8 Warp/block = 6 block/SM（<8）
 1SMあたり6ブロックを並列処理すればよい（最大8ブロック）
 同時実行されるWarp数は
8 Warp/block × 6 block/SM = 48 Warp/SM.
 1SMあたり最大48Warp同時実行できるので，占有率は
48 Warp/SM /48 Warp/SM = 1(=100%)
太字：利用するGPUによって決定
下線：ユーザの設定により決まる値
斜体：処理内容やコンパイラによって決定

 1ブロックあたりのスレッド数 64
 1ブロックあたりのWarp数
64thread/block / 32thread/Warp = 2 Warp/block
 1SMが処理するブロック数（1SMあたり最大48Warpを処理）
48 Warp/SM / 2 Warp/block = 24 block/SM（>8）
 1SMあたり8ブロックを並列処理（24ブロックの処理は不可能）
 同時実行されるWarp数は
2 Warp/block × 8 block/SM = 16 Warp/SM.
 1SMあたり最大48Warp同時実行できるので，占有率は
16 Warp/SM /48 Warp/SM = 0.333(=33%)

 並列実行されるブロックの数
 1ブロックあたりの共有メモリやレジスタの使用量で制限
 1ブロックあたりのスレッド数 Nt
 1ブロックが利用する共有メモリS[byte]
49152/S [block/SM]
 1スレッドが利用するレジスタR[本]
32768/(Nt×R) [block/SM]
 並列実行されるブロック数
min{8, 48/(Nt/32), 49152/S, 32768/(Nt×R)}

Occupancyの計算
 CUDA Occupancy Calculator
 NVIDIAが提供*
 Occupancy等の値を実行前に推定
 必要な情報
 GPUの世代
 共有メモリのサイズ設定
 64kBのうち何kBを共有メモリとするか
 1ブロックあたりのスレッド数
 1スレッドあたりのレジスタ使用数
 1ブロックあたりの共有メモリ使用量
*http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

CUDA Occupancy Calculator
 デモ
GPGPU実践プログラミング18 2015/05/20

共有メモリやレジスタ数使用量の確認
 コンパイルオプションを付加
 ‐Xptxas –v もしくは ‐‐ptxas‐options=‐v
 実行可能バイトコードの情報を出力
$ nvcc ‐arch=sm_20 differentiate.cu ‐Xptxas ‐v
ptxas info    : Compiling entry function '_Z13differentiatePdS_' for 'sm_20'
ptxas info    : Function properties for _Z13differentiatePdS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 48 bytes cmem[0], 12 bytes cmem[16]
使用レジスタ数コンスタントメモリ使用量（サイズ）
（コンパイラの最適化により利用）

共有メモリやレジスタ数使用量の確認
 共有メモリを利用したカーネル
 コンスタントメモリを利用したカーネル
$ nvcc ‐arch=sm_20 differentiate_shared.cu ‐Xptxas ‐v
ptxas info    : Used 18 registers, 2064+0 bytes smem, 48 bytes cmem[0], 4
bytes cmem[16]
共有メモリ使用量（サイズ）
$ nvcc ‐arch=sm_20 differentiate_constant.cu ‐Xptxas ‐v
ptxas info    : Used 13 registers, 2064+0 bytes smem, 48 bytes cmem[0], 8
bytes cmem[2]

CUDA Occupancy Calculator
 デモ

レジスタ数の制約*
 Compute Capalicity2.0世代のGPUのレジスタ利用
 1 Warpあたり64本単位でレジスタを確保
 レジスタ数が問題となってOccupancy 100%の達成が困難
 Occupancy 100%を達成するためには
 ハードウェア的制約
 1SMあたりのレジスタ数 32768本以下
 1SMあたりのWarp数 48(=48×32スレッド)以下
 1 Warpあたり64本単位でレジスタを確保
*http://www.gdep.jp/column/view/33

レジスタ数の制約*
 Occupancy 100%を達成するためには
 1スレッドあたりの利用可能レジスタ数
32768本/(48Warp/SM×32thread/Warp)=21.3=21本/thread
 1スレッドあたり21本以下
 1 Warpが確保するレジスタ数
21本/thread×32thread/Warp=672→704(64の倍数)に切り上げ
 1SMあたりの使用レジスタ数
704本/Warp×48Warp/SM=33792本/SM (>32768)
 1スレッドあたりの使用レジスタ数が20本以下でないと
100%を達成できない
*http://www.gdep.jp/column/view/33

詳細なプロファイリング
 gputime, cputime, occupancy以外のプロファイル
 設定ファイルに測定する項目を記述
 設定ファイルを環境変数CUDA_PROFILE_CONFIGで指定
 prof.confがよく使われる
 $ export CUDA_PROFILE_CONFIG=prof.conf
 設定後，CUDA_PROFILE=1にしてプログラムを実行
 $ export CUDA_PROFILE=1

詳細なプロファイリング
 prof.confの内容
 Profiler User's Guide
http://docs.nvidia.com/cuda/profiler‐users‐guide/index.html
 4.3. Command Line Profiler Configuration中のTable 2
 9. Metrics Reference中のTable 3
 nvprofを利用することも可能（CUDA 5以降）
 $ nvprof ‐‐query‐eventsの結果を参照

カーネル起動に関する情報
名称内容
gpustarttimestamp カーネルもしくはメモリ転送開始時のタイムスタンプ(ns，64bit16進数)
gpuendtimestamp カーネルもしくはメモリ転送完了時のタイムスタンプ(ns，64bit16進数)
timestamp カーネル起動もしくはメモリ転送開始時のタイムスタンプ(s，単精度実数)
gridsize カーネル起動時のブロック数(x,y)
gridsize3d カーネル起動時のブロック数(x,y,z)
threadblocksize カーネル起動時のスレッド数(x,y,z)
dynsmemperblock 動的に割り当てられた共有メモリのサイズ
stasmemperblock 静的に割り当てられた共有メモリのサイズ
regperthread 1スレッドあたりのレジスタ数
memtransferdir メモリ転送の方向(0. host‐>device, 1. device‐>host)
memtransfersze メモリ転送サイズ
memtransferhostmemtype ホストメモリの種類（ページング可能かページロックか）
streamid カーネルを起動したストリームのID
cacheconfigrequested キャッシュ設定の要求共有メモリ：L1キャッシュ
0. 設定無し 1. 48kB:16kB
2. 16kB:48kB 3. 32kB:32kB*cacheconfigexecuted カーネル起動時のキャッシュ設定
*Kepler世代から

カーネル実行時のイベント
 全てのGPUで取得できる情報
名称内容
local_load ローカルメモリからの読込回数（1SM内のWarpあたり）
local store ローカルメモリからの書込回数（1SM内のWarpあたり）
gld_request グローバルメモリからの読込回数（1SM内のWarpあたり）
gst_request グローバルメモリへの書込回数（1SM内のWarpあたり）
divergent_branch divergent branchの発生回数
branch プログラム中の分岐命令の数
sm_cta_launched 1SM内で実行されたブロックの数

 Compute Capability 1.xのGPUで取得できる情報
名称内容
gld_incoherent 非コアレスアクセスによるグローバルメモリからの読込回数
gld_coherent コアレスアクセスによるグローバルメモリからの読込回数
gld_32b 32バイト単位で行われたグローバルメモリからの読込回数
gst_incoherent 非コアレスアクセスによるグローバルメモリへの書込回数
gst_coherent コアレスアクセスによるグローバルメモリへの書込回数
gst_32b 32バイト単位で行われたグローバルメモリへの書込回数
instructions 実行された命令数
warp_serialize 共有メモリ，コンスタントメモリでコンフリクトした回数
cta_launched 起動されたブロックの数

名称内容
instructions_issued
（バージョンによってはinst_issued）
命令発行数（命令のreplay*を含む）
instructions_issued1_0 命令群0の1命令発行に要するサイクル数
instructions_executed
（バージョンによってはinst_executed）
命令発行数（replayを含まない）
warps_launched 起動されたWarpの数（1SMあたり）
threads_launched 起動されたスレッドの数（1SMあたり）
active_cycles 1SM内で最低でも1 Warpが実行されたサイクル数
active_warps
1SM内でアクティブになったWarpの数の合計
サイクルごとにアクティブなWarpの数（0~48の間で）を加算
shared_load 共有メモリからの読込回数（1SM内のWarpあたり）
shared_store 共有メモリへの書込回数（1SM内のWarpあたり）
*同一Warp内のスレッド
が同じ命令を逐次的に
実行
バンクコンフリクト等が
原因で並列実行できな
い場合に発生
http://on‐demand.gputechconf.com/
gtc‐express/2011/presentations/In
st_limited_kernels_Oct2011.pdf

名称内容
l1_global_load_hit L1キャッシュにヒットしたグローバルメモリからの読込回数
l1_global_load_miss L1キャッシュでキャッシュミスしたグローバルメモリからの読込回数
l1_local_load_hit L1キャッシュにヒットしたローカルメモリからの読込回数
l1_local_load_miss L1キャッシュでキャッシュミスしたローカルメモリからの読込回数
l1_local_store_hit L1キャッシュにヒットしたローカルメモリへの書込回数
l1_local_store_miss L1キャッシュでキャッシュミスしたローカルメモリへの書込回数
l1_shared_bank_conflict 共有メモリ内でバンクコンフリクトした回数
uncached_global_load_transacti
on
キャッシュされないグローバルメモリからの読込回数
32,64,128ビットアクセスに対して1,2,4で加算
global_store_transaction
グローバルメモリへの書込回数
32,64,128ビットアクセスに対して1,2,4で加算

 合計で70個程度(※動作は未確認)
•elapsed_cycles_sm
•global_store_transaction
•l1_global_load_hit
•l1_global_load_miss
•l1_local_load_hit
•l1_local_load_miss
•l1_local_store_hit
•l1_local_store_miss
•l1_shared_bank_conflict
•sm_cta_launched
•tex0_cache_sector_misses
•tex0_cache_sector_queries
•uncached_global_load_transaction
•fb_subp0_read_sectors
•fb_subp0_write_sectors
•fb_subp1_read_sectors
•fb_subp1_write_sectors
•l2_subp0_read_hit_sectors
•l2_subp0_read_sector_misses
•l2_subp0_read_sector_queries
•l2_subp0_read_sysmem_sector_queries
•l2_subp0_read_tex_hit_sectors
•l2_subp0_read_tex_sector_queries
•l2_subp0_total_read_sector_queries
•l2_subp0_total_write_sector_queries
•l2_subp0_write_sector_misses
•l2_subp0_write_sector_queries l2
•l2_subp0_write_sysmem_sector_queries
•l2_subp1_read_hit_sectors
•l2_subp1_read_sector_misses
•l2_subp1_read_sector_queries
•l2_subp1_read_sysmem_sector_queries
•l2_subp1_read_tex_hit_sectors
•l2_subp1_read_tex_sector_queries
•l2_subp1_total_read_sector_queries
•l2_subp1_total_write_sector_queries
•l2_subp1_write_sector_misses l2 subp1
•l2_subp1_write_sector_queries l2 subp1
•l2_subp1_write_sysmem_sector_queries
•gld_inst_128bit
•gld_inst_16bit
•gld_inst_32bit
•gld_inst_64bit
•gld_inst_8bit
•gst_inst_128bit
•gst_inst_16bit
•gst_inst_32bit
•gst_inst_64bit
•gst_inst_8bit
•active_cycles
•active_warps
•atom_count
•branch
•divergent_branch
•gld_request
•gred_count
•gst_request
•inst_executed
•inst_issued
•local_load
•local_store
•prof_trigger_00
•prof_trigger_01
•prof_trigger_02
•prof_trigger_03
•prof_trigger_04
•prof_trigger_05
•prof_trigger_06
•prof_trigger_07
•shared_load
•shared_store
•thread_inst_executed_0
•thread_inst_executed_1
•threads_launched
•warps_launched
http://www.ics.uci.edu/~roblim1/docs/cupticounters.txt

プログラム挙動の確認
 スレッド数の変化によるメモリアクセスの変化
 グローバルメモリからの読込(prof_gld)
 gld_request
 l1_global_load_hit
 l1_global_load_miss
 グローバルメモリへの書込(prof_gst)
 gst_request
 global_store_transaction
 命令
 inst_issued (prof_inst_issued)
 inst_executed (prof_inst_executed)

#define N (512000)
#define Nbytes (N*sizeof(float))
#define NT (256)
#define NB (N/NT) //=2000
__global__ void init(float *a,
float *b, float *c){
int i = blockIdx.x*blockDim.x
+ threadIdx.x;
a[i] = 1.0;
b[i] = 2.0;
c[i] = 0.0;
}
__global__ void add(float *a,
+ threadIdx.x;
c[i] = a[i] + b[i];
}
int main(void){
float *a,*b,*c;
cudaMalloc((void **)&a, Nbytes);
cudaMalloc((void **)&b, Nbytes);
cudaMalloc((void **)&c, Nbytes);
init<<< NB, NT>>>(a,b,c);
add<<< NB, NT>>>(a,b,c);
return 0;
}
ベクトル和（最適*アクセス版）
vectoradd_coalesce.cu
*コアレス/アラインアクセス

#define N (512000)
#define NT (250)
#define NB (N/NT) //=2048
+ threadIdx.x;
a[i] = 1.0;
b[i] = 2.0;
c[i] = 0.0;
}
+ threadIdx.x;
c[i] = a[i] + b[i];
}
int main(void){
float *a,*b,*c;
return 0;
}
ベクトル和（非最適*アクセス版）
vectoradd_noncoalesce.cu
*アンコアレスもしくはミスアライン
アクセス

export CUDA_PROFILE=1
nvcc ‐arch=sm_20 vectoradd_coalesce.cu
#
export CUDA_PROFILE_CONFIG=prof_gld
export CUDA_PROFILE_LOG=vectoradd_c_profile_gld.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_gst
export CUDA_PROFILE_LOG=vectoradd_c_profile_gst.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_inst_issued
export CUDA_PROFILE_LOG=vectoradd_c_profile_inst_issued.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_inst_executed
export CUDA_PROFILE_LOG=vectoradd_c_profile_inst_executed.log
./a.out
#
#
#
nvcc ‐arch=sm_20 vectoradd_noncoalesce.cu
#
export CUDA_PROFILE_CONFIG=prof_gld
export CUDA_PROFILE_LOG=vectoradd_n_profile_gld.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_gst
export CUDA_PROFILE_LOG=vectoradd_n_profile_gst.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_inst_issued
export CUDA_PROFILE_LOG=vectoradd_n_profile_inst_issued.log
./a.out
#
export CUDA_PROFILE_LOG=vectoradd_n_profile_inst_executed.log
./a.out
#
コンパイルと実行（シェルスクリプト）
vectoradd.sh

実行結果（ベクトル和のみ抜粋, Occupancyはどちらも1）
 l1_global_load_missの値が増加
 globa_store_transactionの値が大幅に増加
 inst_issuedの値が大幅に増加
内容最適アクセス非最適アクセス
gputime+cputime
（4回の平均）
56.808+103
=159.808
76.616+122.5
=199.116
gld_request 2336 2368
l1_global_load_hit 0 1941
l1_global_load_miss 2336 2577
gst_request 1136 1168
global_store_transaction 1136 2252
inst_issued 24113 31399
inst_executed 19720 19992

メモリアクセスの評価
 transactions/request
 1 Warpがアクセス要求に対してキャッシュラインをいくつ使うか
 理想値
 32 thread/Warp × variable size[byte/thread]
/ 128 byte/line
 4バイト変数では1，8バイト変数では2
 読込
 (l1_global_load_hit+l1_global_load_miss)/gld_re
quest
 書込
 global_store_transaction/gst_requet

transactions/request
 理想値
 float型（4バイト）の読込 1
 float型（4バイト）の書込 1
 非最適アクセスではメモリ転送の効率が50%に低下
 第5回 GPUのメモリ階層でグローバルメモリの項目を参照
読込書込
最適アクセス (2336+0)/2336=1 1136/1136=1
非最適アクセス (1941+2577)/2368≈2 2252/1168≈2

プログラム挙動の確認
 命令の実行に要したサイクルの評価
 実行した命令数
 inst_executed (prof_inst_executed)
 Warpが起動したサイクル数
 active_cycles (prof_cycle)

#define N (8*1024) //64kBに収める
#define NT (256)
#define NB (N/NT)
+ threadIdx.x;
a[i] = 1.0;
b[i] = 2.0;
c[i] = 0.0;
}
+ threadIdx.x;
c[i] = a[i] + b[i];
}
int main(void){
float *a,*b,*c;
return 0;
}
GPUプログラム（グローバルメモリ利用）
2015/05/20GPGPU実践プログラミング
vectoradd.cu
40

#define N (8*1024)
#define NT (256)
#define NB (N/NT)
__constant__ float a[N],b[N];
__global__ void init(float *c){
+ threadIdx.x;
c[i] = 0.0f;
}
__global__ void add(float *c){
+ threadIdx.x;
c[i] = a[i] + b[i];
}
int main(void){
float *c;
float *host_a,*host_b;
int i;
host_a=(float *)malloc(Nbytes);
host_b=(float *)malloc(Nbytes);
cudaMalloc((void **)&c,Nbytes);
for(i=0;i<N;i++){
host_a[i] = 1.0f;
host_b[i] = 2.0f;
}
cudaMemcpyToSymbol
(a,host_a,Nbytes);
cudaMemcpyToSymbol
(b,host_b,Nbytes);
init<<< NB, NT>>>(c);
add<<< NB, NT>>>(c);
return 0;
}
コンスタントメモリ（単純な置き換え）
vectoradd_constant.cu
41

#define N (8*1024)
#define NT (256)
#define NB (N/NT)
__constant__ float a, b;
__global__ void init(float *c){
+ threadIdx.x;
c[i] = 0.0f;
}
__global__ void add(float *c){
+ threadIdx.x;
c[i] = a + b;
}
int main(void){
float *c;
float host_a,host_b;
host_a=1.0f;
host_b=2.0f;
cudaMalloc((void **)&c,Nbytes);
//host_a,host_bが配列ではないので
//アドレスを取り出すために&を付ける
cudaMemcpyToSymbol
(a,&host_a,sizeof(float));
cudaMemcpyToSymbol
(b,&host_b,sizeof(float));
init<<< NB, NT>>>(c);
add<<< NB, NT>>>(c);
return 0;
}
コンスタントメモリ（同一アドレス参照）
vectoradd_broadcast.cu
42

#
nvcc ‐arch=sm_20 vectoradd.cu
#
export CUDA_PROFILE_LOG=vectoradd_profile_inst_executed.log
./a.out
#
export CUDA_PROFILE_CONFIG=prof_cycle
export CUDA_PROFILE_LOG=vectoradd_profile_cycle.log
./a.out
#
#
nvcc ‐arch=sm_20 vectoradd_constant.cu
#
export CUDA_PROFILE_LOG=vectoradd_constant_profile_inst_executed.log
./a.out
#
export CUDA_PROFILE_LOG=vectoradd_constant_profile_cycle.log
./a.out
#
#
nvcc ‐arch=sm_20 vectoradd_broadcast.cu
#
export CUDA_PROFILE_LOG=vectoradd_broadcast_profile_inst_executed.log
./a.out
#
export CUDA_PROFILE_LOG=vectoradd_broadcast_profile_cycle.log
./a.out
#
コンパイルと実行（シェルスクリプト）
vectoradd_cycle.sh

実行結果（ベクトル和のみ抜粋, Occupancyはいずれも1）
 inst_issuedの値が減少
 active_cyclesの値が大幅に増加
 同じ処理の実行に余分なサイクルが必要
 1サイクル（クロック）あたりの命令実行数(≠演算回数)
 グローバルメモリ 408/ 658=0.62
 コンスタントメモリ 336/2984=0.11
 ブロードキャスト 288/ 599=0.48
内容グローバルメモリコンスタントメモリブロードキャスト
gputime+cputime
（2回の平均）
2.96+49
=51.96
7.216+53.5
=60.716
2.96+49.5
=52.46
inst_executed 408 336 288
active_cycles 658 2984 599

簡易的なプロファイリング(CUDA 5以降)
 実行と同時にプロファイル結果を出力
 $ nvprof ./a.out
 $ nvprof ‐‐query‐events
 プロファイルできるイベント一覧を表示
 $ nvprof ‐events inst_executed ./a.out
 プロファイルするイベントを指定して実行

Nsight
 デバッグ・プロファイルツール
 https://developer.nvidia.com/nsight‐eclipse‐
edition
 https://developer.nvidia.com/nvidia‐nsight‐
visual‐studio‐edition
 Visual StudioやEclipseと統合し，開発→デバッグ→
プロファイリングが可能

Available Events
 The profiler supports logging of following counters during kernel execution
only on GPUs with Compute Capability 2.0 or higher
Name description
instructions_issued Number of instructions issued including replays
instructions_issued1_0 Number of cycles that issue one instruction for instruction group 0
instructions_issued2_0
Number of cycles that issue two instructions for instruction group
0
instructions_issued1_1 Number of cycles that issue one instruction for instruction group 1
instructions_issued2_1
Number of cycles that issue two instructions for instruction group
1
instructions_executed Number of instructions executed, do not include replays
warps_launched Number of warps launched in a SM
threads_launched Number of threads launched in a SM
active_cycles
Count of cycles in which at least one warp is active in a
multiprocessor
active_warps
Accumulated count of no. of warps which are active per cycle in a
multiprocessor. Each cycle increments it by the number of warps
active in that cycle (in range 0‐48)
shared_load Number of executed shared load instructions per warp in a SM
shared_store Number of executed shared store instructions per warp in a SM
http://www.cs.cmu.edu/afs/cs/academic/class/15668‐s11/www/cuda‐doc/Compute_Profiler.txt

Available Events
Name description
l1_global_load_hit       Number of global load hits in L1 cache
l1_global_load_miss      Number of global load misses in L1 cache
l1_local_load_hit        Number of local load hits in L1 cache
l1_local_load_miss       Number of local load misses in L1 cache
l1_local_store_hit       Number of local store hits in L1 cache
l1_local_store_miss      Number of local store misses in L1 cache
l1_shared_bank_conflict Count of no. of bank conflicts in shared memory
on
Number of uncached global load transactions. This increments by 1,
2 or 4 for 32, 64 and 128 bit accesses respectively
on
Number of uncached global load transactions. This increments by 1,
2, or 4 for 32, 64 and 128 bit  accesses respectively
global_store_transaction
Number of global store transactions. This increments by 1, 2, or 4
for 32, 64 and 128 bit accesses respectively

Available Events
Name description
l2_subp0_write_sector_misses
Accumulated write sector misses from L2 cache for slice 0 for all
the L2 cache units
l2_subp1_write_sector_misses
Accumulated write sectors misses from L2 cache for slice 1 for all
the L2 cache units
l2_subp0_read_sector_misses
Accumulated read sectors misses from L2 cache for slice 0 for all
the L2 cache units
l2_subp1_read_sector_misses
Accumulated read sectors misses from L2 cache for slice 1 for all
the L2 cache units
l2_subp0_write_sector_queries
Accumulated write sector queries from L1 to L2 cache for slice 0 of
all the L2 cache units
l2_subp1_write_sector_queries
Accumulated write sector queries from L1 to L2 cache for slice 1 of
l2_subp0_read_sector_queries
Accumulated read sector queries from L1 to L2 cache for slice 0 of
l2_subp1_read_sector_queries
Accumulated read sector queries from L1 to L2 cache for slice 1 of
tex0_cache_sector_queries      Number of texture cache sector queries for texture unit 0
tex0_cache_sector_misses Number of texture cache sector misses for texture unit 0
tex1_cache_sector_queries Number of texture cache sector queries for texture unit 1
tex1_cache_sector_misses Number of texture cache sector misses for texture unit 1

Available Events
Name description
fb_subp0_read_sectors
Number of read requests sent to sub‐partition 0 of all the DRAM
units
fb_subp1_read_sectors
units
fb_subp0_write_sectors
Number of write requests sent to sub‐partition 0 of all the DRAM
units
fb_subp1_write_sectors
units
fb0_subp0_read_sectors Number of read requests sent to sub‐partition 0 of DRAM unit 0
fb0_subp0_write_sectors Number of write requests sent to sub‐partition 0 of DRAM unit 0

2015年度GPGPU実践プログラミング　第6回　パフォーマンス解析ツール

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2015年度GPGPU実践プログラミング　第6回　パフォーマンス解析ツール

Similar to 2015年度GPGPU実践プログラミング　第6回　パフォーマンス解析ツール (20)

More from 智啓出川

More from 智啓出川 (20)