[若渴計畫]由GPU硬體概念到coding CUDA

由GPU硬體概念到coding
CUDA
AJ
2014.6.17

GPU是否只能當顯示卡?
能不能拿來做平行運算?

兩個大廠
• NVIDIA
• AMD
• 這兩大廠都有提供open source project給玩家來join
• 能join什麼? 還沒涉略
• 因為我的實驗室只有NVIDA卡,所以就使用NVIDA ~”~
• NVIDA卡,它是使用何種programming model來programming?
• Single-instruction multiple thread (SIMT) programming model
使用此model帶來給你
怎樣的設計概念

從NVIDIA GPU設計概念說起

在NVIDIA GPU中，可用三個特性來看SIMT
• Single instruction, multiple register sets
• Single instruction, multiple addresses
• Single instruction, multiple flow paths
http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

Single Instruction, Multiple Register Sets
for(i=0;i<n;++i) a[i]=b[i]+c[i];
__global__ void add(float *a, float *b, float *c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i]=b[i]+c[i]; //no loop!
}
Costs:
• 每個thread都會對應自己的register set ，所以會有redundant情況發生。

Single Instruction, Multiple Addresses
__global__ void apply(short* a, short* b, short* lut) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[i] = lut[b[i]]; //indirect memory access
// a[i] = b[i]
}
Cost:
• 對於DRAM memory來說， random access跟循序存取比起來是沒有效率
的。
• 對於shared memory來說， random access 會藉由bank contentions而變
慢速度。(先不討論shared memory)

Single Instruction, Multiple Flow Paths
__global__ void find(int* vec, int len,
int* ind, int* nfound,
int nthreads) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int last = 0;
int* myind = ind + tid*len;
for(int i=tid; i<len; i+=nthreads) {
if( vec[i] ) { //flow divergence
myind[last] = i;
last++;
}
}
nfound[tid] = last;
}
….
len thread id = 0
nthreads =1vec
get coalescing to read
if(vec[i]) 成立 if(vec[i]) 不成成立
get coalescing to read
….
registers

以上為SIMT設計特性。
先來看kepler gk110 晶片方塊圖。

• 15 SMX(串流處理器) X 192 cores
• 4 warp scheduler per SMX
• 暫存器個數65536 per SMX
Form NVIDIA kepler gk110 architecture whitepaper

• warp scheduler 用來做啥?
• SMX內部的資源分配
Form NVIDIA kepler gk110 architecture whitepaper
warp1 warp2

Warp使用SIMT運作
1. 在NVIDIA中， a “warp”是由好幾個(32)threads組成
且同時跑。而每個thread需要自己的registers 。
2.在Warp中，SIMT去執行，也就是說32 threads執行相
同指令。如果對於flow divergence ，則硬體會分多個warp處
理這問題，但效能會變差。(James Balfour, “CUDA
Threads and Atomics” ,p.13~p.18) 。

Warp使用SIMT運作(cont.)
1.. 好幾個warps組成a “block” ，一個block被對應到一
個SMX ，而一個SMX裡面有warp scheduler去切換一個
block中的warps去執行。而每個warp都有自己的
register sets。
2. 由圖可知一個block ，再做warp schedule時，是zero
overhead (fast context switch)。因為狀態接由
register set保存。而warp狀態可分actives/suspended 。
3. 你可以指定一個block有多少thread。但一個block做多指
定多少thread ，要看硬體可支援的運算能力。

Thread ID 如何對應到 Warp
• Warp ID (warpid)
• 如何知道一個block中某thread屬於哪個warp? threadIdx.x / 32
• Thread ID = 0 ~ 31  warp
• Thread ID = 32~64  warp
• …
http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf , p.2

以上GPU原理(當然不只) ，外加整合CPU ，
然而就有了CUDA的coding環境出現。

使用CUDA必須注意的事情
• 使用哪一個NVIDIA GPU Architecture 。
• NVIDIA Tesla K20c
• 從https://developer.nvidia.com/cuda-gpus 可知Tesla K20c的
Compute Capability為3.5 。
• 安裝CUDA環境，可參考http://docs.nvidia.com/cuda/cuda-getting-
started-guide-for-linux/index.html#axzz33nDhVV00 。編譯器名稱為
nvcc 。
• 最新的CUDA版本為6.0 ，而我安裝的是5.0 XD(懶得升級哈) 。
• 安裝完CUDA環境，可跑內建執行檔deviceQuery 去看看安裝對不
對。

/usr/local/cuda-5.0/samples/1_Utilities/deviceQuery

你會有個疑問那我同一個CUDA他如何做到同
一個GPU不同SMX數也可以執行?

CUDA 5: Separate Compilation & Linking
From Introducing CUDA 5.pdf

##########################################################
# compiler setting
##########################################################
CC = gcc
CXX = g++
NVCC = nvcc
CFLAGS = -g -Wall
CXXFLAGS = $(CFLAGS) -Weffc++ -pg
LIBS = -lm -lglut -lGLU -lGL
INCPATH = -I./

OBJS = main.o
c_a.o
c_b.o
cpp_a.o
cpp_b.o
cu_a.o
cu_b.o
EXEC = output

all: $(OBJS)
$(NVCC) $(OBJS) -o $(EXEC) $(LIBS) -pg
%.o:%.cu
$(NVCC) -c $< -o $@ -g –G -arch=sm_35
%.o:%.cpp
$(CXX) -c $(CXXFLAGS) $(INCPATH) $< -o $@
%.o:%.c
$(CC) -c $(CFLAGS) $(INCPATH) $< -o $@
#########################################################

假設拿到別人的平行化程式，
可試試看一個不錯可能改善效能的方法。

The ILP method <=小時候學的ILP可以這樣
用啊!!
• 多條thread合併->ILP增加 -> 有機會對coalesce global memory-> Block數減少 -
> 一個thread使用register個數增加 -> Ocuupancy降低
(Vasily Volkov, “Better Performance at Lower Occupancy”)

先說什麼是Occupancy
• Occupancy = Number of warps running concurrently on a
multiprocessor divided by maximum number of warps that can run
concurrently.(意思就是說你每個時間所同時跑的thread數，到底有
沒有塞滿GPU提供的最大同時間跑的thread數。)
From Optimizing CUDA – Part II © NVIDIA Corporation 2009
• 假設某GPU的其中一個SMX最
多同時間可跑1536個threads以
及32K register

NVIDIA工程師
(http://stackoverflow.com/users/749748/harrism)
在stackoverflow表示
• In general, as Jared mentions, using too many registers per thread is
not desirable because it reduces occupancy, and therefore reduces
latency hiding ability in the kernel. GPUs thrive on parallelism and do
so by covering memory latency with work from other threads.
• Therefore, you should probably not optimize arrays into registers.
Instead, ensure that your memory accesses to those arrays across
threads are as close to sequential as possible so you maximize
coalescing (i.e. minimize memory transactions).
http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable

也就是說不管Occupancy高不高，要讓
memory有機會能coalesce來讀取。

繼續對ILP在NVIDIA GPU影響做說明
http://continuum.io/blog/cudapy_ilp_opt
搬資料
• Core
• Memory
controller

上面的效果對應CODE是什麼啊?
ILP = 2時,右邊用
pseudocode表示
# read
i = thread.id
ai = a[i]
bi = b[i]
j = i+5
aj = a[j]
bj = b[j]
# compute
ci = core(ai, bi)
cj = core(aj, bj)
# write
c[i] = ci
c[j] = cj

ILP=4時，實際效果=>讓GPU pipeline效果變
高
http://continuum.io/blog/cudapy_ilp_opt

上述主要概念整理
•Hide latency = do other operations when
waiting for latency
• ILP增加
• 增加Occupancy

剛提到the ILP method ，
一個thread 所使用的register個數是一個重要考
量。

Interpreting Output of --ptxas-options=-v
http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v
http://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda
• Each CUDA thread is using 46 registers?
Yes, correct
• There is no register spilling to local memory(shared memory)?
Yes, correct
• Is 72 bytes the sum-total of the memory for the stack frames of the __global__ (撰寫
平行化的副程式)and __device__(給__global__函數呼叫的副程式) functions?
Yes, correct

我要怎麼限制一個thread的register使用數
• control register usage with the nvcc flag: --maxrregcount

假設threads的分配register總量超過GPU上的
register數量，編譯器會怎做?

stackoverflow神人表示
• PTX level allows many more virtual registers than the hardware.
Those are mapped to hardware registers at load time. The register
limit you specify allows you to set an upper limit on the hardware
resources used by the generated binary. It serves as a heuristic for the
compiler to decide when to spill (see below) registers when compiling
to PTX already so certain concurrency needs can be met.
• For Fermi GPUs there are at most 64 hardware registers. The 64th is
used by the ABI as the stack pointer and thus for "register spilling" (it
means freeing up registers by temporarily storing their values on the
stack and happens when more registers are needed than available) so
it is untouchable.

剛剛說利用增加register來賺memory coalesce的
時間。 register用超過會增加memory存取時間。
怎辦啊?

哈! 再怎嘴砲，也是要coding才知阿~~~~~

我可以寫程式把所需資料放在哪呢?

Mohamed Zahran,
“Lecture 6: CUDA Memories”
• 存取速度
shared memory >
constant memory >
global memory >

要怎宣告的資料，代表存取哪種memory啊?

描述有錯,要看
compiler放在哪裡
Stackoverflow神人
• Dynamically indexed arrays cannot be stored in registers, because the GPU
register file is not dynamically addressable.
• Scalar variables are automatically stored in registers by the compiler.
• Statically-indexed (i.e. where the index can be determined at compile
time), small arrays (say, less than 16 floats) may be stored in registers by the
compiler.

Summing two vectors
Jason Sa nders, Edward Kandrot, “CUDA by Example”

資料哪來啊? 從CPU Memory搬到global
memory

怎麼呼叫自己寫的平行化程式押?
• 呼叫時需要指定每個block有thread數，一個grid有多少block
• 上面意思是說一個grid有N個blocks ，每個block有1個thread再
執行
threadsblocks

從GPU global memory寫回到CPU memory去
處理

整理以上流程
http://en.wikipedia.org/wiki/CUDA

為什麼要指定的thread數block數會有
1D,2D,3D阿?

• 一個block是9x9,因為
100 thread所以有兩
個block

在thread數不是32倍數的狀況下,1D,2D,3D
的分法就是要比較哪個warp塞比較滿!!!

Profiling Tool: nvprof
nvprof --events warps_launched,threads_launched ./執行檔執行檔輸入參數 >
result

Q&A-1: flow divergence的討論
• JIT的作法
• 程式用profile知道哪些true或false的狀況,分開同時丟給JIT去執行
• Brower就是用這樣的方式去加快處理
• 這樣的做法很吃memory

Q&A-2: NVIDA/AMD
• NVIDA
• 筆電,伺服器
• AMD
• 手機

Q&A-3:Single Instruction, Multiple Addresses
的討論
• 對於compiler處理random access
• Point analysis

Q&A-4:
• CUDA LLVM Compiler
• 目前CUDA不支援OpenCL 2.0
• https://developer.nvidia.com/opencl

Q&A-5: trace code討論
• cuda-gdb
• http://docs.nvidia.com/cuda/cuda-gdb/#axzz34ufkPsqt
• EX:
• Note: For disassembly instruction to work
properly, cuobjdump must be installed and present in your $PATH.

Q&A-6: GPU machine code放到哪執行阿?
 不知道GPU有沒有在討論locality問題?

Q&A-7 把function切開平行化是否有好處?
• Function()
function1()
function2()
function3()
• ?

Q&A-8 5 axis machine 的防碰撞平行化
• cutter每走一步就用GPU檢查有沒有撞到
• 問題: GPU持續耗電
• 如果5軸機開雕刻一整天 GPU不就耗電很恐怖?
• Trade off: 耗電/速度

CUDA Toolkit Documentation
• http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9

[若渴計畫]由GPU硬體概念到coding CUDA

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to [若渴計畫]由GPU硬體概念到coding CUDA

Similar to [若渴計畫]由GPU硬體概念到coding CUDA (20)

More from Aj MaChInE

More from Aj MaChInE (12)

Recently uploaded

Recently uploaded (20)

[若渴計畫]由GPU硬體概念到coding CUDA

Editor's Notes