[若渴計畫]由GPU硬體概念到coding CUDA

0 views

Published on

整理半年前閱讀資料,並分享於若渴計畫

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
0
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
29
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • splashtop
  • 當初有個議題
  • 對於programming model的解釋, 我在上次研討會的時候,我覺得有一個不錯的解釋
    使用此programming model 他會帶來給你怎樣的設計概念.

  • 所以接下來就針對SIMT來說明
  • 這裡就是在CUDA中,描述每條thread 要做的事情
    i代表每條thread 某一條thread i
    B陣列的i元素 和c陣列的i元素 存到a陣列的i元素裡面
    每條thread所占用的硬體資源會如右圖
  • 每條tread i 取某個lut陣列的b[i]元素
    這樣的意思是說 每個thread可以對memory 自己抓自己的memory address來處理.
  • Warp表示一個一個warp
    每個warp由32 threads組成
    本來32條 threads執行在同一個warp同步執行 會變成兩個warp循序執行
  • Latency hiding 當有等地latency hiding時 去做別的事情
  • 在每個thread使用作多個register時,
    可同時執行8個warp
    所以如果設計成9個Dispatch的話,在這狀況下1個dispatch就多於了
  • [若渴計畫]由GPU硬體概念到coding CUDA

    1. 1. 由GPU硬體概念到coding CUDA AJ 2014.6.17
    2. 2. GPU是否只能當顯示卡? 能不能拿來做平行運算?
    3. 3. 兩個大廠 • NVIDIA • AMD • 這兩大廠都有提供open source project給玩家來join • 能join什麼? 還沒涉略 • 因為我的實驗室只有NVIDA卡,所以就使用NVIDA ~”~ • NVIDA卡,它是使用何種programming model來programming? • Single-instruction multiple thread (SIMT) programming model 使用此model帶來給你 怎樣的設計概念
    4. 4. 從NVIDIA GPU設計概念說起
    5. 5. 在NVIDIA GPU中,可用三個特性來看SIMT • Single instruction, multiple register sets • Single instruction, multiple addresses • Single instruction, multiple flow paths http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
    6. 6. Single Instruction, Multiple Register Sets for(i=0;i<n;++i) a[i]=b[i]+c[i]; __global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop! } Costs: • 每個thread都會對應自己的register set ,所以會有redundant情況發生。 http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
    7. 7. Single Instruction, Multiple Addresses __global__ void apply(short* a, short* b, short* lut) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i] = lut[b[i]]; //indirect memory access // a[i] = b[i] } Cost: • 對於DRAM memory來說, random access跟循序存取比起來是沒有效率 的。 • 對於shared memory來說, random access 會藉由bank contentions而變 慢速度。(先不討論shared memory) http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
    8. 8. Single Instruction, Multiple Flow Paths http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html __global__ void find(int* vec, int len, int* ind, int* nfound, int nthreads) { int tid = blockIdx.x * blockDim.x + threadIdx.x; int last = 0; int* myind = ind + tid*len; for(int i=tid; i<len; i+=nthreads) { if( vec[i] ) { //flow divergence myind[last] = i; last++; } } nfound[tid] = last; } …. len thread id = 0 nthreads =1vec get coalescing to read if(vec[i]) 成立 if(vec[i]) 不成成立 get coalescing to read …. registers
    9. 9. 以上為SIMT設計特性。 先來看kepler gk110 晶片方塊圖。
    10. 10. • 15 SMX(串流處理器) X 192 cores • 4 warp scheduler per SMX • 暫存器個數65536 per SMX Form NVIDIA kepler gk110 architecture whitepaper
    11. 11. • warp scheduler 用來做啥? • SMX內部的資源分配 Form NVIDIA kepler gk110 architecture whitepaper warp1 warp2
    12. 12. Warp使用SIMT運作 1. 在NVIDIA中, a “warp”是由好幾個(32)threads組成 且同時跑。 而每個thread需要自己的registers 。 2.在Warp中,SIMT去執行,也就是說32 threads執行相 同指令。如果對於flow divergence ,則硬體會分多個warp處 理這問題,但效能會變差。(James Balfour, “CUDA Threads and Atomics” ,p.13~p.18) 。 http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
    13. 13. Warp使用SIMT運作(cont.) 1.. 好幾個warps組成a “block” , 一個block被對應到一 個SMX ,而一個SMX裡面有warp scheduler去切換一個 block中的warps去執行。 而每個warp都有自己的 register sets。 2. 由圖可知一個block ,再做warp schedule時,是zero overhead (fast context switch)。因為狀態接由 register set保存。而warp狀態可分actives/suspended 。 http://www.yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html 3. 你可以指定一個block有多少thread。但一個block做多指 定多少thread ,要看硬體可支援的運算能力。
    14. 14. Thread ID 如何對應到 Warp • Warp ID (warpid) • 如何知道一個block中某thread屬於哪個warp? threadIdx.x / 32 • Thread ID = 0 ~ 31  warp • Thread ID = 32~64  warp • … http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf , p.2
    15. 15. 以上GPU原理(當然不只) ,外加整合CPU , 然而就有了CUDA的coding環境出現。
    16. 16. 使用CUDA必須注意的事情 • 使用哪一個NVIDIA GPU Architecture 。 • NVIDIA Tesla K20c • 從https://developer.nvidia.com/cuda-gpus 可知Tesla K20c的 Compute Capability為3.5 。 • 安裝CUDA環境,可參考http://docs.nvidia.com/cuda/cuda-getting- started-guide-for-linux/index.html#axzz33nDhVV00 。編譯器名稱為 nvcc 。 • 最新的CUDA版本為6.0 ,而我安裝的是5.0 XD(懶得升級 哈) 。 • 安裝完CUDA環境,可跑內建執行檔deviceQuery 去看看安裝對不 對。
    17. 17. /usr/local/cuda-5.0/samples/1_Utilities/deviceQuery
    18. 18. 你會有個疑問 那我同一個CUDA他如何做到同 一個GPU不同SMX數也可以執行?
    19. 19. Block Scalability
    20. 20. Program Compilation
    21. 21. CUDA 5: Separate Compilation & Linking From Introducing CUDA 5.pdf
    22. 22. Makefile範例
    23. 23. ########################################################## # compiler setting ########################################################## CC = gcc CXX = g++ NVCC = nvcc CFLAGS = -g -Wall CXXFLAGS = $(CFLAGS) -Weffc++ -pg LIBS = -lm -lglut -lGLU -lGL INCPATH = -I./
    24. 24. OBJS = main.o c_a.o c_b.o cpp_a.o cpp_b.o cu_a.o cu_b.o EXEC = output
    25. 25. all: $(OBJS) $(NVCC) $(OBJS) -o $(EXEC) $(LIBS) -pg %.o:%.cu $(NVCC) -c $< -o $@ -g –G -arch=sm_35 %.o:%.cpp $(CXX) -c $(CXXFLAGS) $(INCPATH) $< -o $@ %.o:%.c $(CC) -c $(CFLAGS) $(INCPATH) $< -o $@ #########################################################
    26. 26. 假設拿到別人的平行化程式, 可試試看一個不錯可能改善效能的方法。
    27. 27. The ILP method <=小時候學的ILP可以這樣 用啊!! • 多條thread合併->ILP增加 -> 有機會對coalesce global memory-> Block數減少 - > 一個thread使用register個數增加 -> Ocuupancy降低 (Vasily Volkov, “Better Performance at Lower Occupancy”)
    28. 28. 先說什麼是Occupancy • Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently.(意思就是說你每個時間所同時跑的thread數,到底有 沒有塞滿GPU提供的最大同時間跑的thread數。) From Optimizing CUDA – Part II © NVIDIA Corporation 2009 • 假設某GPU的其中一個SMX最 多同時間可跑1536個threads以 及32K register
    29. 29. NVIDIA工程師 (http://stackoverflow.com/users/749748/harrism) 在stackoverflow表示 • In general, as Jared mentions, using too many registers per thread is not desirable because it reduces occupancy, and therefore reduces latency hiding ability in the kernel. GPUs thrive on parallelism and do so by covering memory latency with work from other threads. • Therefore, you should probably not optimize arrays into registers. Instead, ensure that your memory accesses to those arrays across threads are as close to sequential as possible so you maximize coalescing (i.e. minimize memory transactions). http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
    30. 30. 也就是說不管Occupancy高不高,要讓 memory有機會能coalesce來讀取。
    31. 31. 繼續對ILP在NVIDIA GPU影響做說明 http://continuum.io/blog/cudapy_ilp_opt 搬資料 • Core • Memory controller
    32. 32. 上面的效果對應CODE是什麼啊? ILP = 2時,右邊用 pseudocode表示 # read i = thread.id ai = a[i] bi = b[i] j = i+5 aj = a[j] bj = b[j] # compute ci = core(ai, bi) cj = core(aj, bj) # write c[i] = ci c[j] = cj
    33. 33. ILP=4時,實際效果=>讓GPU pipeline效果變 高 http://continuum.io/blog/cudapy_ilp_opt
    34. 34. 上述主要概念整理 •Hide latency = do other operations when waiting for latency • ILP增加 • 增加Occupancy
    35. 35. 剛提到the ILP method , 一個thread 所使用的register個數是一個重要考 量。
    36. 36. Interpreting Output of --ptxas-options=-v http://stackoverflow.com/questions/12388207/interpreting-output-of-ptxas-options-v http://stackoverflow.com/questions/7241062/is-local-memory-slower-than-shared-memory-in-cuda • Each CUDA thread is using 46 registers? Yes, correct • There is no register spilling to local memory(shared memory)? Yes, correct • Is 72 bytes the sum-total of the memory for the stack frames of the __global__ (撰寫 平行化的副程式)and __device__(給__global__函數呼叫的副程式) functions? Yes, correct
    37. 37. 我要怎麼限制一個thread的register使用數 • control register usage with the nvcc flag: --maxrregcount
    38. 38. 假設threads的分配register總量超過GPU上的 register數量,編譯器會怎做?
    39. 39. stackoverflow神人表示 • PTX level allows many more virtual registers than the hardware. Those are mapped to hardware registers at load time. The register limit you specify allows you to set an upper limit on the hardware resources used by the generated binary. It serves as a heuristic for the compiler to decide when to spill (see below) registers when compiling to PTX already so certain concurrency needs can be met. • For Fermi GPUs there are at most 64 hardware registers. The 64th is used by the ABI as the stack pointer and thus for "register spilling" (it means freeing up registers by temporarily storing their values on the stack and happens when more registers are needed than available) so it is untouchable. http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
    40. 40. 剛剛說利用增加register來賺memory coalesce的 時間。 register用超過會增加memory存取時間。 怎辦啊?
    41. 41. 哈! 再怎嘴砲,也是要coding才知阿~~~~~
    42. 42. 我可以寫程式把所需資料放在哪呢?
    43. 43. Mohamed Zahran, “Lecture 6: CUDA Memories” • 存取速度 shared memory > constant memory > global memory >
    44. 44. 要怎宣告的資料,代表存取哪種memory啊?
    45. 45. 描述有錯,要看 compiler放在哪裡 Stackoverflow神人 • Dynamically indexed arrays cannot be stored in registers, because the GPU register file is not dynamically addressable. • Scalar variables are automatically stored in registers by the compiler. • Statically-indexed (i.e. where the index can be determined at compile time), small arrays (say, less than 16 floats) may be stored in registers by the compiler. http://stackoverflow.com/questions/12167926/forcing-cuda-to-use-register-for-a-variable
    46. 46. 來看一個簡單的範例
    47. 47. Summing two vectors Jason Sa nders, Edward Kandrot, “CUDA by Example”
    48. 48. 資料哪來啊? 從CPU Memory搬到global memory Jason Sa nders, Edward Kandrot, “CUDA by Example”
    49. 49. 怎麼呼叫自己寫的平行化程式押? • 呼叫時需要指定每個block有thread數,一個grid有多少block • 上面意思是說一個grid有N個blocks ,每個block有1個thread再 執行 threadsblocks Jason Sa nders, Edward Kandrot, “CUDA by Example”
    50. 50. 從GPU global memory寫回到CPU memory去 處理 Jason Sa nders, Edward Kandrot, “CUDA by Example”
    51. 51. 整理以上流程 http://en.wikipedia.org/wiki/CUDA
    52. 52. 為什麼要指定的thread數block數會有 1D,2D,3D阿?
    53. 53. • 1 block 4
    54. 54. • 一個block是9x9,因為 100 thread所以有兩 個block
    55. 55. • 2 blocks
    56. 56. 在thread數不是32倍數的狀況下,1D,2D,3D 的分法就是要比較哪個warp塞比較滿!!!
    57. 57. 要怎量GPU跑的時間
    58. 58. Profiling Tool: nvprof nvprof --events warps_launched,threads_launched ./執行檔 執行檔輸入參數 > result
    59. 59. Q&A
    60. 60. Q&A-1: flow divergence的討論 • JIT的作法 • 程式用profile知道哪些true或false的狀況,分開同時丟給JIT去執行 • Brower就是用這樣的方式去加快處理 • 這樣的做法很吃memory
    61. 61. Q&A-2: NVIDA/AMD • NVIDA • 筆電,伺服器 • AMD • 手機
    62. 62. Q&A-3:Single Instruction, Multiple Addresses 的討論 • 對於compiler處理random access • Point analysis
    63. 63. Q&A-4: • CUDA LLVM Compiler • 目前CUDA不支援OpenCL 2.0 • https://developer.nvidia.com/opencl
    64. 64. Q&A-5: trace code討論 • cuda-gdb • http://docs.nvidia.com/cuda/cuda-gdb/#axzz34ufkPsqt • EX: • Note: For disassembly instruction to work properly, cuobjdump must be installed and present in your $PATH.
    65. 65. Q&A-6: GPU machine code放到哪執行阿?  不知道GPU有沒有在討論locality問題?
    66. 66. Q&A-7 把function切開平行化是否有好處? • Function() function1() function2() function3() • ?
    67. 67. Q&A-8 5 axis machine 的防碰撞平行化 • cutter每走一步就用GPU檢查有沒有撞到 • 問題: GPU持續耗電 • 如果5軸機開雕刻一整天 GPU不就耗電很恐怖? • Trade off: 耗電/速度
    68. 68. CUDA Toolkit Documentation • http://docs.nvidia.com/cuda/index.html#axzz33uurtJU9

    ×