49. 01/28/15 49
Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPS
on compiled parallel C applications
Available in laptops, desktops, and clusters
GPU parallelism is doubling every year
Programmable in C with CUDA tools
Multithreaded model uses application
data parallelism and thread
parallelism
GeForce 8800
Tesla S870
Tesla D870
50. 01/28/15 50
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS, 768 MB
DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture Texture
結
構
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Load/store Load/store Load/store Load/store Load/store
GeForce 8800
51. 01/28/15 51
Arrays of Parallel Threads
• CUDA kernel 是以 Thread 的陣列執行
– 每個 Thread 皆執行同樣的程式碼
– 每個 Thread 都有一個 ID ,用來計算使用的記憶體位
置和控制
7
6
5
4
3
2
1
0
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
52. 01/28/15 52
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
threadID
Thread Block 0
…
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block 0
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Block N - 1
Thread Blocks: Scalable Cooperation
Divide monolithic thread array into
multiple blocks
Threads within a block cooperate via shared
memory, atomic operations and
barrier synchronization
Threads in different blocks cannot cooperate
7
6
5
4
3
2
1
0 7
6
5
4
3
2
1
0 7
6
5
4
3
2
1
0
53. 01/28/15 53
CUDA Device Memory Allocation (cont.)
Code example:
Allocate a 64 * 64 single precision float
array
Attach the allocated storage to Md
“d” is often used to indicate a device data
structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);
54. 01/28/15 54
CUDA Host-Device Data Transfer
cudaMemcpy()
memory data transfer
Requires four
parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type of transfer
Host to Host (ex:CPU 端的行為
)
Host to Device (ex: 編譯好要運
算的程式送至顯示卡 )
Device to Host (ex: 算好傳回 )
Device to Device (ex: 運算單
Grid
Global
Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
58. Types of Parallel Computing Models
• 資料平行 (Data Parallel) – 單一指令同時運行在多筆資料
上。 Single Instruction, Multiple Data (SIMD)
• 用一個控制器來控制多個處理器,同時對一組數據(又稱「數據向量
」)中的每一個分別執行相同的操作來實現空間上的平行性。
• 任務平行 (Task Parallel) – 不同指令作用在不同資料上
Multiple Instruction, Multiple Data (MIMD)
• SPMD (single program, multiple data) :所有的處理器運作相同的
程式,但每一個都有自已要處理的資料。需要注意的是,這種方式
,在各處理器之間,它們的運作層級是不同步的。
• SPMD is equivalent to MIMD since each MIMD program can be
made SPMD (similarly for SIMD, but not in practical sense.)
• Message passing (and MPI) is for MIMD/SPMD parallelism.