GPU programming and Its Case Study

GPU Programming and SMAC
Case Study
Zhengjie Lu
Master Student of Electronic Group
Electrical Engineering
Technische Universiteit Eindhoven, NL

Contents
Part 1: GPU Programming
1.1 NVIDIA GPU Hardware
1.2 NVIDIA CUDA Programming
1.3 Programming Environment
Part 2: SMAC Case Study
2.1 SMAC Introduction
2.2 SMAC Mapping
2.3 Experiment & Analysis
Part 3: Conclusion & Future Development

Concepts
1. GPU
• Graphic processing unit or graphic card
• Chip vendor: NVIDIA and ATI
2. CUDA Programming
• “Compute Unified Device Architecture”
• Support by NVIDIA
3. SMAC Application
• “Simplified Method for Atmosphere Correction”
PAGE 36/28/15

Part 1: GPU Programming
/ name of department PAGE 46/28/15

• What does NVIDA GPU look like?

• Example: NVIDIA 8-Series GPU
− 128 stream processors (SPs): 1.35GHz per processor
− 16 shared memories: shared by every 8 SPs, small
but fast.
− 1 global memory: shared by 128 SPs, slow but large.
− 1 constant memory: shared by 128 SPs, small but
fast.

Stream Processor (SP) Shared Memory
Global/Constant Memory
Stream Multi-Processor (SM)

• Connection between GPU and CPU
− CPU => main memory => GPU global memory => GPU
− GPU => GPU global memory => main memory => CPU

• Hardware summary:
− Multi-threading is supported physically with the SPs.
− SPs inside a SM communicate with each other
through the shared memory.
− SMs communicate with each other through the global
memory.
− GPU and CPU communicate with each other through
their memories: global memory  main memory

• CUDA programming concepts
− Thread: the basic unit
− Block: the collection of threads
− Grid: the collection of blocks

• CUDA programming concepts
− A grid is mapped on GPU by the scheduler
− A block is mapped on SM by the scheduler
− A thread is mapped on SP by the scheduler

• CPU programming custom:
i. Allocate the CPU memory
ii. Run the CPU kernel
• CUDA programming custom:
i. Allocate the GPU memory
ii. Copy the input to the GPU memory
iii. Run the GPU kernel
iv. Copy the output from the GPU memory

• Example:
/*******************************************************/
/* File: main.c
/* Description: 8x8 matrix addition on CPU
/*******************************************************/
//Data definition
const int mat1[64] = {…};
const int mat3[64];
//Matrix addition on CPU
void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT);
// Main body
int main()
{
// Run the matrix addition on CPU
matrixAdd_CPU(64, mat1, mat2, mat3);
return 0;
}
/****************************************************/
/* File: main.cu
/* Description: 8x8 matrix addition on GPU
/****************************************************/
//Data definition
const int mat3[64];
//Matrix addition on GPU
void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT);
// Main body
int main()
{
// Run the matrix addition on CPU
matrixAdd_GPU(row, col, mat1, mat2, mat3);
return 0;
}

/**************************************************************/
/* File: main.c
/* Description: 8x8 matrix addition on CPU
/******************************************************************/
//Matrix addition on CPU
void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT)
{
int i;
for(i = 0; i < index ; i++){
OUT[i] = IN1[i] + IN2[i];
}
}
/****************************************************/
/* File: main.cu
/* Description: 8x8 matrix addition on GPU
/****************************************************/
//Matrix addition on GPU
void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT)
{
int* deviceIN1;
int* deviceIN2;
int* deviceOUT;
dim3 grid(1, 1, 1);
dim3 block(index, 1, 1);
cutilSafeCall(cudaMalloc((void**) &deviceIN1, sizeof(int) * index));
cutilSafeCall(cudaMalloc((void**) &deviceIN2, sizeof(int) * index));
cutilSafeCall(cudaMalloc((void**) &deviceOUT, sizeof(int) *
index));
cutilSafeCall(cudaMemcpy(deviceIN1, IN1, sizeof(int) * index,
cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(deviceIN2, IN2, sizeof(int) * index,
cudaMemcpyHostToDevice));
GPU_kernel<<<grid, block>>>(deviceIN1, deviceIN2, deviceOUT);
cutilSafeCall(cudaMemcpy(OUT, deviceOUT, sizeof(int) * index,
cudaMemcpyDeviceToHost));
}
Allocate GPU memory
Copy the input
Copy the output
Run the matrix addition
/***********************************/
/* File: main.cu
/* Description: GPU kernel
/*****************************************/
__global__ void
GPU_kernel( int* IN1, int* IN2, int* OUT)
{
int tx = threadIdx.x;
OUT[tx] = IN1[tx] + IN2[tx];
}

• CUDA programming optimization
1) Use the registers and shared memories
2) Maximize the number of threads per block
3) Global Memory access coalescence
4) Shared memory bank conflict
5) Group the byte access
6) Stream execution

1) Use the registers and shared memories
− Register is the fastest. (8192 32-bit reg. per SM)
− Shared memory is fast, but small. (16KB per SM)
− Global memory is slow, but large. (at least 256MB )

2) Maximize the number of threads per block
i. Determine the register/memory budget per thread,
with the tool cudaProf.
ii. Determine the maximum number of threads, with
the tool cudaCal.
iii. Determine the number of blocks

3) Global Memory access coalescence
− Global memory access pattern: 16 threads per time
− 16 threads must access the global memory with 16
continuous words
− 1st
thread must access the global memory address
which is 16-word aligned

/ name of department
PAGE 206/28/15
Coalescence Non-coalescence Non-coalescence
Non-coalescence

4) Shared memory bank conflict
− Shared memory access pattern: 16 thread
− 16KB shared memory: 16 x 1KB memory bank
− The threads shouldn’t access two different addresses
inside a memory bank

No bank confliction Bank confliction Bank confliction

5) Group the byte access
No group access
Group access

6) Stream execution

• Tips
− Examples: NVIDIA SDK
− Programming: “NVIDIA CUDA Programming Guide”
− Optimization: “NVIDIA CUDA C Programming: Best
Practices Guide”

1. Preparation
− Windows: Microsoft Visual C++ 2008 Express
− Linux

2. CUDA installment
− Step 1: Download the CUDA package suitable for your
operation system
(http://www.nvidia.com/object/cuda_get.html)
− Step 2: Install CUDA Driver
− Step 3: Install CUDA Toolkit
− Step 4: Install CUDA SDK
− Step 5: Verify the installment with running the SDK
examples

3. CUDA project setup (Windows)
− Step 1: Download “CUDA Wizard” and install it
(http://www.comp.hkbu.edu.hk/~kyzhao/)
− Step2 : Open VC++ 2008 express
− Step 3: Click “File” and choose “New/Project”
− Step 4: Choose “CUDA” in “Project types” and then
select “CUDAWinApp” in “Visual Studio installed
templates”
− Step 5: Name the CUDA project and click “OK”.
− Step 6: Click “Solution Explorer”

− Step 7: right click “Source Files” and choose
“Add/New Item…”
− Step 8: Click “Code” in “Categories” and then choose
“C++. File (.cpp)” in “Visual Studio installed
templates”
− Step 9: Name the file as “main.cu” and click “Add”
− Step 10: Repeat Step 6~8, and make the other file
named “GPU_kernel.cu”
− Step 11: Click “Solution Explorer” and select
“GPU_kernel.cu” under the menu “Source Files”

− Step 12: Right click “GPU_kernel.cu”
− Step 13: Click “Configuration Properties” and then
click “General”
− Step 14: Select “Custom Build Tool” in “Tool” and
click OK
− Step 15: Implement your GPU kernel in
“GPU_kernel.cu” and the others in “main.cu”

• Tips
− Set up a CUDA project on Linux:
http://sites.google.com/site/5kk70gpu/installation
http://forums.nvidia.com/lofiversion/index.php?f62.htm
l

Part 2: SMAC Case Study

• SMAC algorithm
• SMAC as “Simplified Method for Atmospheric
Correction”
• A fast computation on the atmosphere reflections

• SMAC application profile
Data size: 5781 x 10 X 4 Bytes = 231240 Bytes

• SMAC in the satellite data center
SMAC
Algorithm
Image
Remapping

2.2 SMAC Mapping
• Mapping approach

2.2 SMAC Mapping
• Mapping approach
− GPU.cu: GPU operation functions
− GPU_kernel.cu: SMAC kernel

2.2 SMAC Mapping
• SMAC kernel on GPU
− Data size: 64 x 5781 x 4 Bytes
− CPU time:
− GPU Time:
− Demo

• Experiment Preparation
HARDWARE
CPU Intel Duo-Core, 2.5GHz per core.
GPU nVidia 32-Core GPU, 0.95GHz per
core.
Main Memory 4GB
PCI-E PCI express 1.0 x 16
Operation system Widows Vista Enterprise
CUDA version CUDA 1.1
SOFTWARE
GPU maximum registers per
thread
60
GPU thread number 192 x 4 (#thread per block x #block)
CPU thread number 1

• Experiment setup
− Performance
− GPU improvement
CPU time
GPU Improvement
GPU time

CPU time CPU stop timer CPU start timer －
GPU time GPU stop timer GPU start timer －

− Linear execution-time prediction
CPU time CPU overhead Bytes CPU speed  
( )
( )
( )
( )
GPU time GPU memorytime GPU run time
GPUmemory overhead Bytes GPU memoryspeed
GPUkernel overhead Bytes GPU kernel speed
GPUmemory overhead GPUkernel overhead
Bytes GPU memoryspeed GPU kernel speed
GPU overhe
 
   
 
  
 
 ad Bytes GPU speed 
CPU overhead Bytes CPU speed
Improvement
GPU overhead Bytes GPU speed
Bytes CPU speed CPU speed
Bytes GPU speed GPU speed
 

 

 

Only holds for
large-size data !!!

− Linear execution-time prediction
5
5.39 10CPU time data size
  
6
1.67 2.41 10GPU time data size
   
6
4.45 2.01 10GPU time data size
   
1-stream:
8-stream:
1-thread:

• Experiment result
− GPU improvement

• Experiment result
− Linear execution-time model

• Roofline model
Log Scaling
LogScaling

• Roofline model with SMAC
kernel
Hardware: NVIDIA Quadro FX570M
PCI express bandwidth (GB/sec): 4
Peak performance (GFlops/sec): 91.2
Peak performance without FMAU
(Gflops/sec):
30.4
Software: SMAC kernel on GPU
Data size (Bytes): 5971968
0
Issued instruction number
(Flops):
4189335552
Execution time (ms): 79.2
Instruction density (Flops/Byte): 70.15
Instruction Throughput
(GFlops/sec):
52.8

0.25 2.5 25 250
1
10
100
GFlops/sec
w/out FMA
Peak Performance
70.15
Roofline Model of SMAC on
GPU
52.8 GFlops/sec
Flops/Byte
Hard
disk
IO
BW

3. Conclusion & Future Development
• SMAC application:
− The bottleneck is the hard disk IO.
• SMAC kernel on GPU:
− The bottleneck is the computation.
− 25 times faster than CPU, when large-size data is
processed with the streams.
− The performance ceiling would occur when the data
size is “infinitely” huge.

• Future development
− Power measurement: “Consumption of Contemporary
Graphics Accelerators”

• Power measurement: physical setup
8 x 0.12 omg (5W)

• Future development
− Improve the hard disk I/O
− Employ more powerful GPU
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
520937472
260468736
130234368
65117184
32558592
16279296
8139648
4069824
2034912
1017456
508728
254364

GPU programming and Its Case Study

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (14)

Similar to GPU programming and Its Case Study

Similar to GPU programming and Its Case Study (20)

GPU programming and Its Case Study