SlideShare a Scribd company logo
GPU Programming and SMAC
Case Study
Zhengjie Lu
Master Student of Electronic Group
Electrical Engineering
Technische Universiteit Eindhoven, NL
Contents
Part 1: GPU Programming
1.1 NVIDIA GPU Hardware
1.2 NVIDIA CUDA Programming
1.3 Programming Environment
Part 2: SMAC Case Study
2.1 SMAC Introduction
2.2 SMAC Mapping
2.3 Experiment & Analysis
Part 3: Conclusion & Future Development
Concepts
1. GPU
• Graphic processing unit or graphic card
• Chip vendor: NVIDIA and ATI
2. CUDA Programming
• “Compute Unified Device Architecture”
• Support by NVIDIA
3. SMAC Application
• “Simplified Method for Atmosphere Correction”
PAGE 36/28/15
Part 1: GPU Programming
/ name of department PAGE 46/28/15
1.1 NVIDIA GPU Hardware
• What does NVIDA GPU look like?
/ name of department PAGE 56/28/15
1.1 NVIDIA GPU Hardware
• Example: NVIDIA 8-Series GPU
− 128 stream processors (SPs): 1.35GHz per processor
− 16 shared memories: shared by every 8 SPs, small
but fast.
− 1 global memory: shared by 128 SPs, slow but large.
− 1 constant memory: shared by 128 SPs, small but
fast.
/ name of department PAGE 66/28/15
1.1 NVIDIA GPU Hardware
/ name of department PAGE 76/28/15
Stream Processor (SP) Shared Memory
Global/Constant Memory
Stream Multi-Processor (SM)
1.1 NVIDIA GPU Hardware
• Connection between GPU and CPU
− CPU => main memory => GPU global memory => GPU
− GPU => GPU global memory => main memory => CPU
/ name of department PAGE 86/28/15
1.1 NVIDIA GPU Hardware
• Hardware summary:
− Multi-threading is supported physically with the SPs.
− SPs inside a SM communicate with each other
through the shared memory.
− SMs communicate with each other through the global
memory.
− GPU and CPU communicate with each other through
their memories: global memory  main memory
/ name of department PAGE 96/28/15
1.2 NVIDIA CUDA Programming
• CUDA programming concepts
− Thread: the basic unit
− Block: the collection of threads
− Grid: the collection of blocks
/ name of department PAGE 106/28/15
1.2 NVIDIA CUDA Programming
• CUDA programming concepts
− A grid is mapped on GPU by the scheduler
− A block is mapped on SM by the scheduler
− A thread is mapped on SP by the scheduler
/ name of department PAGE 116/28/15
1.2 NVIDIA CUDA Programming
/ name of department PAGE 126/28/15
1.2 NVIDIA CUDA Programming
• CPU programming custom:
i. Allocate the CPU memory
ii. Run the CPU kernel
• CUDA programming custom:
i. Allocate the GPU memory
ii. Copy the input to the GPU memory
iii. Run the GPU kernel
iv. Copy the output from the GPU memory
/ name of department PAGE 136/28/15
1.2 NVIDIA CUDA Programming
• Example:
/*******************************************************/
/* File: main.c
/* Description: 8x8 matrix addition on CPU
/*******************************************************/
//Data definition
const int mat1[64] = {…};
const int mat2[64] = {…};
const int mat3[64];
//Matrix addition on CPU
void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT);
// Main body
int main()
{
// Run the matrix addition on CPU
matrixAdd_CPU(64, mat1, mat2, mat3);
return 0;
}
/****************************************************/
/* File: main.cu
/* Description: 8x8 matrix addition on GPU
/****************************************************/
//Data definition
const int mat1[64] = {…};
const int mat2[64] = {…};
const int mat3[64];
//Matrix addition on GPU
void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT);
// Main body
int main()
{
// Run the matrix addition on CPU
matrixAdd_GPU(row, col, mat1, mat2, mat3);
return 0;
}
1.2 NVIDIA CUDA Programming
/**************************************************************/
/* File: main.c
/* Description: 8x8 matrix addition on CPU
/******************************************************************/
//Matrix addition on CPU
void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT)
{
int i;
for(i = 0; i < index ; i++){
OUT[i] = IN1[i] + IN2[i];
}
}
/****************************************************/
/* File: main.cu
/* Description: 8x8 matrix addition on GPU
/****************************************************/
//Matrix addition on GPU
void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT)
{
int* deviceIN1;
int* deviceIN2;
int* deviceOUT;
dim3 grid(1, 1, 1);
dim3 block(index, 1, 1);
cutilSafeCall(cudaMalloc((void**) &deviceIN1, sizeof(int) * index));
cutilSafeCall(cudaMalloc((void**) &deviceIN2, sizeof(int) * index));
cutilSafeCall(cudaMalloc((void**) &deviceOUT, sizeof(int) *
index));
cutilSafeCall(cudaMemcpy(deviceIN1, IN1, sizeof(int) * index,
cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(deviceIN2, IN2, sizeof(int) * index,
cudaMemcpyHostToDevice));
GPU_kernel<<<grid, block>>>(deviceIN1, deviceIN2, deviceOUT);
cutilSafeCall(cudaMemcpy(OUT, deviceOUT, sizeof(int) * index,
cudaMemcpyDeviceToHost));
}
Allocate GPU memory
Copy the input
Copy the output
Run the matrix addition
/***********************************/
/* File: main.cu
/* Description: GPU kernel
/*****************************************/
__global__ void
GPU_kernel( int* IN1, int* IN2, int* OUT)
{
int tx = threadIdx.x;
OUT[tx] = IN1[tx] + IN2[tx];
}
1.2 NVIDIA CUDA Programming
• CUDA programming optimization
1) Use the registers and shared memories
2) Maximize the number of threads per block
3) Global Memory access coalescence
4) Shared memory bank conflict
5) Group the byte access
6) Stream execution
/ name of department PAGE 166/28/15
1.2 NVIDIA CUDA Programming
1) Use the registers and shared memories
− Register is the fastest. (8192 32-bit reg. per SM)
− Shared memory is fast, but small. (16KB per SM)
− Global memory is slow, but large. (at least 256MB )
/ name of department PAGE 176/28/15
1.2 NVIDIA CUDA Programming
2) Maximize the number of threads per block
i. Determine the register/memory budget per thread,
with the tool cudaProf.
ii. Determine the maximum number of threads, with
the tool cudaCal.
iii. Determine the number of blocks
/ name of department PAGE 186/28/15
1.2 NVIDIA CUDA Programming
3) Global Memory access coalescence
− Global memory access pattern: 16 threads per time
− 16 threads must access the global memory with 16
continuous words
− 1st
thread must access the global memory address
which is 16-word aligned
/ name of department PAGE 196/28/15
1.2 NVIDIA CUDA Programming
/ name of department
PAGE 206/28/15
Coalescence Non-coalescence Non-coalescence
Non-coalescence
1.2 NVIDIA CUDA Programming
4) Shared memory bank conflict
− Shared memory access pattern: 16 thread
− 16KB shared memory: 16 x 1KB memory bank
− The threads shouldn’t access two different addresses
inside a memory bank
/ name of department PAGE 216/28/15
1.2 NVIDIA CUDA Programming
/ name of department PAGE 226/28/15
No bank confliction Bank confliction Bank confliction
1.2 NVIDIA CUDA Programming
5) Group the byte access
/ name of department PAGE 236/28/15
No group access
Group access
1.2 NVIDIA CUDA Programming
6) Stream execution
/ name of department PAGE 246/28/15
1.2 NVIDIA CUDA Programming
• Tips
− Examples: NVIDIA SDK
− Programming: “NVIDIA CUDA Programming Guide”
− Optimization: “NVIDIA CUDA C Programming: Best
Practices Guide”
/ name of department PAGE 256/28/15
1.3 Programming Environment
1. Preparation
− Windows: Microsoft Visual C++ 2008 Express
− Linux
/ name of department PAGE 266/28/15
1.3 Programming Environment
2. CUDA installment
− Step 1: Download the CUDA package suitable for your
operation system
(http://www.nvidia.com/object/cuda_get.html)
− Step 2: Install CUDA Driver
− Step 3: Install CUDA Toolkit
− Step 4: Install CUDA SDK
− Step 5: Verify the installment with running the SDK
examples
/ name of department PAGE 276/28/15
1.3 Programming Environment
3. CUDA project setup (Windows)
− Step 1: Download “CUDA Wizard” and install it
(http://www.comp.hkbu.edu.hk/~kyzhao/)
− Step2 : Open VC++ 2008 express
− Step 3: Click “File” and choose “New/Project”
− Step 4: Choose “CUDA” in “Project types” and then
select “CUDAWinApp” in “Visual Studio installed
templates”
− Step 5: Name the CUDA project and click “OK”.
− Step 6: Click “Solution Explorer”
/ name of department PAGE 286/28/15
1.3 Programming Environment
3. CUDA project setup (Windows)
− Step 7: right click “Source Files” and choose
“Add/New Item…”
− Step 8: Click “Code” in “Categories” and then choose
“C++. File (.cpp)” in “Visual Studio installed
templates”
− Step 9: Name the file as “main.cu” and click “Add”
− Step 10: Repeat Step 6~8, and make the other file
named “GPU_kernel.cu”
− Step 11: Click “Solution Explorer” and select
“GPU_kernel.cu” under the menu “Source Files”
/ name of department PAGE 296/28/15
1.3 Programming Environment
3. CUDA project setup (Windows)
− Step 12: Right click “GPU_kernel.cu”
− Step 13: Click “Configuration Properties” and then
click “General”
− Step 14: Select “Custom Build Tool” in “Tool” and
click OK
− Step 15: Implement your GPU kernel in
“GPU_kernel.cu” and the others in “main.cu”
/ name of department PAGE 306/28/15
1.3 Programming Environment
• Tips
− Set up a CUDA project on Linux:
http://sites.google.com/site/5kk70gpu/installation
http://forums.nvidia.com/lofiversion/index.php?f62.htm
l
/ name of department PAGE 316/28/15
Part 2: SMAC Case Study
/ name of department PAGE 326/28/15
2.1 SMAC Introduction
• SMAC algorithm
• SMAC as “Simplified Method for Atmospheric
Correction”
• A fast computation on the atmosphere reflections
2.1 SMAC Introduction
• SMAC application profile
Data size: 5781 x 10 X 4 Bytes = 231240 Bytes
2.1 SMAC Introduction
• SMAC in the satellite data center
SMAC
Algorithm
Image
Remapping
2.2 SMAC Mapping
• Mapping approach
2.2 SMAC Mapping
• Mapping approach
− GPU.cu: GPU operation functions
− GPU_kernel.cu: SMAC kernel
2.2 SMAC Mapping
• SMAC kernel on GPU
− Data size: 64 x 5781 x 4 Bytes
− CPU time:
− GPU Time:
− Demo
2.3 Experiment & Analysis
• Experiment Preparation
HARDWARE
CPU Intel Duo-Core, 2.5GHz per core.
GPU nVidia 32-Core GPU, 0.95GHz per
core.
Main Memory 4GB
PCI-E PCI express 1.0 x 16
Operation system Widows Vista Enterprise
CUDA version CUDA 1.1
SOFTWARE
GPU maximum registers per
thread
60
GPU thread number 192 x 4 (#thread per block x #block)
CPU thread number 1
2.3 Experiment & Analysis
• Experiment setup
− Performance
− GPU improvement
CPU time
GPU Improvement
GPU time

CPU time CPU stop timer CPU start timer -
GPU time GPU stop timer GPU start timer -
2.3 Experiment & Analysis
• Experiment setup
− Linear execution-time prediction
CPU time CPU overhead Bytes CPU speed  
( )
( )
( )
( )
GPU time GPU memorytime GPU run time
GPUmemory overhead Bytes GPU memoryspeed
GPUkernel overhead Bytes GPU kernel speed
GPUmemory overhead GPUkernel overhead
Bytes GPU memoryspeed GPU kernel speed
GPU overhe
 
   
 
  
 
 ad Bytes GPU speed 
CPU overhead Bytes CPU speed
Improvement
GPU overhead Bytes GPU speed
Bytes CPU speed CPU speed
Bytes GPU speed GPU speed
 

 

 

Only holds for
large-size data !!!
2.3 Experiment & Analysis
• Experiment setup
− Linear execution-time prediction
5
5.39 10CPU time data size
  
6
1.67 2.41 10GPU time data size
   
6
4.45 2.01 10GPU time data size
   
1-stream:
8-stream:
1-thread:
2.3 Experiment & Analysis
• Experiment result
− GPU improvement
2.3 Experiment & Analysis
• Experiment result
− Linear execution-time model
2.3 Experiment & Analysis
• Experiment result
− Linear execution-time model
2.3 Experiment & Analysis
• Roofline model
/ name of department PAGE 466/28/15
Log Scaling
LogScaling
2.3 Experiment & Analysis
• Roofline model with SMAC
kernel
Hardware: NVIDIA Quadro FX570M
PCI express bandwidth (GB/sec): 4
Peak performance (GFlops/sec): 91.2
Peak performance without FMAU
(Gflops/sec):
30.4
Software: SMAC kernel on GPU
Data size (Bytes): 5971968
0
Issued instruction number
(Flops):
4189335552
Execution time (ms): 79.2
Instruction density (Flops/Byte): 70.15
Instruction Throughput
(GFlops/sec):
52.8
2.3 Experiment & Analysis
0.25 2.5 25 250
1
10
100
GFlops/sec
w/out FMA
Peak Performance
70.15
Roofline Model of SMAC on
GPU
52.8 GFlops/sec
Flops/Byte
Hard
disk
IO
BW
3. Conclusion & Future Development
• SMAC application:
− The bottleneck is the hard disk IO.
• SMAC kernel on GPU:
− The bottleneck is the computation.
− 25 times faster than CPU, when large-size data is
processed with the streams.
− The performance ceiling would occur when the data
size is “infinitely” huge.
/ name of department PAGE 496/28/15
3. Conclusion & Future Development
• Future development
− Power measurement: “Consumption of Contemporary
Graphics Accelerators”
/ name of department PAGE 506/28/15
3. Conclusion & Future Development
• Power measurement: physical setup
/ name of department PAGE 516/28/15
8 x 0.12 omg (5W)
3. Conclusion & Future Development
• Future development
− Improve the hard disk I/O
− Employ more powerful GPU
/ name of department PAGE 526/28/15
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
520937472
260468736
130234368
65117184
32558592
16279296
8139648
4069824
2034912
1017456
508728
254364
Q & A
Thanks for your
attention!

More Related Content

What's hot

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
Brendan Gregg
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
Ofer Rosenberg
 
Linux boot-time
Linux boot-timeLinux boot-time
Linux boot-time
Andrea Righi
 
Kernel Timing Management
Kernel Timing ManagementKernel Timing Management
Kernel Timing Management
pradeep_tewani
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
cseij
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
Tejhaskar Ashok Kumar
 
Disksim with SSD_extension
Disksim with SSD_extensionDisksim with SSD_extension
Disksim with SSD_extension
cucufrog
 

What's hot (10)

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
Linux boot-time
Linux boot-timeLinux boot-time
Linux boot-time
 
Kernel Timing Management
Kernel Timing ManagementKernel Timing Management
Kernel Timing Management
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
Disksim with SSD_extension
Disksim with SSD_extensionDisksim with SSD_extension
Disksim with SSD_extension
 

Viewers also liked

SMAC trends in HR V3
SMAC trends in HR V3SMAC trends in HR V3
SMAC trends in HR V3
Madan Panathula
 
Overcoming the Commodity Management Challenges in Metals & Mining
Overcoming the Commodity Management Challenges in Metals & Mining Overcoming the Commodity Management Challenges in Metals & Mining
Overcoming the Commodity Management Challenges in Metals & Mining
Eka Software Solutions
 
SMACology
SMACologySMACology
SMACology
Kartik Jha
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
Raul Chong
 
Smac by Uday
Smac by UdaySmac by Uday
Smac by Uday
Uday Bhaskar Harish
 
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
C.K. Kumar
 
SMAC -IoT Technology
SMAC -IoT TechnologySMAC -IoT Technology
SMAC -IoT Technology
Ramkumar Ramachandran
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
iTrain
 
SMACIT: The technology trends driving business transformation
SMACIT: The technology trends driving business transformationSMACIT: The technology trends driving business transformation
SMACIT: The technology trends driving business transformation
Glyn Darkin
 
SMAC - Social, Mobile, Analytics, Cloud
SMAC - Social, Mobile, Analytics, CloudSMAC - Social, Mobile, Analytics, Cloud
SMAC - Social, Mobile, Analytics, Cloud
Orchestrate Mortgage and Title Solutions, LLC
 
Big Data World
Big Data WorldBig Data World
Big Data World
Hossein Zahed
 
Why the SMAC Stack is Going to Change the Word...Or has it Already?
Why the SMAC Stack is Going to Change the Word...Or has it Already?Why the SMAC Stack is Going to Change the Word...Or has it Already?
Why the SMAC Stack is Going to Change the Word...Or has it Already?
Ayantek LLC
 
Microsoft Cloud Computing - Windows Azure Platform
Microsoft Cloud Computing - Windows Azure PlatformMicrosoft Cloud Computing - Windows Azure Platform
Microsoft Cloud Computing - Windows Azure Platform
David Chou
 

Viewers also liked (14)

SMAC trends in HR V3
SMAC trends in HR V3SMAC trends in HR V3
SMAC trends in HR V3
 
Overcoming the Commodity Management Challenges in Metals & Mining
Overcoming the Commodity Management Challenges in Metals & Mining Overcoming the Commodity Management Challenges in Metals & Mining
Overcoming the Commodity Management Challenges in Metals & Mining
 
SMACology
SMACologySMACology
SMACology
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
Smac by Uday
Smac by UdaySmac by Uday
Smac by Uday
 
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
Talking SMAC !!! How Social, Mobile, Analytics & Cloud Reshaping Your Busines...
 
SMAC -IoT Technology
SMAC -IoT TechnologySMAC -IoT Technology
SMAC -IoT Technology
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case StudyReal Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
 
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
MDEC Fintech Conference - Demystifying Fintech in the SMAC Era, Darien Nagle ...
 
SMACIT: The technology trends driving business transformation
SMACIT: The technology trends driving business transformationSMACIT: The technology trends driving business transformation
SMACIT: The technology trends driving business transformation
 
SMAC - Social, Mobile, Analytics, Cloud
SMAC - Social, Mobile, Analytics, CloudSMAC - Social, Mobile, Analytics, Cloud
SMAC - Social, Mobile, Analytics, Cloud
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Why the SMAC Stack is Going to Change the Word...Or has it Already?
Why the SMAC Stack is Going to Change the Word...Or has it Already?Why the SMAC Stack is Going to Change the Word...Or has it Already?
Why the SMAC Stack is Going to Change the Word...Or has it Already?
 
Microsoft Cloud Computing - Windows Azure Platform
Microsoft Cloud Computing - Windows Azure PlatformMicrosoft Cloud Computing - Windows Azure Platform
Microsoft Cloud Computing - Windows Azure Platform
 

Similar to GPU programming and Its Case Study

Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
Dilum Bandara
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
Himanshu577858
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
Joaquín Aparicio Ramos
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
mistercteam
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers84
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Getting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitGetting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer Kit
Sulamita Garcia
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Ural-PDC
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
Ganesan Narayanasamy
 

Similar to GPU programming and Its Case Study (20)

Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
introduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely usedintroduction to CUDA_C.pptx it is widely used
introduction to CUDA_C.pptx it is widely used
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
S0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cudaS0333 gtc2012-gmac-programming-cuda
S0333 gtc2012-gmac-programming-cuda
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Getting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer KitGetting started with Intel IoT Developer Kit
Getting started with Intel IoT Developer Kit
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 

GPU programming and Its Case Study

  • 1. GPU Programming and SMAC Case Study Zhengjie Lu Master Student of Electronic Group Electrical Engineering Technische Universiteit Eindhoven, NL
  • 2. Contents Part 1: GPU Programming 1.1 NVIDIA GPU Hardware 1.2 NVIDIA CUDA Programming 1.3 Programming Environment Part 2: SMAC Case Study 2.1 SMAC Introduction 2.2 SMAC Mapping 2.3 Experiment & Analysis Part 3: Conclusion & Future Development
  • 3. Concepts 1. GPU • Graphic processing unit or graphic card • Chip vendor: NVIDIA and ATI 2. CUDA Programming • “Compute Unified Device Architecture” • Support by NVIDIA 3. SMAC Application • “Simplified Method for Atmosphere Correction” PAGE 36/28/15
  • 4. Part 1: GPU Programming / name of department PAGE 46/28/15
  • 5. 1.1 NVIDIA GPU Hardware • What does NVIDA GPU look like? / name of department PAGE 56/28/15
  • 6. 1.1 NVIDIA GPU Hardware • Example: NVIDIA 8-Series GPU − 128 stream processors (SPs): 1.35GHz per processor − 16 shared memories: shared by every 8 SPs, small but fast. − 1 global memory: shared by 128 SPs, slow but large. − 1 constant memory: shared by 128 SPs, small but fast. / name of department PAGE 66/28/15
  • 7. 1.1 NVIDIA GPU Hardware / name of department PAGE 76/28/15 Stream Processor (SP) Shared Memory Global/Constant Memory Stream Multi-Processor (SM)
  • 8. 1.1 NVIDIA GPU Hardware • Connection between GPU and CPU − CPU => main memory => GPU global memory => GPU − GPU => GPU global memory => main memory => CPU / name of department PAGE 86/28/15
  • 9. 1.1 NVIDIA GPU Hardware • Hardware summary: − Multi-threading is supported physically with the SPs. − SPs inside a SM communicate with each other through the shared memory. − SMs communicate with each other through the global memory. − GPU and CPU communicate with each other through their memories: global memory  main memory / name of department PAGE 96/28/15
  • 10. 1.2 NVIDIA CUDA Programming • CUDA programming concepts − Thread: the basic unit − Block: the collection of threads − Grid: the collection of blocks / name of department PAGE 106/28/15
  • 11. 1.2 NVIDIA CUDA Programming • CUDA programming concepts − A grid is mapped on GPU by the scheduler − A block is mapped on SM by the scheduler − A thread is mapped on SP by the scheduler / name of department PAGE 116/28/15
  • 12. 1.2 NVIDIA CUDA Programming / name of department PAGE 126/28/15
  • 13. 1.2 NVIDIA CUDA Programming • CPU programming custom: i. Allocate the CPU memory ii. Run the CPU kernel • CUDA programming custom: i. Allocate the GPU memory ii. Copy the input to the GPU memory iii. Run the GPU kernel iv. Copy the output from the GPU memory / name of department PAGE 136/28/15
  • 14. 1.2 NVIDIA CUDA Programming • Example: /*******************************************************/ /* File: main.c /* Description: 8x8 matrix addition on CPU /*******************************************************/ //Data definition const int mat1[64] = {…}; const int mat2[64] = {…}; const int mat3[64]; //Matrix addition on CPU void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT); // Main body int main() { // Run the matrix addition on CPU matrixAdd_CPU(64, mat1, mat2, mat3); return 0; } /****************************************************/ /* File: main.cu /* Description: 8x8 matrix addition on GPU /****************************************************/ //Data definition const int mat1[64] = {…}; const int mat2[64] = {…}; const int mat3[64]; //Matrix addition on GPU void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT); // Main body int main() { // Run the matrix addition on CPU matrixAdd_GPU(row, col, mat1, mat2, mat3); return 0; }
  • 15. 1.2 NVIDIA CUDA Programming /**************************************************************/ /* File: main.c /* Description: 8x8 matrix addition on CPU /******************************************************************/ //Matrix addition on CPU void matrixAdd_CPU(int index, int* IN1, int* IN2, int* OUT) { int i; for(i = 0; i < index ; i++){ OUT[i] = IN1[i] + IN2[i]; } } /****************************************************/ /* File: main.cu /* Description: 8x8 matrix addition on GPU /****************************************************/ //Matrix addition on GPU void matrixAdd_GPU(int index, int* IN1, int* IN2, int* OUT) { int* deviceIN1; int* deviceIN2; int* deviceOUT; dim3 grid(1, 1, 1); dim3 block(index, 1, 1); cutilSafeCall(cudaMalloc((void**) &deviceIN1, sizeof(int) * index)); cutilSafeCall(cudaMalloc((void**) &deviceIN2, sizeof(int) * index)); cutilSafeCall(cudaMalloc((void**) &deviceOUT, sizeof(int) * index)); cutilSafeCall(cudaMemcpy(deviceIN1, IN1, sizeof(int) * index, cudaMemcpyHostToDevice)); cutilSafeCall(cudaMemcpy(deviceIN2, IN2, sizeof(int) * index, cudaMemcpyHostToDevice)); GPU_kernel<<<grid, block>>>(deviceIN1, deviceIN2, deviceOUT); cutilSafeCall(cudaMemcpy(OUT, deviceOUT, sizeof(int) * index, cudaMemcpyDeviceToHost)); } Allocate GPU memory Copy the input Copy the output Run the matrix addition /***********************************/ /* File: main.cu /* Description: GPU kernel /*****************************************/ __global__ void GPU_kernel( int* IN1, int* IN2, int* OUT) { int tx = threadIdx.x; OUT[tx] = IN1[tx] + IN2[tx]; }
  • 16. 1.2 NVIDIA CUDA Programming • CUDA programming optimization 1) Use the registers and shared memories 2) Maximize the number of threads per block 3) Global Memory access coalescence 4) Shared memory bank conflict 5) Group the byte access 6) Stream execution / name of department PAGE 166/28/15
  • 17. 1.2 NVIDIA CUDA Programming 1) Use the registers and shared memories − Register is the fastest. (8192 32-bit reg. per SM) − Shared memory is fast, but small. (16KB per SM) − Global memory is slow, but large. (at least 256MB ) / name of department PAGE 176/28/15
  • 18. 1.2 NVIDIA CUDA Programming 2) Maximize the number of threads per block i. Determine the register/memory budget per thread, with the tool cudaProf. ii. Determine the maximum number of threads, with the tool cudaCal. iii. Determine the number of blocks / name of department PAGE 186/28/15
  • 19. 1.2 NVIDIA CUDA Programming 3) Global Memory access coalescence − Global memory access pattern: 16 threads per time − 16 threads must access the global memory with 16 continuous words − 1st thread must access the global memory address which is 16-word aligned / name of department PAGE 196/28/15
  • 20. 1.2 NVIDIA CUDA Programming / name of department PAGE 206/28/15 Coalescence Non-coalescence Non-coalescence Non-coalescence
  • 21. 1.2 NVIDIA CUDA Programming 4) Shared memory bank conflict − Shared memory access pattern: 16 thread − 16KB shared memory: 16 x 1KB memory bank − The threads shouldn’t access two different addresses inside a memory bank / name of department PAGE 216/28/15
  • 22. 1.2 NVIDIA CUDA Programming / name of department PAGE 226/28/15 No bank confliction Bank confliction Bank confliction
  • 23. 1.2 NVIDIA CUDA Programming 5) Group the byte access / name of department PAGE 236/28/15 No group access Group access
  • 24. 1.2 NVIDIA CUDA Programming 6) Stream execution / name of department PAGE 246/28/15
  • 25. 1.2 NVIDIA CUDA Programming • Tips − Examples: NVIDIA SDK − Programming: “NVIDIA CUDA Programming Guide” − Optimization: “NVIDIA CUDA C Programming: Best Practices Guide” / name of department PAGE 256/28/15
  • 26. 1.3 Programming Environment 1. Preparation − Windows: Microsoft Visual C++ 2008 Express − Linux / name of department PAGE 266/28/15
  • 27. 1.3 Programming Environment 2. CUDA installment − Step 1: Download the CUDA package suitable for your operation system (http://www.nvidia.com/object/cuda_get.html) − Step 2: Install CUDA Driver − Step 3: Install CUDA Toolkit − Step 4: Install CUDA SDK − Step 5: Verify the installment with running the SDK examples / name of department PAGE 276/28/15
  • 28. 1.3 Programming Environment 3. CUDA project setup (Windows) − Step 1: Download “CUDA Wizard” and install it (http://www.comp.hkbu.edu.hk/~kyzhao/) − Step2 : Open VC++ 2008 express − Step 3: Click “File” and choose “New/Project” − Step 4: Choose “CUDA” in “Project types” and then select “CUDAWinApp” in “Visual Studio installed templates” − Step 5: Name the CUDA project and click “OK”. − Step 6: Click “Solution Explorer” / name of department PAGE 286/28/15
  • 29. 1.3 Programming Environment 3. CUDA project setup (Windows) − Step 7: right click “Source Files” and choose “Add/New Item…” − Step 8: Click “Code” in “Categories” and then choose “C++. File (.cpp)” in “Visual Studio installed templates” − Step 9: Name the file as “main.cu” and click “Add” − Step 10: Repeat Step 6~8, and make the other file named “GPU_kernel.cu” − Step 11: Click “Solution Explorer” and select “GPU_kernel.cu” under the menu “Source Files” / name of department PAGE 296/28/15
  • 30. 1.3 Programming Environment 3. CUDA project setup (Windows) − Step 12: Right click “GPU_kernel.cu” − Step 13: Click “Configuration Properties” and then click “General” − Step 14: Select “Custom Build Tool” in “Tool” and click OK − Step 15: Implement your GPU kernel in “GPU_kernel.cu” and the others in “main.cu” / name of department PAGE 306/28/15
  • 31. 1.3 Programming Environment • Tips − Set up a CUDA project on Linux: http://sites.google.com/site/5kk70gpu/installation http://forums.nvidia.com/lofiversion/index.php?f62.htm l / name of department PAGE 316/28/15
  • 32. Part 2: SMAC Case Study / name of department PAGE 326/28/15
  • 33. 2.1 SMAC Introduction • SMAC algorithm • SMAC as “Simplified Method for Atmospheric Correction” • A fast computation on the atmosphere reflections
  • 34. 2.1 SMAC Introduction • SMAC application profile Data size: 5781 x 10 X 4 Bytes = 231240 Bytes
  • 35. 2.1 SMAC Introduction • SMAC in the satellite data center SMAC Algorithm Image Remapping
  • 36. 2.2 SMAC Mapping • Mapping approach
  • 37. 2.2 SMAC Mapping • Mapping approach − GPU.cu: GPU operation functions − GPU_kernel.cu: SMAC kernel
  • 38. 2.2 SMAC Mapping • SMAC kernel on GPU − Data size: 64 x 5781 x 4 Bytes − CPU time: − GPU Time: − Demo
  • 39. 2.3 Experiment & Analysis • Experiment Preparation HARDWARE CPU Intel Duo-Core, 2.5GHz per core. GPU nVidia 32-Core GPU, 0.95GHz per core. Main Memory 4GB PCI-E PCI express 1.0 x 16 Operation system Widows Vista Enterprise CUDA version CUDA 1.1 SOFTWARE GPU maximum registers per thread 60 GPU thread number 192 x 4 (#thread per block x #block) CPU thread number 1
  • 40. 2.3 Experiment & Analysis • Experiment setup − Performance − GPU improvement CPU time GPU Improvement GPU time  CPU time CPU stop timer CPU start timer - GPU time GPU stop timer GPU start timer -
  • 41. 2.3 Experiment & Analysis • Experiment setup − Linear execution-time prediction CPU time CPU overhead Bytes CPU speed   ( ) ( ) ( ) ( ) GPU time GPU memorytime GPU run time GPUmemory overhead Bytes GPU memoryspeed GPUkernel overhead Bytes GPU kernel speed GPUmemory overhead GPUkernel overhead Bytes GPU memoryspeed GPU kernel speed GPU overhe               ad Bytes GPU speed  CPU overhead Bytes CPU speed Improvement GPU overhead Bytes GPU speed Bytes CPU speed CPU speed Bytes GPU speed GPU speed          Only holds for large-size data !!!
  • 42. 2.3 Experiment & Analysis • Experiment setup − Linear execution-time prediction 5 5.39 10CPU time data size    6 1.67 2.41 10GPU time data size     6 4.45 2.01 10GPU time data size     1-stream: 8-stream: 1-thread:
  • 43. 2.3 Experiment & Analysis • Experiment result − GPU improvement
  • 44. 2.3 Experiment & Analysis • Experiment result − Linear execution-time model
  • 45. 2.3 Experiment & Analysis • Experiment result − Linear execution-time model
  • 46. 2.3 Experiment & Analysis • Roofline model / name of department PAGE 466/28/15 Log Scaling LogScaling
  • 47. 2.3 Experiment & Analysis • Roofline model with SMAC kernel Hardware: NVIDIA Quadro FX570M PCI express bandwidth (GB/sec): 4 Peak performance (GFlops/sec): 91.2 Peak performance without FMAU (Gflops/sec): 30.4 Software: SMAC kernel on GPU Data size (Bytes): 5971968 0 Issued instruction number (Flops): 4189335552 Execution time (ms): 79.2 Instruction density (Flops/Byte): 70.15 Instruction Throughput (GFlops/sec): 52.8
  • 48. 2.3 Experiment & Analysis 0.25 2.5 25 250 1 10 100 GFlops/sec w/out FMA Peak Performance 70.15 Roofline Model of SMAC on GPU 52.8 GFlops/sec Flops/Byte Hard disk IO BW
  • 49. 3. Conclusion & Future Development • SMAC application: − The bottleneck is the hard disk IO. • SMAC kernel on GPU: − The bottleneck is the computation. − 25 times faster than CPU, when large-size data is processed with the streams. − The performance ceiling would occur when the data size is “infinitely” huge. / name of department PAGE 496/28/15
  • 50. 3. Conclusion & Future Development • Future development − Power measurement: “Consumption of Contemporary Graphics Accelerators” / name of department PAGE 506/28/15
  • 51. 3. Conclusion & Future Development • Power measurement: physical setup / name of department PAGE 516/28/15 8 x 0.12 omg (5W)
  • 52. 3. Conclusion & Future Development • Future development − Improve the hard disk I/O − Employ more powerful GPU / name of department PAGE 526/28/15 0 1 2 3 4 5 6 7 8 9 0 20 40 60 80 100 120 520937472 260468736 130234368 65117184 32558592 16279296 8139648 4069824 2034912 1017456 508728 254364
  • 53. Q & A