2. • 10s-100s of processing
cores
• Pre-defined instruction
set & datapath widths
• Optimized for general-
purpose computing
CPU
CPUs vs GPUs
• 1,000s of processing
cores
• Pre-defined instruction
set and datapath widths
• Highly effective at
parallel execution
GPU
DRAM
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
Control
ALU
ALU
Cache
DRAM
ALU
ALU
3. How GPU acceleration works
3. Copy result data from GPU
memory and CPU memory
CPU
CPU memory
GPU
GPU memory
PCI Bus 2. Process
on GPU
1. Copy data from CPU
memory and GPU memory
4. GPU Manufacturers & Programming platforms
• Manufacturers
• NVIDIA
• Intel
• ..
• GPU Programming Platforms
• CUDA (Compute Unified Device Architecture) parallel computing platform..
Specific to NVIDIA GPUs
• OpenCL - Open platform programming model
5. NVIDIA GPU Hello World Vector Add in C & CUDA
// Global key word indicates this code runs on GPU
__global__
void add(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
// Initialize your array ( CPU & GPU memory)
// Some code here ……
// Step 1 - Copy data from CPU (Host) memory to GPU (Device) memory
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Step 2 Perform add on 1M elements on GPU
add<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
// Step3 Copy result data back from GPU (Device) memory to CPU (Host) memory
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
// Free memory on GPU (Device)
// ….
}
7. Summary
1. GPU instance ≠ Faster performance
• Program & compile your code to target CPU
• Operations must be parallelizable & compute intensive
• Need lots of data
• Profile your code to measure CPU vs GPU Utlization