IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

IAP09 CUDA@MIT/6.963

David Luebke
NVIDIA Research

The “New” Moore’s Law

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution

© NVIDIA Corporation 2008

Enter the GPU

Massive economies of scale

Massively parallel


Enter CUDA

Scalable parallel programming model

Minimal extensions to familiar C/C++ environment

Heterogeneous serial-parallel computing


Sound Bite

GPUs + CUDA
=
The Democratization of Parallel Computing

Massively parallel computing has become
a commodity technology

MOTIVATION

146X 36X 19X 17X 100X

149X 47X 20X 24X 30X

CUDA: ‘C’ FOR PARALLELISM

void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
Standard C Code
saxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Parallel C Code

Hierarchy of concurrent threads

Thread t

Parallel kernels composed of many threads

all threads execute the same sequential program

Block b

Threads are grouped into thread blocks t0 t1 … tB

threads in the same block can cooperate

Kernel foo()

Threads/blocks have
...
unique IDs

Hierarchical organization
Block
Thread
per-block
per-thread shared
local memory memory
Local barrier

Kernel 0

...
Global barrier per-device
global
Kernel 1
memory
...

Heterogeneous Programming

CUDA = serial program with parallel kernels, all in C

Serial C code executes in a CPU thread

Parallel kernel C code executes in thread blocks
across multiple processing elements

Serial Code

Parallel Kernel
...
foo<<< nBlk, nTid >>>(args);

Serial Code

Parallel Kernel
...
bar<<< nBlk, nTid >>>(args);

Thread = virtualized scalar processor

Independent thread of execution

has its own PC, variables (registers), processor state, etc.

no implication about how threads are scheduled

CUDA threads might be physical threads

as on NVIDIA GPUs

CUDA threads might be virtual threads

might pick 1 block = 1 physical thread on multicore CPU


Block = virtualized multiprocessor

Provides programmer flexibility

freely choose processors to fit data

freely customize for each kernel launch

Thread block = a (data) parallel task

all blocks in kernel have the same entry point

but may execute any code they want

Thread blocks of kernel must be independent tasks

program valid for any interleaving of block executions


Blocks must be independent

Any possible interleaving of blocks should be valid

presumed to run to completion without pre-emption

can run in any order

can run concurrently OR sequentially

Blocks may coordinate but not synchronize

shared queue pointer: OK

shared lock: BAD … can easily deadlock

Independence requirement gives scalability


Scalable Execution Model

Kernel launched by host

...

Blocks Run on Multiprocessors

MT IU MT IU MT IU MT IU
MT IU MT IU MT IU MT IU

SP SP SP SP SP SP SP SP

...
Shared Shared Shared Shared
Shared Shared Shared Shared
Memory Memory Memory Memory
Memory Memory Memory Memory

Device Memory


Synchronization & Cooperation

Threads within block may synchronize with barriers
… Step 1 …
__syncthreads();
… Step 2 …

Blocks coordinate via atomic memory operations

e.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernels
vec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);


Using per-block shared memory
Block

Variables shared across block

Shared
__shared__ int *begin, *end;

Scratchpad memory
__shared__ int scratch[blocksize];
scratch[threadIdx.x] = begin[threadIdx.x];
// … compute on scratch values …
begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threads
scratch[threadIdx.x] = begin[threadIdx.x];
__syncthreads();
int left = scratch[threadIdx.x - 1];


Example: Parallel Reduction

Summing up a sequence with 1 thread:
int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];

Parallel reduction builds a summation tree

each thread holds 1 element

stepwise partial sums

N threads need log N steps

one possible approach:
Butterfly pattern


Parallel Reduction for 1 Block

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

// One thread per element
sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]

Summing Up CUDA

CUDA = C + a few simple extensions

makes it easy to start writing basic parallel programs

Three key abstractions:
1. hierarchy of parallel threads
2. corresponding levels of synchronization
3. corresponding memory spaces

Supports massive parallelism of manycore GPUs


Parallel

// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];

// One thread per element
sum[i] = x_i; __syncthreads();

for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]

More efficient reduction tree

template<class T> __device__ T reduce(T *x)
{
unsigned int i = threadIdx.x;
unsigned int n = blockDim.x;

for(unsigned int offset=n/2; offset>0; offset/=2)
{
if(tid < offset) x[i] += x[i + offset];
__syncthreads();
}

// Note that only thread 0 has full sum
return x[i];
}


Reduction tree execution example
Input (shared memory)
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2

1 2 3 4 5 6 7 active threads
0

x[i] += x[i+8]; 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2

1 2 3
0

x[i] += x[i+4]; 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

1
0

x[i] += x[i+2]; 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
x[i] += x[i+1];
Final result

Pattern 1: Fit kernel to the data

Code lets B-thread block reduce B-element array

For larger sequences:

launch N/B blocks to reduce each B-element subsequence

write N/B partial sums to temporary array

repeat until done

We’ll be done in logB N steps


Pattern 2: Fit kernel to the machine

For a block size of B:
1) Launch B blocks to sum N/B elements
2) Launch 1 block to combine B partial sums

31704 16 3 31704 16 331704 16 3 31704 16 3 31704 16 331704 16 331704 16 331704 16 3
4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9 4 7 5 9
11 14 11 14 11 14 11 14 11 14 11 14 11 14 11 14
Stage 1:
25 25 25 25 25 25 25 25

many blocks

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)

Similar to IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA) (20)

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NVIDIA)