Using GPUs for parallel processing

Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
advanced levels.

Talk: Using GPUs for parallel processing
A. Stephen McGough

Website: http://conferences.ncl.ac.uk/sciprog/index.php
Research community site: contact Matt Wade for access
Alerts mailing list: sci-prog-seminars@ncl.ac.uk
(sign up at http://lists.ncl.ac.uk )

Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
Dr Ben Allen and Gregg Iceton

Using GPUs for parallel processing

A. Stephen McGough

Why?
observation
• Moore’s XXXX is dead?
law
• “the number of transistors on integrated circuits
doubles approximately every two years”
– Processors aren’t getting faster… They’re getting fatter

Processor Speed and Energy

Assume 1 GHz Core consumes 1watt

A 4GHz Core consumes ~64watts

Four 1GHz cores consume ~4watts

Power ~frequency3

Computers are going many-core

What?
• Games industry is multi-billion dollar
• Gamers want photo-realistic games
– Computationally expensive
– Requires complex physics calculations
• Latest generation of Graphical Processing Units
are therefore many core parallel processors
– General Purpose Graphical Processing Units - GPGPUs

Not just normal processors
• 1000’s of cores
– But cores are simpler than a normal processor
– Multiple cores perform the same action at the same
time – Single Instruction Multiple Data – SIMD
• Conventional processor -> Minimize latency
– Of a single program
• GPU -> Maximize throughput of all cores
• Potential for orders of magnitude speed-up

“If you were plowing a field, which would you
rather use: two strong oxen or 1024 chicken?”

• Famous quote from Seymour Cray arguing for
small numbers of processors
– But the chickens are now winning
• Need a new way to think about programming
– Need hugely parallel algorithms
• Many existing algorithms won’t work (efficiently)

Some Issues with GPGPUs
• Cores are slower than a standard CPU
– But you have lots more
• No direct control on when your code runs on a core
– GPGPU decides where and when
• Can’t communicate between cores
• Order of execution is ‘random’
– Synchronization is through exiting parallel GPU code
• SIMD only works (efficiently) if all cores are doing the
same thing
– NVIDIA GPU’s have Warps of 32 cores working together
• Code divergence leads to more Warps
• Cores can interfere with each other
– Overwriting each others memory

How
• Many approaches
– OpenGL – for the mad Guru
– Computer Unified Device Architecture (CUDA)
– OpenCL – emerging standard
– Dynamic Parallelism – For existing code loops
• Focus here on CUDA
– Well developed and supported
– Exploits full power of GPGPU

CUDA
• CUDA is a set of extensions to C/C++
– (and Fortran)
• Code consists of sequential and parallel parts
– Parallel parts are written as kernels
• Describe what one thread of the code will do
Start Sequential code

Transfer data to card

Execute Kernel

Transfer data from card

Finish Sequential code

Example: Vector Addition
• One dimensional data
• Add two vectors (A,B) together to produce C
• Need to define the kernel to run and the main
code
• Each thread can compute a single value for C

• Pseudo code for the kernel:
– Identify which element in the vector I’m computing
•i
– Compute C[i] = A[i] + B[i]

• How do we identify our index (i)?

Blocks and Threads
• In CUDA the whole data
space is the Grid
– Divided into a number
of blocks
• Divided into a number of
threads
• Blocks can be executed
in any order
• Threads in a block are
executed together
• Blocks and Threads can
be 1D, 2D or 3D

Blocks
• As Blocks are
executed in arbitrary
order this gives
CUDA the
opportunity to scale
to the number of
cores in a particular
device

Thread id
• CUDA provides three pieces of data for
identifying a thread
– BlockIdx – block identity
– BlockDim – the size of a block (no of threads in block)
– ThreadIdx – identity of a thread in a block
• Can use these to compute the absolute thread id
id = BlockIdx * BlockDim + ThreadIdx
• EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1
• id = 2 * 3 + 1 = 7
Thread index 0 1 2 0 1 2 0 1 2
0 1 2 3 4 5 6 7 8

Block0 Block1 Block2

Kernel code
Entry point for a
Normal function
kernel
definition

__global__ void vector_add(double *A, double *B,
double* C, int N) {
// Find my thread id - block and thread
int id = blockDim.x * blockIdx.x + threadIdx.x;
if (id >= N) {return;} // I'm not a valid ID
C[id] = A[id] + B[id]; // do my work
} Compute my
absolute thread id
We might be
invalid – if
data size not Do the work
completely
divisible by
blocks

Pseudo code for sequential code
• Create Data on Host Computer

• Create space on device

• Copy data to device
• Run Kernel
• Copy data back to host and do something with it
• Clean up

Host and Device
• Data needs copying to / from the GPU (device)
• Often end up with same data on both
– Postscript variable names with _device or _host
• To help identify where data is
A_host A_device

Host Device

int N = 2000;
double *A_host = new double[N]; // Create data on host computer
double *B_host = new double[N]; double *C_host = new double[N];
for(int i=0; i<N; i++) { A_host[i] = i; B_host[i] = (double)i/N; }
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*sizeof(double));
cudaMalloc((void**) &B_device, N*sizeof(double));
cudaMalloc((void**) &C_device, N*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 256
int blocks = (N - 0.5)/256 + 1;
vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel
// Copy data back
cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost);
// do something with result

// free device memory
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); // free host memory

More Complex: Matrix Addition
• Now a 2D problem
– BlockIdx, BlockDim, ThreadIdx now have x and y
• But general principles hold
– For kernel
• Compute location in matrix of two diminutions
– For main code
• Define and transmit data
• But keep data 1D
– Why?

Why data in 1D?
• If you define data as 2D there is no guarantee
that data will be a contiguous block of memory
– Can’t be transmitted to card in one command

X X

Some other
data

Faking 2D data
• 2D data size N*M
• Define 1D array of size N*M
• Index element at [x,y] as
x*N+y
• Then can transfer to device in one go

Row 1 Row 2 Row 3 Row 4

Example: Matrix Add
Kernel
__global__ void matrix_add(double *A, double *B, double* C, int N, int M)
{
// Find my thread id - block and thread
Both
int idX = blockDim.x * blockIdx.x + threadIdx.x;
dimensions
int idY = blockDim.y * blockIdx.y + threadIdx.y;
if (idX >= N || idY >= M) {return;} // I'm not a valid ID
int id = idY * N + idX;
Get both
C[id] = A[id] + B[id]; // do my work
dimensions
}
Compute
1D location

Example: Matrix Addition
Main Code
int N = 20;
int M = 10;
double *A_host = new double[N * M]; // Create data on host computer
double *B_host = new double[N * M];
double *C_host = new double[N * M]; Define matrices
for(int i=0; i<N; i++) {
for (int j = 0; j < M; j++) {
on host
A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M;
}
}

double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*M*sizeof(double));
Define space on
cudaMalloc((void**) &B_device, N*M*sizeof(double)); device
cudaMalloc((void**) &C_device, N*M*sizeof(double));

// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
Copy data to
device
// How many blocks will we need? Choose block size of 16
int blocksX = (N - 0.5)/16 + 1;
int blocksY = (M - 0.5)/16 + 1;
dim3 dimGrid(blocksX, blocksY);
dim3 dimBlocks(16, 16); Run Kernel
matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M);

// Copy data back from device to host
cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost); Bring data back
// Free device
//for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]);
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); Tidy up

Running Example
• Computer: condor-gpu01
– Set path
• set path = ( $path /usr/local/cuda/bin/ )
• Compile command nvcc
• Then just run the binary file

• C2050, 440 cores, 3GB RAM
– Single precision flops 1.03Tflops
– Double precision flops 515Gflops

Summary and Questions
• GPGPU’s have great potential for parallelism
• But at a cost
– Not ‘normal’ parallel computing
– Need to think about problems in a new way
• Further reading
– NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone
– Online courses https://www.coursera.org/course/hetero

Using GPUs for parallel processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Using GPUs for parallel processing

Similar to Using GPUs for parallel processing (20)

Recently uploaded

Recently uploaded (9)

Using GPUs for parallel processing