Gpu computing workshop

Vahid Amiri
NationalWorkshop of Cloud Computing
Cloud Computing Lab- Amirkabir University
Vahidamiry.ir
Nov 2012

Bio-Informatics and Life Sciences
Computational Electromagnetics and
Electrodynamics
Computational Finance
Weather, Atmospheric, Ocean Modeling
and Space Sciences
2

Computational Fluid Dynamics
Data Mining, Analytics, and Databases
Molecular Dynamics
Numerical Analytics
3

 Cluster
 Grid
 Cloud
4

 Parallel and distributed processing system
 Consists of a collection of interconnected stand-
alone computers
 Appears as a single system to users and
applications
5

 Distributed, heterogeneous resources for large
experiments
 Compute and storage resources
 Network of Machines
 Larger number of resources
 Extended all over the world
 Different administrative domains
7

 Investment in infrastructure
 Power and Cooling Management
 Management
 Maintenance
 Complexity
 cost
8

 Computing as a utility
 Easy to access
▪ Easy Configuration
 Pay-as-you-go
 Flexibility
 Scalability
 No need to infrastructure management
9

 IaaS
 Cloud-Based Cluster
▪ Amazon EC2
▪ GoGrid
▪ IBM
▪ Rackspace
 PaaS
 Amazon Elastic MapReduce
 GoogleApp Engine – MapReduce Service
 SaaS
10

 Develops software solutions for applications in
the cloud
 CloudBroker
 Cyclone
 Plura Processing
 Penguin on Demand
13

 Supporting five technical domains:
 Computational fluid dynamics (CFD)
 Finite element analysis
 Computational chemistry and materials
 Computational biology
14

 performance penalties
 users voluntarily lose almost all control on the
execution environment
 VirtualizationTechnology
▪ These are related to the performance loss introducedby
the virtualization mechanism
 Cloud Environment
▪ due to overheads and to the sharing of computing and
communication resources
15

 IaaS HPC
 MPI Cluster
 MapReduce Cluster
 GPU Cluster!!!
 …..
16

 General Purpose computation using GPU
 Data parallel algorithms leverageGPU
attributes
 Using graphic hardware for non-graphic computations
 Can improve the performance in the orders
of magnitude in certain types of
applications
18

 GPUs contain much larger number of dedicated
ALUs then CPUs.
 GPUs also contain extensive support of Stream
Processing paradigm. It is related to SIMD (
Single Instruction Multiple Data) processing.
 Each processing unit on GPU contains local
memory that improves data manipulation and
reduces fetch time.
19

 Multiprocessor(MP) = thread processor = ALU
22

 The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
 Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
 Differences between GPU and CPU threads
 GPU threads are extremely lightweight
▪ Very little creation overhead
 GPU needs 1000s of threads for full efficiency
▪ Multi-core CPU needs only a few
23

 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
24

 CUDA is a set of developing tools to create applications that will
perform execution on GPU (Graphics Processing Unit).
 The API is an extension to the ANSI C programming language
 Low learning curve
 CUDA was developed by NVidia and as such can only run on
NVidia GPUs of G8x series and up.
 CUDA was released on February 15, 2007 for PC and Beta version
for MacOS X on August 19, 2008.
25

 A kernel is executed as a grid of thread
blocks
 A thread block is a batch of threads that
can cooperate with each other by:
 Synchronizing their execution
 Efficiently sharing data through a low
latency shared memory
Host
Kern
el 1
Kern
el 2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
27

 Threads and blocks have IDs
 So each thread can decide
what data to work on
 Simplifies memory
addressing when processing
multidimensional data
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
28

 Parallel computations arranged as
grids
 One grid executes after another
 Blocks assigned to SM. A single
block assigned to a single SM.
Multiple blocks can be assigned to a
SM.
 Block consists of elements (threads)
29

 CUDA device driver
 CUDA Software Development Kit
 CUDAToolkit
 You (probably) need experience with C or C++
33

 Thread block – an array of concurrent threads
that execute the same program and can
cooperate to compute the result
 A thread ID has corresponding 1,2 or 3d
indices
 Threads of a thread block share memory
34

 Each thread can:
 R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant memory
 Read only per-grid texture memory
 The host can R/W global,
constant, and texture
memories
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
35

 cudaMalloc()
 Allocates object in the device Global Memory
 Requires two parameters
▪ Address of a pointer to the allocated object
▪ Size of of allocated object
 cudaFree()
 Frees object from deviceGlobal Memory
BLOCK_SIZE = 64;
Float d_f;
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&d_f, size);
cudaFree(Md.elements);
36

 cudaMemcpy()
 memory data transfer
 Requires four parameters
▪ Pointer to source
▪ Pointer to destination
▪ Number of bytes copied
▪ Type of transfer
▪ Host to Host
▪ Host to Device
▪ Device to Host
▪ Device to Device
cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice);
cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost);
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Block (1, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Host
37

 __global__ defines a kernel function
 Must return void
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
38

 Allocate the memory on the GPU
 Copy the arrays ‘a’ and ‘b’ to the GPU
 Call the kernel function
 Copy the array ‘c’ back from the GPU to the CPU
 Free the memory allocated on the GPU
39

 Step 1: Allocate the memory on the GPU
int a[N], b[N], c[N];
int *d_a, *d_b, *d_c;
cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ;
40

 Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU
cudaMemCpy(d_a, a, N * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemCpy(d_b, b, N * sizeof(int),
cudaMemcpyHostToDevice);
 Step 3: Call the kernel function
Add<<<N,1>>>(d_a, d_b, d_c);
41

 Step 4: Copy the array ‘c’ back from the GPU to the
CPU
cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
 Step 5: Free the memory allocated on the GPU
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
42

 kernel function
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x;
if (tid < N)
c[tid] = a[tid] + b[tid];
}
43

 We’ve seen parallel vector addition using:
 Several blocks with one thread each
 One block with several threads
44

46
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host

 P = M * N of size WIDTH xWIDTH
 One thread handles one element of P
 M and N are loadedWIDTH times from
global memory
M
N
P
WIDTHWIDTH
WIDTH WIDTH
47

 Memory latency can be hidden by keeping a
large number of threads busy
 Keep number of threads per block (block size)
and number of blocks per grid (grid size) as
large as possible
 Constant memory can be used for constant
data (variables that do not change).
 Constant memory is cached.
48

 Each thread within the
block computes one
element of Csub
50

51
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host

 Recall that the “stream processors” of the
GPU are organized as MPs (multi-processors)
and every MP has its own set of resources:
 Registers
 Local memory
 The block size needs to be chosen such that
there are enough resources in an MP to
execute a block at a time.
52

 Critical for performance
 Recommended value is 192 or 256
 Maximum value is 512
 Limited by number of registers on the MP
53

 Run with the different block sizes!!
M
N
P
WIDTHWIDTH
WIDTH WIDTH
54

55
0.3 6.7 47
1079
2537
0.3 4.6 5.6 39.3
86.6126 126
19
407
947
0
500
1000
1500
2000
2500
3000
S 128 S 512 S 1024 S 3079 S 4096
block-16
Data - Test 1
Shared

56
0.3 6.7 47
1079
2537
126 126
19
407
947
2948
73 86 73
0
500
1000
1500
2000
2500
3000
3500
S 128 S 512 S 1024 S 3079 S 4096
block-16
Shared
block-32
block-64
block-128
block-512

Gpu computing workshop

More Related Content

What's hot

Similar to Gpu computing workshop

Recently uploaded

Gpu computing workshop