Vahid Amiri
NationalWorkshop of Cloud Computing
Cloud Computing Lab- Amirkabir University
Vahidamiry.ir
Nov 2012
Bio-Informatics and Life Sciences
Computational Electromagnetics and
Electrodynamics
Computational Finance
Weather, Atmospheric, Ocean Modeling
and Space Sciences
2
Computational Fluid Dynamics
Data Mining, Analytics, and Databases
Molecular Dynamics
Numerical Analytics
3
 Cluster
 Grid
 Cloud
4
 Parallel and distributed processing system
 Consists of a collection of interconnected stand-
alone computers
 Appears as a single system to users and
applications
5
6
 Distributed, heterogeneous resources for large
experiments
 Compute and storage resources
 Network of Machines
 Larger number of resources
 Extended all over the world
 Different administrative domains
7
 Investment in infrastructure
 Power and Cooling Management
 Management
 Maintenance
 Complexity
 cost
8
 Computing as a utility
 Easy to access
▪ Easy Configuration
 Pay-as-you-go
 Flexibility
 Scalability
 No need to infrastructure management
9
 IaaS
 Cloud-Based Cluster
▪ Amazon EC2
▪ GoGrid
▪ IBM
▪ Rackspace
 PaaS
 Amazon Elastic MapReduce
 GoogleApp Engine – MapReduce Service
 SaaS
10
11
12
 Develops software solutions for applications in
the cloud
 CloudBroker
 Cyclone
 Plura Processing
 Penguin on Demand
13
 Supporting five technical domains:
 Computational fluid dynamics (CFD)
 Finite element analysis
 Computational chemistry and materials
 Computational biology
14
 performance penalties
 users voluntarily lose almost all control on the
execution environment
 VirtualizationTechnology
▪ These are related to the performance loss introducedby
the virtualization mechanism
 Cloud Environment
▪ due to overheads and to the sharing of computing and
communication resources
15
 IaaS HPC
 MPI Cluster
 MapReduce Cluster
 GPU Cluster!!!
 …..
16
17
 General Purpose computation using GPU
 Data parallel algorithms leverageGPU
attributes
 Using graphic hardware for non-graphic computations
 Can improve the performance in the orders
of magnitude in certain types of
applications
18
 GPUs contain much larger number of dedicated
ALUs then CPUs.
 GPUs also contain extensive support of Stream
Processing paradigm. It is related to SIMD (
Single Instruction Multiple Data) processing.
 Each processing unit on GPU contains local
memory that improves data manipulation and
reduces fetch time.
19
20
21
 Multiprocessor(MP) = thread processor = ALU
22
 The GPU is viewed as a compute device that:
 Is a coprocessor to the CPU or host
 Has its own DRAM (device memory)
 Runs many threads in parallel
 Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
 Differences between GPU and CPU threads
 GPU threads are extremely lightweight
▪ Very little creation overhead
 GPU needs 1000s of threads for full efficiency
▪ Multi-core CPU needs only a few
23
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
24
 CUDA is a set of developing tools to create applications that will
perform execution on GPU (Graphics Processing Unit).
 The API is an extension to the ANSI C programming language
 Low learning curve
 CUDA was developed by NVidia and as such can only run on
NVidia GPUs of G8x series and up.
 CUDA was released on February 15, 2007 for PC and Beta version
for MacOS X on August 19, 2008.
25
26
 A kernel is executed as a grid of thread
blocks
 A thread block is a batch of threads that
can cooperate with each other by:
 Synchronizing their execution
 Efficiently sharing data through a low
latency shared memory
Host
Kern
el 1
Kern
el 2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
27
 Threads and blocks have IDs
 So each thread can decide
what data to work on
 Simplifies memory
addressing when processing
multidimensional data
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
28
 Parallel computations arranged as
grids
 One grid executes after another
 Blocks assigned to SM. A single
block assigned to a single SM.
Multiple blocks can be assigned to a
SM.
 Block consists of elements (threads)
29
30
31
 Demo!
32
 CUDA device driver
 CUDA Software Development Kit
 CUDAToolkit
 You (probably) need experience with C or C++
33
 Thread block – an array of concurrent threads
that execute the same program and can
cooperate to compute the result
 A thread ID has corresponding 1,2 or 3d
indices
 Threads of a thread block share memory
34
 Each thread can:
 R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant memory
 Read only per-grid texture memory
 The host can R/W global,
constant, and texture
memories
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
35
 cudaMalloc()
 Allocates object in the device Global Memory
 Requires two parameters
▪ Address of a pointer to the allocated object
▪ Size of of allocated object
 cudaFree()
 Frees object from deviceGlobal Memory
BLOCK_SIZE = 64;
Float d_f;
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&d_f, size);
cudaFree(Md.elements);
36
 cudaMemcpy()
 memory data transfer
 Requires four parameters
▪ Pointer to source
▪ Pointer to destination
▪ Number of bytes copied
▪ Type of transfer
▪ Host to Host
▪ Host to Device
▪ Device to Host
▪ Device to Device
cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice);
cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost);
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Block (1, 0)
Shared Memory
Local
Memor
y
Thread (0,
0)
Register
s
Local
Memor
y
Thread (1,
0)
Register
s
Host
37
 __global__ defines a kernel function
 Must return void
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
38
 Allocate the memory on the GPU
 Copy the arrays ‘a’ and ‘b’ to the GPU
 Call the kernel function
 Copy the array ‘c’ back from the GPU to the CPU
 Free the memory allocated on the GPU
39
 Step 1: Allocate the memory on the GPU
int a[N], b[N], c[N];
int *d_a, *d_b, *d_c;
cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ;
cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ;
40
 Step 2: Copy the arrays ‘a’ and ‘b’ to the GPU
cudaMemCpy(d_a, a, N * sizeof(int),
cudaMemcpyHostToDevice);
cudaMemCpy(d_b, b, N * sizeof(int),
cudaMemcpyHostToDevice);
 Step 3: Call the kernel function
Add<<<N,1>>>(d_a, d_b, d_c);
41
 Step 4: Copy the array ‘c’ back from the GPU to the
CPU
cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
 Step 5: Free the memory allocated on the GPU
cudaFree( d_a );
cudaFree( d_b );
cudaFree( d_c );
42
 kernel function
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x;
if (tid < N)
c[tid] = a[tid] + b[tid];
}
43
 We’ve seen parallel vector addition using:
 Several blocks with one thread each
 One block with several threads
44
45
46
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 P = M * N of size WIDTH xWIDTH
 One thread handles one element of P
 M and N are loadedWIDTH times from
global memory
M
N
P
WIDTHWIDTH
WIDTH WIDTH
47
 Memory latency can be hidden by keeping a
large number of threads busy
 Keep number of threads per block (block size)
and number of blocks per grid (grid size) as
large as possible
 Constant memory can be used for constant
data (variables that do not change).
 Constant memory is cached.
48
49
 Each thread within the
block computes one
element of Csub
50
51
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
 Recall that the “stream processors” of the
GPU are organized as MPs (multi-processors)
and every MP has its own set of resources:
 Registers
 Local memory
 The block size needs to be chosen such that
there are enough resources in an MP to
execute a block at a time.
52
 Critical for performance
 Recommended value is 192 or 256
 Maximum value is 512
 Limited by number of registers on the MP
53
 Run with the different block sizes!!
M
N
P
WIDTHWIDTH
WIDTH WIDTH
54
55
0.3 6.7 47
1079
2537
0.3 4.6 5.6 39.3
86.6126 126
19
407
947
0
500
1000
1500
2000
2500
3000
S 128 S 512 S 1024 S 3079 S 4096
block-16
Data - Test 1
Shared
56
0.3 6.7 47
1079
2537
126 126
19
407
947
2948
73 86 73
0
500
1000
1500
2000
2500
3000
3500
S 128 S 512 S 1024 S 3079 S 4096
block-16
Shared
block-32
block-64
block-128
block-512
57

Gpu computing workshop

  • 1.
    Vahid Amiri NationalWorkshop ofCloud Computing Cloud Computing Lab- Amirkabir University Vahidamiry.ir Nov 2012
  • 2.
    Bio-Informatics and LifeSciences Computational Electromagnetics and Electrodynamics Computational Finance Weather, Atmospheric, Ocean Modeling and Space Sciences 2
  • 3.
    Computational Fluid Dynamics DataMining, Analytics, and Databases Molecular Dynamics Numerical Analytics 3
  • 4.
  • 5.
     Parallel anddistributed processing system  Consists of a collection of interconnected stand- alone computers  Appears as a single system to users and applications 5
  • 6.
  • 7.
     Distributed, heterogeneousresources for large experiments  Compute and storage resources  Network of Machines  Larger number of resources  Extended all over the world  Different administrative domains 7
  • 8.
     Investment ininfrastructure  Power and Cooling Management  Management  Maintenance  Complexity  cost 8
  • 9.
     Computing asa utility  Easy to access ▪ Easy Configuration  Pay-as-you-go  Flexibility  Scalability  No need to infrastructure management 9
  • 10.
     IaaS  Cloud-BasedCluster ▪ Amazon EC2 ▪ GoGrid ▪ IBM ▪ Rackspace  PaaS  Amazon Elastic MapReduce  GoogleApp Engine – MapReduce Service  SaaS 10
  • 11.
  • 12.
  • 13.
     Develops softwaresolutions for applications in the cloud  CloudBroker  Cyclone  Plura Processing  Penguin on Demand 13
  • 14.
     Supporting fivetechnical domains:  Computational fluid dynamics (CFD)  Finite element analysis  Computational chemistry and materials  Computational biology 14
  • 15.
     performance penalties users voluntarily lose almost all control on the execution environment  VirtualizationTechnology ▪ These are related to the performance loss introducedby the virtualization mechanism  Cloud Environment ▪ due to overheads and to the sharing of computing and communication resources 15
  • 16.
     IaaS HPC MPI Cluster  MapReduce Cluster  GPU Cluster!!!  ….. 16
  • 17.
  • 18.
     General Purposecomputation using GPU  Data parallel algorithms leverageGPU attributes  Using graphic hardware for non-graphic computations  Can improve the performance in the orders of magnitude in certain types of applications 18
  • 19.
     GPUs containmuch larger number of dedicated ALUs then CPUs.  GPUs also contain extensive support of Stream Processing paradigm. It is related to SIMD ( Single Instruction Multiple Data) processing.  Each processing unit on GPU contains local memory that improves data manipulation and reduces fetch time. 19
  • 20.
  • 21.
  • 22.
     Multiprocessor(MP) =thread processor = ALU 22
  • 23.
     The GPUis viewed as a compute device that:  Is a coprocessor to the CPU or host  Has its own DRAM (device memory)  Runs many threads in parallel  Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads  Differences between GPU and CPU threads  GPU threads are extremely lightweight ▪ Very little creation overhead  GPU needs 1000s of threads for full efficiency ▪ Multi-core CPU needs only a few 23
  • 24.
     Host TheCPU and its memory (host memory)  Device The GPU and its memory (device memory) 24
  • 25.
     CUDA isa set of developing tools to create applications that will perform execution on GPU (Graphics Processing Unit).  The API is an extension to the ANSI C programming language  Low learning curve  CUDA was developed by NVidia and as such can only run on NVidia GPUs of G8x series and up.  CUDA was released on February 15, 2007 for PC and Beta version for MacOS X on August 19, 2008. 25
  • 26.
  • 27.
     A kernelis executed as a grid of thread blocks  A thread block is a batch of threads that can cooperate with each other by:  Synchronizing their execution  Efficiently sharing data through a low latency shared memory Host Kern el 1 Kern el 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 27
  • 28.
     Threads andblocks have IDs  So each thread can decide what data to work on  Simplifies memory addressing when processing multidimensional data Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) 28
  • 29.
     Parallel computationsarranged as grids  One grid executes after another  Blocks assigned to SM. A single block assigned to a single SM. Multiple blocks can be assigned to a SM.  Block consists of elements (threads) 29
  • 30.
  • 31.
  • 32.
  • 33.
     CUDA devicedriver  CUDA Software Development Kit  CUDAToolkit  You (probably) need experience with C or C++ 33
  • 34.
     Thread block– an array of concurrent threads that execute the same program and can cooperate to compute the result  A thread ID has corresponding 1,2 or 3d indices  Threads of a thread block share memory 34
  • 35.
     Each threadcan:  R/W per-thread registers  R/W per-thread local memory  R/W per-block shared memory  R/W per-grid global memory  Read only per-grid constant memory  Read only per-grid texture memory  The host can R/W global, constant, and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host 35
  • 36.
     cudaMalloc()  Allocatesobject in the device Global Memory  Requires two parameters ▪ Address of a pointer to the allocated object ▪ Size of of allocated object  cudaFree()  Frees object from deviceGlobal Memory BLOCK_SIZE = 64; Float d_f; int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float); cudaMalloc((void**)&d_f, size); cudaFree(Md.elements); 36
  • 37.
     cudaMemcpy()  memorydata transfer  Requires four parameters ▪ Pointer to source ▪ Pointer to destination ▪ Number of bytes copied ▪ Type of transfer ▪ Host to Host ▪ Host to Device ▪ Device to Host ▪ Device to Device cudaMemcpy(d_f, f, size, cudaMemcpyHostToDevice); cudaMemcpy(f, f_d, size, cudaMemcpyDeviceToHost); (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Block (1, 0) Shared Memory Local Memor y Thread (0, 0) Register s Local Memor y Thread (1, 0) Register s Host 37
  • 38.
     __global__ definesa kernel function  Must return void Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host 38
  • 39.
     Allocate thememory on the GPU  Copy the arrays ‘a’ and ‘b’ to the GPU  Call the kernel function  Copy the array ‘c’ back from the GPU to the CPU  Free the memory allocated on the GPU 39
  • 40.
     Step 1:Allocate the memory on the GPU int a[N], b[N], c[N]; int *d_a, *d_b, *d_c; cudaMalloc( (void**)&dev_a, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_b, N * sizeof(int) ) ; cudaMalloc( (void**)&dev_c, N * sizeof(int) ) ; 40
  • 41.
     Step 2:Copy the arrays ‘a’ and ‘b’ to the GPU cudaMemCpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemCpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);  Step 3: Call the kernel function Add<<<N,1>>>(d_a, d_b, d_c); 41
  • 42.
     Step 4:Copy the array ‘c’ back from the GPU to the CPU cudaMemCpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);  Step 5: Free the memory allocated on the GPU cudaFree( d_a ); cudaFree( d_b ); cudaFree( d_c ); 42
  • 43.
     kernel function __global__void add( int *a, int *b, int *c ) { int tid = blockIdx.x; if (tid < N) c[tid] = a[tid] + b[tid]; } 43
  • 44.
     We’ve seenparallel vector addition using:  Several blocks with one thread each  One block with several threads 44
  • 45.
  • 46.
    46 (Device) Grid Constant Memory Texture Memory Global Memory Block (0,0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 47.
     P =M * N of size WIDTH xWIDTH  One thread handles one element of P  M and N are loadedWIDTH times from global memory M N P WIDTHWIDTH WIDTH WIDTH 47
  • 48.
     Memory latencycan be hidden by keeping a large number of threads busy  Keep number of threads per block (block size) and number of blocks per grid (grid size) as large as possible  Constant memory can be used for constant data (variables that do not change).  Constant memory is cached. 48
  • 49.
  • 50.
     Each threadwithin the block computes one element of Csub 50
  • 51.
    51 (Device) Grid Constant Memory Texture Memory Global Memory Block (0,0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host
  • 52.
     Recall thatthe “stream processors” of the GPU are organized as MPs (multi-processors) and every MP has its own set of resources:  Registers  Local memory  The block size needs to be chosen such that there are enough resources in an MP to execute a block at a time. 52
  • 53.
     Critical forperformance  Recommended value is 192 or 256  Maximum value is 512  Limited by number of registers on the MP 53
  • 54.
     Run withthe different block sizes!! M N P WIDTHWIDTH WIDTH WIDTH 54
  • 55.
    55 0.3 6.7 47 1079 2537 0.34.6 5.6 39.3 86.6126 126 19 407 947 0 500 1000 1500 2000 2500 3000 S 128 S 512 S 1024 S 3079 S 4096 block-16 Data - Test 1 Shared
  • 56.
    56 0.3 6.7 47 1079 2537 126126 19 407 947 2948 73 86 73 0 500 1000 1500 2000 2500 3000 3500 S 128 S 512 S 1024 S 3079 S 4096 block-16 Shared block-32 block-64 block-128 block-512
  • 57.