2013 0928 programming by cuda

CUDA
PROGRAMMING
BY
Direct parallelization & CUBLAS
C.M. WANG
Research Assistant

OUTLINE
 Preparation
 CUBLAS
 Direct Parallelization

PREPARATION
The things before your coding

PATHso your compiler knows where to find the libraries
SET UP
THE
export PATH=/usr/local/cuda-5.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib:/usr/local/cuda-5.0/lib64:$LD_LIBRARY_PATH
[YourAccount@John ~]$ ls –a
[YourAccount@John ~]$ vi .bash_profile
1.Open the bash profile
2.Add these lines to the file

MAKEFILE
to configure your compilation for the source code
CREATE
THE
MAIN=filename
${MAIN} .e:
nvcc ${MAIN} .cu –o ${MAIN} .e –m64 –arch sm_35 –lcublas –O3
Create a makefile something like this:

MEMORYso the GPU can actually store the data in computation
CHECK
YOUR
Global memory available on one card: 5GB.

CUBLAS
BLAS implemented on GPU via CUDA

CUBLAS
Declaration of Device Array
Allocation of Device Array
Declaration of Host Array
Copy Array : Host to Device
Copy Array : Device to Host
De-allocation of Device Array
Execution of CUBLAS
Initialization of CUBLAS
Assignment of CUDA Device
Termination of CUBLAS
#include <cuda_runtime.h>
#include <cublas_v2.h>
…
Double* M;
Double* m;
/* similar for V & v & A & a */
…
cudaSetDevice(0);

CUBLAS
Execution of CUBLAS
…
TS=sizeof(double);
size=n*n*typesize;
cudaMalloc( (void**)&m, size );
…
cublasStatus_t status;
cublasHandle_t handle;
status=cublasCreate(&handle);
...

CUBLAS
Execution of CUBLAS
…
cublasSetVector(n*n,TS,M,1,m,1);
/* similar for V & v */
…
cublasDgemv( handle,
CUBLAS_OP_N,
n, n, &alpha, m, n, v, 1, &beta, a, 1
);

CUBLAS
Execution of CUBLAS
…
cublasGetVector(n,TS,a,1,A,1);
…
cublasDestroy(handle);
…
cudaFree(m);
/* similar for v & a */

DIRECT PARALLELIZATION
Assign the job to each threads directly

Direct Parallelization
Grid
Block Block
Block Block
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
GridDim.y
GridDim.x

Grid
Block
Thread
Thread
GridDim.y
GridDim.x
(1,1)
BlockIdx.x
BlockIdx.y

Grid
Block
Thread
Thread
GridDim.y
GridDim.x
BlockDim.y
BlockDim.x
ThreadIdx.x
ThreadIdx.y
(0,1)

GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x
Grid
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd

GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x
Grid
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd
Block
Thd Thd
GPU_ID = 2 3 1* +
GPU_ID:
0 1 2 3 4 5 6 7 8 9

#include <cuda_runtime.h>
#define IJToIdx(i,j,n) (j*n+i)
…
Double* M;
Double* m;
…
cudaSetDevice(0);
Execution of Kernel
Determination of Size for Grid & Block

Execution of Kernel
…
TS=sizeof(double);
size=n*n*typesize;
cudaMalloc( (void**)&m, size );
…
cudaMemcpy(m, M, size,
cudaMemcpyHostToDevice);
/* similar for V & v */
...

Execution of Kernel
1.Memory assessment
2.Memory alignment
3.Data flow
Use as many threads as possible:
a[ i] m[11] … m[1n]
v[1]
…
v[ j]
…
v[n]
= *

Execution of Kernel
…
My_Dgemv<<<n,1>>>( … );
…
__global__ My_Dgemv( … ){
/* algorithm for MV */
};

Execution of Kernel
__global__ My_Dgemv( … ){
…
id=BlockIdx.x;
i=id;
a[i]=0;
For(j=0,j<n,j++){
a[i]=a[i]+m[ IJToIdx(i,j,n) ]*a[j];
}
}

Execution of Kernel
…
cudaMemcpy(M, m, size,
cudaMemcpyDeviceToHost);
/* similar for A & a */
...
cudaFree(m);
/* similar for v & a */
…

Performance
Time(ms)
Dimension of Vector

REPORT
END OF THE
Thank you for your attention
C.M. WANG
Research Assistant

RGB255,102,0
RGB255,255,25
0
RGB91,96,95
RGB161,161,148
Background
RGB70,70,70
https://kuler.adobe.com/Copy-of-Stormy-Orange-color-theme-2828733/
Kuler: copy of stormy orange

2013 0928 programming by cuda

More Related Content

What's hot

Viewers also liked

Similar to 2013 0928 programming by cuda

Recently uploaded

2013 0928 programming by cuda