2013 0928 programming by cuda

  • 96 views
Uploaded on

Trying a new style for my presentation

Trying a new style for my presentation

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
96
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. CUDA PROGRAMMING BY Direct parallelization & CUBLAS C.M. WANG Research Assistant
  • 2. OUTLINE  Preparation  CUBLAS  Direct Parallelization
  • 3. PREPARATION The things before your coding
  • 4. PATHso your compiler knows where to find the libraries SET UP THE export PATH=/usr/local/cuda-5.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib:/usr/local/cuda-5.0/lib64:$LD_LIBRARY_PATH [YourAccount@John ~]$ ls –a [YourAccount@John ~]$ vi .bash_profile 1.Open the bash profile 2.Add these lines to the file
  • 5. MAKEFILE to configure your compilation for the source code CREATE THE MAIN=filename ${MAIN} .e: nvcc ${MAIN} .cu –o ${MAIN} .e –m64 –arch sm_35 –lcublas –O3 Create a makefile something like this:
  • 6. MEMORYso the GPU can actually store the data in computation CHECK YOUR Global memory available on one card: 5GB.
  • 7. CUBLAS BLAS implemented on GPU via CUDA
  • 8. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS #include <cuda_runtime.h> #include <cublas_v2.h> … Double* M; Double* m; /* similar for V & v & A & a */ … cudaSetDevice(0);
  • 9. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … TS=sizeof(double); size=n*n*typesize; cudaMalloc( (void**)&m, size ); /* similar for V & v & A & a */ … cublasStatus_t status; cublasHandle_t handle; status=cublasCreate(&handle); ...
  • 10. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … cublasSetVector(n*n,TS,M,1,m,1); /* similar for V & v */ … cublasDgemv( handle, CUBLAS_OP_N, n, n, &alpha, m, n, v, 1, &beta, a, 1 );
  • 11. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … cublasGetVector(n,TS,a,1,A,1); … cublasDestroy(handle); … cudaFree(m); /* similar for v & a */
  • 12. DIRECT PARALLELIZATION Assign the job to each threads directly
  • 13. Direct Parallelization
  • 14. Direct Parallelization
  • 15. Direct Parallelization Grid Block Block Block Block Thread Thread Thread Thread Thread Thread Thread Thread GridDim.y GridDim.x
  • 16. Direct Parallelization Grid Block Thread Thread GridDim.y GridDim.x (1,1) BlockIdx.x BlockIdx.y
  • 17. Direct Parallelization Grid Block Thread Thread GridDim.y GridDim.x BlockDim.y BlockDim.x ThreadIdx.x ThreadIdx.y (0,1)
  • 18. Direct Parallelization GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x Grid Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd
  • 19. Direct Parallelization GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x Grid Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd GPU_ID = 2 3 1* + GPU_ID: 0 1 2 3 4 5 6 7 8 9
  • 20. #include <cuda_runtime.h> #define IJToIdx(i,j,n) (j*n+i) … Double* M; Double* m; /* similar for V & v & A & a */ … cudaSetDevice(0); Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block
  • 21. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … TS=sizeof(double); size=n*n*typesize; cudaMalloc( (void**)&m, size ); /* similar for V & v & A & a */ … cudaMemcpy(m, M, size, cudaMemcpyHostToDevice); /* similar for V & v */ ...
  • 22. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block 1.Memory assessment 2.Memory alignment 3.Data flow Use as many threads as possible: a[ i] m[11] … m[1n] v[1] … v[ j] … v[n] = *
  • 23. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … My_Dgemv<<<n,1>>>( … ); … __global__ My_Dgemv( … ){ /* algorithm for MV */ };
  • 24. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block __global__ My_Dgemv( … ){ … id=BlockIdx.x; i=id; a[i]=0; For(j=0,j<n,j++){ a[i]=a[i]+m[ IJToIdx(i,j,n) ]*a[j]; } }
  • 25. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … cudaMemcpy(M, m, size, cudaMemcpyDeviceToHost); /* similar for A & a */ ... cudaFree(m); /* similar for v & a */ …
  • 26. Performance Time(ms) Dimension of Vector
  • 27. REPORT END OF THE Thank you for your attention C.M. WANG Research Assistant
  • 28. RGB255,102,0 RGB255,255,25 0 RGB91,96,95 RGB161,161,148 Background RGB70,70,70 https://kuler.adobe.com/Copy-of-Stormy-Orange-color-theme-2828733/ Kuler: copy of stormy orange