2013 0928 programming by cuda

291 views

Published on

Trying a new style for my presentation

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
291
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2013 0928 programming by cuda

  1. 1. CUDA PROGRAMMING BY Direct parallelization & CUBLAS C.M. WANG Research Assistant
  2. 2. OUTLINE  Preparation  CUBLAS  Direct Parallelization
  3. 3. PREPARATION The things before your coding
  4. 4. PATHso your compiler knows where to find the libraries SET UP THE export PATH=/usr/local/cuda-5.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda-5.0/lib:/usr/local/cuda-5.0/lib64:$LD_LIBRARY_PATH [YourAccount@John ~]$ ls –a [YourAccount@John ~]$ vi .bash_profile 1.Open the bash profile 2.Add these lines to the file
  5. 5. MAKEFILE to configure your compilation for the source code CREATE THE MAIN=filename ${MAIN} .e: nvcc ${MAIN} .cu –o ${MAIN} .e –m64 –arch sm_35 –lcublas –O3 Create a makefile something like this:
  6. 6. MEMORYso the GPU can actually store the data in computation CHECK YOUR Global memory available on one card: 5GB.
  7. 7. CUBLAS BLAS implemented on GPU via CUDA
  8. 8. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS #include <cuda_runtime.h> #include <cublas_v2.h> … Double* M; Double* m; /* similar for V & v & A & a */ … cudaSetDevice(0);
  9. 9. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … TS=sizeof(double); size=n*n*typesize; cudaMalloc( (void**)&m, size ); /* similar for V & v & A & a */ … cublasStatus_t status; cublasHandle_t handle; status=cublasCreate(&handle); ...
  10. 10. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … cublasSetVector(n*n,TS,M,1,m,1); /* similar for V & v */ … cublasDgemv( handle, CUBLAS_OP_N, n, n, &alpha, m, n, v, 1, &beta, a, 1 );
  11. 11. CUBLAS Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of CUBLAS Initialization of CUBLAS Assignment of CUDA Device Termination of CUBLAS … cublasGetVector(n,TS,a,1,A,1); … cublasDestroy(handle); … cudaFree(m); /* similar for v & a */
  12. 12. DIRECT PARALLELIZATION Assign the job to each threads directly
  13. 13. Direct Parallelization
  14. 14. Direct Parallelization
  15. 15. Direct Parallelization Grid Block Block Block Block Thread Thread Thread Thread Thread Thread Thread Thread GridDim.y GridDim.x
  16. 16. Direct Parallelization Grid Block Thread Thread GridDim.y GridDim.x (1,1) BlockIdx.x BlockIdx.y
  17. 17. Direct Parallelization Grid Block Thread Thread GridDim.y GridDim.x BlockDim.y BlockDim.x ThreadIdx.x ThreadIdx.y (0,1)
  18. 18. Direct Parallelization GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x Grid Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd
  19. 19. Direct Parallelization GPU_ID = BlockDim.x * BlockIdx.x + ThreadId.x Grid Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd Block Thd Thd GPU_ID = 2 3 1* + GPU_ID: 0 1 2 3 4 5 6 7 8 9
  20. 20. #include <cuda_runtime.h> #define IJToIdx(i,j,n) (j*n+i) … Double* M; Double* m; /* similar for V & v & A & a */ … cudaSetDevice(0); Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block
  21. 21. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … TS=sizeof(double); size=n*n*typesize; cudaMalloc( (void**)&m, size ); /* similar for V & v & A & a */ … cudaMemcpy(m, M, size, cudaMemcpyHostToDevice); /* similar for V & v */ ...
  22. 22. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block 1.Memory assessment 2.Memory alignment 3.Data flow Use as many threads as possible: a[ i] m[11] … m[1n] v[1] … v[ j] … v[n] = *
  23. 23. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … My_Dgemv<<<n,1>>>( … ); … __global__ My_Dgemv( … ){ /* algorithm for MV */ };
  24. 24. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block __global__ My_Dgemv( … ){ … id=BlockIdx.x; i=id; a[i]=0; For(j=0,j<n,j++){ a[i]=a[i]+m[ IJToIdx(i,j,n) ]*a[j]; } }
  25. 25. Direct Parallelization Declaration of Device Array Allocation of Device Array Declaration of Host Array Copy Array : Host to Device Copy Array : Device to Host De-allocation of Device Array Execution of Kernel Determination of Size for Grid & Block Assignment of CUDA Device Allocation of Device Array Copy Array : Host to Device Determination of Size for Grid & Block … cudaMemcpy(M, m, size, cudaMemcpyDeviceToHost); /* similar for A & a */ ... cudaFree(m); /* similar for v & a */ …
  26. 26. Performance Time(ms) Dimension of Vector
  27. 27. REPORT END OF THE Thank you for your attention C.M. WANG Research Assistant
  28. 28. RGB255,102,0 RGB255,255,25 0 RGB91,96,95 RGB161,161,148 Background RGB70,70,70 https://kuler.adobe.com/Copy-of-Stormy-Orange-color-theme-2828733/ Kuler: copy of stormy orange

×