Cuda 2011

1,387 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,387
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
54
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cuda 2011

  1. 1. Programming Multi-Core Processors A Hands-On Approach for Embedded, Mobile, and Distributed Systems Development Copyright © 2011 4-
  2. 2. Parallel Computing with Compute Unified Device Architecture (CUDA) Ghulam Mustafa Copyright © 2011 4-
  3. 3. Course Outline <ul><li>Parallel computing with CUDA </li></ul><ul><ul><li>CUDA for multi-core architectures </li></ul></ul><ul><ul><li>Memory architecture </li></ul></ul><ul><ul><li>Host-GPU workload partitioning </li></ul></ul><ul><ul><li>Programming paradigm </li></ul></ul><ul><ul><li>Programming examples </li></ul></ul>Copyright © 2011 4-
  4. 4. Agenda for Today <ul><li>Introduction to CUDA </li></ul><ul><ul><li>Memory architecture </li></ul></ul><ul><ul><li>Host-GPU workload partitioning and mapping </li></ul></ul><ul><ul><li>Suitable applications </li></ul></ul><ul><li>Programming paradigm </li></ul><ul><li>Lab exercises </li></ul><ul><ul><li>Hello World </li></ul></ul><ul><ul><li>Matrix multiplication </li></ul></ul><ul><ul><li>Numerical computation of pi </li></ul></ul><ul><ul><li>Parallel sort </li></ul></ul>Copyright © 2011 4-
  5. 5. General Purpose GPU <ul><li>Highly parallel, Multithreaded, Many-core processor </li></ul><ul><li>Very high memory bandwidth </li></ul><ul><li>More Transistors for Data Processing </li></ul><ul><li>Data parallel algorithms leverage GPU attributes </li></ul><ul><li>Fine-grain SIMD parallelism </li></ul><ul><li>Low-latency floating point (FP) computation </li></ul>Copyright © 2011 4- GeForce 8800 Tesla S870 Tesla D870
  6. 6. GPGPU vs. CPU Copyright © 2011 4-
  7. 7. GPU Architecture Copyright © 2011 4-
  8. 8. Memories Copyright © 2011 4-
  9. 9. Challenge <ul><li>Develop parallel applications that are </li></ul><ul><ul><li>Transparently scalable </li></ul></ul><ul><ul><li>Adaptable to increasing number of cores </li></ul></ul><ul><li>Solution: </li></ul><ul><ul><li>Automated parallel software development frameworks </li></ul></ul>Copyright © 2011 4-
  10. 10. CUDA Compute Unified Device Architecture <ul><li>Introduced by NVIDIA in November 2006 </li></ul><ul><li>A general purpose parallel computing architecture with </li></ul><ul><ul><li>A new parallel programming model </li></ul></ul><ul><ul><li>A New instruction set architecture </li></ul></ul><ul><ul><li>Three key abstractions </li></ul></ul><ul><ul><ul><li>A hierarchy of thread groups </li></ul></ul></ul><ul><ul><ul><li>Shared memories </li></ul></ul></ul><ul><ul><ul><li>Barrier synchronization </li></ul></ul></ul><ul><li>CUDA provides </li></ul><ul><ul><li>A minimal set of language extensions </li></ul></ul><ul><ul><li>Fine-grained data & thread parallelism nested within coarse-grained data & task parallelism </li></ul></ul>Copyright © 2011 4-
  11. 11. Task Decomposition <ul><li>Partition the problem into coarser and independent sub-problems </li></ul><ul><li>Partition independent sub-problems into cooperative finer pieces </li></ul><ul><li>Threads cooperate when solving each sub-problem </li></ul><ul><li>Each sub-problem can be scheduled to any available core </li></ul><ul><li>Compiled CUDA program can execute on any number of cores </li></ul><ul><ul><li>Only runtime system needs to know actual No. of cores </li></ul></ul>Copyright © 2011 4-
  12. 12. Task Decomposition Copyright © 2011 4-
  13. 13. Benefits <ul><li>Just parallelize the task parallelization </li></ul><ul><ul><li>Don’t worry about implementation </li></ul></ul><ul><li>Support heterogeneous computation </li></ul><ul><ul><li>Applications use both the CPU and GPU </li></ul></ul><ul><ul><ul><li>Serial portions on the CPU </li></ul></ul></ul><ul><ul><ul><li>Parallel portions offloaded to GPU </li></ul></ul></ul><ul><ul><li>Simultaneous computation on CPU and GPU </li></ul></ul><ul><ul><ul><li>No memory contention between CPU & GPU assigned code </li></ul></ul></ul><ul><li>CUDA can be incrementally applied to existing applications </li></ul><ul><li>CUDA-capable GPUs </li></ul><ul><ul><li>Hundreds of cores </li></ul></ul><ul><ul><li>Run thousands of computing threads </li></ul></ul><ul><li>Reduces system memory bus traffic </li></ul>Copyright © 2011 4-
  14. 14. Kernels <ul><li>Functions written in C for CUDA using </li></ul><ul><ul><li>__global__ declaration specifier </li></ul></ul><ul><li>Executed N times in parallel by N different CUDA threads </li></ul><ul><li>Number of CUDA threads for each call is specified using a new <<<…>>> syntax </li></ul><ul><ul><li>// Kernel definition </li></ul></ul><ul><ul><li>__global__ void VecAdd(float* A, float* B, float* C){ </li></ul></ul><ul><ul><li>//device code here </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>int main(){ </li></ul></ul><ul><ul><li>//Host code here </li></ul></ul><ul><ul><li>VecAdd<<<1, N>>>(A, B, C); // Kernel invocation </li></ul></ul><ul><ul><li>} </li></ul></ul>Copyright © 2011 4-
  15. 15. Threads within kernel <ul><li>Each thread of kernel is given a unique thread ID </li></ul><ul><ul><li>Accessible within the kernel via built-in variable threadIdx </li></ul></ul><ul><ul><li>// Kernel definition </li></ul></ul><ul><ul><li>__global__ void VecAdd(float* A, float* B, float* C){ </li></ul></ul><ul><ul><li>int i = threadIdx.x; </li></ul></ul><ul><ul><li>//Each threads performs one pair-wise addition </li></ul></ul><ul><ul><li>C[i] = A[i] + B[i]; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>int main(){ </li></ul></ul><ul><ul><li>// Kernel invocation </li></ul></ul><ul><ul><li>VecAdd<<<1, N>>>(A, B, C); </li></ul></ul><ul><ul><li>} </li></ul></ul>Copyright © 2011 4-
  16. 16. Thread IDs <ul><li>threadIdx is a 3-component vector </li></ul><ul><li>Threads can be identified using a 1-D, 2-D & 3-D thread index </li></ul><ul><li>Allows the formation of 1-D, 2-D or 3-D thread blocks </li></ul><ul><ul><li>A natural way to invoke computation across the elements in a vector, matrix, or field </li></ul></ul><ul><li>For a 2-D block of size (Dx, Dy) </li></ul><ul><ul><li>Thread ID of a thread of index (x, y) is (x + y Dx) </li></ul></ul><ul><li>For a 3-D block of size (Dx, Dy, Dz) </li></ul><ul><ul><li>Thread ID of a thread of index (x, y, z) is </li></ul></ul><ul><ul><li>(x + y Dx + z Dx Dy) </li></ul></ul><ul><li>For a 1-D block, they are the same </li></ul>Copyright © 2011 4-
  17. 17. Thread synchronization <ul><li>Threads within a block cooperate by sharing data through shared memory </li></ul><ul><li>Intrinsic function __syncthreads() used as barrier </li></ul><ul><ul><li>Acts as synchronization point </li></ul></ul><ul><li>Shared memory is expected to be </li></ul><ul><ul><li>Low-latency memory </li></ul></ul><ul><ul><li>Near each processor core </li></ul></ul><ul><ul><li>Like an L1 cache </li></ul></ul><ul><li>__syncthreads() is expected to be </li></ul><ul><ul><li>Lightweight </li></ul></ul><ul><ul><li>All threads of a block reside on the same processor core </li></ul></ul><ul><ul><li>#Threads per block is restricted by the limited memory resources of a processor core </li></ul></ul><ul><ul><li>On current GPUs, a thread block may contain up to 512 threads </li></ul></ul>Copyright © 2011 4-
  18. 18. Thread Blocks (1/2) <ul><li>Kernel can be executed by multiple equally-shaped thread blocks </li></ul><ul><ul><li>Total #threads = (#threads per block) x (#blocks) </li></ul></ul><ul><li>Multiple blocks are organized into 1-D or 2-D Grid of blocks </li></ul><ul><li>Dimension of grid is specified by first parameter of the <<<…>>> </li></ul><ul><li>Each block within the grid can be identified by a 1-D or 2-D index </li></ul><ul><li>Block within a kernel is accessible via built-in variable blockIdx </li></ul><ul><li>Block dimension is accessible within the kernel via built-in variable blockDim </li></ul>Copyright © 2011 4-
  19. 19. Thread Blocks (2/2) <ul><li>Thread blocks can be scheduled in any order across any number of cores </li></ul><ul><li>Enables programmers to write code that scales with the number of cores </li></ul><ul><li>#thread blocks in a grid is dictated by the size of the data being processed not by #processors in the system </li></ul>Copyright © 2011 4-
  20. 20. Example <ul><ul><li>// Kernel definition </li></ul></ul><ul><ul><li>__global__ </li></ul></ul><ul><ul><li>void MatAdd(float A[N][N], float B[N][N], </li></ul></ul><ul><ul><li>float C[N][N]){ </li></ul></ul><ul><ul><li>int i = blockIdx.x * blockDim.x + threadIdx.x; // x + y Dx </li></ul></ul><ul><ul><li>int j = blockIdx.y * blockDim.y + threadIdx.y; // y + x Dy </li></ul></ul><ul><ul><li>if (i < N && j < N) </li></ul></ul><ul><ul><li>C[i][j] = A[i][j] + B[i][j]; </li></ul></ul><ul><ul><li>} </li></ul></ul><ul><ul><li>int main(){ </li></ul></ul><ul><ul><li>// Kernel invocation </li></ul></ul><ul><ul><li>dim3 dimBlock(16, 16); //16x16=256 threads </li></ul></ul><ul><ul><li>dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x, </li></ul></ul><ul><ul><li>(N + dimBlock.y – 1) / dimBlock.y); </li></ul></ul><ul><ul><li>MatAdd<<<dimGrid, dimBlock>>>(A, B, C); </li></ul></ul><ul><ul><li>} </li></ul></ul>Copyright © 2011 4-
  21. 21. Memory Hierarchy Copyright © 2011 <ul><ul><li>Local </li></ul></ul><ul><ul><li>Shared </li></ul></ul><ul><ul><li>Global </li></ul></ul><ul><ul><li>Constant </li></ul></ul><ul><ul><li>Texture </li></ul></ul><ul><li>These memories are persistent across kernel launches by same application </li></ul>4-
  22. 22. Heterogeneous Programming Copyright © 2011 4-
  23. 23. Programming Interface <ul><li>Two mutually exclusive interfaces </li></ul><ul><ul><li>C for CUDA </li></ul></ul><ul><ul><ul><li>Any source file that contains some of these extensions must be compiled with nvcc </li></ul></ul></ul><ul><ul><li>CUDA driver API </li></ul></ul><ul><ul><ul><li>lower-level C API that provides functions to </li></ul></ul></ul><ul><ul><ul><ul><li>Load kernels as modules of CUDA binary or assembly code </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Inspect their parameters </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Launch them </li></ul></ul></ul></ul>Copyright © 2011 4-
  24. 24. nvcc (1/2) <ul><li>A compiler driver </li></ul><ul><ul><li>simplifies the compilation CUDA code </li></ul></ul><ul><ul><li>Provides simple and familiar command line options </li></ul></ul><ul><li>Invokes a collection of tools that implement the different compilation stages </li></ul><ul><li>Separates device code from host code </li></ul><ul><li>Compiles device code into an assembly </li></ul><ul><ul><li>form ( PTX code) or binary form (cubin object) </li></ul></ul><ul><li>Generates host code in either </li></ul><ul><ul><li>C code to be compiled using another tool </li></ul></ul><ul><ul><li>Object code directly by invoking the host compiler during the last compilation stage </li></ul></ul>Copyright © 2011 4-
  25. 25. nvcc (2/2) <ul><li>Front end of nvcc processes CUDA source code according to C++ syntax rules </li></ul><ul><li>Full C++ is supported for the host code </li></ul><ul><ul><li>void pointers cannot be assigned to non-void pointers without a typecast </li></ul></ul><ul><li>Only C subset of C++ is fully supported for the device code </li></ul>Copyright © 2011 4-
  26. 26. Device Emulation <ul><li>No debugging facility in CUDA </li></ul><ul><li>Debugging done in device emulation mode </li></ul><ul><li>Programming with CUDA on systems without Nvidia GPUs utilizes this mode </li></ul><ul><li>Example </li></ul><ul><ul><li>$ nvcc –o hello –deviceemu hello_world.cu </li></ul></ul>Copyright © 2011 4-
  27. 27. Resources <ul><li>Lecture Slides, Lab Workbook </li></ul><ul><li>http://developer.nvidia.com/cuda-downloads </li></ul><ul><li>CUDA Documentation </li></ul><ul><li>CUDA SDK Sample Projects </li></ul><ul><li>Books available on CUDA website </li></ul>Copyright © 2011 4-
  28. 28. Lab Programming Exercises Copyright © 2011 4-
  29. 29. List of Lab Exercises <ul><li>Hello World </li></ul><ul><li>Matrix multiplication </li></ul><ul><li>Numerical computation of pi </li></ul><ul><li>Parallel sort </li></ul>Copyright © 2011 4-
  30. 30. System Configuration <ul><li>Hardware platform </li></ul><ul><ul><li>X86 with CUDA in device emulation mode </li></ul></ul><ul><li>Software environment </li></ul><ul><ul><li>CUDA SDK version 2.2 </li></ul></ul><ul><ul><li>CUDA toolkit version 2.2 </li></ul></ul><ul><li>Programming paradigm specific details…. </li></ul><ul><ul><li>Data Parallel model </li></ul></ul><ul><ul><li>Loops are attacked </li></ul></ul><ul><ul><li>Within a grid, threads synchronized using Barrier (if necessary) </li></ul></ul>Copyright © 2011 4-
  31. 31. HelloWorld Code <ul><li>#include <cuda.h> </li></ul><ul><li>#include <stdio.h> </li></ul><ul><li>#include <stdlib.h> </li></ul><ul><li>__global__ void printhello() </li></ul><ul><li>{ </li></ul><ul><li>int thid = blockIdx.x * blockDim.x + threadIdx.x; </li></ul><ul><li>printf(&quot;Thread%d: Hello World!n&quot;, thid); </li></ul><ul><li>} </li></ul><ul><li>int main() </li></ul><ul><li>{ </li></ul><ul><li>printhello<<<5,10>>>(); </li></ul><ul><li>return 0; </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  32. 32. Matrix Multiplication Code (1/5) <ul><li>/* Matrices are stored in row-major order: </li></ul><ul><li>* M(row, col) = M.ents[row * M.w + col] */ </li></ul><ul><li>#define BLOCK_SZ 2 </li></ul><ul><li>#define Xc (2 * BLOCK_SZ) </li></ul><ul><li>#define Xr (3 * BLOCK_SZ) </li></ul><ul><li>#define Yc (2 * BLOCK_SZ) </li></ul><ul><li>#define Yr Xc </li></ul><ul><li>#define Zc Yc </li></ul><ul><li>#define Zr Xr </li></ul><ul><li>typedef struct Matrix{ </li></ul><ul><li>int r,c; </li></ul><ul><li>float* elements; </li></ul><ul><li>} matrix; </li></ul><ul><li>void populate_matrix(matrix*); </li></ul><ul><li>void print_matrix(matrix); </li></ul>Copyright © 2011 4-
  33. 33. Matrix Multiplication Code (2/5) <ul><li>__global__ void matrix_mul_krnl(matrix A, matrix B, matrix C) </li></ul><ul><li>{ </li></ul><ul><li>float C_entry = 0; </li></ul><ul><li>int row = blockIdx.y * blockDim.y + threadIdx.y; </li></ul><ul><li>int col = blockIdx.x * blockDim.x + threadIdx.x; </li></ul><ul><li>int i; </li></ul><ul><li>for (i = 0; i < A.c; i++) </li></ul><ul><li>C_entry += A.elements[row * A.c + i] * B.elements[i * B.c + col]; </li></ul><ul><li>C.elements[row * C.c + col] = C_entry; </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  34. 34. Matrix Multiplication Code (3/5) <ul><li>int main() { </li></ul><ul><li>matrix X, Y, Z; </li></ul><ul><li>X.r = Xr; Y.r = Yr; Z.r = Zr; </li></ul><ul><li>X.c = Xc; Y.c = Yc; Z.c = Zc; </li></ul><ul><li>printf(&quot;C(%d,%d) = A(%d,%d) x B(%d,%d)n” ,Z.r,Z.c, X.r,X.c, Y.r,Y.c); </li></ul><ul><li>size_t size_Z = Z.c * Z.r * sizeof(float); </li></ul><ul><li>Z.elements = (float*) malloc(size_Z); </li></ul><ul><li>populate_matrix(&X); </li></ul><ul><li>populate_matrix(&Y); </li></ul><ul><li> print_matrix(X); </li></ul><ul><li>print_matrix(Y); </li></ul><ul><li>matrix d_A; </li></ul><ul><li>d_A.c = X.c; </li></ul><ul><li>d_A.r = X.r; </li></ul><ul><li>size_t size_A = X.c * X.r * sizeof(float); </li></ul><ul><li>cudaMalloc((void**)&d_A.elements, size_A); </li></ul><ul><li>cudaMemcpy(d_A.elements, X.elements, size_A, cudaMemcpyHostToDevice); </li></ul>Copyright © 2011 4-
  35. 35. Matrix Multiplication Code (4/5) <ul><li>matrix d_B; d_B.c = Y.c; d_B.r = Y.r; </li></ul><ul><li>size_t size_B = Y.c * Y.r * sizeof(float); </li></ul><ul><li>cudaMalloc((void**)&d_B.elements, size_B); </li></ul><ul><li>cudaMemcpy(d_B.elements, Y.elements, size_B,cudaMemcpyHostToDevice); </li></ul><ul><li>matrix d_C; d_C.c = Z.c; d_C.r = Z.r; </li></ul><ul><li>size_t size_C = Z.c * Z.r * sizeof(float); </li></ul><ul><li>cudaMalloc((void**)&d_C.elements, size_C); </li></ul><ul><li>dim3 dimBlock(BLOCK_SZ, BLOCK_SZ); </li></ul><ul><li>dim3 dimGrid(Y.c / dimBlock.x, X.r / dimBlock.y); </li></ul><ul><li>matrix_mul_krnl<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); </li></ul><ul><li>cudaMemcpy(Z.elements, d_C.elements, size_C, cudaMemcpyDeviceToHost); </li></ul><ul><li>cudaFree(d_A.elements); </li></ul><ul><li>cudaFree(d_B.elements); cudaFree(d_C.elements); </li></ul><ul><li>print_matrix(Z); </li></ul><ul><li>free (X.elements); free(Y.elements); free(Z.elements); </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  36. 36. Matrix Multiplication Code (5/5) <ul><li>void populate_matrix(matrix* mat){ </li></ul><ul><li>int dim = mat -> c * mat -> r; </li></ul><ul><li>size_t sz = dim * sizeof(float); </li></ul><ul><li>mat -> elements = (float*) malloc(sz); </li></ul><ul><li>int i; </li></ul><ul><li>for (i = 0; i < dim; i++) </li></ul><ul><li>mat->elements[i] = (float)(rand()%1000); </li></ul><ul><li>} </li></ul><ul><li>void print_matrix(matrix mat){ </li></ul><ul><li>int i, n = 0, dim; </li></ul><ul><li>dim = mat.c * mat.r; </li></ul><ul><li>for (i = 0; i < dim; i++) { </li></ul><ul><li>if (i == mat.c * n){ </li></ul><ul><li>printf(&quot;n&quot;); </li></ul><ul><li>n++; </li></ul><ul><li>} </li></ul><ul><li>printf(&quot;%0.2ft&quot;, mat.elements[i]); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  37. 37. Computation of pi Code (1/4) <ul><li>typedef struct PI_data{ </li></ul><ul><li>int n; </li></ul><ul><li>int PerThrItr; </li></ul><ul><li>int nThr; </li></ul><ul><li>} data; </li></ul>Copyright © 2011 4-
  38. 38. Computation of pi Code (2/4) <ul><li>__global__ void calculate_PI(data d, float* s){ </li></ul><ul><li>float sum, x, w; </li></ul><ul><li>int itr,i,j; </li></ul><ul><li>itr = d.PerThrItr; </li></ul><ul><li>i = blockIdx.x * blockDim.x + threadIdx.x; </li></ul><ul><li>int N = d.n-i; </li></ul><ul><li>w = 1.0/(float)N; </li></ul><ul><li>sum = 0.0; </li></ul><ul><li>if (i < d.nThr) { </li></ul><ul><li>for (j = i * itr; j < (i * itr+itr); j++) { </li></ul><ul><li>x = w * (j-0.5); </li></ul><ul><li>sum+= (4.0)/(1.0 + x*x); </li></ul><ul><li>} </li></ul><ul><li>s[i] = sum * w; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  39. 39. Computation of pi Code (3/4) <ul><li>int main(int argc, char** argv){ </li></ul><ul><li>if(argc < 2){ </li></ul><ul><li>printf(&quot;Usage: ./<progname> #itrations #Threadsn&quot;); </li></ul><ul><li>exit(1); </li></ul><ul><li>} </li></ul><ul><li>data pi_data; float PI=0.0; </li></ul><ul><li>pi_data.n = atoi(argv[1]); </li></ul><ul><li>pi_data.nThr = atoi(argv[2]); </li></ul><ul><li>pi_data.PerThrItr = pi_data.n/pi_data.nThr; </li></ul><ul><li>float *d_sum; float *h_sum; </li></ul><ul><li> size_t size = pi_data.nThr * sizeof(float); </li></ul><ul><li>cudaMalloc((void**)&d_sum, size); </li></ul><ul><li>h_sum = (float*) malloc(size); </li></ul>Copyright © 2011 4-
  40. 40. Computation of pi Code (4/4) <ul><li>int threads_per_block = 4; </li></ul><ul><li>int blocks_per_grid; </li></ul><ul><li>blocks_per_grid = (pi_data.nThr + threads_per_block - 1)/threads_per_block; </li></ul><ul><li>calculate_PI<<<blocks_per_grid,threads_per_block>>>(pi_data, d_sum); </li></ul><ul><li>cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost); </li></ul><ul><li>int i; </li></ul><ul><li>for (i = 0; i < pi_data.nThr; i++) </li></ul><ul><li>PI+= h_sum[i]; </li></ul><ul><li>printf(&quot;Using %d itrations, Value of PI is %f n&quot;, pi_data.n, PI); </li></ul><ul><li>cudaFree(d_sum); </li></ul><ul><li>} </li></ul>Copyright © 2011 4-
  41. 41. Bitonic Sort <ul><li>Sorting network is a sorting algorithm </li></ul><ul><ul><li>Sequence of comparisons is not data-dependent </li></ul></ul><ul><ul><li>Suitable for parallel implementations </li></ul></ul><ul><li>Bitonic sort is one of fastest sorting networks </li></ul><ul><ul><li>o(n log 2 n) comparators </li></ul></ul><ul><ul><li>Very efficient when sorting a small number of elements </li></ul></ul><ul><ul><li>Specially design for parallel machines </li></ul></ul>Copyright © 2011 4- http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
  42. 42. Bitonic Sort <ul><li>All elements </li></ul><ul><ul><li>Sorted single element subsequences </li></ul></ul><ul><li>Pairs of elements </li></ul><ul><ul><li>Sorted in ascending or descending subsequences </li></ul></ul><ul><ul><li>Elements differing in bit 0 (lowest) are compared and exchanged conditionally </li></ul></ul><ul><ul><li>If (bit 1 is zero) </li></ul></ul><ul><ul><ul><li>elements are sorted in ascending order </li></ul></ul></ul><ul><ul><li>If(bit 1 is 1) </li></ul></ul><ul><ul><ul><li>Elements are sorted in descending order </li></ul></ul></ul>Copyright © 2011 4- Index 0 1 2 3 4 5 6 7 Binary Form 0000 0001 0010 0011 0100 0101 0110 0111 [0000 , 0001] ascending [0010 , 0011] descending [0100 , 0101] ascending [0110 , 0111] descending
  43. 43. Bitonic Sort <ul><li>k = Bit position to determine swap </li></ul><ul><ul><li>ascending or descending order </li></ul></ul><ul><li>j = Distance b/w elements to be compared and conditionally swapped </li></ul><ul><li>i = Goes through all the elements </li></ul><ul><li>ixj = exclusive-or of i and j </li></ul><ul><ul><li>i.e. element whose position differs only in bit position (log 2  j) </li></ul></ul><ul><ul><li>ixj is the pair of i </li></ul></ul><ul><ul><li>We only compare elements i and ixj if i<ixj </li></ul></ul><ul><ul><ul><li>To avoid duplicate comparison </li></ul></ul></ul>Copyright © 2011 4-
  44. 44. Bitonic Sort <ul><li>int i,j,k; for (k=2;k<=N;k=2*k) { for (j=k/2;j>0;j=j/2) { for (i=0;i<N;i++) { int ixj=i^j; if ((ixj)>i) { if ((i&k)==0 && get(i)>get(ixj)) </li></ul><ul><li>swap(i,ixj); if ((i&k)!=0 && get(i)<get(ixj)) </li></ul><ul><li>swap(i,ixj); } } } } </li></ul>Copyright © 2011 4-
  45. 45. Parallel Sort Code (1/3) <ul><li>#define NUM 32 </li></ul><ul><li>__device__ inline void swap(int & a, int & b){ </li></ul><ul><li>int tmp = a; </li></ul><ul><li> a = b; </li></ul><ul><li> b = tmp; </li></ul><ul><li>} </li></ul><ul><li>__global__ static void bitonicSort(int * values){ </li></ul><ul><li>extern __shared__ int shared[]; </li></ul><ul><li>const unsigned int tid = threadIdx.x; </li></ul><ul><li>shared[tid] = values[tid]; </li></ul><ul><li>__syncthreads(); </li></ul>Copyright © 2011 4-
  46. 46. <ul><li>// Parallel bitonic sort </li></ul><ul><li>for (unsigned int k = 2; k <= NUM; k *= 2) { </li></ul><ul><li>// Bitonic merge: </li></ul><ul><li>for (unsigned int j = k / 2; j>0; j /= 2) { </li></ul><ul><li>unsigned int ixj = tid ^ j; </li></ul><ul><li>if (ixj > tid){ </li></ul><ul><li>if ((tid & k) == 0) { </li></ul><ul><li>if (shared[tid] > shared[ixj]) </li></ul><ul><li> swap(shared[tid], shared[ixj]); </li></ul><ul><li> } </li></ul><ul><li>else{ </li></ul><ul><li>if (shared[tid] < shared[ixj]) </li></ul><ul><li>swap(shared[tid], shared[ixj]); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>__syncthreads(); </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>// Write result. </li></ul><ul><li>values[tid] = shared[tid]; </li></ul><ul><li>} </li></ul>Parallel Sort Code (2/3) Copyright © 2011 4-
  47. 47. Parallel Sort Code (3/3) <ul><li>int main(int argc, char** argv){ </li></ul><ul><li>int values[NUM]; </li></ul><ul><li>for(int i = 0; i < NUM; i++) </li></ul><ul><li>values[i] = rand()%1000; </li></ul><ul><li>int * dvalues; </li></ul><ul><li>cudaMalloc((void**)&dvalues, sizeof(int) * NUM); </li></ul><ul><li>cudaMemcpy(dvalues, values, sizeof(int) * NUM,cudaMemcpyHostToDevice); </li></ul><ul><li>bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues); </li></ul><ul><li>cudaMemcpy(values, dvalues, sizeof(int) * NUM,cudaMemcpyDeviceToHost); </li></ul><ul><li>cudaFree(dvalues); </li></ul><ul><li>bool passed = true; </li></ul><ul><li>int i; </li></ul><ul><li>for( i = 1; i < NUM; i++) { </li></ul><ul><li>if (values[i-1] > values[i]) </li></ul><ul><li>passed = false; </li></ul><ul><li>} </li></ul><ul><li>printf( &quot;Test %sn&quot;, passed ? &quot;PASSED&quot; : &quot;FAILED&quot;); </li></ul><ul><li>} </li></ul>Copyright © 2011 4-

×