Performance Analysis: C vs CUDA

2,370 views
2,250 views

Published on

Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Very proper & expressive set of content, to-the-point presentation.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
2,370
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
108
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Performance Analysis: C vs CUDA

  1. 1. n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda Vitor Pamplona vitor@vitorpamplona.com
  2. 2. Goals Learn Cuda and its limitations ● Implement some n-Queens solutions ● Cuda version ● C++ version ● Compare performance ● Check for possible papers ● Parallel processing ● Computer graphics ● 2 Copyright Vitor F. Pamplona
  3. 3. N by N Queens Problem http://en.wikipedia.org/wiki/Eight_queens_puzzle 3 Copyright Vitor F. Pamplona
  4. 4. Possibilities vs Solutions Board Size Possibilities Solutions 1 1 1 2 4 0 3 27 0 4 256 2 5 3,125 10 6 46,656 4 7 823,543 40 8 16,777,216 92 9 387,420,489 352 10 10,000,000,000 724 11 285,311,670,611 2,680 12 8,916,100,448,256 14,200 13 302,875,106,592,253 73,713 14 11,112,006,825,558,000 365,596 15 437,893,890,380,859,000 2,299,184 16 18,446,744,073,709,600,000 14,772,512 17 827,240,261,886,337,000,000 95,815,104 4 Copyright Vitor F. Pamplona
  5. 5. Cu... what? Compute Unified Device Architecture ● C-style language and compiler ● Designed for parallel solutions ● Not a graphics API ● Runs on current graphics hardware ● nVidia GeForce 8+ ● Faster transfers between CPU and GPU ● Compiler for CPU and GPU ● 5 Copyright Vitor F. Pamplona
  6. 6. Hardware Architecture GPU CPU 6 Copyright Vitor F. Pamplona
  7. 7. Hardware Architecture GPU Processor CPU Cache Memory 7 Copyright Vitor F. Pamplona
  8. 8. Hardware Architecture Processor GPU Processor Memory CPU Cache Memory 8 Copyright Vitor F. Pamplona
  9. 9. Hardware Architecture Device GPU Host Device Memory CPU Cache Host Memory 9 Copyright Vitor F. Pamplona
  10. 10. Hardware Architecture Host Device Memory Cache Host Memory 10 Copyright Vitor F. Pamplona
  11. 11. Hardware Architecture thread warp Host Device Memory CPU Cache Host Memory 11 Copyright Vitor F. Pamplona
  12. 12. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L L L L L L L L L L L L L Host CPU Cache Host Memory 12 Copyright Vitor F. Pamplona
  13. 13. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Host CPU Cache Host Memory 13 Copyright Vitor F. Pamplona
  14. 14. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Host (64kB) CPU Cache Host Memory 14 Copyright Vitor F. Pamplona
  15. 15. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Host (64kB) CPU Cache Host Global Memory 15 Copyright Vitor F. Pamplona
  16. 16. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Cache Host (64kB) Texture CPU Cache optimized for 2D access Host Global Memory 16 Copyright Vitor F. Pamplona
  17. 17. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Cache Host (64kB) Texture CPU Cache optimized for 2D access Host Global Memory 17 Copyright Vitor F. Pamplona
  18. 18. Memory Access 18 Copyright Vitor F. Pamplona
  19. 19. Basics of Programming 19 Copyright Vitor F. Pamplona
  20. 20. Hardware Architecture 20 Copyright Vitor F. Pamplona
  21. 21. Hardware Architecture 21 Copyright Vitor F. Pamplona
  22. 22. Hardware Architecture 22 Copyright Vitor F. Pamplona
  23. 23. Hardware Architecture 23 Copyright Vitor F. Pamplona
  24. 24. Libraries and Access Application CPU GPU 24 Copyright Vitor F. Pamplona
  25. 25. Libraries and Access Application CUDA Libraries CPU GPU 25 Copyright Vitor F. Pamplona
  26. 26. Libraries and Access Application CUDA Libraries CPU CUDA Runtime GPU 26 Copyright Vitor F. Pamplona
  27. 27. Libraries and Access Application CUDA Libraries CPU CUDA Runtime CUDA Driver GPU 27 Copyright Vitor F. Pamplona
  28. 28. Libraries and Access Application CUDA Libraries CPU CUDA Runtime CUDA Driver GPU 28 Copyright Vitor F. Pamplona
  29. 29. Startup Special Windows/Linux drivers ● CUDA Toolkit ● CUDA Developer SDK, which includes SDK ● API Documentation ● Programming guide ● Compiler (nvcc) ● Libraries (CUFFT, CUBLAS) ● Source code examples ● 29 Copyright Vitor F. Pamplona
  30. 30. Host Example float *pHostData = (float*) malloc(sizeof(float) * 256); // fill in the data array... // allocate global memory float *pInput, *pOutput; cudaMalloc((void**) &pInput, sizeof(float) * 256)); cudaMalloc((void**) &pOutput, sizeof(float) * 256)); // host memory to global memory cudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice)); dim3 nDimGrid(1, 1, 1);// 1 block only dim3 nDimBlock(32, 1, 1); // 32 threads per block int nSharedMemBytes = sizeof(float) * 32; MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput); // global memory to host memory cudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost)); free(pHostData); free(pInput); free(pOutput); 30 Copyright Vitor F. Pamplona
  31. 31. Kernel Example __global__ void MyKernel(float* pInData, float* pOutData){ extern __shared__ float sharedData[]; const unsigned int tid = threadIdx.x; const unsigned int num_threads = blockDim.x; // global memory to shared memory sharedData[tid] = pInData[tid]; __syncthreads(); // do something sharedData[tid] = (float) num_threads * sharedData[tid]; __syncthreads(); // shared memory to global memory pOutData[tid] = sharedData[tid]; } 31 Copyright Vitor F. Pamplona
  32. 32. Competitors AMD/ATI Close to Metal (CTM) ● RapidMind ● Acceleware ● PeakStream ● Unavailable since acquisition by Google ● BrookGPU ● OpenGL/Direct3D + GLSL/HLSL/Cg ● BSGP ● 32 Copyright Vitor F. Pamplona
  33. 33. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 33 Copyright Vitor F. Pamplona
  34. 34. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 34 Copyright Vitor F. Pamplona
  35. 35. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 35 Copyright Vitor F. Pamplona
  36. 36. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 36 Copyright Vitor F. Pamplona
  37. 37. CPU Monothread Depth-first Plain Optimized implementation ● Single thread ● Depth-first approach ● No recursion, no function call recursion ● Memory buffers :) ● Fast, really fast! ● 37 Copyright Vitor F. Pamplona
  38. 38. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 38 Copyright Vitor F. Pamplona
  39. 39. CPU N-threads Depth-first Plain N-threads, where N is the board size N-threads ● First column filled in the main thread ● Create N linux pthreads ● One thread for each line ● Each thread process N-1 columns ● Critical Section ● solutions++; ● saveSolution(board); ● 39 Copyright Vitor F. Pamplona
  40. 40. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 40 Copyright Vitor F. Pamplona
  41. 41. GPU Step Breadth-first Step 1 In 41 Copyright Vitor F. Pamplona
  42. 42. GPU Step Breadth-first Step 1 Out In 1 Thread 1 Thread 2 2 Thread 3 3 ... Thread N N Threads = Num. Solutions * N 42 Copyright Vitor F. Pamplona
  43. 43. GPU Step Breadth-first Step 2 In 1 2 3 ... 8 43 Copyright Vitor F. Pamplona
  44. 44. GPU Step Breadth-first Step 2 Out In 11 Thread 1 1 Thread 2 12 2 Thread 3 13 3 ... ... Thread N*N NN 8 Threads = Num. Solutions * N 44 Copyright Vitor F. Pamplona
  45. 45. GPU Step Breadth-first Step 3 In 13 14 ... 15 ... 86 Threads = Num. Solutions * N 45 Copyright Vitor F. Pamplona
  46. 46. Why a Breadth-first solution? Graphics processors are not Intel/AMD ● Slow: 650 MHz ● Driver can kill time-expensive kernels ● Lots of threads ● Good for GPU ● Easy solution-thread mapping by indexes ● Fast kernels ● Good for GPU ● 46 Copyright Vitor F. Pamplona
  47. 47. GPU Step Breadth-first Static memory version ● Bad: One sort in the output for each step ● Good for GPU ● Dynamic memory version ● Bad: Synchronized memory access ● Bad: Global last output index ● 47 Copyright Vitor F. Pamplona
  48. 48. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 48 Copyright Vitor F. Pamplona
  49. 49. Plain Depth-first Dynamic Best case: N^4 threads ● Thread indexes fill the first 4 columns ● Depth-first approach ● Synchronized global memory access ● 49 Copyright Vitor F. Pamplona
  50. 50. Implementations and Threads Solução Threads GPU-breadth-first static mem Sol * N GPU-breadth-first dynamic mem Sol * N GPU-depth-first 1–Thread 1 GPU-depth-first n-Threads N GPU-depth-first n-grids N GPU-depth-first n*n-grids N*N GPU-depth-first n*n-grids*n-threads N*N*N GPU-depth-first n*n-grids*n*n-threads N*N*N*N GPU-depth-first FULL threads N^N CPU-Plain 1 CPU-Recursive 1 CPU-Plain-Threads N 50 Copyright Vitor F. Pamplona
  51. 51. Test platforms CPU: Intel Quad Core 2.4 Ghz ● Ubuntu ● 4GB RAM ● GPU: Geforce 9600 GT ● 8 multiprocessor ● 64 processors at 650 Mhz ● 512MB RAM at 900 Mhz ● Cuda 1.0 ● 51 Copyright Vitor F. Pamplona
  52. 52. Results: CPU 9000 8000 CPU-Plain 7000 CPU-Recursive CPU-Plain- 6000 Threads 5000 4000 3000 2000 1000 0 12 13 14 52 Copyright Vitor F. Pamplona
  53. 53. Results: GPU: Static vs Dynamic 7000 6000 breadth-first static 5000 breadth-first dynamic CPU-Plain 4000 CPU-Recursive CPU-Plain- 3000 Threads 2000 1000 0 11 12 53 Copyright Vitor F. Pamplona
  54. 54. Results: Same Number of Threads 9000 8000 depth-first n- 7000 Threads depth-first n- 6000 Grids CPU-Plain- 5000 Threads 4000 3000 2000 1000 0 12 13 54 Copyright Vitor F. Pamplona
  55. 55. Results: Only 1 Thread 8000 7000 depth-first 1- Thread 6000 CPU-Recursive CPU-Plain 5000 4000 3000 2000 1000 0 10 11 12 55 Copyright Vitor F. Pamplona
  56. 56. Results: Dynamic vs Depth 1800 1600 breadth-first 1400 dynamic depth-first n- 1200 Threads depth-first n- 1000 Grids CPU-Plain 800 CPU-Recursive 600 CPU-Plain- Threads 400 200 0 12 56 Copyright Vitor F. Pamplona
  57. 57. Results: Depth vs CPU 1800 depth-first n- depth-first n*n- 1600 Threads grids*n*n- threads 1400 depth-first n- CPU-Plain Grids 1200 depth-first n*n- CPU-Recursive grids 1000 depth-first n*n- CPU-Plain- 800 grids*n-threads Threads 600 400 200 0 12 57 Copyright Vitor F. Pamplona
  58. 58. Results: GPU N^N solution 12000 depth-first n^n CPU-Plain 10000 CPU-Recursive CPU-Plain- Threads 8000 6000 4000 2000 0 7 8 9 58 Copyright Vitor F. Pamplona
  59. 59. Results: Dynamic, Depth, CPU 1600 breadth-first dynamic 1400 depth-first N*N*N*N 1200 CPU-Plain 1000 CPU-Recursive CPU-Plain- 800 Threads 600 400 200 0 10 11 12 13 59 Copyright Vitor F. Pamplona
  60. 60. Results: Depth vs CPU Threads 140000 depth-first N*N*N*N 120000 CPU-Plain CPU-Plain- 100000 Threads 80000 60000 40000 20000 0 14 15 16 60 Copyright Vitor F. Pamplona
  61. 61. Results Solução Threads 1 2 3 4 5 6 7 8 9 GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220 GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174 GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227 GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230 GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177 GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174 GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174 GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171 GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420 CPU-Plain 1 2 2 2 2 2 2 2 2 3 CPU-Recursive 1 2 2 2 2 2 2 2 2 3 CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5 61 Copyright Vitor F. Pamplona
  62. 62. Results Threads Solução 11 12 13 14 15 16 17 GPU-breadth-first static Sol * N 1234 6184 Mem Mem Mem Mem Mem GPU-breadth-first dynamic Mem Mem Sol * N 218 407 1481 7886 Cont GPU-depth-first 1–Thread 1 1463 7198 GPU-depth-first n-Threads N 441 1561 7827 GPU-depth-first n-grids N 301 824 3604 GPU-depth-first n*n-grids N*N 216 424 1425 7025 GPU-depth-first N^3 N^3 192 267 661 2937 N^4 181 199 360 1369 7562 43488 05:38.99 GPU-depth-first N^4 N^N GPU-depth-first FULL 1 18 91 502 3020 19685 CPU-Plain 1 35 198 1225 8283 58493 CPU-Recursive CPU-Plain-Threads N 17 84 290 1393 8578 32010 04:40.95 62 Copyright Vitor F. Pamplona
  63. 63. Conclusions Cuda is slow. ● Low use of GPU graphics resources ● GLSL, HLSL and Cg are faster ● Compiler needs improvements ● More documentation on assembly optimization ● Instable ● GPU kill some process (I don't know why) ● Performance depends on implementation ● Good for mixed solutions: CPU + GPU solutions ● 63 Copyright Vitor F. Pamplona
  64. 64. Conclusions %, * and / are slow ● ThreadIdx and blockIdx are fantastic ● __shared__ memory helps ● Cuda locks the screen while processing ● No inter-process scheduling ● Synchronized architecture ● Think synchronized ● 64 Copyright Vitor F. Pamplona
  65. 65. Perguntas? Vitor Pamplona vitor@vitorpamplona.com

×