• Like
Performance Analysis: C vs CUDA
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Performance Analysis: C vs CUDA

  • 1,996 views
Published

Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.

Some tests comparing n-queens solutions between CPU using C and GPU using Cuda.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Very proper & expressive set of content, to-the-point presentation.
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
1,996
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
104
Comments
1
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda Vitor Pamplona vitor@vitorpamplona.com
  • 2. Goals Learn Cuda and its limitations ● Implement some n-Queens solutions ● Cuda version ● C++ version ● Compare performance ● Check for possible papers ● Parallel processing ● Computer graphics ● 2 Copyright Vitor F. Pamplona
  • 3. N by N Queens Problem http://en.wikipedia.org/wiki/Eight_queens_puzzle 3 Copyright Vitor F. Pamplona
  • 4. Possibilities vs Solutions Board Size Possibilities Solutions 1 1 1 2 4 0 3 27 0 4 256 2 5 3,125 10 6 46,656 4 7 823,543 40 8 16,777,216 92 9 387,420,489 352 10 10,000,000,000 724 11 285,311,670,611 2,680 12 8,916,100,448,256 14,200 13 302,875,106,592,253 73,713 14 11,112,006,825,558,000 365,596 15 437,893,890,380,859,000 2,299,184 16 18,446,744,073,709,600,000 14,772,512 17 827,240,261,886,337,000,000 95,815,104 4 Copyright Vitor F. Pamplona
  • 5. Cu... what? Compute Unified Device Architecture ● C-style language and compiler ● Designed for parallel solutions ● Not a graphics API ● Runs on current graphics hardware ● nVidia GeForce 8+ ● Faster transfers between CPU and GPU ● Compiler for CPU and GPU ● 5 Copyright Vitor F. Pamplona
  • 6. Hardware Architecture GPU CPU 6 Copyright Vitor F. Pamplona
  • 7. Hardware Architecture GPU Processor CPU Cache Memory 7 Copyright Vitor F. Pamplona
  • 8. Hardware Architecture Processor GPU Processor Memory CPU Cache Memory 8 Copyright Vitor F. Pamplona
  • 9. Hardware Architecture Device GPU Host Device Memory CPU Cache Host Memory 9 Copyright Vitor F. Pamplona
  • 10. Hardware Architecture Host Device Memory Cache Host Memory 10 Copyright Vitor F. Pamplona
  • 11. Hardware Architecture thread warp Host Device Memory CPU Cache Host Memory 11 Copyright Vitor F. Pamplona
  • 12. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L L L L L L L L L L L L L Host CPU Cache Host Memory 12 Copyright Vitor F. Pamplona
  • 13. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Host CPU Cache Host Memory 13 Copyright Vitor F. Pamplona
  • 14. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Host (64kB) CPU Cache Host Memory 14 Copyright Vitor F. Pamplona
  • 15. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Host (64kB) CPU Cache Host Global Memory 15 Copyright Vitor F. Pamplona
  • 16. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Cache Host (64kB) Texture CPU Cache optimized for 2D access Host Global Memory 16 Copyright Vitor F. Pamplona
  • 17. Hardware Architecture local L L L L L L L L L L L L thread L L L L L L L L L L L L warp L L L L L L L L L L L L banks L L L L L L L L L L L L Constant Cache Cache Host (64kB) Texture CPU Cache optimized for 2D access Host Global Memory 17 Copyright Vitor F. Pamplona
  • 18. Memory Access 18 Copyright Vitor F. Pamplona
  • 19. Basics of Programming 19 Copyright Vitor F. Pamplona
  • 20. Hardware Architecture 20 Copyright Vitor F. Pamplona
  • 21. Hardware Architecture 21 Copyright Vitor F. Pamplona
  • 22. Hardware Architecture 22 Copyright Vitor F. Pamplona
  • 23. Hardware Architecture 23 Copyright Vitor F. Pamplona
  • 24. Libraries and Access Application CPU GPU 24 Copyright Vitor F. Pamplona
  • 25. Libraries and Access Application CUDA Libraries CPU GPU 25 Copyright Vitor F. Pamplona
  • 26. Libraries and Access Application CUDA Libraries CPU CUDA Runtime GPU 26 Copyright Vitor F. Pamplona
  • 27. Libraries and Access Application CUDA Libraries CPU CUDA Runtime CUDA Driver GPU 27 Copyright Vitor F. Pamplona
  • 28. Libraries and Access Application CUDA Libraries CPU CUDA Runtime CUDA Driver GPU 28 Copyright Vitor F. Pamplona
  • 29. Startup Special Windows/Linux drivers ● CUDA Toolkit ● CUDA Developer SDK, which includes SDK ● API Documentation ● Programming guide ● Compiler (nvcc) ● Libraries (CUFFT, CUBLAS) ● Source code examples ● 29 Copyright Vitor F. Pamplona
  • 30. Host Example float *pHostData = (float*) malloc(sizeof(float) * 256); // fill in the data array... // allocate global memory float *pInput, *pOutput; cudaMalloc((void**) &pInput, sizeof(float) * 256)); cudaMalloc((void**) &pOutput, sizeof(float) * 256)); // host memory to global memory cudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice)); dim3 nDimGrid(1, 1, 1);// 1 block only dim3 nDimBlock(32, 1, 1); // 32 threads per block int nSharedMemBytes = sizeof(float) * 32; MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput); // global memory to host memory cudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost)); free(pHostData); free(pInput); free(pOutput); 30 Copyright Vitor F. Pamplona
  • 31. Kernel Example __global__ void MyKernel(float* pInData, float* pOutData){ extern __shared__ float sharedData[]; const unsigned int tid = threadIdx.x; const unsigned int num_threads = blockDim.x; // global memory to shared memory sharedData[tid] = pInData[tid]; __syncthreads(); // do something sharedData[tid] = (float) num_threads * sharedData[tid]; __syncthreads(); // shared memory to global memory pOutData[tid] = sharedData[tid]; } 31 Copyright Vitor F. Pamplona
  • 32. Competitors AMD/ATI Close to Metal (CTM) ● RapidMind ● Acceleware ● PeakStream ● Unavailable since acquisition by Google ● BrookGPU ● OpenGL/Direct3D + GLSL/HLSL/Cg ● BSGP ● 32 Copyright Vitor F. Pamplona
  • 33. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 33 Copyright Vitor F. Pamplona
  • 34. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 34 Copyright Vitor F. Pamplona
  • 35. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 35 Copyright Vitor F. Pamplona
  • 36. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 36 Copyright Vitor F. Pamplona
  • 37. CPU Monothread Depth-first Plain Optimized implementation ● Single thread ● Depth-first approach ● No recursion, no function call recursion ● Memory buffers :) ● Fast, really fast! ● 37 Copyright Vitor F. Pamplona
  • 38. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 38 Copyright Vitor F. Pamplona
  • 39. CPU N-threads Depth-first Plain N-threads, where N is the board size N-threads ● First column filled in the main thread ● Create N linux pthreads ● One thread for each line ● Each thread process N-1 columns ● Critical Section ● solutions++; ● saveSolution(board); ● 39 Copyright Vitor F. Pamplona
  • 40. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 40 Copyright Vitor F. Pamplona
  • 41. GPU Step Breadth-first Step 1 In 41 Copyright Vitor F. Pamplona
  • 42. GPU Step Breadth-first Step 1 Out In 1 Thread 1 Thread 2 2 Thread 3 3 ... Thread N N Threads = Num. Solutions * N 42 Copyright Vitor F. Pamplona
  • 43. GPU Step Breadth-first Step 2 In 1 2 3 ... 8 43 Copyright Vitor F. Pamplona
  • 44. GPU Step Breadth-first Step 2 Out In 11 Thread 1 1 Thread 2 12 2 Thread 3 13 3 ... ... Thread N*N NN 8 Threads = Num. Solutions * N 44 Copyright Vitor F. Pamplona
  • 45. GPU Step Breadth-first Step 3 In 13 14 ... 15 ... 86 Threads = Num. Solutions * N 45 Copyright Vitor F. Pamplona
  • 46. Why a Breadth-first solution? Graphics processors are not Intel/AMD ● Slow: 650 MHz ● Driver can kill time-expensive kernels ● Lots of threads ● Good for GPU ● Easy solution-thread mapping by indexes ● Fast kernels ● Good for GPU ● 46 Copyright Vitor F. Pamplona
  • 47. GPU Step Breadth-first Static memory version ● Bad: One sort in the output for each step ● Good for GPU ● Dynamic memory version ● Bad: Synchronized memory access ● Bad: Global last output index ● 47 Copyright Vitor F. Pamplona
  • 48. Back to Work Brute force implementations ● 3 solutions for CPU ● Monothread depth-first recursive ● Monothread depth-first plain ● N-threads depth-first plain ● 3 solutions for GPU ● Step-based breadth-first static memory ● Step-based breadth-first dynamic memory ● Plain depth-first dynamic memory version ● 48 Copyright Vitor F. Pamplona
  • 49. Plain Depth-first Dynamic Best case: N^4 threads ● Thread indexes fill the first 4 columns ● Depth-first approach ● Synchronized global memory access ● 49 Copyright Vitor F. Pamplona
  • 50. Implementations and Threads Solução Threads GPU-breadth-first static mem Sol * N GPU-breadth-first dynamic mem Sol * N GPU-depth-first 1–Thread 1 GPU-depth-first n-Threads N GPU-depth-first n-grids N GPU-depth-first n*n-grids N*N GPU-depth-first n*n-grids*n-threads N*N*N GPU-depth-first n*n-grids*n*n-threads N*N*N*N GPU-depth-first FULL threads N^N CPU-Plain 1 CPU-Recursive 1 CPU-Plain-Threads N 50 Copyright Vitor F. Pamplona
  • 51. Test platforms CPU: Intel Quad Core 2.4 Ghz ● Ubuntu ● 4GB RAM ● GPU: Geforce 9600 GT ● 8 multiprocessor ● 64 processors at 650 Mhz ● 512MB RAM at 900 Mhz ● Cuda 1.0 ● 51 Copyright Vitor F. Pamplona
  • 52. Results: CPU 9000 8000 CPU-Plain 7000 CPU-Recursive CPU-Plain- 6000 Threads 5000 4000 3000 2000 1000 0 12 13 14 52 Copyright Vitor F. Pamplona
  • 53. Results: GPU: Static vs Dynamic 7000 6000 breadth-first static 5000 breadth-first dynamic CPU-Plain 4000 CPU-Recursive CPU-Plain- 3000 Threads 2000 1000 0 11 12 53 Copyright Vitor F. Pamplona
  • 54. Results: Same Number of Threads 9000 8000 depth-first n- 7000 Threads depth-first n- 6000 Grids CPU-Plain- 5000 Threads 4000 3000 2000 1000 0 12 13 54 Copyright Vitor F. Pamplona
  • 55. Results: Only 1 Thread 8000 7000 depth-first 1- Thread 6000 CPU-Recursive CPU-Plain 5000 4000 3000 2000 1000 0 10 11 12 55 Copyright Vitor F. Pamplona
  • 56. Results: Dynamic vs Depth 1800 1600 breadth-first 1400 dynamic depth-first n- 1200 Threads depth-first n- 1000 Grids CPU-Plain 800 CPU-Recursive 600 CPU-Plain- Threads 400 200 0 12 56 Copyright Vitor F. Pamplona
  • 57. Results: Depth vs CPU 1800 depth-first n- depth-first n*n- 1600 Threads grids*n*n- threads 1400 depth-first n- CPU-Plain Grids 1200 depth-first n*n- CPU-Recursive grids 1000 depth-first n*n- CPU-Plain- 800 grids*n-threads Threads 600 400 200 0 12 57 Copyright Vitor F. Pamplona
  • 58. Results: GPU N^N solution 12000 depth-first n^n CPU-Plain 10000 CPU-Recursive CPU-Plain- Threads 8000 6000 4000 2000 0 7 8 9 58 Copyright Vitor F. Pamplona
  • 59. Results: Dynamic, Depth, CPU 1600 breadth-first dynamic 1400 depth-first N*N*N*N 1200 CPU-Plain 1000 CPU-Recursive CPU-Plain- 800 Threads 600 400 200 0 10 11 12 13 59 Copyright Vitor F. Pamplona
  • 60. Results: Depth vs CPU Threads 140000 depth-first N*N*N*N 120000 CPU-Plain CPU-Plain- 100000 Threads 80000 60000 40000 20000 0 14 15 16 60 Copyright Vitor F. Pamplona
  • 61. Results Solução Threads 1 2 3 4 5 6 7 8 9 GPU-breadth-first static Sol * N 171 171 171 174 174 174 178 184 220 GPU-breadth-first dynamic Sol * N 171 171 171 173 173 173 173 173 174 GPU-depth-first 1–Thread 1 171 171 171 171 171 171 171 185 227 GPU-depth-first n-Threads N 171 171 171 172 172 173 173 175 230 GPU-depth-first n-grids N 171 171 171 171 171 173 173 173 177 GPU-depth-first n*n-grids N*N 172 172 172 172 172 172 172 172 174 GPU-depth-first N^3 N^3 171 172 172 172 172 172 172 172 174 GPU-depth-first N^4 N^4 171 171 171 171 171 171 171 171 171 GPU-depth-first FULL N^N 171 171 172 172 172 172 230 1682 11420 CPU-Plain 1 2 2 2 2 2 2 2 2 3 CPU-Recursive 1 2 2 2 2 2 2 2 2 3 CPU-Plain-Threads N 2 2 2 2 2 2 2 2 5 61 Copyright Vitor F. Pamplona
  • 62. Results Threads Solução 11 12 13 14 15 16 17 GPU-breadth-first static Sol * N 1234 6184 Mem Mem Mem Mem Mem GPU-breadth-first dynamic Mem Mem Sol * N 218 407 1481 7886 Cont GPU-depth-first 1–Thread 1 1463 7198 GPU-depth-first n-Threads N 441 1561 7827 GPU-depth-first n-grids N 301 824 3604 GPU-depth-first n*n-grids N*N 216 424 1425 7025 GPU-depth-first N^3 N^3 192 267 661 2937 N^4 181 199 360 1369 7562 43488 05:38.99 GPU-depth-first N^4 N^N GPU-depth-first FULL 1 18 91 502 3020 19685 CPU-Plain 1 35 198 1225 8283 58493 CPU-Recursive CPU-Plain-Threads N 17 84 290 1393 8578 32010 04:40.95 62 Copyright Vitor F. Pamplona
  • 63. Conclusions Cuda is slow. ● Low use of GPU graphics resources ● GLSL, HLSL and Cg are faster ● Compiler needs improvements ● More documentation on assembly optimization ● Instable ● GPU kill some process (I don't know why) ● Performance depends on implementation ● Good for mixed solutions: CPU + GPU solutions ● 63 Copyright Vitor F. Pamplona
  • 64. Conclusions %, * and / are slow ● ThreadIdx and blockIdx are fantastic ● __shared__ memory helps ● Cuda locks the screen while processing ● No inter-process scheduling ● Synchronized architecture ● Think synchronized ● 64 Copyright Vitor F. Pamplona
  • 65. Perguntas? Vitor Pamplona vitor@vitorpamplona.com