PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

1,695 views

Published on

Fabrizio Milo

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,695
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
35
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

  1. 1. PyCUDA: Harnessing the power of GPU with Python
  2. 2. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  3. 3. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  4. 4. WHY A GPU ? PyCon 4 – Florence 2010 – Fabrizio Milo
  5. 5. APPLICATIONS & DEMOS PyCon 4 – Florence 2010 – Fabrizio Milo
  6. 6. Why GPU? PyCon 4 – Florence 2010 – Fabrizio Milo
  7. 7. Talk Structure 1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  8. 8. How does it works ? PyCon 4 – Florence 2010 – Fabrizio Milo
  9. 9. ALU ALU Control ALU ALU Cache DRAM CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  10. 10. DRAM GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  11. 11. ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  12. 12. CUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  13. 13. Compute Unified Device Architecture PyCon 4 – Florence 2010 – Fabrizio Milo
  14. 14. CUDA A Parallel Computing Architecture for NVIDIA GPUs Direct X Compute PyCon 4 – Florence 2010 – Fabrizio Milo
  15. 15. Execution Model CUDA Device Model PyCon 4 – Florence 2010 – Fabrizio Milo
  16. 16. EXECUTION MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  17. 17. Thread Smallest unit of logic PyCon 4 – Florence 2010 – Fabrizio Milo
  18. 18. A Block A Group of Threads PyCon 4 – Florence 2010 – Fabrizio Milo
  19. 19. A Grid A Group of Blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  20. 20. One Block can have many threads PyCon 4 – Florence 2010 – Fabrizio Milo
  21. 21. One Grid can have many blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  22. 22. The hardware DEVICE MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  23. 23. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  24. 24. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  25. 25. Many Scalar Processors PyCon 4 – Florence 2010 – Fabrizio Milo
  26. 26. + Register File PyCon 4 – Florence 2010 – Fabrizio Milo
  27. 27. + Shared Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  28. 28. Multiprocessor PyCon 4 – Florence 2010 – Fabrizio Milo
  29. 29. Device PyCon 4 – Florence 2010 – Fabrizio Milo
  30. 30. Real Example: 10-Series Architecture "   240 Scalar Processor (SP) cores execute kernel threads "   30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors   "  1 double precision unit "  Shared memory PyCon 4 – Florence 2010 – Fabrizio Milo
  31. 31. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  32. 32. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  33. 33. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  34. 34. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  35. 35. RAM CPU Host – Multi Device PyCon 4 – Florence 2010 – Fabrizio Milo
  36. 36. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  37. 37. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  38. 38. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  39. 39. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  40. 40. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Block PyCon 4 – Florence 2010 – Fabrizio Milo
  41. 41. Kernel __global__ void kernel( … ) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; … } Grid PyCon 4 – Florence 2010 – Fabrizio Milo
  42. 42. How do I Program it ? Main Logic Kernel GCC NVCC CPU .bin .cubin GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  43. 43. How do I Program it ? Main Logic Kernel GCC NVCC GPU .bin .cubin .bin .cubin . CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  44. 44. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  45. 45. RAM CPU Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  46. 46. Allocate Memory cudaMalloc( pointer, size ) PyCon 4 – Florence 2010 – Fabrizio Milo
  47. 47. Copy to device cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  48. 48. Kernel Launch cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) PyCon 4 – Florence 2010 – Fabrizio Milo
  49. 49. Get Back the Results cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  50. 50. Error Handling If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } PyCon 4 – Florence 2010 – Fabrizio Milo
  51. 51. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  52. 52. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  53. 53. PyCon 4 – Florence 2010 – Fabrizio Milo
  54. 54. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  55. 55. + & ANDREAS KLOCKNER = PYCUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  56. 56. PyCuda Philosopy Provide Complete Access PyCon 4 – Florence 2010 – Fabrizio Milo
  57. 57. PyCuda Philosopy AutoMatically Manage Resources PyCon 4 – Florence 2010 – Fabrizio Milo
  58. 58. PyCuda Philosopy Check and Report Errors PyCon 4 – Florence 2010 – Fabrizio Milo
  59. 59. PyCuda Philosopy Cross Platform PyCon 4 – Florence 2010 – Fabrizio Milo
  60. 60. PyCuda Philosopy Allow Interactive Use PyCon 4 – Florence 2010 – Fabrizio Milo
  61. 61. PyCuda Philosopy NumPy Integration PyCon 4 – Florence 2010 – Fabrizio Milo
  62. 62. NUMPY - ARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  63. 63. 1 1 1 1 1 1 0 99 import numpy my_array = numpy.array([1,] * 100) PyCon 4 – Florence 2010 – Fabrizio Milo
  64. 64. 1 1 1 0 1 1 import numpy my_array = numpy.array([1,] * 100) my_array[3] = 0 PyCon 4 – Florence 2010 – Fabrizio Milo
  65. 65. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  66. 66. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  67. 67. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  68. 68. Memory Allocation cuda.mem_alloc( size_bytes ) PyCon 4 – Florence 2010 – Fabrizio Milo
  69. 69. Memory Copy gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) PyCon 4 – Florence 2010 – Fabrizio Milo
  70. 70. Kernel gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) PyCon 4 – Florence 2010 – Fabrizio Milo
  71. 71. Kernel Launch mod = SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) multiply_them = mod.get_function(“multiply_them”) multiply_them ( *args, block=(30, 64, 1)) PyCon 4 – Florence 2010 – Fabrizio Milo
  72. 72. PyCon 4 – Florence 2010 – Fabrizio Milo
  73. 73. PyCon 4 – Florence 2010 – Fabrizio Milo
  74. 74. PyCon 4 – Florence 2010 – Fabrizio Milo
  75. 75. Hello Gpu DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  76. 76. GPUARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  77. 77. gpuarray PyCon 4 – Florence 2010 – Fabrizio Milo
  78. 78. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() PyCon 4 – Florence 2010 – Fabrizio Milo
  79. 79. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() +, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product … PyCon 4 – Florence 2010 – Fabrizio Milo
  80. 80. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel PyCon 4 – Florence 2010 – Fabrizio Milo
  81. 81. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) PyCon 4 – Florence 2010 – Fabrizio Milo
  82. 82. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu) assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5 PyCon 4 – Florence 2010 – Fabrizio Milo
  83. 83. Meta-Programming __kernel_template__ = “”” __global__ void kernel( args ) { for (int i=0; i={{ iterations }}; i++){ {{operations}} } }””” See for example jinja2 PyCon 4 – Florence 2010 – Fabrizio Milo
  84. 84. Meta-Programming PyCon 4 – Florence 2010 – Fabrizio Milo
  85. 85. Meta-Programming Generate Source ! PyCon 4 – Florence 2010 – Fabrizio Milo
  86. 86. Performances ? PyCon 4 – Florence 2010 – Fabrizio Milo
  87. 87. mandelbrot DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  88. 88. PyCuda: Documentation PyCon 4 – Florence 2010 – Fabrizio Milo
  89. 89. PyCuda WebSite: http://mathema.tician.de/software/ pycuda License: X Consortium License (no warranty, free for all use) Dependencies: Python 2.4+, numpy, Boost PyCon 4 – Florence 2010 – Fabrizio Milo
  90. 90. In the Future … OPENCL PyCon 4 – Florence 2010 – Fabrizio Milo
  91. 91. THANK YOU & HAVE FUN ! PyCon 4 – Florence 2010 – Fabrizio Milo
  92. 92. ? PyCon 4 – Florence 2010 – Fabrizio Milo

×