PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

  • 1,252 views
Uploaded on

Fabrizio Milo

Fabrizio Milo

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,252
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
21
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. PyCUDA: Harnessing the power of GPU with Python
  • 2. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 3. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 4. WHY A GPU ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 5. APPLICATIONS & DEMOS PyCon 4 – Florence 2010 – Fabrizio Milo
  • 6. Why GPU? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 7. Talk Structure 1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 8. How does it works ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 9. ALU ALU Control ALU ALU Cache DRAM CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 10. DRAM GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 11. ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 12. CUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 13. Compute Unified Device Architecture PyCon 4 – Florence 2010 – Fabrizio Milo
  • 14. CUDA A Parallel Computing Architecture for NVIDIA GPUs Direct X Compute PyCon 4 – Florence 2010 – Fabrizio Milo
  • 15. Execution Model CUDA Device Model PyCon 4 – Florence 2010 – Fabrizio Milo
  • 16. EXECUTION MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 17. Thread Smallest unit of logic PyCon 4 – Florence 2010 – Fabrizio Milo
  • 18. A Block A Group of Threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 19. A Grid A Group of Blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 20. One Block can have many threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 21. One Grid can have many blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 22. The hardware DEVICE MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 23. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 24. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 25. Many Scalar Processors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 26. + Register File PyCon 4 – Florence 2010 – Fabrizio Milo
  • 27. + Shared Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 28. Multiprocessor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 29. Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 30. Real Example: 10-Series Architecture "   240 Scalar Processor (SP) cores execute kernel threads "   30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors   "  1 double precision unit "  Shared memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 31. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 32. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 33. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 34. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 35. RAM CPU Host – Multi Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 36. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 37. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 38. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 39. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 40. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Block PyCon 4 – Florence 2010 – Fabrizio Milo
  • 41. Kernel __global__ void kernel( … ) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; … } Grid PyCon 4 – Florence 2010 – Fabrizio Milo
  • 42. How do I Program it ? Main Logic Kernel GCC NVCC CPU .bin .cubin GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 43. How do I Program it ? Main Logic Kernel GCC NVCC GPU .bin .cubin .bin .cubin . CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 44. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 45. RAM CPU Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 46. Allocate Memory cudaMalloc( pointer, size ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 47. Copy to device cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 48. Kernel Launch cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 49. Get Back the Results cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 50. Error Handling If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 51. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 52. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 53. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 54. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 55. + & ANDREAS KLOCKNER = PYCUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 56. PyCuda Philosopy Provide Complete Access PyCon 4 – Florence 2010 – Fabrizio Milo
  • 57. PyCuda Philosopy AutoMatically Manage Resources PyCon 4 – Florence 2010 – Fabrizio Milo
  • 58. PyCuda Philosopy Check and Report Errors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 59. PyCuda Philosopy Cross Platform PyCon 4 – Florence 2010 – Fabrizio Milo
  • 60. PyCuda Philosopy Allow Interactive Use PyCon 4 – Florence 2010 – Fabrizio Milo
  • 61. PyCuda Philosopy NumPy Integration PyCon 4 – Florence 2010 – Fabrizio Milo
  • 62. NUMPY - ARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 63. 1 1 1 1 1 1 0 99 import numpy my_array = numpy.array([1,] * 100) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 64. 1 1 1 0 1 1 import numpy my_array = numpy.array([1,] * 100) my_array[3] = 0 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 65. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 66. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 67. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 68. Memory Allocation cuda.mem_alloc( size_bytes ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 69. Memory Copy gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 70. Kernel gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 71. Kernel Launch mod = SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) multiply_them = mod.get_function(“multiply_them”) multiply_them ( *args, block=(30, 64, 1)) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 72. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 73. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 74. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 75. Hello Gpu DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 76. GPUARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 77. gpuarray PyCon 4 – Florence 2010 – Fabrizio Milo
  • 78. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() PyCon 4 – Florence 2010 – Fabrizio Milo
  • 79. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() +, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product … PyCon 4 – Florence 2010 – Fabrizio Milo
  • 80. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel PyCon 4 – Florence 2010 – Fabrizio Milo
  • 81. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 82. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu) assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 83. Meta-Programming __kernel_template__ = “”” __global__ void kernel( args ) { for (int i=0; i={{ iterations }}; i++){ {{operations}} } }””” See for example jinja2 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 84. Meta-Programming PyCon 4 – Florence 2010 – Fabrizio Milo
  • 85. Meta-Programming Generate Source ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 86. Performances ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 87. mandelbrot DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 88. PyCuda: Documentation PyCon 4 – Florence 2010 – Fabrizio Milo
  • 89. PyCuda WebSite: http://mathema.tician.de/software/ pycuda License: X Consortium License (no warranty, free for all use) Dependencies: Python 2.4+, numpy, Boost PyCon 4 – Florence 2010 – Fabrizio Milo
  • 90. In the Future … OPENCL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 91. THANK YOU & HAVE FUN ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 92. ? PyCon 4 – Florence 2010 – Fabrizio Milo