PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python
Upcoming SlideShare
Loading in...5
×
 

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

on

  • 1,526 views

Fabrizio Milo

Fabrizio Milo

Statistics

Views

Total Views
1,526
Views on SlideShare
1,525
Embed Views
1

Actions

Likes
2
Downloads
21
Comments
0

1 Embed 1

http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python Presentation Transcript

  • PyCUDA: Harnessing the power of GPU with Python
  • Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • WHY A GPU ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • APPLICATIONS & DEMOS PyCon 4 – Florence 2010 – Fabrizio Milo
  • Why GPU? PyCon 4 – Florence 2010 – Fabrizio Milo
  • Talk Structure 1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • How does it works ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • ALU ALU Control ALU ALU Cache DRAM CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • DRAM GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • CUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • Compute Unified Device Architecture PyCon 4 – Florence 2010 – Fabrizio Milo
  • CUDA A Parallel Computing Architecture for NVIDIA GPUs Direct X Compute PyCon 4 – Florence 2010 – Fabrizio Milo
  • Execution Model CUDA Device Model PyCon 4 – Florence 2010 – Fabrizio Milo
  • EXECUTION MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • Thread Smallest unit of logic PyCon 4 – Florence 2010 – Fabrizio Milo
  • A Block A Group of Threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • A Grid A Group of Blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • One Block can have many threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • One Grid can have many blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • The hardware DEVICE MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • Many Scalar Processors PyCon 4 – Florence 2010 – Fabrizio Milo
  • + Register File PyCon 4 – Florence 2010 – Fabrizio Milo
  • + Shared Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • Multiprocessor PyCon 4 – Florence 2010 – Fabrizio Milo
  • Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • Real Example: 10-Series Architecture "   240 Scalar Processor (SP) cores execute kernel threads "   30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors   "  1 double precision unit "  Shared memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • RAM CPU Host – Multi Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Block PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel __global__ void kernel( … ) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; … } Grid PyCon 4 – Florence 2010 – Fabrizio Milo
  • How do I Program it ? Main Logic Kernel GCC NVCC CPU .bin .cubin GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • How do I Program it ? Main Logic Kernel GCC NVCC GPU .bin .cubin .bin .cubin . CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • RAM CPU Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • Allocate Memory cudaMalloc( pointer, size ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Copy to device cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel Launch cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Get Back the Results cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Error Handling If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } PyCon 4 – Florence 2010 – Fabrizio Milo
  • And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCon 4 – Florence 2010 – Fabrizio Milo
  • 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • + & ANDREAS KLOCKNER = PYCUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy Provide Complete Access PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy AutoMatically Manage Resources PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy Check and Report Errors PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy Cross Platform PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy Allow Interactive Use PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda Philosopy NumPy Integration PyCon 4 – Florence 2010 – Fabrizio Milo
  • NUMPY - ARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 1 1 1 1 1 1 0 99 import numpy my_array = numpy.array([1,] * 100) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 1 1 1 0 1 1 import numpy my_array = numpy.array([1,] * 100) my_array[3] = 0 PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • Memory Allocation cuda.mem_alloc( size_bytes ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Memory Copy gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) PyCon 4 – Florence 2010 – Fabrizio Milo
  • Kernel Launch mod = SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) multiply_them = mod.get_function(“multiply_them”) multiply_them ( *args, block=(30, 64, 1)) PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCon 4 – Florence 2010 – Fabrizio Milo
  • Hello Gpu DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • GPUARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • gpuarray PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() +, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product … PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu) assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5 PyCon 4 – Florence 2010 – Fabrizio Milo
  • Meta-Programming __kernel_template__ = “”” __global__ void kernel( args ) { for (int i=0; i={{ iterations }}; i++){ {{operations}} } }””” See for example jinja2 PyCon 4 – Florence 2010 – Fabrizio Milo
  • Meta-Programming PyCon 4 – Florence 2010 – Fabrizio Milo
  • Meta-Programming Generate Source ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • Performances ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • mandelbrot DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda: Documentation PyCon 4 – Florence 2010 – Fabrizio Milo
  • PyCuda WebSite: http://mathema.tician.de/software/ pycuda License: X Consortium License (no warranty, free for all use) Dependencies: Python 2.4+, numpy, Boost PyCon 4 – Florence 2010 – Fabrizio Milo
  • In the Future … OPENCL PyCon 4 – Florence 2010 – Fabrizio Milo
  • THANK YOU & HAVE FUN ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • ? PyCon 4 – Florence 2010 – Fabrizio Milo