PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

PyCUDA:
Harnessing the power of GPU with Python

Talk Structure

1. Why a GPU ?
2. How does It works ?
3. How do I Program it ?
4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo

WHY A GPU ?


APPLICATIONS & DEMOS


Why GPU?


Talk Structure

1. Why a GPU ?
2. How does it works ?


How does it works ?


ALU ALU

Control

ALU ALU

Cache

DRAM

CPU

DRAM

GPU

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU


CUDA


Compute Unified Device Architecture


CUDA
A Parallel Computing Architecture for NVIDIA GPUs

Direct X
Compute


Execution Model

CUDA
Device Model


EXECUTION MODEL


Thread
Smallest unit of logic


A Block
A Group of Threads


A Grid
A Group of Blocks


One Block can have many threads


One Grid can have many blocks


The hardware

DEVICE MODEL


Scalar Processor


Many Scalar Processors


+ Register File


+ Shared Memory


Multiprocessor


Device


Real Example: 10-Series Architecture

"   240 Scalar Processor (SP) cores execute kernel threads
"   30 Streaming Multiprocessors (SMs) each contain
" 8 scalar processors
 
"  1 double precision unit
"  Shared memory


Software Hardware

Scalar
Processor
Thread

Thread
Block Multiprocessor

Grid Device

Global Memory


RAM

CPU Global Memory

Host - Device


RAM

CPU

Host – Multi Device


1. Why a GPU ?
2. How does It works ?


Kernel

__global__ void multiply_them( float *dest,
float *a,
float *b )
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}

Thread

Kernel

__global__ void multiply_them( float *dest,
float *a,
float *b )
{
}

Block

Kernel

__global__ void kernel( … )
{
const int idx =

blockIdx.x * blockDim.x + threadIdx.x;
…
}

Grid

How do I Program it ?

Main Logic Kernel

GCC
NVCC

CPU .bin .cubin GPU


How do I Program it ?

Main Logic Kernel

GCC
NVCC

GPU

.bin .cubin

.bin .cubin . CPU


RAM

CPU Global Memory


Allocate Memory

cudaMalloc( pointer, size )


Copy to device


cudaMemcpy( dest, src, size, direction)


Kernel Launch



Kernel<<< # blocks, # threads >> (*params)


Get Back the Results



Kernel<<< # blocks, # threads >> (*params)



Error Handling

If(cudaMalloc( pointer, size ) != cudaSuccess){
handle_error()
}


And soon it becomes …

handle_error()
}

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){
handle_error()
}

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }


And soon it becomes …
handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){
} handle_error()
}
handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){
} handle_error()
}

} handle_error()
}
} handle_error()
}

} handle_error()
}
} handle_error()
}


+

& ANDREAS KLOCKNER

= PYCUDA


PyCuda Philosopy

Provide
Complete
Access


PyCuda Philosopy

AutoMatically
Manage
Resources


PyCuda Philosopy

Check and
Report Errors


PyCuda Philosopy

Cross
Platform


PyCuda Philosopy

Allow
Interactive
Use


PyCuda Philosopy

NumPy
Integration


NUMPY - ARRAY

1 1 1 1 1 1

0 99

import numpy

my_array = numpy.array([1,] * 100)


1 1 1 0 1 1

import numpy

my_array = numpy.array([1,] * 100)

my_array[3] = 0

PyCuda: Workflow


Memory Allocation

cuda.mem_alloc( size_bytes )


Memory Copy

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )


Kernel

gpu_mem = cuda.mem_alloc( size_bytes )

cuda.memcpy_htod( gpu_mem, cpu_mem )

SourceModule(“””
__global__ void multiply_them( float *dest, float *a,
float *b )
{
}”””)


Kernel Launch

mod = SourceModule(“””
__global__ void multiply_them( float *dest, float *a,
float *b )
{
}”””)

multiply_them = mod.get_function(“multiply_them”)
multiply_them ( *args, block=(30, 64, 1))


Hello Gpu

DEMO


GPUARRAY

gpuarray


PyCuda: GpuArray

gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()


PyCuda: GpuArray

gpuarray.to_gpu(numpy array)

numpy array = gpuarray.get()

+, -, !, /, ﬁll, sin, exp, rand, basic
indexing, norm, inner product …


PyCuda: GpuArray: ElementWise

from pycuda.elementwise import ElementwiseKernel




lincomb = ElementwiseKernel(
” float a , float !x , float b , float !y , float !z”,
”z [ i ] = a !x[ i ] + b!y[i ] ”
)




lin comb = ElementwiseKernel(
” float a , float !x , float b , float !y , float !z”,
”z [ i ] = a !x[ i ] + b!y[i ] ”
)

c gpu = gpuarray. empty like (a gpu)
lincomb (5, a gpu, 6, b gpu, c gpu)

assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5

Meta-Programming

__kernel_template__ = “””
__global__ void kernel( args )
{

for (int i=0; i={{ iterations }}; i++){
{{operations}}
}

}”””

See for example jinja2


Meta-Programming


Meta-Programming

Generate Source !


Performances ?


mandelbrot

DEMO


PyCuda: Documentation


PyCuda

WebSite:
http://mathema.tician.de/software/ pycuda

License:
X Consortium License
(no warranty, free for all use)

Dependencies:
Python 2.4+, numpy, Boost

In the Future …

OPENCL


THANK YOU & HAVE FUN !


?


PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

Recommended

Recommended

More Related Content

More from PyCon Italia

More from PyCon Italia (19)

Recently uploaded

Recently uploaded (20)

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python