Scripting CUDA
(using python, R and
Ferdinand Jamitzky
Why parallel programming?
End of the free lunch
Moore's law means
no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)
The future is parallel
●Moore's law is still valid
●Number of transistors doubles every 2 years
●Clock speed saturates at 3 to 4 GHz
●multi-core processors vs many-core processors
●grid/cloud computing
(intel 2005)
Supercomputer scaling
Supercomputer: SMP
SMP Machine:
shared memory
typically 10s of cores
threaded programs
bus interconnect
in R:
and inlined code
Example: gvs1
128 GB RAM
16 cores
Example: uv2/3
3.359 GB RAM
2.080 cores
Supercomputer: MPI
Cluster of machines:
distributed memory
typically 100s of cores
message passing interface
infiniband interconnect
in R:
and inlined code
Example: linux MPP cluster
2752 GB RAM
2752 cores
Example: superMUC
340,000 GB RAM
155,656 Intel cores
Supercomputer: GPGPU
Graphics Card:
shared memory
typically 1000s of cores
CUDA or openCL
on chip interconnect
in R:
and inlined code
Example: Tesla K20X
2688 Threads
Example: Titan ORNL
262.000 GB RAM
18,688 GPU Cards
50,233,344 Threads
The future is massively parallel
Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores
Rmax: 20 GFLOP/s
The future is massively parallel
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores
(PowerPC 450)
Rmax: 222 TFLOP/s
now: 1 PFLOP/s
294912 cores
Levels of Parallelism
●Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
●Socket Level
each socket contains 8 cores
●Core Level
each core has 16 vector registers
●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
●Pipeline Level (how many simultaneous pipelines)
●Instruction Level (instructions per cycle)
out of order execution, branch prediction
Problems: Access Times
Getting data from:
CPU register 1ns
L2 cache 10ns
memory 80 ns
network(IB) 200 ns
GPU(PCIe) 50.000 ns
harddisk 500.000 ns
Getting some food from:
fridge 10s
microwave 100s ~ 2min
pizza service 800s ~ 15min
city mall 2000s ~ 0.5h
mum sends cake 500.000 s~1 week
grown in own garden 5Ms ~ 2months
Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Acceleration factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
saturation point!
Amdahl's law III
> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
> Tserial=0.01
> Tcomm=0.001
How are High-Performance Codes
●“Traditional” Construction of High-Performance Codes:
●“Alternative” Construction of High-Performance Codes:
oScripting for ‘brains’
oGPUs for ‘inner loops’
●Play to the strengths of each programming environment.
Hierarchical architecture of
hardware vs software
●accelerators (gpus, xeon phi)
●in-core vectorisation (avx)
●multicore nodes (qpi, pci bus)
●strongly coupled nodes (infiniband, 10GE)
●weakly coupled clusters (cloud)
●Cuda, intrinsics
●vectorisation pragmas
●workflow middleware
Why Scripting?
Do you:
●want to reuse CUDA code easily (e.g. as a library) ?
●want to dynamically determine whether CUDA is available?
●want to use multi-threading (painlessly)?
●want to use MPI (painlessly)?
●want to use loose coupling (grid computing)?
●want dynamic exception handling and fallbacks?
●want dynamic compilation of CUDA code?
If you answered "yes" to one of these questions, you
should consider a scripting language
Parallel Tools in python, R and MATLAB
doMC, doSMP,
pnmath, BLAS
no max cores
MMP massive
doMPI, doRedis
parallel python,
rgpu, gputools
parfor, spmd
max 8 cores
jobs, pmode gpuArray
Scripting CUDA
PGI Fortran NumbraPro pyCUDA rgpu MATLAB
python R
# load matlab module and start command line version
module load cuda
module load matlab/R2011A
matlab -nodesktop
MATLAB gpuArray
●Copy data to GPGPU and return a handle on the object
●All operations on the handle are performed on the GPGPU
●how to compute the GFlop/s
Gives you the following advantages:
1.Combining Two Strong Tools
2.Scripting CUDA
3.Run-Time Code Generation
special thanks to a.klöckner
log in to lxgp1
$ module load python
$ module load cuda
$ module load boost
$ python
Python 2.6.1 (r261:67515, Apr 17 2009, 17:25:25)
[GCC 4.1.2 20070115 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more
Simple Example
from numpy import *
import pycuda.autoinit
import pycuda.gpuarray as gpu
a_gpu =
a_doubled = (2∗a_gpu).get()
print a_doubled
print a_gpu
gpuarray class
Meant to look and feel just like numpy.
● gpu(numpy array)
●numpy array = gpuarray.get()
●+, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product
●Mixed types (int32 + float32 = float64)
●print gpuarray for debugging.
●Allows access to raw bits
●Use as kernel arguments, textures, etc.
gpuarray: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c_gpu = gpuarray.empty_like (a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
assert la.norm((c_gpu − (5∗a_gpu+6∗b_gpu)).get()) < 1e−5
gpuarray: Reduction made easy
Example: A scalar product calculation
from pycuda.reduction import ReductionKernel
dot = ReductionKernel(dtype_out=numpy.float32, neutral=”0”,
reduce_expr=”a+b”, map_expr=”x[i]∗y[i]”,
arguments=”const float ∗x, const float ∗y”)
from pycuda.curandom import rand as curand
x = curand((1000∗1000), dtype=numpy.float32)
y = curand((1000∗1000), dtype=numpy.float32)
x_dot_y = dot(x,y).get()
x_dot_y_cpu =, y.get ())
CUDA Kernels in pyCUDA
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{ const int i = threadIdx.x;
dest[i] = a[i] * b[i];
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
drv.Out(dest), drv.In(a), drv.In(b),
print dest-a*b
PyCUDA exposes all of CUDA.
For example:
●Arrays and Textures
●Pagelocked host memory
●Memory transfers (asynchronous, structured)
●Streams and Events
●Device queries
●GL Interop
And furthermore:
●Allow interactive use
●Integrate tightly with numpy
pyCUDA showcase
●Agent-based Models
●Computational Visual Neuroscience
●Discontinuous Galerkin Finite Element PDE Solvers
●Estimating the Entropy of Natural Scenes
●Facial Image Database Search
●Filtered Backprojection for Radar Imaging
●LINGO Chemical Similarities
●Recurrence Diagrams
●Sailfish: Lattice Boltzmann Fluid Dynamics
●Selective Embedded Just In Time Specialization
●Simulation of spiking neural networks
Generate CUDA Kernels using a Just-in-time compiler
from numbapro import cuda
@cuda.jit('void(float32[:], float32[:], float32[:])')
def sum(a, b, result):
i = cuda.grid(1) # equals to threadIdx.x + blockIdx.x *
result[i] = a[i] + b[i]
# Invoke like: sum[grid_dim, block_dim](big_input_1, big_input_2,
The Language R
R in a nutshell
module load cuda/2.3
module load R/serial/2.13
> x=1:10
> y=x**2
> str(y)
> print(x)
> times2 = function(x) 2*x
> plot(x,y)
= and <- are interchangable
a set of functions for loading data toa gpu and manipulating the
data there:
●sumgpu(x), meangpu(x), gemmgpu(a,b)
●cos, sin,.., +, -, *, /, **, %*%
load the correct R module
$ module load R/serial/2.13
start R
$ R
R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
load rgpu library
> library(rgpu)
> help(package="rgpu")
> rgpudetails()
Data on the GPGPU
one million random uniform numbers
> x=runif(10000000)
send data to gpu
> exportgpu(x)
do some calculations
> evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))
do some timing comparisons (GPU vs CPU):
> system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))))
> system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))
real world examples: gputools
gputools is a package of precompiled CUDA functions for
statistics, linear algebra and machine learning
●gpuCor, gpuAucEstimate
●gpuDist, gpuDistClust, gpuHclust, gpuFastICA
●gpuGlm, gpuLm
●gpuGranger, gpuMi
●gpuMatMult, gpuQr, gpuSvd, gpuSolve
●gpuSvmPredict, gpuSvmTrain
Example: Matrix Inversion
np <- 2000
x <- matrix(runif(np**2), np,np)
Example: Hierarchical Clustering
numVectors <- 5
dimension <- 10
Vectors <- matrix(runif(numVectors*dimension), numVectors,
distMat <- gpuDist(Vectors, "euclidean")
myClust <- gpuHclust(distMat, "single")
for other examples try:
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
PGI Compiler
log in to lxgp1
$ module load fortran/pgi/11.8
$ pgf90 -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: -Minfo, -fast, -O3)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
!$acc region
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
!$acc end region
end do
print*, " total energy: ",sum(x**2+v**2)
end program
PGI Compiler accelerator
module load fortran/pgi
pgf90 -ta=nvidia -o myprog.exe myprog.f90
time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: change acc region)
Use R as scripting language
R can dynamically load shared objects:
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)
R subroutine
subroutine mysub_cuda(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
end subroutine
Compile two versions
don't forget to load the modules!
module unload ccomp fortran
module load ccomp/pgi/11.8
module load fortran/pgi/11.8
module load R/serial/2.13
pgf90 -shared -fPIC -o
pgf90 -ta=nvidia -shared -fPIC -o mysub_cuda.f90
Load and run
Load dynamic libraries
> dyn.load(""), dyn.load(""); np=1000000
> system.time(str(.Fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
26.901 0.000 26.900
> system.time(str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
0.829 0.000 0.830
Acceleration Factor:
> 26.9/0.83
[1] 32.40964
Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np) a(i,j) =
a(i,j) + b(i,k)*c(k,j)
end do
end subroutine
Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -shared -fPIC -o
# dynamically load library
# define multiplication function
mmult.f <- function(a,b,c)
Call FORTRAN binary
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
Exercise: make a plot system-time vs matrix-dimension
PGI accelerator directives
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
!$acc region
forall(i=1:np, j=1:np) a(i,j) = a(i,j)
+ b(i,k)*c(k,j)
!$acc end region
end do
end subroutine
Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -ta=nvidia -shared -fPIC -o mmult.f90");
# dynamically load library
# define multiplication function
mmult.f <- function(a,b,c)
Compute MFlop/s
)[[3]]," MFlop/s"))
Exercise: Compare MFlop/s vs dimension for serial and
accelerated code
Scripting Parallel Execution
jit pnmath doSNOWdoMPIdoMC doRedis
hierarchical parallelisation:
- accelerator: rgpu, pnmath, MKL
- intra-node: jit, doMC, MKL
- intra-cluster: SNOW, MPI, pbdMPI
- inter-cluster: Redis, SNOW
foreach package
# new R foreach
alist <-
foreach (i=1:N) %do%
foreach is a function
# old R code
for(i in 1:N)
for is a language
multithreading with R
foreach(i=1:N) %do%
# serial execution
# thread execution
MPI with R
foreach(i=1:N) %do%
# serial execution
# MPI execution
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
# R
> library(doMC)
> registerDoMC(cores=4)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
9.352 2.652 12.002
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
7.228 7.216 3.296
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
Clients are available for C, C++, C#, Objective-C, Clojure, Common
Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet
> startRedisWorkers(100)
# R
> library(doRedis)
> registerDoRedis("jobs")
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
Using doSNOW and dyn.load with pgifortran:
foreach(i=1:2) %dopar% setwd("~/KURSE/R_cuda")
foreach(i=1:2) %dopar% dyn.load("")
foreach(i=1:4) %dopar%
Big Memory
Logical Setup of Node
without shared memory
Logical Setup of Node
with shared memory
Logical Setup of Node
with file-backed memory
Logical Setup of Node
with network attached file-
backed memory
Network Network Network
● shared memory regions for several
processes in SMP
● file backed arrays for several node over
network file systems
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))

