Gpu workshop cluster universe: scripting cuda

Scripting CUDA
(using python, R and
MATLAB)
Ferdinand Jamitzky
jamitzky@lrz.de
http://goo.gl/nKD8FY

Why parallel programming?
End of the free lunch
Moore's law means
no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)

The future is parallel
●Moore's law is still valid
●Number of transistors doubles every 2 years
●Clock speed saturates at 3 to 4 GHz
●multi-core processors vs many-core processors
●grid/cloud computing
●clusters
●GPGPUs
(intel 2005)

Supercomputer: SMP
SMP Machine:
shared memory
typically 10s of cores
threaded programs
bus interconnect
in R:
library(multicore)
and inlined code
Example: gvs1
128 GB RAM
16 cores
Example: uv2/3
3.359 GB RAM
2.080 cores

Supercomputer: MPI
Cluster of machines:
distributed memory
message passing interface
infiniband interconnect
in R:
library(Rmpi)
and inlined code
Example: linux MPP cluster
2752 GB RAM
2752 cores
Example: superMUC
340,000 GB RAM
155,656 Intel cores

Supercomputer: GPGPU
Graphics Card:
shared memory
CUDA or openCL
on chip interconnect
in R:
library(gputools)
and inlined code
Example: Tesla K20X
6 GB RAM
2688 Threads
Example: Titan ORNL
262.000 GB RAM
18,688 GPU Cards
50,233,344 Threads

The future is massively parallel
Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores
(AND, OR, NOT)
Rmax: 20 GFLOP/s

The future is massively parallel
JUGENE
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores
(PowerPC 450)
Rmax: 222 TFLOP/s
now: 1 PFLOP/s
294912 cores

Levels of Parallelism
●Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
●Socket Level
each socket contains 8 cores
●Core Level
each core has 16 vector registers
●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
●Pipeline Level (how many simultaneous pipelines)
hyperthreading
●Instruction Level (instructions per cycle)
out of order execution, branch prediction

Problems: Access Times
Getting data from:
CPU register 1ns
L2 cache 10ns
memory 80 ns
network(IB) 200 ns
GPU(PCIe) 50.000 ns
harddisk 500.000 ns
Getting some food from:
fridge 10s
microwave 100s ~ 2min
pizza service 800s ~ 15min
city mall 2000s ~ 0.5h
mum sends cake 500.000 s~1 week
grown in own garden 5Ms ~ 2months

Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Acceleration factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
saturation point!

Amdahl's law III
> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
> Tserial=0.01
> Tcomm=0.001

How are High-Performance Codes
constructed?
●“Traditional” Construction of High-Performance Codes:
oC/C++/Fortran
oLibraries
●“Alternative” Construction of High-Performance Codes:
oScripting for ‘brains’
oGPUs for ‘inner loops’
●Play to the strengths of each programming environment.

Hierarchical architecture of
hardware vs software
●accelerators (gpus, xeon phi)
●in-core vectorisation (avx)
●multicore nodes (qpi, pci bus)
●strongly coupled nodes (infiniband, 10GE)
●weakly coupled clusters (cloud)
●Cuda, intrinsics
●vectorisation pragmas
●openMP
●MPI
●workflow middleware

Why Scripting?
Do you:
●want to reuse CUDA code easily (e.g. as a library) ?
●want to dynamically determine whether CUDA is available?
●want to use multi-threading (painlessly)?
●want to use MPI (painlessly)?
●want to use loose coupling (grid computing)?
●want dynamic exception handling and fallbacks?
●want dynamic compilation of CUDA code?
If you answered "yes" to one of these questions, you
should consider a scripting language

Parallel Tools in python, R and MATLAB
SMP
multicore
parallelism
doMC, doSMP,
pnmath, BLAS
no max cores
multiprocessing
futures
MMP massive
parallel
processing
doSNOW,
doMPI, doRedis
parallel python,
mpi4py
GPGPU
CUDA
openCL
rgpu, gputools
pyCUDA,
pyOpenCL
parfor, spmd
max 8 cores
jobs, pmode gpuArray
R
python
MATLAB

Scripting CUDA
Compiler
CUDA
Interpreter
PGI Fortran NumbraPro pyCUDA rgpu MATLAB
python R

MATLAB GPU
# load matlab module and start command line version
module load cuda
module load matlab/R2011A
matlab -nodesktop

MATLAB gpuArray
●Copy data to GPGPU and return a handle on the object
●All operations on the handle are performed on the GPGPU
x=rand(100);
gx=gpuArray(x);
●how to compute the GFlop/s
tic;
M=gpuArray(rand(np*1000));
gather(sum(sum(M*M)));
2*np^3/toc

pyCUDA
Gives you the following advantages:
1.Combining Two Strong Tools
2.Scripting CUDA
3.Run-Time Code Generation
http://mathema.tician.de/software/pycuda
special thanks to a.klöckner

pyCUDA @ LRZ
log in to lxgp1
$ module load python
$ module load cuda
$ module load boost
$ python
Python 2.6.1 (r261:67515, Apr 17 2009, 17:25:25)
[GCC 4.1.2 20070115 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>>

Simple Example
from numpy import *
import pycuda.autoinit
import pycuda.gpuarray as gpu
a_gpu =
gpu.to_gpu(random.randn(4,4).astype(ﬂoat32))
a_doubled = (2∗a_gpu).get()
print a_doubled
print a_gpu

gpuarray class
pycuda.gpuarray:
Meant to look and feel just like numpy.
●gpuarray.to gpu(numpy array)
●numpy array = gpuarray.get()
●+, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product
●Mixed types (int32 + float32 = float64)
●print gpuarray for debugging.
●Allows access to raw bits
●Use as kernel arguments, textures, etc.

gpuarray: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c_gpu = gpuarray.empty_like (a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
assert la.norm((c_gpu − (5∗a_gpu+6∗b_gpu)).get()) < 1e−5

gpuarray: Reduction made easy
Example: A scalar product calculation
from pycuda.reduction import ReductionKernel
dot = ReductionKernel(dtype_out=numpy.float32, neutral=”0”,
reduce_expr=”a+b”, map_expr=”x[i]∗y[i]”,
arguments=”const float ∗x, const float ∗y”)
from pycuda.curandom import rand as curand
x = curand((1000∗1000), dtype=numpy.float32)
y = curand((1000∗1000), dtype=numpy.float32)
x_dot_y = dot(x,y).get()
x_dot_y_cpu = numpy.dot(x.get(), y.get ())

CUDA Kernels in pyCUDA
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{ const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1)
print dest-a*b

Completeness
PyCUDA exposes all of CUDA.
For example:
●Arrays and Textures
●Pagelocked host memory
●Memory transfers (asynchronous, structured)
●Streams and Events
●Device queries
●GL Interop
And furthermore:
●Allow interactive use
●Integrate tightly with numpy

pyCUDA showcase
http://wiki.tiker.net/PyCuda/ShowCase
●Agent-based Models
●Computational Visual Neuroscience
●Discontinuous Galerkin Finite Element PDE Solvers
●Estimating the Entropy of Natural Scenes
●Facial Image Database Search
●Filtered Backprojection for Radar Imaging
●LINGO Chemical Similarities
●Recurrence Diagrams
●Sailfish: Lattice Boltzmann Fluid Dynamics
●Selective Embedded Just In Time Specialization
●Simulation of spiking neural networks

NumbraPro
Generate CUDA Kernels using a Just-in-time compiler
from numbapro import cuda
@cuda.jit('void(float32[:], float32[:], float32[:])')
def sum(a, b, result):
i = cuda.grid(1) # equals to threadIdx.x + blockIdx.x *
blockDim.x
result[i] = a[i] + b[i]
# Invoke like: sum[grid_dim, block_dim](big_input_1, big_input_2,
result_array)

The Language R
http://www.r-project.org/

R in a nutshell
module load cuda/2.3
module load R/serial/2.13
> x=1:10
> y=x**2
> str(y)
> print(x)
> times2 = function(x) 2*x
graphics!
> plot(x,y)
= and <- are interchangable

rgpu
a set of functions for loading data toa gpu and manipulating the
data there:
●exportgpu(x)
●evalgpu(x+y)
●lsgpu()
●rmgpu("x")
●sumgpu(x), meangpu(x), gemmgpu(a,b)
●cos, sin,.., +, -, *, /, **, %*%

Example
load the correct R module
$ module load R/serial/2.13
start R
$ R
R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
load rgpu library
> library(rgpu)
> help(package="rgpu")
> rgpudetails()

Data on the GPGPU
one million random uniform numbers
> x=runif(10000000)
send data to gpu
> exportgpu(x)
do some calculations
> evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))
do some timing comparisons (GPU vs CPU):
> system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))))
> system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))

real world examples: gputools
gputools is a package of precompiled CUDA functions for
statistics, linear algebra and machine learning
●chooseGpu
●getGpuId()
●gpuCor, gpuAucEstimate
●gpuDist, gpuDistClust, gpuHclust, gpuFastICA
●gpuGlm, gpuLm
●gpuGranger, gpuMi
●gpuMatMult, gpuQr, gpuSvd, gpuSolve
●gpuLsfit
●gpuSvmPredict, gpuSvmTrain
●gpuTtest

Example: Matrix Inversion
np <- 2000
x <- matrix(runif(np**2), np,np)
system.time(gpuSolve(x))
system.time(solve(x))

Example: Hierarchical Clustering
numVectors <- 5
dimension <- 10
Vectors <- matrix(runif(numVectors*dimension), numVectors,
dimension)
distMat <- gpuDist(Vectors, "euclidean")
myClust <- gpuHclust(distMat, "single")
plot(myClust)
for other examples try:
example(hclust)

Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program

PGI Compiler
log in to lxgp1
$ module load fortran/pgi/11.8
$ pgf90 -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: -Minfo, -fast, -O3)

Fortran 90 Example
program myprog
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
!$acc region
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
!$acc end region
end do
print*, " total energy: ",sum(x**2+v**2)
end program

PGI Compiler accelerator
module load fortran/pgi
pgf90 -ta=nvidia -o myprog.exe myprog.f90
time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: change acc region)

Use R as scripting language
R can dynamically load shared objects:
dyn.load("lib.so")
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)

R subroutine
subroutine mysub_cuda(x,v,nstep)
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
return
end subroutine

Compile two versions
don't forget to load the modules!
module unload ccomp fortran
module load ccomp/pgi/11.8
module load fortran/pgi/11.8
module load R/serial/2.13
pgf90 -shared -fPIC -o mysub_host.so
mysub_host.f90
pgf90 -ta=nvidia -shared -fPIC -o
mysub_cuda.so mysub_cuda.f90

Load and run
Load dynamic libraries
> dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np=1000000
Benchmark
> system.time(str(.Fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
26.901 0.000 26.900
> system.time(str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
0.829 0.000 0.830
Acceleration Factor:
> 26.9/0.83
[1] 32.40964

Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np) a(i,j) =
a(i,j) + b(i,k)*c(k,j)
end do
return
end subroutine

Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -shared -fPIC -o mmult.so
mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,
np=as.integer(dim(a)[1]))

Call FORTRAN binary
np=100
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Exercise: make a plot system-time vs matrix-dimension

PGI accelerator directives
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
!$acc region
forall(i=1:np, j=1:np) a(i,j) = a(i,j)
+ b(i,k)*c(k,j)
!$acc end region
end do
return
end subroutine

Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -ta=nvidia -shared -fPIC -o
mmult.so mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,
np=as.integer(dim(a)[1]))

Compute MFlop/s
print(paste(2.*2.*np**3/1000000./system.time(
str(mmult.f(...))
)[[3]]," MFlop/s"))
Exercise: Compare MFlop/s vs dimension for serial and
accelerated code

Scripting Parallel Execution
implicit
R
explicite
jit pnmath doSNOWdoMPIdoMC doRedis
hierarchical parallelisation:
- accelerator: rgpu, pnmath, MKL
- intra-node: jit, doMC, MKL
- intra-cluster: SNOW, MPI, pbdMPI
- inter-cluster: Redis, SNOW
MKLrgpu

foreach package
# new R foreach
library(foreach)
alist <-
foreach (i=1:N) %do%
call(i)
foreach is a function
# old R code
alist=list()
for(i in 1:N)
alist[i]<-call(i)
for is a language
keyword

multithreading with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N)
%dopar%
{
mmult.f()
}
# thread execution

MPI with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N)
%dopar%
{
mmult.f()
}
# MPI execution

doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865

doMC
# R
> library(doMC)
> registerDoMC(cores=4)
user system elapsed
9.352 2.652 12.002
user system elapsed
7.228 7.216 3.296

noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
Clients are available for C, C++, C#, Objective-C, Clojure, Common
Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl

doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet
> startRedisWorkers(100)

doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
user system elapsed
15.377 0.928 16.303
user system elapsed
4.864 0.000 4.865

MPI-CUDA with R
Using doSNOW and dyn.load with pgifortran:
library(doSNOW)
cl=makeCluster(c("gvs1","gvs2"),type="SOCK")
registerDoSNOW(cl)
foreach(i=1:2) %dopar% setwd("~/KURSE/R_cuda")
foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so")
system.time(
foreach(i=1:4) %dopar%
str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np)
,
nstep=as.integer(1000))))

Disk
Big Memory
R R
MEM MEM
Logical Setup of Node
without shared memory
R R
MEM
with shared memory
DiskDisk
R R
MEM
with file-backed memory
R R
MEM
with network attached file-
backed memory
Network Network Network

library(bigmemory)
● shared memory regions for several
processes in SMP
● file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])

Gpu workshop cluster universe: scripting cuda

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Gpu workshop cluster universe: scripting cuda

Similar to Gpu workshop cluster universe: scripting cuda (20)

Recently uploaded

Recently uploaded (20)

Gpu workshop cluster universe: scripting cuda