SlideShare a Scribd company logo
Scripting CUDA
(using python, R and
MATLAB)
Ferdinand Jamitzky
jamitzky@lrz.de
http://goo.gl/nKD8FY
Why parallel programming?
End of the free lunch
Moore's law means
no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)
The future is parallel
●Moore's law is still valid
●Number of transistors doubles every 2 years
●Clock speed saturates at 3 to 4 GHz
●multi-core processors vs many-core processors
●grid/cloud computing
●clusters
●GPGPUs
(intel 2005)
Supercomputer scaling
Supercomputer: SMP
SMP Machine:
shared memory
typically 10s of cores
threaded programs
bus interconnect
in R:
library(multicore)
and inlined code
Example: gvs1
128 GB RAM
16 cores
Example: uv2/3
3.359 GB RAM
2.080 cores
Supercomputer: MPI
Cluster of machines:
distributed memory
typically 100s of cores
message passing interface
infiniband interconnect
in R:
library(Rmpi)
and inlined code
Example: linux MPP cluster
2752 GB RAM
2752 cores
Example: superMUC
340,000 GB RAM
155,656 Intel cores
Supercomputer: GPGPU
Graphics Card:
shared memory
typically 1000s of cores
CUDA or openCL
on chip interconnect
in R:
library(gputools)
and inlined code
Example: Tesla K20X
6 GB RAM
2688 Threads
Example: Titan ORNL
262.000 GB RAM
18,688 GPU Cards
50,233,344 Threads
The future is massively parallel
Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores
(AND, OR, NOT)
Rmax: 20 GFLOP/s
The future is massively parallel
JUGENE
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores
(PowerPC 450)
Rmax: 222 TFLOP/s
now: 1 PFLOP/s
294912 cores
Levels of Parallelism
●Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
●Socket Level
each socket contains 8 cores
●Core Level
each core has 16 vector registers
●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
●Pipeline Level (how many simultaneous pipelines)
hyperthreading
●Instruction Level (instructions per cycle)
out of order execution, branch prediction
Problems: Access Times
Getting data from:
CPU register 1ns
L2 cache 10ns
memory 80 ns
network(IB) 200 ns
GPU(PCIe) 50.000 ns
harddisk 500.000 ns
Getting some food from:
fridge 10s
microwave 100s ~ 2min
pizza service 800s ~ 15min
city mall 2000s ~ 0.5h
mum sends cake 500.000 s~1 week
grown in own garden 5Ms ~ 2months
Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Acceleration factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
saturation point!
Amdahl's law III
> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
> Tserial=0.01
> Tcomm=0.001
How are High-Performance Codes
constructed?
●“Traditional” Construction of High-Performance Codes:
oC/C++/Fortran
oLibraries
●“Alternative” Construction of High-Performance Codes:
oScripting for ‘brains’
oGPUs for ‘inner loops’
●Play to the strengths of each programming environment.
Hierarchical architecture of
hardware vs software
●accelerators (gpus, xeon phi)
●in-core vectorisation (avx)
●multicore nodes (qpi, pci bus)
●strongly coupled nodes (infiniband, 10GE)
●weakly coupled clusters (cloud)
●Cuda, intrinsics
●vectorisation pragmas
●openMP
●MPI
●workflow middleware
Why Scripting?
Do you:
●want to reuse CUDA code easily (e.g. as a library) ?
●want to dynamically determine whether CUDA is available?
●want to use multi-threading (painlessly)?
●want to use MPI (painlessly)?
●want to use loose coupling (grid computing)?
●want dynamic exception handling and fallbacks?
●want dynamic compilation of CUDA code?
If you answered "yes" to one of these questions, you
should consider a scripting language
Parallel Tools in python, R and MATLAB
SMP
multicore
parallelism
doMC, doSMP,
pnmath, BLAS
no max cores
multiprocessing
futures
MMP massive
parallel
processing
doSNOW,
doMPI, doRedis
parallel python,
mpi4py
GPGPU
CUDA
openCL
rgpu, gputools
pyCUDA,
pyOpenCL
parfor, spmd
max 8 cores
jobs, pmode gpuArray
R
python
MATLAB
Scripting CUDA
Compiler
CUDA
Interpreter
PGI Fortran NumbraPro pyCUDA rgpu MATLAB
python R
MATLAB GPU
MATLAB GPU
# load matlab module and start command line version
module load cuda
module load matlab/R2011A
matlab -nodesktop
MATLAB gpuArray
●Copy data to GPGPU and return a handle on the object
●All operations on the handle are performed on the GPGPU
x=rand(100);
gx=gpuArray(x);
●how to compute the GFlop/s
tic;
M=gpuArray(rand(np*1000));
gather(sum(sum(M*M)));
2*np^3/toc
pyCUDA
Gives you the following advantages:
1.Combining Two Strong Tools
2.Scripting CUDA
3.Run-Time Code Generation
http://mathema.tician.de/software/pycuda
special thanks to a.klöckner
pyCUDA @ LRZ
log in to lxgp1
$ module load python
$ module load cuda
$ module load boost
$ python
Python 2.6.1 (r261:67515, Apr 17 2009, 17:25:25)
[GCC 4.1.2 20070115 (SUSE Linux)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>>
Simple Example
from numpy import *
import pycuda.autoinit
import pycuda.gpuarray as gpu
a_gpu =
gpu.to_gpu(random.randn(4,4).astype(float32))
a_doubled = (2∗a_gpu).get()
print a_doubled
print a_gpu
gpuarray class
pycuda.gpuarray:
Meant to look and feel just like numpy.
●gpuarray.to gpu(numpy array)
●numpy array = gpuarray.get()
●+, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product
●Mixed types (int32 + float32 = float64)
●print gpuarray for debugging.
●Allows access to raw bits
●Use as kernel arguments, textures, etc.
gpuarray: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curand
a_gpu = curand((50,))
b_gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c_gpu = gpuarray.empty_like (a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
assert la.norm((c_gpu − (5∗a_gpu+6∗b_gpu)).get()) < 1e−5
gpuarray: Reduction made easy
Example: A scalar product calculation
from pycuda.reduction import ReductionKernel
dot = ReductionKernel(dtype_out=numpy.float32, neutral=”0”,
reduce_expr=”a+b”, map_expr=”x[i]∗y[i]”,
arguments=”const float ∗x, const float ∗y”)
from pycuda.curandom import rand as curand
x = curand((1000∗1000), dtype=numpy.float32)
y = curand((1000∗1000), dtype=numpy.float32)
x_dot_y = dot(x,y).get()
x_dot_y_cpu = numpy.dot(x.get(), y.get ())
CUDA Kernels in pyCUDA
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{ const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1)
print dest-a*b
Completeness
PyCUDA exposes all of CUDA.
For example:
●Arrays and Textures
●Pagelocked host memory
●Memory transfers (asynchronous, structured)
●Streams and Events
●Device queries
●GL Interop
And furthermore:
●Allow interactive use
●Integrate tightly with numpy
pyCUDA showcase
http://wiki.tiker.net/PyCuda/ShowCase
●Agent-based Models
●Computational Visual Neuroscience
●Discontinuous Galerkin Finite Element PDE Solvers
●Estimating the Entropy of Natural Scenes
●Facial Image Database Search
●Filtered Backprojection for Radar Imaging
●LINGO Chemical Similarities
●Recurrence Diagrams
●Sailfish: Lattice Boltzmann Fluid Dynamics
●Selective Embedded Just In Time Specialization
●Simulation of spiking neural networks
NumbraPro
Generate CUDA Kernels using a Just-in-time compiler
from numbapro import cuda
@cuda.jit('void(float32[:], float32[:], float32[:])')
def sum(a, b, result):
i = cuda.grid(1) # equals to threadIdx.x + blockIdx.x *
blockDim.x
result[i] = a[i] + b[i]
# Invoke like: sum[grid_dim, block_dim](big_input_1, big_input_2,
result_array)
The Language R
http://www.r-project.org/
R in a nutshell
module load cuda/2.3
module load R/serial/2.13
> x=1:10
> y=x**2
> str(y)
> print(x)
> times2 = function(x) 2*x
graphics!
> plot(x,y)
= and <- are interchangable
rgpu
a set of functions for loading data toa gpu and manipulating the
data there:
●exportgpu(x)
●evalgpu(x+y)
●lsgpu()
●rmgpu("x")
●sumgpu(x), meangpu(x), gemmgpu(a,b)
●cos, sin,.., +, -, *, /, **, %*%
Example
load the correct R module
$ module load R/serial/2.13
start R
$ R
R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
load rgpu library
> library(rgpu)
> help(package="rgpu")
> rgpudetails()
Data on the GPGPU
one million random uniform numbers
> x=runif(10000000)
send data to gpu
> exportgpu(x)
do some calculations
> evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))
do some timing comparisons (GPU vs CPU):
> system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))))
> system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))
real world examples: gputools
gputools is a package of precompiled CUDA functions for
statistics, linear algebra and machine learning
●chooseGpu
●getGpuId()
●gpuCor, gpuAucEstimate
●gpuDist, gpuDistClust, gpuHclust, gpuFastICA
●gpuGlm, gpuLm
●gpuGranger, gpuMi
●gpuMatMult, gpuQr, gpuSvd, gpuSolve
●gpuLsfit
●gpuSvmPredict, gpuSvmTrain
●gpuTtest
Example: Matrix Inversion
np <- 2000
x <- matrix(runif(np**2), np,np)
system.time(gpuSolve(x))
system.time(solve(x))
Example: Hierarchical Clustering
numVectors <- 5
dimension <- 10
Vectors <- matrix(runif(numVectors*dimension), numVectors,
dimension)
distMat <- gpuDist(Vectors, "euclidean")
myClust <- gpuHclust(distMat, "single")
plot(myClust)
for other examples try:
example(hclust)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
PGI Compiler
log in to lxgp1
$ module load fortran/pgi/11.8
$ pgf90 -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: -Minfo, -fast, -O3)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
!$acc region
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
!$acc end region
end do
print*, " total energy: ",sum(x**2+v**2)
end program
PGI Compiler accelerator
module load fortran/pgi
pgf90 -ta=nvidia -o myprog.exe myprog.f90
time ./myprog.exe
exercise for you:
●compute MFlop/s (Floating Point Operations: 4 * np * nstep)
●optimize (hint: change acc region)
Use R as scripting language
R can dynamically load shared objects:
dyn.load("lib.so")
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)
R subroutine
subroutine mysub_cuda(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt; dv=-x*dt
x=x+dx; v=v+dv
end do
return
end subroutine
Compile two versions
don't forget to load the modules!
module unload ccomp fortran
module load ccomp/pgi/11.8
module load fortran/pgi/11.8
module load R/serial/2.13
pgf90 -shared -fPIC -o mysub_host.so
mysub_host.f90
pgf90 -ta=nvidia -shared -fPIC -o
mysub_cuda.so mysub_cuda.f90
Load and run
Load dynamic libraries
> dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np=1000000
Benchmark
> system.time(str(.Fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
26.901 0.000 26.900
> system.time(str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000))))
total energy: 666667.6633012500
total energy: 667334.6641391169
List of 3
$ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ...
$ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ...
$ nstep: int 1000
user system elapsed
0.829 0.000 0.830
Acceleration Factor:
> 26.9/0.83
[1] 32.40964
Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np) a(i,j) =
a(i,j) + b(i,k)*c(k,j)
end do
return
end subroutine
Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -shared -fPIC -o mmult.so
mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,
np=as.integer(dim(a)[1]))
Call FORTRAN binary
np=100
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Exercise: make a plot system-time vs matrix-dimension
PGI accelerator directives
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
!$acc region
forall(i=1:np, j=1:np) a(i,j) = a(i,j)
+ b(i,k)*c(k,j)
!$acc end region
end do
return
end subroutine
Call FORTRAN from R
# compile f90 to shared object library
system("pgf90 -ta=nvidia -shared -fPIC -o
mmult.so mmult.f90");
# dynamically load library
dyn.load("mmult.so")
# define multiplication function
mmult.f <- function(a,b,c)
.Fortran("mmult",a=a,b=b,c=c,
np=as.integer(dim(a)[1]))
Compute MFlop/s
print(paste(2.*2.*np**3/1000000./system.time(
str(mmult.f(...))
)[[3]]," MFlop/s"))
Exercise: Compare MFlop/s vs dimension for serial and
accelerated code
Scripting Parallel Execution
implicit
R
explicite
jit pnmath doSNOWdoMPIdoMC doRedis
hierarchical parallelisation:
- accelerator: rgpu, pnmath, MKL
- intra-node: jit, doMC, MKL
- intra-cluster: SNOW, MPI, pbdMPI
- inter-cluster: Redis, SNOW
MKLrgpu
foreach package
# new R foreach
library(foreach)
alist <-
foreach (i=1:N) %do%
call(i)
foreach is a function
# old R code
alist=list()
for(i in 1:N)
alist[i]<-call(i)
for is a language
keyword
multithreading with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N)
%dopar%
{
mmult.f()
}
# thread execution
MPI with R
library(foreach)
foreach(i=1:N) %do%
{
mmult.f()
}
# serial execution
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N)
%dopar%
{
mmult.f()
}
# MPI execution
doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
doMC
# R
> library(doMC)
> registerDoMC(cores=4)
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
9.352 2.652 12.002
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
7.228 7.216 3.296
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
Clients are available for C, C++, C#, Objective-C, Clojure, Common
Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet
> startRedisWorkers(100)
doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))
user system elapsed
15.377 0.928 16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))
user system elapsed
4.864 0.000 4.865
MPI-CUDA with R
Using doSNOW and dyn.load with pgifortran:
library(doSNOW)
cl=makeCluster(c("gvs1","gvs2"),type="SOCK")
registerDoSNOW(cl)
foreach(i=1:2) %dopar% setwd("~/KURSE/R_cuda")
foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so")
system.time(
foreach(i=1:4) %dopar%
str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np)
,
nstep=as.integer(1000))))
Disk
Big Memory
R R
MEM MEM
Logical Setup of Node
without shared memory
R R
MEM
Logical Setup of Node
with shared memory
DiskDisk
R R
MEM
Logical Setup of Node
with file-backed memory
R R
MEM
Logical Setup of Node
with network attached file-
backed memory
Network Network Network
library(bigmemory)
● shared memory regions for several
processes in SMP
● file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])

More Related Content

What's hot

NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
DefCamp
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 

What's hot (20)

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 

Viewers also liked (6)

Intel
IntelIntel
Intel
 
Lrz kurse: r visualisation
Lrz kurse: r visualisationLrz kurse: r visualisation
Lrz kurse: r visualisation
 
Lrz kurse: r as superglue
Lrz kurse: r as superglueLrz kurse: r as superglue
Lrz kurse: r as superglue
 
Concurrent programming1
Concurrent programming1Concurrent programming1
Concurrent programming1
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
End of Moore's Law?
End of Moore's Law? End of Moore's Law?
End of Moore's Law?
 

Similar to Gpu workshop cluster universe: scripting cuda

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 

Similar to Gpu workshop cluster universe: scripting cuda (20)

Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in GoCapturing NIC and Kernel TX and RX Timestamps for Packets in Go
Capturing NIC and Kernel TX and RX Timestamps for Packets in Go
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 

Recently uploaded

Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
iGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by SkilrockiGaming Platform & Lottery Solutions by Skilrock
iGaming Platform & Lottery Solutions by Skilrock
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 

Gpu workshop cluster universe: scripting cuda

  • 1. Scripting CUDA (using python, R and MATLAB) Ferdinand Jamitzky jamitzky@lrz.de http://goo.gl/nKD8FY
  • 2. Why parallel programming? End of the free lunch Moore's law means no longer faster processors, only more of them. But beware! 2 x 3 GHz < 6 GHz (cache consistency, multi-threading, etc)
  • 3. The future is parallel ●Moore's law is still valid ●Number of transistors doubles every 2 years ●Clock speed saturates at 3 to 4 GHz ●multi-core processors vs many-core processors ●grid/cloud computing ●clusters ●GPGPUs (intel 2005)
  • 5. Supercomputer: SMP SMP Machine: shared memory typically 10s of cores threaded programs bus interconnect in R: library(multicore) and inlined code Example: gvs1 128 GB RAM 16 cores Example: uv2/3 3.359 GB RAM 2.080 cores
  • 6. Supercomputer: MPI Cluster of machines: distributed memory typically 100s of cores message passing interface infiniband interconnect in R: library(Rmpi) and inlined code Example: linux MPP cluster 2752 GB RAM 2752 cores Example: superMUC 340,000 GB RAM 155,656 Intel cores
  • 7. Supercomputer: GPGPU Graphics Card: shared memory typically 1000s of cores CUDA or openCL on chip interconnect in R: library(gputools) and inlined code Example: Tesla K20X 6 GB RAM 2688 Threads Example: Titan ORNL 262.000 GB RAM 18,688 GPU Cards 50,233,344 Threads
  • 8. The future is massively parallel Connection Machine CM-1 (1983) 12-D Hypercube 65536 1-bit cores (AND, OR, NOT) Rmax: 20 GFLOP/s
  • 9. The future is massively parallel JUGENE Blue Gene/P (2007) 3-D Torus or Tree 65536 64-bit cores (PowerPC 450) Rmax: 222 TFLOP/s now: 1 PFLOP/s 294912 cores
  • 10. Levels of Parallelism ●Node Level (e.g. SuperMUC has approx. 10000 nodes) each node has 2 sockets ●Socket Level each socket contains 8 cores ●Core Level each core has 16 vector registers ●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers) ●Pipeline Level (how many simultaneous pipelines) hyperthreading ●Instruction Level (instructions per cycle) out of order execution, branch prediction
  • 11. Problems: Access Times Getting data from: CPU register 1ns L2 cache 10ns memory 80 ns network(IB) 200 ns GPU(PCIe) 50.000 ns harddisk 500.000 ns Getting some food from: fridge 10s microwave 100s ~ 2min pizza service 800s ~ 15min city mall 2000s ~ 0.5h mum sends cake 500.000 s~1 week grown in own garden 5Ms ~ 2months
  • 12. Amdahl's law Computing time for N processors T(N) = T(1)/N + Tserial + Tcomm * N Acceleration factor: T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2) small N: T(1)/T(N) ~ N large N: T(1)/T(N) ~ 1/N saturation point!
  • 13. Amdahl's law III > plot(N,type="l") > lines(N/(1+0.01*N),col="red") > lines(N/(1+0.01*N+0.001*N**2),col="green") > Tserial=0.01 > Tcomm=0.001
  • 14. How are High-Performance Codes constructed? ●“Traditional” Construction of High-Performance Codes: oC/C++/Fortran oLibraries ●“Alternative” Construction of High-Performance Codes: oScripting for ‘brains’ oGPUs for ‘inner loops’ ●Play to the strengths of each programming environment.
  • 15. Hierarchical architecture of hardware vs software ●accelerators (gpus, xeon phi) ●in-core vectorisation (avx) ●multicore nodes (qpi, pci bus) ●strongly coupled nodes (infiniband, 10GE) ●weakly coupled clusters (cloud) ●Cuda, intrinsics ●vectorisation pragmas ●openMP ●MPI ●workflow middleware
  • 16. Why Scripting? Do you: ●want to reuse CUDA code easily (e.g. as a library) ? ●want to dynamically determine whether CUDA is available? ●want to use multi-threading (painlessly)? ●want to use MPI (painlessly)? ●want to use loose coupling (grid computing)? ●want dynamic exception handling and fallbacks? ●want dynamic compilation of CUDA code? If you answered "yes" to one of these questions, you should consider a scripting language
  • 17. Parallel Tools in python, R and MATLAB SMP multicore parallelism doMC, doSMP, pnmath, BLAS no max cores multiprocessing futures MMP massive parallel processing doSNOW, doMPI, doRedis parallel python, mpi4py GPGPU CUDA openCL rgpu, gputools pyCUDA, pyOpenCL parfor, spmd max 8 cores jobs, pmode gpuArray R python MATLAB
  • 18. Scripting CUDA Compiler CUDA Interpreter PGI Fortran NumbraPro pyCUDA rgpu MATLAB python R
  • 20. MATLAB GPU # load matlab module and start command line version module load cuda module load matlab/R2011A matlab -nodesktop
  • 21. MATLAB gpuArray ●Copy data to GPGPU and return a handle on the object ●All operations on the handle are performed on the GPGPU x=rand(100); gx=gpuArray(x); ●how to compute the GFlop/s tic; M=gpuArray(rand(np*1000)); gather(sum(sum(M*M))); 2*np^3/toc
  • 22. pyCUDA Gives you the following advantages: 1.Combining Two Strong Tools 2.Scripting CUDA 3.Run-Time Code Generation http://mathema.tician.de/software/pycuda special thanks to a.klöckner
  • 23. pyCUDA @ LRZ log in to lxgp1 $ module load python $ module load cuda $ module load boost $ python Python 2.6.1 (r261:67515, Apr 17 2009, 17:25:25) [GCC 4.1.2 20070115 (SUSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
  • 24. Simple Example from numpy import * import pycuda.autoinit import pycuda.gpuarray as gpu a_gpu = gpu.to_gpu(random.randn(4,4).astype(float32)) a_doubled = (2∗a_gpu).get() print a_doubled print a_gpu
  • 25. gpuarray class pycuda.gpuarray: Meant to look and feel just like numpy. ●gpuarray.to gpu(numpy array) ●numpy array = gpuarray.get() ●+, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product ●Mixed types (int32 + float32 = float64) ●print gpuarray for debugging. ●Allows access to raw bits ●Use as kernel arguments, textures, etc.
  • 26. gpuarray: Elementwise expressions Avoiding extra store-fetch cycles for elementwise math: from pycuda.curandom import rand as curand a_gpu = curand((50,)) b_gpu = curand((50,)) from pycuda.elementwise import ElementwiseKernel lin_comb = ElementwiseKernel( ” float a, float ∗x, float b, float ∗y, float ∗z”, ”z[ i ] = a∗x[i ] + b∗y[i]”) c_gpu = gpuarray.empty_like (a_gpu) lin_comb(5, a_gpu, 6, b_gpu, c_gpu) assert la.norm((c_gpu − (5∗a_gpu+6∗b_gpu)).get()) < 1e−5
  • 27. gpuarray: Reduction made easy Example: A scalar product calculation from pycuda.reduction import ReductionKernel dot = ReductionKernel(dtype_out=numpy.float32, neutral=”0”, reduce_expr=”a+b”, map_expr=”x[i]∗y[i]”, arguments=”const float ∗x, const float ∗y”) from pycuda.curandom import rand as curand x = curand((1000∗1000), dtype=numpy.float32) y = curand((1000∗1000), dtype=numpy.float32) x_dot_y = dot(x,y).get() x_dot_y_cpu = numpy.dot(x.get(), y.get ())
  • 28. CUDA Kernels in pyCUDA import pycuda.autoinit import pycuda.driver as drv import numpy from pycuda.compiler import SourceModule mod = SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }""") multiply_them = mod.get_function("multiply_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1) print dest-a*b
  • 29. Completeness PyCUDA exposes all of CUDA. For example: ●Arrays and Textures ●Pagelocked host memory ●Memory transfers (asynchronous, structured) ●Streams and Events ●Device queries ●GL Interop And furthermore: ●Allow interactive use ●Integrate tightly with numpy
  • 30. pyCUDA showcase http://wiki.tiker.net/PyCuda/ShowCase ●Agent-based Models ●Computational Visual Neuroscience ●Discontinuous Galerkin Finite Element PDE Solvers ●Estimating the Entropy of Natural Scenes ●Facial Image Database Search ●Filtered Backprojection for Radar Imaging ●LINGO Chemical Similarities ●Recurrence Diagrams ●Sailfish: Lattice Boltzmann Fluid Dynamics ●Selective Embedded Just In Time Specialization ●Simulation of spiking neural networks
  • 31. NumbraPro Generate CUDA Kernels using a Just-in-time compiler from numbapro import cuda @cuda.jit('void(float32[:], float32[:], float32[:])') def sum(a, b, result): i = cuda.grid(1) # equals to threadIdx.x + blockIdx.x * blockDim.x result[i] = a[i] + b[i] # Invoke like: sum[grid_dim, block_dim](big_input_1, big_input_2, result_array)
  • 33. R in a nutshell module load cuda/2.3 module load R/serial/2.13 > x=1:10 > y=x**2 > str(y) > print(x) > times2 = function(x) 2*x graphics! > plot(x,y) = and <- are interchangable
  • 34. rgpu a set of functions for loading data toa gpu and manipulating the data there: ●exportgpu(x) ●evalgpu(x+y) ●lsgpu() ●rmgpu("x") ●sumgpu(x), meangpu(x), gemmgpu(a,b) ●cos, sin,.., +, -, *, /, **, %*%
  • 35. Example load the correct R module $ module load R/serial/2.13 start R $ R R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 load rgpu library > library(rgpu) > help(package="rgpu") > rgpudetails()
  • 36. Data on the GPGPU one million random uniform numbers > x=runif(10000000) send data to gpu > exportgpu(x) do some calculations > evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))) do some timing comparisons (GPU vs CPU): > system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))) > system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))
  • 37. real world examples: gputools gputools is a package of precompiled CUDA functions for statistics, linear algebra and machine learning ●chooseGpu ●getGpuId() ●gpuCor, gpuAucEstimate ●gpuDist, gpuDistClust, gpuHclust, gpuFastICA ●gpuGlm, gpuLm ●gpuGranger, gpuMi ●gpuMatMult, gpuQr, gpuSvd, gpuSolve ●gpuLsfit ●gpuSvmPredict, gpuSvmTrain ●gpuTtest
  • 38. Example: Matrix Inversion np <- 2000 x <- matrix(runif(np**2), np,np) system.time(gpuSolve(x)) system.time(solve(x))
  • 39. Example: Hierarchical Clustering numVectors <- 5 dimension <- 10 Vectors <- matrix(runif(numVectors*dimension), numVectors, dimension) distMat <- gpuDist(Vectors, "euclidean") myClust <- gpuHclust(distMat, "single") plot(myClust) for other examples try: example(hclust)
  • 40. Fortran 90 Example program myprog ! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2) end program
  • 41. PGI Compiler log in to lxgp1 $ module load fortran/pgi/11.8 $ pgf90 -o myprog.exe myprog.f90 $ time ./myprog.exe exercise for you: ●compute MFlop/s (Floating Point Operations: 4 * np * nstep) ●optimize (hint: -Minfo, -fast, -O3)
  • 42. Fortran 90 Example program myprog ! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep !$acc region dx=v*dt; dv=-x*dt x=x+dx; v=v+dv !$acc end region end do print*, " total energy: ",sum(x**2+v**2) end program
  • 43. PGI Compiler accelerator module load fortran/pgi pgf90 -ta=nvidia -o myprog.exe myprog.f90 time ./myprog.exe exercise for you: ●compute MFlop/s (Floating Point Operations: 4 * np * nstep) ●optimize (hint: change acc region)
  • 44. Use R as scripting language R can dynamically load shared objects: dyn.load("lib.so") these functions can then be called via .C("fname", args) .Fortran("fname", args)
  • 45. R subroutine subroutine mysub_cuda(x,v,nstep) ! simulate harmonic oscillator integer, parameter :: np=1000000 real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do return end subroutine
  • 46. Compile two versions don't forget to load the modules! module unload ccomp fortran module load ccomp/pgi/11.8 module load fortran/pgi/11.8 module load R/serial/2.13 pgf90 -shared -fPIC -o mysub_host.so mysub_host.f90 pgf90 -ta=nvidia -shared -fPIC -o mysub_cuda.so mysub_cuda.f90
  • 47. Load and run Load dynamic libraries > dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np=1000000 Benchmark > system.time(str(.Fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: 666667.6633012500 total energy: 667334.6641391169 List of 3 $ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ... $ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ... $ nstep: int 1000 user system elapsed 26.901 0.000 26.900 > system.time(str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: 666667.6633012500 total energy: 667334.6641391169 List of 3 $ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ... $ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ... $ nstep: int 1000 user system elapsed 0.829 0.000 0.830 Acceleration Factor: > 26.9/0.83 [1] 32.40964
  • 48. Matrix Multipl. in FORTRAN subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j) end do return end subroutine
  • 49. Call FORTRAN from R # compile f90 to shared object library system("pgf90 -shared -fPIC -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,c=c, np=as.integer(dim(a)[1]))
  • 50. Call FORTRAN binary np=100 system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ) ) Exercise: make a plot system-time vs matrix-dimension
  • 51. PGI accelerator directives subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np !$acc region forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j) !$acc end region end do return end subroutine
  • 52. Call FORTRAN from R # compile f90 to shared object library system("pgf90 -ta=nvidia -shared -fPIC -o mmult.so mmult.f90"); # dynamically load library dyn.load("mmult.so") # define multiplication function mmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,c=c, np=as.integer(dim(a)[1]))
  • 54. Scripting Parallel Execution implicit R explicite jit pnmath doSNOWdoMPIdoMC doRedis hierarchical parallelisation: - accelerator: rgpu, pnmath, MKL - intra-node: jit, doMC, MKL - intra-cluster: SNOW, MPI, pbdMPI - inter-cluster: Redis, SNOW MKLrgpu
  • 55. foreach package # new R foreach library(foreach) alist <- foreach (i=1:N) %do% call(i) foreach is a function # old R code alist=list() for(i in 1:N) alist[i]<-call(i) for is a language keyword
  • 56. multithreading with R library(foreach) foreach(i=1:N) %do% { mmult.f() } # serial execution library(foreach) library(doMC) registerDoMC() foreach(i=1:N) %dopar% { mmult.f() } # thread execution
  • 57. MPI with R library(foreach) foreach(i=1:N) %do% { mmult.f() } # serial execution library(foreach) library(doSNOW) registerDoSNOW() foreach(i=1:N) %dopar% { mmult.f() } # MPI execution
  • 58. doSNOW # R > library(doSNOW) > cl <- makeSOCKcluster(4) > registerDoSNOW(cl) > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 15.377 0.928 16.303 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 4.864 0.000 4.865
  • 59. doMC # R > library(doMC) > registerDoMC(cores=4) > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 9.352 2.652 12.002 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 7.228 7.216 3.296
  • 60. noSQL databases Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. http://www.redis.io Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl
  • 61. doRedis / workers start redis worker: > echo "require('doRedis');redisWorker('jobs')" | R The workers can be distributed over the internet > startRedisWorkers(100)
  • 62. doRedis # R > library(doRedis) > registerDoRedis("jobs") > system.time(foreach(i=1:10) %do% sum(runif(10000000))) user system elapsed 15.377 0.928 16.303 > system.time(foreach(i=1:10) %dopar% sum(runif(10000000))) user system elapsed 4.864 0.000 4.865
  • 63. MPI-CUDA with R Using doSNOW and dyn.load with pgifortran: library(doSNOW) cl=makeCluster(c("gvs1","gvs2"),type="SOCK") registerDoSNOW(cl) foreach(i=1:2) %dopar% setwd("~/KURSE/R_cuda") foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so") system.time( foreach(i=1:4) %dopar% str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np) , nstep=as.integer(1000))))
  • 64. Disk Big Memory R R MEM MEM Logical Setup of Node without shared memory R R MEM Logical Setup of Node with shared memory DiskDisk R R MEM Logical Setup of Node with file-backed memory R R MEM Logical Setup of Node with network attached file- backed memory Network Network Network
  • 65. library(bigmemory) ● shared memory regions for several processes in SMP ● file backed arrays for several node over network file systems library(bigmemory) x <- as.big.matrix(matrix(runif(1000000), 1000, 1000))) sum(x[1,1:1000])