7. Outline
1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI
8. Outline
1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI
9. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
10. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
11. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from first prototype
to full-scale production code.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
12. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
13. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
How are High-Performance Codes constructed?
“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
14. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Scripting: Python
One example of a scripting language: Python
Mature
Large and active community
Emphasizes readability
Written in widely-portable C
A ‘multi-paradigm’ language
Rich ecosystem of sci-comp related
software
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
15. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Scripting Languages
Python:
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
uses run-time typing.
works well for “gluing” lower-level blocks together.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
16. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.
Use the right tool for the job.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
17. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.
Use the right tool for the job.
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)
o PyCuda Tutorial
18. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Scripting: Goals
Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time
Think about the tools you use.
Use the right tool for the job.
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)
o PyCuda Tutorial
19. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Scripting: Speed
Usual answer to the “Speed
Question”:
Hybrid (“mixed”) Code.
Plays to the strengths of each
language.
But: Introduces (some)
complexity.
Observation: GPU code is already hybrid.
Consequence: No added complexity through hybrid code.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
20. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 import pycuda.driver as cuda
2 import pycuda.autoinit , pycuda.compiler
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.float32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)
[This is examples/demo.py in the PyCUDA distribution.]
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
21. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 }
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
Andreas Kl¨ckner
o PyCUDA: Even Simpler GPU Programming with Python
22. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( float ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 } Compute kernel
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
23. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Whetting your appetite, Part II
Did somebody say “Abstraction is good”?
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
24. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Whetting your appetite, Part II
1 import numpy
2 import pycuda.autoinit
3 from pycuda import gpuarray
4
5 a cpu = numpy.random.randn(4,4).astype(numpy.float32)
6 b cpu = numpy.random.randn(4,4).astype(numpy.float32)
7 c cpu = a cpu ∗ b cpu
8
9 a gpu = gpuarray.to gpu(a cpu)
10 b gpu = gpuarray.to gpu(b cpu)
11 c gpu = (a gpu ∗ b gpu).get()
12
13 print c cpu − c gpu
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
25. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Remember me?
1 // trivia
2 #include <stdio.h>
3
4 #define CUDA CHK(NAME, ARGS) {
5 cudaError t cuda err code = NAME ARGS;
6 if (cuda err code != cudaSuccess) { 1 // main2
7 printf (”%s failed with code %dn”, #NAME, cuda err code); 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }
8 abort (); 3
9 } 4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),
10 } 5 cudaMemcpyHostToDevice));
11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),
12 7 cudaMemcpyHostToDevice));
13 // kernel 8
14 global void square array ( float ∗a, float ∗b, int n) 9 dim3 block dim(16, 16);
15 { 10 int block size = block dim.x∗block dim.y;
16 int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y) 11 int n blocks = (n + block size−1) / block size ;
17 ∗ blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);
18 if ( i < n) 13 // end
19 a[ i ] = a[i ] ∗ b[i ]; 14
20 } 15 // main3
21 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),
22 17 cudaMemcpyDeviceToHost));
23 // main1 18
24 int main() 19 for ( int i = 0; i < n; i++)
25 { 20 printf (”%.0f ”, a host [ i ]);
26 cudaSetDevice(0); // EDIT ME 21 puts(”n”);
27 22
28 const int n = 4096; 23 free (a host );
29 24 CUDA CHK(cudaFree, (a device));
30 float ∗a host = (float ∗) malloc(n∗sizeof( float )); 25 }
31 float ∗b host = (float ∗) malloc(n∗sizeof( float )); 26 // end
32
33 float ∗a device, ∗b device;
34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));
35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));
36 // end
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
26. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
PyCUDA Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Check for and report errors
automatically
Full documentation
Integrate tightly with numpy
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
27. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
PyCuda: Workflow
Edit Cache!
Run nvcc .cubin
SourceModule("...") Upload to GPU
PyCuda
Run on GPU
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
28. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Automatic Cleanup
Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspecified future time.
Scarce resources (memory) can be
explicitly freed. (obj.free())
Correctly deals with multiple
contexts and dependencies.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
29. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
gpuarray: Simple Linear Algebra
pycuda.gpuarray:
Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudata
attribute.
Use as kernel arguments, textures, etc.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
30. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
What’s this “numpy”, anyway?
Numpy: package for large,
multi-dimensional arrays.
Vectors, Matrices, . . .
A+B, sin(A), dot(A,B)
la.solve(A, b), la.eig(A)
cube[:, :, n-k:n+k], cube+5
All much faster than functional equivalents in
Python.
“Python’s MATLAB”:
Basis for SciPy, plotting, . . .
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
31. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
gpuarray: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curand
a gpu = curand((50,))
b gpu = curand((50,))
from pycuda.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)
c gpu = gpuarray.empty like (a gpu)
lin comb(5, a gpu, 6, b gpu, c gpu)
assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
32. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
gpuarray: Reduction made easy
Example: A scalar product calculation
from pycuda.reduction import ReductionKernel
dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
arguments=”const float ∗x, const float ∗y”)
from pycuda.curandom import rand as curand
x = curand((1000∗1000), dtype=numpy.float32)
y = curand((1000∗1000), dtype=numpy.float32)
x dot y = dot(x, y ). get()
x dot y cpu = numpy.dot(x.get(), y. get ())
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
33. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Step 3: Usage
Complex numbers
. . . in GPUArray
. . . in user code
(pycuda-complex.hpp)
If/then/else for GPUArrays
Support for custom device pointers
Smarter device picking/context
creation
PyFFT: FFT for PyOpenCL and
PyCUDA
scikits.cuda: CUFFT, CUBLAS,
CULA
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
34. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Sparse Matrix-Vector on the GPU
New feature in 0.94:
Sparse matrix-vector
multiplication
Uses “packeted format”
by Garland and Bell (also
includes parts of their code)
Integrates with scipy.sparse.
Conjugate-gradients solver
included
Deferred convergence
checking
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
35. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Kernel Invocation: Automatic Copies
mod = pycuda.driver.SourceModule(
” global my func(float ∗out, float ∗in ){...} ”)
func = mod.get function(”my func”)
src = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.empty like(src)
my func(
cuda.Out(dest),
cuda.In( src ),
block=(400,1,1))
“InOut” exists, too.
Only for immediate invocation style.
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
36. GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python
Step 4: Debugging
New in 0.94.1: Support for CUDA gdb:
$ cuda-gdb --args python -m
pycuda.debug demo.py
Automatically:
Sets Compiler flags
Retains source code
Disables compiler cache
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
37. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
CUDA APIs
C/C++ Python CUDA has two Programming
Interfaces:
Runtime API PyCuda “Runtime” high-level
(libcudart.so, in the
Driver API “toolkit”)
“Driver” low-level
Kernel Driver (libcuda.so, comes with
GPU driver)
Hardware (mutually exclusive)
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
38. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
Runtime vs. Driver API
Runtime ↔ Driver differences:
Explicit initialization.
Code objects (“Modules”) become programming language
objects.
Texture handling requires slightly more work.
Only needs nvcc for compiling GPU code.
Driver API:
Conceptually cleaner
Less sugar-coating (provide in Python)
Not very different otherwise
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
39. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
PyCuda: API Tracing
With ./configure --cuda-trace=1:
import pycuda. driver as cuda cuInit
import pycuda. autoinit cuDeviceGetCount
import numpy cuDeviceGet
cuCtxCreate
a = numpy.random.randn(4,4).astype(numpy.float32) cuMemAlloc
a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD
cuda.memcpy htod(a gpu, a) cuCtxGetDevice
cuDeviceComputeCapability
mod = cuda.SourceModule(””” cuModuleLoadData
global void doublify ( float ∗a) cuModuleGetFunction
{ cuFuncSetBlockShape
int idx = threadIdx.x + threadIdx.y∗4; cuParamSetv
a[ idx ] ∗= 2; cuParamSetSize
} cuLaunchGrid
”””) cuMemcpyDtoH
cuCtxPopCurrent
func = mod.get function(”doublify”) cuCtxPushCurrent
func(a gpu, block=(4,4,1)) cuMemFree
cuCtxPopCurrent
a doubled = numpy.empty like(a) cuCtxPushCurrent
cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload
print a doubled cuCtxPopCurrent
print a cuCtxDestroy
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
40. GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
PyCUDA: Vital Information
http://mathema.tician.de/
software/pycuda
Complete documentation
MIT License
(no warranty, free for all use)
Requires: numpy, Python 2.4+
(Win/OS X/Linux)
Support via mailing list
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
46. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
GPU Programming: Implementation Choices
Many difficult questions
Insufficient heuristics
Answers are hardware-specific and
have no lasting value
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
47. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
GPU Programming: Implementation Choices
Many difficult questions
Insufficient heuristics
Answers are hardware-specific and
have no lasting value
Proposed Solution: Tune automatically
for hardware at run time, cache tuning
results.
Decrease reliance on knowledge of
hardware internals
Shift emphasis from
tuning results to tuning ideas
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
48. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
49. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
50. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is data–it wants to be
reasoned about at run time)
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
51. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
52. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary Machine
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
53. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Human In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
54. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Good for code In GPU scripting,
Python Code
News generation GPU code does
The
not need ailabee
v to bl
GPU Code Gener a t i on d ge is A
nowlea compile-time
e Code most K
4 R u n - T i m o d e w h e n th e constant.
Writ
GPU Compiler ing C
GPU Binaryase
howc
S
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
55. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Good for code In GPUyCUDA
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
56. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Metaprogramming
Idea
Good for code PyOp UDA
In GPUyCenCL
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler
GPU Binary
(Key: Code is data–it wants to be
GPU reasoned about at run time)
Result
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
57. GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available
Machine-generated Code
Why machine-generate code?
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
(→ register pressure)
Loop Unrolling
slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even
58. Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood
PyCuda: Support for Metaprogramming
Access properties of compiled code:
func.{num regs,shared size bytes,local size bytes}
Exact GPU timing via events
Can calculate hardware-dependent MP occupancy
codepy (by Andreas):
Build C syntax trees from Python
Generates readable, indented C
Or use a templating engine (many available, e.g. Cheetah)
o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial
59. Outline
1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI (vision)
63. The Approach
Reverse and Forward Engineering the Brain
REVERSE FORWARD
Study Build
Natural System Artificial System
64. Why is modeling challenging?
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
Advice from Dave Cox:
“Don’t run anything that takes longer than a
week to complete, because it will just crash
halfway through anyways (or you’ll discover
a bug) and you’ll never finish your Ph.D.”
65. Why is modeling challenging?
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
70. A Match Made in Heaven
Brains are parallel, GPUs are parallel
≈
Multiple scales of parallelism:
“Embarrasingly” parallel: video
frames, regions
Fine-grained: independent “neurons,”
operating on overlapping inputs
71. A Match Made in Heaven
Images In, Images Out
≈
Image processing particularly well-suited
Excellent Arithmetic Intensity: very
natural to load image patches into
shared memory
Data: 2D / 3D locality
72. Why is modeling challenging?
The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run
Neural data only provides weak constraints
➡ Lots of parameters – hard to explore
79. Two conflicting requirements
The brain is a massively parallel computer
FA ST slow to run
➡ Big models are paralyzingly
Neural data only provides weak constraints
LEXI BLE
F
➡ Lots of parameters – hard to explore
How to optimize?
82. Fast vs Flexible: what can you do?
- Make your code accessible
- No focus on raw performance
Examples:
MATLAB/CUDA by Jim Mutch (2010)
by John Moore (1995)
83. Fast vs Flexible: what can you do?
- Use standard libraries
(e.g. CUBLAS, CUFFT, Jacket)
- But: “remap” problem to fit?
- Memory issues (not always optimal)
84. Fast vs Flexible: what can you do?
- Fully optimized, by hand
- But for only a few input configurations...
85. Fast vs Flexible: what can you do?
- Focus on flexibility/accessibility first
- But add strong foundations for raw
performance from the beginning
Example:
Python/C/CUDA
(OpenCL*)
http://deeplearning.net
by James Bergstra & Yoshua Bengio (2010)
89. Meta-programming !
Leave the grunt-programming to the
computer (i.e. auto-tuning like ATLAS or FFTW)
• Dynamically compile specialized versions
of the same kernel for different conditions
• Empirical run-time tuning
• For free: smooth syntactic ugliness: unroll
loops, index un-indexable registers, etc.
91. Meta-programming !
Let the computer generate and find the optimal
code:
• brute-force search with a global objective
• machine-learning approach with local
objectives and hidden variables (advanced)
• e.g. PyCuda makes this easy:
92. Basic GPU Meta-programming System
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
93. texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];
#define IMUL(a, b) __mul24(a, b)
extern "C" {
C hee ta h
#for j in xrange($FILTER_H)
__global__ void convolve_beta_j${j}(float4 *input, float4 *output)
{
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
__shared__ float shared_in[$INPUT_BLOCK_W][4+1];
// -- input/output offsets
const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;
// -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
98. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• variable-length argument lists
99. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• syntax-level code control (e.g. conditionals)
100. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• loop unrolling (possibly fine-controlled)
101. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• fine-controlled loop unrolling
..)
v = shared_in[threadIdx.x+0][0];
w = constant[0][0][0];
sum0 += v*w;
w = constant[0][0][1];
sum1 += v*w;
w = constant[0][0][2];
sum2 += v*w;
w = constant[0][0][3];
sum3 += v*w;
v = shared_in[threadIdx.x+1][0];
w = constant[0][1][0];
sum0 += v*w;
w = constant[0][1][1];
sum1 += v*w;
w = constant[0][1][2];
sum2 += v*w;
w = constant[0][1][3];
sum3 += v*w;
v = shared_in[threadIdx.x+2][0];
w = constant[0][2][0];
sum0 += v*w;
w = constant[0][2][1];
sum1 += v*w;
w = constant[0][2][2];
sum2 += v*w;
w = constant[0][2][3];
sum3 += v*w;
v = shared_in[threadIdx.x+3][0];
w = constant[0][3][0];
sum0 += v*w;
w = constant[0][3][1];
sum1 += v*w;
w = constant[0][3][2];
sum2 += v*w;
w = constant[0][3][3];
sum3 += v*w;
v = shared_in[threadIdx.x+0][1];
w = constant[1][0][0];
sum0 += v*w;
w = constant[1][0][1];
sum1 += v*w;
w = constant[1][0][2];
sum2 += v*w;
w = constant[1][0][3];
sum3 += v*w;
103. o t alo ne....
we are n
s for S ignal
Using GPU
elation pil ers
Corr ust com
’t tr
itchell
Daniel A. M
Don The Murch
ode fr
a
ts
ison Widefi
gmen
eld Array
c
tical”
e “iden
re thes + g *h;
ompa LOPS
• C
*c +
e*f
770 GF
+ d
b*c grating 8-s
econd snap
shots over
a +=
inte peeling,
roduced by lanking and
b*c;
-2526 field p d after RFI b
f the J2107 e of the fiel
an image o ht is an imag
S
FLOP
n the left is . On the rig
a += d*c;
Figure 3:
O ing
hout blank
interval wit
20 G
entire time eeled imag
e. noise
the e unp e above the
ntours of th f magnitud
ers o . This
10
co
along with that are ord ubious data
a += e*f;
at levels iscard d
e receivers ill simply d tector show
n here
fract into th e system w
k
ichael hClar
fl ect or re real-tim n-based de
occasion, re s the MWA mple media
integration hich the si
M wit
floor. D
wil
uring deep
l require a
series of d
ata-quality
art.
tests, of w
a += g*h;
n integral p
will form a eenhill
Lincoln Gr
Paul La Pla
nte and ces
Referen t Boolard
a +=
y, EDGES
Memo, 058
, 2010.
R.J. Cappal
lo, M.F. M
orales, and
ics a ale, d Topics
RFI Statist , C.J. Lonsd l of Selecte
[1] A.E .E. Rogers, , R.J. Sault IE EE Journa
R.B. Wayth eld Array,
. Greenhill, hison Widefi ].
itchell, L.J of the Murc 07.1912 E, 97
[2] D.A. M Time Calib
ration
, [astro-
ph/08 s of the IEE
S.M. O rd, Real- 7 17, 2008 , Proceeding
2 (5), 707– n Overview
1
nuary 201
sday, 27 Ja rocessing, rray: Desig
in Signal P on Widefield A
he Murchis 8]. , Graphics
ale, et al., T 903.182 R.G. Edgar
[3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series,
506, 2009, ell, K. Dale, Conference
(8), 1497–1 , D.A. Mitch d Array, ASP
R.B. Wayth on Wide-fiel
Greenhill, the Murchis
IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal
Units for D Mathemati
Processing 1 radio pola
rimetry. I.
009. aa d
nderstryn20 ing
1
411, 127, 2 .J. Sault, U Janu 6.
. Breg man, and R ursday,.,2117, 137–147, 199
7
alar
amaker, J.D Th pl. Ser
up alogue of sc
[5] J.P. H strophys. S ll-co herency an rophys. Su
ppl.
s, Astron. A . IV. The fu Astron. Ast
foundation polarimetry ric fidelity,
g radio ge and pola
rimet
derstandin
104. Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• index un-indexable resources (e.g. regs)
106. Basic GPU Meta-programming System
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
107. Exploring design decision space more freely
Meta-programming:
• enables efficient learning of the GPU
hardware/software
• allows full exploitation of the GPU
architecture
108. version A
conv_kernel_beta_template.cu
...
mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
mov.b32 $r1, c0[$ofs2+0x0008]
texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
[$N_FILTERS];
mov.b32 $r1, c0[$ofs2+0x000c]
mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
#define IMUL(a, b) __mul24(a, b)
extern "C" {
#for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010]
__global__ void convolve_beta_j${j}(float4 *input, float4
*output)
mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
{
#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
__shared__ float shared_in[$INPUT_BLOCK_W][4+1];
...
// -- input/output offsets
const uint in_idx = (blockIdx.y+$j)*INPUT_W +
blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W +
blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;
// -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
version B
#if $i==($LOAD_ITERATIONS-1)
if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
$i);
shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;
...
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
}
#end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1
...
aster... Why ?
using decuda by Wladimir J. van der Laan 2x f
110. Exploring design decision space more freely
When USE_THREAD_PER_FILTER is True
• each thread will access different cmem
locations (in order)
using the decuda disassembler by Wladimir J. van der Laan
(Python-based)
111. Exploring design decision space more freely
When USE_THREAD_PER_FILTER is False
• each thread will access the same cmem
locations (broadcast)
using the decuda disassembler by Wladimir J. van der Laan
(Python-based)
112. Exploring design decision space more freely
more registers
thread-dependent data movement
v.s.
aster... Why ?
2x f
113. Strategy
• intermediate design decisions can be made
explicit
• multiple “forks” in the path can be kept in place
• frees up the developer to revisit paste choices
(without incurring a combinatoric explosion of separate pieces of code)
• retesting sets of assumptions can be done
frequently and programmatically from the
“outer” framework of code
114. Toy Ex a mple
M atmul
http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
115. Summary
Meta-programming:
• can assist exploration and manual
optimization
• can de-clutter code
• is easy and flexible with the right tools
(e.g. Python, Py{CUDA,CL}, Cheetah, decuda)
➡ facilitates auto-tuning!
120. Basic GPU Meta-programming System
A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem
Pinto N, Cox DD
121. Auto-tuning
The goal is to empirically optimize execution
time given:
• the environment
- hardware (GPU, CPU, Memory, Mobo)
- software (SDK, Compiler suite)
• the data (input dimensions, repetitions, etc.)
126. Optimizing strategy
• Like many operations, filter-bank convolution is
usually “communication bound” on the GPU:
- compute is cheap
- communication is expensive
• We must take advantage of all types of memory:
- explicit: gmem (global), smem (shared), cmem
(constant), tmem (texture)
- implicit: rmem (registers), bmem (bin-code?) *
• Different optimal access patterns
139. Summary
• Meta-programming makes developing
high-performing code for GPU easier
• Fantastic tools exist (e.g. PyCUDA) to help
• Interesting way to explore/learn about
GPUs (hw/sw)
• Coarse auto-tuning yields good results
140. Future
• More fermi optimizations
(L1 cache, concurrent kernels)
• OpenCL to optimize across vendors
• Smarter auto-tuning techniques (ML)
- (boosted) decision trees
- evolutionary programming strategies