[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #6: CUDA Ninja Tricks | March 1st, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #6: CUDA Ninja Tricks | February 29th, 2011

Auto-tuning
am ming,
, Meta- progr
riptin g”
G PU “Sc

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood

Why do Scripting for GPUs?

GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum
compute/memory throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Realize a promise: Use Scripting. . .
from ﬁrst prototype
to full-scale production code.

o slide by Andreas Klockner (NYU)
Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown) PyCuda Tutorial

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Why do Scripting for GPUs?

GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
→ complement each other
CPU: largely restricted to control
tasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL

slide by Andreas Klockner (NYU) Simpler GPU Programming with Python
Andreas Kl¨ckner
o PyCUDA: Even


How are High-Performance Codes constructed?

“Traditional” Construction of
High-Performance Codes:
C/C++/Fortran
Libraries
“Alternative” Construction of
High-Performance Codes:
Scripting for ‘brains’
GPUs for ‘inner loops’
Play to the strengths of each
programming environment.

Andreas Kl¨ckner
o PyCUDA: Even


Scripting: Python

One example of a scripting language: Python

Mature
Large and active community
Emphasizes readability
Written in widely-portable C
A ‘multi-paradigm’ language
Rich ecosystem of sci-comp related
software

Andreas Kl¨ckner
o PyCUDA: Even


Scripting Languages

Python:
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
uses run-time typing.
works well for “gluing” lower-level blocks together.



Scripting: Goals

Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time

Think about the tools you use.
Use the right tool for the job.



Scripting: Goals

Scripting languages aim to reduce the load on the programmer:
Reduce required knowledge
Encourage experimentation
Eliminate sources of error
Encourage abstraction wherever possible
Value programmer time over computer time

Think about the tools you use.
Use the right tool for the job.

Nicolas Pinto (MIT) and Andreas Kl¨ckner (Brown)
o PyCuda Tutorial


Scripting: Speed

Usual answer to the “Speed
Question”:
Hybrid (“mixed”) Code.
Plays to the strengths of each
language.
But: Introduces (some)
complexity.

Observation: GPU code is already hybrid.

Consequence: No added complexity through hybrid code.



Whetting your appetite

1 import pycuda.driver as cuda
2 import pycuda.autoinit , pycuda.compiler
3 import numpy
4
5 a = numpy.random.randn(4,4).astype(numpy.ﬂoat32)
6 a gpu = cuda.mem alloc(a.nbytes)
7 cuda.memcpy htod(a gpu, a)

[This is examples/demo.py in the PyCUDA distribution.]

Andreas Kl¨ckner
o PyCUDA: Even



1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( ﬂoat ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 }
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a

Andreas Kl¨ckner
o PyCUDA: Even Simpler GPU Programming with Python



1 mod = pycuda.compiler.SourceModule(”””
2 global void twice( ﬂoat ∗a)
3 {
4 int idx = threadIdx.x + threadIdx.y∗4;
5 a[ idx ] ∗= 2;
6 } Compute kernel
7 ”””)
8
9 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))
11
12 a doubled = numpy.empty like(a)
13 cuda.memcpy dtoh(a doubled, a gpu)
14 print a doubled
15 print a

Andreas Kl¨ckner
o PyCUDA: Even


Whetting your appetite, Part II

Did somebody say “Abstraction is good”?



Whetting your appetite, Part II

1 import numpy
2 import pycuda.autoinit
3 from pycuda import gpuarray
4
5 a cpu = numpy.random.randn(4,4).astype(numpy.ﬂoat32)
6 b cpu = numpy.random.randn(4,4).astype(numpy.ﬂoat32)
7 c cpu = a cpu ∗ b cpu
8
9 a gpu = gpuarray.to gpu(a cpu)
10 b gpu = gpuarray.to gpu(b cpu)
11 c gpu = (a gpu ∗ b gpu).get()
12
13 print c cpu − c gpu



Remember me?

1 // trivia
2 #include <stdio.h>
3
4 #define CUDA CHK(NAME, ARGS) {
5 cudaError t cuda err code = NAME ARGS;
6 if (cuda err code != cudaSuccess) { 1 // main2
7 printf (”%s failed with code %dn”, #NAME, cuda err code); 2 for ( int i = 0; i < n; i++) { a host[i] = i; b host [ i ] = i+1; }
8 abort (); 3
9 } 4 CUDA CHK(cudaMemcpy, (a device, a host, n∗sizeof(float),
10 } 5 cudaMemcpyHostToDevice));
11 // end 6 CUDA CHK(cudaMemcpy, (b device, b host, n∗sizeof(float),
12 7 cudaMemcpyHostToDevice));
13 // kernel 8
14 global void square array ( float ∗a, float ∗b, int n) 9 dim3 block dim(16, 16);
15 { 10 int block size = block dim.x∗block dim.y;
16 int i = ( blockIdx .x ∗ blockDim.y + threadIdx.y) 11 int n blocks = (n + block size−1) / block size ;
17 ∗ blockDim.x + threadIdx.x; 12 square array <<<n blocks, block dim>>>(a device, b device, n);
18 if ( i < n) 13 // end
19 a[ i ] = a[i ] ∗ b[i ]; 14
20 } 15 // main3
21 // end 16 CUDA CHK(cudaMemcpy, (a host, a device, n∗sizeof(float),
22 17 cudaMemcpyDeviceToHost));
23 // main1 18
24 int main() 19 for ( int i = 0; i < n; i++)
25 { 20 printf (”%.0f ”, a host [ i ]);
26 cudaSetDevice(0); // EDIT ME 21 puts(”n”);
27 22
28 const int n = 4096; 23 free (a host );
29 24 CUDA CHK(cudaFree, (a device));
30 float ∗a host = (float ∗) malloc(n∗sizeof( float )); 25 }
31 float ∗b host = (float ∗) malloc(n∗sizeof( float )); 26 // end
32
33 float ∗a device, ∗b device;
34 CUDA CHK(cudaMalloc, ((void ∗∗) &a device, n∗sizeof(float)));
35 CUDA CHK(cudaMalloc, ((void ∗∗) &b device, n∗sizeof(float)));
36 // end



PyCUDA Philosophy

Provide complete access
Automatically manage resources
Provide abstractions
Check for and report errors
automatically
Full documentation
Integrate tightly with numpy

Andreas Kl¨ckner
o PyCUDA: Even


PyCuda: Workﬂow

Edit Cache!

Run nvcc .cubin

SourceModule("...") Upload to GPU
PyCuda

Run on GPU



Automatic Cleanup

Reachable objects (memory,
streams, . . . ) are never destroyed.
Once unreachable, released at an
unspeciﬁed future time.
Scarce resources (memory) can be
explicitly freed. (obj.free())
Correctly deals with multiple
contexts and dependencies.



gpuarray: Simple Linear Algebra

pycuda.gpuarray:
Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, ∗, /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudata
attribute.
Use as kernel arguments, textures, etc.



What’s this “numpy”, anyway?

Numpy: package for large,
multi-dimensional arrays.
Vectors, Matrices, . . .
A+B, sin(A), dot(A,B)
la.solve(A, b), la.eig(A)
cube[:, :, n-k:n+k], cube+5
All much faster than functional equivalents in
Python.

“Python’s MATLAB”:
Basis for SciPy, plotting, . . .

Andreas Kl¨ckner
o PyCUDA: Even


gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:
from pycuda.curandom import rand as curand
a gpu = curand((50,))
b gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernel
lin comb = ElementwiseKernel(
” float a, float ∗x, float b, float ∗y, float ∗z”,
”z[ i ] = a∗x[i ] + b∗y[i]”)

c gpu = gpuarray.empty like (a gpu)
lin comb(5, a gpu, 6, b gpu, c gpu)

assert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5



gpuarray: Reduction made easy

Example: A scalar product calculation
from pycuda.reduction import ReductionKernel
dot = ReductionKernel(dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,
arguments=”const float ∗x, const float ∗y”)

from pycuda.curandom import rand as curand
x = curand((1000∗1000), dtype=numpy.float32)
y = curand((1000∗1000), dtype=numpy.float32)

x dot y = dot(x, y ). get()
x dot y cpu = numpy.dot(x.get(), y. get ())

Andreas Kl¨ckner
o PyCUDA: Even

GPU Scripting PyOpenCL News RTCG Showcase Exciting Developments in GPU-Python

Step 3: Usage

Complex numbers
. . . in GPUArray
. . . in user code
(pycuda-complex.hpp)
If/then/else for GPUArrays
Support for custom device pointers
Smarter device picking/context
creation
PyFFT: FFT for PyOpenCL and
PyCUDA
scikits.cuda: CUFFT, CUBLAS,
CULA

Andreas Kl¨ckner
o PyCUDA: Even


Sparse Matrix-Vector on the GPU

New feature in 0.94:
Sparse matrix-vector
multiplication
Uses “packeted format”
by Garland and Bell (also
includes parts of their code)
Integrates with scipy.sparse.
Conjugate-gradients solver
included
Deferred convergence
checking

Andreas Kl¨ckner
o PyCUDA: Even


Kernel Invocation: Automatic Copies

mod = pycuda.driver.SourceModule(
” global my func(float ∗out, float ∗in ){...} ”)
func = mod.get function(”my func”)

src = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.empty like(src)

my func(
cuda.Out(dest),
cuda.In( src ),
block=(400,1,1))

“InOut” exists, too.
Only for immediate invocation style.



Step 4: Debugging

New in 0.94.1: Support for CUDA gdb:

$ cuda-gdb --args python -m
pycuda.debug demo.py

Automatically:
Sets Compiler ﬂags
Retains source code
Disables compiler cache

Andreas Kl¨ckner
o PyCUDA: Even


CUDA APIs

C/C++ Python CUDA has two Programming
Interfaces:
Runtime API PyCuda “Runtime” high-level
(libcudart.so, in the
Driver API “toolkit”)
“Driver” low-level
Kernel Driver (libcuda.so, comes with
GPU driver)
Hardware (mutually exclusive)



Runtime vs. Driver API

Runtime ↔ Driver diﬀerences:
Explicit initialization.
Code objects (“Modules”) become programming language
objects.
Texture handling requires slightly more work.
Only needs nvcc for compiling GPU code.
Driver API:
Conceptually cleaner
Less sugar-coating (provide in Python)
Not very diﬀerent otherwise



PyCuda: API Tracing

With ./configure --cuda-trace=1:
import pycuda. driver as cuda cuInit
import pycuda. autoinit cuDeviceGetCount
import numpy cuDeviceGet
cuCtxCreate
a = numpy.random.randn(4,4).astype(numpy.ﬂoat32) cuMemAlloc
a gpu = cuda.mem alloc(a.nbytes) cuMemcpyHtoD
cuda.memcpy htod(a gpu, a) cuCtxGetDevice
cuDeviceComputeCapability
mod = cuda.SourceModule(””” cuModuleLoadData
global void doublify ( ﬂoat ∗a) cuModuleGetFunction
{ cuFuncSetBlockShape
int idx = threadIdx.x + threadIdx.y∗4; cuParamSetv
a[ idx ] ∗= 2; cuParamSetSize
} cuLaunchGrid
”””) cuMemcpyDtoH
cuCtxPopCurrent
func = mod.get function(”doublify”) cuCtxPushCurrent
func(a gpu, block=(4,4,1)) cuMemFree
cuCtxPopCurrent
a doubled = numpy.empty like(a) cuCtxPushCurrent
cuda.memcpy dtoh(a doubled, a gpu) cuModuleUnload
print a doubled cuCtxPopCurrent
print a cuCtxDestroy



PyCUDA: Vital Information

http://mathema.tician.de/
software/pycuda
Complete documentation
MIT License
(no warranty, free for all use)
Requires: numpy, Python 2.4+
(Win/OS X/Linux)
Support via mailing list

Andreas Kl¨ckner
o PyCUDA: Even

... too much ?

ba nk c
onﬂict
s

on
ing

isi
ale sc

ec
co

ca
pr

ch
d part
ition

in
ixe
cla ca m

g
m ping
m

pi
ng
adca sting
bro
ms
zero-cop trea

e ?
ec id
’t d
c an

GPU Scripting PyOpenCL News RTCG Showcase Writing Code when the most Knowledge is Available

GPU Programming: Implementation Choices

Many difficult questions
Insufficient heuristics
Answers are hardware-specific and
have no lasting value

Andreas Kl¨ckner
o PyCUDA: Even


GPU Programming: Implementation Choices

Many difficult questions
Insufficient heuristics
Answers are hardware-specific and
have no lasting value
Proposed Solution: Tune automatically
for hardware at run time, cache tuning
results.
Decrease reliance on knowledge of
hardware internals
Shift emphasis from
tuning results to tuning ideas

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

(Key: Code is data–it wants to be
reasoned about at run time)

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

In GPU scripting,
GPU code does
not need to be
a compile-time
constant.

reasoned about at run time)

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary
GPU reasoned about at run time)

Result

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary Machine

Result

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea
Human In GPU scripting,
Python Code
GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

Good for code In GPU scripting,
Python Code
News generation GPU code does
The
not need ailabee
v to bl
GPU Code Gener a t i on d ge is A
nowlea compile-time
e Code most K
4 R u n - T i m o d e w h e n th e constant.
Writ
GPU Compiler ing C

GPU Binaryase
howc
S

Result

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

Good for code In GPUyCUDA
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner
o PyCUDA: Even


Metaprogramming

Idea

Good for code PyOp UDA
In GPUyCenCL
P scripting,
Python Code
generation GPU code does
not need to be
GPU Code
a compile-time
constant.
GPU Compiler

GPU Binary

Result

Andreas Kl¨ckner
o PyCUDA: Even


Machine-generated Code

Why machine-generate code?
Automated Tuning
(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables
(→ register pressure)
Loop Unrolling

Andreas Kl¨ckner
o PyCUDA: Even


PyCuda: Support for Metaprogramming

Access properties of compiled code:
func.{num regs,shared size bytes,local size bytes}
Exact GPU timing via events
Can calculate hardware-dependent MP occupancy
codepy (by Andreas):
Build C syntax trees from Python
Generates readable, indented C
Or use a templating engine (many available, e.g. Cheetah)


Outline

1. Scripting GPUs with PyCUDA
2. Meta-programming and RTCG
3. Case study in brain-inspired AI (vision)

The Problem:
Visual Object Recognition

fast
accurate
tolerant to variations
effortless
critical to survival

The Approach
Reverse and Forward Engineering the Brain

The Approach
Reverse and Forward Engineering the Brain

REVERSE FORWARD
Study Build
Natural System Artiﬁcial System

Why is modeling challenging?

The brain is a massively parallel computer
➡ Big models are paralyzingly slow to run

Neural data only provides weak constraints
➡ Lots of parameters – hard to explore

Advice from Dave Cox:
“Don’t run anything that takes longer than a
week to complete, because it will just crash
halfway through anyways (or you’ll discover
a bug) and you’ll never ﬁnish your Ph.D.”

Why is modeling challenging?

➡ Big models are paralyzingly slow to run


Visual Cortex

t aﬂo ps !
in =2 0 pe
bra

GPUs (since 2006)

7800 GTX Monster16GPU Tesla Cluster
(2006) (2008) (2009)

OpenGL/Cg CUDA CUDA/OpenCL
C++/Python Python Python

Cell Broadband Engine (since 2007)

Teraﬂop Playstation3 clusters:

DiCarlo Lab / MIT Cox Lab / Harvard

A Match Made in Heaven
Brains are parallel, GPUs are parallel

≈
Multiple scales of parallelism:
“Embarrasingly” parallel: video
frames, regions
Fine-grained: independent “neurons,”
operating on overlapping inputs

A Match Made in Heaven
Images In, Images Out

≈
Image processing particularly well-suited
Excellent Arithmetic Intensity: very
natural to load image patches into
shared memory
Data: 2D / 3D locality

Read-out

L3
thresh/sat norm strength

normalization Learning
neighborhood Rate
Trace
“Temp. Adv.”
“Auto-reset”
...
number of lters

L2

Learning
normalization
neighborhood Rate
kernel Trace
size “Temp. Adv.”
“Auto-reset”
...
n. of lters

L1
thresh/sat norm strength Learning
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
kernel ...
size

number of lters

input
kernel
size

neighborhood Rate
Trace
“Temp. Adv.”
“Auto-reset”
...
number of lters

L2

Learning
normalization
neighborhood Rate
kernel Trace
size “Temp. Adv.”
“Auto-reset”
...
n. of lters

L1
thresh/sat norm strength Learning
Rate
normalization
Trace
neighborhood
“Temp. Adv.”
“Auto-reset”
kernel ...
size

Two conﬂicting requirements

FA ST slow to run
➡ Big models are paralyzingly

LEXI BLE
F

How to optimize?

lutio ns!
k Co nvo
i lter ba n
3D F

Fast vs Flexible: what can you do?

- Make your code accessible
- No focus on raw performance

Examples:

MATLAB/CUDA by Jim Mutch (2010)

by John Moore (1995)


- Use standard libraries
(e.g. CUBLAS, CUFFT, Jacket)

- But: “remap” problem to ﬁt?

- Memory issues (not always optimal)


- Fully optimized, by hand
- But for only a few input conﬁgurations...


- Focus on ﬂexibility/accessibility ﬁrst
- But add strong foundations for raw
performance from the beginning

Example:

Python/C/CUDA
(OpenCL*)

http://deeplearning.net
by James Bergstra & Yoshua Bengio (2010)

Meta-programming
and
Auto-tuning

Meta-programming !

Leave the grunt-programming to the
computer (i.e. auto-tuning like ATLAS or FFTW)
• Dynamically compile specialized versions
of the same kernel for different conditions
• Empirical run-time tuning
• For free: smooth syntactic ugliness: unroll
loops, index un-indexable registers, etc.

Meta-programming !

“Instrument” your solutions:
• Block size
• Work size
• Loop unrolling
• Pre-fetching
• Spilling
• etc.

Meta-programming !

Let the computer generate and ﬁnd the optimal
code:
• brute-force search with a global objective
• machine-learning approach with local
objectives and hidden variables (advanced)
• e.g. PyCuda makes this easy:

Basic GPU Meta-programming System

A Case Study
GPU Meta-Programming:
red Machine Vision
in Biologically-Inspi
s]
[GPU Computing Gem

Pinto N, Cox DD

texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[$FILTER_D][$FILTER_W][$N_FILTERS];

#define IMUL(a, b) __mul24(a, b)
extern "C" {
C hee ta h
#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4 *output)
{

#set INPUT_BLOCK_W = $BLOCK_W+$FILTER_W-1
__shared__ float shared_in[$INPUT_BLOCK_W][4+1];

// -- input/output offsets
const uint in_idx = (blockIdx.y+$j)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
float4 input_v4;

// -- load input to shared memory
#for i in xrange($LOAD_ITERATIONS)
#if $i==($LOAD_ITERATIONS-1)
if((threadIdx.x+$BLOCK_W*$i)<$INPUT_BLOCK_W)
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*$i);
shared_in[threadIdx.x+$BLOCK_W*$i][0] = input_v4.x;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y;
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z;

conv_kernel_4x4x4.cu
conv_kernel_template.cu #include <stdio.h>

texture<float4, 1, cudaReadModeElementType> tex_float4;
__constant__ float constant[4][4][4];

#define IMUL(a, b) __mul24(a, b)
texture<float4, 1, cudaReadModeElementType> tex_float4; extern "C" {
__constant__ float constant[$FILTER_D][$FILTER_W]
[$N_FILTERS]; __global__ void convolve_beta_j0(float4 *input, float4 *output)
{
extern "C" { __shared__ float shared_in[131][4+1];

#for j in xrange($FILTER_H) const uint in_idx = (blockIdx.y+0)*INPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
const uint out_idx = blockIdx.y*OUTPUT_W + blockIdx.x*blockDim.x + threadIdx.x;
__global__ void convolve_beta_j${j}(float4 *input, float4 float4 input_v4;
*output)
{
{

input_v4 = tex1Dfetch(tex_float4, in_idx+128*0);

shared_in[threadIdx.x+128*0][0] = input_v4.x;

shared_in[threadIdx.x+128*0][1] = input_v4.y;

shared_in[threadIdx.x+128*0][2] = input_v4.z;

shared_in[threadIdx.x+128*0][3] = input_v4.w;
}
const uint in_idx = (blockIdx.y+$j)*INPUT_W + if((threadIdx.x+128*1)<131)
blockIdx.x*blockDim.x + threadIdx.x; {
const uint out_idx = blockIdx.y*OUTPUT_W +

input_v4 = tex1Dfetch(tex_float4, in_idx+128*1);
blockIdx.x*blockDim.x + threadIdx.x;

shared_in[threadIdx.x+128*1][0] = input_v4.x;

shared_in[threadIdx.x+128*1][1] = input_v4.y;
float4 input_v4;

shared_in[threadIdx.x+128*1][2] = input_v4.z;

shared_in[threadIdx.x+128*1][3] = input_v4.w;
// -- load input to shared memory }
#for i in xrange($LOAD_ITERATIONS) __syncthreads();
// -- compute dot products
float v, w;
#end if
{ float sum0 = 0;
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* float sum1 = 0;
$i); float sum2 = 0;
float sum3 = 0;
shared_in[threadIdx.x+$BLOCK_W*$i][1] = input_v4.y; v = shared_in[threadIdx.x+0][0];
shared_in[threadIdx.x+$BLOCK_W*$i][2] = input_v4.z; w = constant[0][0][0];
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; sum0 += v*w;
} w = constant[0][0][1];
sum1 += v*w;
#end for
w = constant[0][0][2];
sum2 += v*w;
sum3 += v*w;
v = shared_in[threadIdx.x+1][0];
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;

conv_kernel_template.cu
__constant__ float constant[$FILTER_D][$FILTER_W]
[$N_FILTERS];

conv_kernel_4x4x4.cu
extern "C" {

#for j in xrange($FILTER_H)

__global__ void convolve_beta_j${j}(float4 *input, float4

20 kB
*output)
{


const uint in_idx = (blockIdx.y+$j)*INPUT_W +
float4 input_v4;

#end if

$i);
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W* conv_kernel_8x8x4.cu
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w;
}

64 kB
#end for

Smooth syntactic ugliness
Manipulations that are not easily
accessible in CUDA C code:
• variable-length argument lists

• syntax-level code control (e.g. conditionals)

• loop unrolling (possibly ﬁne-controlled)

• ﬁne-controlled loop unrolling
..)

sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;
sum0 += v*w;
sum1 += v*w;
sum2 += v*w;
sum3 += v*w;

How about #pragma unroll ?
(why don’t you trust the compiler?)

o t alo ne....
we are n
s for S ignal
Using GPU
elation pil ers
Corr ust com
’t tr
itchell
Daniel A. M

Don The Murch

ode fr
a
ts
ison Widefi

gmen
eld Array

c
tical”
e “iden
re thes + g *h;
ompa LOPS
• C
*c +
e*f
770 GF
+ d
b*c grating 8-s
econd snap
shots over

a +=
inte peeling,
roduced by lanking and

b*c;
-2526 field p d after RFI b
f the J2107 e of the fiel
an image o ht is an imag
S
FLOP
n the left is . On the rig

a += d*c;
Figure 3:
O ing
hout blank
interval wit

20 G
entire time eeled imag
e. noise
the e unp e above the
ntours of th f magnitud
ers o . This
10
co
along with that are ord ubious data

a += e*f;
at levels iscard d
e receivers ill simply d tector show
n here
fract into th e system w
k
ichael hClar
fl ect or re real-tim n-based de
occasion, re s the MWA mple media
integration hich the si
M wit
floor. D
wil
uring deep
l require a
series of d
ata-quality
art.
tests, of w
a += g*h;
n integral p
will form a eenhill
Lincoln Gr
Paul La Pla
nte and ces
Referen t Boolard
a +=
y, EDGES
Memo, 058
, 2010.
R.J. Cappal
lo, M.F. M
orales, and
ics a ale, d Topics
RFI Statist , C.J. Lonsd l of Selecte
[1] A.E .E. Rogers, , R.J. Sault IE EE Journa
R.B. Wayth eld Array,
. Greenhill, hison Widefi ].
itchell, L.J of the Murc 07.1912 E, 97
[2] D.A. M Time Calib
ration
, [astro-
ph/08 s of the IEE
S.M. O rd, Real- 7 17, 2008 , Proceeding
2 (5), 707– n Overview
1
nuary 201
sday, 27 Ja rocessing, rray: Desig
in Signal P on Widefield A
he Murchis 8]. , Graphics
ale, et al., T 903.182 R.G. Edgar
[3] C.J. Lonsd [ast ro-ph/0 H. Pfister, and Series,
506, 2009, ell, K. Dale, Conference
(8), 1497–1 , D.A. Mitch d Array, ASP
R.B. Wayth on Wide-fiel
Greenhill, the Murchis

IICS‘2011 [4] S.M . Ord, L.J. ata Pro cessing in cal
Units for D Mathemati
Processing 1 radio pola
rimetry. I.
009. aa d
nderstryn20 ing
1
411, 127, 2 .J. Sault, U Janu 6.
. Breg man, and R ursday,.,2117, 137–147, 199
7
alar
amaker, J.D Th pl. Ser
up alogue of sc
[5] J.P. H strophys. S ll-co herency an rophys. Su
ppl.
s, Astron. A . IV. The fu Astron. Ast
foundation polarimetry ric fidelity,
g radio ge and pola
rimet
derstandin


• index un-indexable resources (e.g. regs)

Explore design decision
space more freely

Exploring design decision space more freely

Meta-programming:

• enables efficient learning of the GPU
hardware/software

• allows full exploitation of the GPU
architecture

version A
conv_kernel_beta_template.cu
...
mad.rn.f32 $r4, s[$ofs3+0x0000], $r4, $r1
mov.b32 $r1, c0[$ofs2+0x0008]
__constant__ float constant[$FILTER_D][$FILTER_W] mad.rn.f32 $r4, s[$ofs3+0x0008], $r1, $r4
[$N_FILTERS];
mov.b32 $r1, c0[$ofs2+0x000c]
mad.rn.f32 $r4, s[$ofs3+0x000c], $r1, $r4
extern "C" {

#for j in xrange($FILTER_H) mov.b32 $r1, c0[$ofs2+0x0010]
__global__ void convolve_beta_j${j}(float4 *input, float4
*output)
mad.rn.f32 $r4, s[$ofs3+0x0010], $r1, $r4
{

...
const uint in_idx = (blockIdx.y+$j)*INPUT_W +
float4 input_v4;


version B
#end if
{
input_v4 = tex1Dfetch(tex_float4, in_idx+$BLOCK_W*
$i);

...
shared_in[threadIdx.x+$BLOCK_W*$i][3] = input_v4.w; mad.rn.f32 $r1, s[$ofs1+0x007c], c0[$ofs1+0x0078], $r1
}
#end for mad.rn.f32 $r1, s[$ofs2+0x0000], c0[$ofs2+0x007c], $r1
mad.rn.f32 $r1, s[$ofs2+0x0008], c0[$ofs2+0x0080], $r1
mad.rn.f32 $r1, s[$ofs2+0x000c], c0[$ofs2+0x0084], $r1
mad.rn.f32 $r1, s[$ofs2+0x0010], c0[$ofs2+0x0088], $r1

...
aster... Why ?
using decuda by Wladimir J. van der Laan 2x f


When USE_THREAD_PER_FILTER is True
• each thread will access different cmem
locations (in order)

using the decuda disassembler by Wladimir J. van der Laan
(Python-based)


When USE_THREAD_PER_FILTER is False
• each thread will access the same cmem
locations (broadcast)

using the decuda disassembler by Wladimir J. van der Laan
(Python-based)


more registers

thread-dependent data movement

v.s.

aster... Why ?
2x f

Strategy

• intermediate design decisions can be made
explicit

• multiple “forks” in the path can be kept in place

• frees up the developer to revisit paste choices
(without incurring a combinatoric explosion of separate pieces of code)

• retesting sets of assumptions can be done
frequently and programmatically from the
“outer” framework of code

Toy Ex a mple
M atmul

http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah

Summary

Meta-programming:

• can assist exploration and manual
optimization
• can de-clutter code
• is easy and ﬂexible with the right tools
(e.g. Python, Py{CUDA,CL}, Cheetah, decuda)

➡ facilitates auto-tuning!

ninja level?
t to the
How t o ge

practic e ...
, pract ice,
Prac tice

Auto-tuning

The goal is to empirically optimize execution
time given:

• the environment
- hardware (GPU, CPU, Memory, Mobo)
- software (SDK, Compiler suite)

• the data (input dimensions, repetitions, etc.)

Basic auto-tuning: pseudo-code (1/3)
Filter-bank Convolution / Correlation

Scripting, Py{CUDA,CL}

NoSQL (CouchDB, MongoDB) ?


Cheetah,
Jinja, Mako

PyCUDA/CL


PyCUDA/CL

NoSQL
(CouchDB,
MongoDB)

Optimizing strategy

• Like many operations, ﬁlter-bank convolution is
usually “communication bound” on the GPU:
- compute is cheap
- communication is expensive
• We must take advantage of all types of memory:
- explicit: gmem (global), smem (shared), cmem
(constant), tmem (texture)
- implicit: rmem (registers), bmem (bin-code?) *
• Different optimal access patterns

Example: thread gmem output size

stupid ﬂoat4 xyzw trick

Example: using texture fetches

Example: register pressure (nvcc)

Example: capitalizing on bmem (bin code) ??

multiple versions of
the same function with
different input offsets

input offset in cubin
code?

Results

Meta-prog Meta-prog
GPU / SDK Input Filter-bank Boost
default (gﬂops) auto-tuned (gﬂops)
256x256x8 64x9x9x8 6.710 ± 0.005 36.584 ± 0.023 445.2 %
9600M GT 512x512x4 32x13x13x4 13.606 ± 0.002 35.582 ± 0.003 161.5 %
CUDA3.1 1024x1024x8 16x5x5x8 20.034 ± 0.113 26.084 ± 6.243 30.2 %
2048x2048x4 4x8x8x4 25.781 ± 0.044 46.945 ± 0.100 82.1 %
256x256x8 64x9x9x8 104.188 ± 0.051 168.083 ± 0.372 61.3 %
C1060 512x512x4 32x13x13x4 125.739 ± 0.109 234.053 ± 0.266 86.1 %
CUDA2.3 1024x1024x8 16x5x5x8 144.279 ± 0.764 243.697 ± 0.346 68.9 %
2048x2048x4 4x8x8x4 180.060 ± 0.018 322.328 ± 0.348 79.0 %
256x256x8 64x9x9x8 123.396 ± 0.016 197.006 ± 0.219 59.7 %
GTX285 512x512x4 32x13x13x4 143.277 ± 0.044 270.206 ± 0.209 88.6 %
CUDA2.3 1024x1024x8 16x5x5x8 148.841 ± 0.465 310.276 ± 0.538 108.5 %
2048x2048x4 4x8x8x4 205.152 ± 0.015 376.685 ± 0.070 83.6 %
256x256x8 64x9x9x8 467.631 ± 19.100 471.902 ± 11.419 0.9 %
GTX480 512x512x4 32x13x13x4 834.838 ± 8.275 974.266 ± 3.809 16.7 %
CUDA3.1 1024x1024x8 16x5x5x8 542.808 ± 1.135 614.019 ± 0.904 13.1 %
2048x2048x4 4x8x8x4 378.165 ± 0.537 806.628 ± 0.168 113.3 %

Empirical results...

Performance (g ops)

Q9450 (Matlab/C) [2008] 0.3

Q9450 (C/SSE) [2008] 9.0

7900GTX (Cg) [2006] 68.2

PS3/Cell (C/ASM) [2007] 111.4

8800GTX (CUDA1.x) [2007] 192.7

GTX280 (CUDA2.x) [2008] 339.3

.
GTX480 (CUDA3.x) [2010] e cha nging.. 974.3
g am
e edup is
>1 0 00X sp

Summary

• Meta-programming makes developing
high-performing code for GPU easier
• Fantastic tools exist (e.g. PyCUDA) to help
• Interesting way to explore/learn about
GPUs (hw/sw)
• Coarse auto-tuning yields good results

Future

• More fermi optimizations
(L1 cache, concurrent kernels)

• OpenCL to optimize across vendors

• Smarter auto-tuning techniques (ML)
- (boosted) decision trees
- evolutionary programming strategies

More ?
• Thu 3/31/11:
PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
• Tue 3/29/11:
Algorithm Strategies (W. Hwu, UIUC)
• Tue 4/5/11:
Analysis-driven Optimization (C.Wooley, NVIDIA)
• Thu 4/7/11:
Irregular Parallelism & Efﬁcient Data Structures (J.Owens, UCDavis)
• Thu 4/14/11:
Optimization for Ninjas (D.Merill, UVirg)
• ...

one more thing
or two...

Life/Code Hacking #2.x
Speed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.2b
Speed writing


Speed writing

?
R SI


Speed writing

SI?
R


Speed writing

Life/Code Hacking #2.3
Speed reading


Speed reading

1. Collect many papers, docs, chapters, etc. (100)
2. Skim through them quickly / select (50)
3. Read w/o full understanding / select (25)
4. Read completely w/ full understanding / select (10)
5. Complete mastery + reproduction (5)


Speed reading
http://readerssoft.com/speed_reading_obstacles.php


Speed reading
http://readerssoft.com/speed_reading_obstacles.php

normal reading

vs.
speed reading


Speed reading
like David Guetta, use one ﬁnger !


[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning

Similar to [Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning (20)

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Auto-tuning