[Harvard CS264] 05 - Advanced-level CUDA Programming

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #5: Advanced CUDA | February 22th, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Administrivia
• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)
• Projects: think about it, consult the staff (*),
proposals due ~ Fri 3/25/11
• Guest lectures:
• schedule coming soon
• on Fridays 7.35-9.35pm (March, April) ?

During this course,
r CS264
adapted fo

we’ll try to

“ ”

and use existing material ;-)

Outline

1. Hardware Review
2. Memory/Communication Optimizations
3. Threading/Execution Optimizations

10-Series Architecture

240 thread processors execute kernel threads
30 multiprocessors, each contains
8 thread processors
One double-precision unit
Shared memory enables thread cooperation

Multiprocessor

Thread
Processors

Double

Shared
Memory

© NVIDIA Corporation 2008 8

Threading Hierarchy
Execution Model
Software Hardware

Threads are executed by thread
Thread
processors
Processor
Thread

Thread blocks are executed on
multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can
Thread reside on one multiprocessor - limited
Block Multiprocessor by multiprocessor resources (shared
memory and register file)

A kernel is launched as a grid of
thread blocks
...
Only one kernel can execute on a
Grid device at one time
Device
© 2008 NVIDIA Corporation.

Warps and Half Warps

32 Threads
A thread block consists of 32-
thread warps
... = 32 Threads
A warp is executed physically in
32 Threads parallel (SIMD) on a
Thread
Block Warps Multiprocessor multiprocessor

DRAM A half-warp of 16 threads can
coordinate global memory
16 16 Global accesses into a single transaction

Half Warps Local

Device
Memory


Memory Architecture

Host Device
GPU
CPU DRAM
Multiprocessor
Registers
Local Multiprocessor
Shared Memory
Registers
Multiprocessor
Chipset Shared Memory
Registers
Global
Shared Memory

DRAM Constant
Constant and Texture
Caches
Texture


Kernel Memory Access

Per-thread
Registers On-chip
Thread
Local Memory Off-chip, uncached

Per-block
Shared • On-chip, small
Block • Fast
Memory

Per-device

Kernel 0 ... • Off-chip, large
• Uncached
Global • Persistent across
Time

Memory kernel launches
Kernel 1 ... • Kernel I/O

Global Memory

• Different types of “global memory”
Per-thread
Registers On-chip

• Linear Memory Thread
Local Memory Off-chip, uncached

• Texture
Per-block Memory

• Constant Memory
Block
•
•
Shared
Memory
On-chip, small
Fast

Per-device

Kernel 0 ... • Off-chip, large
• Uncached
Global • Persistent across
Time

Memory kernel launches
Kernel 1 ... • Kernel I/O

Memory Architecture

Memory Location Cached Access Scope Lifetime
Register On-chip N/A R/W One thread Thread
Local Off-chip No R/W One thread Thread
Shared On-chip N/A R/W All threads in a block Block
Global Off-chip No R/W All threads + host Application
Constant Off-chip Yes R All threads + host Application
Texture Off-chip Yes R All threads + host Application


2. Memory/Communication
Optimizations

Revie w

2.1 Host/Device Transfer
Optimizations

Rev ie w
PC Architecture
8 GB/s
>?@

?>L9G=2%&66"K16
J%+8#"F7(&"K16

H%'2$7,6">'%("I"
A+%#$)%7(B& F+1#$)%7(B&
>@C!

E&.+%/"K16 ?>L"K16
3+ Gb/s
CD!E F!:! G#$&%8&# !
160+ GB/s
to
VRAM 25+ GB/s
modiﬁed from Matthew Bolitho

Rev ie w
The PCI-“not-so”-e Bus
• PCIe bus is slow
• Try to minimize/group transfers
• Use pinned memory on host whenever possible
• Try to perform copies asynchronously (e.g. Streams)
• Use “Zero-Copy” when appropriate
• Examples in the SDK (e.g. bandwidthTest)

2.2 Device Memory
Optimizations

Deﬁnitions
• gmem: global memory
• smem: shared memory
• tmem: texture memory
• cmem: constant memory
• bmem: binary code (cubin) memory ?!?
(covered next week)

Performance Analysis
e.g. Matrix Transpose

Matrix Transpose

Transpose 2048x2048 matrix of floats
Performed out-of-place
Separate input and output matrices
Use tile of 32x32 elements, block of 32x8 threads
Each thread processes 4 matrix elements
In general tile and block size are fair game for
optimization
Process
Get the right answer
Measure effective bandwidth (relative to theoretical or
reference case)
Address global memory coalescing, shared memory bank
conflicts, and partition camping while repeating above
steps

Theoretical Bandwidth

Device Bandwidth of GTX 280

DDR

1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s
Memory Memory
clock (Hz) interface
(bytes)

Specs report 141 GB/s
Use 10^9 B/GB conversion rather than 1024^3
Whichever you use, be consistent


Effective Bandwidth

Transpose Effective Bandwidth

2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s

Matrix size Read and
(bytes) write

Reference Case - Matrix Copy
Transpose operates on tiles - need better comparison
than raw device bandwidth
Look at effective bandwidth of copy that uses tiles


Matrix Copy Kernel
__global__ void copy(float *odata, float *idata, int width,
int height)
{
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index = xIndex + width*yIndex;

for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) {
odata[index+i*width] = idata[index+i*width];
}
TILE_DIM = 32
}
BLOCK_ROWS = 8
idata odata
32x32 tile
32x8 thread block

idata and odata
in global memory
Elements copied by a half-warp of threads


Matrix Copy Kernel Timing
Measure elapsed time over loop
Looping/timing done in two ways:
Over kernel launches (nreps = 1)
Includes launch/indexing overhead
Within the kernel over loads/stores (nreps > 1)
Amortizes launch/indexing overhead
__global__ void copy(float *odata, float* idata, int width,
int height, int nreps)
{
int index = xIndex + width*yIndex;

for (int r = 0; r < nreps; r++) {
for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) {
odata[index+i*width] = idata[index+i*width];
}
}
}

Naïve Transpose
Similar to copy
Input and output matrices have different indices
__global__ void transposeNaive(float *odata, float* idata, int width,
{

int index_in = xIndex + width * yIndex;
int index_out = yIndex + height * xIndex;

for (int r=0; r < nreps; r++) {
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i] = idata[index_in+i*width];
}
} idata odata
}


Effective Bandwidth

Effective Bandwidth (GB/s)
2048x2048, GTX 280

Loop over Loop in kernel
kernel
Simple Copy 96.9 81.6

Naïve 2.2 2.2
Transpose


Memory Coalescing

GPU memory controller granularity is 64 or 128 bytes
Must also be 64 or 128 byte aligned
Suppose thread loads a float (4 bytes)
Controller loads 64 bytes, throws 60 bytes away

Memory Coalescing

Memory controller actually more intelligent
Consider half-warp (16 threads)
Suppose each thread reads consecutive float
Memory controller will perform one 64 byte load
This is known as coalescing

Make threads read consecutive locations

Coalescing
Global memory access of 32, 64, or 128-bit words by a half-
warp of threads can result in as few as one (or two)
transaction(s) if certain access requirements are met
Depends on compute capability
1.0 and 1.1 have stricter access requirements

Examples – float (32-bit) data

Global Memory

} 64B aligned segment (16 floats)
} 128B aligned segment (32 floats)

Half-warp of threads


Coalescing
Compute capability 1.0 and 1.1
K-th thread must access k-th word in the segment (or k-th word in 2
contiguous 128B segments for 128-bit words), not all threads need to
participate

Coalesces – 1 transaction

Out of sequence – 16 transactions Misaligned – 16 transactions


Memory Coalescing

GT200 has hardware coalescer
Inspects memory requests from each half-warp
Determines minimum set of transactions which are
64 or 128 bytes long
64 or 128 byte aligned

Coalescing
Compute capability 1.2 and higher (e.g. GT200 like the C1060)
Coalescing is achieved for any pattern of addresses that fits into a
segment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for
32- and 64-bit words
Smaller transactions may be issued to avoid wasted bandwidth due
to unused words

1 transaction - 64B segment

1 transaction - 128B segment
2 transactions - 64B and 32B segments


Coalescing
Compute capability 2.0 (Fermi, Tesla C2050)
Memory transactions handled per warp (32 threads)
L1 cache ON:
Issues always 128B segment transactions
caches them in 16kB or 48kB L1 cache per multiprocessor
2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.

L1 cache OFF:
Issues always 32B segment transactions
E.g. advantage for widely scattered thread accesses
32 transactions - 32 x 32B segments, instead of 32 x 128B segments.

© NVIDIA Corporation 2010

Coalescing Summary

Coalescing dramatically speeds global memory access
Strive for perfect coalescing:
Align starting address (may require padding)
A warp should access within a contiguous region

Coalescing in Transpose

Naïve transpose coalesces reads, but not writes

idata odata

Elements transposed by a half-warp of threads

Q: How to coalesce writes ?

Shared Memory

SMs can access gmem at 80+ GiB/sec
but have hundreds of cycles of latency
Each SM has 16 kiB ‘shared’ memory
Essentially user-managed cache
Speed comparable to registers
Accessible to all threads in a block
Reduces load/stores to device memory

Shared Memory

~Hundred times faster than global memory

Cache data to reduce global memory accesses

Threads can cooperate via shared memory

Use it to avoid non-coalesced access
Stage loads and stores in shared memory to re-order non-
coalesceable addressing


A Common Programming Strategy

!   Partition data into subsets that fit into shared memory

© 2008 NVIDIA Corporation


!   Handle each data subset with one thread block



!   Load the subset from global memory to shared
memory, using multiple threads to exploit memory-
level parallelism


!   Perform the computation on the subset from shared
memory



!   Copy the result from shared memory back to global
memory


Coalescing through shared memory

Access columns of a tile in shared memory to write
contiguous data to global memory
Requires __syncthreads() since threads write data
read by other threads

idata odata
tile

Elements transposed by a half-warp of threads


Coalescing through shared memory
__global__ void transposeCoalesced(float *odata, float *idata, int width,
{
__shared__ float tile[TILE_DIM][TILE_DIM];

int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;

for (int r=0; r < nreps; r++) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}

__syncthreads();

odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}
}
}

Effective Bandwidth

2048x2048, GTX 280
Loop over kernel Loop in kernel
Simple Copy 96.9 81.6 Uses shared
memory tile
Shared Memory Copy 80.9 81.1
and
Naïve Transpose 2.2 2.2 __syncthreads()

Coalesced Transpose 16.5 17.1


Shared Memory Architecture

Many threads accessing memory
Therefore, memory is divided into banks
Successive 32-bit words assigned to successive banks

Each bank can service one address per cycle
Bank 0
A memory can service as many simultaneous Bank 1
accesses as it has banks Bank 2
Bank 3
Bank 4
Multiple simultaneous accesses to a bank Bank 5
result in a bank conflict Bank 6
Conflicting accesses are serialized Bank 7

Bank 15

Shared Memory Banks
Bank 0 0 16
Bank 1 1 17
Bank 2 2 18

Shared memory divided Bank 3
Bank 4
3
4
19
20
into 16 ‘banks’ Bank 5 5 21
Bank 6 6 22
Shared memory is (almost) Bank 7 7 23

as fast as registers (...) Bank 8 8 24
Bank 9 9 25

Exception is in case of Bank 10
Bank 11
10
11
26
27
bank conﬂicts Bank 12 12 28
Bank 13 13 29
Bank 14 14 30
Bank 15 15 31

4 bytes

Bank Addressing Examples

No Bank Conflicts No Bank Conflicts
Linear addressing Random 1:1 Permutation
stride == 1

Thread 0 Bank 0 Thread 0 Bank 0



Bank Addressing Examples

2-way Bank Conflicts 8-way Bank Conflicts
Linear addressing Linear addressing
stride == 2 stride == 8
x8
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Thread 8 x8
Thread 9
Thread 10


Shared memory bank conflicts
Shared memory is ~ as fast as registers if there are no bank
conflicts

warp_serialize profiler signal reflects conflicts

The fast case:
If all threads of a half-warp access different banks, there is no
bank conflict
If all threads of a half-warp read the identical address, there is no
bank conflict (broadcast)

The slow case:
Bank Conflict: multiple threads in the same half-warp access the
same bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank

Bank Conflicts in Transpose

32x32 shared memory tile of floats
Data in columns k and k+16 are in same bank
16-way bank conflict reading half columns in tile
Solution - pad shared memory array
__shared__ float tile[TILE_DIM][TILE_DIM+1];
Q: How to avoid bank conﬂicts ?
Data in anti-diagonals are in same bank

idata odata
tile


Bank Conflicts in Transpose

32x32 shared memory tile of floats
Data in columns k and k+16 are in same bank
16-way bank conflict reading half columns in tile
Solution - pad shared memory array
Data in anti-diagonals are in same bank

idata odata
tile


Illustration
!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/78
• :;<:; !(=('#$$#+
• >#$?'#77%99%9'#'7*6@)0,
– !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9

warps:
0 1 2 31

0 1 2 31
Bank 0
Bank 1 0 1 2 31
…
0 1 2 31
Bank 31

0 1 2 31
NVIDIA 2010

Illustration
!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789
• -&&'#'7*6:)0'5*$';#&&/01,
– !"#!! $%&%'())(*
• <#$;'#77%99%9'#'7*6:)0,
– !" +,--.)./0'1(/234'/5'1(/2'65/-7,603
warps:
0 1 2 31 padding

0 1 2 31
Bank 0
Bank 1 0 1 2 31
…
0 1 2 31
Bank 31

0 1 2 31
© NVIDIA 2010

Effective Bandwidth

2048x2048, GTX 280
Loop over Loop in kernel
kernel
Naïve Transpose 2.2 2.2
Bank Conﬂict Free Transpose 16.6 17.2


Partition Camping

Global memory accesses go through partitions
6 partitions on 8-series GPUs, 8 partitions on 10-series
GPUs
Successive 256-byte regions of global memory are
assigned to successive partitions

For best performance:
Simultaneous global memory accesses GPU-wide should
be distributed evenly amongst partitions

Partition Camping occurs when global memory
accesses at an instant use a subset of partitions
Directly analogous to shared memory bank conflicts, but
on a larger scale

Partition Camping in Transpose

Partition width = 256 bytes = 64 floats
Twice width of tile
On GTX280 (8 partitions), data 2KB apart map to
same partition
2048 floats divides evenly by 2KB => columns of matrices
map to same partition

idata odata
tiles in matrices
0 1 2 3 4 5 0 64 128
colors = partitions
64 65 66 67 68 69 1 65 129
128 129 130 ... 2 66 130

3 67 ...

4 68

5 69

blockId = gridDim.x * blockIdx.y + blockIdx.x

Partition Camping Solutions

Pad matrices (by two tiles)
In general might be expensive/prohibitive memory-wise
Diagonally reorder blocks
Interpret blockIdx.y as different diagonal slices and
blockIdx.x as distance along a diagonal

idata odata
0 64 128 0

1 65 129 64 1

2 66 130 128 65 2

3 67 ... 129 66 3

4 68 130 67 4

5 ... 68 5

blockId = gridDim.x * blockIdx.y + blockIdx.x

Diagonal Transpose
__global__ void transposeDiagonal(float *odata, float *idata, int width,
{

int blockIdx_y = blockIdx.x; Add lines to map diagonal
int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x; to Cartesian coordinates

int xIndex = blockIdx_x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx_y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
Replace
xIndex = blockIdx_y * TILE_DIM + threadIdx.x;
yIndex = blockIdx_x * TILE_DIM + threadIdx.y; blockIdx.x
int index_out = xIndex + (yIndex)*height; with
blockIdx_x,
for (int r=0; r < nreps; r++) { blockIdx.y
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { with
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; blockIdx_y
}
__syncthreads();
odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i];
}
}
}

Diagonal Transpose
Previous slide for square matrices (width == height)
More generally:

if (width == height) {
blockIdx_y = blockIdx.x;
blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;
} else {
int bid = blockIdx.x + gridDim.x*blockIdx.y;
blockIdx_y = bid%gridDim.y;
blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;
}


Effective Bandwidth

2048x2048, GTX 280
Loop over kernel Loop in kernel
Naïve Transpose 2.2 2.2
Bank Conﬂict Free Transpose 16.6 17.2
Diagonal 69.5 78.3


Order of Optimizations

Larger optimization issues can mask smaller ones
Proper order of some optimization techniques in
not known a priori
Eg. partition camping is problem-size dependent

Don’t dismiss an optimization technique as
ineffective until you know it was applied at the right
time
Bank Partition
Conflicts 16.6 GB/s Camping

Naïve Coalescing
16.5 GB/s 69.5 GB/s
2.2 GB/s
Partition 48.8 GB/s Bank
Camping Conflicts

Transpose Summary

Coalescing and shared memory bank conflicts are
small-scale phenomena
Deal with memory access within half-warp
Problem-size independent

Partition camping is a large-scale phenomenon
Deals with simultaneous memory accesses by warps on
different multiprocessors
Problem size dependent
Wouldn’t see in (2048+32)^2 matrix

Coalescing is generally the most critical

SDK Transpose Example:
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
53

Textures in CUDA

Texture is an object for reading data

Benefits:
Data is cached (optimized for 2D locality)
Helpful when coalescing is a problem
Filtering
Linear / bilinear / trilinear
Dedicated hardware
Wrap modes (for “out-of-bounds” addresses)
Clamp to edge / repeat
Addressable in 1D, 2D, or 3D
Using integer or normalized coordinates

Usage:
CPU code binds data to a texture object
Kernel reads data by calling a fetch function

Other goodies
Optional “format conversion”
• {char, short, int, half} (16bit)
to
ﬂoat (32bit)
• “for free”
• useful for *mem compression (see later)

Texture Addressing
0 1 2 3 4
0
(2.5, 0.5)
1 (1.0, 1.0)
2
3

Wrap Clamp
Out-of-bounds coordinate is Out-of-bounds coordinate is
wrapped (modulo arithmetic) replaced with the closest
boundary
0 1 2 3 4 0 1 2 3 4
0 0
(5.5, 1.5) (5.5, 1.5)
1 1
2 2
3 3


Two CUDA Texture Types

Bound to linear memory
Global memory address is bound to a texture
Only 1D
Integer addressing
No filtering, no addressing modes

Bound to CUDA arrays
CUDA array is bound to a texture
1D, 2D, or 3D
Float addressing (size-based or normalized)
Filtering
Addressing modes (clamping, repeat)

Both:
Return either element type or normalized float

CUDA Texturing Steps
Host (CPU) code:
Allocate/obtain memory (global linear, or CUDA array)
Create a texture reference object
Currently must be at file-scope
Bind the texture reference to memory/array
When done:
Unbind the texture reference, free resources

Device (kernel) code:
Fetch using texture reference
Linear memory textures:
tex1Dfetch()
Array textures:
tex1D() or tex2D() or tex3D()


!"#$%&#%'()*"+,
• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$
• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
– 1'$(#$+2(3.(/.'0(32(456(7./$.+%
– 8,9,&.0(&#(:;<=
• <)+*2'&..$'4#20"+*'&11)$$)$=
– <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$%
– 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%%
– B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./
• !"#$%&#%'1&13)'%3+"49374%='
– EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/
– G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%%
• H./,'+,C.%(#&A./@,%.

© NVIDIA 2010

!"#$%&#%'()*"+,
• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$
• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
!!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' B
– 1'$(#$+2(3.(/.'0(32(456(7./$.+%
– 8,9,&.0(&#(:;<= C
DDD
• <)+*2'&..$'4#20"+*'&11)$$)$=
-+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9
– <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$%
-+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9
– 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%%
-+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9
– H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./
DDD
• !"#$%&#%'1&13)'%3+"49374%=' Z
– KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/
– M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%%
• N./,'+,I.%(#&G./F,%.

© NVIDIA 2010

!"#$%&#%'()*"+,
• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)
• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5
• C$=#>'D(E(F
– !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8#
– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,

addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448

© NVIDIA 2010

!"#$%&#%'()*"+,
• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)
• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5
• C$=#>'0"#$%&#%D1#=?"+*'&00)$$E
– !"#$%&'(#)&*+%,-+$&./&01%+$
– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",
• 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+

addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448

© NVIDIA 2010

!"#$%$&$'()*$#+),-%"./00$-'
• 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9)
47#/0)'//5/5:);-'0$5/.);-%"./00$-'
• <"".-2;+/0=
– !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;%
&'7<=)>
– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"
• A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+
– G;"6)0-;+)H$
• I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+
• K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'"

• <""3$;2#$-')$')".2;#$;/=
– M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"%
+'=()*+%'"%Q@R+S
– F##<$TT;*L,(U'*6T;-+TCV22U42V2
34
© NVIDIA 2010

tation t hrough
compu
g GPU
Acc eleratin thods
-precis ion me
mixed

lark
M ichael C s
ic
As trophys
ian Ce nter for Univers
ity
on Harvard
-Smiths
Harvard

SC’10

... too much ?

ba nk c
onﬂict
s

on
ing

isi
ale sc

ec
co

ca
pr

ch
d part
ition

in
ixe
cla ca m

g
m ping
m

pi
ng
adca sting
bro
ms
zero-cop trea

Parallel
ProgrammParking
is Hard
(but you’ll pick it up)

3. Threading/Execution
Optimizations

3.1 Exec. Conﬁguration
Optimizations

Occupancy

Thread instructions are executed sequentially, so
executing other warps is the only way to hide
latencies and keep the hardware busy

Occupancy = Number of warps running concurrently
on a multiprocessor divided by maximum number of
warps that can run concurrently

Limited by resource usage:
Registers
Shared memory


Grid/Block Size Heuristics

# of blocks > # of multiprocessors
So all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2
Multiple blocks can run concurrently in a multiprocessor
Blocks that aren’t waiting at a __syncthreads() keep the
hardware busy
Subject to resource availability – registers, shared memory

# of blocks > 100 to scale to future devices
Blocks executed in pipeline fashion
1000 blocks per grid will scale across multiple generations


Register Dependency

Read-after-write register dependency
Instruction’s result can be read ~24 cycles later
Scenarios: CUDA: PTX:
x = y + 5; add.f32 $f3, $f1, $f2
z = x + 3; add.f32 $f5, $f3, $f4

s_data[0] += 3; ld.shared.f32 $f3, [$r31+0]
add.f32 $f3, $f3, $f4

To completely hide the latency:
Run at least 192 threads (6 warps) per multiprocessor
At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3)
Threads do not have to belong to the same thread block

Register Pressure
Hide latency by using more threads per SM
Limiting Factors:
Number of registers per kernel
8K/16K per SM, partitioned among concurrent threads
Amount of shared memory
16KB per SM, partitioned among concurrent threadblocks
Compile with –ptxas-options=-v flag
Use –maxrregcount=N flag to NVCC
N = desired maximum registers / kernel
At some point “spilling” into local memory may occur
Reduces performance – local memory is slow


Occupancy Calculator


Optimizing threads per block
Choose threads per block as a multiple of warp size
Avoid wasting computation on under-populated warps
Want to run as many warps as possible per
multiprocessor (hide latency)
Multiprocessor can run up to 8 blocks at a time

Heuristics
Minimum: 64 threads per block
Only if multiple concurrent blocks
192 or 256 threads a better choice
Usually still enough regs to compile and invoke successfully
This all depends on your computation, so experiment!


Occupancy != Performance

Increasing occupancy does not necessarily increase
performance

BUT …

Low-occupancy multiprocessors cannot adequately
hide latency on memory-bound kernels
(It all comes down to arithmetic intensity and available
parallelism)


01*+ ,2%
."$%/,,
,"% *#%-(
&"$'($ )*+
!"##"$%
(%)(*'
!"#$%&'!
).%.&'
+,'-./
' 8'

./'556'5787'
0. 12.34

GTC’10

3.2 Instruction
Optimizations

CUDA Instruction Performance

Instruction cycles (per warp) = sum of
Operand read cycles
Instruction execution cycles
Result update cycles

Therefore instruction throughput depends on
Nominal instruction throughput
Memory latency
Memory bandwidth

“Cycle” refers to the multiprocessor clock rate
1.3 GHz on the Tesla C1060, for example


Maximizing Instruction Throughput

Maximize use of high-bandwidth memory
Maximize use of shared memory
Minimize accesses to global memory
Maximize coalescing of global memory accesses

Optimize performance by overlapping memory
accesses with HW computation
High arithmetic intensity programs
i.e. high ratio of math to memory transactions
Many concurrent threads


Arithmetic Instruction Throughput

int and float add, shift, min, max and float mul, mad:
4 cycles per warp
int multiply (*) is by default 32-bit
requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int
multiply

Integer divide and modulo are more expensive
Compiler will convert literal power-of-2 divides to shifts
But we have seen it miss some cases
Be explicit in cases where compiler can’t tell that divisor is
a power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2


Runtime Math Library

There are two types of runtime math operations in
single-precision
__funcf(): direct mapping to hardware ISA
Fast but lower accuracy (see prog. guide for details)
Examples: __sinf(x), __expf(x), __powf(x,y)
funcf() : compile to multiple instructions
Slower but higher accuracy (5 ulp or less)
Examples: sinf(x), expf(x), powf(x,y)

The -use_fast_math compiler option forces every
funcf() to compile to __funcf()


GPU results may not match CPU

Many variables: hardware, compiler, optimization
settings

CPU operations aren’t strictly limited to 0.5 ulp
Sequences of operations can be more accurate due to 80-
bit extended precision ALUs

Floating-point arithmetic is not associative!


FP Math is Not Associative!

In symbolic math, (x+y)+z == x+(y+z)
This is not necessarily true for floating-point addition
Try x = 1030, y = -1030 and z = 1 in the above equation

When you parallelize computations, you potentially
change the order of operations

Parallel results may not exactly match sequential
results
This is not specific to GPU or CUDA – inherent part of
parallel execution


Control Flow Instructions
Main performance concern with branching is
divergence
Threads within a single warp take different paths
Different execution paths must be serialized

Avoid divergence when branch condition is a
function of thread ID
Example with divergence:
if (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:
if (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size


Scared ?
Howwwwww?!
(do I start)

!"#$%&'&()'*+(,-./'$0-
• ,-./'$0-(1.2"*0-&3
– !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+
• #$%&"'()*+,+(%+-"./"0 1+*"23*1
• 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($%
– -.+)%*/&*#$!"-#$)%*/&*#$
• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(
• :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7
– .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!("
• :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@
– &"'2'4*+)-.(12.).(2+)$%2"#2'$!("
• :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5
• 6.78#-03
– B>""D"!"#$%&'$!("#)!##&*+ <D"B>"E"23*1"5'F+"D<
– 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@
7
© NVIDIA 2010

CUDA Visual Profiler data for memory transfers

Memory transfer type and direction
(D=Device, H=Host, A=cuArray)
e.g. H to D: Host to Device

Synchronous / Asynchronous

Memory transfer size, in bytes

Stream ID


CUDA Visual Profiler data for kernels


CUDA Visual Profiler computed data for kernels
Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate

Global memory read throughput (Gigabytes/second)

Global memory write throughput (Gigabytes/second)

Overall global memory access throughput (Gigabytes/second)

Global memory load efficiency

Global memory store efficiency


CUDA Visual Profiler data analysis views
Views:
Summary table
Kernel table
Memcopy table
Summary plot
GPU Time Height plot
GPU Time Width plot
Profiler counter plot
Profiler table column plot
Multi-device plot
Multi-stream plot
Analyze profiler counters
Analyze kernel occupancy


CUDA Visual Profiler Misc.
Multiple sessions

Compare views for different sessions

Comparison Summary plot

Profiler projects save & load

Import/Export profiler data
(.CSV format)


Scared ?
meh!!!! I don’t like to
proﬁle

!"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0
• 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$
– !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&'
"++&%##$.5
– 6$0%#'7)8'5))+'%#,$9",%#'()&:
• ;$9%'#2%.,'"**%##$.5'9%9)&7
• ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).#

• 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$&
– =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+
– @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$*
• A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9%

9
© NVIDIA 2010

Scared ?

I want to believe...

!"#$%&'(#)*$%!+$,(-."/

time

mem math full mem math full mem math full mem math full

Memory-bound Math-bound Balanced Memory and latency bound
Good mem-math Good mem-math Good mem-math Poor mem-math overlap:
overlap: latency not a
problem
Memory bound ?
problem
problem
latency is a problem

(assuming memory (assuming instruction (assuming memory/instr
throughput is not low throughput is not low throughput is not low

Math bound ?
compared to HW theory) compared to HW theory) compared to HW theory)
13
© NVIDIA 2010

Latency bound ?

!"#$%&'(#)*$%!+$,(-."/

time

mem math full mem math full mem math full mem math full

Memory-bound Math-bound Balanced Memory and latency bound
Good mem-math Good mem-math Good mem-math Poor mem-math overlap:
overlap: latency not a overlap: latency not a overlap: latency not a latency is a problem
problem problem problem
(assuming memory (assuming instruction (assuming memory/instr
throughput is not low throughput is not low throughput is not low
compared to HW theory) compared to HW theory) compared to HW theory)
13
© NVIDIA 2010

Argn&%#$... too many optimizations !!!

Parameterize Your Application

Parameterization helps adaptation to different GPUs

GPUs vary in many ways
# of multiprocessors
Memory bandwidth
Shared memory size
Register file size
Max. threads per block

You can even make apps self-tuning (like FFTW and
ATLAS)
“Experiment” mode discovers and saves optimal
configuration


More ?
• Next week:
GPU “Scripting”, Meta-programming, Auto-tuning
• Thu 3/31/11:
PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
• Tue 3/29/11:
Algorithm Strategies (W. Hwu, UIUC)
• Tue 4/5/11:
Analysis-driven Optimization (C.Wooley, NVIDIA)
• Thu 4/7/11:
Irregular Parallelism & Efﬁcient Data Structures (J.Owens, UCDavis)
• Thu 4/14/11:
Optimization for Ninjas (D.Merill, UVirg)
• ...

one more thing
or two...

Life/Code Hacking #2.x
Speed {listen,read,writ}ing

accelerated e-learning (c) / massively parallel {learn,programm}ing (c)

Life/Code Hacking #2.2
Speed writing


Speed writing
http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html


Speed writing
Typing tutors: gtypist, ktouch, typingweb.com, etc.


Speed writing

Kinesis Advantage (QWERTY/DVORAK)

[Harvard CS264] 05 - Advanced-level CUDA Programming

More Related Content

What's hot

Similar to [Harvard CS264] 05 - Advanced-level CUDA Programming

More from npinto

Recently uploaded

[Harvard CS264] 05 - Advanced-level CUDA Programming