7. GPU CACHES
GPU caches are not intended for the same use as CPU's
Not aimed at temporal reuse. Smaller than CPU size (especially per thread, e.g. Fermi:
48 KB L1, 1536 threads on fly, cache / thread = 1 x 128-byte line).
Aimed at spatial reuse. Intended to smooth some access patterns, help with spilled
registers and stack.
Do not tile relying on block size. Lines likely become evicted next few access
Use smem for tiling. Same latency, fully programmable
L2 aimed to speed up atomics and gmem writes.
8. GMEM
Learn your access pattern before thinking about latency hiding and try not to thresh the
memory bus.
Four general categories of inefficient memory access patterns:
Miss-aligned (offset) warp addresses
Strided access between threads within a warp
Thread-affine (each thread in a warp accesses a large contiguous region)
Irregular (scattered) addresses
Always be aware about bytes you actually need and bytes you transfer through the bus
11. GMEM: STRIDED
Use smem to correct access pattern.
1. load gmem -> smem with best coalescing
2. synchronize
3. use
12. GMEM: STRIDED
Use warp shuffle to permute elements for warp
1. coalescingly load elements needed by warp
2. permute
3. use
13. GMEM: STRIDED
Use proper caching strategy
cg – cache global
ldg – cache in texture L1
cs – cache streaming
14. GMEM: THREAD-AFFINE
Each thread accesses relatively long continuous memory region
Load big structures using AoS
Thread loads continuous region of data
All threads load the same data
15. GMEM: THREAD-AFFINE
Work distribution
i n t t i d = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ;
i n t t h r e a d N = N / b l o c k D i m . x * g r i d D i m . x ;
f o r ( s i z e _ t i = t i d * N ; i < ( t i d + 1 ) * N ; + + i )
{
s u m = + i n [ i ]
}
f o r ( s i z e _ t i = t i d ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x )
{
s u m = + i n [ i ]
}
16. UNIFORM LOAD
All threads in a block access the same address as read only.
Memory operation uses 3-level constant cache
Generated by compiler
Available as PTX asm insertion
_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ f l o a t _ _ l d u ( c o n s t f l o a t * p t r )
{
f l o a t v a l ;
a s m ( " l d u . g l o b a l . f 3 2 % 0 , [ % 1 ] ; " : " = " f ( v a l ) : l ( p t r ) ) ;
r e t u r n v a l ;
}
17. GMEM: IRREGULAR
Random memory access. Threads in a warp access many lines, strides are irregular.
Improve data locality
Try 2D-local arrays (Morton-ordered)
Use read-only texture L1
Kernel fission to localize the worst case.
18. TEXTURE
Smaller transactions and different caching (dedicated L1, 48 KB, ~104 clock latency)
Cache is not polluted by other GMEM loads, separate partition for each warp scheduler
helps to prevent cache threshing
Possible hardware interpolation (Note: 9-bit alpha)
Hardware handling of out-of-bound access
Kepler improvements:
sm_30+ Bindless textures. No global static variables. Can be used in threaded code
sm_32+ GMEM access through texture cache bypassing interpolation units
19. SMEM: BANKING
KEPLER: 32-BIT AND 64-BIT MODES
special case: 2D smem usage (Fermi example)
_ _ s h a r e d _ _ f l o a t s m e m _ b u f f e r [ 3 2 ] [ 3 2 + 1 ]
20. SMEM
The common techniques are:
use smem to improve memory access pattern
use smem for stencil processing
But the gap between smem and math throughput is increasing
Tesla: 16 (32 bit) banks vs 8 thread processors (2:1)
GF100: 32 (32 bit) banks vs 32 thread processors (1:1)
GF104: 32 (32 bit) banks vs 48 thread processors (2:3)
Kepler: 32 (64 bit) banks vs 192 thread processors (1:3)
Max size 48 KB (49152 B), assume max occupancy 64x32,
so 24 bytes per thread.
More intensive memory usage affects occupancy.
21. SMEM (CONT.)
smem + L1 use the same 64K B. Program-configurable split:
Fermi: 48:16, 16:48
Kepler: 48:16, 16:48, 32:32
cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig()
prefer L1 to improve lmem usage
prefer smem for stencil kernels
smen often used for:
data sharing across the block
inter-block communication
bock-level buffers (for scan or reduction)
stencil code
22. LMEM
Local memory is a stack memory analogue: call stack, register spilling. Note: Both Local
memory reads/writes are cached in L1.
Registers are for automatic variables
Volatile keyword enforces spilling
Registers do not support indexing: local memory is used for local arrays
Register spilling leads to more instructions and memory traffic
i n t a = 4 2 ;
i n t b [ S I Z E ] = { 0 , } ;
23. SPILLING CONTROL
1. Use __launch_bounds__ to help compiler to select maximum amount of registers.
2. Compile with -maxrregcount to enforce compiler optimization for register usage and
register spilling if needed
3. By default you run less concurrent warps per SM
_ _ g l o b a l _ _ v o i d _ _ l a u n c h _ b o u n d s _ _ (
m a x T h r e a d s P e r B l o c k , m i n B l o c k s P e r M u l t i p r o c e s s o r ) k e r n e l ( . . . )
{
/ / . . .
}
25. CONTROL FLOW: PROBLEMS
Warp divergence: branching, early loop exit... Inspect SASS to find divergent pieces of
code
Workload is data dependent: code-path depends on input (like classification task)
Too many synchronization logic: intensive usage of parallel data structures, lots of
atomics, __sychthreads(), etc
Resident warps: occupy resources but do nothing
Big blocks: tail effect
26. CONTROL FLOW: SOLUTIONS
Understand your problem. Select best algorithm keeping in mind GPU architecture.
Maximize independent parallelism
Compiler generates branch predication with -O3 during if/switch optimization but
number of instructions has to be less or equal than a given threshold. Threshold = 7 if
lots of divergent warps, 4 otherwise
Adjust thread block size
Try work queues
27. KERNEL FUSION AND FISSION
Fusion
Replace chain of kernel calls with fused one
Helps to save memory reads/writes. Intermediate results can be kept in registers
Enables further ILP optimizations
Kernels should have almost the same access pattern
Fission
Replace one kernel call with a chain
Helps to localize ineffective memory access patterns
Insert small kernels that repack data (e.g. integral image)
28. TUNING BLOCK CONFIGURATION
Finding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
low occupancy presents full hardware utilization and lowers possibility to hide patency
high occupancy for kernels with large memory demands results in over polluted read or
write queues
Experiment to find optimal configuration (block and grid resolutions, amount of work per
thread) that is optimal for your kernel.
29. TUNING BLOCK CONFIGURATION
Finding optimal launch configuration is crucial to achieve best performance. Launch
configuration affects occupancy
30. FINAL WORDS
Basic CUDA Code Optimizations
use compiler flags
do not trick compiler
use structure of arrays
improve memory layout
load by cache line
process by row
cache data in registers
re-compute values instead of re-loading
keep data on GPU
31. FINAL WORDS
Conventional parallelization optimizations
use light-weight locking,
... atomics,
... and lock-free code.
minimize locking,
... memory fences,
... and volatile accesses.
32. FINAL WORDS
Conventional architectural optimizations
utilize shared memory,
... constant memory,
... streams,
... thread voting,
... and rsqrtf;
detect compute capability and number of SMs;
tune thread count,
... blocks per SM,
... launch bounds,
and L1 cache/shared memory configuration