3. DEVICE CODE OPTIMIZATION PRINCIPLE
Specific of SIMT architecture makes GPU to be latent at all, so
Hiding latency is the only GPU-specific optimization principle
Typical latencies for Kepler generation
register writeback: ~10 cycles
L1: ~34 cycles
Texture L1: ~96 cycles
L2: ~160 cycles
Global memory: ~350 cycles
4. PERFORMANCE LIMITERS
Optimize for GPU ≃ Optimize for latency
Factors that pervert latency hiding:
Insufficient parallelism
Inefficient memory accesses
Inefficient control flow
5. THROUGHPUT & LATENCY
Throughput
is how many operations are performed in one cycle
Latency
is how many cycles pipeline stalls before another dependent operation
Inventory
is a number of warps on fly i.e. in execution stage of the pipeline
6. LITTLE’S LAW
L = λ × W
Inventory (L) = Throughput (λ) × Latency (W)
Example:GPUwith8operationsperclockand18clocklatency
7. LITTLE’S LAW: FFMA EXAMPLE
Fermi GF100
Throughput: 32 operations per clock (1 warp)
Latency: ~18 clocks
Maximum resident warps per SM: 24
Inventory: 1 * 18 = 18 warps on fly
Kepler GK110
Throughput: 128 (if no ILP) operations per clock (4 warps)
Latency: ~10 clocks
Maximum resident warps per SM: 64
Inventory: 4 * 10 = 40 warps on fly
Maxwell GM204
Throughput: 128 operations per clock (4 warps)
Latency: ~6 clocks
Maximum resident warps per SM: 64
Inventory: 4 * 6 = 24 warps on fly
8. TLP & ILP
Thread Level Parallelism
enabling factors:
sufficient number of warps per SM on fly
limiting factors:
bad launch configuration
resource consuming kernels
poorly parallelized code
Instruction Level Parallelism
enabling factors:
independent instructions per warp
dual issue capabilities
limiting Factors:
structural hazards
data hazards
9. IMPROVING TLP
Occupancy
is actual number of warps running concurrently on a multiprocessor divided by
maximum number of warps that can be run concurrently by hardware
Improve occupancy to achieve better TLP
Modern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell)
blocks BUT you need recourses for them: registers, smem
Kepler has 64 K. × 32-bit registers and 32-lane wide warp
65536 registers / 64 warps / 32 warp_size = 32 registers/thread
10. IMPROVING ILP
Kernel unrolling: process more elements by thread, because operations on different
elements are independent
Device code compiler is not bad in instruction reordering
Loop unrolling in device code to increase number of independent operations
Other techniques used for increasing ILP on CPU are suitable
_ _ g l o b a l _ _ v o i d u n r o l l e d ( c o n s t f l o a t * i n , f l o a t * o u t )
{
c o n s t i n t t i d = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ;
c o n s t i n t t o t a l T h r a d s = b l o c k D i m . x * g r i d D i m . x ;
o u t [ t i d ] = p r o c e s s ( i n [ t i d ] ) ;
o u t [ t i d + t o t a l T h r a d s ] = p r o c e s s ( i n [ t i d + t o t a l T h r a d s ] ) ;
}
# p r a g m a u n r o l l C O N S T _ E X P R E S S I O N
f o r ( i n t i = 0 ; i < N _ I T E R A T I O N S ; i + + ) { / * . . . * / }
11. ILP ON MODERN GPUS
ILP is a mast-have for older architectures, but still help to hide pipeline latencies on
modern GPUs
Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warps
each clock. Compute cores utilization: 1.0
Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warps
each clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization:
0.6(6)
Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warps
each 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization:
0.6(6)