NVIDIA GPU Architecture:From Fermi to KeplerOfer RosenbergJan 21st 2013
Scope This presentation covers the main features of Fermi, Fermi refresh & Kepler architectures The overview is done from compute perspective, and as such Graphics features are not discussed Polyphase Engine, Raster, ROBs, etc.
GF100 SM SM - Stream Multiprocessor 32 “CUDA cores”, organized into two clusters, 16 cores each Warp is 32 threads – two cycles to complete a Warp NVIDIA solution - ALU clock is double the Core clock 4 SFUs (accelerate transcendental functions) 16 Load / Store units Dual Warp scheduler – execute two warps concurrently Note bottlenecks on LD/ST & SFU – architecture decision Each SM can hold up to 48 Warps, divided up to 8 blocks Hold “in-flight” warps to hide latency Typically no. of blocks is lower. For example, 24 warps per block = 2 blocks per SM
Packing it all together GPC – Graphic Processing Cluster Four SMs Transparent to compute usages
Packing it all together Four GPCs 768K L2 shared between SMs Support L2 only or L1&L2 caching 384-bit GDDR5 GigaThread Scheduler Schedule thread blocks to SMs Concurrent Kernel Execution - separated kernels per SM.
Fermi GF104 SMChanges from GF100 SM: 48 “CUDA cores”, organized into three clusters of 16 cores each 8 SFUs instead of 4 Rest remains the same (32K 32-bit registers, 64K L1/Shared, etc.) Wait a sec…three clusters, but still schedule two warps ? Under-utilization study of GF100 led to scheduling redesign – Next slide…
Instruction Level Parallelism (ILP)GF100 GF104 Two warp Schedulers feed two clusters of cores Adopt ILP idea from CPU world - issue two Memory access or SFU access lead to instructions per clock underutilization of Cores Cluster Add a third cluster for balanced utilization
Meet GK104 SMX 192 “CUDA Cores” Organized into 6 clusters of 32 cores each No more “dual clocked ALU” 16 Load/Store units 16 SFUs 64K 32-bit registers Same 64K L1/Shared Same dual-issued Warp scheduling: Execute 4 warps concurrently Issue two instructions per cycle Each SMX can hold up to 64 warps, divided up to 16 blocks
From GF104 to GK104 Look at Half of SMX SM SMX Same: Two warp schedulers Two dispatch units per scheduler 32K register file 6 rows of cores 1 row of load/store 1 row of SFU Different: On SMX, a row of cores is 16 vs 8 on SM On SMX a row of SFU is 16 vs 8 on SM
Packing it all together Four GPCs, each has two SMXs 512K L2 shared between SMs L1 is no longer used for CUDA 256-bit GDDR5 GigaThread Scheduler Dynamic Parallelism
GK104 vs. GF104 Kepler has less “multiprocessors” 8 vs. 16 Less flexible on executing different kernels concurrently Each “multiprocessor” is stronger Issue twice the warps (6 vs. 3) Twice the register file Execute warp in a single cycle More SFUs 10x Faster atomic operations But: SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
GK110 SMX Tesla only (no GeForce version) Very similar to GK104 SMX Additional Double-Precision units, otherwise the same