From fermi to kepler


Published on

The presentation covers NVIDIA GPU architectures: Fermi, Fermi refresh and Kepler

Published in: Technology
  • You're absolutly right - I missed that !. thanks for the correction. I'll upload an updated presentation soon.
    Are you sure you want to  Yes  No
    Your message goes here
  • The drawing is absolutely fine Rosen..But GTX 680 has 32 LD/St and SFU units per SMX.
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks. Glad that you like it.
    As for your question - that's right. It's even visible in the drawing of GK104 SMX which appears on slide 12. This drawing is taken from NVIDIA's GTX680 white paper
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice presentation. Thanks :-)
    BTW, is it correct to say GK104 has 16 LD/ST units and 16 SFUs??
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

From fermi to kepler

  1. 1. NVIDIA GPU Architecture:From Fermi to KeplerOfer RosenbergJan 21st 2013
  2. 2. Scope This presentation covers the main features of Fermi, Fermi refresh & Kepler architectures The overview is done from compute perspective, and as such Graphics features are not discussed  Polyphase Engine, Raster, ROBs, etc.
  3. 3. Quick Numbers GTX 480 GTX580 GTX680Architecture GF100 GF110 GK104SM / SMX 15 16 8CUDA cores 480 512 1536Core Frequency 700MHz 772MHz 1006MHzCompute Power 1345 GFLOPS 1581 GFLOPS 3090 GFLOPSMemory BW 177.4 GB/s 192.2 GB/s 192.2 GB/sTransistors 3.2B 3.0B 3.5BTechnology 40nm 40nm 28nmPower 250W 244W 195W
  4. 4. GF100 SM SM - Stream Multiprocessor 32 “CUDA cores”, organized into two clusters, 16 cores each Warp is 32 threads – two cycles to complete a Warp  NVIDIA solution - ALU clock is double the Core clock 4 SFUs (accelerate transcendental functions) 16 Load / Store units Dual Warp scheduler – execute two warps concurrently  Note bottlenecks on LD/ST & SFU – architecture decision Each SM can hold up to 48 Warps, divided up to 8 blocks  Hold “in-flight” warps to hide latency  Typically no. of blocks is lower.  For example, 24 warps per block = 2 blocks per SM
  5. 5. Packing it all together GPC – Graphic Processing Cluster  Four SMs  Transparent to compute usages
  6. 6. Packing it all together Four GPCs 768K L2 shared between SMs  Support L2 only or L1&L2 caching 384-bit GDDR5 GigaThread Scheduler  Schedule thread blocks to SMs  Concurrent Kernel Execution - separated kernels per SM.
  7. 7. Fermi GF104 SMChanges from GF100 SM: 48 “CUDA cores”, organized into three clusters of 16 cores each 8 SFUs instead of 4 Rest remains the same (32K 32-bit registers, 64K L1/Shared, etc.) Wait a sec…three clusters, but still schedule two warps ? Under-utilization study of GF100 led to scheduling redesign – Next slide…
  8. 8. Instruction Level Parallelism (ILP)GF100 GF104 Two warp Schedulers feed two clusters of cores  Adopt ILP idea from CPU world - issue two Memory access or SFU access lead to instructions per clock underutilization of Cores Cluster  Add a third cluster for balanced utilization
  9. 9. Meet GK104 SMX 192 “CUDA Cores” Organized into 6 clusters of 32 cores each  No more “dual clocked ALU” 16 Load/Store units 16 SFUs 64K 32-bit registers Same 64K L1/Shared Same dual-issued Warp scheduling:  Execute 4 warps concurrently  Issue two instructions per cycle Each SMX can hold up to 64 warps, divided up to 16 blocks
  10. 10. From GF104 to GK104 Look at Half of SMX SM SMX Same:  Two warp schedulers  Two dispatch units per scheduler  32K register file  6 rows of cores  1 row of load/store  1 row of SFU Different:  On SMX, a row of cores is 16 vs 8 on SM  On SMX a row of SFU is 16 vs 8 on SM
  11. 11. Packing it all together Four GPCs, each has two SMXs 512K L2 shared between SMs  L1 is no longer used for CUDA 256-bit GDDR5 GigaThread Scheduler  Dynamic Parallelism
  12. 12. GK104 vs. GF104 Kepler has less “multiprocessors”  8 vs. 16  Less flexible on executing different kernels concurrently Each “multiprocessor” is stronger  Issue twice the warps (6 vs. 3)  Twice the register file  Execute warp in a single cycle  More SFUs  10x Faster atomic operations But:  SMX Holds 64 warps vs. the 48 for SM – less latency hiding per warp cluster  L1/Shared Memory stayed the same size – and totally bypassed in CUDA/OpenCL  Memory BW did not scale as compute/cores did (192GB/Sec, same as in GF110)
  13. 13. GK110 SMX Tesla only (no GeForce version) Very similar to GK104 SMX Additional Double-Precision units, otherwise the same
  14. 14. GK110 Production versions: 14 & 13 SMXs (not 15) Improved device-level scheduling (next slides):  HyperQ  Dynamic Parallelism
  15. 15. Improved scheduling 1 - HyperQ Scenario: multiple CPU processes send work to the GPU On Fermi, time division between processes On Kepler, simultaneous processing from multiple processes
  16. 16. Improved scheduling 2 A new age in GPU programmability: moving from Master-Salve pattern to self-feeding
  17. 17. Questions ?