Advertisement
Advertisement

More Related Content

Similar to GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah(20)

Advertisement

More from AMD Developer Central (20)

Advertisement

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

  1. THE AMD GCN ARCHITECTURE A CRASH COURSE @MissQuickstep LAYLA MAH – LAYLA.MAH@AMD.COM DEVELOPER TECHNOLOGY ENGINEER
  2. AGENDA  Part 1: A Brief History of GPU Evolution  Part 2: Introduction to Graphics Core Next (GCN)  Part 3: Anatomy of a GCN Compute Unit (CU)  Part 4: GCN Shader: Arbitration, Examples & Tips  Part 5: GCN Memory Hierarchy  Part 6: GCN Compute Architecture (ACE)  Part 7: GCN Fixed Function Units (CP, GeometryEngine, Rasterizer, RBE, …)  Part 8: Main Takeaways & Conclusion  Bonus Slides: Tiled Resources, Partially Resident Textures (PRT) 22 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  3. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 3 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  4. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 4 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  5. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders Prior to 2002  Graphics-specific hardware 3RD ERA: Graphics Parallel Core VLIW5 ‒ Texture mapping/filtering ‒ Transform & Lighting (T&L) Engines ‒ Geometry processing ‒ Rasterization ‒ Fixed function lighting equations Lighting Stream Processing Units FMAD+ Special Functions Branch Unit ‒ Multi-texturing General Purpose Registers ‒ Dedicated texture and pixel caches VLIW4 ‒ Sufficient for basic graphics tasks Processing Units ‒ No general purpose compute capability General Purpose Stream Registers 5 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit  Dot product and scalar multiply-add
  6. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders Memory Interface 3RD ERA: Graphics Parallel Core VLIW5 Stream Processing Units General Purpose Registers Setup Engine Lighting Pixel Shader Core VLIW4 Stream Processing Units General Purpose Registers 6 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit 16 Pixel Pipes FMAD+ Special Functions Branch Unit 8 Vertex Pipes
  7. GPU EVOLUTION 1ST ERA: Fixed Function 2002-2006 3D Geometry Transformation Graphics Programmability – Direct3D 8/9, OpenGL 2.0 IEEE not required Memory Interface 8 Vertex Pipes Setup Engine – NV 16-bit full-speed – Lighting NV 32-bit half-speed Pixel Shader Core 7 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Shader Models 1.0 - 2.0 VLIW5 ‒ VS and PS are distinct ‒ Minimal Instruction Sets ‒ Limited Instruction Slots ‒ LimitedGeneral Purpose Lengths Shader Registers ‒ No DYNAMIC Flow Control ‒ No Looping Constructs ‒ VLIW4 No Vertex Texture Fetch ‒ No Bitwise Operators ‒ No Native Integer ALU Stream Processing Units ‒ […] FMAD+ Special Functions Branch Unit – Specialized shader units for vertex & pixel processing  Added dedicated caches The Rise of Shaders Stream Processing Units Different precision per IHV – ATI 24-bit full-speed 3RD ERA: Graphics Parallel Core Branch Unit – Floating point processing 2ND ERA: Simple Shaders 16 Pixel Pipes General Purpose Registers
  8. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 8 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  9. GPU EVOLUTION 1ST ERA: Fixed Function 2ND ERA: Simple Shaders The Rise of The Unified Shader (VLIW-5)  5-Element Very-Long-Instruction-Word (XYZWT) 3D Geometry Transformation 3RD ERA: Graphics Parallel Core VLIW5 ‒ Ideal for 4-element Vector and 4x4 Matrix Operations ‒ Vector/Vector math in a single instruction Stream Processing Units ‒ Plus One Transcendental-Unit function per Instruction FMAD+ Special Functions Branch Unit ‒ Began with XENOS and utilized from R600 until “Cayman” ‒ Flexible and optimized for Graphics workloads General Purpose Registers  More advanced caching ‒ Instruction, constant, multi-level texture/data, & later: LDS/GDS Lighting VLIW4 Stream Processing Units  More flexible: Unified ALU, Branch Unit, Dynamic Flow Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc. 9 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 General Purpose Registers Branch Unit  Single Precision 32-bit IEEE-Compliant Floating Point ALUs
  10. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 10 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  11. GPU EVOLUTION 1ST ERA: Fixed Function 2ND ERA: Simple Shaders Optimized For Die Area Efficiency (VLIW-4) 3D 4-Element Very-Long-Instruction-Word (XYZW)  Geometry Transformation 3RD ERA: Graphics Parallel Core VLIW5 ‒ Profiling showed average VLIW utilization was < 3.4/5 ‒ Each ALU has a smaller LUT ‒ Combined using 3-term Lagrange polynomial interpolation across multiple ALU Stream Processing Units ‒ Better optimized for combination of Graphics & Compute ‒ Graphics is still the primary focus, but compute is gaining attention ‒ Still ideal for 4-element Vector and 4x4 Matrix Operations ‒ Fewer ALU bubbles in transcendental-light code, better utilization Lighting FMAD+ Special Functions General Purpose Registers VLIW4 ‒ Multiple dispatch processors & separate command queues 11 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Stream Processing Units General Purpose Registers Branch Unit ‒ Simplified programming and optimization relative to VLIW-5  Improved support for DirectCompute™ and OpenCL™ Branch Unit ‒ Removed dedicated T-Unit – Optimized die area usage
  12. GPU EVOLUTION LANE 0 LANE 1 LANE 2 LANE 15 SIMD VLIW4 SIMD LANE 0 1 2 SIMD 0 15 LANE 0 1 2 15 SIMD 1 LANE 0 1 2 15 SIMD 2 LANE 0 1 2 15 SIMD 3 GCN Quad SIMD-16  64 Single Precision multiply-add (per-clock)  64 Single Precision multiply-add (per-clock)  16 SIMDs × ( 1 VLIW inst × 4 ALU ops )  4 SIMDs × ( 1 ALU op × 16 threads )  1 VLIW inst containing 4 ALU ops (per-clock)  4 ALU ops (from different wavefronts) / clock  Needs 4 parallel ALU ops to fill each VLIW inst  Needs 4+ wavefronts to keep SIMD lanes full  Compiler manages register port conflicts  No register port conflicts  Specialized, complex compiler scheduling  Standard compiler scheduling & optimizations  Difficult assembly creation, analysis, and debug  Simplified assembly creation, analysis, & debug  Complicated tool chain support  Simplified tool chain development and support  Careful optimization req. for peak performance  Stable and predictable performance 12 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  13. AMD GRAPHICS CORE NEXT 13 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  14.  GS-4112 – Mantle: Empowering 3D Graphics Innovation  Keynote – Johan Andersson, Technical Director, EA  GS-4145 – Oxide on Mantle Adoption (Wed 5:00-5:45)  New low level programming interface for PCs  Designed in collaboration with top game developers  Lightweight driver that allows direct access to GPU hardware  Compatible with DirectX® HLSL for simplified porting MANTLE Graphics Applications Mantle API Mantle Driver GCN Works with all Graphics Core Next GPUs
  15. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING Faster performance Higher efficiency New graphics features New compute features GRAPHICS CORE NEXT
  16. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Cutting-edge graphics performance and features  High compute density with multi-tasking  Built for power efficiency  Optimized for heterogeneous computing  Enabling the Heterogeneous System Architecture (HSA)  Amazing scalability and flexibility 16 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 GRAPHICS CORE NEXT
  17. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Unlimited Resources & Samplers  All UAV formats can be read/write  Simpler Assembly Language  Simpler Shader Code  Ability to support C/C++ (like)  Architectural support for traps, exceptions & debugging  Ability to share virtual x86-64 address space with CPU cores 17 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 GRAPHICS CORE NEXT
  18. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  AMD TECHNOLOGY POWERS NEXT-GEN CONSOLES NEW NEXT-GEN GAME CONSOLES RAISE THE BAR FOR GRAPHICS PERFORMANCE PERFORMANCE TFLOPS-CLASS COMPUTE POWER MEMORY 16X MORE MEMORY * * Based on PlayStation 3 512MB vs. PlayStation 4 8192MB GDDR5. GRAPHICS CORE NEXT
  19. GCN COMPUTE UNIT A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  CU = Basic Building Block of GPU Computational Power Branch & Message Unit Vector Units (4x SIMD-16) Texture Filter Units (4) Texture Fetch Load / Store Units (16)  New Instruction Set Architecture Scalar Unit ‒ Non-VLIW ‒ Vector unit + scalar co-processor ‒ Scheduler Distributed programmable scheduler  Each CU can execute instructions from multiple kernels at once  Increased instructions per clock per mm2 ‒ High utilization ‒ High throughput Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) 19 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 L1 Cache (16KB) ‒ Multi-tasking
  20. GCN COMPUTE UNIT Branch & Message Unit Scheduler Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Scalar Unit 4x Vector Units (16-lane SIMD)  Branch andData Threads) Data Cache Scheduler Message Unit 64kb Local(2560 L1 Vector 16kb Read/Write Share (LDS) ‒ CU Total Throughput: 64 Single-Precision (SP) ops/clock Fully Programmable ‒ Executes Branch instructions Limit (32k/thread group) Separate to texture units (acts as texture cache) 2x Larger decode/issue for: Attachedthan D3D11 TGSM ‒ 1 SP (Single-Precision) operation per 4 clocks ‒ Shared by all threads of a wavefront ‒ (as dispatched by SMEM, LDS, GDS/E V SALU, ‒ 1 DP ALU, V with Conflict Resolution  4 32 banks, MEM,Units Scalar unit)clocks XPORT Texture ‒ Used for Filtercontrol, ADD in 8 arithmetic, etc. (Double-Precision) pointer ‒ flow ‒ + Special Instructions (NOPs, barriers, etc.) and ‒ Bandwidth Amplification  16 TextureGPR pool, scalar data cache, etc. branch instructions Fetch Load/Store Units ‒ 1 DP MUL/FMA/Transcendental per 16 clocks* ‒ Has own ‒ 16 Hardware Barriers Decode ‒ Separate Instruction  4x64KB Vector Registers (VGPR)  8KB Scalar General Purpose Registers (SGPR) 20 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Scalar Unit ‒ 8KB Scalar Registers (SGPR) ‒ 16KB 4-CU Shared R/O L1 Scalar Data Cache  Branch & Message Unit  4x Vector Units (SIMD-16) ‒  64KB Local Data Share (LDS) ‒  32 banks, with conflict resolution 16KB Read/Write L1 Vector Data Cache ‒  4x64KB Vector Registers (VGPR) Shared with TMU as Texture Cache Hardware Scheduler ‒ Up-to 2560 threads
  21. GCN COMPUTE UNIT Branch & Message Unit Scheduler SIMD SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each Compute Unit (CU) contains 4 SIMD; each SIMD has: ‒ A 16-lane IEEE-754 vector ALU (VALU) ‒ 64KB of vector register file (VGPR) ‒ Its own 40-bit (48-bit on HSA APUs) Program Counter (PC) ‒ Instruction buffer for 10 wavefronts* ‒ *A wavefront is a group of 64 threads: the size of one logical vGPR 21 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  22. GCN COMPUTE UNIT SCALAR UNIT LANE 0 1 2 SIMD 0 15 LANE 0 1 2 SIMD 1 15 SPECIFICS … LANE 0 1 2 15 SIMD 2 LANE 0 1 2 15 Scalar Unit SIMD 3 GCN Scalar Unit  Fully Programmable Scalar Unit replaces FF Branch Logic  Operations such as JMP [GPR] are now supported  Opens the door to e.g. virtual function calls  Has its own GPR pool and can execute normal ALU code  64-bit bitwise ops to mask thread execution  32-bit bitwise and integer arithmetic operations at full-speed  Potential to offload uniform code (Vector ALU  Scalar ALU)  A GCN CU can dispatch 1 scalar op/clock 24 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  23. GCN COMPUTE UNIT SCALAR UNIT LANE 0 1 2 SIMD 0 15 LANE 0 1 2 15 CONTINUED … LANE 0 1 2 SIMD 1 15 SIMD 2 R/W L2 LANE 0 1 2 15 Scalar Unit SIMD 3 GCN Scalar Unit  Natively a 64-bit integer ALU  Independent arbitration and instruction decode  One ALU, memory or control flow op per cycle  512 Scalar GPR per SIMD shared between waves  { SGPRn+1, SGPR } pair provide 64-bit register  4 CU Shared Read Only Scalar Data Cache: 16KB – 64B lines  4-way assoc, LRU replacement policy  Peak Bandwidth per CU is 16 bytes/cycle 25 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 4 CU Shared 16KB Scalar R/O L1 Scalar Unit 8KB Registers Integer ALU Scalar Decode
  24. GCN COMPUTE UNIT Branch & Message Unit Scheduler BRANCH & MESSAGE UNIT Vector Units (4x SIMD-16) Scalar Unit GRAPHICS CORE NEXT  Independent scalar assist unit to handle special classes of instructions concurrently ‒ Branch ‒ Unconditional Branch (s_branch) ‒ Conditional Branch (s_cbranch_<cond>) ‒ Condition  SCC == 0, SCC == 1, EXEC == 0, EXEC != 0, VCC == 0, VCC != 0 ‒ 16-bit signed immediate dword offset from PC provided ‒ Messages ‒ s_sendmsg  CPU interrupt with optional halt (with shader supplied code and source) ‒ debug message (perf trace data, halt, etc.) ‒ special graphics synchronization messages 26 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Branch  Unconditional Branch (s_branch)  Conditional Branch (s_cbranch_<cond>) ‒ ‒ ‒ ‒ ‒ ‒ SCC == SCC == EXEC == EXEC != VCC == VCC != 0 1 0 0 0 0  Messages ‒ s_sendmsg
  25. GCN COMPUTE UNIT Branch & Message Unit Scheduler MEMORY SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each CU has its own dedicated L1 cache and LDS memory ‒ Both global and shared memory atomics are supported  ‒ 32 banks, with conflict resolution  ‒ Scalar L1 Read-Only data cache is shared between 4 neighbor CU 27 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 16KB R/W L1 Vector D-Cache ‒ Shared with TMU as Texture Cache  ‒ 16 work group barriers supported per CU ‒ Vector L1 Read/Write data cache shared with TMU as texture cache 64KB Local Data Share (LDS) Scalar Unit ‒ 16KB 4-CU Shared R/O Scalar L1  16 hardware barriers per CU  A GCN GPU with 44 CU, such as the AMD Radeon™ R9 290x, can be working on up-to 112,640 work items at a time!
  26. GCN COMPUTE UNIT Branch & Message Unit Scheduler SCHEDULER SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each CU has its own dedicated Scheduler unit  Each CU can have 40 waves in-flight ‒ Each potentially from a different kernel  Scheduler Limits: ‒ Supports up-to 2560 threads per CU (64 threads x 10 waves x 4 SIMD) ‒ 40 wavefronts per CU ‒ All threads within a workgroup are guaranteed to reside on the same CU simultaneously ‒ Limited by available GPR count ‒ A set of synchronization primitives and shared memory allow data to be passed between threads in a workgroup ‒ Optimized for throughput – latency is hidden by overlapping execution of wavefronts 28 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 10 wavefronts per SIMD ‒ Limited by available LDS memory ‒ 16 hardware barriers per CU  A GCN GPU with 44 CU, such as the AMD Radeon™ R9 290x, can be working on up-to 112,640 work items at a time!
  27. GCN COMPUTE UNIT Branch & Message Unit Scheduler SCHEDULER SPECIFICS Vector Units (4x SIMD-16) Scalar Unit ARBITRATION & DECODE Local Data Share (64KB) L1 Cache (16KB) GRAPHICS CORE NEXT  CU is guaranteed to issue instructions for a wave sequentially ‒ Predication & control flow enables any single work-item a unique execution path  For a CU, every clock, waves on 1 SIMD are considered for issue ‒ Round-Robin scheduling algorithm  Maximum 5 instructions per cycle ‒ Not including “internal” instructions  Instruction Types: ‒ 1 Vector Arithmetic Logic Unit (VALU) ‒ 1 Scalar ALU or Scalar Memory (SALU)|(SMEM) ‒ 1 Vector Memory (Read/Write/Atomic) (VMEM) ‒ 1 Branch/Message (e.g. s_branch, s_cbranch) ‒ 1 Local Data Share (LDS)  At most, 1 instruction from each category may be issued  At most, 1 instruction per wave may be issued  Theoretical maximum of 5 instructions per cycle per CU 29 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 1 Export or Global Data Share (GDS) ‒ 10 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional unit]
  28. GCN COMPUTE UNIT VECTOR & SCALAR ARBITRATION LANE 0 1 2 SIMD 0 15 LANE 0 1 2 SIMD 1 15 LANE 0 1 2 15 LANE 0 1 2 15 SIMD 2 HARDWARE VIEW Scalar Unit SIMD 3 GCN Hardware View  A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clocks  Each lane can dispatch 1 SP ALU operation per clock  Each SP ALU operation takes 4 clocks to complete  The scheduler dispatches from a different wavefront each cycle 30 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  29. GCN COMPUTE UNIT VECTOR & SCALAR ARBITRATION LANE 0 1 2 15 LANE 16 17 18 31 LANE 32 33 34 47 LANE 48 49 50 PROGRAMMER VIEW Scalar Unit 63 WAVEFRONT 0 WAVEFRONT 4 WAVEFRONT 1 WAVEFRONT 5 WAVEFRONT 2 WAVEFRONT 6 WAVEFRONT 3 WAVEFRONT 7 WAVEFRONT 0 WAVEFRONT 8 WAVEFRONT 1 WAVEFRONT 9 GCN Programmer View  A GCN Compute Unit can perform 64 SP Vector ALU ops / clock  Each lane can dispatch 1 SP ALU operation per clock  Each SP ALU operation still takes 4 clocks to complete  But you can PRETEND your code runs 1 op on 64-threads at once 31 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  30. GCN VECTOR UNITS ALU CHARACTERISTICS  FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of NaN/Inf/Zero and full de-normal support in hardware for SP and DP  MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADDieee to be combined with round and normalization after both multiplication and subsequent addition  VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates  IEEE Rounding Modes (Round toward +Infinity, Round toward –Infinity, Round to nearest even, Round toward zero) supported under program control anywhere in the shader. SP and DP modes are controlled separately.  De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero. 32 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  31. GCN VECTOR UNITS ALU CHARACTERISTICS CONTINUED …  Divide Assist Ops IEEE 0.5 ULP Division accomplished with macro (SP/DP ~15/41 Instruction Slots, respectively)  FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE-754 precision and rounding  Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, underflow, overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation  64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root  24-bit Integer MUL/MULADD/LOGICAL/SPECIAL @ full SP rates ‒ Heavily utilized for integer thread group address calculation ‒ 32-bit integer MUL/MULADD @ DP MUL/FMA rate 33 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  32. GCN SHADER AUTHORING TIPS  GCN has greatly improved branch performance, and it continues to improve ‒ Don’t be afraid to use it! But, remember: use it wisely – improved != free  ‒ It’s at its best for highly coherent workloads (where most threads take the same path)  However, the new architecture is more susceptible to: register pressure ‒ Using too many registers within a shader can reduce the maximum waves per SIMD!  ‒ NOTE: A WAVEFRONT CAN ALLOCATE 104 USER SCALAR REGISTERS AS SEVERAL SCALAR REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE GCN SGPR Count VGPR Count <= 48 <=24 56 28 64 72 84 100 32 36 40 48 > 100 84 64 <= 128 > 128 Max Waves/SIMD 10  9 9 8 8 4 3 4 2 1 77 66 ‒ Take caution with respect to the following: ‒ Excessive nested branching/looping ‒ Loop Unrolling ‒ Variable declarations (especially arrays) ‒ Excessive function calls requiring storing of results 34 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 55
  33. GCN SHADER CODE EXAMPLE // Registers r0 contains “a”, r1 contains “b” // Value is returned in r2 v_cmp_gt_f32 s_mov_b64 s_and_b64 s_cbranch_vccz v_sub_f32 v_mul_f32 r0,r1 s0,exec exec,vcc,exec label0 r2,r0,r1 r2,r2,r0 // // // // // // a > b, establish VCC Save current exec mask Do “if” Branch if all lanes fail result = a – b result = result * a s_andn2_b64 s_cbranch_execz v_sub_f32 v_mul_f32 exec,s0,exec label1 r2,r1,r0 r2,r2,r1 // // // // Do “else (s0 & !exec) Branch if all lanes fail result = b – a result = result * b s_mov_b64 exec,s0 // Restore exec mask  An alternative to s_cbranch, is to use VSKIP to transform VALU into NOPs  s_setvskip – enables or disables VSKIP mode. Requires 1 waitstate after executing. 35 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  VSKIP does NOT skip VMEM instructions (Do: branch over superfluous VMEM inst.)
  34. GCN MEMORY CACHE HIERARCHY I$ 32KB instruction cache (I$) + 16KB scalar data cache (K$) shared per ~4 CUs with L2 backing K$ I$ K$ Each CU has its own registers and local data share 64 Bytes per clock L1 bandwidth per CU GDS L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 read/write caches 64 Bytes per clock L2 bandwidth per partition L2 read/write cache partitions 36 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 L2 L2 L2 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller Global Data Share facilitates synchronization between CUs (64KB)
  35. GCN MEMORY VECTOR MEMORY INSTRUCTIONS VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS MUBUF – read from or write/atomic to an un-typed buffer/address ‒ Data type/size is specified by the instruction operation MTBUF – read from or write to a typed buffer/address ‒ Data type is specified in the resource constant GRAPHICS CORE NEXT  ‒ MUBUF is like C++ reinterpret_cast MIMG – read/write/atomic operations on elements from an image surface ‒ Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data) ‒ Image objects use resource and sampler constants for access and filtering 37 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 A pointer is a pointer on GCN! ‒ MTBUF is like C++ static_cast  Utilize TMU for filtering via MIMG
  36. GCN MEMORY DEVICE FLAT MEMORY INSTRUCTIONS A GCN POINTER IS A POINTER FLAT  Flat Address Space (“flat”) instructions are new as of Sea Islands (CI) and allow read/write/atomic access to a generic memory address pointer which can resolve to any of the following physical memories: ‒ Global Memory ‒ Scratch (“private”) ‒ LDS (“shared”) ‒ Invalid - MEM_VIOL TrapStatus  Device Flat (Generic) 64b/32b Addressing Support ‒ FLAT instructions support both 64 and 32-bit addressing. The address size is set via a mode register (“PTR32”) and a local copy of the value is stored per wave. ‒ The addresses for the aperture check differ in 32 and 64-bit mode 38 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  37. GCN MEMORY EXPORT INSTRUCTION & GDS  Exports move data from 1-4 VGPRS to the fixed-function Graphics Pipeline ‒ E.g: Color (MRT0-7), Depth, Position, and Parameter  Tessellator, Rasterizer, or RBE  Global Shared Memory Ops (Utilize GDS)  The GDS is identical to the LDS, except that it is shared by all CUs, so it acts as an explicit global synchronization point between all wavefronts  The atomic units in the GDS also support ordered count operations 39 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  38. GCN MEMORY LOCAL DATA SHARE  GCN Local Data Share (LDS) is a 64KB, 32 bank (or 16) Shared Memory  Instruction issue fully decoupled from ALU instructions  Direct mode ‒ Vector Instruction Operand  32/16/8-bit broadcast value ‒ Graphics Interpolation @ rate, no bank conflicts  Index Mode – Load/Store/Atomic Operations ‒ Bandwidth Amplification, up-to 32 – 32-bit lanes serviced per clock peak ‒ Direct decoupled return to VGPRs ‒ Hardware conflict detection with auto scheduling  Software consistency/coherency for thread groups via hardware barrier  Fast & low power vector load return from R/W L1 40 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  39. GCN MEMORY CONTINUED … LOCAL DATA SHARE  An LDS bank is 512 entries, each 32-bits wide ‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that includes 32 atomic integer units ‒ This means that several threads can read the same LDS location at the same time for FREE ‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins (useful e.g. for all threads writing uniform value to still be fast)  Typically, the LDS will coalesce 32 lanes from one SIMD each cycle ‒ One wavefront is serviced completely every 2 cycles ‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware ‒ An instruction which accesses different elements in the same bank takes additional cycles 41 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  40. BLOCK DIAGRAM GCN MEMORY LOCAL DATA SHARE 42 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  41. GCN MEMORY NEW MEMORY OPERATIONS LOCAL DATA SHARE  Remote Atomic Ops with Shared Memory Dual-Source Operands ‒LDS[Dst] = LDS[addr0] op LDS[addr1]; ‒ Fast remote reduction operations for arithmetic, logical, Min/Max  Read/Write/Conditional Exchange 96b/128b  32-bit FP Min/Max/Compare Swap 43 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  42. GCN MEMORY NEW MEMORY OPERATIONS LOCAL DATA SHARE CONTINUED … Fast Lane Swizzle Operations ‒Does not require allocation, no shared memory used ‒Invalid read result in 0x0 return ‒First Mode: Each four adjacent lanes can full crossbar data, same switch for each set of four ‒Second mode: For each consecutive set of 32 work-items ‒ Swap: 16, 8, 4, 2, 1 ‒ Reverse: 32, 16, 8, 4, 2 ‒ Broadcast: 32, 16, 8, 4, 2 44 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  43. OPERATION DIAGRAMS GCN MEMORY LOCAL DATA SHARE 16 4 Lane CrossBar Reverse 8 4 Lane 0 , 1 ……………………..…31,32……………………………….63 2 1 Lane 0 , 1 ……………………..…31,32……………………………….63 Swap Broadcast 16 16 8 8 4 4 2 2 1 Lane 0 , 1 ……………………..…31,32……………………………….63 45 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 1 Lane 0 , 1 ……………………..…31,32……………………………….63
  44. GCN MEMORY BLOCK DIAGRAM READ/WRITE CACHE  Reads and writes cached ‒ Bandwidth amplification ‒ Improved behavior on more memory access patterns ‒ Improved write to read reuse performance  Relaxed consistency memory model ‒ Consistency controls available to control locality of load/store  GPU Coherent ‒ Acquire/Release semantics control data visibility across the machine (GLC bit on load/store) ‒ GCN APUs also have SLC bit to control data visibility to CPU caches ‒ L2 coherent = all CUs can have the same view of data  Global Atomics ‒ Performed in L2 cache (GDS also has global atomics) 46 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  45. GCN MEMORY READ/WRITE L1 CACHE ARCHITECTURE ‒ Each CU has its own Vector L1 Data Cache ‒ 16KB L1, 64B lines, 4 sets x 64-way ‒ ~64B/CLK bandwidth per Compute Unit ‒ Write-through – alloc on write (no read) w/dirty byte mask ‒ Write-through at end of wavefront ‒ Decompression on cache read out ‒ Instruction GLC bit defines cache behavior (GCN APUs also have SLC bit) ‒ GLC = 0; ‒ Local caching (full lines left valid) ‒ Shader write back invalidate instructions ‒ GLC = 1; ‒ Global coherent (hits within wavefront boundaries) 47 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  46. GCN MEMORY READ/WRITE L2 CACHE ARCHITECTURE ‒ 64-128KB L2 per Memory Controller Channel ‒ Up-to 16 L2 cache partitions ‒ 64B lines, 16-way set associative ‒ ~64B/CLK per channel for L2/L1 bandwidth ‒ Write-back - alloc on write (no read) w/ dirty byte mask ‒ Acquire/Release semantics control data visibility across CUs ‒ L2 Coherent = all CUs can have the same view of data ‒ Remote Atomic Operations ‒ Common Integer Set & Floating Point Min/Max/CmpSwap 48 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  47. GCN MEMORY INFORMATION BANDWIDTH ‒ Each CU has 64 bytes per cycle of L1 bandwidth ‒ Shared with the GDS ‒ Per L2 there’s 64 bytes of data per cycle as well ‒ Peak Scalar L1 Data Cache Bandwidth per CU is 16 bytes/cycle ‒ Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions) ‒ LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification ‒ For R9 290x: ‒ That’s nearly 5.5 TB/s of LDS BW, 2.8 TB/s of L1 BW, and 1 TB/s of L2 BW! ‒ 512-bit GDDR5 Main Memory has over 320 GB/sec bandwidth ‒ PCI Express 3.0 x16 bus interface to system (32GBps) 49 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  48. GCN MEMORY TABLES BANDWIDTH & LATENCY LDS K$ L1 128 bytes / clock 16 bytes / clock 64 bytes / clock Main Takeaways: –LDS is optimized for bandwidth amplification and atomics –K$ is optimized for periodic low-latency reads of small datasets –L1 is optimized for high-bandwidth texture fetches and streaming LDS K$ L1 Resident Short Short (1x) Long (20x) Non-Resident N/A Medium (10x) Long (20x) 50 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  49. GCN MEMORY BLOCK DIAGRAM L1 TEXTURE CACHE  The memory hierarchy is re-used for graphics  Some dedicated graphics hardware added ‒ Address-gen unit receives 4 texture addr/clock ‒ Calculates 16 sample addr (nearest neighbors) ‒ Reads samples from L1 vector data cache ‒ Decompresses samples in Texture Mapping Unit (TMU) ‒ TMU filters adjacent samples, produces <= 4 interpolated texels/clock ‒ TMU output undergoes format conversion and is written into the vector register file ‒ The format conversion hardware is also used for writing certain formats to memory from graphics shaders 51 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  50. X86-64 GCN MEMORY VIRTUAL MEMORY  The GCN cache hierarchy was designed to integrate with x86-64 microprocessors  The GCN virtual memory system can support 4KB pages ‒ Natural mapping granularity for the x86-64 address space ‒ Paves the way for a shared address space in the future ‒ All GCN hardware can already translate requests into x86-64 address space  GCN caches use 64B lines, which is the same size x86-64 processors use  AMD A-Series APU  The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control! 52 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  51. GCN COMPUTE ARCHITECTURE R9 290X A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING AMD Radeon™ HD 7970 GHz Edition AMD Radeon™ R9 290X Increase Geometry Processing 2.1 billion primitives/sec 4 billion primitives/sec 1.9x Compute 4.3 TFLOPS 5.6 TFLOPS 1.3x Texture fill rate 134.4 Gtexels/sec 176 Gtexels/sec 1.3x Pixel fill rate 33.6 Gpixels/sec 64 Gpixels/sec 1.9x Peak Bandwidth 264 GB/sec 320 GB/sec 1.2x Die area 352 mm2 438 mm2 1.24x Peak GFLOPS/mm2 12.2 12.8 1.05x 53 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  52. GCN COMPUTE ARCHITECTURE SHADER ENGINE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Each GCN GPU can contain up-to 4 Shader Engines ‒ Load balanced with each other ‒ Screen partitioning of pixel assignment  A Shader Engine is a high level organizational unit containing: ‒ 1 Geometry Processor (1 Primitive Per Cycle Throughput) ‒ 1 Rasterizer ‒ 1-16 CUs (Compute Units) ‒ Instruction I$ and constant K$ caches shared by up to 4 CU each ‒ 1-4 RBEs (Render Back Ends) ‒ Up-to 16 – 64b pixels/cycle per Shader Engine ‒ Up-to 8 – 128b pixels/cycle per Shader Engine 54 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  53. GCN COMPUTE ARCHITECTURE R9 290X A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  44 Compute Units  4 Geometry Processors ‒ 4 billion primitives/sec  64 Pixel Output/Clock ‒ 64 Gpixels/sec fill rate  1MB L2 Cache ‒ Up-to 1 TB/sec L2/L1 bandwidth  512-bit GDDR5 memory interface ‒ 320 GB/sec memory bandwidth  6.2 billion transistors 55 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 438 mm2 on 28nm process node ‒ 12.8 GFLOPS/mm2
  54. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  8 ASYNCHRONOUS COMPUTE ENGINES (ACE) ‒ Operate in parallel with Graphics CP ‒ Independent scheduling and work item dispatch for efficient multi-tasking ‒ 9 Devices with 64+ Command Queues! ‒ Fast context switching ‒ Exposed in OpenCL™  Dual DMA engines ‒ Can saturate PCIe 3.0 x16 bus bandwidth (16 GB/sec bidirectional) 56 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  55. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  ACEs are responsible for compute shader scheduling & resource allocation  Each ACE fetches commands from cache or memory & forms task queues  Tasks have a priority level for scheduling ‒ Background  Realtime  ACE dispatch tasks to shader arrays as resources permit  Tasks complete out-of-order, tracked by ACE for correctness  Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs 57 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  56. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  ACE are independent ‒ But, can synchronize and communicate via Cache/Memory/GDS  ACE can form task graphs ‒ Individual tasks can have dependencies on one another ‒ Can depend on another ACE ‒ Can depend on part of graphics pipe  ACE can control task switching ‒ Stop and Start tasks and dispatch work to shader engines 58 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  57. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  Focus in GPU hardware shifting away from graphics-specific units, towards general-purpose compute units  R9 290x GCN-based ASICs already have 8:1 ACE : CP ratio ‒ CP can dispatch compute ‒ ACE cannot dispatch graphics  If you aren’t writing Compute Shaders, you’re not getting the absolute most out of modern GPUs ‒ Control: LDS, barriers, thread layout, ... 59 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  58. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT Future Trends:  More Compute Units ‒ ALU outpaces Bandwidth  CPU + GPU Flat Memory ‒ APU + dGPU  Less Fixed Function Graphics ‒ Can you write a Compute-based graphics pipeline? ‒ Start thinking about it…  60 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  59. GCN FIXED FUNCTION ARCHITECTURE Geometry Processor Geometry Assembler Tessellator Vertex Assembler GEOMETRY Geometry Processor Geometry Assembler Tessellator Vertex Assembler Updated Hardware Geometry Units – Off-chip buffering improvements – Larger parameter and position cache Geometry Processor Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Vertex Assembler Tessellation off on Tessellation off  GS + Tessellation is faster than before…  However… memory is still the bottleneck! – Minimize the number of inputs and outputs for best performance…  Small expansions can be done within LDS! 61 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Image from Battlefield 3, EA DICE Process and rasterize up to 4 primitives per clock cycle
  60. GCN FIXED FUNCTION ARCHITECTURE RASTERIZER  We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock) ‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels  Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency  This can cause us to become raster-bound, starving the shader and holding up geometry! 12 Pixels Per Clock 75% Efficiency 100% Efficiency 16 Pixels Per Clock 28 Pixels in 2 Clocks 62 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 vs. 3 Pixels in 3 Clocks 1 Pixel Per Clock  6.25% Efficiency
  61. GCN FIXED FUNCTION ARCHITECTURE TESSELLATION + RASTERIZER EFFICIENCY 6.25% 75-90% 18-25% Efficiency Efficiency ~13 Pixels ~4 Pixels 1 Pixel Per Clock Per Clock Per Clock Efficiency Over-Tessellation  Reduces rasterizer efficiency ‒ Extreme Tessellation = 6.25% Efficiency  Also impacts ROPs and MSAA efficiency ‒ High number of polygon edges to AA ‒ Consumes dramatically more bandwidth ‒ If nFragments > nSamples, quality will be lost ‒ E.g. 16 verts affecting 1 pixel @ 8xMSAA 63 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  62. GCN FIXED FUNCTION ARCHITECTURE Over-Tessellation  Reduces shader efficiency  HS, DS and VS run many times for each final image pixel ‒ Yet don’t contribute much to final image quality  The graphics pipeline is not designed for this abuse! TESSELLATION + SHADING EFFICIENCY Shading Passes Per-Pixel (Overshade) 8 7 6 5 4 3 2 1  Consider Alternatives: ‒ Parallax Occlusion Mapping ‒ […]  Image courtesy: Kayvon Fatahalian “Evolving the Direct3D Pipeline for Real-time Micropolygon Rendering,” From ACM SIGGRAPH 2010 course: “Beyond Programmable Shading II” 64 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  63. GCN Tessellation – Best Practices  While performance is much improved, it is still a potential bottleneck! ‒ Produces a great deal of IO traffic, starving other parts of the pipeline  Best performance generlly achieved with tessellation factors less than 15! Continue to Optimize: ‒ Pre-triangulate ‒ Distance-adaptive ‒ Screen-space adaptive ‒ Orientation-adaptive ‒ Backface Culling ‒Frustum Culling ‒ […] 65 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Tessellation OFF ON
  64. GCN FIXED FUNCTION ARCHITECTURE RASTERIZER  We now have 4 Geometry Processors on R9 290x ‒ Overall Primitive Rate = 4 prims per clock (ideal)  We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock) ‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels  Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency  This can cause us to become raster-bound, unable to rasterize at peak-rate! Command Processor Geometry Processor Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Geometry Processor Vertex Assembler Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Vertex Assembler Compute Units Rasterizer Scan Converter Hierarchical Z Render Back-Ends Rasterizer Scan Converter Hierarchical Z Render Back-Ends 66 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Rasterizer Scan Converter Hierarchical Z Render Back-Ends Rasterizer Scan Converter Hierarchical Z Render Back-Ends
  65. GCN FIXED FUNCTION ARCHITECTURE RENDER BACK ENDS  Once the pixels fragments in a tile have been shaded, they flow the Render Back-Ends (RBEs) Z/Stencil ROPs Color ROPs Depth Cache Color Cache ‒ 16KB Color Cache ‒ Up to 8 color + 16 coverage samples (16x EQAA) ‒ 8KB Depth Cache ‒ Up to 8 depth samples (8x MSAA) ‒ Writes un-cached via memory controllers ‒ 64 – 64B pixels per cycle ‒ 256 Depth Test (Z) / Stencil Ops per cycle  Logic Operations as alternative to Blending ‒Exposed in Direct3D 11.1 ‒Also available in OpenGL  Dual-Source Color Blending with MRTs ‒Only available in OpenGL * 67 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 There are 16 RBEs on R9 290x
  66. GCN FIXED FUNCTION ARCHITECTURE DEPTH IMPROVEMENTS 24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS Fast-accept of fully-visible triangles spanning one or more tile If a triangle is fully covering a tile, then cost is only 1 clock/tile  Depth Bounds Test (DBT) Extension ‒Exposed in OpenGL via GL_EXT_depth_bounds_test ‒Exposed in Direct3D 11 via extension 68 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  67. GCN FIXED FUNCTION ARCHITECTURE STENCIL IMPROVEMENTS  GCN has support for new extended stencil ops ‒Only available in OpenGL: GL_AMD_stencil_operation_extended ‒Additional stencil ops: ‒AND, XOR, NOR ‒REPLACE_VALUE_AMD ‒etc. ‒ Also exposes additional stencil op source value ‒ Can be used as an alternative to stencil ref value  Stencil ref and op source value can now be exported from pixel shader ‒Only available in OpenGL: GL_AMD_shader_stencil_value_export 69 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  68. GCN LOW-LEVEL TIPS GPR PRESSURE  GPRs and GPR Pressure  Banks of GCN Vector GPRs (Illustration)  General Purpose Registers (GPR) are a limited resource ‒ Separate banks of GPRs for Vector and Scalar (per SIMD) ‒ Maximum of 256 VGPRS and 512 SGPRS shared across all waves (up-to 10) owned by a SIMD ‒ Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit) ‒ Number of GPRs required by a shader affects SIMD scheduling and execution efficiency ‒ Shader tools can be used to determine how many GPRs are used…  GPR pressure is affected by: ‒ Loop Unrolling ‒ Long lifetime of temporary variables ‒ Nested Dynamic Flow Control instructions ‒ Fetch dependencies (e.g. indexed constants) 70 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  69. GCN LOW-LEVEL TIPS TEXTURE FILTERING ‒Point sampling is full-rate on all formats ‒Trilinear filtering costs up to 2x bilinear filtering cost ‒Anisotropic (N taps) costs <= (N x bilinear) ‒Avoid cache thrashing! ‒Use MIPmapping ‒Use Gather() where applicable ‒Exploit neighbouring pixel shader threadCU locality: ‒ Sampling from texels resident on the same CU can have a lower cost ‒Exploit this explicitly by using Compute Shaders 71 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  70. GCN LOW-LEVEL TIPS COLOR OUTPUT  PS Output: Each additional color output increases export cost  Export cost can be more costly than PS execution! ‒ Each (fast) export is equivalent to 64 ALU ops on R9 290X ‒ If shader is export-bound then use “free” ALU for packing instead  Watch out for export-bound cases ‒ E.g. G-Buffer parameter writes ‒ MINIMIZE SHADER INPUTS AND OUTPUTS! ‒ Pack, pack, pack, pack!  Costs of outputting and blending various formats ‒discard/clip allow the shader hardware to skip the rest of the work * Miss “PACK” Man kindly reminds you to “Pack pack pack!”  72 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  71. GCN MEDIA PROCESSING MEDIA INSTRUCTIONS  SAD = Sum of Absolute Differences Closest match  Critical to video & image processing algorithms ‒ Motion detection ‒ Gesture recognition ‒ Video & image search ‒ Stereo depth extraction ‒ Computer vision  SAD (4x1) and QSAD (4 4x1) instructions ‒ New QSAD combines SAD with alignment ops for higher performance and reduced power draw ‒ Evaluate up to 256 pixels per CU per clock cycle!  Maskable MQSAD instruction ‒ Allows background pixels to be ignored ‒ Accelerated isolation of moving objects  New: 32-bit destination accumulator register ‒ SAD/QSAD/MQSAD U32/U16 accumulators with saturation 73 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 3 2 5 5 4 4 0 7 1 7 5 9 4 1 3 5 5 5 9 3 1 4 4 0 SAD = 7 SAD = 22 2 22 9 5 1 6 7 2 9 SAD = 6 1 59 3 5 2 8 1 1 7 6 8 3 0 4 3 2 9 9 3 0 7 1 1 7 4 SAD = 5 5 58 4 0 8 0 0 2 2 SAD = 2 8 45 3 2 9 9 7 1 6 2 4 0 AMD Radeon R9 290x can evaluate 11.26 Terapixels/sec * * Peak theoretical performance for 8-bit integer pixels 3
  72. GCN MEDIA PROCESSING VIDEO CODEC ENGINE  Video Codec Engine (VCE) ‒ Hardware H.264 Compression and Decompression ‒ Ultra-low-power, fully fixed-function mode ‒ Capable of 1080p @ 60 frames / second ‒ Programmable for Ultra High Quality and or Speed ‒ Entropy encoding block fully accessible to software ‒ AMD Accelerated Parallel Programming SDK ‒ OpenCL ™ ‒ Create hybrid faster-than-real-time encoders! ‒ Custom motion estimation ‒ Inverse DCT and motion compensation ‒ Combine with hardware entropy encoding! AMD Radeon R9 290x can compress Realtime+ 1080p H.264 74 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  73. GCN MEDIA PROCESSING AMD TRUEAUDIO  Multiple integrated Tensilica HiFi EP Audio DSP cores  Dedicated Audio DSP solution for game sound effects  Guaranteed real-time performance and service  Designed for game audio artists and engineers to bring take their artistic vision beyond sound production into the realm of sound processing  Intended to transform game audio as programmable shaders transformed graphics 75 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  74. GCN MEDIA PROCESSING AMD TRUEAUDIO SPATIALIZATION / 3D AUDIO REVERBS AUDIO/VOICE STREAMS MASTERING LIMITERS 76 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  75. HEAR MORE REALTIME VOICES AND CHANNELS IN A GAME 77 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  76. ENABLES AMAZING DIRECTIONAL AUDIO OVER ANY OUTPUT 78 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  77. CONCLUSIONS GCN ARCHITECTURE TAKEAWAYS ‒GCN offers increased flexibility & efficiency, with reduced complexity! ‒Non-VLIW Architecture improves efficiency while reducing programmer burden ‒Constants/resources are just address + offset now in the hardware ‒UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast ‒Has virtual memory & GPU flat memory, moving towards CPU + GPU flat memory ‒GCN is designed with a forward-looking focus on Compute ‒Scalar unit for complex dynamic control flow + branch & message unit ‒64KB LDS/CU, 64KB GDS, atomics at every stage, coherent cache hierarchy ‒8 Asynchronous Compute Engines (ACE) for multitasking compute ‒ 8 ACE x 8 HQD (per ACE) = 64 HQD (HQD = Hardware Queue Descriptors) 79 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  78. CONCLUSIONS GCN ARCHITECTURE TAKEAWAYS CONTINUED … ‒GCN generally simplifies your life as a programmer ‒Don’t: fret too much about instruction grouping, or vectorization ‒Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts) ‒Do: Think about thread/CU locality when you structure your algorithm ‒Do: Exploit the low-latency 4-CU Shared 16KB Scalar L1 Data Cache (K$) ‒Do: Pack shader inputs and outputs – aim to be IO/bandwidth thin! ‒ Pack PS exports into non-blended 64-bit format for optimal ROP utilization ‒ But, remember that 32-bit formats still use less bandwidth ‒ Keep geometry (HS, VS, GS, DS) stage IO under 4 float4 (ideally less! ) ‒Unlimited number of addressable constants/resources ‒N constants aren’t free anymore – each consume resources, use sparingly! ‒Compute is the future – exploit its power for GPGPU work & graphics! 80 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  79. THANK YOU 问题? QUESTIONS?  質問がありますか? ^_^ Layla Mah layla.mah@amd.com 81 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  80. BONUS SLIDES 82 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  81. THE BONUS SLIDES 83 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  82. TILED RESOURCES & PARTIALLY RESIDENT TEXTURES MegaTexture in id Tech5
  83. Tiled Resources & Partially Resident Textures – INTRODUCTION Enables application to manage more texture data than can physically fit in a fixed footprint ‒ Known as: Tiled Resources (Direct3D 11.2) and Partially Resident Textures (OpenGL 4.2) ‒ A.k.a. “Virtual texturing“ and “Sparse texturing” The principle behind PRT is that not all texture contents are likely to be needed at any given time ‒ Current render view may only require selected portions of a texture to be resident in memory ‒ Or, only selected MIPMap levels… PRT textures only have a portion of their data mapped into GPU-accessible memory at a time ‒ Texture data can be streamed in on-demand ‒ Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)  OpenGL extension – GL_AMD_sparse_texture 85 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  84. Tiled Resources & Partially Resident Textures – TEXTURE TILES The PRT texture is chunked into 64KB tiles ‒ Fixed memory size ‒ Not dependant on texture type or format Highlighted areas represent texture data that needs highest resolution Chunked texture Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 86 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Texture tiles needing to be resident in GPU memory
  85. Tiled Resources & Partially Resident Textures – TRANSLATION TABLE The GPU virtual memory page table translates 64KB tiles into a resident texture tile pool Texture Map Page Table Texture Tile Pool (Video Memory) (linear storage) 64KB tile Unmapped page entry Mapped page entry Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 87 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  86. Tiled Resources & Partially Resident Textures – MIP MAPS Not all tiles from the texture map are actually resident in video memory PRT hardware page table stores virtual  physical mappings Texture Map Page Table MIP Levels Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 88 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Texture Tile Pool (Video Memory) 64KB tile Unmapped page entry Mapped page entry
  87. Tiled Resources & Partially Resident Textures – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles! A common scenario is to upload lower MIPMaps to texture tile pool ‒ This allows a full representation of the PRT contents to be resident in memory (albeit at lower resolution) ‒ e.g. MIP LOD 6 and above for 16kx16k 32-bits texture is about 650KB (256x256 resolution) Texture tiles corresponding to higher resolution areas are uploaded by the application as needed ‒ e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading 89 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  88. Tiled Resources & Partially Resident Textures – “FAILED” FETCH How does the application know which texture tiles to upload? Answer: PRT-specific texture fetch instructions in pixel shader ‒ Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not in the pool ‒ OpenGL example: int glSparseTexture( gsampler2D sampler, vec2 P, inout gvec4 texel ); This information is then stored in render target or UAV ‒ Texel fetch failed for a given (x, y) tile location ...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded 90 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  89. Tiled Resources & Partially Resident Textures – “LOD WARNING” PRT fetch condition code can also indicate an “LOD Warning” The minimum LOD warning is specified by the application on a per texture basis ‒ OpenGL example: glTexParameteri( If a fetched pixel’s LOD is <target>, MIN_WARNING_LOD_AMD, <LOD warning value> ); < the specified LOD warning value then the condition code is returned This functionality is typically used to try to predict when higher-resolution MIP levels will be needed ‒ E.g. Camera getting closer to PRT-mapped geometry 91 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  90. Tiled Resources & Partially Resident Textures – EXAMPLE USAGE 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API 2. App uploads MIP levels using API calls 3. Shader fetches PRT data at specified texcoords Two possibilities: 3.a. Texel data belongs to a resident (64KB) tile - Valid color returned, no error code 3.b. Texel data points to non-resident tile or specified LOD - Error/LOD Warning code returned - Shader writes tile location and error code to RT or UAV 4. App reads RT or UAV and upload/release new tiles as needed 92 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  91. Tiled Resources & Partially Resident Textures – TYPES, FORMATS & DIMENSIONS  All texture types and formats supported ‒1D, 2D, cube, arrays and 3D volume textures ‒All common texture formats ‒ Including compressed formats ‒Maximum dimensions: ‒16k x 16k x 8k x 128-bit textures 93 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  92. Hardware PRT > Software Implementation PRT Ease of implementation • Complexity hidden behind HW & API Full filtering support SW Implementation • Includes anisotropic filtering Full-speed filtering • SW solution requires “manual” filtering • Software anisotropic is very costly Don’t go overboard with PRT allocation! • Page table entry size is 4 DWORDs • Have to be resident in video memory 94 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  93. 问题? QUESTIONS?  質問がありますか? ^_^ Layla Mah layla.mah@amd.com 95 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 @MissQuickstep
  94. Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2013 Advanced Micro Devices, Inc. All rights reserved. 96 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  95. THE BONUS SLIDES SHADER CODE 97 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  96. SHADER CODE EXAMPLE #2 float fn0(float a,float b) { float c = 0.0; float d = 0.0; for(int i=0;i<100;i++) { if(c>113.0) break; c = c * a + b; d = d + 1.0; } return(d); } // Registers r0 contains “a”, r1 contains “b”, r2 contains “c” // and r3 contains “d” // Value is returned in r3 v_mov_b32 v_mov_b32 s_mov_b64 s_mov_b32 label0: s_cmp_lt_s32 s_cbranch_sccz v_cmp_le_f32 s_and_b64 s_branch_execz v_mul_f32 v_add_f32 v_add_f32 s_add_s32 s_branch label1: s_mov_b64 98 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 r2, #0.0 r3, #0.0 exec, s0 s2, #0 // // // // float c = 0.0 float d = 0.0 Save execution mask i=0 s2, #100 label1 r2, #113.0 exec, vcc, exec label1 r2, r2, r0 r2, r2, r1 r3, r3, #1.0 s2, s2, #1 label0 // // // // // // // // // // i<100 Exit loop if not true c > 113.0 Update exec mask on fail Exit if all lanes pass c = c*a c = c+b d = d+1.0 i++ Jump to start of loop exec, s0 // Restore exec mask
  97. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 100 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Editor's Notes

  1. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  2. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  3. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  4. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  5. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  6. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  7. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  8. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  9. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  10. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  11. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  12. So what is mantle?
  13. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  14. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  15. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  16. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  17. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  18. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  19. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  20. Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels within a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.Shader code will be visible in GPUShaderAnalyzer to allow optimizations.
  21. The new cache hierarchy was shown at AFDS. This core implements the first version of that core. It’s a full 2 level R/W cache, with 16Kbytes of L1 per CY, and 64 Kbytes per L2. Each CU has 64 Bytes per cycle of L1 BW, shared with the global data share (which is a local buffer for sharing data between wavefronts). Per L2 there’s 64 bytes of data per cycle as well. That’s nearly 2 TB/s of L1 BW, and 700 GB/s of L2 BW. Nice! Each group of four cores shares a 32KB instruction cache and a 16KB scalar data cache. Coherency is handled at the L2 level, with applications able to keep the physical L2’s updated directly with their L1’s. Never settle for enough cache bandwidth!
  22. ADDR8VGPR which holds address. For 64-bit addresses, ADDR has the LSB’s and ADDR+1 has the MSBs.DATA8VGPR which holds the first dword of data. Instructions can use 0-4 dwords.VDST8VGPR destination for data returned to the shader, either from LOADs or Atomics with GLC=1 (return pre-op value).SLC1System Level Coherent. Used in conjunction with GLC and MTYPE to determine cache policies.GLC1Global Level Coherent. For Atomics, GLC=1 means return pre-op value, 0 = do not return pre-op value.TFE1Texel Fail Enable for PRT (Partially Resident Textures). When set, fetch may return a NACK which causes a VGPR write into DST+1 (first GPR after all fetch-destgprs).( M0 )32Implied use of M0.  M0[16:0] contains the byte-size of the LDS segment. this is used to clamp the final address.Opcode:FLAT_LOAD_UBYTE FLAT_STORE_BYTEFLAT_ATOMIC_SWAP FLAT_ATOMIC_SWAP_X2 FLAT_LOAD_SBYTE  FLAT_ATOMIC_CMPSWAP FLAT_ATOMIC_CMPSWAP_X2 FLAT_LOAD_USHORT FLAT_STORE_SHORTFLAT_ATOMIC_ADD FLAT_ATOMIC_ADD_X2 FLAT_LOAD_SSHORT  FLAT_ATOMIC_SUB FLAT_ATOMIC_SUB_X2 FLAT_LOAD_DWORD FLAT_STORE_DWORD FLAT_ATOMIC_SMIN FLAT_ATOMIC_SMIN_X2 FLAT_LOAD_DWORDX2 FLAT_STORE_DWORDX2 FLAT_ATOMIC_UMIN FLAT_ATOMIC_UMIN_X2 FLAT_LOAD_DWORDX3 FLAT_STORE_DWORDX3 FLAT_ATOMIC_SMAX FLAT_ATOMIC_SMAX_X2 FLAT_LOAD_DWORDX4 FLAT_STORE_DWORDX4 FLAT_ATOMIC_UMAX FLAT_ATOMIC_UMAX_X2   FLAT_ATOMIC_AND FLAT_ATOMIC_AND_X2   FLAT_ATOMIC_OR FLAT_ATOMIC_OR_X2   FLAT_ATOMIC_XOR FLAT_ATOMIC_XOR_X2   FLAT_ATOMIC_INC FLAT_ATOMIC_INC_X2   FLAT_ATOMIC_DEC FLAT_ATOMIC_DEC_X2  FLAT_ATOMIC_FCMPSWAP FLAT_ATOMIC_FCMPSWAP_X2   FLAT_ATOMIC_FMIN FLAT_ATOMIC_FMIN_X2   FLAT_ATOMIC_FMAX FLAT_ATOMIC_FMAX_X2
  23. Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
  24. Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
  25. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  26. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  27. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  28. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  29. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  30. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  31. The R9 290 device is the first GCN to offer scaling to 4 prims per clock.  Interstage parameter and position storage is provide on chip to enable necessary inflight overlap. Each geometry engine provides surface, tessellation, geometry and vertex management and output primitive filtering to drive the four partitioned rasterizers efficiently.  For low to mid level amplification the geometry stage has added a driver/compiler controlled mode that retains interstage data in the shared memory to decrease external bandwidth requirements and latency effects that as much as double the performance in some scenarios.   Finally, for tessellation, improvements have been made in staging storage and control to improve overall performance.   
  32. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  33. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  34. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  35. pre-tessellate as needed in order to avoid higher tess factors.
  36. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  37. The R9 290 series provides a massive 64 pixel rasterization capability with 256 pixel’s depth and stencil test per clock.  The render backend units can drive color writes and blending operations for up to 64 pixels surviving per clock.  This capability will move the bottleneck from pixel fill to bandwidth in some scenarios.
  38. Present TrueAudio as the solution to the limitations imposed by today’s PC audio solutionsEmphasize real-timeand programmability
  39. SPATIALIZATION / 3D AUDIOSurround Sound with Stereo gaming headsetsKnow exactly where the enemy isREVERBS- More Realistic Sound EnvironmentAUDIO/VOICE STREAMS- Fuller sound for games with many scene objectsMASTERING LIMITERSReduce developer workload with real-time limiters
  40. Some immediate benefits of TRUEAUDIO – It enables you to hear hundreds more REALTIME VOICES AND AUDIO channels in your game than what is possible on CPUs today
  41. AMD is working with audio plugin developers such as GenAudio to provide an immersive audio experience when integrated into gamesGamers who use stereo headsets (either through USB or audio jacks) will enjoy virtual surround sound, accelerated by AMD TrueAudio technologyThis level of integration leads to accurate 3-dimensional audio since position data is extracted directly from the gameWhereas headsets with virtual surround sound capability use simple audio expansion algorithms with no knowledge of the game’s environment
  42. That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
  43. That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
  44. Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels rwithin a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.
Advertisement