GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
THE AMD GCN ARCHITECTURE
A CRASH COURSE
@MissQuickstep
LAYLA MAH – LAYLA.MAH@AMD.COM
DEVELOPER TECHNOLOGY ENGINEER
AGENDA
Part 1: A Brief History of GPU Evolution
Part 2: Introduction to Graphics Core Next (GCN)
Part 3: Anatomy of a GCN Compute Unit (CU)
Part 4: GCN Shader: Arbitration, Examples & Tips
Part 5: GCN Memory Hierarchy
Part 6: GCN Compute Architecture (ACE)
Part 7: GCN Fixed Function Units
(CP, GeometryEngine, Rasterizer, RBE, …)
Part 8: Main Takeaways & Conclusion
Bonus Slides: Tiled Resources, Partially Resident Textures (PRT)
22
| A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
3RD ERA:
Graphics Parallel Core
VLIW5
FMAD+
Special
Functions
Branch Unit
Stream
Processing Units
General Purpose
Registers
Lighting
VLIW4
General Purpose
Registers
3 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
3RD ERA:
Graphics Parallel Core
VLIW5
FMAD+
Special
Functions
Branch Unit
Stream
Processing Units
General Purpose
Registers
Lighting
VLIW4
General Purpose
Registers
4 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
Prior to 2002
Graphics-specific hardware
3RD ERA:
Graphics Parallel Core
VLIW5
‒ Texture mapping/filtering
‒ Transform & Lighting (T&L) Engines
‒ Geometry processing
‒ Rasterization
‒ Fixed function lighting equations
Lighting
Stream
Processing Units
FMAD+
Special
Functions
Branch Unit
‒ Multi-texturing
General Purpose
Registers
‒ Dedicated texture and pixel caches VLIW4
‒ Sufficient for basic graphics tasks
Processing Units
‒ No general purpose compute capability General Purpose
Stream
Registers
5 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
Dot product and scalar multiply-add
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
Memory Interface
3RD ERA:
Graphics Parallel Core
VLIW5
Stream
Processing Units
General Purpose
Registers
Setup Engine
Lighting
Pixel Shader Core
VLIW4
Stream
Processing Units
General Purpose
Registers
6 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
16 Pixel Pipes
FMAD+
Special
Functions
Branch Unit
8 Vertex Pipes
GPU EVOLUTION
1ST ERA:
Fixed Function
2002-2006
3D Geometry Transformation
Graphics Programmability
– Direct3D 8/9, OpenGL 2.0
IEEE not required
Memory Interface
8 Vertex Pipes
Setup Engine
– NV 16-bit full-speed
–
Lighting NV 32-bit half-speed
Pixel Shader Core
7 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Shader Models 1.0 - 2.0
VLIW5
‒ VS and PS are distinct
‒ Minimal Instruction Sets
‒ Limited Instruction Slots
‒ LimitedGeneral Purpose Lengths
Shader
Registers
‒ No DYNAMIC Flow Control
‒ No Looping Constructs
‒ VLIW4
No Vertex Texture Fetch
‒ No Bitwise Operators
‒ No Native Integer ALU
Stream
Processing Units
‒ […]
FMAD+
Special
Functions
Branch Unit
– Specialized shader units for
vertex & pixel processing
Added dedicated caches
The Rise of Shaders
Stream
Processing Units
Different precision per IHV
– ATI 24-bit full-speed
3RD ERA:
Graphics Parallel Core
Branch Unit
– Floating point processing
2ND ERA:
Simple Shaders
16 Pixel Pipes
General Purpose
Registers
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
3RD ERA:
Graphics Parallel Core
VLIW5
FMAD+
Special
Functions
Branch Unit
Stream
Processing Units
General Purpose
Registers
Lighting
VLIW4
General Purpose
Registers
8 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
2ND ERA:
Simple Shaders
The Rise of The Unified Shader (VLIW-5)
5-Element Very-Long-Instruction-Word (XYZWT)
3D Geometry Transformation
3RD ERA:
Graphics Parallel Core
VLIW5
‒ Ideal for 4-element Vector and 4x4 Matrix Operations
‒ Vector/Vector math in a single instruction
Stream
Processing Units
‒ Plus One Transcendental-Unit function per Instruction
FMAD+
Special
Functions
Branch Unit
‒ Began with XENOS and utilized from R600 until “Cayman”
‒ Flexible and optimized for Graphics workloads
General Purpose
Registers
More advanced caching
‒ Instruction, constant, multi-level texture/data, & later: LDS/GDS
Lighting
VLIW4
Stream
Processing Units
More flexible: Unified ALU, Branch Unit, Dynamic Flow
Control, Vertex Texture, Geometry Shader, Tessellation
Engines, etc.
9 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
General Purpose
Registers
Branch Unit
Single Precision 32-bit IEEE-Compliant Floating Point ALUs
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation
2ND ERA:
Simple Shaders
3RD ERA:
Graphics Parallel Core
VLIW5
FMAD+
Special
Functions
Branch Unit
Stream
Processing Units
General Purpose
Registers
Lighting
VLIW4
General Purpose
Registers
10 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch Unit
Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
2ND ERA:
Simple Shaders
Optimized For Die Area Efficiency (VLIW-4)
3D 4-Element Very-Long-Instruction-Word (XYZW)
Geometry Transformation
3RD ERA:
Graphics Parallel Core
VLIW5
‒ Profiling showed average VLIW utilization was < 3.4/5
‒ Each ALU has a smaller LUT
‒ Combined using 3-term Lagrange polynomial interpolation across multiple ALU
Stream
Processing Units
‒ Better optimized for combination of Graphics & Compute
‒ Graphics is still the primary focus, but compute is gaining attention
‒ Still ideal for 4-element Vector and 4x4 Matrix Operations
‒ Fewer ALU bubbles in transcendental-light code, better utilization
Lighting
FMAD+
Special
Functions
General Purpose
Registers
VLIW4
‒ Multiple dispatch processors & separate command queues
11 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Stream
Processing Units
General Purpose
Registers
Branch Unit
‒ Simplified programming and optimization relative to VLIW-5
Improved support for DirectCompute™ and OpenCL™
Branch Unit
‒ Removed dedicated T-Unit – Optimized die area usage
GPU EVOLUTION
LANE 0
LANE 1
LANE 2
LANE 15
SIMD
VLIW4 SIMD
LANE
0 1 2
SIMD 0
15
LANE
0 1 2
15
SIMD 1
LANE
0 1 2
15
SIMD 2
LANE
0 1 2
15
SIMD 3
GCN Quad SIMD-16
64 Single Precision multiply-add (per-clock)
64 Single Precision multiply-add (per-clock)
16 SIMDs × ( 1 VLIW inst × 4 ALU ops )
4 SIMDs × ( 1 ALU op × 16 threads )
1 VLIW inst containing 4 ALU ops (per-clock)
4 ALU ops (from different wavefronts) / clock
Needs 4 parallel ALU ops to fill each VLIW inst
Needs 4+ wavefronts to keep SIMD lanes full
Compiler manages register port conflicts
No register port conflicts
Specialized, complex compiler scheduling
Standard compiler scheduling & optimizations
Difficult assembly creation, analysis, and debug
Simplified assembly creation, analysis, & debug
Complicated tool chain support
Simplified tool chain development and support
Careful optimization req. for peak performance
Stable and predictable performance
12 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GS-4112 – Mantle: Empowering 3D Graphics Innovation
Keynote – Johan Andersson, Technical Director, EA
GS-4145 – Oxide on Mantle Adoption (Wed 5:00-5:45)
New low level programming
interface for PCs
Designed in collaboration with
top game developers
Lightweight driver that allows
direct access to GPU hardware
Compatible with DirectX® HLSL
for simplified porting
MANTLE
Graphics Applications
Mantle API
Mantle Driver
GCN
Works with all
Graphics Core
Next GPUs
AMD GRAPHICS CORE NEXT ARCHITECTURE
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
Faster performance
Higher efficiency
New graphics features
New compute features
GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
Cutting-edge graphics performance and features
High compute density with multi-tasking
Built for power efficiency
Optimized for heterogeneous computing
Enabling the Heterogeneous System Architecture (HSA)
Amazing scalability and flexibility
16 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
Unlimited Resources & Samplers
All UAV formats can be read/write
Simpler Assembly Language
Simpler Shader Code
Ability to support C/C++ (like)
Architectural support for traps, exceptions & debugging
Ability to share virtual x86-64 address space with CPU cores
17 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
AMD TECHNOLOGY POWERS NEXT-GEN CONSOLES
NEW NEXT-GEN GAME CONSOLES
RAISE THE BAR FOR GRAPHICS PERFORMANCE
PERFORMANCE
TFLOPS-CLASS
COMPUTE POWER
MEMORY
16X
MORE
MEMORY
*
* Based on PlayStation 3 512MB vs.
PlayStation 4 8192MB GDDR5.
GRAPHICS CORE NEXT
GCN COMPUTE UNIT
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
CU = Basic Building Block of GPU Computational Power
Branch &
Message Unit
Vector Units
(4x SIMD-16)
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
New Instruction Set Architecture
Scalar Unit
‒
Non-VLIW
‒
Vector unit + scalar co-processor
‒
Scheduler
Distributed programmable scheduler
Each CU can execute instructions from
multiple kernels at once
Increased instructions per clock per mm2
‒ High utilization
‒ High throughput
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
19 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
L1 Cache
(16KB)
‒ Multi-tasking
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
L1 Cache
(16KB)
Scalar Unit
4x Vector Units (16-lane SIMD)
Branch andData Threads) Data Cache
Scheduler Message Unit
64kb Local(2560 L1 Vector
16kb Read/Write Share (LDS)
‒ CU Total Throughput: 64 Single-Precision (SP) ops/clock
Fully Programmable
‒ Executes Branch instructions Limit (32k/thread group)
Separate to texture units (acts as texture cache)
2x Larger decode/issue for:
Attachedthan D3D11 TGSM
‒ 1 SP (Single-Precision) operation per 4 clocks
‒ Shared by all threads of a wavefront
‒ (as dispatched by SMEM, LDS, GDS/E
V
SALU,
‒ 1 DP ALU, V with Conflict Resolution
4 32 banks, MEM,Units Scalar unit)clocks XPORT
Texture
‒ Used for Filtercontrol, ADD in 8 arithmetic, etc.
(Double-Precision) pointer
‒
flow
‒ + Special Instructions (NOPs, barriers, etc.) and
‒ Bandwidth Amplification
16 TextureGPR pool, scalar data cache, etc. branch instructions
Fetch Load/Store Units
‒ 1 DP MUL/FMA/Transcendental per 16 clocks*
‒ Has own
‒ 16 Hardware Barriers Decode
‒ Separate Instruction
4x64KB Vector Registers (VGPR)
8KB Scalar General Purpose Registers (SGPR)
20 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Scalar Unit
‒
8KB Scalar Registers (SGPR)
‒
16KB 4-CU Shared R/O L1 Scalar Data Cache
Branch & Message Unit
4x Vector Units (SIMD-16)
‒
64KB Local Data Share (LDS)
‒
32 banks, with conflict resolution
16KB Read/Write L1 Vector Data Cache
‒
4x64KB Vector Registers (VGPR)
Shared with TMU as Texture Cache
Hardware Scheduler
‒
Up-to 2560 threads
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
SIMD SPECIFICS
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
L1 Cache
(16KB)
Each Compute Unit (CU) contains 4 SIMD; each SIMD has:
‒ A 16-lane IEEE-754 vector ALU (VALU)
‒ 64KB of vector register file (VGPR)
‒ Its own 40-bit (48-bit on HSA APUs) Program Counter (PC)
‒ Instruction buffer for 10 wavefronts*
‒ *A wavefront is a group of 64 threads: the size of one logical vGPR
21 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT
SCALAR UNIT
LANE
0 1 2
SIMD 0
15
LANE
0 1 2
SIMD 1
15
SPECIFICS …
LANE
0 1 2
15
SIMD 2
LANE
0 1 2
15
Scalar Unit
SIMD 3
GCN Scalar Unit
Fully Programmable Scalar Unit replaces FF Branch Logic
Operations such as JMP [GPR] are now supported
Opens the door to e.g. virtual function calls
Has its own GPR pool and can execute normal ALU code
64-bit bitwise ops to mask thread execution
32-bit bitwise and integer arithmetic operations at full-speed
Potential to offload uniform code (Vector ALU Scalar ALU)
A GCN CU can dispatch 1 scalar op/clock
24 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT
SCALAR UNIT
LANE
0 1 2
SIMD 0
15
LANE
0 1 2
15
CONTINUED …
LANE
0 1 2
SIMD 1
15
SIMD 2
R/W L2
LANE
0 1 2
15
Scalar Unit
SIMD 3
GCN Scalar Unit
Natively a 64-bit integer ALU
Independent arbitration and instruction decode
One ALU, memory or control flow op per cycle
512 Scalar GPR per SIMD shared between waves
{ SGPRn+1, SGPR } pair provide 64-bit register
4 CU Shared Read Only Scalar Data Cache: 16KB – 64B lines
4-way assoc, LRU replacement policy
Peak Bandwidth per CU is 16 bytes/cycle
25 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
4 CU Shared 16KB Scalar R/O L1
Scalar Unit
8KB Registers
Integer ALU
Scalar Decode
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
BRANCH & MESSAGE UNIT
Vector Units
(4x SIMD-16)
Scalar Unit
GRAPHICS CORE NEXT
Independent scalar assist unit to handle special classes of
instructions concurrently
‒ Branch
‒ Unconditional Branch (s_branch)
‒ Conditional Branch (s_cbranch_<cond>)
‒ Condition SCC == 0, SCC == 1, EXEC == 0, EXEC != 0, VCC == 0, VCC != 0
‒ 16-bit signed immediate dword offset from PC provided
‒ Messages
‒ s_sendmsg CPU interrupt with optional halt (with shader supplied code and source)
‒ debug message (perf trace data, halt, etc.)
‒ special graphics synchronization messages
26 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Branch
Unconditional Branch (s_branch)
Conditional Branch (s_cbranch_<cond>)
‒
‒
‒
‒
‒
‒
SCC ==
SCC ==
EXEC ==
EXEC !=
VCC ==
VCC !=
0
1
0
0
0
0
Messages
‒ s_sendmsg
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
MEMORY SPECIFICS
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
L1 Cache
(16KB)
Each CU has its own dedicated L1 cache and LDS memory
‒ Both global and shared memory atomics are supported
‒ 32 banks, with conflict resolution
‒ Scalar L1 Read-Only data cache is shared between 4 neighbor CU
27 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
16KB R/W L1 Vector D-Cache
‒ Shared with TMU as Texture Cache
‒ 16 work group barriers supported per CU
‒ Vector L1 Read/Write data cache shared with TMU as texture cache
64KB Local Data Share (LDS)
Scalar Unit
‒ 16KB 4-CU Shared R/O Scalar L1
16 hardware barriers per CU
A GCN GPU with 44 CU, such as the AMD
Radeon™ R9 290x, can be working on
up-to 112,640 work items at a time!
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
SCHEDULER SPECIFICS
Vector Units
(4x SIMD-16)
Scalar Unit
Texture Filter
Units (4)
Texture Fetch
Load / Store
Units (16)
GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)
Local Data Share
(64KB)
Scalar Registers
(8KB)
L1 Cache
(16KB)
Each CU has its own dedicated Scheduler unit
Each CU can have 40 waves in-flight
‒ Each potentially from a different kernel
Scheduler Limits:
‒ Supports up-to 2560 threads per CU (64 threads x 10 waves x 4 SIMD)
‒ 40 wavefronts per CU
‒ All threads within a workgroup are guaranteed to reside on the same
CU simultaneously
‒ Limited by available GPR count
‒ A set of synchronization primitives and shared memory allow data to
be passed between threads in a workgroup
‒ Optimized for throughput – latency is hidden by overlapping
execution of wavefronts
28 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
‒ 10 wavefronts per SIMD
‒ Limited by available LDS memory
‒ 16 hardware barriers per CU
A GCN GPU with 44 CU, such as the
AMD Radeon™ R9 290x, can be
working on up-to 112,640 work items at
a time!
GCN COMPUTE UNIT
Branch &
Message Unit
Scheduler
SCHEDULER SPECIFICS
Vector Units
(4x SIMD-16)
Scalar Unit
ARBITRATION & DECODE
Local Data Share
(64KB)
L1 Cache
(16KB)
GRAPHICS CORE NEXT
CU is guaranteed to issue instructions for a wave sequentially
‒ Predication & control flow enables any single work-item a
unique execution path
For a CU, every clock, waves on 1 SIMD are considered for issue
‒ Round-Robin scheduling algorithm
Maximum
5 instructions per cycle
‒ Not including “internal” instructions
Instruction Types:
‒ 1 Vector Arithmetic Logic Unit (VALU)
‒ 1 Scalar ALU or Scalar Memory (SALU)|(SMEM)
‒ 1 Vector Memory (Read/Write/Atomic) (VMEM)
‒ 1 Branch/Message (e.g. s_branch, s_cbranch)
‒ 1 Local Data Share (LDS)
At most, 1 instruction from each category may be issued
At most, 1 instruction per wave may be issued
Theoretical maximum of 5 instructions per cycle per CU
29 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
‒ 1 Export or Global Data Share (GDS)
‒ 10 Special/Internal
(s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) –
[no functional unit]
GCN COMPUTE UNIT
VECTOR & SCALAR ARBITRATION
LANE
0 1 2
SIMD 0
15
LANE
0 1 2
SIMD 1
15
LANE
0 1 2
15
LANE
0 1 2
15
SIMD 2
HARDWARE VIEW
Scalar Unit
SIMD 3
GCN Hardware View
A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clocks
Each lane can dispatch 1 SP ALU operation per clock
Each SP ALU operation takes 4 clocks to complete
The scheduler dispatches from a different wavefront each cycle
30 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT
VECTOR & SCALAR ARBITRATION
LANE
0 1 2
15
LANE
16 17 18
31
LANE
32 33 34
47
LANE
48 49 50
PROGRAMMER VIEW
Scalar Unit
63
WAVEFRONT 0
WAVEFRONT 4
WAVEFRONT 1
WAVEFRONT 5
WAVEFRONT 2
WAVEFRONT 6
WAVEFRONT 3
WAVEFRONT 7
WAVEFRONT 0
WAVEFRONT 8
WAVEFRONT 1
WAVEFRONT 9
GCN Programmer View
A GCN Compute Unit can perform 64 SP Vector ALU ops / clock
Each lane can dispatch 1 SP ALU operation per clock
Each SP ALU operation still takes 4 clocks to complete
But you can PRETEND your code runs 1 op on 64-threads at once
31 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN VECTOR UNITS
ALU CHARACTERISTICS
FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of
NaN/Inf/Zero and full de-normal support in hardware for SP and DP
MULADD single cycle issue instruction without truncation, enabling a MULieee followed by
ADDieee to be combined with round and normalization after both multiplication and subsequent
addition
VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison
predicates
IEEE Rounding Modes (Round toward +Infinity, Round toward –Infinity, Round to nearest
even, Round toward zero) supported under program control anywhere in the shader. SP and DP
modes are controlled separately.
De-normal Programmable Mode control for SP and DP independently. Separate control for input
flush to zero and underflow flush to zero.
32 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN VECTOR UNITS
ALU CHARACTERISTICS
CONTINUED …
Divide Assist Ops IEEE 0.5 ULP Division accomplished with macro (SP/DP ~15/41 Instruction
Slots, respectively)
FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE-754 precision and
rounding
Exceptions Support in hardware for floating point numbers with software recording and reporting
mechanism. Inexact, underflow, overflow, division by zero, de-normal, invalid operation, and
integer divide by zero operation
64-bit Transcendental Approximation Hardware based double precision approximation for
reciprocal, reciprocal square root and square root
24-bit Integer MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
‒ Heavily utilized for integer thread group address calculation
‒ 32-bit integer MUL/MULADD @ DP MUL/FMA rate
33 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN SHADER AUTHORING TIPS
GCN has greatly improved branch performance, and it continues to improve
‒ Don’t be afraid to use it! But, remember: use it wisely – improved != free
‒ It’s at its best for highly coherent workloads (where most threads take the same path)
However, the new architecture is more susceptible to: register pressure
‒ Using too many registers within a shader can reduce the maximum waves per SIMD!
‒ NOTE: A WAVEFRONT
CAN ALLOCATE
104 USER SCALAR REGISTERS
AS SEVERAL SCALAR REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE
GCN SGPR Count
VGPR Count
<= 48
<=24
56
28
64 72 84 100
32 36 40 48
> 100 84
64
<= 128
> 128
Max Waves/SIMD
10
9
9
8
8
4 3
4
2
1
77
66
‒ Take caution with respect to the following:
‒ Excessive nested branching/looping
‒ Loop Unrolling
‒ Variable declarations (especially arrays)
‒ Excessive function calls requiring storing of results
34 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
55
GCN SHADER CODE EXAMPLE
// Registers r0 contains “a”, r1 contains “b”
// Value is returned in r2
v_cmp_gt_f32
s_mov_b64
s_and_b64
s_cbranch_vccz
v_sub_f32
v_mul_f32
r0,r1
s0,exec
exec,vcc,exec
label0
r2,r0,r1
r2,r2,r0
//
//
//
//
//
//
a > b, establish VCC
Save current exec mask
Do “if”
Branch if all lanes fail
result = a – b
result = result * a
s_andn2_b64
s_cbranch_execz
v_sub_f32
v_mul_f32
exec,s0,exec
label1
r2,r1,r0
r2,r2,r1
//
//
//
//
Do “else (s0 & !exec)
Branch if all lanes fail
result = b – a
result = result * b
s_mov_b64
exec,s0
// Restore exec mask
An alternative to s_cbranch, is to use VSKIP to transform VALU into NOPs
s_setvskip – enables or disables VSKIP mode. Requires 1 waitstate after executing.
35 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
VSKIP does NOT skip VMEM instructions (Do: branch over superfluous VMEM inst.)
GCN MEMORY
CACHE HIERARCHY
I$
32KB instruction cache (I$) +
16KB scalar data cache (K$)
shared per ~4 CUs
with L2 backing
K$
I$
K$
Each CU has its own registers
and local data share
64 Bytes per clock
L1 bandwidth per CU
GDS
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1 read/write caches
64 Bytes per clock
L2 bandwidth per partition
L2 read/write cache
partitions
36 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
L2
L2
L2
64-bit Dual Channel
Memory Controller
64-bit Dual Channel
Memory Controller
64-bit Dual Channel
Memory Controller
Global Data
Share facilitates
synchronization
between CUs
(64KB)
GCN MEMORY
VECTOR MEMORY INSTRUCTIONS
VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS
MUBUF – read from or write/atomic to an un-typed buffer/address
‒ Data type/size is specified by the instruction operation
MTBUF – read from or write to a typed buffer/address
‒ Data type is specified in the resource constant
GRAPHICS CORE NEXT
‒ MUBUF is like C++
reinterpret_cast
MIMG – read/write/atomic operations on elements from an image surface
‒ Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)
‒ Image objects use resource and sampler constants for access and filtering
37 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
A pointer is a pointer on GCN!
‒ MTBUF is like C++
static_cast
Utilize TMU for filtering via MIMG
GCN MEMORY
DEVICE FLAT MEMORY INSTRUCTIONS
A GCN POINTER IS A POINTER
FLAT
Flat Address Space (“flat”) instructions are new as of Sea Islands (CI) and
allow read/write/atomic access to a generic memory address pointer which
can resolve to any of the following physical memories:
‒ Global Memory
‒ Scratch (“private”)
‒ LDS (“shared”)
‒ Invalid - MEM_VIOL TrapStatus
Device Flat (Generic) 64b/32b Addressing Support
‒ FLAT instructions support both 64 and 32-bit addressing. The address size is set
via a mode register (“PTR32”) and a local copy of the value is stored per wave.
‒ The addresses for the aperture check differ in 32 and 64-bit mode
38 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
EXPORT INSTRUCTION & GDS
Exports move data from 1-4 VGPRS to the fixed-function Graphics Pipeline
‒ E.g: Color (MRT0-7), Depth, Position, and Parameter Tessellator, Rasterizer, or RBE
Global Shared Memory Ops (Utilize GDS)
The GDS is identical to the LDS, except that it is shared by all CUs, so it acts as an
explicit global synchronization point between all wavefronts
The atomic units in the GDS also support ordered count operations
39 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
LOCAL DATA SHARE
GCN Local Data Share (LDS) is a 64KB, 32 bank (or 16) Shared Memory
Instruction issue fully decoupled from ALU instructions
Direct mode
‒ Vector Instruction Operand 32/16/8-bit broadcast value
‒ Graphics Interpolation @ rate, no bank conflicts
Index Mode – Load/Store/Atomic Operations
‒ Bandwidth Amplification, up-to 32 – 32-bit lanes serviced per clock peak
‒ Direct decoupled return to VGPRs
‒ Hardware conflict detection with auto scheduling
Software consistency/coherency for thread groups via hardware barrier
Fast & low power vector load return from R/W L1
40 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
CONTINUED …
LOCAL DATA SHARE
An LDS bank is 512 entries, each 32-bits wide
‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that
includes 32 atomic integer units
‒ This means that several threads can read the same LDS location at the same time for FREE
‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins
(useful e.g. for all threads writing uniform value to still be fast)
Typically, the LDS will coalesce 32 lanes from one SIMD each cycle
‒ One wavefront is serviced completely every 2 cycles
‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware
‒ An instruction which accesses different elements in the same bank takes additional cycles
41 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
NEW MEMORY OPERATIONS
LOCAL DATA SHARE
Remote Atomic Ops with Shared Memory Dual-Source Operands
‒LDS[Dst] = LDS[addr0] op LDS[addr1];
‒ Fast remote reduction operations for arithmetic, logical, Min/Max
Read/Write/Conditional Exchange 96b/128b
32-bit FP Min/Max/Compare Swap
43 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
NEW MEMORY OPERATIONS
LOCAL DATA SHARE
CONTINUED …
Fast Lane Swizzle Operations
‒Does not require allocation, no shared memory used
‒Invalid read result in 0x0 return
‒First Mode: Each four adjacent lanes can full crossbar data, same switch for each set
of four
‒Second mode: For each consecutive set of 32 work-items
‒ Swap: 16, 8, 4, 2, 1
‒ Reverse: 32, 16, 8, 4, 2
‒ Broadcast: 32, 16, 8, 4, 2
44 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
OPERATION DIAGRAMS
GCN MEMORY
LOCAL DATA SHARE
16
4 Lane CrossBar
Reverse
8
4
Lane 0 , 1 ……………………..…31,32……………………………….63
2
1
Lane 0 , 1 ……………………..…31,32……………………………….63
Swap
Broadcast
16
16
8
8
4
4
2
2
1
Lane 0 , 1 ……………………..…31,32……………………………….63
45 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
1
Lane 0 , 1 ……………………..…31,32……………………………….63
GCN MEMORY
BLOCK DIAGRAM
READ/WRITE CACHE
Reads and writes cached
‒ Bandwidth amplification
‒ Improved behavior on more memory access patterns
‒ Improved write to read reuse performance
Relaxed consistency memory model
‒ Consistency controls available to control locality of load/store
GPU Coherent
‒ Acquire/Release semantics control data visibility across the machine (GLC bit on load/store)
‒ GCN APUs also have SLC bit to control data visibility to CPU caches
‒ L2 coherent = all CUs can have the same view of data
Global Atomics
‒ Performed in L2 cache (GDS also has global atomics)
46 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
READ/WRITE L1
CACHE ARCHITECTURE
‒ Each CU has its own Vector L1 Data Cache
‒ 16KB L1, 64B lines, 4 sets x 64-way
‒ ~64B/CLK bandwidth per Compute Unit
‒ Write-through – alloc on write (no read) w/dirty byte mask
‒ Write-through at end of wavefront
‒ Decompression on cache read out
‒ Instruction GLC bit defines cache behavior (GCN APUs also have SLC bit)
‒ GLC = 0;
‒ Local caching (full lines left valid)
‒ Shader write back invalidate instructions
‒ GLC = 1;
‒ Global coherent (hits within wavefront boundaries)
47 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
READ/WRITE L2
CACHE ARCHITECTURE
‒ 64-128KB L2 per Memory Controller Channel
‒ Up-to 16 L2 cache partitions
‒ 64B lines, 16-way set associative
‒ ~64B/CLK per channel for L2/L1 bandwidth
‒ Write-back - alloc on write (no read) w/ dirty byte mask
‒ Acquire/Release semantics control data visibility across CUs
‒ L2 Coherent = all CUs can have the same view of data
‒ Remote Atomic Operations
‒ Common Integer Set & Floating Point Min/Max/CmpSwap
48 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
INFORMATION
BANDWIDTH
‒ Each CU has 64 bytes per cycle of L1 bandwidth
‒ Shared with the GDS
‒ Per L2 there’s 64 bytes of data per cycle as well
‒ Peak Scalar L1 Data Cache Bandwidth per CU is 16 bytes/cycle
‒ Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)
‒ LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification
‒ For R9 290x:
‒ That’s nearly 5.5 TB/s of LDS BW, 2.8 TB/s of L1 BW, and 1 TB/s of L2 BW!
‒ 512-bit GDDR5 Main Memory has over 320 GB/sec bandwidth
‒ PCI Express 3.0 x16 bus interface to system (32GBps)
49 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
TABLES
BANDWIDTH & LATENCY
LDS
K$
L1
128 bytes / clock
16 bytes / clock
64 bytes / clock
Main Takeaways:
–LDS is optimized for bandwidth amplification and atomics
–K$ is optimized for periodic low-latency reads of small datasets
–L1 is optimized for high-bandwidth texture fetches and streaming
LDS
K$
L1
Resident
Short
Short (1x)
Long (20x)
Non-Resident
N/A
Medium (10x)
Long (20x)
50 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY
BLOCK DIAGRAM
L1 TEXTURE CACHE
The memory hierarchy is re-used for graphics
Some dedicated graphics hardware added
‒ Address-gen unit receives 4 texture addr/clock
‒ Calculates 16 sample addr (nearest neighbors)
‒ Reads samples from L1 vector data cache
‒ Decompresses samples in Texture Mapping Unit (TMU)
‒ TMU filters adjacent samples, produces <= 4 interpolated texels/clock
‒ TMU output undergoes format conversion and is written into the vector register file
‒ The format conversion hardware is also used for writing certain formats to memory from graphics shaders
51 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
X86-64
GCN MEMORY
VIRTUAL MEMORY
The GCN cache hierarchy was designed to integrate with x86-64 microprocessors
The GCN virtual memory system can support 4KB pages
‒ Natural mapping granularity for the x86-64 address space
‒ Paves the way for a shared address space in the future
‒ All GCN hardware can already translate requests into x86-64 address space
GCN caches use 64B lines, which is the same size x86-64 processors use
AMD A-Series APU
The stage is set for heterogeneous systems to transparently share data between the GPU
and CPU through the traditional caching system, without explicit programmer control!
52 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
R9 290X
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
AMD Radeon™
HD 7970 GHz Edition
AMD Radeon™
R9 290X
Increase
Geometry Processing
2.1 billion primitives/sec
4 billion primitives/sec
1.9x
Compute
4.3 TFLOPS
5.6 TFLOPS
1.3x
Texture fill rate
134.4 Gtexels/sec
176 Gtexels/sec
1.3x
Pixel fill rate
33.6 Gpixels/sec
64 Gpixels/sec
1.9x
Peak Bandwidth
264 GB/sec
320 GB/sec
1.2x
Die area
352 mm2
438 mm2
1.24x
Peak GFLOPS/mm2
12.2
12.8
1.05x
53 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
SHADER ENGINE
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
Each GCN GPU can contain up-to 4 Shader Engines
‒ Load balanced with each other
‒ Screen partitioning of pixel assignment
A Shader Engine is a high level organizational unit containing:
‒ 1 Geometry Processor (1 Primitive Per Cycle Throughput)
‒ 1 Rasterizer
‒ 1-16 CUs (Compute Units)
‒ Instruction I$ and constant K$ caches shared by up to 4 CU each
‒ 1-4 RBEs (Render Back Ends)
‒ Up-to 16 – 64b pixels/cycle per Shader Engine
‒ Up-to 8 – 128b pixels/cycle per Shader Engine
54 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
R9 290X
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
44 Compute Units
4 Geometry Processors
‒ 4 billion primitives/sec
64 Pixel Output/Clock
‒ 64 Gpixels/sec fill rate
1MB L2 Cache
‒ Up-to 1 TB/sec L2/L1 bandwidth
512-bit GDDR5 memory interface
‒ 320 GB/sec memory bandwidth
6.2 billion transistors
55 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
‒ 438 mm2 on 28nm process node
‒ 12.8 GFLOPS/mm2
GCN COMPUTE ARCHITECTURE
SEA ISLANDS
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
8 ASYNCHRONOUS COMPUTE ENGINES (ACE)
‒ Operate in parallel with Graphics CP
‒ Independent scheduling and work item dispatch
for efficient multi-tasking
‒ 9 Devices with 64+ Command Queues!
‒ Fast context switching
‒ Exposed in OpenCL™
Dual DMA engines
‒ Can saturate PCIe 3.0 x16 bus bandwidth (16
GB/sec bidirectional)
56 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
SEA ISLANDS
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
ACEs are responsible for compute shader
scheduling & resource allocation
Each ACE fetches commands from cache or
memory & forms task queues
Tasks have a priority level for scheduling
‒ Background Realtime
ACE dispatch tasks to shader arrays as
resources permit
Tasks complete out-of-order, tracked by ACE
for correctness
Every cycle, an ACE can create a
workgroup and dispatch one wavefront from
the workgroup to the CUs
57 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
SEA ISLANDS
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
ACE are independent
‒ But, can synchronize and communicate
via Cache/Memory/GDS
ACE can form task graphs
‒ Individual tasks can have
dependencies on one another
‒ Can depend on another ACE
‒ Can depend on part of graphics pipe
ACE can control task switching
‒ Stop and Start tasks and dispatch
work to shader engines
58 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
SEA ISLANDS
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
Focus in GPU hardware shifting away
from graphics-specific units, towards
general-purpose compute units
R9 290x GCN-based ASICs already
have 8:1 ACE : CP ratio
‒ CP can dispatch compute
‒ ACE cannot dispatch graphics
If you aren’t writing Compute
Shaders, you’re not getting the absolute
most out of modern GPUs
‒ Control: LDS, barriers, thread layout, ...
59 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE
SEA ISLANDS
A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
GRAPHICS CORE NEXT
Future Trends:
More Compute Units
‒ ALU outpaces Bandwidth
CPU + GPU Flat Memory
‒ APU + dGPU
Less Fixed Function Graphics
‒ Can you write a Compute-based
graphics pipeline?
‒ Start thinking about it…
60 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
GEOMETRY
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
Updated Hardware Geometry Units
– Off-chip buffering improvements
– Larger parameter and position cache
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
Tessellation off
on
Tessellation
off
GS + Tessellation is faster than before…
However… memory is still the bottleneck!
– Minimize the number of inputs and
outputs for best performance…
Small expansions can be done within LDS!
61 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Image from Battlefield 3, EA DICE
Process and rasterize up to 4 primitives per clock cycle
GCN FIXED FUNCTION ARCHITECTURE
RASTERIZER
We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels
Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
This can cause us to become raster-bound, starving the shader and holding up geometry!
12 Pixels Per Clock
75%
Efficiency
100%
Efficiency
16 Pixels Per Clock
28 Pixels in 2 Clocks
62 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
vs.
3 Pixels in 3 Clocks
1 Pixel Per Clock
6.25%
Efficiency
GCN FIXED FUNCTION ARCHITECTURE
TESSELLATION + RASTERIZER EFFICIENCY
6.25%
75-90%
18-25%
Efficiency
Efficiency
~13 Pixels
~4 Pixels
1 Pixel
Per Clock
Per Clock
Per Clock
Efficiency
Over-Tessellation
Reduces rasterizer efficiency
‒ Extreme Tessellation = 6.25% Efficiency
Also impacts ROPs and MSAA efficiency
‒ High number of polygon edges to AA
‒
Consumes dramatically more bandwidth
‒ If nFragments > nSamples, quality will be lost
‒
E.g. 16 verts affecting 1 pixel @ 8xMSAA
63 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE
Over-Tessellation
Reduces shader efficiency
HS, DS and VS run many times
for each final image pixel
‒ Yet don’t contribute much
to final image quality
The graphics pipeline is not
designed for this abuse!
TESSELLATION + SHADING EFFICIENCY
Shading Passes Per-Pixel (Overshade)
8
7
6
5
4
3
2
1
Consider Alternatives:
‒ Parallax Occlusion Mapping
‒ […]
Image courtesy: Kayvon Fatahalian
“Evolving the Direct3D Pipeline for Real-time Micropolygon Rendering,”
From ACM SIGGRAPH 2010 course: “Beyond Programmable Shading II”
64 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN Tessellation – Best Practices
While performance is much improved, it is still a potential bottleneck!
‒ Produces a great deal of IO traffic, starving other parts of the pipeline
Best performance generlly achieved with tessellation factors less than 15!
Continue to
Optimize:
‒ Pre-triangulate
‒ Distance-adaptive
‒ Screen-space adaptive
‒ Orientation-adaptive
‒ Backface Culling
‒Frustum Culling
‒ […]
65 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tessellation OFF
ON
GCN FIXED FUNCTION ARCHITECTURE
RASTERIZER
We now have 4 Geometry Processors on R9 290x
‒ Overall Primitive Rate = 4 prims per clock (ideal)
We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels
Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
This can cause us to become raster-bound, unable to rasterize at peak-rate!
Command Processor
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
Geometry Processor
Geometry
Assembler
Tessellator
Geometry Processor
Vertex
Assembler
Geometry
Assembler
Tessellator
Vertex
Assembler
Geometry Processor
Geometry
Assembler
Tessellator
Vertex
Assembler
Compute Units
Rasterizer
Scan Converter
Hierarchical Z
Render Back-Ends
Rasterizer
Scan Converter
Hierarchical Z
Render Back-Ends
66 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Rasterizer
Scan Converter
Hierarchical Z
Render Back-Ends
Rasterizer
Scan Converter
Hierarchical Z
Render Back-Ends
GCN FIXED FUNCTION ARCHITECTURE
RENDER BACK ENDS
Once the pixels fragments in a tile have been
shaded, they flow the Render Back-Ends (RBEs)
Z/Stencil ROPs
Color ROPs
Depth Cache
Color Cache
‒ 16KB Color Cache
‒ Up to 8 color + 16 coverage samples (16x EQAA)
‒ 8KB Depth Cache
‒ Up to 8 depth samples (8x MSAA)
‒ Writes un-cached via memory controllers
‒ 64 – 64B pixels per cycle
‒ 256 Depth Test (Z) / Stencil Ops per cycle
Logic Operations as alternative to Blending
‒Exposed in Direct3D 11.1
‒Also available in OpenGL
Dual-Source Color Blending with MRTs
‒Only available in OpenGL
*
67 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
There are 16 RBEs on R9 290x
GCN FIXED FUNCTION ARCHITECTURE
DEPTH IMPROVEMENTS
24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS
Fast-accept of fully-visible triangles spanning one or more tile
If a triangle is fully covering a tile, then cost is only 1 clock/tile
Depth Bounds Test (DBT) Extension
‒Exposed in OpenGL via GL_EXT_depth_bounds_test
‒Exposed in Direct3D 11 via extension
68 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE
STENCIL IMPROVEMENTS
GCN has support for new extended stencil ops
‒Only available in OpenGL:
GL_AMD_stencil_operation_extended
‒Additional stencil ops:
‒AND, XOR, NOR
‒REPLACE_VALUE_AMD
‒etc.
‒ Also exposes additional stencil op source value
‒ Can be used as an alternative to stencil ref value
Stencil ref and op source value can now be exported from pixel shader
‒Only available in OpenGL: GL_AMD_shader_stencil_value_export
69 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS
GPR PRESSURE
GPRs and GPR Pressure
Banks of GCN Vector GPRs (Illustration)
General Purpose Registers (GPR) are a limited resource
‒ Separate banks of GPRs for Vector and Scalar (per SIMD)
‒ Maximum of 256 VGPRS and 512 SGPRS shared across all waves (up-to 10) owned by a SIMD
‒ Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit)
‒ Number of GPRs required by a shader affects SIMD scheduling and execution efficiency
‒ Shader tools can be used to determine how many GPRs are used…
GPR pressure is affected by:
‒ Loop Unrolling
‒ Long lifetime of temporary variables
‒ Nested Dynamic Flow Control instructions
‒ Fetch dependencies (e.g. indexed constants)
70 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS
TEXTURE FILTERING
‒Point sampling is full-rate on all formats
‒Trilinear filtering costs up to 2x bilinear filtering cost
‒Anisotropic (N taps) costs <= (N x bilinear)
‒Avoid cache thrashing!
‒Use MIPmapping
‒Use Gather() where applicable
‒Exploit neighbouring pixel shader threadCU locality:
‒ Sampling from texels resident on the same CU can have a lower cost
‒Exploit this explicitly by using Compute Shaders
71 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS
COLOR OUTPUT
PS Output: Each additional color output increases export cost
Export cost can be more costly than PS execution!
‒ Each (fast) export is equivalent to 64 ALU ops on R9 290X
‒ If shader is export-bound then use “free” ALU for packing instead
Watch out for export-bound cases
‒ E.g. G-Buffer parameter writes
‒ MINIMIZE SHADER INPUTS AND OUTPUTS!
‒ Pack, pack, pack, pack!
Costs of outputting and blending various formats
‒discard/clip allow the shader hardware to skip the rest of the work
* Miss “PACK” Man kindly reminds you to “Pack pack pack!”
72 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING
MEDIA INSTRUCTIONS
SAD = Sum of Absolute Differences
Closest match
Critical to video & image processing algorithms
‒ Motion detection
‒ Gesture recognition
‒ Video & image search
‒ Stereo depth extraction
‒ Computer vision
SAD (4x1) and QSAD (4 4x1) instructions
‒ New QSAD combines SAD with alignment ops for higher
performance and reduced power draw
‒ Evaluate up to 256 pixels per CU per clock cycle!
Maskable MQSAD instruction
‒ Allows background pixels to be ignored
‒ Accelerated isolation of moving objects
New: 32-bit destination accumulator register
‒ SAD/QSAD/MQSAD U32/U16 accumulators with saturation
73 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
3
2
5
5
4
4
0
7
1
7
5
9
4
1
3
5
5
5
9
3
1
4
4 0
SAD = 7
SAD = 22
2 22 9
5
1
6
7
2 9
SAD = 6
1 59 3
5
2
8
1
1
7
6
8
3
0
4
3
2
9
9
3
0
7
1
1
7 4
SAD = 5
5 58 4
0
8
0
0
2 2
SAD = 2
8 45 3
2
9
9
7
1
6
2
4
0
AMD Radeon R9 290x can evaluate
11.26 Terapixels/sec *
* Peak theoretical performance for 8-bit integer pixels
3
GCN MEDIA PROCESSING
VIDEO CODEC ENGINE
Video Codec Engine (VCE)
‒ Hardware H.264 Compression and Decompression
‒ Ultra-low-power, fully fixed-function mode
‒ Capable of 1080p @ 60 frames / second
‒ Programmable for Ultra High Quality and or Speed
‒ Entropy encoding block fully accessible to software
‒ AMD Accelerated Parallel Programming SDK
‒ OpenCL ™
‒ Create hybrid faster-than-real-time encoders!
‒ Custom motion estimation
‒ Inverse DCT and motion compensation
‒ Combine with hardware entropy encoding!
AMD Radeon R9 290x can compress
Realtime+ 1080p H.264
74 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING
AMD TRUEAUDIO
Multiple integrated Tensilica HiFi EP Audio DSP cores
Dedicated Audio DSP solution for game sound effects
Guaranteed real-time performance and service
Designed for game audio artists and engineers to bring take their artistic vision
beyond sound production into the realm of sound processing
Intended to transform game audio as programmable shaders transformed graphics
75 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING
AMD TRUEAUDIO
SPATIALIZATION / 3D AUDIO
REVERBS
AUDIO/VOICE STREAMS
MASTERING LIMITERS
76 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
HEAR MORE REALTIME
VOICES AND CHANNELS
IN A GAME
77 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
CONCLUSIONS
GCN ARCHITECTURE TAKEAWAYS
‒GCN offers increased flexibility & efficiency, with reduced complexity!
‒Non-VLIW Architecture improves efficiency while reducing programmer burden
‒Constants/resources are just address + offset now in the hardware
‒UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast
‒Has virtual memory & GPU flat memory, moving towards CPU + GPU flat memory
‒GCN is designed with a forward-looking focus on Compute
‒Scalar unit for complex dynamic control flow + branch & message unit
‒64KB LDS/CU, 64KB GDS, atomics at every stage, coherent cache hierarchy
‒8 Asynchronous Compute Engines (ACE) for multitasking compute
‒ 8 ACE x 8 HQD (per ACE) = 64 HQD (HQD = Hardware Queue Descriptors)
79 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
CONCLUSIONS
GCN ARCHITECTURE TAKEAWAYS
CONTINUED …
‒GCN generally simplifies your life as a programmer
‒Don’t: fret too much about instruction grouping, or vectorization
‒Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts)
‒Do: Think about thread/CU locality when you structure your algorithm
‒Do: Exploit the low-latency 4-CU Shared 16KB Scalar L1 Data Cache (K$)
‒Do: Pack shader inputs and outputs – aim to be IO/bandwidth thin!
‒ Pack PS exports into non-blended 64-bit format for optimal ROP utilization
‒ But, remember that 32-bit formats still use less bandwidth
‒ Keep geometry (HS, VS, GS, DS) stage IO under 4 float4 (ideally less! )
‒Unlimited number of addressable constants/resources
‒N constants aren’t free anymore – each consume resources, use sparingly!
‒Compute is the future – exploit its power for GPGPU work & graphics!
80 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – INTRODUCTION
Enables application to manage more texture data than can physically fit in a fixed footprint
‒ Known as: Tiled Resources (Direct3D 11.2) and Partially Resident Textures (OpenGL 4.2)
‒ A.k.a. “Virtual texturing“ and “Sparse texturing”
The principle behind PRT is that not all texture contents are likely to be needed at any given time
‒ Current render view may only require selected portions of a texture to be resident in memory
‒ Or, only selected MIPMap levels…
PRT textures only have a portion of their data mapped into GPU-accessible memory at a time
‒ Texture data can be streamed in on-demand
‒ Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)
OpenGL extension – GL_AMD_sparse_texture
85 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – TEXTURE TILES
The PRT texture is chunked into 64KB tiles
‒ Fixed memory size
‒ Not dependant on texture type or format
Highlighted areas represent
texture data that needs highest
resolution
Chunked texture
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
86 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Texture tiles needing to be
resident in GPU memory
Tiled Resources & Partially Resident Textures – TRANSLATION TABLE
The GPU virtual memory page table translates 64KB tiles into a resident texture tile pool
Texture Map
Page Table
Texture Tile Pool (Video
Memory)
(linear storage)
64KB tile
Unmapped page entry
Mapped page entry
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
87 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – MIP MAPS
Not all tiles from the texture map are actually resident in video memory
PRT hardware page table stores virtual physical mappings
Texture Map
Page Table
MIP Levels
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
88 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Texture Tile Pool (Video
Memory)
64KB tile
Unmapped page entry
Mapped page entry
Tiled Resources & Partially Resident Textures – TILE MANAGEMENT
The Application is responsible for uploading/releasing new PRT tiles!
A common scenario is to upload lower MIPMaps to texture tile pool
‒ This allows a full representation of the PRT contents to be resident in memory (albeit at
lower resolution)
‒ e.g. MIP LOD 6 and above for 16kx16k 32-bits texture is about 650KB (256x256 resolution)
Texture tiles corresponding to higher resolution areas are uploaded by the application
as needed
‒ e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen
pixels ratio increases, thus higher LOD tiles need uploading
89 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – “FAILED” FETCH
How does the application know which texture tiles to upload?
Answer: PRT-specific texture fetch instructions in pixel shader
‒ Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not
in the pool
‒ OpenGL example: int
glSparseTexture( gsampler2D sampler,
vec2
P,
inout gvec4 texel );
This information is then stored in render target or UAV
‒ Texel fetch failed for a given (x, y) tile location
...and then copied to the CPU so that application can upload required tiles
App chooses what to render until missing data gets uploaded
90 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – “LOD WARNING”
PRT fetch condition code can also indicate an “LOD Warning”
The minimum LOD warning is specified by the application on a per texture basis
‒ OpenGL example:
glTexParameteri(
If a fetched pixel’s LOD is
<target>,
MIN_WARNING_LOD_AMD,
<LOD warning value>
);
< the specified LOD warning value then the condition code is returned
This functionality is typically used to try to predict when higher-resolution MIP levels will be needed
‒ E.g. Camera getting closer to PRT-mapped geometry
91 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – EXAMPLE USAGE
1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API
2. App uploads MIP levels using API calls
3. Shader fetches PRT data at specified texcoords
Two possibilities:
3.a. Texel data belongs to a resident (64KB) tile
- Valid color returned, no error code
3.b. Texel data points to non-resident tile or specified LOD
- Error/LOD Warning code returned
- Shader writes tile location and error code to RT or UAV
4. App reads RT or UAV and upload/release new tiles as needed
92 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures –
TYPES, FORMATS & DIMENSIONS
All texture types and formats supported
‒1D, 2D, cube, arrays and 3D volume textures
‒All common texture formats
‒ Including compressed formats
‒Maximum dimensions:
‒16k x 16k x 8k x 128-bit textures
93 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Hardware PRT > Software Implementation
PRT
Ease of implementation
• Complexity hidden behind HW & API
Full filtering support
SW Implementation
• Includes anisotropic filtering
Full-speed filtering
• SW solution requires “manual” filtering
• Software anisotropic is very costly
Don’t go overboard with PRT allocation!
• Page table entry size is 4 DWORDs
• Have to be resident in video memory
94 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex & pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
So what is mantle?
This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels within a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.Shader code will be visible in GPUShaderAnalyzer to allow optimizations.
The new cache hierarchy was shown at AFDS. This core implements the first version of that core. It’s a full 2 level R/W cache, with 16Kbytes of L1 per CY, and 64 Kbytes per L2. Each CU has 64 Bytes per cycle of L1 BW, shared with the global data share (which is a local buffer for sharing data between wavefronts). Per L2 there’s 64 bytes of data per cycle as well. That’s nearly 2 TB/s of L1 BW, and 700 GB/s of L2 BW. Nice! Each group of four cores shares a 32KB instruction cache and a 16KB scalar data cache. Coherency is handled at the L2 level, with applications able to keep the physical L2’s updated directly with their L1’s. Never settle for enough cache bandwidth!
ADDR8VGPR which holds address. For 64-bit addresses, ADDR has the LSB’s and ADDR+1 has the MSBs.DATA8VGPR which holds the first dword of data. Instructions can use 0-4 dwords.VDST8VGPR destination for data returned to the shader, either from LOADs or Atomics with GLC=1 (return pre-op value).SLC1System Level Coherent. Used in conjunction with GLC and MTYPE to determine cache policies.GLC1Global Level Coherent. For Atomics, GLC=1 means return pre-op value, 0 = do not return pre-op value.TFE1Texel Fail Enable for PRT (Partially Resident Textures). When set, fetch may return a NACK which causes a VGPR write into DST+1 (first GPR after all fetch-destgprs).( M0 )32Implied use of M0. M0[16:0] contains the byte-size of the LDS segment. this is used to clamp the final address.Opcode:FLAT_LOAD_UBYTE FLAT_STORE_BYTEFLAT_ATOMIC_SWAP FLAT_ATOMIC_SWAP_X2 FLAT_LOAD_SBYTE FLAT_ATOMIC_CMPSWAP FLAT_ATOMIC_CMPSWAP_X2 FLAT_LOAD_USHORT FLAT_STORE_SHORTFLAT_ATOMIC_ADD FLAT_ATOMIC_ADD_X2 FLAT_LOAD_SSHORT FLAT_ATOMIC_SUB FLAT_ATOMIC_SUB_X2 FLAT_LOAD_DWORD FLAT_STORE_DWORD FLAT_ATOMIC_SMIN FLAT_ATOMIC_SMIN_X2 FLAT_LOAD_DWORDX2 FLAT_STORE_DWORDX2 FLAT_ATOMIC_UMIN FLAT_ATOMIC_UMIN_X2 FLAT_LOAD_DWORDX3 FLAT_STORE_DWORDX3 FLAT_ATOMIC_SMAX FLAT_ATOMIC_SMAX_X2 FLAT_LOAD_DWORDX4 FLAT_STORE_DWORDX4 FLAT_ATOMIC_UMAX FLAT_ATOMIC_UMAX_X2 FLAT_ATOMIC_AND FLAT_ATOMIC_AND_X2 FLAT_ATOMIC_OR FLAT_ATOMIC_OR_X2 FLAT_ATOMIC_XOR FLAT_ATOMIC_XOR_X2 FLAT_ATOMIC_INC FLAT_ATOMIC_INC_X2 FLAT_ATOMIC_DEC FLAT_ATOMIC_DEC_X2 FLAT_ATOMIC_FCMPSWAP FLAT_ATOMIC_FCMPSWAP_X2 FLAT_ATOMIC_FMIN FLAT_ATOMIC_FMIN_X2 FLAT_ATOMIC_FMAX FLAT_ATOMIC_FMAX_X2
Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2. World-class IP.
The R9 290 device is the first GCN to offer scaling to 4 prims per clock. Interstage parameter and position storage is provide on chip to enable necessary inflight overlap. Each geometry engine provides surface, tessellation, geometry and vertex management and output primitive filtering to drive the four partitioned rasterizers efficiently. For low to mid level amplification the geometry stage has added a driver/compiler controlled mode that retains interstage data in the shared memory to decrease external bandwidth requirements and latency effects that as much as double the performance in some scenarios. Finally, for tessellation, improvements have been made in staging storage and control to improve overall performance.
I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
pre-tessellate as needed in order to avoid higher tess factors.
I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
The R9 290 series provides a massive 64 pixel rasterization capability with 256 pixel’s depth and stencil test per clock. The render backend units can drive color writes and blending operations for up to 64 pixels surviving per clock. This capability will move the bottleneck from pixel fill to bandwidth in some scenarios.
Present TrueAudio as the solution to the limitations imposed by today’s PC audio solutionsEmphasize real-timeand programmability
SPATIALIZATION / 3D AUDIOSurround Sound with Stereo gaming headsetsKnow exactly where the enemy isREVERBS- More Realistic Sound EnvironmentAUDIO/VOICE STREAMS- Fuller sound for games with many scene objectsMASTERING LIMITERSReduce developer workload with real-time limiters
Some immediate benefits of TRUEAUDIO – It enables you to hear hundreds more REALTIME VOICES AND AUDIO channels in your game than what is possible on CPUs today
AMD is working with audio plugin developers such as GenAudio to provide an immersive audio experience when integrated into gamesGamers who use stereo headsets (either through USB or audio jacks) will enjoy virtual surround sound, accelerated by AMD TrueAudio technologyThis level of integration leads to accurate 3-dimensional audio since position data is extracted directly from the gameWhereas headsets with virtual surround sound capability use simple audio expansion algorithms with no knowledge of the game’s environment
That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels rwithin a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.