GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

THE AMD GCN ARCHITECTURE
A CRASH COURSE
@MissQuickstep
LAYLA MAH – LAYLA.MAH@AMD.COM
DEVELOPER TECHNOLOGY ENGINEER

AGENDA
 Part 1: A Brief History of GPU Evolution
 Part 2: Introduction to Graphics Core Next (GCN)
 Part 3: Anatomy of a GCN Compute Unit (CU)
 Part 4: GCN Shader: Arbitration, Examples & Tips
 Part 5: GCN Memory Hierarchy
 Part 6: GCN Compute Architecture (ACE)
 Part 7: GCN Fixed Function Units
(CP, GeometryEngine, Rasterizer, RBE, …)

 Part 8: Main Takeaways & Conclusion
 Bonus Slides: Tiled Resources, Partially Resident Textures (PRT)
22

| A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

3RD ERA:
Graphics Parallel Core
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers

3 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

Stream
Processing Units

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

3RD ERA:
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers


Branch Unit

Stream
Processing Units

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

Prior to 2002
 Graphics-specific hardware

3RD ERA:
VLIW5

‒ Texture mapping/filtering
‒ Transform & Lighting (T&L) Engines
‒ Geometry processing
‒ Rasterization
‒ Fixed function lighting equations

Lighting

Stream
Processing Units

FMAD+
Special
Functions

Branch Unit

‒ Multi-texturing

General Purpose
Registers

‒ Dedicated texture and pixel caches VLIW4

‒ Sufficient for basic graphics tasks
Processing Units
‒ No general purpose compute capability General Purpose
Stream

Registers


Branch Unit

 Dot product and scalar multiply-add

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

Memory Interface

3RD ERA:
VLIW5

Stream
Processing Units

General Purpose
Registers

Setup Engine

Lighting

Pixel Shader Core

VLIW4

Stream
Processing Units
General Purpose
Registers


Branch Unit

16 Pixel Pipes

FMAD+
Special
Functions

Branch Unit

8 Vertex Pipes

GPU EVOLUTION
1ST ERA:
Fixed Function

2002-2006
Graphics Programmability

– Direct3D 8/9, OpenGL 2.0
IEEE not required

Memory Interface
8 Vertex Pipes
Setup Engine

– NV 16-bit full-speed
–
Lighting NV 32-bit half-speed

Pixel Shader Core


 Shader Models 1.0 - 2.0
VLIW5
‒ VS and PS are distinct
‒ Minimal Instruction Sets
‒ Limited Instruction Slots
‒ LimitedGeneral Purpose Lengths
Shader
Registers
‒ No DYNAMIC Flow Control
‒ No Looping Constructs
‒ VLIW4
No Vertex Texture Fetch
‒ No Bitwise Operators
‒ No Native Integer ALU
Stream
Processing Units
‒ […]
FMAD+
Special
Functions

Branch Unit

– Specialized shader units for
vertex & pixel processing
 Added dedicated caches

The Rise of Shaders

Stream
Processing Units

Different precision per IHV
– ATI 24-bit full-speed

3RD ERA:

Branch Unit

– Floating point processing

2ND ERA:
Simple Shaders

16 Pixel Pipes

General Purpose
Registers

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

3RD ERA:
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers


Branch Unit

Stream
Processing Units

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

The Rise of The Unified Shader (VLIW-5)
 5-Element Very-Long-Instruction-Word (XYZWT)

3RD ERA:
VLIW5

‒ Ideal for 4-element Vector and 4x4 Matrix Operations
‒ Vector/Vector math in a single instruction

Stream
Processing Units

‒ Plus One Transcendental-Unit function per Instruction

FMAD+
Special
Functions

Branch Unit

‒ Began with XENOS and utilized from R600 until “Cayman”
‒ Flexible and optimized for Graphics workloads

General Purpose
Registers

 More advanced caching
‒ Instruction, constant, multi-level texture/data, & later: LDS/GDS
Lighting

VLIW4

Stream
Processing Units

 More flexible: Unified ALU, Branch Unit, Dynamic Flow
Control, Vertex Texture, Geometry Shader, Tessellation
Engines, etc.

General Purpose
Registers

Branch Unit

 Single Precision 32-bit IEEE-Compliant Floating Point ALUs

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

3RD ERA:
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers


Branch Unit

Stream
Processing Units

GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

Optimized For Die Area Efficiency (VLIW-4)
3D 4-Element Very-Long-Instruction-Word (XYZW)
 Geometry Transformation

3RD ERA:
VLIW5

‒ Profiling showed average VLIW utilization was < 3.4/5
‒ Each ALU has a smaller LUT
‒ Combined using 3-term Lagrange polynomial interpolation across multiple ALU

Stream
Processing Units

‒ Better optimized for combination of Graphics & Compute
‒ Graphics is still the primary focus, but compute is gaining attention
‒ Still ideal for 4-element Vector and 4x4 Matrix Operations
‒ Fewer ALU bubbles in transcendental-light code, better utilization

Lighting

FMAD+
Special
Functions

General Purpose
Registers

VLIW4

‒ Multiple dispatch processors & separate command queues


Stream
Processing Units
General Purpose
Registers

Branch Unit

‒ Simplified programming and optimization relative to VLIW-5

 Improved support for DirectCompute™ and OpenCL™

Branch Unit

‒ Removed dedicated T-Unit – Optimized die area usage

GPU EVOLUTION
LANE 0

LANE 1

LANE 2

LANE 15

SIMD

VLIW4 SIMD

LANE
0 1 2
SIMD 0

15

LANE
0 1 2

15

SIMD 1

LANE
0 1 2

15

SIMD 2

LANE
0 1 2
15
SIMD 3

GCN Quad SIMD-16

 64 Single Precision multiply-add (per-clock)

 64 Single Precision multiply-add (per-clock)

 16 SIMDs × ( 1 VLIW inst × 4 ALU ops )

 4 SIMDs × ( 1 ALU op × 16 threads )

 1 VLIW inst containing 4 ALU ops (per-clock)

 4 ALU ops (from different wavefronts) / clock

 Needs 4 parallel ALU ops to fill each VLIW inst

 Needs 4+ wavefronts to keep SIMD lanes full

 Compiler manages register port conflicts

 No register port conflicts

 Specialized, complex compiler scheduling

 Standard compiler scheduling & optimizations

 Difficult assembly creation, analysis, and debug

 Simplified assembly creation, analysis, & debug

 Complicated tool chain support

 Simplified tool chain development and support

 Careful optimization req. for peak performance

 Stable and predictable performance


AMD

GRAPHICS
CORE
NEXT


 GS-4112 – Mantle: Empowering 3D Graphics Innovation
 Keynote – Johan Andersson, Technical Director, EA
 GS-4145 – Oxide on Mantle Adoption (Wed 5:00-5:45)

 New low level programming
interface for PCs
 Designed in collaboration with
top game developers
 Lightweight driver that allows
direct access to GPU hardware
 Compatible with DirectX® HLSL
for simplified porting

MANTLE
Graphics Applications

Mantle API
Mantle Driver
GCN

Works with all
Graphics Core
Next GPUs

AMD GRAPHICS CORE NEXT ARCHITECTURE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

Faster performance
Higher efficiency

New graphics features
New compute features

GRAPHICS CORE NEXT


 Cutting-edge graphics performance and features
 High compute density with multi-tasking

 Built for power efficiency
 Optimized for heterogeneous computing
 Enabling the Heterogeneous System Architecture (HSA)

 Amazing scalability and flexibility


GRAPHICS CORE NEXT


 Unlimited Resources & Samplers
 All UAV formats can be read/write

 Simpler Assembly Language
 Simpler Shader Code
 Ability to support C/C++ (like)

 Architectural support for traps, exceptions & debugging
 Ability to share virtual x86-64 address space with CPU cores

GRAPHICS CORE NEXT


 AMD TECHNOLOGY POWERS NEXT-GEN CONSOLES
NEW NEXT-GEN GAME CONSOLES
RAISE THE BAR FOR GRAPHICS PERFORMANCE

PERFORMANCE
TFLOPS-CLASS
COMPUTE POWER

MEMORY

16X

MORE
MEMORY

*

* Based on PlayStation 3 512MB vs.
PlayStation 4 8192MB GDDR5.

GRAPHICS CORE NEXT

GCN COMPUTE UNIT


GRAPHICS CORE NEXT

 CU = Basic Building Block of GPU Computational Power
Branch &
Message Unit

Vector Units
(4x SIMD-16)

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)



New Instruction Set Architecture

Scalar Unit

‒

Non-VLIW

‒

Vector unit + scalar co-processor

‒

Scheduler

Distributed programmable scheduler



Each CU can execute instructions from
multiple kernels at once



Increased instructions per clock per mm2

‒ High utilization

‒ High throughput
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)


L1 Cache
(16KB)

‒ Multi-tasking

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Scalar Unit
4x Vector Units (16-lane SIMD)
 Branch andData Threads) Data Cache
Scheduler Message Unit
64kb Local(2560 L1 Vector
16kb Read/Write Share (LDS)

‒ CU Total Throughput: 64 Single-Precision (SP) ops/clock
Fully Programmable

‒ Executes Branch instructions Limit (32k/thread group)
Separate to texture units (acts as texture cache)
2x Larger decode/issue for:
Attachedthan D3D11 TGSM
‒ 1 SP (Single-Precision) operation per 4 clocks
‒ Shared by all threads of a wavefront
‒ (as dispatched by SMEM, LDS, GDS/E
V
SALU,
‒ 1 DP ALU, V with Conflict Resolution
 4 32 banks, MEM,Units Scalar unit)clocks XPORT
Texture
‒ Used for Filtercontrol, ADD in 8 arithmetic, etc.
(Double-Precision) pointer
‒
flow
‒ + Special Instructions (NOPs, barriers, etc.) and
‒ Bandwidth Amplification
 16 TextureGPR pool, scalar data cache, etc. branch instructions
Fetch Load/Store Units
‒ 1 DP MUL/FMA/Transcendental per 16 clocks*
‒ Has own
‒ 16 Hardware Barriers Decode
‒ Separate Instruction
 4x64KB Vector Registers (VGPR)

 8KB Scalar General Purpose Registers (SGPR)



Scalar Unit
‒

8KB Scalar Registers (SGPR)

‒

16KB 4-CU Shared R/O L1 Scalar Data Cache



Branch & Message Unit



4x Vector Units (SIMD-16)
‒



64KB Local Data Share (LDS)
‒



32 banks, with conflict resolution

16KB Read/Write L1 Vector Data Cache
‒



4x64KB Vector Registers (VGPR)

Shared with TMU as Texture Cache

Hardware Scheduler
‒

Up-to 2560 threads

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SIMD SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each Compute Unit (CU) contains 4 SIMD; each SIMD has:
‒ A 16-lane IEEE-754 vector ALU (VALU)
‒ 64KB of vector register file (VGPR)
‒ Its own 40-bit (48-bit on HSA APUs) Program Counter (PC)

‒ Instruction buffer for 10 wavefronts*
‒ *A wavefront is a group of 64 threads: the size of one logical vGPR

GCN COMPUTE UNIT

SCALAR UNIT

LANE
0 1 2
SIMD 0

15

LANE
0 1 2
SIMD 1

15

SPECIFICS …
LANE
0 1 2

15

SIMD 2

LANE
0 1 2
15

Scalar Unit

SIMD 3

GCN Scalar Unit
 Fully Programmable Scalar Unit replaces FF Branch Logic
 Operations such as JMP [GPR] are now supported
 Opens the door to e.g. virtual function calls
 Has its own GPR pool and can execute normal ALU code
 64-bit bitwise ops to mask thread execution
 32-bit bitwise and integer arithmetic operations at full-speed
 Potential to offload uniform code (Vector ALU  Scalar ALU)

 A GCN CU can dispatch 1 scalar op/clock


GCN COMPUTE UNIT

SCALAR UNIT

LANE
0 1 2
SIMD 0

15

LANE
0 1 2

15

CONTINUED …
LANE
0 1 2

SIMD 1

15

SIMD 2

R/W L2

LANE
0 1 2
15

Scalar Unit

SIMD 3

GCN Scalar Unit
 Natively a 64-bit integer ALU
 Independent arbitration and instruction decode
 One ALU, memory or control flow op per cycle
 512 Scalar GPR per SIMD shared between waves
 { SGPRn+1, SGPR } pair provide 64-bit register
 4 CU Shared Read Only Scalar Data Cache: 16KB – 64B lines

 4-way assoc, LRU replacement policy
 Peak Bandwidth per CU is 16 bytes/cycle


4 CU Shared 16KB Scalar R/O L1
Scalar Unit
8KB Registers
Integer ALU

Scalar Decode

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

BRANCH & MESSAGE UNIT

Vector Units
(4x SIMD-16)

Scalar Unit

GRAPHICS CORE NEXT

 Independent scalar assist unit to handle special classes of
instructions concurrently
‒ Branch
‒ Unconditional Branch (s_branch)
‒ Conditional Branch (s_cbranch_<cond>)
‒ Condition  SCC == 0, SCC == 1, EXEC == 0, EXEC != 0, VCC == 0, VCC != 0

‒ 16-bit signed immediate dword offset from PC provided

‒ Messages
‒ s_sendmsg  CPU interrupt with optional halt (with shader supplied code and source)
‒ debug message (perf trace data, halt, etc.)
‒ special graphics synchronization messages

 Branch


Unconditional Branch (s_branch)



Conditional Branch (s_cbranch_<cond>)

‒
‒
‒
‒
‒
‒

SCC ==
SCC ==
EXEC ==
EXEC !=
VCC ==
VCC !=

0
1
0
0
0
0

 Messages

‒ s_sendmsg

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

MEMORY SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each CU has its own dedicated L1 cache and LDS memory
‒ Both global and shared memory atomics are supported



‒ 32 banks, with conflict resolution



‒ Scalar L1 Read-Only data cache is shared between 4 neighbor CU


16KB R/W L1 Vector D-Cache
‒ Shared with TMU as Texture Cache



‒ 16 work group barriers supported per CU
‒ Vector L1 Read/Write data cache shared with TMU as texture cache

64KB Local Data Share (LDS)

Scalar Unit
‒ 16KB 4-CU Shared R/O Scalar L1



16 hardware barriers per CU



A GCN GPU with 44 CU, such as the AMD
Radeon™ R9 290x, can be working on
up-to 112,640 work items at a time!

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SCHEDULER SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each CU has its own dedicated Scheduler unit



Each CU can have 40 waves in-flight
‒ Each potentially from a different kernel



Scheduler Limits:

‒ Supports up-to 2560 threads per CU (64 threads x 10 waves x 4 SIMD)

‒ 40 wavefronts per CU

‒ All threads within a workgroup are guaranteed to reside on the same
CU simultaneously

‒ Limited by available GPR count

‒ A set of synchronization primitives and shared memory allow data to
be passed between threads in a workgroup
‒ Optimized for throughput – latency is hidden by overlapping
execution of wavefronts

‒ 10 wavefronts per SIMD
‒ Limited by available LDS memory
‒ 16 hardware barriers per CU


A GCN GPU with 44 CU, such as the
AMD Radeon™ R9 290x, can be
working on up-to 112,640 work items at
a time!

GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SCHEDULER SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

ARBITRATION & DECODE

Local Data Share
(64KB)

L1 Cache
(16KB)

GRAPHICS CORE NEXT

 CU is guaranteed to issue instructions for a wave sequentially
‒ Predication & control flow enables any single work-item a
unique execution path
 For a CU, every clock, waves on 1 SIMD are considered for issue
‒ Round-Robin scheduling algorithm



Maximum

5 instructions per cycle

‒ Not including “internal” instructions



Instruction Types:
‒ 1 Vector Arithmetic Logic Unit (VALU)
‒ 1 Scalar ALU or Scalar Memory (SALU)|(SMEM)
‒ 1 Vector Memory (Read/Write/Atomic) (VMEM)
‒ 1 Branch/Message (e.g. s_branch, s_cbranch)
‒ 1 Local Data Share (LDS)

 At most, 1 instruction from each category may be issued
 At most, 1 instruction per wave may be issued
 Theoretical maximum of 5 instructions per cycle per CU

‒ 1 Export or Global Data Share (GDS)
‒ 10 Special/Internal
(s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) –
[no functional unit]

GCN COMPUTE UNIT

VECTOR & SCALAR ARBITRATION

LANE
0 1 2
SIMD 0

15

LANE
0 1 2
SIMD 1

15

LANE
0 1 2

15

LANE
0 1 2
15

SIMD 2

HARDWARE VIEW
Scalar Unit

SIMD 3

GCN Hardware View
 A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clocks
 Each lane can dispatch 1 SP ALU operation per clock
 Each SP ALU operation takes 4 clocks to complete
 The scheduler dispatches from a different wavefront each cycle


GCN COMPUTE UNIT

VECTOR & SCALAR ARBITRATION

LANE
0 1 2

15

LANE
16 17 18

31

LANE
32 33 34

47

LANE
48 49 50

PROGRAMMER VIEW
Scalar Unit

63

WAVEFRONT 0

WAVEFRONT 4

WAVEFRONT 1

WAVEFRONT 5

WAVEFRONT 2

WAVEFRONT 6

WAVEFRONT 3

WAVEFRONT 7

WAVEFRONT 0

WAVEFRONT 8

WAVEFRONT 1

WAVEFRONT 9

GCN Programmer View
 A GCN Compute Unit can perform 64 SP Vector ALU ops / clock
 Each lane can dispatch 1 SP ALU operation per clock
 Each SP ALU operation still takes 4 clocks to complete
 But you can PRETEND your code runs 1 op on 64-threads at once


GCN VECTOR UNITS

ALU CHARACTERISTICS

 FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of
NaN/Inf/Zero and full de-normal support in hardware for SP and DP
 MULADD single cycle issue instruction without truncation, enabling a MULieee followed by
ADDieee to be combined with round and normalization after both multiplication and subsequent
addition
 VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison
predicates
 IEEE Rounding Modes (Round toward +Infinity, Round toward –Infinity, Round to nearest
even, Round toward zero) supported under program control anywhere in the shader. SP and DP
modes are controlled separately.
 De-normal Programmable Mode control for SP and DP independently. Separate control for input
flush to zero and underflow flush to zero.

GCN VECTOR UNITS

ALU CHARACTERISTICS

CONTINUED …

 Divide Assist Ops IEEE 0.5 ULP Division accomplished with macro (SP/DP ~15/41 Instruction
Slots, respectively)
 FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE-754 precision and
rounding
 Exceptions Support in hardware for floating point numbers with software recording and reporting
mechanism. Inexact, underflow, overflow, division by zero, de-normal, invalid operation, and
integer divide by zero operation
 64-bit Transcendental Approximation Hardware based double precision approximation for
reciprocal, reciprocal square root and square root
 24-bit Integer MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
‒ Heavily utilized for integer thread group address calculation
‒ 32-bit integer MUL/MULADD @ DP MUL/FMA rate

GCN SHADER AUTHORING TIPS
 GCN has greatly improved branch performance, and it continues to improve
‒ Don’t be afraid to use it! But, remember: use it wisely – improved != free 
‒ It’s at its best for highly coherent workloads (where most threads take the same path)

 However, the new architecture is more susceptible to: register pressure
‒ Using too many registers within a shader can reduce the maximum waves per SIMD! 
‒ NOTE: A WAVEFRONT

CAN ALLOCATE

104 USER SCALAR REGISTERS

AS SEVERAL SCALAR REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE

GCN SGPR Count
VGPR Count

<= 48
<=24

56
28

64 72 84 100
32 36 40 48

> 100 84
64

<= 128

> 128

Max Waves/SIMD

10 

9
9

8
8

4 3
4

2

1

77

66

‒ Take caution with respect to the following:
‒ Excessive nested branching/looping
‒ Loop Unrolling

‒ Variable declarations (especially arrays)
‒ Excessive function calls requiring storing of results

55

GCN SHADER CODE EXAMPLE
// Registers r0 contains “a”, r1 contains “b”
// Value is returned in r2
v_cmp_gt_f32
s_mov_b64
s_and_b64
s_cbranch_vccz
v_sub_f32
v_mul_f32

r0,r1
s0,exec
exec,vcc,exec
label0
r2,r0,r1
r2,r2,r0

//
//
//
//
//
//

a > b, establish VCC
Save current exec mask
Do “if”
Branch if all lanes fail
result = a – b
result = result * a

s_andn2_b64
s_cbranch_execz
v_sub_f32
v_mul_f32

exec,s0,exec
label1
r2,r1,r0
r2,r2,r1

//
//
//
//

Do “else (s0 & !exec)
Branch if all lanes fail
result = b – a
result = result * b

s_mov_b64

exec,s0

// Restore exec mask

 An alternative to s_cbranch, is to use VSKIP to transform VALU into NOPs
 s_setvskip – enables or disables VSKIP mode. Requires 1 waitstate after executing.



VSKIP does NOT skip VMEM instructions (Do: branch over superfluous VMEM inst.)

GCN MEMORY

CACHE HIERARCHY
I$

32KB instruction cache (I$) +
16KB scalar data cache (K$)
shared per ~4 CUs
with L2 backing

K$

I$

K$

Each CU has its own registers
and local data share

64 Bytes per clock
L1 bandwidth per CU

GDS
L1

L1

L1

L1

L1

L1

L1

L1

L1

L1 read/write caches
64 Bytes per clock
L2 bandwidth per partition
L2 read/write cache
partitions


L2

L2

L2

64-bit Dual Channel
Memory Controller

64-bit Dual Channel
Memory Controller

64-bit Dual Channel
Memory Controller

Global Data
Share facilitates
synchronization
between CUs
(64KB)

GCN MEMORY

VECTOR MEMORY INSTRUCTIONS

VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS

MUBUF – read from or write/atomic to an un-typed buffer/address
‒ Data type/size is specified by the instruction operation

MTBUF – read from or write to a typed buffer/address
‒ Data type is specified in the resource constant

GRAPHICS CORE NEXT



‒ MUBUF is like C++
reinterpret_cast

MIMG – read/write/atomic operations on elements from an image surface
‒ Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)
‒ Image objects use resource and sampler constants for access and filtering

A pointer is a pointer on GCN!

‒ MTBUF is like C++
static_cast


Utilize TMU for filtering via MIMG

GCN MEMORY

DEVICE FLAT MEMORY INSTRUCTIONS
A GCN POINTER IS A POINTER

FLAT
 Flat Address Space (“flat”) instructions are new as of Sea Islands (CI) and
allow read/write/atomic access to a generic memory address pointer which
can resolve to any of the following physical memories:
‒ Global Memory
‒ Scratch (“private”)
‒ LDS (“shared”)
‒ Invalid - MEM_VIOL TrapStatus
 Device Flat (Generic) 64b/32b Addressing Support
‒ FLAT instructions support both 64 and 32-bit addressing. The address size is set
via a mode register (“PTR32”) and a local copy of the value is stored per wave.
‒ The addresses for the aperture check differ in 32 and 64-bit mode

GCN MEMORY

EXPORT INSTRUCTION & GDS

 Exports move data from 1-4 VGPRS to the fixed-function Graphics Pipeline
‒ E.g: Color (MRT0-7), Depth, Position, and Parameter  Tessellator, Rasterizer, or RBE

 Global Shared Memory Ops (Utilize GDS)

 The GDS is identical to the LDS, except that it is shared by all CUs, so it acts as an
explicit global synchronization point between all wavefronts
 The atomic units in the GDS also support ordered count operations

GCN MEMORY

LOCAL DATA SHARE

 GCN Local Data Share (LDS) is a 64KB, 32 bank (or 16) Shared Memory
 Instruction issue fully decoupled from ALU instructions
 Direct mode
‒ Vector Instruction Operand  32/16/8-bit broadcast value
‒ Graphics Interpolation @ rate, no bank conflicts

 Index Mode – Load/Store/Atomic Operations
‒ Bandwidth Amplification, up-to 32 – 32-bit lanes serviced per clock peak
‒ Direct decoupled return to VGPRs
‒ Hardware conflict detection with auto scheduling

 Software consistency/coherency for thread groups via hardware barrier
 Fast & low power vector load return from R/W L1

GCN MEMORY

CONTINUED …
LOCAL DATA SHARE

 An LDS bank is 512 entries, each 32-bits wide
‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that
includes 32 atomic integer units
‒ This means that several threads can read the same LDS location at the same time for FREE
‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins
(useful e.g. for all threads writing uniform value to still be fast)

 Typically, the LDS will coalesce 32 lanes from one SIMD each cycle
‒ One wavefront is serviced completely every 2 cycles
‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware
‒ An instruction which accesses different elements in the same bank takes additional cycles

BLOCK DIAGRAM
GCN MEMORY

LOCAL DATA SHARE


GCN MEMORY

NEW MEMORY OPERATIONS
LOCAL DATA SHARE

 Remote Atomic Ops with Shared Memory Dual-Source Operands
‒LDS[Dst] = LDS[addr0] op LDS[addr1];
‒ Fast remote reduction operations for arithmetic, logical, Min/Max

 Read/Write/Conditional Exchange 96b/128b
 32-bit FP Min/Max/Compare Swap

GCN MEMORY

NEW MEMORY OPERATIONS
LOCAL DATA SHARE
CONTINUED …

Fast Lane Swizzle Operations
‒Does not require allocation, no shared memory used
‒Invalid read result in 0x0 return
‒First Mode: Each four adjacent lanes can full crossbar data, same switch for each set
of four
‒Second mode: For each consecutive set of 32 work-items
‒ Swap: 16, 8, 4, 2, 1
‒ Reverse: 32, 16, 8, 4, 2
‒ Broadcast: 32, 16, 8, 4, 2

OPERATION DIAGRAMS
GCN MEMORY

LOCAL DATA SHARE
16

4 Lane CrossBar

Reverse

8
4

Lane 0 , 1 ……………………..…31,32……………………………….63

2

1
Lane 0 , 1 ……………………..…31,32……………………………….63

Swap

Broadcast

16

16

8

8

4

4

2

2

1
Lane 0 , 1 ……………………..…31,32……………………………….63

1
Lane 0 , 1 ……………………..…31,32……………………………….63

GCN MEMORY

BLOCK DIAGRAM
READ/WRITE CACHE

 Reads and writes cached
‒ Bandwidth amplification
‒ Improved behavior on more memory access patterns
‒ Improved write to read reuse performance

 Relaxed consistency memory model
‒ Consistency controls available to control locality of load/store

 GPU Coherent
‒ Acquire/Release semantics control data visibility across the machine (GLC bit on load/store)
‒ GCN APUs also have SLC bit to control data visibility to CPU caches

‒ L2 coherent = all CUs can have the same view of data

 Global Atomics
‒ Performed in L2 cache (GDS also has global atomics)

GCN MEMORY

READ/WRITE L1
CACHE ARCHITECTURE

‒ Each CU has its own Vector L1 Data Cache
‒ 16KB L1, 64B lines, 4 sets x 64-way
‒ ~64B/CLK bandwidth per Compute Unit
‒ Write-through – alloc on write (no read) w/dirty byte mask
‒ Write-through at end of wavefront
‒ Decompression on cache read out

‒ Instruction GLC bit defines cache behavior (GCN APUs also have SLC bit)
‒ GLC = 0;
‒ Local caching (full lines left valid)
‒ Shader write back invalidate instructions
‒ GLC = 1;
‒ Global coherent (hits within wavefront boundaries)

GCN MEMORY

READ/WRITE L2
CACHE ARCHITECTURE

‒ 64-128KB L2 per Memory Controller Channel
‒ Up-to 16 L2 cache partitions
‒ 64B lines, 16-way set associative
‒ ~64B/CLK per channel for L2/L1 bandwidth
‒ Write-back - alloc on write (no read) w/ dirty byte mask
‒ Acquire/Release semantics control data visibility across CUs

‒ L2 Coherent = all CUs can have the same view of data
‒ Remote Atomic Operations

‒ Common Integer Set & Floating Point Min/Max/CmpSwap


GCN MEMORY

INFORMATION
BANDWIDTH

‒ Each CU has 64 bytes per cycle of L1 bandwidth
‒ Shared with the GDS

‒ Per L2 there’s 64 bytes of data per cycle as well
‒ Peak Scalar L1 Data Cache Bandwidth per CU is 16 bytes/cycle
‒ Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)
‒ LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification
‒ For R9 290x:
‒ That’s nearly 5.5 TB/s of LDS BW, 2.8 TB/s of L1 BW, and 1 TB/s of L2 BW!
‒ 512-bit GDDR5 Main Memory has over 320 GB/sec bandwidth
‒ PCI Express 3.0 x16 bus interface to system (32GBps)

GCN MEMORY

TABLES
BANDWIDTH & LATENCY

LDS

K$

L1

128 bytes / clock

16 bytes / clock

64 bytes / clock

Main Takeaways:
–LDS is optimized for bandwidth amplification and atomics
–K$ is optimized for periodic low-latency reads of small datasets
–L1 is optimized for high-bandwidth texture fetches and streaming

LDS

K$

L1

Resident

Short

Short (1x)

Long (20x)

Non-Resident

N/A

Medium (10x)

Long (20x)


GCN MEMORY

BLOCK DIAGRAM
L1 TEXTURE CACHE

 The memory hierarchy is re-used for graphics
 Some dedicated graphics hardware added
‒ Address-gen unit receives 4 texture addr/clock
‒ Calculates 16 sample addr (nearest neighbors)
‒ Reads samples from L1 vector data cache
‒ Decompresses samples in Texture Mapping Unit (TMU)

‒ TMU filters adjacent samples, produces <= 4 interpolated texels/clock
‒ TMU output undergoes format conversion and is written into the vector register file
‒ The format conversion hardware is also used for writing certain formats to memory from graphics shaders


X86-64
GCN MEMORY
VIRTUAL MEMORY

 The GCN cache hierarchy was designed to integrate with x86-64 microprocessors
 The GCN virtual memory system can support 4KB pages
‒ Natural mapping granularity for the x86-64 address space
‒ Paves the way for a shared address space in the future
‒ All GCN hardware can already translate requests into x86-64 address space

 GCN caches use 64B lines, which is the same size x86-64 processors use
 AMD A-Series APU

 The stage is set for heterogeneous systems to transparently share data between the GPU
and CPU through the traditional caching system, without explicit programmer control!


GCN COMPUTE ARCHITECTURE

R9 290X

AMD Radeon™
HD 7970 GHz Edition

AMD Radeon™
R9 290X

Increase

Geometry Processing

2.1 billion primitives/sec

4 billion primitives/sec

1.9x

Compute

4.3 TFLOPS

5.6 TFLOPS

1.3x

Texture fill rate

134.4 Gtexels/sec

176 Gtexels/sec

1.3x

Pixel fill rate

33.6 Gpixels/sec

64 Gpixels/sec

1.9x

Peak Bandwidth

264 GB/sec

320 GB/sec

1.2x

Die area

352 mm2

438 mm2

1.24x

Peak GFLOPS/mm2

12.2

12.8

1.05x



SHADER ENGINE

 Each GCN GPU can contain up-to 4 Shader Engines
‒ Load balanced with each other
‒ Screen partitioning of pixel assignment

 A Shader Engine is a high level organizational unit containing:
‒ 1 Geometry Processor (1 Primitive Per Cycle Throughput)
‒ 1 Rasterizer
‒ 1-16 CUs (Compute Units)
‒ Instruction I$ and constant K$ caches shared by up to 4 CU each

‒ 1-4 RBEs (Render Back Ends)
‒ Up-to 16 – 64b pixels/cycle per Shader Engine
‒ Up-to 8 – 128b pixels/cycle per Shader Engine


R9 290X


GRAPHICS CORE NEXT

 44 Compute Units

 4 Geometry Processors
‒ 4 billion primitives/sec

 64 Pixel Output/Clock
‒ 64 Gpixels/sec fill rate

 1MB L2 Cache
‒ Up-to 1 TB/sec L2/L1 bandwidth

 512-bit GDDR5 memory interface
‒ 320 GB/sec memory bandwidth

 6.2 billion transistors

‒ 438 mm2 on 28nm process node
‒ 12.8 GFLOPS/mm2


SEA ISLANDS


GRAPHICS CORE NEXT

 8 ASYNCHRONOUS COMPUTE ENGINES (ACE)
‒ Operate in parallel with Graphics CP
‒ Independent scheduling and work item dispatch
for efficient multi-tasking
‒ 9 Devices with 64+ Command Queues!

‒ Fast context switching
‒ Exposed in OpenCL™

 Dual DMA engines
‒ Can saturate PCIe 3.0 x16 bus bandwidth (16
GB/sec bidirectional)



SEA ISLANDS


GRAPHICS CORE NEXT

 ACEs are responsible for compute shader
scheduling & resource allocation
 Each ACE fetches commands from cache or
memory & forms task queues
 Tasks have a priority level for scheduling
‒ Background  Realtime
 ACE dispatch tasks to shader arrays as
resources permit
 Tasks complete out-of-order, tracked by ACE
for correctness
 Every cycle, an ACE can create a
workgroup and dispatch one wavefront from
the workgroup to the CUs


SEA ISLANDS


GRAPHICS CORE NEXT

 ACE are independent
‒ But, can synchronize and communicate
via Cache/Memory/GDS

 ACE can form task graphs
‒ Individual tasks can have
dependencies on one another
‒ Can depend on another ACE
‒ Can depend on part of graphics pipe

 ACE can control task switching
‒ Stop and Start tasks and dispatch
work to shader engines


SEA ISLANDS


GRAPHICS CORE NEXT

 Focus in GPU hardware shifting away
from graphics-specific units, towards
general-purpose compute units
 R9 290x GCN-based ASICs already
have 8:1 ACE : CP ratio
‒ CP can dispatch compute
‒ ACE cannot dispatch graphics
 If you aren’t writing Compute
Shaders, you’re not getting the absolute
most out of modern GPUs
‒ Control: LDS, barriers, thread layout, ...


SEA ISLANDS


GRAPHICS CORE NEXT

Future Trends:
 More Compute Units
‒ ALU outpaces Bandwidth
 CPU + GPU Flat Memory
‒ APU + dGPU
 Less Fixed Function Graphics
‒ Can you write a Compute-based
graphics pipeline?

‒ Start thinking about it… 

GCN FIXED FUNCTION ARCHITECTURE
Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

GEOMETRY

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Updated Hardware Geometry Units
– Off-chip buffering improvements
– Larger parameter and position cache

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Tessellation off
on
Tessellation
off

 GS + Tessellation is faster than before…
 However… memory is still the bottleneck!
– Minimize the number of inputs and
outputs for best performance…
 Small expansions can be done within LDS!

Image from Battlefield 3, EA DICE

Process and rasterize up to 4 primitives per clock cycle


RASTERIZER

 We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

 Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
 This can cause us to become raster-bound, starving the shader and holding up geometry!

12 Pixels Per Clock
75%

Efficiency

100%

Efficiency

16 Pixels Per Clock

28 Pixels in 2 Clocks


vs.

3 Pixels in 3 Clocks

1 Pixel Per Clock 
6.25%

Efficiency


TESSELLATION + RASTERIZER EFFICIENCY
6.25%

75-90%

18-25%
Efficiency

Efficiency

~13 Pixels

~4 Pixels

1 Pixel

Per Clock

Per Clock

Per Clock

Efficiency

Over-Tessellation

 Reduces rasterizer efficiency
‒ Extreme Tessellation = 6.25% Efficiency
 Also impacts ROPs and MSAA efficiency
‒ High number of polygon edges to AA
‒

Consumes dramatically more bandwidth

‒ If nFragments > nSamples, quality will be lost
‒

E.g. 16 verts affecting 1 pixel @ 8xMSAA


Over-Tessellation
 Reduces shader efficiency
 HS, DS and VS run many times
for each final image pixel
‒ Yet don’t contribute much
to final image quality

 The graphics pipeline is not
designed for this abuse!

TESSELLATION + SHADING EFFICIENCY
Shading Passes Per-Pixel (Overshade)

8
7
6
5
4
3
2
1

 Consider Alternatives:
‒ Parallax Occlusion Mapping
‒ […]

 Image courtesy: Kayvon Fatahalian
“Evolving the Direct3D Pipeline for Real-time Micropolygon Rendering,”
From ACM SIGGRAPH 2010 course: “Beyond Programmable Shading II”


GCN Tessellation – Best Practices
 While performance is much improved, it is still a potential bottleneck!
‒ Produces a great deal of IO traffic, starving other parts of the pipeline

 Best performance generlly achieved with tessellation factors less than 15!

Continue to

Optimize:

‒ Pre-triangulate
‒ Distance-adaptive
‒ Screen-space adaptive
‒ Orientation-adaptive
‒ Backface Culling
‒Frustum Culling
‒ […]


Tessellation OFF
ON


RASTERIZER

 We now have 4 Geometry Processors on R9 290x
‒ Overall Primitive Rate = 4 prims per clock (ideal)

 We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

 Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
 This can cause us to become raster-bound, unable to rasterize at peak-rate!
Command Processor

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Geometry Processor

Vertex
Assembler

Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Compute Units

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends


Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends


RENDER BACK ENDS

 Once the pixels fragments in a tile have been
shaded, they flow the Render Back-Ends (RBEs)

Z/Stencil ROPs

Color ROPs

Depth Cache

Color Cache

‒ 16KB Color Cache
‒ Up to 8 color + 16 coverage samples (16x EQAA)

‒ 8KB Depth Cache
‒ Up to 8 depth samples (8x MSAA)

‒ Writes un-cached via memory controllers
‒ 64 – 64B pixels per cycle
‒ 256 Depth Test (Z) / Stencil Ops per cycle

 Logic Operations as alternative to Blending

‒Exposed in Direct3D 11.1
‒Also available in OpenGL
 Dual-Source Color Blending with MRTs

‒Only available in OpenGL
*

There are 16 RBEs on R9 290x


DEPTH IMPROVEMENTS

24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS

Fast-accept of fully-visible triangles spanning one or more tile
If a triangle is fully covering a tile, then cost is only 1 clock/tile
 Depth Bounds Test (DBT) Extension

‒Exposed in OpenGL via GL_EXT_depth_bounds_test
‒Exposed in Direct3D 11 via extension


STENCIL IMPROVEMENTS

 GCN has support for new extended stencil ops

‒Only available in OpenGL:

GL_AMD_stencil_operation_extended
‒Additional stencil ops:
‒AND, XOR, NOR
‒REPLACE_VALUE_AMD
‒etc.
‒ Also exposes additional stencil op source value
‒ Can be used as an alternative to stencil ref value

 Stencil ref and op source value can now be exported from pixel shader

‒Only available in OpenGL: GL_AMD_shader_stencil_value_export

GCN LOW-LEVEL TIPS

GPR PRESSURE

 GPRs and GPR Pressure
 Banks of GCN Vector GPRs (Illustration)

 General Purpose Registers (GPR) are a limited resource
‒ Separate banks of GPRs for Vector and Scalar (per SIMD)
‒ Maximum of 256 VGPRS and 512 SGPRS shared across all waves (up-to 10) owned by a SIMD
‒ Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit)
‒ Number of GPRs required by a shader affects SIMD scheduling and execution efficiency
‒ Shader tools can be used to determine how many GPRs are used…

 GPR pressure is affected by:
‒ Loop Unrolling
‒ Long lifetime of temporary variables
‒ Nested Dynamic Flow Control instructions
‒ Fetch dependencies (e.g. indexed constants)


GCN LOW-LEVEL TIPS

TEXTURE FILTERING

‒Point sampling is full-rate on all formats
‒Trilinear filtering costs up to 2x bilinear filtering cost
‒Anisotropic (N taps) costs <= (N x bilinear)
‒Avoid cache thrashing!
‒Use MIPmapping
‒Use Gather() where applicable
‒Exploit neighbouring pixel shader threadCU locality:
‒ Sampling from texels resident on the same CU can have a lower cost
‒Exploit this explicitly by using Compute Shaders

GCN LOW-LEVEL TIPS

COLOR OUTPUT

 PS Output: Each additional color output increases export cost
 Export cost can be more costly than PS execution!
‒ Each (fast) export is equivalent to 64 ALU ops on R9 290X
‒ If shader is export-bound then use “free” ALU for packing instead

 Watch out for export-bound cases
‒ E.g. G-Buffer parameter writes
‒ MINIMIZE SHADER INPUTS AND OUTPUTS!
‒ Pack, pack, pack, pack!

 Costs of outputting and blending various formats
‒discard/clip allow the shader hardware to skip the rest of the work
* Miss “PACK” Man kindly reminds you to “Pack pack pack!” 


GCN MEDIA PROCESSING

MEDIA INSTRUCTIONS

 SAD = Sum of Absolute Differences
Closest match

 Critical to video & image processing algorithms
‒ Motion detection
‒ Gesture recognition
‒ Video & image search
‒ Stereo depth extraction
‒ Computer vision

 SAD (4x1) and QSAD (4 4x1) instructions
‒ New QSAD combines SAD with alignment ops for higher
performance and reduced power draw
‒ Evaluate up to 256 pixels per CU per clock cycle!

 Maskable MQSAD instruction
‒ Allows background pixels to be ignored
‒ Accelerated isolation of moving objects

 New: 32-bit destination accumulator register
‒ SAD/QSAD/MQSAD U32/U16 accumulators with saturation

3
2

5

5

4

4

0

7

1

7

5

9

4

1

3

5

5

5

9

3

1

4

4 0
SAD = 7
SAD = 22
2 22 9
5

1

6

7

2 9
SAD = 6
1 59 3
5

2

8

1

1

7

6

8

3

0

4

3

2

9

9

3

0

7

1

1

7 4
SAD = 5
5 58 4
0

8

0

0

2 2
SAD = 2
8 45 3
2

9

9

7

1

6

2

4

0

AMD Radeon R9 290x can evaluate

11.26 Terapixels/sec *
* Peak theoretical performance for 8-bit integer pixels

3


VIDEO CODEC ENGINE

 Video Codec Engine (VCE)
‒ Hardware H.264 Compression and Decompression
‒ Ultra-low-power, fully fixed-function mode
‒ Capable of 1080p @ 60 frames / second

‒ Programmable for Ultra High Quality and or Speed
‒ Entropy encoding block fully accessible to software
‒ AMD Accelerated Parallel Programming SDK
‒ OpenCL ™

‒ Create hybrid faster-than-real-time encoders!
‒ Custom motion estimation
‒ Inverse DCT and motion compensation
‒ Combine with hardware entropy encoding!

AMD Radeon R9 290x can compress

Realtime+ 1080p H.264



AMD TRUEAUDIO

 Multiple integrated Tensilica HiFi EP Audio DSP cores
 Dedicated Audio DSP solution for game sound effects
 Guaranteed real-time performance and service

 Designed for game audio artists and engineers to bring take their artistic vision
beyond sound production into the realm of sound processing
 Intended to transform game audio as programmable shaders transformed graphics


AMD TRUEAUDIO

SPATIALIZATION / 3D AUDIO

REVERBS

AUDIO/VOICE STREAMS

MASTERING LIMITERS


HEAR MORE REALTIME
VOICES AND CHANNELS
IN A GAME


ENABLES AMAZING
DIRECTIONAL AUDIO
OVER ANY OUTPUT


CONCLUSIONS

GCN ARCHITECTURE TAKEAWAYS

‒GCN offers increased flexibility & efficiency, with reduced complexity!
‒Non-VLIW Architecture improves efficiency while reducing programmer burden
‒Constants/resources are just address + offset now in the hardware
‒UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast
‒Has virtual memory & GPU flat memory, moving towards CPU + GPU flat memory

‒GCN is designed with a forward-looking focus on Compute
‒Scalar unit for complex dynamic control flow + branch & message unit
‒64KB LDS/CU, 64KB GDS, atomics at every stage, coherent cache hierarchy
‒8 Asynchronous Compute Engines (ACE) for multitasking compute
‒ 8 ACE x 8 HQD (per ACE) = 64 HQD (HQD = Hardware Queue Descriptors)


CONCLUSIONS

GCN ARCHITECTURE TAKEAWAYS

CONTINUED …
‒GCN generally simplifies your life as a programmer
‒Don’t: fret too much about instruction grouping, or vectorization
‒Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts)
‒Do: Think about thread/CU locality when you structure your algorithm
‒Do: Exploit the low-latency 4-CU Shared 16KB Scalar L1 Data Cache (K$)
‒Do: Pack shader inputs and outputs – aim to be IO/bandwidth thin!
‒ Pack PS exports into non-blended 64-bit format for optimal ROP utilization
‒ But, remember that 32-bit formats still use less bandwidth
‒ Keep geometry (HS, VS, GS, DS) stage IO under 4 float4 (ideally less! )

‒Unlimited number of addressable constants/resources
‒N constants aren’t free anymore – each consume resources, use sparingly!

‒Compute is the future – exploit its power for GPGPU work & graphics!

THANK YOU
问题？

QUESTIONS? 
質問がありますか？
^_^
Layla Mah
layla.mah@amd.com

BONUS SLIDES


THE BONUS SLIDES


TILED RESOURCES & PARTIALLY RESIDENT TEXTURES

MegaTexture in id Tech5

Tiled Resources & Partially Resident Textures – INTRODUCTION
Enables application to manage more texture data than can physically fit in a fixed footprint
‒ Known as: Tiled Resources (Direct3D 11.2) and Partially Resident Textures (OpenGL 4.2)
‒ A.k.a. “Virtual texturing“ and “Sparse texturing”

The principle behind PRT is that not all texture contents are likely to be needed at any given time
‒ Current render view may only require selected portions of a texture to be resident in memory
‒ Or, only selected MIPMap levels…

PRT textures only have a portion of their data mapped into GPU-accessible memory at a time
‒ Texture data can be streamed in on-demand
‒ Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)

 OpenGL extension – GL_AMD_sparse_texture

Tiled Resources & Partially Resident Textures – TEXTURE TILES
The PRT texture is chunked into 64KB tiles
‒ Fixed memory size
‒ Not dependant on texture type or format

Highlighted areas represent
texture data that needs highest
resolution

Chunked texture

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008

Texture tiles needing to be
resident in GPU memory

Tiled Resources & Partially Resident Textures – TRANSLATION TABLE
The GPU virtual memory page table translates 64KB tiles into a resident texture tile pool
Texture Map

Page Table

Texture Tile Pool (Video
Memory)

(linear storage)

64KB tile
Unmapped page entry
Mapped page entry


Tiled Resources & Partially Resident Textures – MIP MAPS
Not all tiles from the texture map are actually resident in video memory
PRT hardware page table stores virtual  physical mappings
Texture Map

Page Table

MIP Levels


Texture Tile Pool (Video
Memory)

64KB tile
Unmapped page entry
Mapped page entry

Tiled Resources & Partially Resident Textures – TILE MANAGEMENT
The Application is responsible for uploading/releasing new PRT tiles!

A common scenario is to upload lower MIPMaps to texture tile pool
‒ This allows a full representation of the PRT contents to be resident in memory (albeit at
lower resolution)
‒ e.g. MIP LOD 6 and above for 16kx16k 32-bits texture is about 650KB (256x256 resolution)

Texture tiles corresponding to higher resolution areas are uploaded by the application
as needed
‒ e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen
pixels ratio increases, thus higher LOD tiles need uploading


Tiled Resources & Partially Resident Textures – “FAILED” FETCH
How does the application know which texture tiles to upload?
Answer: PRT-specific texture fetch instructions in pixel shader
‒ Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not
in the pool
‒ OpenGL example: int

glSparseTexture( gsampler2D sampler,
vec2
P,
inout gvec4 texel );

This information is then stored in render target or UAV
‒ Texel fetch failed for a given (x, y) tile location

...and then copied to the CPU so that application can upload required tiles
App chooses what to render until missing data gets uploaded


Tiled Resources & Partially Resident Textures – “LOD WARNING”
PRT fetch condition code can also indicate an “LOD Warning”
The minimum LOD warning is specified by the application on a per texture basis
‒ OpenGL example:

glTexParameteri(

If a fetched pixel’s LOD is

<target>,
MIN_WARNING_LOD_AMD,
<LOD warning value>
);

< the specified LOD warning value then the condition code is returned

This functionality is typically used to try to predict when higher-resolution MIP levels will be needed
‒ E.g. Camera getting closer to PRT-mapped geometry


Tiled Resources & Partially Resident Textures – EXAMPLE USAGE
1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API
2. App uploads MIP levels using API calls
3. Shader fetches PRT data at specified texcoords
Two possibilities:
3.a. Texel data belongs to a resident (64KB) tile
- Valid color returned, no error code
3.b. Texel data points to non-resident tile or specified LOD
- Error/LOD Warning code returned
- Shader writes tile location and error code to RT or UAV

4. App reads RT or UAV and upload/release new tiles as needed


Tiled Resources & Partially Resident Textures –
TYPES, FORMATS & DIMENSIONS

 All texture types and formats supported
‒1D, 2D, cube, arrays and 3D volume textures

‒All common texture formats
‒ Including compressed formats
‒Maximum dimensions:
‒16k x 16k x 8k x 128-bit textures

Hardware PRT > Software Implementation
PRT
Ease of implementation
• Complexity hidden behind HW & API

Full filtering support

SW Implementation

• Includes anisotropic filtering

Full-speed filtering
• SW solution requires “manual” filtering
• Software anisotropic is very costly

Don’t go overboard with PRT allocation!
• Page table entry size is 4 DWORDs
• Have to be resident in video memory

问题？
QUESTIONS? 
質問がありますか？
^_^
Layla Mah
layla.mah@amd.com

@MissQuickstep

Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective
owners.
©2013 Advanced Micro Devices, Inc. All rights reserved.

THE BONUS SLIDES
SHADER CODE


SHADER CODE EXAMPLE #2
float fn0(float a,float b)
{
float c = 0.0;
float d = 0.0;
for(int i=0;i<100;i++)
{
if(c>113.0)
break;
c = c * a + b;
d = d + 1.0;
}
return(d);
}

// Registers r0 contains “a”, r1 contains “b”, r2 contains “c”
// and r3 contains “d”
// Value is returned in r3
v_mov_b32
v_mov_b32
s_mov_b64
s_mov_b32
label0:
s_cmp_lt_s32
s_cbranch_sccz
v_cmp_le_f32
s_and_b64
s_branch_execz
v_mul_f32
v_add_f32
v_add_f32
s_add_s32
s_branch
label1:
s_mov_b64


r2, #0.0
r3, #0.0
exec, s0
s2, #0

//
//
//
//

float c = 0.0
float d = 0.0
Save execution mask
i=0

s2, #100
label1
r2, #113.0
exec, vcc, exec
label1
r2, r2, r0
r2, r2, r1
r3, r3, #1.0
s2, s2, #1
label0

//
//
//
//
//
//
//
//
//
//

i<100
Exit loop if not true
c > 113.0
Update exec mask on fail
Exit if all lanes pass
c = c*a
c = c+b
d = d+1.0
i++
Jump to start of loop

exec, s0

// Restore exec mask

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).
Other names are for informational purposes only and may be trademarks of their respective owners.

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

Similar to GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah (20)

More from AMD Developer Central

More from AMD Developer Central (20)

Recently uploaded

Recently uploaded (20)

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

Editor's Notes