SlideShare a Scribd company logo
1 of 98
THE AMD GCN ARCHITECTURE
A CRASH COURSE
@MissQuickstep
LAYLA MAH – LAYLA.MAH@AMD.COM
DEVELOPER TECHNOLOGY ENGINEER
AGENDA
 Part 1: A Brief History of GPU Evolution
 Part 2: Introduction to Graphics Core Next (GCN)
 Part 3: Anatomy of a GCN Compute Unit (CU)
 Part 4: GCN Shader: Arbitration, Examples & Tips
 Part 5: GCN Memory Hierarchy
 Part 6: GCN Compute Architecture (ACE)
 Part 7: GCN Fixed Function Units
(CP, GeometryEngine, Rasterizer, RBE, …)

 Part 8: Main Takeaways & Conclusion
 Bonus Slides: Tiled Resources, Partially Resident Textures (PRT)
22

| A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

3RD ERA:
Graphics Parallel Core
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers

3 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

3RD ERA:
Graphics Parallel Core
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers

4 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

Prior to 2002
 Graphics-specific hardware

3RD ERA:
Graphics Parallel Core
VLIW5

‒ Texture mapping/filtering
‒ Transform & Lighting (T&L) Engines
‒ Geometry processing
‒ Rasterization
‒ Fixed function lighting equations

Lighting

Stream
Processing Units

FMAD+
Special
Functions

Branch Unit

‒ Multi-texturing

General Purpose
Registers

‒ Dedicated texture and pixel caches VLIW4

‒ Sufficient for basic graphics tasks
Processing Units
‒ No general purpose compute capability General Purpose
Stream

Registers

5 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

 Dot product and scalar multiply-add
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

Memory Interface

3RD ERA:
Graphics Parallel Core
VLIW5

Stream
Processing Units

General Purpose
Registers

Setup Engine

Lighting

Pixel Shader Core

VLIW4

Stream
Processing Units
General Purpose
Registers

6 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

16 Pixel Pipes

FMAD+
Special
Functions

Branch Unit

8 Vertex Pipes
GPU EVOLUTION
1ST ERA:
Fixed Function

2002-2006
3D Geometry Transformation
Graphics Programmability

– Direct3D 8/9, OpenGL 2.0
IEEE not required

Memory Interface
8 Vertex Pipes
Setup Engine

– NV 16-bit full-speed
–
Lighting NV 32-bit half-speed

Pixel Shader Core

7 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

 Shader Models 1.0 - 2.0
VLIW5
‒ VS and PS are distinct
‒ Minimal Instruction Sets
‒ Limited Instruction Slots
‒ LimitedGeneral Purpose Lengths
Shader
Registers
‒ No DYNAMIC Flow Control
‒ No Looping Constructs
‒ VLIW4
No Vertex Texture Fetch
‒ No Bitwise Operators
‒ No Native Integer ALU
Stream
Processing Units
‒ […]
FMAD+
Special
Functions

Branch Unit

– Specialized shader units for
vertex & pixel processing
 Added dedicated caches

The Rise of Shaders

Stream
Processing Units

Different precision per IHV
– ATI 24-bit full-speed

3RD ERA:
Graphics Parallel Core

Branch Unit

– Floating point processing

2ND ERA:
Simple Shaders

16 Pixel Pipes

General Purpose
Registers
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

3RD ERA:
Graphics Parallel Core
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers

8 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

The Rise of The Unified Shader (VLIW-5)
 5-Element Very-Long-Instruction-Word (XYZWT)
3D Geometry Transformation

3RD ERA:
Graphics Parallel Core
VLIW5

‒ Ideal for 4-element Vector and 4x4 Matrix Operations
‒ Vector/Vector math in a single instruction

Stream
Processing Units

‒ Plus One Transcendental-Unit function per Instruction

FMAD+
Special
Functions

Branch Unit

‒ Began with XENOS and utilized from R600 until “Cayman”
‒ Flexible and optimized for Graphics workloads

General Purpose
Registers

 More advanced caching
‒ Instruction, constant, multi-level texture/data, & later: LDS/GDS
Lighting

VLIW4

Stream
Processing Units

 More flexible: Unified ALU, Branch Unit, Dynamic Flow
Control, Vertex Texture, Geometry Shader, Tessellation
Engines, etc.
9 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

General Purpose
Registers

Branch Unit

 Single Precision 32-bit IEEE-Compliant Floating Point ALUs
GPU EVOLUTION
1ST ERA:
Fixed Function
3D Geometry Transformation

2ND ERA:
Simple Shaders

3RD ERA:
Graphics Parallel Core
VLIW5

FMAD+
Special
Functions

Branch Unit

Stream
Processing Units

General Purpose
Registers

Lighting

VLIW4

General Purpose
Registers

10 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Branch Unit

Stream
Processing Units
GPU EVOLUTION
1ST ERA:
Fixed Function

2ND ERA:
Simple Shaders

Optimized For Die Area Efficiency (VLIW-4)
3D 4-Element Very-Long-Instruction-Word (XYZW)
 Geometry Transformation

3RD ERA:
Graphics Parallel Core
VLIW5

‒ Profiling showed average VLIW utilization was < 3.4/5
‒ Each ALU has a smaller LUT
‒ Combined using 3-term Lagrange polynomial interpolation across multiple ALU

Stream
Processing Units

‒ Better optimized for combination of Graphics & Compute
‒ Graphics is still the primary focus, but compute is gaining attention
‒ Still ideal for 4-element Vector and 4x4 Matrix Operations
‒ Fewer ALU bubbles in transcendental-light code, better utilization

Lighting

FMAD+
Special
Functions

General Purpose
Registers

VLIW4

‒ Multiple dispatch processors & separate command queues

11 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Stream
Processing Units
General Purpose
Registers

Branch Unit

‒ Simplified programming and optimization relative to VLIW-5

 Improved support for DirectCompute™ and OpenCL™

Branch Unit

‒ Removed dedicated T-Unit – Optimized die area usage
GPU EVOLUTION
LANE 0

LANE 1

LANE 2

LANE 15

SIMD

VLIW4 SIMD

LANE
0 1 2
SIMD 0

15

LANE
0 1 2

15

SIMD 1

LANE
0 1 2

15

SIMD 2

LANE
0 1 2
15
SIMD 3

GCN Quad SIMD-16

 64 Single Precision multiply-add (per-clock)

 64 Single Precision multiply-add (per-clock)

 16 SIMDs × ( 1 VLIW inst × 4 ALU ops )

 4 SIMDs × ( 1 ALU op × 16 threads )

 1 VLIW inst containing 4 ALU ops (per-clock)

 4 ALU ops (from different wavefronts) / clock

 Needs 4 parallel ALU ops to fill each VLIW inst

 Needs 4+ wavefronts to keep SIMD lanes full

 Compiler manages register port conflicts

 No register port conflicts

 Specialized, complex compiler scheduling

 Standard compiler scheduling & optimizations

 Difficult assembly creation, analysis, and debug

 Simplified assembly creation, analysis, & debug

 Complicated tool chain support

 Simplified tool chain development and support

 Careful optimization req. for peak performance

 Stable and predictable performance

12 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
AMD

GRAPHICS
CORE
NEXT

13 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
 GS-4112 – Mantle: Empowering 3D Graphics Innovation
 Keynote – Johan Andersson, Technical Director, EA
 GS-4145 – Oxide on Mantle Adoption (Wed 5:00-5:45)

 New low level programming
interface for PCs
 Designed in collaboration with
top game developers
 Lightweight driver that allows
direct access to GPU hardware
 Compatible with DirectX® HLSL
for simplified porting

MANTLE
Graphics Applications

Mantle API
Mantle Driver
GCN

Works with all
Graphics Core
Next GPUs
AMD GRAPHICS CORE NEXT ARCHITECTURE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

Faster performance
Higher efficiency

New graphics features
New compute features

GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
 Cutting-edge graphics performance and features
 High compute density with multi-tasking

 Built for power efficiency
 Optimized for heterogeneous computing
 Enabling the Heterogeneous System Architecture (HSA)

 Amazing scalability and flexibility

16 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
 Unlimited Resources & Samplers
 All UAV formats can be read/write

 Simpler Assembly Language
 Simpler Shader Code
 Ability to support C/C++ (like)

 Architectural support for traps, exceptions & debugging
 Ability to share virtual x86-64 address space with CPU cores
17 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

GRAPHICS CORE NEXT
AMD GRAPHICS CORE NEXT ARCHITECTURE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
 AMD TECHNOLOGY POWERS NEXT-GEN CONSOLES
NEW NEXT-GEN GAME CONSOLES
RAISE THE BAR FOR GRAPHICS PERFORMANCE

PERFORMANCE
TFLOPS-CLASS
COMPUTE POWER

MEMORY

16X

MORE
MEMORY

*

* Based on PlayStation 3 512MB vs.
PlayStation 4 8192MB GDDR5.

GRAPHICS CORE NEXT
GCN COMPUTE UNIT

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 CU = Basic Building Block of GPU Computational Power
Branch &
Message Unit

Vector Units
(4x SIMD-16)

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)



New Instruction Set Architecture

Scalar Unit

‒

Non-VLIW

‒

Vector unit + scalar co-processor

‒

Scheduler

Distributed programmable scheduler



Each CU can execute instructions from
multiple kernels at once



Increased instructions per clock per mm2

‒ High utilization

‒ High throughput
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

19 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

L1 Cache
(16KB)

‒ Multi-tasking
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Scalar Unit
4x Vector Units (16-lane SIMD)
 Branch andData Threads) Data Cache
Scheduler Message Unit
64kb Local(2560 L1 Vector
16kb Read/Write Share (LDS)

‒ CU Total Throughput: 64 Single-Precision (SP) ops/clock
Fully Programmable

‒ Executes Branch instructions Limit (32k/thread group)
Separate to texture units (acts as texture cache)
2x Larger decode/issue for:
Attachedthan D3D11 TGSM
‒ 1 SP (Single-Precision) operation per 4 clocks
‒ Shared by all threads of a wavefront
‒ (as dispatched by SMEM, LDS, GDS/E
V
SALU,
‒ 1 DP ALU, V with Conflict Resolution
 4 32 banks, MEM,Units Scalar unit)clocks XPORT
Texture
‒ Used for Filtercontrol, ADD in 8 arithmetic, etc.
(Double-Precision) pointer
‒
flow
‒ + Special Instructions (NOPs, barriers, etc.) and
‒ Bandwidth Amplification
 16 TextureGPR pool, scalar data cache, etc. branch instructions
Fetch Load/Store Units
‒ 1 DP MUL/FMA/Transcendental per 16 clocks*
‒ Has own
‒ 16 Hardware Barriers Decode
‒ Separate Instruction
 4x64KB Vector Registers (VGPR)

 8KB Scalar General Purpose Registers (SGPR)
20 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13



Scalar Unit
‒

8KB Scalar Registers (SGPR)

‒

16KB 4-CU Shared R/O L1 Scalar Data Cache



Branch & Message Unit



4x Vector Units (SIMD-16)
‒



64KB Local Data Share (LDS)
‒



32 banks, with conflict resolution

16KB Read/Write L1 Vector Data Cache
‒



4x64KB Vector Registers (VGPR)

Shared with TMU as Texture Cache

Hardware Scheduler
‒

Up-to 2560 threads
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SIMD SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each Compute Unit (CU) contains 4 SIMD; each SIMD has:
‒ A 16-lane IEEE-754 vector ALU (VALU)
‒ 64KB of vector register file (VGPR)
‒ Its own 40-bit (48-bit on HSA APUs) Program Counter (PC)

‒ Instruction buffer for 10 wavefronts*
‒ *A wavefront is a group of 64 threads: the size of one logical vGPR
21 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT

SCALAR UNIT

LANE
0 1 2
SIMD 0

15

LANE
0 1 2
SIMD 1

15

SPECIFICS …
LANE
0 1 2

15

SIMD 2

LANE
0 1 2
15

Scalar Unit

SIMD 3

GCN Scalar Unit
 Fully Programmable Scalar Unit replaces FF Branch Logic
 Operations such as JMP [GPR] are now supported
 Opens the door to e.g. virtual function calls
 Has its own GPR pool and can execute normal ALU code
 64-bit bitwise ops to mask thread execution
 32-bit bitwise and integer arithmetic operations at full-speed
 Potential to offload uniform code (Vector ALU  Scalar ALU)

 A GCN CU can dispatch 1 scalar op/clock

24 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT

SCALAR UNIT

LANE
0 1 2
SIMD 0

15

LANE
0 1 2

15

CONTINUED …
LANE
0 1 2

SIMD 1

15

SIMD 2

R/W L2

LANE
0 1 2
15

Scalar Unit

SIMD 3

GCN Scalar Unit
 Natively a 64-bit integer ALU
 Independent arbitration and instruction decode
 One ALU, memory or control flow op per cycle
 512 Scalar GPR per SIMD shared between waves
 { SGPRn+1, SGPR } pair provide 64-bit register
 4 CU Shared Read Only Scalar Data Cache: 16KB – 64B lines

 4-way assoc, LRU replacement policy
 Peak Bandwidth per CU is 16 bytes/cycle

25 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

4 CU Shared 16KB Scalar R/O L1
Scalar Unit
8KB Registers
Integer ALU

Scalar Decode
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

BRANCH & MESSAGE UNIT

Vector Units
(4x SIMD-16)

Scalar Unit

GRAPHICS CORE NEXT

 Independent scalar assist unit to handle special classes of
instructions concurrently
‒ Branch
‒ Unconditional Branch (s_branch)
‒ Conditional Branch (s_cbranch_<cond>)
‒ Condition  SCC == 0, SCC == 1, EXEC == 0, EXEC != 0, VCC == 0, VCC != 0

‒ 16-bit signed immediate dword offset from PC provided

‒ Messages
‒ s_sendmsg  CPU interrupt with optional halt (with shader supplied code and source)
‒ debug message (perf trace data, halt, etc.)
‒ special graphics synchronization messages
26 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

 Branch


Unconditional Branch (s_branch)



Conditional Branch (s_cbranch_<cond>)

‒
‒
‒
‒
‒
‒

SCC ==
SCC ==
EXEC ==
EXEC !=
VCC ==
VCC !=

0
1
0
0
0
0

 Messages

‒ s_sendmsg
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

MEMORY SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each CU has its own dedicated L1 cache and LDS memory
‒ Both global and shared memory atomics are supported



‒ 32 banks, with conflict resolution



‒ Scalar L1 Read-Only data cache is shared between 4 neighbor CU

27 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

16KB R/W L1 Vector D-Cache
‒ Shared with TMU as Texture Cache



‒ 16 work group barriers supported per CU
‒ Vector L1 Read/Write data cache shared with TMU as texture cache

64KB Local Data Share (LDS)

Scalar Unit
‒ 16KB 4-CU Shared R/O Scalar L1



16 hardware barriers per CU



A GCN GPU with 44 CU, such as the AMD
Radeon™ R9 290x, can be working on
up-to 112,640 work items at a time!
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SCHEDULER SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

Texture Filter
Units (4)

Texture Fetch
Load / Store
Units (16)

GRAPHICS CORE NEXT
Vector Registers
(4x 64KB)

Local Data Share
(64KB)

Scalar Registers
(8KB)

L1 Cache
(16KB)

 Each CU has its own dedicated Scheduler unit



Each CU can have 40 waves in-flight
‒ Each potentially from a different kernel



Scheduler Limits:

‒ Supports up-to 2560 threads per CU (64 threads x 10 waves x 4 SIMD)

‒ 40 wavefronts per CU

‒ All threads within a workgroup are guaranteed to reside on the same
CU simultaneously

‒ Limited by available GPR count

‒ A set of synchronization primitives and shared memory allow data to
be passed between threads in a workgroup
‒ Optimized for throughput – latency is hidden by overlapping
execution of wavefronts
28 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

‒ 10 wavefronts per SIMD
‒ Limited by available LDS memory
‒ 16 hardware barriers per CU


A GCN GPU with 44 CU, such as the
AMD Radeon™ R9 290x, can be
working on up-to 112,640 work items at
a time!
GCN COMPUTE UNIT
Branch &
Message Unit

Scheduler

SCHEDULER SPECIFICS

Vector Units
(4x SIMD-16)

Scalar Unit

ARBITRATION & DECODE

Local Data Share
(64KB)

L1 Cache
(16KB)

GRAPHICS CORE NEXT

 CU is guaranteed to issue instructions for a wave sequentially
‒ Predication & control flow enables any single work-item a
unique execution path
 For a CU, every clock, waves on 1 SIMD are considered for issue
‒ Round-Robin scheduling algorithm



Maximum

5 instructions per cycle

‒ Not including “internal” instructions



Instruction Types:
‒ 1 Vector Arithmetic Logic Unit (VALU)
‒ 1 Scalar ALU or Scalar Memory (SALU)|(SMEM)
‒ 1 Vector Memory (Read/Write/Atomic) (VMEM)
‒ 1 Branch/Message (e.g. s_branch, s_cbranch)
‒ 1 Local Data Share (LDS)

 At most, 1 instruction from each category may be issued
 At most, 1 instruction per wave may be issued
 Theoretical maximum of 5 instructions per cycle per CU
29 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

‒ 1 Export or Global Data Share (GDS)
‒ 10 Special/Internal
(s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) –
[no functional unit]
GCN COMPUTE UNIT

VECTOR & SCALAR ARBITRATION

LANE
0 1 2
SIMD 0

15

LANE
0 1 2
SIMD 1

15

LANE
0 1 2

15

LANE
0 1 2
15

SIMD 2

HARDWARE VIEW
Scalar Unit

SIMD 3

GCN Hardware View
 A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clocks
 Each lane can dispatch 1 SP ALU operation per clock
 Each SP ALU operation takes 4 clocks to complete
 The scheduler dispatches from a different wavefront each cycle

30 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE UNIT

VECTOR & SCALAR ARBITRATION

LANE
0 1 2

15

LANE
16 17 18

31

LANE
32 33 34

47

LANE
48 49 50

PROGRAMMER VIEW
Scalar Unit

63

WAVEFRONT 0

WAVEFRONT 4

WAVEFRONT 1

WAVEFRONT 5

WAVEFRONT 2

WAVEFRONT 6

WAVEFRONT 3

WAVEFRONT 7

WAVEFRONT 0

WAVEFRONT 8

WAVEFRONT 1

WAVEFRONT 9

GCN Programmer View
 A GCN Compute Unit can perform 64 SP Vector ALU ops / clock
 Each lane can dispatch 1 SP ALU operation per clock
 Each SP ALU operation still takes 4 clocks to complete
 But you can PRETEND your code runs 1 op on 64-threads at once

31 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN VECTOR UNITS

ALU CHARACTERISTICS

 FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of
NaN/Inf/Zero and full de-normal support in hardware for SP and DP
 MULADD single cycle issue instruction without truncation, enabling a MULieee followed by
ADDieee to be combined with round and normalization after both multiplication and subsequent
addition
 VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison
predicates
 IEEE Rounding Modes (Round toward +Infinity, Round toward –Infinity, Round to nearest
even, Round toward zero) supported under program control anywhere in the shader. SP and DP
modes are controlled separately.
 De-normal Programmable Mode control for SP and DP independently. Separate control for input
flush to zero and underflow flush to zero.
32 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN VECTOR UNITS

ALU CHARACTERISTICS

CONTINUED …

 Divide Assist Ops IEEE 0.5 ULP Division accomplished with macro (SP/DP ~15/41 Instruction
Slots, respectively)
 FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE-754 precision and
rounding
 Exceptions Support in hardware for floating point numbers with software recording and reporting
mechanism. Inexact, underflow, overflow, division by zero, de-normal, invalid operation, and
integer divide by zero operation
 64-bit Transcendental Approximation Hardware based double precision approximation for
reciprocal, reciprocal square root and square root
 24-bit Integer MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
‒ Heavily utilized for integer thread group address calculation
‒ 32-bit integer MUL/MULADD @ DP MUL/FMA rate
33 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN SHADER AUTHORING TIPS
 GCN has greatly improved branch performance, and it continues to improve
‒ Don’t be afraid to use it! But, remember: use it wisely – improved != free 
‒ It’s at its best for highly coherent workloads (where most threads take the same path)

 However, the new architecture is more susceptible to: register pressure
‒ Using too many registers within a shader can reduce the maximum waves per SIMD! 
‒ NOTE: A WAVEFRONT

CAN ALLOCATE

104 USER SCALAR REGISTERS

AS SEVERAL SCALAR REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE

GCN SGPR Count
VGPR Count

<= 48
<=24

56
28

64 72 84 100
32 36 40 48

> 100 84
64

<= 128

> 128

Max Waves/SIMD

10 

9
9

8
8

4 3
4

2

1

77

66

‒ Take caution with respect to the following:
‒ Excessive nested branching/looping
‒ Loop Unrolling

‒ Variable declarations (especially arrays)
‒ Excessive function calls requiring storing of results
34 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

55
GCN SHADER CODE EXAMPLE
// Registers r0 contains “a”, r1 contains “b”
// Value is returned in r2
v_cmp_gt_f32
s_mov_b64
s_and_b64
s_cbranch_vccz
v_sub_f32
v_mul_f32

r0,r1
s0,exec
exec,vcc,exec
label0
r2,r0,r1
r2,r2,r0

//
//
//
//
//
//

a > b, establish VCC
Save current exec mask
Do “if”
Branch if all lanes fail
result = a – b
result = result * a

s_andn2_b64
s_cbranch_execz
v_sub_f32
v_mul_f32

exec,s0,exec
label1
r2,r1,r0
r2,r2,r1

//
//
//
//

Do “else (s0 & !exec)
Branch if all lanes fail
result = b – a
result = result * b

s_mov_b64

exec,s0

// Restore exec mask

 An alternative to s_cbranch, is to use VSKIP to transform VALU into NOPs
 s_setvskip – enables or disables VSKIP mode. Requires 1 waitstate after executing.
35 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13



VSKIP does NOT skip VMEM instructions (Do: branch over superfluous VMEM inst.)
GCN MEMORY

CACHE HIERARCHY
I$

32KB instruction cache (I$) +
16KB scalar data cache (K$)
shared per ~4 CUs
with L2 backing

K$

I$

K$

Each CU has its own registers
and local data share

64 Bytes per clock
L1 bandwidth per CU

GDS
L1

L1

L1

L1

L1

L1

L1

L1

L1

L1 read/write caches
64 Bytes per clock
L2 bandwidth per partition
L2 read/write cache
partitions

36 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

L2

L2

L2

64-bit Dual Channel
Memory Controller

64-bit Dual Channel
Memory Controller

64-bit Dual Channel
Memory Controller

Global Data
Share facilitates
synchronization
between CUs
(64KB)
GCN MEMORY

VECTOR MEMORY INSTRUCTIONS

VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS

MUBUF – read from or write/atomic to an un-typed buffer/address
‒ Data type/size is specified by the instruction operation

MTBUF – read from or write to a typed buffer/address
‒ Data type is specified in the resource constant

GRAPHICS CORE NEXT



‒ MUBUF is like C++
reinterpret_cast

MIMG – read/write/atomic operations on elements from an image surface
‒ Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)
‒ Image objects use resource and sampler constants for access and filtering
37 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

A pointer is a pointer on GCN!

‒ MTBUF is like C++
static_cast


Utilize TMU for filtering via MIMG
GCN MEMORY

DEVICE FLAT MEMORY INSTRUCTIONS
A GCN POINTER IS A POINTER

FLAT
 Flat Address Space (“flat”) instructions are new as of Sea Islands (CI) and
allow read/write/atomic access to a generic memory address pointer which
can resolve to any of the following physical memories:
‒ Global Memory
‒ Scratch (“private”)
‒ LDS (“shared”)
‒ Invalid - MEM_VIOL TrapStatus
 Device Flat (Generic) 64b/32b Addressing Support
‒ FLAT instructions support both 64 and 32-bit addressing. The address size is set
via a mode register (“PTR32”) and a local copy of the value is stored per wave.
‒ The addresses for the aperture check differ in 32 and 64-bit mode
38 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

EXPORT INSTRUCTION & GDS

 Exports move data from 1-4 VGPRS to the fixed-function Graphics Pipeline
‒ E.g: Color (MRT0-7), Depth, Position, and Parameter  Tessellator, Rasterizer, or RBE

 Global Shared Memory Ops (Utilize GDS)

 The GDS is identical to the LDS, except that it is shared by all CUs, so it acts as an
explicit global synchronization point between all wavefronts
 The atomic units in the GDS also support ordered count operations
39 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

LOCAL DATA SHARE

 GCN Local Data Share (LDS) is a 64KB, 32 bank (or 16) Shared Memory
 Instruction issue fully decoupled from ALU instructions
 Direct mode
‒ Vector Instruction Operand  32/16/8-bit broadcast value
‒ Graphics Interpolation @ rate, no bank conflicts

 Index Mode – Load/Store/Atomic Operations
‒ Bandwidth Amplification, up-to 32 – 32-bit lanes serviced per clock peak
‒ Direct decoupled return to VGPRs
‒ Hardware conflict detection with auto scheduling

 Software consistency/coherency for thread groups via hardware barrier
 Fast & low power vector load return from R/W L1
40 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

CONTINUED …
LOCAL DATA SHARE

 An LDS bank is 512 entries, each 32-bits wide
‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that
includes 32 atomic integer units
‒ This means that several threads can read the same LDS location at the same time for FREE
‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins
(useful e.g. for all threads writing uniform value to still be fast)

 Typically, the LDS will coalesce 32 lanes from one SIMD each cycle
‒ One wavefront is serviced completely every 2 cycles
‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware
‒ An instruction which accesses different elements in the same bank takes additional cycles
41 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
BLOCK DIAGRAM
GCN MEMORY

LOCAL DATA SHARE

42 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

NEW MEMORY OPERATIONS
LOCAL DATA SHARE

 Remote Atomic Ops with Shared Memory Dual-Source Operands
‒LDS[Dst] = LDS[addr0] op LDS[addr1];
‒ Fast remote reduction operations for arithmetic, logical, Min/Max

 Read/Write/Conditional Exchange 96b/128b
 32-bit FP Min/Max/Compare Swap
43 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

NEW MEMORY OPERATIONS
LOCAL DATA SHARE
CONTINUED …

Fast Lane Swizzle Operations
‒Does not require allocation, no shared memory used
‒Invalid read result in 0x0 return
‒First Mode: Each four adjacent lanes can full crossbar data, same switch for each set
of four
‒Second mode: For each consecutive set of 32 work-items
‒ Swap: 16, 8, 4, 2, 1
‒ Reverse: 32, 16, 8, 4, 2
‒ Broadcast: 32, 16, 8, 4, 2
44 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
OPERATION DIAGRAMS
GCN MEMORY

LOCAL DATA SHARE
16

4 Lane CrossBar

Reverse

8
4

Lane 0 , 1 ……………………..…31,32……………………………….63

2

1
Lane 0 , 1 ……………………..…31,32……………………………….63

Swap

Broadcast

16

16

8

8

4

4

2

2

1
Lane 0 , 1 ……………………..…31,32……………………………….63
45 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

1
Lane 0 , 1 ……………………..…31,32……………………………….63
GCN MEMORY

BLOCK DIAGRAM
READ/WRITE CACHE

 Reads and writes cached
‒ Bandwidth amplification
‒ Improved behavior on more memory access patterns
‒ Improved write to read reuse performance

 Relaxed consistency memory model
‒ Consistency controls available to control locality of load/store

 GPU Coherent
‒ Acquire/Release semantics control data visibility across the machine (GLC bit on load/store)
‒ GCN APUs also have SLC bit to control data visibility to CPU caches

‒ L2 coherent = all CUs can have the same view of data

 Global Atomics
‒ Performed in L2 cache (GDS also has global atomics)
46 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

READ/WRITE L1
CACHE ARCHITECTURE

‒ Each CU has its own Vector L1 Data Cache
‒ 16KB L1, 64B lines, 4 sets x 64-way
‒ ~64B/CLK bandwidth per Compute Unit
‒ Write-through – alloc on write (no read) w/dirty byte mask
‒ Write-through at end of wavefront
‒ Decompression on cache read out

‒ Instruction GLC bit defines cache behavior (GCN APUs also have SLC bit)
‒ GLC = 0;
‒ Local caching (full lines left valid)
‒ Shader write back invalidate instructions
‒ GLC = 1;
‒ Global coherent (hits within wavefront boundaries)
47 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

READ/WRITE L2
CACHE ARCHITECTURE

‒ 64-128KB L2 per Memory Controller Channel
‒ Up-to 16 L2 cache partitions
‒ 64B lines, 16-way set associative
‒ ~64B/CLK per channel for L2/L1 bandwidth
‒ Write-back - alloc on write (no read) w/ dirty byte mask
‒ Acquire/Release semantics control data visibility across CUs

‒ L2 Coherent = all CUs can have the same view of data
‒ Remote Atomic Operations

‒ Common Integer Set & Floating Point Min/Max/CmpSwap

48 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

INFORMATION
BANDWIDTH

‒ Each CU has 64 bytes per cycle of L1 bandwidth
‒ Shared with the GDS

‒ Per L2 there’s 64 bytes of data per cycle as well
‒ Peak Scalar L1 Data Cache Bandwidth per CU is 16 bytes/cycle
‒ Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)
‒ LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification
‒ For R9 290x:
‒ That’s nearly 5.5 TB/s of LDS BW, 2.8 TB/s of L1 BW, and 1 TB/s of L2 BW!
‒ 512-bit GDDR5 Main Memory has over 320 GB/sec bandwidth
‒ PCI Express 3.0 x16 bus interface to system (32GBps)
49 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

TABLES
BANDWIDTH & LATENCY

LDS

K$

L1

128 bytes / clock

16 bytes / clock

64 bytes / clock

Main Takeaways:
–LDS is optimized for bandwidth amplification and atomics
–K$ is optimized for periodic low-latency reads of small datasets
–L1 is optimized for high-bandwidth texture fetches and streaming

LDS

K$

L1

Resident

Short

Short (1x)

Long (20x)

Non-Resident

N/A

Medium (10x)

Long (20x)

50 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEMORY

BLOCK DIAGRAM
L1 TEXTURE CACHE

 The memory hierarchy is re-used for graphics
 Some dedicated graphics hardware added
‒ Address-gen unit receives 4 texture addr/clock
‒ Calculates 16 sample addr (nearest neighbors)
‒ Reads samples from L1 vector data cache
‒ Decompresses samples in Texture Mapping Unit (TMU)

‒ TMU filters adjacent samples, produces <= 4 interpolated texels/clock
‒ TMU output undergoes format conversion and is written into the vector register file
‒ The format conversion hardware is also used for writing certain formats to memory from graphics shaders

51 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
X86-64
GCN MEMORY
VIRTUAL MEMORY

 The GCN cache hierarchy was designed to integrate with x86-64 microprocessors
 The GCN virtual memory system can support 4KB pages
‒ Natural mapping granularity for the x86-64 address space
‒ Paves the way for a shared address space in the future
‒ All GCN hardware can already translate requests into x86-64 address space

 GCN caches use 64B lines, which is the same size x86-64 processors use
 AMD A-Series APU

 The stage is set for heterogeneous systems to transparently share data between the GPU
and CPU through the traditional caching system, without explicit programmer control!

52 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

R9 290X

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
AMD Radeon™
HD 7970 GHz Edition

AMD Radeon™
R9 290X

Increase

Geometry Processing

2.1 billion primitives/sec

4 billion primitives/sec

1.9x

Compute

4.3 TFLOPS

5.6 TFLOPS

1.3x

Texture fill rate

134.4 Gtexels/sec

176 Gtexels/sec

1.3x

Pixel fill rate

33.6 Gpixels/sec

64 Gpixels/sec

1.9x

Peak Bandwidth

264 GB/sec

320 GB/sec

1.2x

Die area

352 mm2

438 mm2

1.24x

Peak GFLOPS/mm2

12.2

12.8

1.05x

53 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

SHADER ENGINE

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING
 Each GCN GPU can contain up-to 4 Shader Engines
‒ Load balanced with each other
‒ Screen partitioning of pixel assignment

 A Shader Engine is a high level organizational unit containing:
‒ 1 Geometry Processor (1 Primitive Per Cycle Throughput)
‒ 1 Rasterizer
‒ 1-16 CUs (Compute Units)
‒ Instruction I$ and constant K$ caches shared by up to 4 CU each

‒ 1-4 RBEs (Render Back Ends)
‒ Up-to 16 – 64b pixels/cycle per Shader Engine
‒ Up-to 8 – 128b pixels/cycle per Shader Engine
54 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

R9 290X

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 44 Compute Units

 4 Geometry Processors
‒ 4 billion primitives/sec

 64 Pixel Output/Clock
‒ 64 Gpixels/sec fill rate

 1MB L2 Cache
‒ Up-to 1 TB/sec L2/L1 bandwidth

 512-bit GDDR5 memory interface
‒ 320 GB/sec memory bandwidth

 6.2 billion transistors
55 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

‒ 438 mm2 on 28nm process node
‒ 12.8 GFLOPS/mm2
GCN COMPUTE ARCHITECTURE

SEA ISLANDS

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 8 ASYNCHRONOUS COMPUTE ENGINES (ACE)
‒ Operate in parallel with Graphics CP
‒ Independent scheduling and work item dispatch
for efficient multi-tasking
‒ 9 Devices with 64+ Command Queues!

‒ Fast context switching
‒ Exposed in OpenCL™

 Dual DMA engines
‒ Can saturate PCIe 3.0 x16 bus bandwidth (16
GB/sec bidirectional)

56 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

SEA ISLANDS

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 ACEs are responsible for compute shader
scheduling & resource allocation
 Each ACE fetches commands from cache or
memory & forms task queues
 Tasks have a priority level for scheduling
‒ Background  Realtime
 ACE dispatch tasks to shader arrays as
resources permit
 Tasks complete out-of-order, tracked by ACE
for correctness
 Every cycle, an ACE can create a
workgroup and dispatch one wavefront from
the workgroup to the CUs
57 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

SEA ISLANDS

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 ACE are independent
‒ But, can synchronize and communicate
via Cache/Memory/GDS

 ACE can form task graphs
‒ Individual tasks can have
dependencies on one another
‒ Can depend on another ACE
‒ Can depend on part of graphics pipe

 ACE can control task switching
‒ Stop and Start tasks and dispatch
work to shader engines
58 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

SEA ISLANDS

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

 Focus in GPU hardware shifting away
from graphics-specific units, towards
general-purpose compute units
 R9 290x GCN-based ASICs already
have 8:1 ACE : CP ratio
‒ CP can dispatch compute
‒ ACE cannot dispatch graphics
 If you aren’t writing Compute
Shaders, you’re not getting the absolute
most out of modern GPUs
‒ Control: LDS, barriers, thread layout, ...
59 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN COMPUTE ARCHITECTURE

SEA ISLANDS

A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING

GRAPHICS CORE NEXT

Future Trends:
 More Compute Units
‒ ALU outpaces Bandwidth
 CPU + GPU Flat Memory
‒ APU + dGPU
 Less Fixed Function Graphics
‒ Can you write a Compute-based
graphics pipeline?

‒ Start thinking about it… 
60 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE
Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

GEOMETRY

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Updated Hardware Geometry Units
– Off-chip buffering improvements
– Larger parameter and position cache

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Tessellation off
on
Tessellation
off

 GS + Tessellation is faster than before…
 However… memory is still the bottleneck!
– Minimize the number of inputs and
outputs for best performance…
 Small expansions can be done within LDS!
61 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Image from Battlefield 3, EA DICE

Process and rasterize up to 4 primitives per clock cycle
GCN FIXED FUNCTION ARCHITECTURE

RASTERIZER

 We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

 Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
 This can cause us to become raster-bound, starving the shader and holding up geometry!

12 Pixels Per Clock
75%

Efficiency

100%

Efficiency

16 Pixels Per Clock

28 Pixels in 2 Clocks

62 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

vs.

3 Pixels in 3 Clocks

1 Pixel Per Clock 
6.25%

Efficiency
GCN FIXED FUNCTION ARCHITECTURE

TESSELLATION + RASTERIZER EFFICIENCY
6.25%

75-90%

18-25%
Efficiency

Efficiency

~13 Pixels

~4 Pixels

1 Pixel

Per Clock

Per Clock

Per Clock

Efficiency

Over-Tessellation

 Reduces rasterizer efficiency
‒ Extreme Tessellation = 6.25% Efficiency
 Also impacts ROPs and MSAA efficiency
‒ High number of polygon edges to AA
‒

Consumes dramatically more bandwidth

‒ If nFragments > nSamples, quality will be lost
‒

E.g. 16 verts affecting 1 pixel @ 8xMSAA

63 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE
Over-Tessellation
 Reduces shader efficiency
 HS, DS and VS run many times
for each final image pixel
‒ Yet don’t contribute much
to final image quality

 The graphics pipeline is not
designed for this abuse!

TESSELLATION + SHADING EFFICIENCY
Shading Passes Per-Pixel (Overshade)

8
7
6
5
4
3
2
1

 Consider Alternatives:
‒ Parallax Occlusion Mapping
‒ […]

 Image courtesy: Kayvon Fatahalian
“Evolving the Direct3D Pipeline for Real-time Micropolygon Rendering,”
From ACM SIGGRAPH 2010 course: “Beyond Programmable Shading II”

64 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN Tessellation – Best Practices
 While performance is much improved, it is still a potential bottleneck!
‒ Produces a great deal of IO traffic, starving other parts of the pipeline

 Best performance generlly achieved with tessellation factors less than 15!

Continue to

Optimize:

‒ Pre-triangulate
‒ Distance-adaptive
‒ Screen-space adaptive
‒ Orientation-adaptive
‒ Backface Culling
‒Frustum Culling
‒ […]

65 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Tessellation OFF
ON
GCN FIXED FUNCTION ARCHITECTURE

RASTERIZER

 We now have 4 Geometry Processors on R9 290x
‒ Overall Primitive Rate = 4 prims per clock (ideal)

 We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock)
‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

 Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency
 This can cause us to become raster-bound, unable to rasterize at peak-rate!
Command Processor

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Geometry Processor

Vertex
Assembler

Geometry
Assembler

Tessellator

Vertex
Assembler

Geometry Processor
Geometry
Assembler

Tessellator

Vertex
Assembler

Compute Units

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends

66 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends

Rasterizer
Scan Converter

Hierarchical Z

Render Back-Ends
GCN FIXED FUNCTION ARCHITECTURE

RENDER BACK ENDS

 Once the pixels fragments in a tile have been
shaded, they flow the Render Back-Ends (RBEs)

Z/Stencil ROPs

Color ROPs

Depth Cache

Color Cache

‒ 16KB Color Cache
‒ Up to 8 color + 16 coverage samples (16x EQAA)

‒ 8KB Depth Cache
‒ Up to 8 depth samples (8x MSAA)

‒ Writes un-cached via memory controllers
‒ 64 – 64B pixels per cycle
‒ 256 Depth Test (Z) / Stencil Ops per cycle

 Logic Operations as alternative to Blending

‒Exposed in Direct3D 11.1
‒Also available in OpenGL
 Dual-Source Color Blending with MRTs

‒Only available in OpenGL
*
67 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

There are 16 RBEs on R9 290x
GCN FIXED FUNCTION ARCHITECTURE

DEPTH IMPROVEMENTS

24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS

Fast-accept of fully-visible triangles spanning one or more tile
If a triangle is fully covering a tile, then cost is only 1 clock/tile
 Depth Bounds Test (DBT) Extension

‒Exposed in OpenGL via GL_EXT_depth_bounds_test
‒Exposed in Direct3D 11 via extension
68 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN FIXED FUNCTION ARCHITECTURE

STENCIL IMPROVEMENTS

 GCN has support for new extended stencil ops

‒Only available in OpenGL:

GL_AMD_stencil_operation_extended
‒Additional stencil ops:
‒AND, XOR, NOR
‒REPLACE_VALUE_AMD
‒etc.
‒ Also exposes additional stencil op source value
‒ Can be used as an alternative to stencil ref value

 Stencil ref and op source value can now be exported from pixel shader

‒Only available in OpenGL: GL_AMD_shader_stencil_value_export
69 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS

GPR PRESSURE

 GPRs and GPR Pressure
 Banks of GCN Vector GPRs (Illustration)

 General Purpose Registers (GPR) are a limited resource
‒ Separate banks of GPRs for Vector and Scalar (per SIMD)
‒ Maximum of 256 VGPRS and 512 SGPRS shared across all waves (up-to 10) owned by a SIMD
‒ Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit)
‒ Number of GPRs required by a shader affects SIMD scheduling and execution efficiency
‒ Shader tools can be used to determine how many GPRs are used…

 GPR pressure is affected by:
‒ Loop Unrolling
‒ Long lifetime of temporary variables
‒ Nested Dynamic Flow Control instructions
‒ Fetch dependencies (e.g. indexed constants)

70 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS

TEXTURE FILTERING

‒Point sampling is full-rate on all formats
‒Trilinear filtering costs up to 2x bilinear filtering cost
‒Anisotropic (N taps) costs <= (N x bilinear)
‒Avoid cache thrashing!
‒Use MIPmapping
‒Use Gather() where applicable
‒Exploit neighbouring pixel shader threadCU locality:
‒ Sampling from texels resident on the same CU can have a lower cost
‒Exploit this explicitly by using Compute Shaders
71 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN LOW-LEVEL TIPS

COLOR OUTPUT

 PS Output: Each additional color output increases export cost
 Export cost can be more costly than PS execution!
‒ Each (fast) export is equivalent to 64 ALU ops on R9 290X
‒ If shader is export-bound then use “free” ALU for packing instead

 Watch out for export-bound cases
‒ E.g. G-Buffer parameter writes
‒ MINIMIZE SHADER INPUTS AND OUTPUTS!
‒ Pack, pack, pack, pack!

 Costs of outputting and blending various formats
‒discard/clip allow the shader hardware to skip the rest of the work
* Miss “PACK” Man kindly reminds you to “Pack pack pack!” 

72 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING

MEDIA INSTRUCTIONS

 SAD = Sum of Absolute Differences
Closest match

 Critical to video & image processing algorithms
‒ Motion detection
‒ Gesture recognition
‒ Video & image search
‒ Stereo depth extraction
‒ Computer vision

 SAD (4x1) and QSAD (4 4x1) instructions
‒ New QSAD combines SAD with alignment ops for higher
performance and reduced power draw
‒ Evaluate up to 256 pixels per CU per clock cycle!

 Maskable MQSAD instruction
‒ Allows background pixels to be ignored
‒ Accelerated isolation of moving objects

 New: 32-bit destination accumulator register
‒ SAD/QSAD/MQSAD U32/U16 accumulators with saturation
73 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

3
2

5

5

4

4

0

7

1

7

5

9

4

1

3

5

5

5

9

3

1

4

4 0
SAD = 7
SAD = 22
2 22 9
5

1

6

7

2 9
SAD = 6
1 59 3
5

2

8

1

1

7

6

8

3

0

4

3

2

9

9

3

0

7

1

1

7 4
SAD = 5
5 58 4
0

8

0

0

2 2
SAD = 2
8 45 3
2

9

9

7

1

6

2

4

0

AMD Radeon R9 290x can evaluate

11.26 Terapixels/sec *
* Peak theoretical performance for 8-bit integer pixels

3
GCN MEDIA PROCESSING

VIDEO CODEC ENGINE

 Video Codec Engine (VCE)
‒ Hardware H.264 Compression and Decompression
‒ Ultra-low-power, fully fixed-function mode
‒ Capable of 1080p @ 60 frames / second

‒ Programmable for Ultra High Quality and or Speed
‒ Entropy encoding block fully accessible to software
‒ AMD Accelerated Parallel Programming SDK
‒ OpenCL ™

‒ Create hybrid faster-than-real-time encoders!
‒ Custom motion estimation
‒ Inverse DCT and motion compensation
‒ Combine with hardware entropy encoding!

AMD Radeon R9 290x can compress

Realtime+ 1080p H.264

74 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING

AMD TRUEAUDIO

 Multiple integrated Tensilica HiFi EP Audio DSP cores
 Dedicated Audio DSP solution for game sound effects
 Guaranteed real-time performance and service

 Designed for game audio artists and engineers to bring take their artistic vision
beyond sound production into the realm of sound processing
 Intended to transform game audio as programmable shaders transformed graphics
75 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
GCN MEDIA PROCESSING

AMD TRUEAUDIO

SPATIALIZATION / 3D AUDIO

REVERBS

AUDIO/VOICE STREAMS

MASTERING LIMITERS

76 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
HEAR MORE REALTIME
VOICES AND CHANNELS
IN A GAME

77 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
ENABLES AMAZING
DIRECTIONAL AUDIO
OVER ANY OUTPUT

78 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
CONCLUSIONS

GCN ARCHITECTURE TAKEAWAYS

‒GCN offers increased flexibility & efficiency, with reduced complexity!
‒Non-VLIW Architecture improves efficiency while reducing programmer burden
‒Constants/resources are just address + offset now in the hardware
‒UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast
‒Has virtual memory & GPU flat memory, moving towards CPU + GPU flat memory

‒GCN is designed with a forward-looking focus on Compute
‒Scalar unit for complex dynamic control flow + branch & message unit
‒64KB LDS/CU, 64KB GDS, atomics at every stage, coherent cache hierarchy
‒8 Asynchronous Compute Engines (ACE) for multitasking compute
‒ 8 ACE x 8 HQD (per ACE) = 64 HQD (HQD = Hardware Queue Descriptors)

79 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
CONCLUSIONS

GCN ARCHITECTURE TAKEAWAYS

CONTINUED …
‒GCN generally simplifies your life as a programmer
‒Don’t: fret too much about instruction grouping, or vectorization
‒Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts)
‒Do: Think about thread/CU locality when you structure your algorithm
‒Do: Exploit the low-latency 4-CU Shared 16KB Scalar L1 Data Cache (K$)
‒Do: Pack shader inputs and outputs – aim to be IO/bandwidth thin!
‒ Pack PS exports into non-blended 64-bit format for optimal ROP utilization
‒ But, remember that 32-bit formats still use less bandwidth
‒ Keep geometry (HS, VS, GS, DS) stage IO under 4 float4 (ideally less! )

‒Unlimited number of addressable constants/resources
‒N constants aren’t free anymore – each consume resources, use sparingly!

‒Compute is the future – exploit its power for GPGPU work & graphics!
80 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
THANK YOU
问题?

QUESTIONS? 
質問がありますか?
^_^
Layla Mah
layla.mah@amd.com
81 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
BONUS SLIDES

82 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
THE BONUS SLIDES

83 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
TILED RESOURCES & PARTIALLY RESIDENT TEXTURES

MegaTexture in id Tech5
Tiled Resources & Partially Resident Textures – INTRODUCTION
Enables application to manage more texture data than can physically fit in a fixed footprint
‒ Known as: Tiled Resources (Direct3D 11.2) and Partially Resident Textures (OpenGL 4.2)
‒ A.k.a. “Virtual texturing“ and “Sparse texturing”

The principle behind PRT is that not all texture contents are likely to be needed at any given time
‒ Current render view may only require selected portions of a texture to be resident in memory
‒ Or, only selected MIPMap levels…

PRT textures only have a portion of their data mapped into GPU-accessible memory at a time
‒ Texture data can be streamed in on-demand
‒ Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)

 OpenGL extension – GL_AMD_sparse_texture
85 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – TEXTURE TILES
The PRT texture is chunked into 64KB tiles
‒ Fixed memory size
‒ Not dependant on texture type or format

Highlighted areas represent
texture data that needs highest
resolution

Chunked texture

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
86 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Texture tiles needing to be
resident in GPU memory
Tiled Resources & Partially Resident Textures – TRANSLATION TABLE
The GPU virtual memory page table translates 64KB tiles into a resident texture tile pool
Texture Map

Page Table

Texture Tile Pool (Video
Memory)

(linear storage)

64KB tile
Unmapped page entry
Mapped page entry

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
87 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – MIP MAPS
Not all tiles from the texture map are actually resident in video memory
PRT hardware page table stores virtual  physical mappings
Texture Map

Page Table

MIP Levels

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
88 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Texture Tile Pool (Video
Memory)

64KB tile
Unmapped page entry
Mapped page entry
Tiled Resources & Partially Resident Textures – TILE MANAGEMENT
The Application is responsible for uploading/releasing new PRT tiles!

A common scenario is to upload lower MIPMaps to texture tile pool
‒ This allows a full representation of the PRT contents to be resident in memory (albeit at
lower resolution)
‒ e.g. MIP LOD 6 and above for 16kx16k 32-bits texture is about 650KB (256x256 resolution)

Texture tiles corresponding to higher resolution areas are uploaded by the application
as needed
‒ e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen
pixels ratio increases, thus higher LOD tiles need uploading

89 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – “FAILED” FETCH
How does the application know which texture tiles to upload?
Answer: PRT-specific texture fetch instructions in pixel shader
‒ Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not
in the pool
‒ OpenGL example: int

glSparseTexture( gsampler2D sampler,
vec2
P,
inout gvec4 texel );

This information is then stored in render target or UAV
‒ Texel fetch failed for a given (x, y) tile location

...and then copied to the CPU so that application can upload required tiles
App chooses what to render until missing data gets uploaded

90 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – “LOD WARNING”
PRT fetch condition code can also indicate an “LOD Warning”
The minimum LOD warning is specified by the application on a per texture basis
‒ OpenGL example:

glTexParameteri(

If a fetched pixel’s LOD is

<target>,
MIN_WARNING_LOD_AMD,
<LOD warning value>
);

< the specified LOD warning value then the condition code is returned

This functionality is typically used to try to predict when higher-resolution MIP levels will be needed
‒ E.g. Camera getting closer to PRT-mapped geometry

91 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures – EXAMPLE USAGE
1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API
2. App uploads MIP levels using API calls
3. Shader fetches PRT data at specified texcoords
Two possibilities:
3.a. Texel data belongs to a resident (64KB) tile
- Valid color returned, no error code
3.b. Texel data points to non-resident tile or specified LOD
- Error/LOD Warning code returned
- Shader writes tile location and error code to RT or UAV

4. App reads RT or UAV and upload/release new tiles as needed

92 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Tiled Resources & Partially Resident Textures –
TYPES, FORMATS & DIMENSIONS

 All texture types and formats supported
‒1D, 2D, cube, arrays and 3D volume textures

‒All common texture formats
‒ Including compressed formats
‒Maximum dimensions:
‒16k x 16k x 8k x 128-bit textures
93 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
Hardware PRT > Software Implementation
PRT
Ease of implementation
• Complexity hidden behind HW & API

Full filtering support

SW Implementation

• Includes anisotropic filtering

Full-speed filtering
• SW solution requires “manual” filtering
• Software anisotropic is very costly

Don’t go overboard with PRT allocation!
• Page table entry size is 4 DWORDs
• Have to be resident in video memory
94 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
问题?
QUESTIONS? 
質問がありますか?
^_^
Layla Mah
layla.mah@amd.com
95 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

@MissQuickstep
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective
owners.
©2013 Advanced Micro Devices, Inc. All rights reserved.
96 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
THE BONUS SLIDES
SHADER CODE

97 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
SHADER CODE EXAMPLE #2
float fn0(float a,float b)
{
float c = 0.0;
float d = 0.0;
for(int i=0;i<100;i++)
{
if(c>113.0)
break;
c = c * a + b;
d = d + 1.0;
}
return(d);
}

// Registers r0 contains “a”, r1 contains “b”, r2 contains “c”
// and r3 contains “d”
// Value is returned in r3
v_mov_b32
v_mov_b32
s_mov_b64
s_mov_b32
label0:
s_cmp_lt_s32
s_cbranch_sccz
v_cmp_le_f32
s_and_b64
s_branch_execz
v_mul_f32
v_add_f32
v_add_f32
s_add_s32
s_branch
label1:
s_mov_b64

98 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

r2, #0.0
r3, #0.0
exec, s0
s2, #0

//
//
//
//

float c = 0.0
float d = 0.0
Save execution mask
i=0

s2, #100
label1
r2, #113.0
exec, vcc, exec
label1
r2, r2, r0
r2, r2, r1
r3, r3, #1.0
s2, s2, #1
label0

//
//
//
//
//
//
//
//
//
//

i<100
Exit loop if not true
c > 113.0
Update exec mask on fail
Exit if all lanes pass
c = c*a
c = c+b
d = d+1.0
i++
Jump to start of loop

exec, s0

// Restore exec mask
DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap
changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software
changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD
reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of
such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY
INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION
CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro
Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).
Other names are for informational purposes only and may be trademarks of their respective owners.
100 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

More Related Content

What's hot

Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizationspjcozzi
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Intel® Software
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance AMD
 
Deferred shading
Deferred shadingDeferred shading
Deferred shadingFrank Chao
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloadedmistercteam
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3guest11b095
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The SurgePhilip Hammer
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Guerrilla
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)Philip Hammer
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Jafar Khan
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The SurgeMichele Giacalone
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 

What's hot (20)

Z Buffer Optimizations
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance 7nm "Navi" GPU - A GPU Built For Performance
7nm "Navi" GPU - A GPU Built For Performance
 
Deferred shading
Deferred shadingDeferred shading
Deferred shading
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Masked Occlusion Culling
Masked Occlusion CullingMasked Occlusion Culling
Masked Occlusion Culling
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 

Similar to GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14AMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyMark Kilgard
 
Create Amazing VFX with the Visual Effect Graph
Create Amazing VFX with the Visual Effect GraphCreate Amazing VFX with the Visual Effect Graph
Create Amazing VFX with the Visual Effect GraphUnity Technologies
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciênciaCampus Party Brasil
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
[TGDF 2019] Mali GPU Architecture and Mobile Studio
[TGDF 2019] Mali GPU Architecture and Mobile Studio[TGDF 2019] Mali GPU Architecture and Mobile Studio
[TGDF 2019] Mali GPU Architecture and Mobile StudioOwen Wu
 
IT Platform Selection by Economic Factors and Information Security Requiremen...
IT Platform Selection by Economic Factors and Information Security Requiremen...IT Platform Selection by Economic Factors and Information Security Requiremen...
IT Platform Selection by Economic Factors and Information Security Requiremen...ECLeasing
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 

Similar to GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah (20)

Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
 
Create Amazing VFX with the Visual Effect Graph
Create Amazing VFX with the Visual Effect GraphCreate Amazing VFX with the Visual Effect Graph
Create Amazing VFX with the Visual Effect Graph
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUsAMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
AMD Radeon™ RX 5700 Series 7nm Energy-Efficient High-Performance GPUs
 
Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
[TGDF 2019] Mali GPU Architecture and Mobile Studio
[TGDF 2019] Mali GPU Architecture and Mobile Studio[TGDF 2019] Mali GPU Architecture and Mobile Studio
[TGDF 2019] Mali GPU Architecture and Mobile Studio
 
IT Platform Selection by Economic Factors and Information Security Requiremen...
IT Platform Selection by Economic Factors and Information Security Requiremen...IT Platform Selection by Economic Factors and Information Security Requiremen...
IT Platform Selection by Economic Factors and Information Security Requiremen...
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 

More from AMD Developer Central

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14AMD Developer Central
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...AMD Developer Central
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14AMD Developer Central
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...AMD Developer Central
 

More from AMD Developer Central (20)

Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
Keynote (Tony King-Smith) - Silicon? Check. HSA? Check. All done? Wrong! - by...
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah

  • 1. THE AMD GCN ARCHITECTURE A CRASH COURSE @MissQuickstep LAYLA MAH – LAYLA.MAH@AMD.COM DEVELOPER TECHNOLOGY ENGINEER
  • 2. AGENDA  Part 1: A Brief History of GPU Evolution  Part 2: Introduction to Graphics Core Next (GCN)  Part 3: Anatomy of a GCN Compute Unit (CU)  Part 4: GCN Shader: Arbitration, Examples & Tips  Part 5: GCN Memory Hierarchy  Part 6: GCN Compute Architecture (ACE)  Part 7: GCN Fixed Function Units (CP, GeometryEngine, Rasterizer, RBE, …)  Part 8: Main Takeaways & Conclusion  Bonus Slides: Tiled Resources, Partially Resident Textures (PRT) 22 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 3. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 3 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  • 4. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 4 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  • 5. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders Prior to 2002  Graphics-specific hardware 3RD ERA: Graphics Parallel Core VLIW5 ‒ Texture mapping/filtering ‒ Transform & Lighting (T&L) Engines ‒ Geometry processing ‒ Rasterization ‒ Fixed function lighting equations Lighting Stream Processing Units FMAD+ Special Functions Branch Unit ‒ Multi-texturing General Purpose Registers ‒ Dedicated texture and pixel caches VLIW4 ‒ Sufficient for basic graphics tasks Processing Units ‒ No general purpose compute capability General Purpose Stream Registers 5 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit  Dot product and scalar multiply-add
  • 6. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders Memory Interface 3RD ERA: Graphics Parallel Core VLIW5 Stream Processing Units General Purpose Registers Setup Engine Lighting Pixel Shader Core VLIW4 Stream Processing Units General Purpose Registers 6 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit 16 Pixel Pipes FMAD+ Special Functions Branch Unit 8 Vertex Pipes
  • 7. GPU EVOLUTION 1ST ERA: Fixed Function 2002-2006 3D Geometry Transformation Graphics Programmability – Direct3D 8/9, OpenGL 2.0 IEEE not required Memory Interface 8 Vertex Pipes Setup Engine – NV 16-bit full-speed – Lighting NV 32-bit half-speed Pixel Shader Core 7 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Shader Models 1.0 - 2.0 VLIW5 ‒ VS and PS are distinct ‒ Minimal Instruction Sets ‒ Limited Instruction Slots ‒ LimitedGeneral Purpose Lengths Shader Registers ‒ No DYNAMIC Flow Control ‒ No Looping Constructs ‒ VLIW4 No Vertex Texture Fetch ‒ No Bitwise Operators ‒ No Native Integer ALU Stream Processing Units ‒ […] FMAD+ Special Functions Branch Unit – Specialized shader units for vertex & pixel processing  Added dedicated caches The Rise of Shaders Stream Processing Units Different precision per IHV – ATI 24-bit full-speed 3RD ERA: Graphics Parallel Core Branch Unit – Floating point processing 2ND ERA: Simple Shaders 16 Pixel Pipes General Purpose Registers
  • 8. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 8 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  • 9. GPU EVOLUTION 1ST ERA: Fixed Function 2ND ERA: Simple Shaders The Rise of The Unified Shader (VLIW-5)  5-Element Very-Long-Instruction-Word (XYZWT) 3D Geometry Transformation 3RD ERA: Graphics Parallel Core VLIW5 ‒ Ideal for 4-element Vector and 4x4 Matrix Operations ‒ Vector/Vector math in a single instruction Stream Processing Units ‒ Plus One Transcendental-Unit function per Instruction FMAD+ Special Functions Branch Unit ‒ Began with XENOS and utilized from R600 until “Cayman” ‒ Flexible and optimized for Graphics workloads General Purpose Registers  More advanced caching ‒ Instruction, constant, multi-level texture/data, & later: LDS/GDS Lighting VLIW4 Stream Processing Units  More flexible: Unified ALU, Branch Unit, Dynamic Flow Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc. 9 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 General Purpose Registers Branch Unit  Single Precision 32-bit IEEE-Compliant Floating Point ALUs
  • 10. GPU EVOLUTION 1ST ERA: Fixed Function 3D Geometry Transformation 2ND ERA: Simple Shaders 3RD ERA: Graphics Parallel Core VLIW5 FMAD+ Special Functions Branch Unit Stream Processing Units General Purpose Registers Lighting VLIW4 General Purpose Registers 10 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Branch Unit Stream Processing Units
  • 11. GPU EVOLUTION 1ST ERA: Fixed Function 2ND ERA: Simple Shaders Optimized For Die Area Efficiency (VLIW-4) 3D 4-Element Very-Long-Instruction-Word (XYZW)  Geometry Transformation 3RD ERA: Graphics Parallel Core VLIW5 ‒ Profiling showed average VLIW utilization was < 3.4/5 ‒ Each ALU has a smaller LUT ‒ Combined using 3-term Lagrange polynomial interpolation across multiple ALU Stream Processing Units ‒ Better optimized for combination of Graphics & Compute ‒ Graphics is still the primary focus, but compute is gaining attention ‒ Still ideal for 4-element Vector and 4x4 Matrix Operations ‒ Fewer ALU bubbles in transcendental-light code, better utilization Lighting FMAD+ Special Functions General Purpose Registers VLIW4 ‒ Multiple dispatch processors & separate command queues 11 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Stream Processing Units General Purpose Registers Branch Unit ‒ Simplified programming and optimization relative to VLIW-5  Improved support for DirectCompute™ and OpenCL™ Branch Unit ‒ Removed dedicated T-Unit – Optimized die area usage
  • 12. GPU EVOLUTION LANE 0 LANE 1 LANE 2 LANE 15 SIMD VLIW4 SIMD LANE 0 1 2 SIMD 0 15 LANE 0 1 2 15 SIMD 1 LANE 0 1 2 15 SIMD 2 LANE 0 1 2 15 SIMD 3 GCN Quad SIMD-16  64 Single Precision multiply-add (per-clock)  64 Single Precision multiply-add (per-clock)  16 SIMDs × ( 1 VLIW inst × 4 ALU ops )  4 SIMDs × ( 1 ALU op × 16 threads )  1 VLIW inst containing 4 ALU ops (per-clock)  4 ALU ops (from different wavefronts) / clock  Needs 4 parallel ALU ops to fill each VLIW inst  Needs 4+ wavefronts to keep SIMD lanes full  Compiler manages register port conflicts  No register port conflicts  Specialized, complex compiler scheduling  Standard compiler scheduling & optimizations  Difficult assembly creation, analysis, and debug  Simplified assembly creation, analysis, & debug  Complicated tool chain support  Simplified tool chain development and support  Careful optimization req. for peak performance  Stable and predictable performance 12 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 13. AMD GRAPHICS CORE NEXT 13 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 14.  GS-4112 – Mantle: Empowering 3D Graphics Innovation  Keynote – Johan Andersson, Technical Director, EA  GS-4145 – Oxide on Mantle Adoption (Wed 5:00-5:45)  New low level programming interface for PCs  Designed in collaboration with top game developers  Lightweight driver that allows direct access to GPU hardware  Compatible with DirectX® HLSL for simplified porting MANTLE Graphics Applications Mantle API Mantle Driver GCN Works with all Graphics Core Next GPUs
  • 15. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING Faster performance Higher efficiency New graphics features New compute features GRAPHICS CORE NEXT
  • 16. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Cutting-edge graphics performance and features  High compute density with multi-tasking  Built for power efficiency  Optimized for heterogeneous computing  Enabling the Heterogeneous System Architecture (HSA)  Amazing scalability and flexibility 16 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 GRAPHICS CORE NEXT
  • 17. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Unlimited Resources & Samplers  All UAV formats can be read/write  Simpler Assembly Language  Simpler Shader Code  Ability to support C/C++ (like)  Architectural support for traps, exceptions & debugging  Ability to share virtual x86-64 address space with CPU cores 17 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 GRAPHICS CORE NEXT
  • 18. AMD GRAPHICS CORE NEXT ARCHITECTURE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  AMD TECHNOLOGY POWERS NEXT-GEN CONSOLES NEW NEXT-GEN GAME CONSOLES RAISE THE BAR FOR GRAPHICS PERFORMANCE PERFORMANCE TFLOPS-CLASS COMPUTE POWER MEMORY 16X MORE MEMORY * * Based on PlayStation 3 512MB vs. PlayStation 4 8192MB GDDR5. GRAPHICS CORE NEXT
  • 19. GCN COMPUTE UNIT A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  CU = Basic Building Block of GPU Computational Power Branch & Message Unit Vector Units (4x SIMD-16) Texture Filter Units (4) Texture Fetch Load / Store Units (16)  New Instruction Set Architecture Scalar Unit ‒ Non-VLIW ‒ Vector unit + scalar co-processor ‒ Scheduler Distributed programmable scheduler  Each CU can execute instructions from multiple kernels at once  Increased instructions per clock per mm2 ‒ High utilization ‒ High throughput Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) 19 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 L1 Cache (16KB) ‒ Multi-tasking
  • 20. GCN COMPUTE UNIT Branch & Message Unit Scheduler Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Scalar Unit 4x Vector Units (16-lane SIMD)  Branch andData Threads) Data Cache Scheduler Message Unit 64kb Local(2560 L1 Vector 16kb Read/Write Share (LDS) ‒ CU Total Throughput: 64 Single-Precision (SP) ops/clock Fully Programmable ‒ Executes Branch instructions Limit (32k/thread group) Separate to texture units (acts as texture cache) 2x Larger decode/issue for: Attachedthan D3D11 TGSM ‒ 1 SP (Single-Precision) operation per 4 clocks ‒ Shared by all threads of a wavefront ‒ (as dispatched by SMEM, LDS, GDS/E V SALU, ‒ 1 DP ALU, V with Conflict Resolution  4 32 banks, MEM,Units Scalar unit)clocks XPORT Texture ‒ Used for Filtercontrol, ADD in 8 arithmetic, etc. (Double-Precision) pointer ‒ flow ‒ + Special Instructions (NOPs, barriers, etc.) and ‒ Bandwidth Amplification  16 TextureGPR pool, scalar data cache, etc. branch instructions Fetch Load/Store Units ‒ 1 DP MUL/FMA/Transcendental per 16 clocks* ‒ Has own ‒ 16 Hardware Barriers Decode ‒ Separate Instruction  4x64KB Vector Registers (VGPR)  8KB Scalar General Purpose Registers (SGPR) 20 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Scalar Unit ‒ 8KB Scalar Registers (SGPR) ‒ 16KB 4-CU Shared R/O L1 Scalar Data Cache  Branch & Message Unit  4x Vector Units (SIMD-16) ‒  64KB Local Data Share (LDS) ‒  32 banks, with conflict resolution 16KB Read/Write L1 Vector Data Cache ‒  4x64KB Vector Registers (VGPR) Shared with TMU as Texture Cache Hardware Scheduler ‒ Up-to 2560 threads
  • 21. GCN COMPUTE UNIT Branch & Message Unit Scheduler SIMD SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each Compute Unit (CU) contains 4 SIMD; each SIMD has: ‒ A 16-lane IEEE-754 vector ALU (VALU) ‒ 64KB of vector register file (VGPR) ‒ Its own 40-bit (48-bit on HSA APUs) Program Counter (PC) ‒ Instruction buffer for 10 wavefronts* ‒ *A wavefront is a group of 64 threads: the size of one logical vGPR 21 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 22. GCN COMPUTE UNIT SCALAR UNIT LANE 0 1 2 SIMD 0 15 LANE 0 1 2 SIMD 1 15 SPECIFICS … LANE 0 1 2 15 SIMD 2 LANE 0 1 2 15 Scalar Unit SIMD 3 GCN Scalar Unit  Fully Programmable Scalar Unit replaces FF Branch Logic  Operations such as JMP [GPR] are now supported  Opens the door to e.g. virtual function calls  Has its own GPR pool and can execute normal ALU code  64-bit bitwise ops to mask thread execution  32-bit bitwise and integer arithmetic operations at full-speed  Potential to offload uniform code (Vector ALU  Scalar ALU)  A GCN CU can dispatch 1 scalar op/clock 24 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 23. GCN COMPUTE UNIT SCALAR UNIT LANE 0 1 2 SIMD 0 15 LANE 0 1 2 15 CONTINUED … LANE 0 1 2 SIMD 1 15 SIMD 2 R/W L2 LANE 0 1 2 15 Scalar Unit SIMD 3 GCN Scalar Unit  Natively a 64-bit integer ALU  Independent arbitration and instruction decode  One ALU, memory or control flow op per cycle  512 Scalar GPR per SIMD shared between waves  { SGPRn+1, SGPR } pair provide 64-bit register  4 CU Shared Read Only Scalar Data Cache: 16KB – 64B lines  4-way assoc, LRU replacement policy  Peak Bandwidth per CU is 16 bytes/cycle 25 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 4 CU Shared 16KB Scalar R/O L1 Scalar Unit 8KB Registers Integer ALU Scalar Decode
  • 24. GCN COMPUTE UNIT Branch & Message Unit Scheduler BRANCH & MESSAGE UNIT Vector Units (4x SIMD-16) Scalar Unit GRAPHICS CORE NEXT  Independent scalar assist unit to handle special classes of instructions concurrently ‒ Branch ‒ Unconditional Branch (s_branch) ‒ Conditional Branch (s_cbranch_<cond>) ‒ Condition  SCC == 0, SCC == 1, EXEC == 0, EXEC != 0, VCC == 0, VCC != 0 ‒ 16-bit signed immediate dword offset from PC provided ‒ Messages ‒ s_sendmsg  CPU interrupt with optional halt (with shader supplied code and source) ‒ debug message (perf trace data, halt, etc.) ‒ special graphics synchronization messages 26 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  Branch  Unconditional Branch (s_branch)  Conditional Branch (s_cbranch_<cond>) ‒ ‒ ‒ ‒ ‒ ‒ SCC == SCC == EXEC == EXEC != VCC == VCC != 0 1 0 0 0 0  Messages ‒ s_sendmsg
  • 25. GCN COMPUTE UNIT Branch & Message Unit Scheduler MEMORY SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each CU has its own dedicated L1 cache and LDS memory ‒ Both global and shared memory atomics are supported  ‒ 32 banks, with conflict resolution  ‒ Scalar L1 Read-Only data cache is shared between 4 neighbor CU 27 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 16KB R/W L1 Vector D-Cache ‒ Shared with TMU as Texture Cache  ‒ 16 work group barriers supported per CU ‒ Vector L1 Read/Write data cache shared with TMU as texture cache 64KB Local Data Share (LDS) Scalar Unit ‒ 16KB 4-CU Shared R/O Scalar L1  16 hardware barriers per CU  A GCN GPU with 44 CU, such as the AMD Radeon™ R9 290x, can be working on up-to 112,640 work items at a time!
  • 26. GCN COMPUTE UNIT Branch & Message Unit Scheduler SCHEDULER SPECIFICS Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) GRAPHICS CORE NEXT Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB)  Each CU has its own dedicated Scheduler unit  Each CU can have 40 waves in-flight ‒ Each potentially from a different kernel  Scheduler Limits: ‒ Supports up-to 2560 threads per CU (64 threads x 10 waves x 4 SIMD) ‒ 40 wavefronts per CU ‒ All threads within a workgroup are guaranteed to reside on the same CU simultaneously ‒ Limited by available GPR count ‒ A set of synchronization primitives and shared memory allow data to be passed between threads in a workgroup ‒ Optimized for throughput – latency is hidden by overlapping execution of wavefronts 28 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 10 wavefronts per SIMD ‒ Limited by available LDS memory ‒ 16 hardware barriers per CU  A GCN GPU with 44 CU, such as the AMD Radeon™ R9 290x, can be working on up-to 112,640 work items at a time!
  • 27. GCN COMPUTE UNIT Branch & Message Unit Scheduler SCHEDULER SPECIFICS Vector Units (4x SIMD-16) Scalar Unit ARBITRATION & DECODE Local Data Share (64KB) L1 Cache (16KB) GRAPHICS CORE NEXT  CU is guaranteed to issue instructions for a wave sequentially ‒ Predication & control flow enables any single work-item a unique execution path  For a CU, every clock, waves on 1 SIMD are considered for issue ‒ Round-Robin scheduling algorithm  Maximum 5 instructions per cycle ‒ Not including “internal” instructions  Instruction Types: ‒ 1 Vector Arithmetic Logic Unit (VALU) ‒ 1 Scalar ALU or Scalar Memory (SALU)|(SMEM) ‒ 1 Vector Memory (Read/Write/Atomic) (VMEM) ‒ 1 Branch/Message (e.g. s_branch, s_cbranch) ‒ 1 Local Data Share (LDS)  At most, 1 instruction from each category may be issued  At most, 1 instruction per wave may be issued  Theoretical maximum of 5 instructions per cycle per CU 29 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 1 Export or Global Data Share (GDS) ‒ 10 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional unit]
  • 28. GCN COMPUTE UNIT VECTOR & SCALAR ARBITRATION LANE 0 1 2 SIMD 0 15 LANE 0 1 2 SIMD 1 15 LANE 0 1 2 15 LANE 0 1 2 15 SIMD 2 HARDWARE VIEW Scalar Unit SIMD 3 GCN Hardware View  A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clocks  Each lane can dispatch 1 SP ALU operation per clock  Each SP ALU operation takes 4 clocks to complete  The scheduler dispatches from a different wavefront each cycle 30 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 29. GCN COMPUTE UNIT VECTOR & SCALAR ARBITRATION LANE 0 1 2 15 LANE 16 17 18 31 LANE 32 33 34 47 LANE 48 49 50 PROGRAMMER VIEW Scalar Unit 63 WAVEFRONT 0 WAVEFRONT 4 WAVEFRONT 1 WAVEFRONT 5 WAVEFRONT 2 WAVEFRONT 6 WAVEFRONT 3 WAVEFRONT 7 WAVEFRONT 0 WAVEFRONT 8 WAVEFRONT 1 WAVEFRONT 9 GCN Programmer View  A GCN Compute Unit can perform 64 SP Vector ALU ops / clock  Each lane can dispatch 1 SP ALU operation per clock  Each SP ALU operation still takes 4 clocks to complete  But you can PRETEND your code runs 1 op on 64-threads at once 31 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 30. GCN VECTOR UNITS ALU CHARACTERISTICS  FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of NaN/Inf/Zero and full de-normal support in hardware for SP and DP  MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADDieee to be combined with round and normalization after both multiplication and subsequent addition  VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates  IEEE Rounding Modes (Round toward +Infinity, Round toward –Infinity, Round to nearest even, Round toward zero) supported under program control anywhere in the shader. SP and DP modes are controlled separately.  De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero. 32 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 31. GCN VECTOR UNITS ALU CHARACTERISTICS CONTINUED …  Divide Assist Ops IEEE 0.5 ULP Division accomplished with macro (SP/DP ~15/41 Instruction Slots, respectively)  FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE-754 precision and rounding  Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, underflow, overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation  64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root  24-bit Integer MUL/MULADD/LOGICAL/SPECIAL @ full SP rates ‒ Heavily utilized for integer thread group address calculation ‒ 32-bit integer MUL/MULADD @ DP MUL/FMA rate 33 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 32. GCN SHADER AUTHORING TIPS  GCN has greatly improved branch performance, and it continues to improve ‒ Don’t be afraid to use it! But, remember: use it wisely – improved != free  ‒ It’s at its best for highly coherent workloads (where most threads take the same path)  However, the new architecture is more susceptible to: register pressure ‒ Using too many registers within a shader can reduce the maximum waves per SIMD!  ‒ NOTE: A WAVEFRONT CAN ALLOCATE 104 USER SCALAR REGISTERS AS SEVERAL SCALAR REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE GCN SGPR Count VGPR Count <= 48 <=24 56 28 64 72 84 100 32 36 40 48 > 100 84 64 <= 128 > 128 Max Waves/SIMD 10  9 9 8 8 4 3 4 2 1 77 66 ‒ Take caution with respect to the following: ‒ Excessive nested branching/looping ‒ Loop Unrolling ‒ Variable declarations (especially arrays) ‒ Excessive function calls requiring storing of results 34 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 55
  • 33. GCN SHADER CODE EXAMPLE // Registers r0 contains “a”, r1 contains “b” // Value is returned in r2 v_cmp_gt_f32 s_mov_b64 s_and_b64 s_cbranch_vccz v_sub_f32 v_mul_f32 r0,r1 s0,exec exec,vcc,exec label0 r2,r0,r1 r2,r2,r0 // // // // // // a > b, establish VCC Save current exec mask Do “if” Branch if all lanes fail result = a – b result = result * a s_andn2_b64 s_cbranch_execz v_sub_f32 v_mul_f32 exec,s0,exec label1 r2,r1,r0 r2,r2,r1 // // // // Do “else (s0 & !exec) Branch if all lanes fail result = b – a result = result * b s_mov_b64 exec,s0 // Restore exec mask  An alternative to s_cbranch, is to use VSKIP to transform VALU into NOPs  s_setvskip – enables or disables VSKIP mode. Requires 1 waitstate after executing. 35 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13  VSKIP does NOT skip VMEM instructions (Do: branch over superfluous VMEM inst.)
  • 34. GCN MEMORY CACHE HIERARCHY I$ 32KB instruction cache (I$) + 16KB scalar data cache (K$) shared per ~4 CUs with L2 backing K$ I$ K$ Each CU has its own registers and local data share 64 Bytes per clock L1 bandwidth per CU GDS L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 read/write caches 64 Bytes per clock L2 bandwidth per partition L2 read/write cache partitions 36 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 L2 L2 L2 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller 64-bit Dual Channel Memory Controller Global Data Share facilitates synchronization between CUs (64KB)
  • 35. GCN MEMORY VECTOR MEMORY INSTRUCTIONS VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS MUBUF – read from or write/atomic to an un-typed buffer/address ‒ Data type/size is specified by the instruction operation MTBUF – read from or write to a typed buffer/address ‒ Data type is specified in the resource constant GRAPHICS CORE NEXT  ‒ MUBUF is like C++ reinterpret_cast MIMG – read/write/atomic operations on elements from an image surface ‒ Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data) ‒ Image objects use resource and sampler constants for access and filtering 37 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 A pointer is a pointer on GCN! ‒ MTBUF is like C++ static_cast  Utilize TMU for filtering via MIMG
  • 36. GCN MEMORY DEVICE FLAT MEMORY INSTRUCTIONS A GCN POINTER IS A POINTER FLAT  Flat Address Space (“flat”) instructions are new as of Sea Islands (CI) and allow read/write/atomic access to a generic memory address pointer which can resolve to any of the following physical memories: ‒ Global Memory ‒ Scratch (“private”) ‒ LDS (“shared”) ‒ Invalid - MEM_VIOL TrapStatus  Device Flat (Generic) 64b/32b Addressing Support ‒ FLAT instructions support both 64 and 32-bit addressing. The address size is set via a mode register (“PTR32”) and a local copy of the value is stored per wave. ‒ The addresses for the aperture check differ in 32 and 64-bit mode 38 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 37. GCN MEMORY EXPORT INSTRUCTION & GDS  Exports move data from 1-4 VGPRS to the fixed-function Graphics Pipeline ‒ E.g: Color (MRT0-7), Depth, Position, and Parameter  Tessellator, Rasterizer, or RBE  Global Shared Memory Ops (Utilize GDS)  The GDS is identical to the LDS, except that it is shared by all CUs, so it acts as an explicit global synchronization point between all wavefronts  The atomic units in the GDS also support ordered count operations 39 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 38. GCN MEMORY LOCAL DATA SHARE  GCN Local Data Share (LDS) is a 64KB, 32 bank (or 16) Shared Memory  Instruction issue fully decoupled from ALU instructions  Direct mode ‒ Vector Instruction Operand  32/16/8-bit broadcast value ‒ Graphics Interpolation @ rate, no bank conflicts  Index Mode – Load/Store/Atomic Operations ‒ Bandwidth Amplification, up-to 32 – 32-bit lanes serviced per clock peak ‒ Direct decoupled return to VGPRs ‒ Hardware conflict detection with auto scheduling  Software consistency/coherency for thread groups via hardware barrier  Fast & low power vector load return from R/W L1 40 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 39. GCN MEMORY CONTINUED … LOCAL DATA SHARE  An LDS bank is 512 entries, each 32-bits wide ‒ A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that includes 32 atomic integer units ‒ This means that several threads can read the same LDS location at the same time for FREE ‒ Writing to the same address from multiple threads also occurs at rate, last thread to write wins (useful e.g. for all threads writing uniform value to still be fast)  Typically, the LDS will coalesce 32 lanes from one SIMD each cycle ‒ One wavefront is serviced completely every 2 cycles ‒ Conflicts automatically detected across 32 lanes from a wavefront and resolved in hardware ‒ An instruction which accesses different elements in the same bank takes additional cycles 41 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 40. BLOCK DIAGRAM GCN MEMORY LOCAL DATA SHARE 42 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 41. GCN MEMORY NEW MEMORY OPERATIONS LOCAL DATA SHARE  Remote Atomic Ops with Shared Memory Dual-Source Operands ‒LDS[Dst] = LDS[addr0] op LDS[addr1]; ‒ Fast remote reduction operations for arithmetic, logical, Min/Max  Read/Write/Conditional Exchange 96b/128b  32-bit FP Min/Max/Compare Swap 43 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 42. GCN MEMORY NEW MEMORY OPERATIONS LOCAL DATA SHARE CONTINUED … Fast Lane Swizzle Operations ‒Does not require allocation, no shared memory used ‒Invalid read result in 0x0 return ‒First Mode: Each four adjacent lanes can full crossbar data, same switch for each set of four ‒Second mode: For each consecutive set of 32 work-items ‒ Swap: 16, 8, 4, 2, 1 ‒ Reverse: 32, 16, 8, 4, 2 ‒ Broadcast: 32, 16, 8, 4, 2 44 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 43. OPERATION DIAGRAMS GCN MEMORY LOCAL DATA SHARE 16 4 Lane CrossBar Reverse 8 4 Lane 0 , 1 ……………………..…31,32……………………………….63 2 1 Lane 0 , 1 ……………………..…31,32……………………………….63 Swap Broadcast 16 16 8 8 4 4 2 2 1 Lane 0 , 1 ……………………..…31,32……………………………….63 45 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 1 Lane 0 , 1 ……………………..…31,32……………………………….63
  • 44. GCN MEMORY BLOCK DIAGRAM READ/WRITE CACHE  Reads and writes cached ‒ Bandwidth amplification ‒ Improved behavior on more memory access patterns ‒ Improved write to read reuse performance  Relaxed consistency memory model ‒ Consistency controls available to control locality of load/store  GPU Coherent ‒ Acquire/Release semantics control data visibility across the machine (GLC bit on load/store) ‒ GCN APUs also have SLC bit to control data visibility to CPU caches ‒ L2 coherent = all CUs can have the same view of data  Global Atomics ‒ Performed in L2 cache (GDS also has global atomics) 46 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 45. GCN MEMORY READ/WRITE L1 CACHE ARCHITECTURE ‒ Each CU has its own Vector L1 Data Cache ‒ 16KB L1, 64B lines, 4 sets x 64-way ‒ ~64B/CLK bandwidth per Compute Unit ‒ Write-through – alloc on write (no read) w/dirty byte mask ‒ Write-through at end of wavefront ‒ Decompression on cache read out ‒ Instruction GLC bit defines cache behavior (GCN APUs also have SLC bit) ‒ GLC = 0; ‒ Local caching (full lines left valid) ‒ Shader write back invalidate instructions ‒ GLC = 1; ‒ Global coherent (hits within wavefront boundaries) 47 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 46. GCN MEMORY READ/WRITE L2 CACHE ARCHITECTURE ‒ 64-128KB L2 per Memory Controller Channel ‒ Up-to 16 L2 cache partitions ‒ 64B lines, 16-way set associative ‒ ~64B/CLK per channel for L2/L1 bandwidth ‒ Write-back - alloc on write (no read) w/ dirty byte mask ‒ Acquire/Release semantics control data visibility across CUs ‒ L2 Coherent = all CUs can have the same view of data ‒ Remote Atomic Operations ‒ Common Integer Set & Floating Point Min/Max/CmpSwap 48 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 47. GCN MEMORY INFORMATION BANDWIDTH ‒ Each CU has 64 bytes per cycle of L1 bandwidth ‒ Shared with the GDS ‒ Per L2 there’s 64 bytes of data per cycle as well ‒ Peak Scalar L1 Data Cache Bandwidth per CU is 16 bytes/cycle ‒ Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions) ‒ LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification ‒ For R9 290x: ‒ That’s nearly 5.5 TB/s of LDS BW, 2.8 TB/s of L1 BW, and 1 TB/s of L2 BW! ‒ 512-bit GDDR5 Main Memory has over 320 GB/sec bandwidth ‒ PCI Express 3.0 x16 bus interface to system (32GBps) 49 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 48. GCN MEMORY TABLES BANDWIDTH & LATENCY LDS K$ L1 128 bytes / clock 16 bytes / clock 64 bytes / clock Main Takeaways: –LDS is optimized for bandwidth amplification and atomics –K$ is optimized for periodic low-latency reads of small datasets –L1 is optimized for high-bandwidth texture fetches and streaming LDS K$ L1 Resident Short Short (1x) Long (20x) Non-Resident N/A Medium (10x) Long (20x) 50 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 49. GCN MEMORY BLOCK DIAGRAM L1 TEXTURE CACHE  The memory hierarchy is re-used for graphics  Some dedicated graphics hardware added ‒ Address-gen unit receives 4 texture addr/clock ‒ Calculates 16 sample addr (nearest neighbors) ‒ Reads samples from L1 vector data cache ‒ Decompresses samples in Texture Mapping Unit (TMU) ‒ TMU filters adjacent samples, produces <= 4 interpolated texels/clock ‒ TMU output undergoes format conversion and is written into the vector register file ‒ The format conversion hardware is also used for writing certain formats to memory from graphics shaders 51 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 50. X86-64 GCN MEMORY VIRTUAL MEMORY  The GCN cache hierarchy was designed to integrate with x86-64 microprocessors  The GCN virtual memory system can support 4KB pages ‒ Natural mapping granularity for the x86-64 address space ‒ Paves the way for a shared address space in the future ‒ All GCN hardware can already translate requests into x86-64 address space  GCN caches use 64B lines, which is the same size x86-64 processors use  AMD A-Series APU  The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control! 52 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 51. GCN COMPUTE ARCHITECTURE R9 290X A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING AMD Radeon™ HD 7970 GHz Edition AMD Radeon™ R9 290X Increase Geometry Processing 2.1 billion primitives/sec 4 billion primitives/sec 1.9x Compute 4.3 TFLOPS 5.6 TFLOPS 1.3x Texture fill rate 134.4 Gtexels/sec 176 Gtexels/sec 1.3x Pixel fill rate 33.6 Gpixels/sec 64 Gpixels/sec 1.9x Peak Bandwidth 264 GB/sec 320 GB/sec 1.2x Die area 352 mm2 438 mm2 1.24x Peak GFLOPS/mm2 12.2 12.8 1.05x 53 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 52. GCN COMPUTE ARCHITECTURE SHADER ENGINE A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING  Each GCN GPU can contain up-to 4 Shader Engines ‒ Load balanced with each other ‒ Screen partitioning of pixel assignment  A Shader Engine is a high level organizational unit containing: ‒ 1 Geometry Processor (1 Primitive Per Cycle Throughput) ‒ 1 Rasterizer ‒ 1-16 CUs (Compute Units) ‒ Instruction I$ and constant K$ caches shared by up to 4 CU each ‒ 1-4 RBEs (Render Back Ends) ‒ Up-to 16 – 64b pixels/cycle per Shader Engine ‒ Up-to 8 – 128b pixels/cycle per Shader Engine 54 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 53. GCN COMPUTE ARCHITECTURE R9 290X A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  44 Compute Units  4 Geometry Processors ‒ 4 billion primitives/sec  64 Pixel Output/Clock ‒ 64 Gpixels/sec fill rate  1MB L2 Cache ‒ Up-to 1 TB/sec L2/L1 bandwidth  512-bit GDDR5 memory interface ‒ 320 GB/sec memory bandwidth  6.2 billion transistors 55 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 ‒ 438 mm2 on 28nm process node ‒ 12.8 GFLOPS/mm2
  • 54. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  8 ASYNCHRONOUS COMPUTE ENGINES (ACE) ‒ Operate in parallel with Graphics CP ‒ Independent scheduling and work item dispatch for efficient multi-tasking ‒ 9 Devices with 64+ Command Queues! ‒ Fast context switching ‒ Exposed in OpenCL™  Dual DMA engines ‒ Can saturate PCIe 3.0 x16 bus bandwidth (16 GB/sec bidirectional) 56 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 55. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  ACEs are responsible for compute shader scheduling & resource allocation  Each ACE fetches commands from cache or memory & forms task queues  Tasks have a priority level for scheduling ‒ Background  Realtime  ACE dispatch tasks to shader arrays as resources permit  Tasks complete out-of-order, tracked by ACE for correctness  Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs 57 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 56. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  ACE are independent ‒ But, can synchronize and communicate via Cache/Memory/GDS  ACE can form task graphs ‒ Individual tasks can have dependencies on one another ‒ Can depend on another ACE ‒ Can depend on part of graphics pipe  ACE can control task switching ‒ Stop and Start tasks and dispatch work to shader engines 58 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 57. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT  Focus in GPU hardware shifting away from graphics-specific units, towards general-purpose compute units  R9 290x GCN-based ASICs already have 8:1 ACE : CP ratio ‒ CP can dispatch compute ‒ ACE cannot dispatch graphics  If you aren’t writing Compute Shaders, you’re not getting the absolute most out of modern GPUs ‒ Control: LDS, barriers, thread layout, ... 59 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 58. GCN COMPUTE ARCHITECTURE SEA ISLANDS A NEW GPU DESIGN FOR A NEW ERA OF COMPUTING GRAPHICS CORE NEXT Future Trends:  More Compute Units ‒ ALU outpaces Bandwidth  CPU + GPU Flat Memory ‒ APU + dGPU  Less Fixed Function Graphics ‒ Can you write a Compute-based graphics pipeline? ‒ Start thinking about it…  60 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 59. GCN FIXED FUNCTION ARCHITECTURE Geometry Processor Geometry Assembler Tessellator Vertex Assembler GEOMETRY Geometry Processor Geometry Assembler Tessellator Vertex Assembler Updated Hardware Geometry Units – Off-chip buffering improvements – Larger parameter and position cache Geometry Processor Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Vertex Assembler Tessellation off on Tessellation off  GS + Tessellation is faster than before…  However… memory is still the bottleneck! – Minimize the number of inputs and outputs for best performance…  Small expansions can be done within LDS! 61 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Image from Battlefield 3, EA DICE Process and rasterize up to 4 primitives per clock cycle
  • 60. GCN FIXED FUNCTION ARCHITECTURE RASTERIZER  We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock) ‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels  Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency  This can cause us to become raster-bound, starving the shader and holding up geometry! 12 Pixels Per Clock 75% Efficiency 100% Efficiency 16 Pixels Per Clock 28 Pixels in 2 Clocks 62 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 vs. 3 Pixels in 3 Clocks 1 Pixel Per Clock  6.25% Efficiency
  • 61. GCN FIXED FUNCTION ARCHITECTURE TESSELLATION + RASTERIZER EFFICIENCY 6.25% 75-90% 18-25% Efficiency Efficiency ~13 Pixels ~4 Pixels 1 Pixel Per Clock Per Clock Per Clock Efficiency Over-Tessellation  Reduces rasterizer efficiency ‒ Extreme Tessellation = 6.25% Efficiency  Also impacts ROPs and MSAA efficiency ‒ High number of polygon edges to AA ‒ Consumes dramatically more bandwidth ‒ If nFragments > nSamples, quality will be lost ‒ E.g. 16 verts affecting 1 pixel @ 8xMSAA 63 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 62. GCN FIXED FUNCTION ARCHITECTURE Over-Tessellation  Reduces shader efficiency  HS, DS and VS run many times for each final image pixel ‒ Yet don’t contribute much to final image quality  The graphics pipeline is not designed for this abuse! TESSELLATION + SHADING EFFICIENCY Shading Passes Per-Pixel (Overshade) 8 7 6 5 4 3 2 1  Consider Alternatives: ‒ Parallax Occlusion Mapping ‒ […]  Image courtesy: Kayvon Fatahalian “Evolving the Direct3D Pipeline for Real-time Micropolygon Rendering,” From ACM SIGGRAPH 2010 course: “Beyond Programmable Shading II” 64 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 63. GCN Tessellation – Best Practices  While performance is much improved, it is still a potential bottleneck! ‒ Produces a great deal of IO traffic, starving other parts of the pipeline  Best performance generlly achieved with tessellation factors less than 15! Continue to Optimize: ‒ Pre-triangulate ‒ Distance-adaptive ‒ Screen-space adaptive ‒ Orientation-adaptive ‒ Backface Culling ‒Frustum Culling ‒ […] 65 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Tessellation OFF ON
  • 64. GCN FIXED FUNCTION ARCHITECTURE RASTERIZER  We now have 4 Geometry Processors on R9 290x ‒ Overall Primitive Rate = 4 prims per clock (ideal)  We now have 4 Rasterizers on R9 290x (4 triangles x 16 pixels = 64 pixels per clock) ‒ Each rasterizer can read in a single triangle per cycle, and write out 16 pixels  Caveat: tiny (e.g. sub-pixel) triangles can dramatically reduce efficiency  This can cause us to become raster-bound, unable to rasterize at peak-rate! Command Processor Geometry Processor Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Geometry Processor Vertex Assembler Geometry Assembler Tessellator Vertex Assembler Geometry Processor Geometry Assembler Tessellator Vertex Assembler Compute Units Rasterizer Scan Converter Hierarchical Z Render Back-Ends Rasterizer Scan Converter Hierarchical Z Render Back-Ends 66 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Rasterizer Scan Converter Hierarchical Z Render Back-Ends Rasterizer Scan Converter Hierarchical Z Render Back-Ends
  • 65. GCN FIXED FUNCTION ARCHITECTURE RENDER BACK ENDS  Once the pixels fragments in a tile have been shaded, they flow the Render Back-Ends (RBEs) Z/Stencil ROPs Color ROPs Depth Cache Color Cache ‒ 16KB Color Cache ‒ Up to 8 color + 16 coverage samples (16x EQAA) ‒ 8KB Depth Cache ‒ Up to 8 depth samples (8x MSAA) ‒ Writes un-cached via memory controllers ‒ 64 – 64B pixels per cycle ‒ 256 Depth Test (Z) / Stencil Ops per cycle  Logic Operations as alternative to Blending ‒Exposed in Direct3D 11.1 ‒Also available in OpenGL  Dual-Source Color Blending with MRTs ‒Only available in OpenGL * 67 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 There are 16 RBEs on R9 290x
  • 66. GCN FIXED FUNCTION ARCHITECTURE DEPTH IMPROVEMENTS 24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS Fast-accept of fully-visible triangles spanning one or more tile If a triangle is fully covering a tile, then cost is only 1 clock/tile  Depth Bounds Test (DBT) Extension ‒Exposed in OpenGL via GL_EXT_depth_bounds_test ‒Exposed in Direct3D 11 via extension 68 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 67. GCN FIXED FUNCTION ARCHITECTURE STENCIL IMPROVEMENTS  GCN has support for new extended stencil ops ‒Only available in OpenGL: GL_AMD_stencil_operation_extended ‒Additional stencil ops: ‒AND, XOR, NOR ‒REPLACE_VALUE_AMD ‒etc. ‒ Also exposes additional stencil op source value ‒ Can be used as an alternative to stencil ref value  Stencil ref and op source value can now be exported from pixel shader ‒Only available in OpenGL: GL_AMD_shader_stencil_value_export 69 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 68. GCN LOW-LEVEL TIPS GPR PRESSURE  GPRs and GPR Pressure  Banks of GCN Vector GPRs (Illustration)  General Purpose Registers (GPR) are a limited resource ‒ Separate banks of GPRs for Vector and Scalar (per SIMD) ‒ Maximum of 256 VGPRS and 512 SGPRS shared across all waves (up-to 10) owned by a SIMD ‒ Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for 128-bit) ‒ Number of GPRs required by a shader affects SIMD scheduling and execution efficiency ‒ Shader tools can be used to determine how many GPRs are used…  GPR pressure is affected by: ‒ Loop Unrolling ‒ Long lifetime of temporary variables ‒ Nested Dynamic Flow Control instructions ‒ Fetch dependencies (e.g. indexed constants) 70 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 69. GCN LOW-LEVEL TIPS TEXTURE FILTERING ‒Point sampling is full-rate on all formats ‒Trilinear filtering costs up to 2x bilinear filtering cost ‒Anisotropic (N taps) costs <= (N x bilinear) ‒Avoid cache thrashing! ‒Use MIPmapping ‒Use Gather() where applicable ‒Exploit neighbouring pixel shader threadCU locality: ‒ Sampling from texels resident on the same CU can have a lower cost ‒Exploit this explicitly by using Compute Shaders 71 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 70. GCN LOW-LEVEL TIPS COLOR OUTPUT  PS Output: Each additional color output increases export cost  Export cost can be more costly than PS execution! ‒ Each (fast) export is equivalent to 64 ALU ops on R9 290X ‒ If shader is export-bound then use “free” ALU for packing instead  Watch out for export-bound cases ‒ E.g. G-Buffer parameter writes ‒ MINIMIZE SHADER INPUTS AND OUTPUTS! ‒ Pack, pack, pack, pack!  Costs of outputting and blending various formats ‒discard/clip allow the shader hardware to skip the rest of the work * Miss “PACK” Man kindly reminds you to “Pack pack pack!”  72 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 71. GCN MEDIA PROCESSING MEDIA INSTRUCTIONS  SAD = Sum of Absolute Differences Closest match  Critical to video & image processing algorithms ‒ Motion detection ‒ Gesture recognition ‒ Video & image search ‒ Stereo depth extraction ‒ Computer vision  SAD (4x1) and QSAD (4 4x1) instructions ‒ New QSAD combines SAD with alignment ops for higher performance and reduced power draw ‒ Evaluate up to 256 pixels per CU per clock cycle!  Maskable MQSAD instruction ‒ Allows background pixels to be ignored ‒ Accelerated isolation of moving objects  New: 32-bit destination accumulator register ‒ SAD/QSAD/MQSAD U32/U16 accumulators with saturation 73 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 3 2 5 5 4 4 0 7 1 7 5 9 4 1 3 5 5 5 9 3 1 4 4 0 SAD = 7 SAD = 22 2 22 9 5 1 6 7 2 9 SAD = 6 1 59 3 5 2 8 1 1 7 6 8 3 0 4 3 2 9 9 3 0 7 1 1 7 4 SAD = 5 5 58 4 0 8 0 0 2 2 SAD = 2 8 45 3 2 9 9 7 1 6 2 4 0 AMD Radeon R9 290x can evaluate 11.26 Terapixels/sec * * Peak theoretical performance for 8-bit integer pixels 3
  • 72. GCN MEDIA PROCESSING VIDEO CODEC ENGINE  Video Codec Engine (VCE) ‒ Hardware H.264 Compression and Decompression ‒ Ultra-low-power, fully fixed-function mode ‒ Capable of 1080p @ 60 frames / second ‒ Programmable for Ultra High Quality and or Speed ‒ Entropy encoding block fully accessible to software ‒ AMD Accelerated Parallel Programming SDK ‒ OpenCL ™ ‒ Create hybrid faster-than-real-time encoders! ‒ Custom motion estimation ‒ Inverse DCT and motion compensation ‒ Combine with hardware entropy encoding! AMD Radeon R9 290x can compress Realtime+ 1080p H.264 74 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 73. GCN MEDIA PROCESSING AMD TRUEAUDIO  Multiple integrated Tensilica HiFi EP Audio DSP cores  Dedicated Audio DSP solution for game sound effects  Guaranteed real-time performance and service  Designed for game audio artists and engineers to bring take their artistic vision beyond sound production into the realm of sound processing  Intended to transform game audio as programmable shaders transformed graphics 75 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 74. GCN MEDIA PROCESSING AMD TRUEAUDIO SPATIALIZATION / 3D AUDIO REVERBS AUDIO/VOICE STREAMS MASTERING LIMITERS 76 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 75. HEAR MORE REALTIME VOICES AND CHANNELS IN A GAME 77 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 76. ENABLES AMAZING DIRECTIONAL AUDIO OVER ANY OUTPUT 78 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 77. CONCLUSIONS GCN ARCHITECTURE TAKEAWAYS ‒GCN offers increased flexibility & efficiency, with reduced complexity! ‒Non-VLIW Architecture improves efficiency while reducing programmer burden ‒Constants/resources are just address + offset now in the hardware ‒UAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast ‒Has virtual memory & GPU flat memory, moving towards CPU + GPU flat memory ‒GCN is designed with a forward-looking focus on Compute ‒Scalar unit for complex dynamic control flow + branch & message unit ‒64KB LDS/CU, 64KB GDS, atomics at every stage, coherent cache hierarchy ‒8 Asynchronous Compute Engines (ACE) for multitasking compute ‒ 8 ACE x 8 HQD (per ACE) = 64 HQD (HQD = Hardware Queue Descriptors) 79 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 78. CONCLUSIONS GCN ARCHITECTURE TAKEAWAYS CONTINUED … ‒GCN generally simplifies your life as a programmer ‒Don’t: fret too much about instruction grouping, or vectorization ‒Do: Think about GPR utilization & LDS usage (impacts max # of wavefronts) ‒Do: Think about thread/CU locality when you structure your algorithm ‒Do: Exploit the low-latency 4-CU Shared 16KB Scalar L1 Data Cache (K$) ‒Do: Pack shader inputs and outputs – aim to be IO/bandwidth thin! ‒ Pack PS exports into non-blended 64-bit format for optimal ROP utilization ‒ But, remember that 32-bit formats still use less bandwidth ‒ Keep geometry (HS, VS, GS, DS) stage IO under 4 float4 (ideally less! ) ‒Unlimited number of addressable constants/resources ‒N constants aren’t free anymore – each consume resources, use sparingly! ‒Compute is the future – exploit its power for GPGPU work & graphics! 80 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 79. THANK YOU 问题? QUESTIONS?  質問がありますか? ^_^ Layla Mah layla.mah@amd.com 81 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 80. BONUS SLIDES 82 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 81. THE BONUS SLIDES 83 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 82. TILED RESOURCES & PARTIALLY RESIDENT TEXTURES MegaTexture in id Tech5
  • 83. Tiled Resources & Partially Resident Textures – INTRODUCTION Enables application to manage more texture data than can physically fit in a fixed footprint ‒ Known as: Tiled Resources (Direct3D 11.2) and Partially Resident Textures (OpenGL 4.2) ‒ A.k.a. “Virtual texturing“ and “Sparse texturing” The principle behind PRT is that not all texture contents are likely to be needed at any given time ‒ Current render view may only require selected portions of a texture to be resident in memory ‒ Or, only selected MIPMap levels… PRT textures only have a portion of their data mapped into GPU-accessible memory at a time ‒ Texture data can be streamed in on-demand ‒ Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)  OpenGL extension – GL_AMD_sparse_texture 85 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 84. Tiled Resources & Partially Resident Textures – TEXTURE TILES The PRT texture is chunked into 64KB tiles ‒ Fixed memory size ‒ Not dependant on texture type or format Highlighted areas represent texture data that needs highest resolution Chunked texture Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 86 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Texture tiles needing to be resident in GPU memory
  • 85. Tiled Resources & Partially Resident Textures – TRANSLATION TABLE The GPU virtual memory page table translates 64KB tiles into a resident texture tile pool Texture Map Page Table Texture Tile Pool (Video Memory) (linear storage) 64KB tile Unmapped page entry Mapped page entry Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 87 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 86. Tiled Resources & Partially Resident Textures – MIP MAPS Not all tiles from the texture map are actually resident in video memory PRT hardware page table stores virtual  physical mappings Texture Map Page Table MIP Levels Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008 88 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 Texture Tile Pool (Video Memory) 64KB tile Unmapped page entry Mapped page entry
  • 87. Tiled Resources & Partially Resident Textures – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles! A common scenario is to upload lower MIPMaps to texture tile pool ‒ This allows a full representation of the PRT contents to be resident in memory (albeit at lower resolution) ‒ e.g. MIP LOD 6 and above for 16kx16k 32-bits texture is about 650KB (256x256 resolution) Texture tiles corresponding to higher resolution areas are uploaded by the application as needed ‒ e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading 89 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 88. Tiled Resources & Partially Resident Textures – “FAILED” FETCH How does the application know which texture tiles to upload? Answer: PRT-specific texture fetch instructions in pixel shader ‒ Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently not in the pool ‒ OpenGL example: int glSparseTexture( gsampler2D sampler, vec2 P, inout gvec4 texel ); This information is then stored in render target or UAV ‒ Texel fetch failed for a given (x, y) tile location ...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded 90 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 89. Tiled Resources & Partially Resident Textures – “LOD WARNING” PRT fetch condition code can also indicate an “LOD Warning” The minimum LOD warning is specified by the application on a per texture basis ‒ OpenGL example: glTexParameteri( If a fetched pixel’s LOD is <target>, MIN_WARNING_LOD_AMD, <LOD warning value> ); < the specified LOD warning value then the condition code is returned This functionality is typically used to try to predict when higher-resolution MIP levels will be needed ‒ E.g. Camera getting closer to PRT-mapped geometry 91 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 90. Tiled Resources & Partially Resident Textures – EXAMPLE USAGE 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API 2. App uploads MIP levels using API calls 3. Shader fetches PRT data at specified texcoords Two possibilities: 3.a. Texel data belongs to a resident (64KB) tile - Valid color returned, no error code 3.b. Texel data points to non-resident tile or specified LOD - Error/LOD Warning code returned - Shader writes tile location and error code to RT or UAV 4. App reads RT or UAV and upload/release new tiles as needed 92 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 91. Tiled Resources & Partially Resident Textures – TYPES, FORMATS & DIMENSIONS  All texture types and formats supported ‒1D, 2D, cube, arrays and 3D volume textures ‒All common texture formats ‒ Including compressed formats ‒Maximum dimensions: ‒16k x 16k x 8k x 128-bit textures 93 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 92. Hardware PRT > Software Implementation PRT Ease of implementation • Complexity hidden behind HW & API Full filtering support SW Implementation • Includes anisotropic filtering Full-speed filtering • SW solution requires “manual” filtering • Software anisotropic is very costly Don’t go overboard with PRT allocation! • Page table entry size is 4 DWORDs • Have to be resident in video memory 94 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 93. 问题? QUESTIONS?  質問がありますか? ^_^ Layla Mah layla.mah@amd.com 95 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 @MissQuickstep
  • 94. Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2013 Advanced Micro Devices, Inc. All rights reserved. 96 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 95. THE BONUS SLIDES SHADER CODE 97 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13
  • 96. SHADER CODE EXAMPLE #2 float fn0(float a,float b) { float c = 0.0; float d = 0.0; for(int i=0;i<100;i++) { if(c>113.0) break; c = c * a + b; d = d + 1.0; } return(d); } // Registers r0 contains “a”, r1 contains “b”, r2 contains “c” // and r3 contains “d” // Value is returned in r3 v_mov_b32 v_mov_b32 s_mov_b64 s_mov_b32 label0: s_cmp_lt_s32 s_cbranch_sccz v_cmp_le_f32 s_and_b64 s_branch_execz v_mul_f32 v_add_f32 v_add_f32 s_add_s32 s_branch label1: s_mov_b64 98 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13 r2, #0.0 r3, #0.0 exec, s0 s2, #0 // // // // float c = 0.0 float d = 0.0 Save execution mask i=0 s2, #100 label1 r2, #113.0 exec, vcc, exec label1 r2, r2, r0 r2, r2, r1 r3, r3, #1.0 s2, s2, #1 label0 // // // // // // // // // // i<100 Exit loop if not true c > 113.0 Update exec mask on fail Exit if all lanes pass c = c*a c = c+b d = d+1.0 i++ Jump to start of loop exec, s0 // Restore exec mask
  • 97.
  • 98. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 100 | A CRASH COURSE: THE AMD GCN ARCHITECTURE | JANUARY 31, 2014 | APU13

Editor's Notes

  1. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  2. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  3. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  4. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  5. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  6. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  7. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  8. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  9. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  10. Prior to 2002Graphics specific hardwareTexture mapping/filteringGeometry processingRasterizationDedicated texture and pixel cachesDot product and scalar multiply-add sufficient for basic graphics tasksNo general purpose compute capability2002 - 2006Graphics-focused programmabilityDirectX 8/9Floating point processing (IEEE compliance not required)Specialized ALUs for vertex &amp; pixel processingLimited shadersMore dedicated caches (vertex, texture, color, depth)2007 to PresentUnified shader architectures VLIW5: flexible and optimized for graphics workloadsVLIW4: simplified and optimized for more general workloadsMore advanced cachingInstruction, constant, multi-level texture/data, local/global data sharesBasic general purpose computeCAL, Brook, ATI StreamIEEE compliant floating point mathGraphics performance still primary objective
  11. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  12. So what is mantle?
  13. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  14. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  15. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  16. This is now our next era – we simply called it Graphics core next. From a graphics standpoint, delivers cutting edge features and performance, while still being very flexible and scalable, allowing for all our southern islands parts to leverage the core. GCN, delivers an amazing step up in terms of heterogeneous computing – both in terms of a new simpler and more powerful programming model, but also in terms of sheer efficiency and performance.
  17. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  18. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier.Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  19. Our VLIW4 and 5 architecture is a powerful architecture that continues in our products, but it’s certainly not the easiest to program for general purpose programming. The new design offers the same amount of ALU, but the scalar style programming removes all the register and instruction dependencies we had. Chained multiplies, for example, work at peak efficiency, vs ¼ rate on HD6900. The port simplification that comes from removing the VLIW makes each instruction simple and easy to compile for. The tools chain to cater to this architecture is massively simplified and can be made much more robust; as well, performance tuning is easier. Finally, this core supports advanced debug features, such as breakpoints and single stepping, that allow for much deeper debug capabilities.
  20. Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels within a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.Shader code will be visible in GPUShaderAnalyzer to allow optimizations.
  21. The new cache hierarchy was shown at AFDS. This core implements the first version of that core. It’s a full 2 level R/W cache, with 16Kbytes of L1 per CY, and 64 Kbytes per L2. Each CU has 64 Bytes per cycle of L1 BW, shared with the global data share (which is a local buffer for sharing data between wavefronts). Per L2 there’s 64 bytes of data per cycle as well. That’s nearly 2 TB/s of L1 BW, and 700 GB/s of L2 BW. Nice! Each group of four cores shares a 32KB instruction cache and a 16KB scalar data cache. Coherency is handled at the L2 level, with applications able to keep the physical L2’s updated directly with their L1’s. Never settle for enough cache bandwidth!
  22. ADDR8VGPR which holds address. For 64-bit addresses, ADDR has the LSB’s and ADDR+1 has the MSBs.DATA8VGPR which holds the first dword of data. Instructions can use 0-4 dwords.VDST8VGPR destination for data returned to the shader, either from LOADs or Atomics with GLC=1 (return pre-op value).SLC1System Level Coherent. Used in conjunction with GLC and MTYPE to determine cache policies.GLC1Global Level Coherent. For Atomics, GLC=1 means return pre-op value, 0 = do not return pre-op value.TFE1Texel Fail Enable for PRT (Partially Resident Textures). When set, fetch may return a NACK which causes a VGPR write into DST+1 (first GPR after all fetch-destgprs).( M0 )32Implied use of M0.  M0[16:0] contains the byte-size of the LDS segment. this is used to clamp the final address.Opcode:FLAT_LOAD_UBYTE FLAT_STORE_BYTEFLAT_ATOMIC_SWAP FLAT_ATOMIC_SWAP_X2 FLAT_LOAD_SBYTE  FLAT_ATOMIC_CMPSWAP FLAT_ATOMIC_CMPSWAP_X2 FLAT_LOAD_USHORT FLAT_STORE_SHORTFLAT_ATOMIC_ADD FLAT_ATOMIC_ADD_X2 FLAT_LOAD_SSHORT  FLAT_ATOMIC_SUB FLAT_ATOMIC_SUB_X2 FLAT_LOAD_DWORD FLAT_STORE_DWORD FLAT_ATOMIC_SMIN FLAT_ATOMIC_SMIN_X2 FLAT_LOAD_DWORDX2 FLAT_STORE_DWORDX2 FLAT_ATOMIC_UMIN FLAT_ATOMIC_UMIN_X2 FLAT_LOAD_DWORDX3 FLAT_STORE_DWORDX3 FLAT_ATOMIC_SMAX FLAT_ATOMIC_SMAX_X2 FLAT_LOAD_DWORDX4 FLAT_STORE_DWORDX4 FLAT_ATOMIC_UMAX FLAT_ATOMIC_UMAX_X2   FLAT_ATOMIC_AND FLAT_ATOMIC_AND_X2   FLAT_ATOMIC_OR FLAT_ATOMIC_OR_X2   FLAT_ATOMIC_XOR FLAT_ATOMIC_XOR_X2   FLAT_ATOMIC_INC FLAT_ATOMIC_INC_X2   FLAT_ATOMIC_DEC FLAT_ATOMIC_DEC_X2  FLAT_ATOMIC_FCMPSWAP FLAT_ATOMIC_FCMPSWAP_X2   FLAT_ATOMIC_FMIN FLAT_ATOMIC_FMIN_X2   FLAT_ATOMIC_FMAX FLAT_ATOMIC_FMAX_X2
  23. Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
  24. Some stats to illustrate a 20-90% improvement in key metrics for a 24% increase in area.
  25. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  26. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  27. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  28. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  29. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  30. HW team has redesigned the GDDR5 memory interface to be smaller and more power efficient.During this redesign the resulting 512b interface and controllers are 20% smaller than the replaced 384b interface.The target frequency yields a 20% increase in total accessible bandwidth for a 50% increase in bandwidth per mm2.  World-class IP.
  31. The R9 290 device is the first GCN to offer scaling to 4 prims per clock.  Interstage parameter and position storage is provide on chip to enable necessary inflight overlap. Each geometry engine provides surface, tessellation, geometry and vertex management and output primitive filtering to drive the four partitioned rasterizers efficiently.  For low to mid level amplification the geometry stage has added a driver/compiler controlled mode that retains interstage data in the shared memory to decrease external bandwidth requirements and latency effects that as much as double the performance in some scenarios.   Finally, for tessellation, improvements have been made in staging storage and control to improve overall performance.   
  32. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  33. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  34. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  35. pre-tessellate as needed in order to avoid higher tess factors.
  36. I stated earlier that we have our next generation geometry engines, two of them in here. Well, this latest generation also improves significantly on both tessellation as well as geometry buffer performance. Lots of changes went in here to make this happen, though the biggest are listed here.This allows us to get up to 4x the performance of our previous HD6900 series architecture. Let’s see it.
  37. The R9 290 series provides a massive 64 pixel rasterization capability with 256 pixel’s depth and stencil test per clock.  The render backend units can drive color writes and blending operations for up to 64 pixels surviving per clock.  This capability will move the bottleneck from pixel fill to bandwidth in some scenarios.
  38. Present TrueAudio as the solution to the limitations imposed by today’s PC audio solutionsEmphasize real-timeand programmability
  39. SPATIALIZATION / 3D AUDIOSurround Sound with Stereo gaming headsetsKnow exactly where the enemy isREVERBS- More Realistic Sound EnvironmentAUDIO/VOICE STREAMS- Fuller sound for games with many scene objectsMASTERING LIMITERSReduce developer workload with real-time limiters
  40. Some immediate benefits of TRUEAUDIO – It enables you to hear hundreds more REALTIME VOICES AND AUDIO channels in your game than what is possible on CPUs today
  41. AMD is working with audio plugin developers such as GenAudio to provide an immersive audio experience when integrated into gamesGamers who use stereo headsets (either through USB or audio jacks) will enjoy virtual surround sound, accelerated by AMD TrueAudio technologyThis level of integration leads to accurate 3-dimensional audio since position data is extracted directly from the gameWhereas headsets with virtual surround sound capability use simple audio expansion algorithms with no knowledge of the game’s environment
  42. That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
  43. That simplicity has attracted the world’s top game devsPick some big ones by name: DICE (BF4), Eidos Montreal (Thief), Irrational games (Bioshock), CryTek (Crysis 3)
  44. Purple: vector instructionsBlue: scalar instructions.Exec = Execution register, defines which thread out of the wavefront (64 threads) will do the work. Already set at shader input (e.g. would be set so that that only rasterized pixels rwithin a primitive are processed).VCC = Vector Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a vector instruction.SCC = Scalar Condition Code register, defines which thread out of the wavefront (64 threads) will do the work. Output from a scalar instruction.