thu-blake-gdc-2014-final

Achieving the Best Performance with Intel Graphics
Tips, Tricks, and Clever Bits
Blake Taylor, Graphics Software Development and Validation (GSDV)
GDC 2014

Agenda
 Application
 Platform
 IA Graphics
 Sampler
 Fillrate
 Arithmetic Logic
 Geometry
 Scaling Your Game
 Conclusion
2

Application
Performance starts at the top
3

Efficient GPU Programming
4
Making the most of the pipeline!
 Optimizations within the IA software stack
 Application specific
 Generic
 Greatest impact from application optimization
 Meet your friendly AE!
Application
Driver
GPU
Low
High
Scope of Optimizations
Draw
Draw…Frame
Big Picture!
Tooltip!
- Use GPA™ to find and optimize GPU hotspots

Draw Dispatching and Resource Update
Be conscious of memory access patterns
of dispatched operations
 3D / 2D operation scheduling
 State / shader changes
 Resource locality
5
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Vs.
Resource A Resource A
Draw 0 Draw 1 (2)
Resource B
Draw 2 (1)
Large Surfaces (high latency)
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Small Surfaces (low latency)
Write to same RT region

Platform
More than just the sum of it’s parts…
6

Platform
7
Graphics is only part of the puzzle
 Unique architecture characteristics
 Power & performance
 Memory hierarchy
 Paired platform
 CPU
 System memory
 Other constraints
 Thermal
 Power
Memory
CPU GPU
Un-core
Package
Display Peripherals
Platform

CPU Optimization
Relationship between CPU / GPU
 CPU or GPU bottleneck
 CPU can limit GPU
 Whaaa?....
8
0
2
4
6
8
10
12
0 5 10 15
Watts
Power (15W Total)
CPU
GPU
Uncore
0
200
400
600
800
1000
1200
1400
0 5 10 15
MHz
Frequency (GPU Peak 1.15Ghz)
CPU
GPU
Tooltip!
- Use VTune™ to find and optimize CPU hotspots

Cache Locality Is King
Optimize memory accesses for both CPU and GPU
 Memory bandwidth bound
 Hierarchy varies with platform
 Optional CPU + GPU Caches
– Last Level Cache (LLC)
– Embedded DRAM (eDRAM)
 GPU
9
GPU Caches
CPU + GPU
GPU
Memory Interface
DRAM
Last Level Cache eDRAM
Varying availability

IA Graphics
This is what you came for right?
10

Architecture
11
Architectural components
 Non-Slice
 Fixed function
– Transformation
– Clipping
 Slice
 Slice common
– Rasterization
– Shader dispatch
– Color back-end
 Sub-slice(s)
– Shader execution
Fixed Function
Slice
Shader Execution
Non-Slice
Sub-Slice
Shader Execution
Sub-Slice
Rasterization … Color back-end
Slice Common
MemoryInterface

Architecture Scaling
12
Scaling Components
 Slice
 Parallel primitive processing
 Sub-slice
 Parallel span processing
Prim 0
Ex. Post Clip Primitive Processing (4x4 pixel spans)
Prim 1
Slice Scaling (1 – N Slices)
…
Sub-Slice Scaling (1 – N Sub-Slices)
…..

Sampler
13
1 Sampler Per Sub-Slice
 Local texture cache (Tex$)
 Backed by common L3$
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
Slice Common
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 4 N
Throughput
Sub-Slices
Ex. Synthetic Throughput vs. Sub-slice Count
Architectural Peak

Sampler Performance
14
Remember Cache Locality? 
 Throughput
 Format
 Sampling pattern
 Poor access pattern
 Increased memory b/w
 Increased latency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 4 N
Throughput
Sub-Slices
Good
Bad
Ex. Synthetic Throughput Good vs. Bad Access Pattern
Architectural Peak

Texture Compression
15
Utilize as much as possible!
 Offline compression
 Dynamic compression
BC1 Error with BC1 BC7 Error with BC7
Original Surface

Fillrate
16
Per Slice-Common
 Pixel Back-End
 Color Cache (RCC$)
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 N
Fillrate
Slices
Ex. Synthetic Fillrate vs. Slice Count
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
Slice Common

Fillrate Performance
17
Pumping out color
 Throughput
 Format
 Dimension + region
 Other factors
 Rasterization
 Early Z/STC
 Pixel Shader Execution
 Late Z/STC
 Blend function + mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 N
Fillrate
Sub-Slices
Non-Blended
Blended
Ex. Synthetic Fillrate Blended vs. Non-Blended
Architectural Peak

18
Surface Format
Select the appropriate format for color range
 Intermediate / final render targets
 Cause
 Higher precision format chosen un-necessarily
 Effect
 Reduced fill rate
 Increased memory bandwidth
HDR (R16G16B16A16)
HDR (R10G10B10A2)

Arithmetic Logic
19
Block Per Sub-Slice
 Execution Units (EUs)
 Instruction Cache (IC$)
Fixed Function
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 3 6
Performance
Sub-Slices
Ex. Synthetic EU Throughput vs. Sub-Slice
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU

Arithmetic Logic Performance
20
Algorithmic Complexity
 Control flow
 Math
 Extended math
 Max concurrent registers
0
0.2
0.4
0.6
0.8
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
0
0.2
0.4
0.6
0.8
1
Performance
Synthetic SIMD8 vs. SIMD16
SIMD16
SIMD8

Shader Optimization
Optimal code based on purpose
 Shader scaling
 The case of the generic shader
 Generation of un-used outputs 
21
vs_3_0
def c17, 2, -1, 0, 1
def c18, 1.44269502, 0.00999999978, -1.44269502, 0
dcl_position v0
dcl_normal v1
dcl_color v2
dcl_position o0
dcl_texcoord o1
dcl_texcoord1 o2
dcl_texcoord2 o3.xyz
dcl_texcoord3 o4.xyz
dcl_color o5
dcl_texcoord4 o6
dcl_texcoord5 o7
dcl_texcoord6 o8.xy
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
.. 76 instructions…
mov o7, v2
vs_3_0
dcl_position v0
dcl_position o0
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
0
50000
100000
150000
Cycles
Original
Optimized
0
200
400
600
800
1000
Cache Entries
Original
Optimized

Geometry
22
Single Non-Slice
 Fixed Function
 VS
 HS
 TE
 DS
 GS
 SOL
 Clipper
 Setup Front-End
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU
VF
VS
HS
TE
DS
GS
SOL
CL
SFE
TDG

Optimizing Geometry for Algorithmic Complexity
Optimal definitions for a single piece of
geometry
 Quality scaling with platform
 Purpose
 Lighting, depth, animation…
23
Model with Hard Edges
Model with Soft Edges
Before (Hard Edge) After (Soft Edge)
Duplicate Vertex
Merged Vertex + Normal
Ex. Edge Softening

Optimizing Primitive Ordering
24
Primitive scheduling within a single draw
 Ordering primitives for both locality and latency
 Two cases
 View dependent
 View independent
 Sample example (HDAO10.1)
 Primitive dispatch color coded (green -> red)
 2%-13% performance gain
Original Ordering
View Dependent Ordering

Scaling Your Game
Burn baby burn, heat inferno…
25
Lost Planet 2 : Images courtesy of Capcom

Why do you care?
Wide Range of Platforms + CPU + GPU
 Each with unique performance characteristics
 All of which the user hopes to run your game
 And run it well 
26
Better selling point? more platforms + happy users == more money? $$$ 

How Well Does Your Game Scale?
 Created a game
 Quality settings
27
Tablet
Low
Phone
Ultra Low?
Mobile
Medium
Desktop
Ultra

Memory Bandwidth
28
It’s all about the memory.. baby
 Will vary greatly with platform
 Why do you care?
 Read from memory
 Write to memory
Sharp turn ahead!
Low
High
MemoryBandwidth
Goal
- Establish memory ceiling (budget)

Sampler Throughput
29
Varies with architecture and platform
 Measure all use cases
 Dimension
 Format
 Filtering mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.03 0.25 0.3 6 10 16
Throughput
Memory Footprint (MB)
32bit (Point/Bilinear)
32bit (Trilinear)
Ex. Synthetic Sampler Throughput 32bit Use Cases
Architectural Peak
Goal
- Select optimal format & dimension

Fill Rate
30
Multiple surface types
 Render target
 Format
 Dimension
 Blended / Non-blended
 Depth
 Read +/ Write
 Stencil
 Read +/ Write 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RT 32bit RT 32bit
(Blend)
Z Fail Z Pass STC Test
Performance
Synthetic Relative Performance – Fullscreen Primitive
Goal
- Understand relative performance
- Optimal format, dimension, and algorithm

Geometry Throughput
31
Fixed function bandwidth and Arithmetic Logic
 Fixed function
 Clip / Cull
 Rasterization
 Geometry transformation
 ALU
Goal
- Optimal geometry and algorithm 0
0.5
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
Vs.
Ex. Geometry Selection Low vs. High Throughput

Conclusion
Wrapping it all up in a bow..
32
THE END

Looking Forward
33
Same game for desktop to phone
 Wide array of platforms
 Adaptable quality settings
 Scaling algorithms
 Optimization
Thanks for attending!
And everything in-between...

Questions?
34
Contact Information
 E-mail : robert.b.taylor@intel.com

Ready for More? Look Inside™.
35
Keep in touch with us at GDC and beyond:
• Game Developer Conference
Visit our Intel® booth #1016 in Moscone South
• Intel University Games Showcase
Marriott Marquis Salon 7, Thursday 5:30pm
RSVP at bit.ly/intelgame
• Intel Developer Forum, San Francisco
September 9-11, 2014
intel.com/idf14
• Intel Software Adrenaline
@inteladrenaline
• Intel Developer Zone
software.intel.com
@intelsoftware

Up Next…
37
12:30 – 1:30
Realistic Cloud Rendering using Pixel Synchronization
Presented by:
Egor Yusov - Intel

thu-blake-gdc-2014-final

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to thu-blake-gdc-2014-final

Similar to thu-blake-gdc-2014-final (20)

thu-blake-gdc-2014-final