Achieving the Best Performance with Intel Graphics
Tips, Tricks, and Clever Bits
Blake Taylor, Graphics Software Development and Validation (GSDV)
GDC 2014
Agenda
 Application
 Platform
 IA Graphics
 Sampler
 Fillrate
 Arithmetic Logic
 Geometry
 Scaling Your Game
 Conclusion
2
Application
Performance starts at the top
3
Efficient GPU Programming
4
Making the most of the pipeline!
 Optimizations within the IA software stack
 Application specific
 Generic
 Greatest impact from application optimization
 Meet your friendly AE!
Application
Driver
GPU
Low
High
Scope of Optimizations
Draw
Draw…Frame
Big Picture!
Tooltip!
- Use GPA™ to find and optimize GPU hotspots
Draw Dispatching and Resource Update
Be conscious of memory access patterns
of dispatched operations
 3D / 2D operation scheduling
 State / shader changes
 Resource locality
5
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Vs.
Resource A Resource A
Draw 0 Draw 1 (2)
Resource B
Draw 2 (1)
Large Surfaces (high latency)
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Small Surfaces (low latency)
Write to same RT region
Platform
More than just the sum of it’s parts…
6
Platform
7
Graphics is only part of the puzzle
 Unique architecture characteristics
 Power & performance
 Memory hierarchy
 Paired platform
 CPU
 System memory
 Other constraints
 Thermal
 Power
Memory
CPU GPU
Un-core
Package
Display Peripherals
Platform
CPU Optimization
Relationship between CPU / GPU
 CPU or GPU bottleneck
 CPU can limit GPU
 Whaaa?....
8
0
2
4
6
8
10
12
0 5 10 15
Watts
Power (15W Total)
CPU
GPU
Uncore
0
200
400
600
800
1000
1200
1400
0 5 10 15
MHz
Frequency (GPU Peak 1.15Ghz)
CPU
GPU
Tooltip!
- Use VTune™ to find and optimize CPU hotspots
Cache Locality Is King
Optimize memory accesses for both CPU and GPU
 Memory bandwidth bound
 Hierarchy varies with platform
 Optional CPU + GPU Caches
– Last Level Cache (LLC)
– Embedded DRAM (eDRAM)
 GPU
9
GPU Caches
CPU + GPU
GPU
Memory Interface
DRAM
Last Level Cache eDRAM
Varying availability
IA Graphics
This is what you came for right?
10
Architecture
11
Architectural components
 Non-Slice
 Fixed function
– Transformation
– Clipping
 Slice
 Slice common
– Rasterization
– Shader dispatch
– Color back-end
 Sub-slice(s)
– Shader execution
Fixed Function
Slice
Shader Execution
Non-Slice
Sub-Slice
Shader Execution
Sub-Slice
Rasterization … Color back-end
Slice Common
MemoryInterface
Architecture Scaling
12
Scaling Components
 Slice
 Parallel primitive processing
 Sub-slice
 Parallel span processing
Prim 0
Ex. Post Clip Primitive Processing (4x4 pixel spans)
Prim 1
Slice Scaling (1 – N Slices)
…
Sub-Slice Scaling (1 – N Sub-Slices)
…..
Sampler
13
1 Sampler Per Sub-Slice
 Local texture cache (Tex$)
 Backed by common L3$
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
Slice Common
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 4 N
Throughput
Sub-Slices
Ex. Synthetic Throughput vs. Sub-slice Count
Architectural Peak
Sampler Performance
14
Remember Cache Locality? 
 Throughput
 Format
 Sampling pattern
 Poor access pattern
 Increased memory b/w
 Increased latency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 4 N
Throughput
Sub-Slices
Good
Bad
Ex. Synthetic Throughput Good vs. Bad Access Pattern
Architectural Peak
Texture Compression
15
Utilize as much as possible!
 Offline compression
 Dynamic compression
BC1 Error with BC1 BC7 Error with BC7
Original Surface
Fillrate
16
Per Slice-Common
 Pixel Back-End
 Color Cache (RCC$)
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 N
Fillrate
Slices
Ex. Synthetic Fillrate vs. Slice Count
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
Slice Common
Fillrate Performance
17
Pumping out color
 Throughput
 Format
 Dimension + region
 Other factors
 Rasterization
 Early Z/STC
 Pixel Shader Execution
 Late Z/STC
 Blend function + mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 N
Fillrate
Sub-Slices
Non-Blended
Blended
Ex. Synthetic Fillrate Blended vs. Non-Blended
Architectural Peak
18
Surface Format
Select the appropriate format for color range
 Intermediate / final render targets
 Cause
 Higher precision format chosen un-necessarily
 Effect
 Reduced fill rate
 Increased memory bandwidth
HDR (R16G16B16A16)
HDR (R10G10B10A2)
Arithmetic Logic
19
Block Per Sub-Slice
 Execution Units (EUs)
 Instruction Cache (IC$)
Fixed Function
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 3 6
Performance
Sub-Slices
Ex. Synthetic EU Throughput vs. Sub-Slice
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU
Arithmetic Logic Performance
20
Algorithmic Complexity
 Control flow
 Math
 Extended math
 Max concurrent registers
0
0.2
0.4
0.6
0.8
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
0
0.2
0.4
0.6
0.8
1
Performance
Synthetic SIMD8 vs. SIMD16
SIMD16
SIMD8
Shader Optimization
Optimal code based on purpose
 Shader scaling
 The case of the generic shader
 Generation of un-used outputs 
21
vs_3_0
def c17, 2, -1, 0, 1
def c18, 1.44269502, 0.00999999978, -1.44269502, 0
dcl_position v0
dcl_normal v1
dcl_color v2
dcl_position o0
dcl_texcoord o1
dcl_texcoord1 o2
dcl_texcoord2 o3.xyz
dcl_texcoord3 o4.xyz
dcl_color o5
dcl_texcoord4 o6
dcl_texcoord5 o7
dcl_texcoord6 o8.xy
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
.. 76 instructions…
mov o7, v2
vs_3_0
dcl_position v0
dcl_position o0
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
0
50000
100000
150000
Cycles
Original
Optimized
0
200
400
600
800
1000
Cache Entries
Original
Optimized
Geometry
22
Single Non-Slice
 Fixed Function
 VS
 HS
 TE
 DS
 GS
 SOL
 Clipper
 Setup Front-End
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU
VF
VS
HS
TE
DS
GS
SOL
CL
SFE
TDG
Optimizing Geometry for Algorithmic Complexity
Optimal definitions for a single piece of
geometry
 Quality scaling with platform
 Purpose
 Lighting, depth, animation…
23
Model with Hard Edges
Model with Soft Edges
Before (Hard Edge) After (Soft Edge)
Duplicate Vertex
Merged Vertex + Normal
Ex. Edge Softening
Optimizing Primitive Ordering
24
Primitive scheduling within a single draw
 Ordering primitives for both locality and latency
 Two cases
 View dependent
 View independent
 Sample example (HDAO10.1)
 Primitive dispatch color coded (green -> red)
 2%-13% performance gain
Original Ordering
View Dependent Ordering
Scaling Your Game
Burn baby burn, heat inferno…
25
Lost Planet 2 : Images courtesy of Capcom
Why do you care?
Wide Range of Platforms + CPU + GPU
 Each with unique performance characteristics
 All of which the user hopes to run your game
 And run it well 
26
Better selling point? more platforms + happy users == more money? $$$ 
How Well Does Your Game Scale?
 Created a game
 Quality settings
27
Tablet
Low
Phone
Ultra Low?
Mobile
Medium
Desktop
Ultra
Memory Bandwidth
28
It’s all about the memory.. baby
 Will vary greatly with platform
 Why do you care?
 Read from memory
 Write to memory
Sharp turn ahead!
Low
High
MemoryBandwidth
Goal
- Establish memory ceiling (budget)
Sampler Throughput
29
Varies with architecture and platform
 Measure all use cases
 Dimension
 Format
 Filtering mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.03 0.25 0.3 6 10 16
Throughput
Memory Footprint (MB)
32bit (Point/Bilinear)
32bit (Trilinear)
Ex. Synthetic Sampler Throughput 32bit Use Cases
Architectural Peak
Goal
- Select optimal format & dimension
Fill Rate
30
Multiple surface types
 Render target
 Format
 Dimension
 Blended / Non-blended
 Depth
 Read +/ Write
 Stencil
 Read +/ Write 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RT 32bit RT 32bit
(Blend)
Z Fail Z Pass STC Test
Performance
Synthetic Relative Performance – Fullscreen Primitive
Goal
- Understand relative performance
- Optimal format, dimension, and algorithm
Geometry Throughput
31
Fixed function bandwidth and Arithmetic Logic
 Fixed function
 Clip / Cull
 Rasterization
 Geometry transformation
 ALU
Goal
- Optimal geometry and algorithm 0
0.5
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
Vs.
Ex. Geometry Selection Low vs. High Throughput
Conclusion
Wrapping it all up in a bow..
32
THE END
Looking Forward
33
Same game for desktop to phone
 Wide array of platforms
 Adaptable quality settings
 Scaling algorithms
 Optimization
Thanks for attending!
And everything in-between...
Questions?
34
Contact Information
 E-mail : robert.b.taylor@intel.com
Ready for More? Look Inside™.
35
Keep in touch with us at GDC and beyond:
• Game Developer Conference
Visit our Intel® booth #1016 in Moscone South
• Intel University Games Showcase
Marriott Marquis Salon 7, Thursday 5:30pm
RSVP at bit.ly/intelgame
• Intel Developer Forum, San Francisco
September 9-11, 2014
intel.com/idf14
• Intel Software Adrenaline
@inteladrenaline
• Intel Developer Zone
software.intel.com
@intelsoftware
Up Next…
37
12:30 – 1:30
Realistic Cloud Rendering using Pixel Synchronization
Presented by:
Egor Yusov - Intel

thu-blake-gdc-2014-final

  • 1.
    Achieving the BestPerformance with Intel Graphics Tips, Tricks, and Clever Bits Blake Taylor, Graphics Software Development and Validation (GSDV) GDC 2014
  • 2.
    Agenda  Application  Platform IA Graphics  Sampler  Fillrate  Arithmetic Logic  Geometry  Scaling Your Game  Conclusion 2
  • 3.
  • 4.
    Efficient GPU Programming 4 Makingthe most of the pipeline!  Optimizations within the IA software stack  Application specific  Generic  Greatest impact from application optimization  Meet your friendly AE! Application Driver GPU Low High Scope of Optimizations Draw Draw…Frame Big Picture! Tooltip! - Use GPA™ to find and optimize GPU hotspots
  • 5.
    Draw Dispatching andResource Update Be conscious of memory access patterns of dispatched operations  3D / 2D operation scheduling  State / shader changes  Resource locality 5 Resource A Resource B Draw 0 Draw 1 Resource A Draw 2 Vs. Resource A Resource A Draw 0 Draw 1 (2) Resource B Draw 2 (1) Large Surfaces (high latency) Resource A Resource B Draw 0 Draw 1 Resource A Draw 2 Small Surfaces (low latency) Write to same RT region
  • 6.
    Platform More than justthe sum of it’s parts… 6
  • 7.
    Platform 7 Graphics is onlypart of the puzzle  Unique architecture characteristics  Power & performance  Memory hierarchy  Paired platform  CPU  System memory  Other constraints  Thermal  Power Memory CPU GPU Un-core Package Display Peripherals Platform
  • 8.
    CPU Optimization Relationship betweenCPU / GPU  CPU or GPU bottleneck  CPU can limit GPU  Whaaa?.... 8 0 2 4 6 8 10 12 0 5 10 15 Watts Power (15W Total) CPU GPU Uncore 0 200 400 600 800 1000 1200 1400 0 5 10 15 MHz Frequency (GPU Peak 1.15Ghz) CPU GPU Tooltip! - Use VTune™ to find and optimize CPU hotspots
  • 9.
    Cache Locality IsKing Optimize memory accesses for both CPU and GPU  Memory bandwidth bound  Hierarchy varies with platform  Optional CPU + GPU Caches – Last Level Cache (LLC) – Embedded DRAM (eDRAM)  GPU 9 GPU Caches CPU + GPU GPU Memory Interface DRAM Last Level Cache eDRAM Varying availability
  • 10.
    IA Graphics This iswhat you came for right? 10
  • 11.
    Architecture 11 Architectural components  Non-Slice Fixed function – Transformation – Clipping  Slice  Slice common – Rasterization – Shader dispatch – Color back-end  Sub-slice(s) – Shader execution Fixed Function Slice Shader Execution Non-Slice Sub-Slice Shader Execution Sub-Slice Rasterization … Color back-end Slice Common MemoryInterface
  • 12.
    Architecture Scaling 12 Scaling Components Slice  Parallel primitive processing  Sub-slice  Parallel span processing Prim 0 Ex. Post Clip Primitive Processing (4x4 pixel spans) Prim 1 Slice Scaling (1 – N Slices) … Sub-Slice Scaling (1 – N Sub-Slices) …..
  • 13.
    Sampler 13 1 Sampler PerSub-Slice  Local texture cache (Tex$)  Backed by common L3$ Fixed Function Slice Non-Slice Sub-Slice Sub-Slice Slice Common MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 2 4 N Throughput Sub-Slices Ex. Synthetic Throughput vs. Sub-slice Count Architectural Peak
  • 14.
    Sampler Performance 14 Remember CacheLocality?   Throughput  Format  Sampling pattern  Poor access pattern  Increased memory b/w  Increased latency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 N Throughput Sub-Slices Good Bad Ex. Synthetic Throughput Good vs. Bad Access Pattern Architectural Peak
  • 15.
    Texture Compression 15 Utilize asmuch as possible!  Offline compression  Dynamic compression BC1 Error with BC1 BC7 Error with BC7 Original Surface
  • 16.
    Fillrate 16 Per Slice-Common  PixelBack-End  Color Cache (RCC$) Fixed Function Slice Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 2 N Fillrate Slices Ex. Synthetic Fillrate vs. Slice Count Architectural Peak Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ Slice Common
  • 17.
    Fillrate Performance 17 Pumping outcolor  Throughput  Format  Dimension + region  Other factors  Rasterization  Early Z/STC  Pixel Shader Execution  Late Z/STC  Blend function + mode 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 N Fillrate Sub-Slices Non-Blended Blended Ex. Synthetic Fillrate Blended vs. Non-Blended Architectural Peak
  • 18.
    18 Surface Format Select theappropriate format for color range  Intermediate / final render targets  Cause  Higher precision format chosen un-necessarily  Effect  Reduced fill rate  Increased memory bandwidth HDR (R16G16B16A16) HDR (R10G10B10A2)
  • 19.
    Arithmetic Logic 19 Block PerSub-Slice  Execution Units (EUs)  Instruction Cache (IC$) Fixed Function Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 3 6 Performance Sub-Slices Ex. Synthetic EU Throughput vs. Sub-Slice Architectural Peak Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ L1 IC$ Data Port EU Slice Common EU EU EU EU EU EU EU L1 IC$ Data Port EU EU EU EU EU EU EU EU
  • 20.
    Arithmetic Logic Performance 20 AlgorithmicComplexity  Control flow  Math  Extended math  Max concurrent registers 0 0.2 0.4 0.6 0.8 1 MAX LRP CMP LOG EXP POW ADD MUL MAD Performance Synthetic Relative Performance - EU Operations 0 0.2 0.4 0.6 0.8 1 Performance Synthetic SIMD8 vs. SIMD16 SIMD16 SIMD8
  • 21.
    Shader Optimization Optimal codebased on purpose  Shader scaling  The case of the generic shader  Generation of un-used outputs  21 vs_3_0 def c17, 2, -1, 0, 1 def c18, 1.44269502, 0.00999999978, -1.44269502, 0 dcl_position v0 dcl_normal v1 dcl_color v2 dcl_position o0 dcl_texcoord o1 dcl_texcoord1 o2 dcl_texcoord2 o3.xyz dcl_texcoord3 o4.xyz dcl_color o5 dcl_texcoord4 o6 dcl_texcoord5 o7 dcl_texcoord6 o8.xy mul r0, c5, v0.y mad r0, v0.x, c4, r0 mad r0, v0.z, c6, r0 mad o0, v0.w, c7, r0 .. 76 instructions… mov o7, v2 vs_3_0 dcl_position v0 dcl_position o0 mul r0, c5, v0.y mad r0, v0.x, c4, r0 mad r0, v0.z, c6, r0 mad o0, v0.w, c7, r0 0 50000 100000 150000 Cycles Original Optimized 0 200 400 600 800 1000 Cache Entries Original Optimized
  • 22.
    Geometry 22 Single Non-Slice  FixedFunction  VS  HS  TE  DS  GS  SOL  Clipper  Setup Front-End Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ L1 IC$ Data Port EU Slice Common EU EU EU EU EU EU EU L1 IC$ Data Port EU EU EU EU EU EU EU EU VF VS HS TE DS GS SOL CL SFE TDG
  • 23.
    Optimizing Geometry forAlgorithmic Complexity Optimal definitions for a single piece of geometry  Quality scaling with platform  Purpose  Lighting, depth, animation… 23 Model with Hard Edges Model with Soft Edges Before (Hard Edge) After (Soft Edge) Duplicate Vertex Merged Vertex + Normal Ex. Edge Softening
  • 24.
    Optimizing Primitive Ordering 24 Primitivescheduling within a single draw  Ordering primitives for both locality and latency  Two cases  View dependent  View independent  Sample example (HDAO10.1)  Primitive dispatch color coded (green -> red)  2%-13% performance gain Original Ordering View Dependent Ordering
  • 25.
    Scaling Your Game Burnbaby burn, heat inferno… 25 Lost Planet 2 : Images courtesy of Capcom
  • 26.
    Why do youcare? Wide Range of Platforms + CPU + GPU  Each with unique performance characteristics  All of which the user hopes to run your game  And run it well  26 Better selling point? more platforms + happy users == more money? $$$ 
  • 27.
    How Well DoesYour Game Scale?  Created a game  Quality settings 27 Tablet Low Phone Ultra Low? Mobile Medium Desktop Ultra
  • 28.
    Memory Bandwidth 28 It’s allabout the memory.. baby  Will vary greatly with platform  Why do you care?  Read from memory  Write to memory Sharp turn ahead! Low High MemoryBandwidth Goal - Establish memory ceiling (budget)
  • 29.
    Sampler Throughput 29 Varies witharchitecture and platform  Measure all use cases  Dimension  Format  Filtering mode 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.03 0.25 0.3 6 10 16 Throughput Memory Footprint (MB) 32bit (Point/Bilinear) 32bit (Trilinear) Ex. Synthetic Sampler Throughput 32bit Use Cases Architectural Peak Goal - Select optimal format & dimension
  • 30.
    Fill Rate 30 Multiple surfacetypes  Render target  Format  Dimension  Blended / Non-blended  Depth  Read +/ Write  Stencil  Read +/ Write 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 RT 32bit RT 32bit (Blend) Z Fail Z Pass STC Test Performance Synthetic Relative Performance – Fullscreen Primitive Goal - Understand relative performance - Optimal format, dimension, and algorithm
  • 31.
    Geometry Throughput 31 Fixed functionbandwidth and Arithmetic Logic  Fixed function  Clip / Cull  Rasterization  Geometry transformation  ALU Goal - Optimal geometry and algorithm 0 0.5 1 MAX LRP CMP LOG EXP POW ADD MUL MAD Performance Synthetic Relative Performance - EU Operations Vs. Ex. Geometry Selection Low vs. High Throughput
  • 32.
    Conclusion Wrapping it allup in a bow.. 32 THE END
  • 33.
    Looking Forward 33 Same gamefor desktop to phone  Wide array of platforms  Adaptable quality settings  Scaling algorithms  Optimization Thanks for attending! And everything in-between...
  • 34.
  • 35.
    Ready for More?Look Inside™. 35 Keep in touch with us at GDC and beyond: • Game Developer Conference Visit our Intel® booth #1016 in Moscone South • Intel University Games Showcase Marriott Marquis Salon 7, Thursday 5:30pm RSVP at bit.ly/intelgame • Intel Developer Forum, San Francisco September 9-11, 2014 intel.com/idf14 • Intel Software Adrenaline @inteladrenaline • Intel Developer Zone software.intel.com @intelsoftware
  • 37.
    Up Next… 37 12:30 –1:30 Realistic Cloud Rendering using Pixel Synchronization Presented by: Egor Yusov - Intel