SlideShare a Scribd company logo
1 of 37
Download to read offline
Achieving the Best Performance with Intel Graphics
Tips, Tricks, and Clever Bits
Blake Taylor, Graphics Software Development and Validation (GSDV)
GDC 2014
Agenda
 Application
 Platform
 IA Graphics
 Sampler
 Fillrate
 Arithmetic Logic
 Geometry
 Scaling Your Game
 Conclusion
2
Application
Performance starts at the top
3
Efficient GPU Programming
4
Making the most of the pipeline!
 Optimizations within the IA software stack
 Application specific
 Generic
 Greatest impact from application optimization
 Meet your friendly AE!
Application
Driver
GPU
Low
High
Scope of Optimizations
Draw
Draw…Frame
Big Picture!
Tooltip!
- Use GPA™ to find and optimize GPU hotspots
Draw Dispatching and Resource Update
Be conscious of memory access patterns
of dispatched operations
 3D / 2D operation scheduling
 State / shader changes
 Resource locality
5
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Vs.
Resource A Resource A
Draw 0 Draw 1 (2)
Resource B
Draw 2 (1)
Large Surfaces (high latency)
Resource A Resource B
Draw 0 Draw 1
Resource A
Draw 2
Small Surfaces (low latency)
Write to same RT region
Platform
More than just the sum of it’s parts…
6
Platform
7
Graphics is only part of the puzzle
 Unique architecture characteristics
 Power & performance
 Memory hierarchy
 Paired platform
 CPU
 System memory
 Other constraints
 Thermal
 Power
Memory
CPU GPU
Un-core
Package
Display Peripherals
Platform
CPU Optimization
Relationship between CPU / GPU
 CPU or GPU bottleneck
 CPU can limit GPU
 Whaaa?....
8
0
2
4
6
8
10
12
0 5 10 15
Watts
Power (15W Total)
CPU
GPU
Uncore
0
200
400
600
800
1000
1200
1400
0 5 10 15
MHz
Frequency (GPU Peak 1.15Ghz)
CPU
GPU
Tooltip!
- Use VTune™ to find and optimize CPU hotspots
Cache Locality Is King
Optimize memory accesses for both CPU and GPU
 Memory bandwidth bound
 Hierarchy varies with platform
 Optional CPU + GPU Caches
– Last Level Cache (LLC)
– Embedded DRAM (eDRAM)
 GPU
9
GPU Caches
CPU + GPU
GPU
Memory Interface
DRAM
Last Level Cache eDRAM
Varying availability
IA Graphics
This is what you came for right?
10
Architecture
11
Architectural components
 Non-Slice
 Fixed function
– Transformation
– Clipping
 Slice
 Slice common
– Rasterization
– Shader dispatch
– Color back-end
 Sub-slice(s)
– Shader execution
Fixed Function
Slice
Shader Execution
Non-Slice
Sub-Slice
Shader Execution
Sub-Slice
Rasterization … Color back-end
Slice Common
MemoryInterface
Architecture Scaling
12
Scaling Components
 Slice
 Parallel primitive processing
 Sub-slice
 Parallel span processing
Prim 0
Ex. Post Clip Primitive Processing (4x4 pixel spans)
Prim 1
Slice Scaling (1 – N Slices)
…
Sub-Slice Scaling (1 – N Sub-Slices)
…..
Sampler
13
1 Sampler Per Sub-Slice
 Local texture cache (Tex$)
 Backed by common L3$
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
Slice Common
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 4 N
Throughput
Sub-Slices
Ex. Synthetic Throughput vs. Sub-slice Count
Architectural Peak
Sampler Performance
14
Remember Cache Locality? 
 Throughput
 Format
 Sampling pattern
 Poor access pattern
 Increased memory b/w
 Increased latency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 4 N
Throughput
Sub-Slices
Good
Bad
Ex. Synthetic Throughput Good vs. Bad Access Pattern
Architectural Peak
Texture Compression
15
Utilize as much as possible!
 Offline compression
 Dynamic compression
BC1 Error with BC1 BC7 Error with BC7
Original Surface
Fillrate
16
Per Slice-Common
 Pixel Back-End
 Color Cache (RCC$)
Fixed Function
Slice
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 2 N
Fillrate
Slices
Ex. Synthetic Fillrate vs. Slice Count
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
Slice Common
Fillrate Performance
17
Pumping out color
 Throughput
 Format
 Dimension + region
 Other factors
 Rasterization
 Early Z/STC
 Pixel Shader Execution
 Late Z/STC
 Blend function + mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 N
Fillrate
Sub-Slices
Non-Blended
Blended
Ex. Synthetic Fillrate Blended vs. Non-Blended
Architectural Peak
18
Surface Format
Select the appropriate format for color range
 Intermediate / final render targets
 Cause
 Higher precision format chosen un-necessarily
 Effect
 Reduced fill rate
 Increased memory bandwidth
HDR (R16G16B16A16)
HDR (R10G10B10A2)
Arithmetic Logic
19
Block Per Sub-Slice
 Execution Units (EUs)
 Instruction Cache (IC$)
Fixed Function
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
0
0.2
0.4
0.6
0.8
1
1 3 6
Performance
Sub-Slices
Ex. Synthetic EU Throughput vs. Sub-Slice
Architectural Peak
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU
Arithmetic Logic Performance
20
Algorithmic Complexity
 Control flow
 Math
 Extended math
 Max concurrent registers
0
0.2
0.4
0.6
0.8
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
0
0.2
0.4
0.6
0.8
1
Performance
Synthetic SIMD8 vs. SIMD16
SIMD16
SIMD8
Shader Optimization
Optimal code based on purpose
 Shader scaling
 The case of the generic shader
 Generation of un-used outputs 
21
vs_3_0
def c17, 2, -1, 0, 1
def c18, 1.44269502, 0.00999999978, -1.44269502, 0
dcl_position v0
dcl_normal v1
dcl_color v2
dcl_position o0
dcl_texcoord o1
dcl_texcoord1 o2
dcl_texcoord2 o3.xyz
dcl_texcoord3 o4.xyz
dcl_color o5
dcl_texcoord4 o6
dcl_texcoord5 o7
dcl_texcoord6 o8.xy
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
.. 76 instructions…
mov o7, v2
vs_3_0
dcl_position v0
dcl_position o0
mul r0, c5, v0.y
mad r0, v0.x, c4, r0
mad r0, v0.z, c6, r0
mad o0, v0.w, c7, r0
0
50000
100000
150000
Cycles
Original
Optimized
0
200
400
600
800
1000
Cache Entries
Original
Optimized
Geometry
22
Single Non-Slice
 Fixed Function
 VS
 HS
 TE
 DS
 GS
 SOL
 Clipper
 Setup Front-End
Non-Slice
Sub-Slice
Sub-Slice
MemoryInterface
Tex$Sampler
Tex$Sampler
L3$
Pixel
Back-End
RCC$
Rasterizer
Early Z
Early STC
Depth$
Stencil$
L1
IC$
Data Port
EU
Slice Common
EU EU EU
EU EU EU EU
L1
IC$
Data Port
EU EU EU EU
EU EU EU EU
VF
VS
HS
TE
DS
GS
SOL
CL
SFE
TDG
Optimizing Geometry for Algorithmic Complexity
Optimal definitions for a single piece of
geometry
 Quality scaling with platform
 Purpose
 Lighting, depth, animation…
23
Model with Hard Edges
Model with Soft Edges
Before (Hard Edge) After (Soft Edge)
Duplicate Vertex
Merged Vertex + Normal
Ex. Edge Softening
Optimizing Primitive Ordering
24
Primitive scheduling within a single draw
 Ordering primitives for both locality and latency
 Two cases
 View dependent
 View independent
 Sample example (HDAO10.1)
 Primitive dispatch color coded (green -> red)
 2%-13% performance gain
Original Ordering
View Dependent Ordering
Scaling Your Game
Burn baby burn, heat inferno…
25
Lost Planet 2 : Images courtesy of Capcom
Why do you care?
Wide Range of Platforms + CPU + GPU
 Each with unique performance characteristics
 All of which the user hopes to run your game
 And run it well 
26
Better selling point? more platforms + happy users == more money? $$$ 
How Well Does Your Game Scale?
 Created a game
 Quality settings
27
Tablet
Low
Phone
Ultra Low?
Mobile
Medium
Desktop
Ultra
Memory Bandwidth
28
It’s all about the memory.. baby
 Will vary greatly with platform
 Why do you care?
 Read from memory
 Write to memory
Sharp turn ahead!
Low
High
MemoryBandwidth
Goal
- Establish memory ceiling (budget)
Sampler Throughput
29
Varies with architecture and platform
 Measure all use cases
 Dimension
 Format
 Filtering mode
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.03 0.25 0.3 6 10 16
Throughput
Memory Footprint (MB)
32bit (Point/Bilinear)
32bit (Trilinear)
Ex. Synthetic Sampler Throughput 32bit Use Cases
Architectural Peak
Goal
- Select optimal format & dimension
Fill Rate
30
Multiple surface types
 Render target
 Format
 Dimension
 Blended / Non-blended
 Depth
 Read +/ Write
 Stencil
 Read +/ Write 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RT 32bit RT 32bit
(Blend)
Z Fail Z Pass STC Test
Performance
Synthetic Relative Performance – Fullscreen Primitive
Goal
- Understand relative performance
- Optimal format, dimension, and algorithm
Geometry Throughput
31
Fixed function bandwidth and Arithmetic Logic
 Fixed function
 Clip / Cull
 Rasterization
 Geometry transformation
 ALU
Goal
- Optimal geometry and algorithm 0
0.5
1
MAX LRP CMP LOG EXP POW ADD MUL MAD
Performance
Synthetic Relative Performance - EU Operations
Vs.
Ex. Geometry Selection Low vs. High Throughput
Conclusion
Wrapping it all up in a bow..
32
THE END
Looking Forward
33
Same game for desktop to phone
 Wide array of platforms
 Adaptable quality settings
 Scaling algorithms
 Optimization
Thanks for attending!
And everything in-between...
Questions?
34
Contact Information
 E-mail : robert.b.taylor@intel.com
Ready for More? Look Inside™.
35
Keep in touch with us at GDC and beyond:
• Game Developer Conference
Visit our Intel® booth #1016 in Moscone South
• Intel University Games Showcase
Marriott Marquis Salon 7, Thursday 5:30pm
RSVP at bit.ly/intelgame
• Intel Developer Forum, San Francisco
September 9-11, 2014
intel.com/idf14
• Intel Software Adrenaline
@inteladrenaline
• Intel Developer Zone
software.intel.com
@intelsoftware
Up Next…
37
12:30 – 1:30
Realistic Cloud Rendering using Pixel Synchronization
Presented by:
Egor Yusov - Intel

More Related Content

What's hot

Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion CullingIntel® Software
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE
 

What's hot (9)

Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
lecture4 raster details in computer graphics(Computer graphics tutorials)
lecture4 raster details in computer graphics(Computer graphics tutorials)lecture4 raster details in computer graphics(Computer graphics tutorials)
lecture4 raster details in computer graphics(Computer graphics tutorials)
 
Chapter 1
Chapter 1Chapter 1
Chapter 1
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
 

Viewers also liked

Telecentre Multimedia Academy: project presentation
Telecentre Multimedia Academy: project presentationTelecentre Multimedia Academy: project presentation
Telecentre Multimedia Academy: project presentationTELECENTRE EUROPE
 
Ridgewood, NY Land Use study
Ridgewood, NY Land Use studyRidgewood, NY Land Use study
Ridgewood, NY Land Use studyRaquel Vega
 
Disseminated cryptococcosis in HIV seropositive patient
Disseminated cryptococcosis in HIV seropositive patientDisseminated cryptococcosis in HIV seropositive patient
Disseminated cryptococcosis in HIV seropositive patientAkshay Khumar
 
Forensic Science in Daily Life
Forensic Science in Daily LifeForensic Science in Daily Life
Forensic Science in Daily LifeAkshay Khumar
 
PORTFOLIO Final - LQ Page
PORTFOLIO Final - LQ PagePORTFOLIO Final - LQ Page
PORTFOLIO Final - LQ PageSahar Rahmani
 
Historical poisoning cases in Tamil nadu
Historical poisoning cases in Tamil naduHistorical poisoning cases in Tamil nadu
Historical poisoning cases in Tamil naduAkshay Khumar
 
Adopción Agile en Seriva (2011)
Adopción Agile en Seriva (2011)Adopción Agile en Seriva (2011)
Adopción Agile en Seriva (2011)Johnny Ordóñez
 
Seawind Copenhagen Paper 20150310
Seawind Copenhagen Paper 20150310Seawind Copenhagen Paper 20150310
Seawind Copenhagen Paper 20150310Ray Dackerman
 
Uac user-account-control
Uac user-account-controlUac user-account-control
Uac user-account-controlFebri Evan
 
Rapport CEMO empowerende technologie in de thuiszorg Pilootstudie
Rapport CEMO empowerende technologie in de thuiszorg PilootstudieRapport CEMO empowerende technologie in de thuiszorg Pilootstudie
Rapport CEMO empowerende technologie in de thuiszorg PilootstudieBert Desmet
 
20130214 studiedag CEMO
20130214 studiedag CEMO20130214 studiedag CEMO
20130214 studiedag CEMOBert Desmet
 

Viewers also liked (20)

Telecentre Multimedia Academy: project presentation
Telecentre Multimedia Academy: project presentationTelecentre Multimedia Academy: project presentation
Telecentre Multimedia Academy: project presentation
 
Kanban avec muv on
Kanban avec muv onKanban avec muv on
Kanban avec muv on
 
Module 0
Module 0 Module 0
Module 0
 
Ridgewood, NY Land Use study
Ridgewood, NY Land Use studyRidgewood, NY Land Use study
Ridgewood, NY Land Use study
 
Disseminated cryptococcosis in HIV seropositive patient
Disseminated cryptococcosis in HIV seropositive patientDisseminated cryptococcosis in HIV seropositive patient
Disseminated cryptococcosis in HIV seropositive patient
 
Oral contraceptives
Oral contraceptivesOral contraceptives
Oral contraceptives
 
H1 n1 (swine flu)
H1 n1 (swine flu)H1 n1 (swine flu)
H1 n1 (swine flu)
 
Forensic Science in Daily Life
Forensic Science in Daily LifeForensic Science in Daily Life
Forensic Science in Daily Life
 
PORTFOLIO Final - LQ Page
PORTFOLIO Final - LQ PagePORTFOLIO Final - LQ Page
PORTFOLIO Final - LQ Page
 
H5 n1 (bird flu)
H5 n1 (bird flu)H5 n1 (bird flu)
H5 n1 (bird flu)
 
Final Draft
Final DraftFinal Draft
Final Draft
 
unidad II
unidad IIunidad II
unidad II
 
Historical poisoning cases in Tamil nadu
Historical poisoning cases in Tamil naduHistorical poisoning cases in Tamil nadu
Historical poisoning cases in Tamil nadu
 
Adopción Agile en Seriva (2011)
Adopción Agile en Seriva (2011)Adopción Agile en Seriva (2011)
Adopción Agile en Seriva (2011)
 
Seawind Copenhagen Paper 20150310
Seawind Copenhagen Paper 20150310Seawind Copenhagen Paper 20150310
Seawind Copenhagen Paper 20150310
 
Menstrual disorders
Menstrual disordersMenstrual disorders
Menstrual disorders
 
Uac user-account-control
Uac user-account-controlUac user-account-control
Uac user-account-control
 
Rapport CEMO empowerende technologie in de thuiszorg Pilootstudie
Rapport CEMO empowerende technologie in de thuiszorg PilootstudieRapport CEMO empowerende technologie in de thuiszorg Pilootstudie
Rapport CEMO empowerende technologie in de thuiszorg Pilootstudie
 
20130214 studiedag CEMO
20130214 studiedag CEMO20130214 studiedag CEMO
20130214 studiedag CEMO
 
SGVResume
SGVResumeSGVResume
SGVResume
 

Similar to thu-blake-gdc-2014-final

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
Next generation graphics programming on xbox 360
Next generation graphics programming on xbox 360Next generation graphics programming on xbox 360
Next generation graphics programming on xbox 360VIKAS SINGH BHADOURIA
 
Introduction To Geometry Shaders
Introduction To Geometry ShadersIntroduction To Geometry Shaders
Introduction To Geometry Shaderspjcozzi
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaJohanAspro
 
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...Lviv Startup Club
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86Droidcon Berlin
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD supportJoel Falcou
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityMark Kilgard
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Hsien-Hsin Sean Lee, Ph.D.
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02chon2010
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Overview of graphics systems.ppt
Overview of graphics systems.pptOverview of graphics systems.ppt
Overview of graphics systems.pptMalleshBettadapura1
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentDavid Galeano
 

Similar to thu-blake-gdc-2014-final (20)

[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Next generation graphics programming on xbox 360
Next generation graphics programming on xbox 360Next generation graphics programming on xbox 360
Next generation graphics programming on xbox 360
 
Introduction To Geometry Shaders
Introduction To Geometry ShadersIntroduction To Geometry Shaders
Introduction To Geometry Shaders
 
OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...
Данило Ульянич “C89 OpenGL for ARM microcontrollers on Cortex-M. Basic functi...
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
new_age_graphics_android_x86
new_age_graphics_android_x86new_age_graphics_android_x86
new_age_graphics_android_x86
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
0xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp020xdroid osdc-2010-100426084937-phpapp02
0xdroid osdc-2010-100426084937-phpapp02
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Overview of graphics systems.ppt
Overview of graphics systems.pptOverview of graphics systems.ppt
Overview of graphics systems.ppt
 
Mod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdfMod 2 hardware_graphics.pdf
Mod 2 hardware_graphics.pdf
 
mloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game developmentmloc.js 2014 - JavaScript and the browser as a platform for game development
mloc.js 2014 - JavaScript and the browser as a platform for game development
 

thu-blake-gdc-2014-final

  • 1. Achieving the Best Performance with Intel Graphics Tips, Tricks, and Clever Bits Blake Taylor, Graphics Software Development and Validation (GSDV) GDC 2014
  • 2. Agenda  Application  Platform  IA Graphics  Sampler  Fillrate  Arithmetic Logic  Geometry  Scaling Your Game  Conclusion 2
  • 4. Efficient GPU Programming 4 Making the most of the pipeline!  Optimizations within the IA software stack  Application specific  Generic  Greatest impact from application optimization  Meet your friendly AE! Application Driver GPU Low High Scope of Optimizations Draw Draw…Frame Big Picture! Tooltip! - Use GPA™ to find and optimize GPU hotspots
  • 5. Draw Dispatching and Resource Update Be conscious of memory access patterns of dispatched operations  3D / 2D operation scheduling  State / shader changes  Resource locality 5 Resource A Resource B Draw 0 Draw 1 Resource A Draw 2 Vs. Resource A Resource A Draw 0 Draw 1 (2) Resource B Draw 2 (1) Large Surfaces (high latency) Resource A Resource B Draw 0 Draw 1 Resource A Draw 2 Small Surfaces (low latency) Write to same RT region
  • 6. Platform More than just the sum of it’s parts… 6
  • 7. Platform 7 Graphics is only part of the puzzle  Unique architecture characteristics  Power & performance  Memory hierarchy  Paired platform  CPU  System memory  Other constraints  Thermal  Power Memory CPU GPU Un-core Package Display Peripherals Platform
  • 8. CPU Optimization Relationship between CPU / GPU  CPU or GPU bottleneck  CPU can limit GPU  Whaaa?.... 8 0 2 4 6 8 10 12 0 5 10 15 Watts Power (15W Total) CPU GPU Uncore 0 200 400 600 800 1000 1200 1400 0 5 10 15 MHz Frequency (GPU Peak 1.15Ghz) CPU GPU Tooltip! - Use VTune™ to find and optimize CPU hotspots
  • 9. Cache Locality Is King Optimize memory accesses for both CPU and GPU  Memory bandwidth bound  Hierarchy varies with platform  Optional CPU + GPU Caches – Last Level Cache (LLC) – Embedded DRAM (eDRAM)  GPU 9 GPU Caches CPU + GPU GPU Memory Interface DRAM Last Level Cache eDRAM Varying availability
  • 10. IA Graphics This is what you came for right? 10
  • 11. Architecture 11 Architectural components  Non-Slice  Fixed function – Transformation – Clipping  Slice  Slice common – Rasterization – Shader dispatch – Color back-end  Sub-slice(s) – Shader execution Fixed Function Slice Shader Execution Non-Slice Sub-Slice Shader Execution Sub-Slice Rasterization … Color back-end Slice Common MemoryInterface
  • 12. Architecture Scaling 12 Scaling Components  Slice  Parallel primitive processing  Sub-slice  Parallel span processing Prim 0 Ex. Post Clip Primitive Processing (4x4 pixel spans) Prim 1 Slice Scaling (1 – N Slices) … Sub-Slice Scaling (1 – N Sub-Slices) …..
  • 13. Sampler 13 1 Sampler Per Sub-Slice  Local texture cache (Tex$)  Backed by common L3$ Fixed Function Slice Non-Slice Sub-Slice Sub-Slice Slice Common MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 2 4 N Throughput Sub-Slices Ex. Synthetic Throughput vs. Sub-slice Count Architectural Peak
  • 14. Sampler Performance 14 Remember Cache Locality?   Throughput  Format  Sampling pattern  Poor access pattern  Increased memory b/w  Increased latency 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 4 N Throughput Sub-Slices Good Bad Ex. Synthetic Throughput Good vs. Bad Access Pattern Architectural Peak
  • 15. Texture Compression 15 Utilize as much as possible!  Offline compression  Dynamic compression BC1 Error with BC1 BC7 Error with BC7 Original Surface
  • 16. Fillrate 16 Per Slice-Common  Pixel Back-End  Color Cache (RCC$) Fixed Function Slice Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 2 N Fillrate Slices Ex. Synthetic Fillrate vs. Slice Count Architectural Peak Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ Slice Common
  • 17. Fillrate Performance 17 Pumping out color  Throughput  Format  Dimension + region  Other factors  Rasterization  Early Z/STC  Pixel Shader Execution  Late Z/STC  Blend function + mode 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 N Fillrate Sub-Slices Non-Blended Blended Ex. Synthetic Fillrate Blended vs. Non-Blended Architectural Peak
  • 18. 18 Surface Format Select the appropriate format for color range  Intermediate / final render targets  Cause  Higher precision format chosen un-necessarily  Effect  Reduced fill rate  Increased memory bandwidth HDR (R16G16B16A16) HDR (R10G10B10A2)
  • 19. Arithmetic Logic 19 Block Per Sub-Slice  Execution Units (EUs)  Instruction Cache (IC$) Fixed Function Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ 0 0.2 0.4 0.6 0.8 1 1 3 6 Performance Sub-Slices Ex. Synthetic EU Throughput vs. Sub-Slice Architectural Peak Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ L1 IC$ Data Port EU Slice Common EU EU EU EU EU EU EU L1 IC$ Data Port EU EU EU EU EU EU EU EU
  • 20. Arithmetic Logic Performance 20 Algorithmic Complexity  Control flow  Math  Extended math  Max concurrent registers 0 0.2 0.4 0.6 0.8 1 MAX LRP CMP LOG EXP POW ADD MUL MAD Performance Synthetic Relative Performance - EU Operations 0 0.2 0.4 0.6 0.8 1 Performance Synthetic SIMD8 vs. SIMD16 SIMD16 SIMD8
  • 21. Shader Optimization Optimal code based on purpose  Shader scaling  The case of the generic shader  Generation of un-used outputs  21 vs_3_0 def c17, 2, -1, 0, 1 def c18, 1.44269502, 0.00999999978, -1.44269502, 0 dcl_position v0 dcl_normal v1 dcl_color v2 dcl_position o0 dcl_texcoord o1 dcl_texcoord1 o2 dcl_texcoord2 o3.xyz dcl_texcoord3 o4.xyz dcl_color o5 dcl_texcoord4 o6 dcl_texcoord5 o7 dcl_texcoord6 o8.xy mul r0, c5, v0.y mad r0, v0.x, c4, r0 mad r0, v0.z, c6, r0 mad o0, v0.w, c7, r0 .. 76 instructions… mov o7, v2 vs_3_0 dcl_position v0 dcl_position o0 mul r0, c5, v0.y mad r0, v0.x, c4, r0 mad r0, v0.z, c6, r0 mad o0, v0.w, c7, r0 0 50000 100000 150000 Cycles Original Optimized 0 200 400 600 800 1000 Cache Entries Original Optimized
  • 22. Geometry 22 Single Non-Slice  Fixed Function  VS  HS  TE  DS  GS  SOL  Clipper  Setup Front-End Non-Slice Sub-Slice Sub-Slice MemoryInterface Tex$Sampler Tex$Sampler L3$ Pixel Back-End RCC$ Rasterizer Early Z Early STC Depth$ Stencil$ L1 IC$ Data Port EU Slice Common EU EU EU EU EU EU EU L1 IC$ Data Port EU EU EU EU EU EU EU EU VF VS HS TE DS GS SOL CL SFE TDG
  • 23. Optimizing Geometry for Algorithmic Complexity Optimal definitions for a single piece of geometry  Quality scaling with platform  Purpose  Lighting, depth, animation… 23 Model with Hard Edges Model with Soft Edges Before (Hard Edge) After (Soft Edge) Duplicate Vertex Merged Vertex + Normal Ex. Edge Softening
  • 24. Optimizing Primitive Ordering 24 Primitive scheduling within a single draw  Ordering primitives for both locality and latency  Two cases  View dependent  View independent  Sample example (HDAO10.1)  Primitive dispatch color coded (green -> red)  2%-13% performance gain Original Ordering View Dependent Ordering
  • 25. Scaling Your Game Burn baby burn, heat inferno… 25 Lost Planet 2 : Images courtesy of Capcom
  • 26. Why do you care? Wide Range of Platforms + CPU + GPU  Each with unique performance characteristics  All of which the user hopes to run your game  And run it well  26 Better selling point? more platforms + happy users == more money? $$$ 
  • 27. How Well Does Your Game Scale?  Created a game  Quality settings 27 Tablet Low Phone Ultra Low? Mobile Medium Desktop Ultra
  • 28. Memory Bandwidth 28 It’s all about the memory.. baby  Will vary greatly with platform  Why do you care?  Read from memory  Write to memory Sharp turn ahead! Low High MemoryBandwidth Goal - Establish memory ceiling (budget)
  • 29. Sampler Throughput 29 Varies with architecture and platform  Measure all use cases  Dimension  Format  Filtering mode 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.03 0.25 0.3 6 10 16 Throughput Memory Footprint (MB) 32bit (Point/Bilinear) 32bit (Trilinear) Ex. Synthetic Sampler Throughput 32bit Use Cases Architectural Peak Goal - Select optimal format & dimension
  • 30. Fill Rate 30 Multiple surface types  Render target  Format  Dimension  Blended / Non-blended  Depth  Read +/ Write  Stencil  Read +/ Write 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 RT 32bit RT 32bit (Blend) Z Fail Z Pass STC Test Performance Synthetic Relative Performance – Fullscreen Primitive Goal - Understand relative performance - Optimal format, dimension, and algorithm
  • 31. Geometry Throughput 31 Fixed function bandwidth and Arithmetic Logic  Fixed function  Clip / Cull  Rasterization  Geometry transformation  ALU Goal - Optimal geometry and algorithm 0 0.5 1 MAX LRP CMP LOG EXP POW ADD MUL MAD Performance Synthetic Relative Performance - EU Operations Vs. Ex. Geometry Selection Low vs. High Throughput
  • 32. Conclusion Wrapping it all up in a bow.. 32 THE END
  • 33. Looking Forward 33 Same game for desktop to phone  Wide array of platforms  Adaptable quality settings  Scaling algorithms  Optimization Thanks for attending! And everything in-between...
  • 34. Questions? 34 Contact Information  E-mail : robert.b.taylor@intel.com
  • 35. Ready for More? Look Inside™. 35 Keep in touch with us at GDC and beyond: • Game Developer Conference Visit our Intel® booth #1016 in Moscone South • Intel University Games Showcase Marriott Marquis Salon 7, Thursday 5:30pm RSVP at bit.ly/intelgame • Intel Developer Forum, San Francisco September 9-11, 2014 intel.com/idf14 • Intel Software Adrenaline @inteladrenaline • Intel Developer Zone software.intel.com @intelsoftware
  • 36.
  • 37. Up Next… 37 12:30 – 1:30 Realistic Cloud Rendering using Pixel Synchronization Presented by: Egor Yusov - Intel