[Unity Forum 2019] Mobile Graphics Optimization Guides

Owen Wu
Owen WuDeveloper Relations Engineer
Confidential © 2018 Arm Limited
Owen Wu (owen.wu@arm.com)
Developer Relations Engineer
Mobile Graphics
Optimization
Guides
Confidential © 2018 Arm Limited
Who We Are
Arm Developer Relations Introduction
3 Confidential © 2018 Arm Limited
Who We Are
• Arm provides CPU & GPU IPs to Silicon Vendor
• We don’t make chips
• Clients including Apple, Samsung, Nintendo, Qualcomm, MediaTek, etc.
• Softbank acquired us in 2016
• >95% market share of smartphone CPU
• >1 Billion GPU shipped in 2018
• We help developers to make better game
4 Confidential © 2018 Arm Limited
What We Can Help
• Developer education
• Game issue investigation
• Game performance analysis
• Game performance optimization
• Deep collaboration for Mali GPU
• Game promotion on Arm/Partner global events
• Development devices (in the future)
Confidential © 2018 Arm Limited
Optimization
Guides
6 Confidential © 2018 Arm Limited
Why Optimize
with reliable performance, smooth
gameplay, and long battery life
Retain users
by ensuring good user experience on a
wide range of consumer devices
Widen market
by maximizing rendering effectiveness inside
a 2.5 Watt system-wide power budget
Enhance visuals
7 Confidential © 2018 Arm Limited
Best Parctices of Optimization
• Write code with optimization in mind
• Hardware knowledge
• Check performance regression everyday
• Arm Performance Advisor
• Don’t do deep optimization too early
• Arm Mobile Studio
8 Confidential © 2018 Arm Limited
How To Do Optimization
• Knowledge of hardware
• Gather reliable and stable data
• Identify bottleneck first
• Find out why bottleneck happened
• Figure out the solution
• Verify the solution
• Optimize one thing at one time
Confidential © 2018 Arm Limited
Basic Hardware
Concepts
10 Confidential © 2018 Arm Limited
Pipeline Stages
Application
Graphics
Driver
Vertex
Shader
Primitive
Assembler
Rasterizer
Early Frag
Op
Fragment
Shader
Late Frag OpBlending
Color
Output
Depth
Output
Stencil
Output
11 Confidential © 2018 Arm Limited
GPU Is Multi-threading
• Traditional API is single threaded
• Driver hides the complexity
• CPU and GPU is running parallelly
• Every tasks in GPU are running parallelly too
• Optimization goal – keep GPU as busy as possible
12 Confidential © 2018 Arm Limited
GPU
Desktop GPU – Immidate Rendering Mode
Vertex Shader
Fragment
Shader
Frame Buffer 0Frame Buffer 1
Application Draw Calls
13 Confidential © 2018 Arm Limited
Mobile GPU – Tiled Rendering Mode
GPU
Vertex Shader
Fragment
Shader
Frame Buffer 0Frame Buffer 1
Application Draw Calls
14 Confidential © 2018 Arm Limited
External Memory / Memory Bandwidth
• Mobile’s external memory bandwidth is much smaller than
desktop’s
• Read or write to external memory is very power consuming
• Tile based rendering mode reduce the memory bandwidth
requirement
• GPU computing power advances faster than memory bandwidth
• Usually 2GB/s is recommended
GPU
Tile
Memory
External memory
Slow Path
15 Confidential © 2018 Arm Limited
Cache
• Keep data in fast memory
• Cache size is small
• Keep data small so cache can keep more data
• Sequential access can increase cache hit rate
16 Confidential © 2018 Arm Limited
Pixel/Fragment/Texel
• Pixel is a single position data on frame buffer
• Fragment is a thread in GPU which will output a pixel
• Texel is a single position data on texture
17 Confidential © 2018 Arm Limited
Texture Compression
• ASTC may get better quality with same memory size as ETC.
• Or same quality with less memory size than ETC.
• ASTC takes longer to encode compared to ETC and might make the game packaging
process take longer time. Due to this, it is better to use it on final packaging of the
game.
• ASTC allows more control in terms of quality by allowing to set block size. There are no
single best default in block size, but generally setting it to 5x5 or 6x6 is good default.
18 Confidential © 2018 Arm Limited
Texture Color Space
• Use linear color space rendering if using lighting
• Check sRGB in texture inspector window
• Textures that are not processed as color should NOT
be used in sRGB color space (such as metallic,
roughness, normal map, etc).
• Current hardware supports sRGB format and
hardware will do Gamma correction automatically for
free
19 Confidential © 2018 Arm Limited
Texture Filtering
• Trilinear - Like Bilinear but with added blur
between mipmap level
• Don’t use trilinear with no mipmap
• This filtering will remove noticeable change
between mipmap by adding smooth transition.
20 Confidential © 2018 Arm Limited
Texture Filtering
• Anisotropic - Make textures look better when
viewed from different angle, which is good for
ground level textures
• Higher anisotropic level cost higher
21 Confidential © 2018 Arm Limited
Texture Filtering
• Use bilinear for balance between performance and visual quality
• Trilinear will cost more memory bandwidth than bilinear and needs to be used
selectively
• Bilinear + 2x Anisotropic most of the time will look and perform better than Trilinear +
1x Anisotropic, so this combination can be better solution rather than using Trilinear.
• Keep the anisotropic level low. Using a level higher than 2 should be done very
selectively for critical game assets.
• This is because higher anisotropic level will cost a lot more bandwidth and will affect device battery life.
Confidential © 2018 Arm Limited
Best Optimization
Practices of Mali
GPU
23 Confidential © 2018 Arm Limited
Reduce Render State Switch
• Render state switch is very expensive operation
• Rendering as many primitives as possible in one draw call
• Don’t just check number of draw calls or batches
• Number of render state switch is also an important index
• Using Tris/SetPass (i.e. 95.2K/34) is more accurate
• Batch as many draw call as possible
• Static batch
• GPU Instancing
• Dynamic batch
24 Confidential © 2018 Arm Limited
Reduce Frame Buffer Switch
• Bind each frame buffer object only once
• Making all required draw calls before switching to the
next
• Avoid unnecessary context switch
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
25 Confidential © 2018 Arm Limited
Clear Frame Buffer Before Rendering
• Before rendering, GPU will read frame buffer into tile
memory from external memory
• Minimizing start of tile loads
• Can cheaply initialize the tile memory to a clear color value
• Ensure that you clear or invalidate all of your attachments at
the start of each render pass
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
No clear before
switching render target.
Bad for performance.
26 Confidential © 2018 Arm Limited
Reduce Frame Buffer Write
• After rendering, GPU will write result to external
memory from tile memory
• Minimizing end of tile stores
• Avoid writing back to external memory whenever is
possible
• Disable writing to depth/stencil buffer if
depth/stencil value is not used
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
27 Confidential © 2018 Arm Limited
Avoid Rendering Small Triangles
• The bandwidth and processing cost of a vertex is typically orders
of magnitude higher than the cost of processing a fragment
• Make sure that you get many pixels worth of fragment work for
each primitive
• Make sure each model which create at least 10-20 fragments
per primitive
• Use dynamic mesh level-of-detail, using simpler meshes when
objects are further away from the camera
28 Confidential © 2018 Arm Limited
Take Advantage of Early-Z
• Many fragments are occluded by other fragments
• Running fragment shader of occluded fragment is wasting GPU
power
• Render opaque object from front to back
• Occluded fragment will be rejected if
• Fragment shader doesn’t use discard
• Fragment shader doesn’t write value to depth buffer
• Alpha-to-coverage is OFF
• Otherwise the fragment will go Late-Z path which rejects
occluded fragment after fragment shader
Early Frag
Op
Fragment
Shader
Late Frag Op
29 Confidential © 2018 Arm Limited
Always Use Mipmap If Camera Is Not Still
• Using mipmapping will improve GPU performance
• Less cache miss
• Mipmapping also reduce texture aliasing and improve final image
quality
30 Confidential © 2018 Arm Limited
Don't Use Pre-Z Pass
• The opaque geometry is drawn twice, first as a depth-only update, and then as a color render which uses
an EQUALS depth test to reduce the redundant fragment processing
• Mali GPU has already include optimizations such as FPK to reduce the redundant fragment processing
automatically
• The cost of the additional draw calls, vertex shading, and memory bandwidth nearly always outweighs
the benefits of Z-prepass
31 Confidential © 2018 Arm Limited
Shader Floating-point Precision
• Use mediump and highp keywords
• Full FP32 of vertex attributes is unnecessary for many uses of attribute data
• Keep the data at the minimum precision needed to produce an acceptable final output
• Use FP32 for computing vertex positions only
• Use the lowest possible precision for other attributes
• Don’t always use FP32 for everything
• Don’t upload FP32 data into a buffer and then read it as a mediump attribute
Confidential © 2018 Arm Limited
Arm
Mobile Studio
Introduction
33 Confidential © 2018 Arm Limited
What is in the box?
Streamline
Graphics
Analyzer
Mali Offline
Compiler
(separate download)
Performance
Advisor
(closed beta)
Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
34 Confidential © 2018 Arm Limited
Streamline
Performance Analyzer
Mali GPU support
 Analyze and optimize Mali™ GPU
graphics and compute workloads
 Accelerate your workflow using
built-in analysis templates
Optimize for energy
 Move beyond simple frame time
and FPS tracking
 Monitor overall usage of processor
cycles and memory bandwidth
Speed up your app
 Find out where the system is
spending the most time
 Tune code for cache efficiency
Application event traceNative code profiling
 Break performance
down by function
 View cost alongside
disassembly listing
Arm CPU support
 Profile 32-bit and 64-bit apps for
ARMv7-A and ARMv8-A cores
 Tune multi-threading for
DynamIQ multi-core systems
 Annotate software
workloads
 Define logical event
channel structure
 Trace cross-channel
task dependencies
Tune your rendering
 Identify critical-path GPU
shader core resources
 Detect content inefficiency
35 Confidential © 2018 Arm Limited
Streamline
36 Confidential © 2018 Arm Limited
Triage nurse scenarios
Vsync bound Fragment bound CPU bound
Serialization problems Thermally bound
16.6 ms
37 Confidential © 2018 Arm Limited
Graphics Analyzer
GPU API Debugger
Shader analysis
 Capture and view all shaders used
 Optimize shader performance using
integrated Mali Offline Compiler
Cross platform
 Host support for Windows,
macOS, and Linux
 Target support for any Android
GPU
Rendering API debug
 Graphics debug for content
developers
 Support for all versions of
OpenGL ES and Vulkan
Android utility appVisual analysis views
 Native mode
 Overdraw mode
 Shader map mode
 Fragment count mode
State visibility
 Show API state after every API call
 Trace backwards from point-of-use
to API call responsible for state set
 Manage on-device
connection
 Select and launch
user application
Frame analysis
 Diagnose root causes
of rendering errors
 Identify sources of
rendering inefficiency
38 Confidential © 2018 Arm Limited
Trace outline
Frame capture
Vertex data
API calls
Statistics
Target state
Shaders
Textures,
Buffers,
Uniforms,
…
39 Confidential © 2018 Arm Limited
s
Mali Offline Compiler
Shader static analysis
Rapid iteration
 Verify impact of shader changes
without needing whole application
rebuild
Profile for any Mali GPU
 Cost shader code for every Mali
GPU without needing hardware
Mali GPU aware
 Support for all actively
shipping Mali GPUs
 Cycle counts reflect
specific microarchitecture
Critical path analysisControl flow aware
 Best case control flow
 Worst case control flow
Syntax verification
 Verify correctness of code changes
 Receive clear error diagnostics for
invalid shaders
 Identify dominant
shader resource
 Target this for
optimization!
Register usage
 Work registers
 Uniform registers
 Stack spilling
40 Confidential © 2018 Arm Limited
Mali Offline Compiler
 Use GA to capture the API calls and shaders to
understand the AP behavior.
 Use Offline Shader compiler to profiling
instruction counts for ALU, L/S, TEX
 If the shader needs more registers than the
available one, the GPU would need to perform
registers spilling
 registers spilling will cause big inefficiencies
and higher Load/Store utilization
Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision
0p0_15dev0
--driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V
ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014
ARM Limited. All rights reserved.
Compilation successful. 3 work registers used, 16 uniform registers
used, spilling not used.
A L/S T Total Bound
Cycles: 9 5 0 14 A
Shortest Path: 4.5 5 0 9.5 L/S
Longest Path: 4.5 5 0 9.5 L/S
Note: The cycles counts do not include possible stalls due to cache
misses.
41 Confidential © 2018 Arm Limited
Performance Advisor
Analysis Reports
Fast workflow
 Integrate data capture and analysis
into nightly CI test workflow
 Read results over a nice cup of tea
Caveats …
 Still under development
 Currently in closed beta
Overview chartsRegion views
 Split by dynamic
behavior
 Split by application
annotation
Executive dashboard
 Show high level status summary
 Show status breakdown by regions
of interest
 See performance
trends over time
 See region splits
Summary reports
 Easy-to-use performance
status reports
 Integrated initial root cause
analysis
42 Confidential © 2018 Arm Limited
Performance Advisor
• An automated performance triage nurse
• Move beyond simple FPS-based regression tracking
• Perform an automated first pass analysis
• Generate easy to read performance report
• Route common issues directly to the team to review
• Free up performance experts to focus on the difficult problems
• Integrate into nightly continuous integration
• Catch major issues early
• Detect gradual regressions before they start impacting users
43 Confidential © 2018 Arm Limited
Region-by-region analysis
4444
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫תודה‬
Confidential © 2018 Arm Limited
4545
The Arm trademarks featured in this presentation are registered trademarks or
trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights
reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
Confidential © 2018 Arm Limited
1 of 45

Recommended

[TGDF 2020] Mobile Graphics Best Practices for Artist by
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for ArtistOwen Wu
468 views28 slides
Customizing a production pipeline by
Customizing a production pipelineCustomizing a production pipeline
Customizing a production pipelineFelipe Lira
1.9K views44 slides
Speed up your asset imports for big projects - Unite Copenhagen 2019 by
Speed up your asset imports for big projects - Unite Copenhagen 2019Speed up your asset imports for big projects - Unite Copenhagen 2019
Speed up your asset imports for big projects - Unite Copenhagen 2019Unity Technologies
14.3K views47 slides
[TGDF 2019] Mali GPU Architecture and Mobile Studio by
[TGDF 2019] Mali GPU Architecture and Mobile Studio[TGDF 2019] Mali GPU Architecture and Mobile Studio
[TGDF 2019] Mali GPU Architecture and Mobile StudioOwen Wu
968 views70 slides
Unity mobile game performance profiling – using arm mobile studio by
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioOwen Wu
714 views32 slides
Hello, DirectCompute by
Hello, DirectComputeHello, DirectCompute
Hello, DirectComputedasyprocta
6.5K views66 slides

More Related Content

What's hot

Unreal Engine Basics 01 - Game Framework by
Unreal Engine Basics 01 - Game FrameworkUnreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game FrameworkNick Pruehs
738 views53 slides
Rendering AAA-Quality Characters of Project A1 by
Rendering AAA-Quality Characters of Project A1Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1Ki Hyunwoo
19.7K views94 slides
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ... by
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
8K views45 slides
Next generation graphics programming on xbox 360 by
Next generation graphics programming on xbox 360Next generation graphics programming on xbox 360
Next generation graphics programming on xbox 360VIKAS SINGH BHADOURIA
1.9K views60 slides
Light prepass by
Light prepassLight prepass
Light prepasschangehee lee
10.2K views42 slides
A Real-time Radiosity Architecture by
A Real-time Radiosity ArchitectureA Real-time Radiosity Architecture
A Real-time Radiosity ArchitectureElectronic Arts / DICE
23.6K views37 slides

What's hot(20)

Unreal Engine Basics 01 - Game Framework by Nick Pruehs
Unreal Engine Basics 01 - Game FrameworkUnreal Engine Basics 01 - Game Framework
Unreal Engine Basics 01 - Game Framework
Nick Pruehs738 views
Rendering AAA-Quality Characters of Project A1 by Ki Hyunwoo
Rendering AAA-Quality Characters of Project A1Rendering AAA-Quality Characters of Project A1
Rendering AAA-Quality Characters of Project A1
Ki Hyunwoo19.7K views
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ... by Johan Andersson
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Johan Andersson8K views
Killzone Shadow Fall: Creating Art Tools For A New Generation Of Games by Guerrilla
Killzone Shadow Fall: Creating Art Tools For A New Generation Of GamesKillzone Shadow Fall: Creating Art Tools For A New Generation Of Games
Killzone Shadow Fall: Creating Art Tools For A New Generation Of Games
Guerrilla12.5K views
Approaching zero driver overhead by Cass Everitt
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
Cass Everitt370.3K views
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007) by Johan Andersson
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Johan Andersson16.3K views
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio by Owen Wu
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Owen Wu907 views
Trip down the GPU lane with Machine Learning by Renaldas Zioma
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
Renaldas Zioma13.5K views
Beyond porting by Cass Everitt
Beyond portingBeyond porting
Beyond porting
Cass Everitt69.4K views
Advanced Scenegraph Rendering Pipeline by Narann29
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
Narann291.4K views
Dynamic Resolution and Interlaced Rendering by MartinMueller34
Dynamic Resolution and Interlaced RenderingDynamic Resolution and Interlaced Rendering
Dynamic Resolution and Interlaced Rendering
MartinMueller34109 views
East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi... by Gerke Max Preussner
East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi...East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi...
East Coast DevCon 2014: Concurrency & Parallelism in UE4 - Tips for programmi...
Gerke Max Preussner2.7K views
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che... by Colin Barré-Brisebois
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing... by Johan Andersson
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson19.1K views
Anti-Aliasing Methods in CryENGINE 3 by Tiago Sousa
Anti-Aliasing Methods in CryENGINE 3Anti-Aliasing Methods in CryENGINE 3
Anti-Aliasing Methods in CryENGINE 3
Tiago Sousa7K views
Terrain in Battlefield 3: A Modern, Complete and Scalable System by Electronic Arts / DICE
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Electronic Arts / DICE146.9K views

Similar to [Unity Forum 2019] Mobile Graphics Optimization Guides

Droidcon2013 triangles gangolells_imagination by
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
1.5K views35 slides
IoTFuse - Machine Learning at the Edge by
IoTFuse - Machine Learning at the EdgeIoTFuse - Machine Learning at the Edge
IoTFuse - Machine Learning at the EdgeAustin Blackstone
375 views21 slides
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese... by
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
1.4K views34 slides
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists by
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for ArtistsOwen Wu
160 views30 slides
Optimization in Unity: simple tips for developing with "no surprises" / Anton... by
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...DevGAMM Conference
606 views62 slides
Hosting AAA Multiplayer Experiences with Multiplay by
Hosting AAA Multiplayer Experiences with MultiplayHosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with MultiplayUnity Technologies
2.4K views31 slides

Similar to [Unity Forum 2019] Mobile Graphics Optimization Guides(20)

Droidcon2013 triangles gangolells_imagination by Droidcon Berlin
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
Droidcon Berlin1.5K views
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese... by Edge AI and Vision Alliance
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists by Owen Wu
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
Owen Wu160 views
Optimization in Unity: simple tips for developing with "no surprises" / Anton... by DevGAMM Conference
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
DevGAMM Conference606 views
Hosting AAA Multiplayer Experiences with Multiplay by Unity Technologies
Hosting AAA Multiplayer Experiences with MultiplayHosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with Multiplay
Unity Technologies2.4K views
OpenGL ES and Mobile GPU by Jiansong Chen
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
Jiansong Chen968 views
Live Data: For When Data is Greater than Memory by MemVerge
Live Data: For When Data is Greater than MemoryLive Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than Memory
MemVerge40 views
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall... by Amazon Web Services
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
Amazon Web Services1.9K views
Smedberg niklas bringing_aaa_graphics by changehee lee
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee966 views
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ... by James Anderson
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
James Anderson279 views
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr... by Edge AI and Vision Alliance
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28 by Amazon Web Services
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon Web Services1.2K views

Recently uploaded

Design_Discover_Develop_Campaign.pptx by
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptxShivanshSeth6
45 views20 slides
Pitchbook Repowerlab.pdf by
Pitchbook Repowerlab.pdfPitchbook Repowerlab.pdf
Pitchbook Repowerlab.pdfVictoriaGaleano
5 views12 slides
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptxlwang78
165 views19 slides
REACTJS.pdf by
REACTJS.pdfREACTJS.pdf
REACTJS.pdfArthyR3
35 views16 slides
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...csegroupvn
6 views210 slides
fakenews_DBDA_Mar23.pptx by
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptxdeepmitra8
16 views34 slides

Recently uploaded(20)

Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth645 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78165 views
REACTJS.pdf by ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR335 views
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn6 views
fakenews_DBDA_Mar23.pptx by deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra816 views
Searching in Data Structure by raghavbirla63
Searching in Data StructureSearching in Data Structure
Searching in Data Structure
raghavbirla6314 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi10 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy34 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 22 views
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 10 views
MongoDB.pdf by ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR349 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure6 views

[Unity Forum 2019] Mobile Graphics Optimization Guides

  • 1. Confidential © 2018 Arm Limited Owen Wu (owen.wu@arm.com) Developer Relations Engineer Mobile Graphics Optimization Guides
  • 2. Confidential © 2018 Arm Limited Who We Are Arm Developer Relations Introduction
  • 3. 3 Confidential © 2018 Arm Limited Who We Are • Arm provides CPU & GPU IPs to Silicon Vendor • We don’t make chips • Clients including Apple, Samsung, Nintendo, Qualcomm, MediaTek, etc. • Softbank acquired us in 2016 • >95% market share of smartphone CPU • >1 Billion GPU shipped in 2018 • We help developers to make better game
  • 4. 4 Confidential © 2018 Arm Limited What We Can Help • Developer education • Game issue investigation • Game performance analysis • Game performance optimization • Deep collaboration for Mali GPU • Game promotion on Arm/Partner global events • Development devices (in the future)
  • 5. Confidential © 2018 Arm Limited Optimization Guides
  • 6. 6 Confidential © 2018 Arm Limited Why Optimize with reliable performance, smooth gameplay, and long battery life Retain users by ensuring good user experience on a wide range of consumer devices Widen market by maximizing rendering effectiveness inside a 2.5 Watt system-wide power budget Enhance visuals
  • 7. 7 Confidential © 2018 Arm Limited Best Parctices of Optimization • Write code with optimization in mind • Hardware knowledge • Check performance regression everyday • Arm Performance Advisor • Don’t do deep optimization too early • Arm Mobile Studio
  • 8. 8 Confidential © 2018 Arm Limited How To Do Optimization • Knowledge of hardware • Gather reliable and stable data • Identify bottleneck first • Find out why bottleneck happened • Figure out the solution • Verify the solution • Optimize one thing at one time
  • 9. Confidential © 2018 Arm Limited Basic Hardware Concepts
  • 10. 10 Confidential © 2018 Arm Limited Pipeline Stages Application Graphics Driver Vertex Shader Primitive Assembler Rasterizer Early Frag Op Fragment Shader Late Frag OpBlending Color Output Depth Output Stencil Output
  • 11. 11 Confidential © 2018 Arm Limited GPU Is Multi-threading • Traditional API is single threaded • Driver hides the complexity • CPU and GPU is running parallelly • Every tasks in GPU are running parallelly too • Optimization goal – keep GPU as busy as possible
  • 12. 12 Confidential © 2018 Arm Limited GPU Desktop GPU – Immidate Rendering Mode Vertex Shader Fragment Shader Frame Buffer 0Frame Buffer 1 Application Draw Calls
  • 13. 13 Confidential © 2018 Arm Limited Mobile GPU – Tiled Rendering Mode GPU Vertex Shader Fragment Shader Frame Buffer 0Frame Buffer 1 Application Draw Calls
  • 14. 14 Confidential © 2018 Arm Limited External Memory / Memory Bandwidth • Mobile’s external memory bandwidth is much smaller than desktop’s • Read or write to external memory is very power consuming • Tile based rendering mode reduce the memory bandwidth requirement • GPU computing power advances faster than memory bandwidth • Usually 2GB/s is recommended GPU Tile Memory External memory Slow Path
  • 15. 15 Confidential © 2018 Arm Limited Cache • Keep data in fast memory • Cache size is small • Keep data small so cache can keep more data • Sequential access can increase cache hit rate
  • 16. 16 Confidential © 2018 Arm Limited Pixel/Fragment/Texel • Pixel is a single position data on frame buffer • Fragment is a thread in GPU which will output a pixel • Texel is a single position data on texture
  • 17. 17 Confidential © 2018 Arm Limited Texture Compression • ASTC may get better quality with same memory size as ETC. • Or same quality with less memory size than ETC. • ASTC takes longer to encode compared to ETC and might make the game packaging process take longer time. Due to this, it is better to use it on final packaging of the game. • ASTC allows more control in terms of quality by allowing to set block size. There are no single best default in block size, but generally setting it to 5x5 or 6x6 is good default.
  • 18. 18 Confidential © 2018 Arm Limited Texture Color Space • Use linear color space rendering if using lighting • Check sRGB in texture inspector window • Textures that are not processed as color should NOT be used in sRGB color space (such as metallic, roughness, normal map, etc). • Current hardware supports sRGB format and hardware will do Gamma correction automatically for free
  • 19. 19 Confidential © 2018 Arm Limited Texture Filtering • Trilinear - Like Bilinear but with added blur between mipmap level • Don’t use trilinear with no mipmap • This filtering will remove noticeable change between mipmap by adding smooth transition.
  • 20. 20 Confidential © 2018 Arm Limited Texture Filtering • Anisotropic - Make textures look better when viewed from different angle, which is good for ground level textures • Higher anisotropic level cost higher
  • 21. 21 Confidential © 2018 Arm Limited Texture Filtering • Use bilinear for balance between performance and visual quality • Trilinear will cost more memory bandwidth than bilinear and needs to be used selectively • Bilinear + 2x Anisotropic most of the time will look and perform better than Trilinear + 1x Anisotropic, so this combination can be better solution rather than using Trilinear. • Keep the anisotropic level low. Using a level higher than 2 should be done very selectively for critical game assets. • This is because higher anisotropic level will cost a lot more bandwidth and will affect device battery life.
  • 22. Confidential © 2018 Arm Limited Best Optimization Practices of Mali GPU
  • 23. 23 Confidential © 2018 Arm Limited Reduce Render State Switch • Render state switch is very expensive operation • Rendering as many primitives as possible in one draw call • Don’t just check number of draw calls or batches • Number of render state switch is also an important index • Using Tris/SetPass (i.e. 95.2K/34) is more accurate • Batch as many draw call as possible • Static batch • GPU Instancing • Dynamic batch
  • 24. 24 Confidential © 2018 Arm Limited Reduce Frame Buffer Switch • Bind each frame buffer object only once • Making all required draw calls before switching to the next • Avoid unnecessary context switch • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check
  • 25. 25 Confidential © 2018 Arm Limited Clear Frame Buffer Before Rendering • Before rendering, GPU will read frame buffer into tile memory from external memory • Minimizing start of tile loads • Can cheaply initialize the tile memory to a clear color value • Ensure that you clear or invalidate all of your attachments at the start of each render pass • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check No clear before switching render target. Bad for performance.
  • 26. 26 Confidential © 2018 Arm Limited Reduce Frame Buffer Write • After rendering, GPU will write result to external memory from tile memory • Minimizing end of tile stores • Avoid writing back to external memory whenever is possible • Disable writing to depth/stencil buffer if depth/stencil value is not used • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check
  • 27. 27 Confidential © 2018 Arm Limited Avoid Rendering Small Triangles • The bandwidth and processing cost of a vertex is typically orders of magnitude higher than the cost of processing a fragment • Make sure that you get many pixels worth of fragment work for each primitive • Make sure each model which create at least 10-20 fragments per primitive • Use dynamic mesh level-of-detail, using simpler meshes when objects are further away from the camera
  • 28. 28 Confidential © 2018 Arm Limited Take Advantage of Early-Z • Many fragments are occluded by other fragments • Running fragment shader of occluded fragment is wasting GPU power • Render opaque object from front to back • Occluded fragment will be rejected if • Fragment shader doesn’t use discard • Fragment shader doesn’t write value to depth buffer • Alpha-to-coverage is OFF • Otherwise the fragment will go Late-Z path which rejects occluded fragment after fragment shader Early Frag Op Fragment Shader Late Frag Op
  • 29. 29 Confidential © 2018 Arm Limited Always Use Mipmap If Camera Is Not Still • Using mipmapping will improve GPU performance • Less cache miss • Mipmapping also reduce texture aliasing and improve final image quality
  • 30. 30 Confidential © 2018 Arm Limited Don't Use Pre-Z Pass • The opaque geometry is drawn twice, first as a depth-only update, and then as a color render which uses an EQUALS depth test to reduce the redundant fragment processing • Mali GPU has already include optimizations such as FPK to reduce the redundant fragment processing automatically • The cost of the additional draw calls, vertex shading, and memory bandwidth nearly always outweighs the benefits of Z-prepass
  • 31. 31 Confidential © 2018 Arm Limited Shader Floating-point Precision • Use mediump and highp keywords • Full FP32 of vertex attributes is unnecessary for many uses of attribute data • Keep the data at the minimum precision needed to produce an acceptable final output • Use FP32 for computing vertex positions only • Use the lowest possible precision for other attributes • Don’t always use FP32 for everything • Don’t upload FP32 data into a buffer and then read it as a mediump attribute
  • 32. Confidential © 2018 Arm Limited Arm Mobile Studio Introduction
  • 33. 33 Confidential © 2018 Arm Limited What is in the box? Streamline Graphics Analyzer Mali Offline Compiler (separate download) Performance Advisor (closed beta) Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
  • 34. 34 Confidential © 2018 Arm Limited Streamline Performance Analyzer Mali GPU support  Analyze and optimize Mali™ GPU graphics and compute workloads  Accelerate your workflow using built-in analysis templates Optimize for energy  Move beyond simple frame time and FPS tracking  Monitor overall usage of processor cycles and memory bandwidth Speed up your app  Find out where the system is spending the most time  Tune code for cache efficiency Application event traceNative code profiling  Break performance down by function  View cost alongside disassembly listing Arm CPU support  Profile 32-bit and 64-bit apps for ARMv7-A and ARMv8-A cores  Tune multi-threading for DynamIQ multi-core systems  Annotate software workloads  Define logical event channel structure  Trace cross-channel task dependencies Tune your rendering  Identify critical-path GPU shader core resources  Detect content inefficiency
  • 35. 35 Confidential © 2018 Arm Limited Streamline
  • 36. 36 Confidential © 2018 Arm Limited Triage nurse scenarios Vsync bound Fragment bound CPU bound Serialization problems Thermally bound 16.6 ms
  • 37. 37 Confidential © 2018 Arm Limited Graphics Analyzer GPU API Debugger Shader analysis  Capture and view all shaders used  Optimize shader performance using integrated Mali Offline Compiler Cross platform  Host support for Windows, macOS, and Linux  Target support for any Android GPU Rendering API debug  Graphics debug for content developers  Support for all versions of OpenGL ES and Vulkan Android utility appVisual analysis views  Native mode  Overdraw mode  Shader map mode  Fragment count mode State visibility  Show API state after every API call  Trace backwards from point-of-use to API call responsible for state set  Manage on-device connection  Select and launch user application Frame analysis  Diagnose root causes of rendering errors  Identify sources of rendering inefficiency
  • 38. 38 Confidential © 2018 Arm Limited Trace outline Frame capture Vertex data API calls Statistics Target state Shaders Textures, Buffers, Uniforms, …
  • 39. 39 Confidential © 2018 Arm Limited s Mali Offline Compiler Shader static analysis Rapid iteration  Verify impact of shader changes without needing whole application rebuild Profile for any Mali GPU  Cost shader code for every Mali GPU without needing hardware Mali GPU aware  Support for all actively shipping Mali GPUs  Cycle counts reflect specific microarchitecture Critical path analysisControl flow aware  Best case control flow  Worst case control flow Syntax verification  Verify correctness of code changes  Receive clear error diagnostics for invalid shaders  Identify dominant shader resource  Target this for optimization! Register usage  Work registers  Uniform registers  Stack spilling
  • 40. 40 Confidential © 2018 Arm Limited Mali Offline Compiler  Use GA to capture the API calls and shaders to understand the AP behavior.  Use Offline Shader compiler to profiling instruction counts for ALU, L/S, TEX  If the shader needs more registers than the available one, the GPU would need to perform registers spilling  registers spilling will cause big inefficiencies and higher Load/Store utilization Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision 0p0_15dev0 --driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014 ARM Limited. All rights reserved. Compilation successful. 3 work registers used, 16 uniform registers used, spilling not used. A L/S T Total Bound Cycles: 9 5 0 14 A Shortest Path: 4.5 5 0 9.5 L/S Longest Path: 4.5 5 0 9.5 L/S Note: The cycles counts do not include possible stalls due to cache misses.
  • 41. 41 Confidential © 2018 Arm Limited Performance Advisor Analysis Reports Fast workflow  Integrate data capture and analysis into nightly CI test workflow  Read results over a nice cup of tea Caveats …  Still under development  Currently in closed beta Overview chartsRegion views  Split by dynamic behavior  Split by application annotation Executive dashboard  Show high level status summary  Show status breakdown by regions of interest  See performance trends over time  See region splits Summary reports  Easy-to-use performance status reports  Integrated initial root cause analysis
  • 42. 42 Confidential © 2018 Arm Limited Performance Advisor • An automated performance triage nurse • Move beyond simple FPS-based regression tracking • Perform an automated first pass analysis • Generate easy to read performance report • Route common issues directly to the team to review • Free up performance experts to focus on the difficult problems • Integrate into nightly continuous integration • Catch major issues early • Detect gradual regressions before they start impacting users
  • 43. 43 Confidential © 2018 Arm Limited Region-by-region analysis
  • 45. 4545 The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks Confidential © 2018 Arm Limited

Editor's Notes

  1. Mobile Studio consists of four component tools, although at the moment only two are actually in the public tool bundle. Streamline, a system profiler for CPU and GPU performance. Graphics Analyzer, an API debugger for OpenGL ES and Vulkan rendering APIs. In addition we have: Mali Offline Compiler, a syntax checker and static analysis tool for GPU shader programs, which is currently available as a separate download. Performance Advisor, a new tool which places automated performance analysis into a continuous integration workflow. This is currently still in development in a closed beta, but expect to see this joining the Studio release early next year.
  2. This is how the tool looks, all of the views are customizable so you can show only the data you need per API. APITrace is every single call that you make to your chosen api. Can get into the millions quite easily. Dynamic Help is static analysis so we have had a list of our things to watch out for by our experts so it gives you pointers. Textures and Shaders so we get every single asset in your application. And we run shaders through the offline compiler this makes them easily sortable. Frame Outline allows you to quickly navigate between the whole trace to find your problem area fast.
  3. Each region defined has its own analysis section with advice and links to further actions that can be taken All of this information is packaged into one report, which can be integrated into CI systems, or run manualy, and reduces the reliance on technical experts to spends long amounts of time determining why application have performance issues. This enables teams to move forward, empowering them with deeper knowledge, to understand where the application needs attention. In turn Freeing up the indivdual expert to concentrate on other areas.