SlideShare a Scribd company logo
1 of 45
Confidential © 2018 Arm Limited
Owen Wu (owen.wu@arm.com)
Developer Relations Engineer
Mobile Graphics
Optimization
Guides
Confidential © 2018 Arm Limited
Who We Are
Arm Developer Relations Introduction
3 Confidential © 2018 Arm Limited
Who We Are
• Arm provides CPU & GPU IPs to Silicon Vendor
• We don’t make chips
• Clients including Apple, Samsung, Nintendo, Qualcomm, MediaTek, etc.
• Softbank acquired us in 2016
• >95% market share of smartphone CPU
• >1 Billion GPU shipped in 2018
• We help developers to make better game
4 Confidential © 2018 Arm Limited
What We Can Help
• Developer education
• Game issue investigation
• Game performance analysis
• Game performance optimization
• Deep collaboration for Mali GPU
• Game promotion on Arm/Partner global events
• Development devices (in the future)
Confidential © 2018 Arm Limited
Optimization
Guides
6 Confidential © 2018 Arm Limited
Why Optimize
with reliable performance, smooth
gameplay, and long battery life
Retain users
by ensuring good user experience on a
wide range of consumer devices
Widen market
by maximizing rendering effectiveness inside
a 2.5 Watt system-wide power budget
Enhance visuals
7 Confidential © 2018 Arm Limited
Best Parctices of Optimization
• Write code with optimization in mind
• Hardware knowledge
• Check performance regression everyday
• Arm Performance Advisor
• Don’t do deep optimization too early
• Arm Mobile Studio
8 Confidential © 2018 Arm Limited
How To Do Optimization
• Knowledge of hardware
• Gather reliable and stable data
• Identify bottleneck first
• Find out why bottleneck happened
• Figure out the solution
• Verify the solution
• Optimize one thing at one time
Confidential © 2018 Arm Limited
Basic Hardware
Concepts
10 Confidential © 2018 Arm Limited
Pipeline Stages
Application
Graphics
Driver
Vertex
Shader
Primitive
Assembler
Rasterizer
Early Frag
Op
Fragment
Shader
Late Frag OpBlending
Color
Output
Depth
Output
Stencil
Output
11 Confidential © 2018 Arm Limited
GPU Is Multi-threading
• Traditional API is single threaded
• Driver hides the complexity
• CPU and GPU is running parallelly
• Every tasks in GPU are running parallelly too
• Optimization goal – keep GPU as busy as possible
12 Confidential © 2018 Arm Limited
GPU
Desktop GPU – Immidate Rendering Mode
Vertex Shader
Fragment
Shader
Frame Buffer 0Frame Buffer 1
Application Draw Calls
13 Confidential © 2018 Arm Limited
Mobile GPU – Tiled Rendering Mode
GPU
Vertex Shader
Fragment
Shader
Frame Buffer 0Frame Buffer 1
Application Draw Calls
14 Confidential © 2018 Arm Limited
External Memory / Memory Bandwidth
• Mobile’s external memory bandwidth is much smaller than
desktop’s
• Read or write to external memory is very power consuming
• Tile based rendering mode reduce the memory bandwidth
requirement
• GPU computing power advances faster than memory bandwidth
• Usually 2GB/s is recommended
GPU
Tile
Memory
External memory
Slow Path
15 Confidential © 2018 Arm Limited
Cache
• Keep data in fast memory
• Cache size is small
• Keep data small so cache can keep more data
• Sequential access can increase cache hit rate
16 Confidential © 2018 Arm Limited
Pixel/Fragment/Texel
• Pixel is a single position data on frame buffer
• Fragment is a thread in GPU which will output a pixel
• Texel is a single position data on texture
17 Confidential © 2018 Arm Limited
Texture Compression
• ASTC may get better quality with same memory size as ETC.
• Or same quality with less memory size than ETC.
• ASTC takes longer to encode compared to ETC and might make the game packaging
process take longer time. Due to this, it is better to use it on final packaging of the
game.
• ASTC allows more control in terms of quality by allowing to set block size. There are no
single best default in block size, but generally setting it to 5x5 or 6x6 is good default.
18 Confidential © 2018 Arm Limited
Texture Color Space
• Use linear color space rendering if using lighting
• Check sRGB in texture inspector window
• Textures that are not processed as color should NOT
be used in sRGB color space (such as metallic,
roughness, normal map, etc).
• Current hardware supports sRGB format and
hardware will do Gamma correction automatically for
free
19 Confidential © 2018 Arm Limited
Texture Filtering
• Trilinear - Like Bilinear but with added blur
between mipmap level
• Don’t use trilinear with no mipmap
• This filtering will remove noticeable change
between mipmap by adding smooth transition.
20 Confidential © 2018 Arm Limited
Texture Filtering
• Anisotropic - Make textures look better when
viewed from different angle, which is good for
ground level textures
• Higher anisotropic level cost higher
21 Confidential © 2018 Arm Limited
Texture Filtering
• Use bilinear for balance between performance and visual quality
• Trilinear will cost more memory bandwidth than bilinear and needs to be used
selectively
• Bilinear + 2x Anisotropic most of the time will look and perform better than Trilinear +
1x Anisotropic, so this combination can be better solution rather than using Trilinear.
• Keep the anisotropic level low. Using a level higher than 2 should be done very
selectively for critical game assets.
• This is because higher anisotropic level will cost a lot more bandwidth and will affect device battery life.
Confidential © 2018 Arm Limited
Best Optimization
Practices of Mali
GPU
23 Confidential © 2018 Arm Limited
Reduce Render State Switch
• Render state switch is very expensive operation
• Rendering as many primitives as possible in one draw call
• Don’t just check number of draw calls or batches
• Number of render state switch is also an important index
• Using Tris/SetPass (i.e. 95.2K/34) is more accurate
• Batch as many draw call as possible
• Static batch
• GPU Instancing
• Dynamic batch
24 Confidential © 2018 Arm Limited
Reduce Frame Buffer Switch
• Bind each frame buffer object only once
• Making all required draw calls before switching to the
next
• Avoid unnecessary context switch
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
25 Confidential © 2018 Arm Limited
Clear Frame Buffer Before Rendering
• Before rendering, GPU will read frame buffer into tile
memory from external memory
• Minimizing start of tile loads
• Can cheaply initialize the tile memory to a clear color value
• Ensure that you clear or invalidate all of your attachments at
the start of each render pass
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
No clear before
switching render target.
Bad for performance.
26 Confidential © 2018 Arm Limited
Reduce Frame Buffer Write
• After rendering, GPU will write result to external
memory from tile memory
• Minimizing end of tile stores
• Avoid writing back to external memory whenever is
possible
• Disable writing to depth/stencil buffer if
depth/stencil value is not used
• Use Unity frame debugger to check
• Use Arm Mobile Studio to do API level check
27 Confidential © 2018 Arm Limited
Avoid Rendering Small Triangles
• The bandwidth and processing cost of a vertex is typically orders
of magnitude higher than the cost of processing a fragment
• Make sure that you get many pixels worth of fragment work for
each primitive
• Make sure each model which create at least 10-20 fragments
per primitive
• Use dynamic mesh level-of-detail, using simpler meshes when
objects are further away from the camera
28 Confidential © 2018 Arm Limited
Take Advantage of Early-Z
• Many fragments are occluded by other fragments
• Running fragment shader of occluded fragment is wasting GPU
power
• Render opaque object from front to back
• Occluded fragment will be rejected if
• Fragment shader doesn’t use discard
• Fragment shader doesn’t write value to depth buffer
• Alpha-to-coverage is OFF
• Otherwise the fragment will go Late-Z path which rejects
occluded fragment after fragment shader
Early Frag
Op
Fragment
Shader
Late Frag Op
29 Confidential © 2018 Arm Limited
Always Use Mipmap If Camera Is Not Still
• Using mipmapping will improve GPU performance
• Less cache miss
• Mipmapping also reduce texture aliasing and improve final image
quality
30 Confidential © 2018 Arm Limited
Don't Use Pre-Z Pass
• The opaque geometry is drawn twice, first as a depth-only update, and then as a color render which uses
an EQUALS depth test to reduce the redundant fragment processing
• Mali GPU has already include optimizations such as FPK to reduce the redundant fragment processing
automatically
• The cost of the additional draw calls, vertex shading, and memory bandwidth nearly always outweighs
the benefits of Z-prepass
31 Confidential © 2018 Arm Limited
Shader Floating-point Precision
• Use mediump and highp keywords
• Full FP32 of vertex attributes is unnecessary for many uses of attribute data
• Keep the data at the minimum precision needed to produce an acceptable final output
• Use FP32 for computing vertex positions only
• Use the lowest possible precision for other attributes
• Don’t always use FP32 for everything
• Don’t upload FP32 data into a buffer and then read it as a mediump attribute
Confidential © 2018 Arm Limited
Arm
Mobile Studio
Introduction
33 Confidential © 2018 Arm Limited
What is in the box?
Streamline
Graphics
Analyzer
Mali Offline
Compiler
(separate download)
Performance
Advisor
(closed beta)
Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
34 Confidential © 2018 Arm Limited
Streamline
Performance Analyzer
Mali GPU support
 Analyze and optimize Mali™ GPU
graphics and compute workloads
 Accelerate your workflow using
built-in analysis templates
Optimize for energy
 Move beyond simple frame time
and FPS tracking
 Monitor overall usage of processor
cycles and memory bandwidth
Speed up your app
 Find out where the system is
spending the most time
 Tune code for cache efficiency
Application event traceNative code profiling
 Break performance
down by function
 View cost alongside
disassembly listing
Arm CPU support
 Profile 32-bit and 64-bit apps for
ARMv7-A and ARMv8-A cores
 Tune multi-threading for
DynamIQ multi-core systems
 Annotate software
workloads
 Define logical event
channel structure
 Trace cross-channel
task dependencies
Tune your rendering
 Identify critical-path GPU
shader core resources
 Detect content inefficiency
35 Confidential © 2018 Arm Limited
Streamline
36 Confidential © 2018 Arm Limited
Triage nurse scenarios
Vsync bound Fragment bound CPU bound
Serialization problems Thermally bound
16.6 ms
37 Confidential © 2018 Arm Limited
Graphics Analyzer
GPU API Debugger
Shader analysis
 Capture and view all shaders used
 Optimize shader performance using
integrated Mali Offline Compiler
Cross platform
 Host support for Windows,
macOS, and Linux
 Target support for any Android
GPU
Rendering API debug
 Graphics debug for content
developers
 Support for all versions of
OpenGL ES and Vulkan
Android utility appVisual analysis views
 Native mode
 Overdraw mode
 Shader map mode
 Fragment count mode
State visibility
 Show API state after every API call
 Trace backwards from point-of-use
to API call responsible for state set
 Manage on-device
connection
 Select and launch
user application
Frame analysis
 Diagnose root causes
of rendering errors
 Identify sources of
rendering inefficiency
38 Confidential © 2018 Arm Limited
Trace outline
Frame capture
Vertex data
API calls
Statistics
Target state
Shaders
Textures,
Buffers,
Uniforms,
…
39 Confidential © 2018 Arm Limited
s
Mali Offline Compiler
Shader static analysis
Rapid iteration
 Verify impact of shader changes
without needing whole application
rebuild
Profile for any Mali GPU
 Cost shader code for every Mali
GPU without needing hardware
Mali GPU aware
 Support for all actively
shipping Mali GPUs
 Cycle counts reflect
specific microarchitecture
Critical path analysisControl flow aware
 Best case control flow
 Worst case control flow
Syntax verification
 Verify correctness of code changes
 Receive clear error diagnostics for
invalid shaders
 Identify dominant
shader resource
 Target this for
optimization!
Register usage
 Work registers
 Uniform registers
 Stack spilling
40 Confidential © 2018 Arm Limited
Mali Offline Compiler
 Use GA to capture the API calls and shaders to
understand the AP behavior.
 Use Offline Shader compiler to profiling
instruction counts for ALU, L/S, TEX
 If the shader needs more registers than the
available one, the GPU would need to perform
registers spilling
 registers spilling will cause big inefficiencies
and higher Load/Store utilization
Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision
0p0_15dev0
--driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V
ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014
ARM Limited. All rights reserved.
Compilation successful. 3 work registers used, 16 uniform registers
used, spilling not used.
A L/S T Total Bound
Cycles: 9 5 0 14 A
Shortest Path: 4.5 5 0 9.5 L/S
Longest Path: 4.5 5 0 9.5 L/S
Note: The cycles counts do not include possible stalls due to cache
misses.
41 Confidential © 2018 Arm Limited
Performance Advisor
Analysis Reports
Fast workflow
 Integrate data capture and analysis
into nightly CI test workflow
 Read results over a nice cup of tea
Caveats …
 Still under development
 Currently in closed beta
Overview chartsRegion views
 Split by dynamic
behavior
 Split by application
annotation
Executive dashboard
 Show high level status summary
 Show status breakdown by regions
of interest
 See performance
trends over time
 See region splits
Summary reports
 Easy-to-use performance
status reports
 Integrated initial root cause
analysis
42 Confidential © 2018 Arm Limited
Performance Advisor
• An automated performance triage nurse
• Move beyond simple FPS-based regression tracking
• Perform an automated first pass analysis
• Generate easy to read performance report
• Route common issues directly to the team to review
• Free up performance experts to focus on the difficult problems
• Integrate into nightly continuous integration
• Catch major issues early
• Detect gradual regressions before they start impacting users
43 Confidential © 2018 Arm Limited
Region-by-region analysis
4444
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫תודה‬
Confidential © 2018 Arm Limited
4545
The Arm trademarks featured in this presentation are registered trademarks or
trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights
reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
Confidential © 2018 Arm Limited

More Related Content

What's hot

Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteElectronic Arts / DICE
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War IIISlide_N
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioOwen Wu
 
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...Unity Technologies
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
 
「原神」におけるコンソールプラットフォーム開発
「原神」におけるコンソールプラットフォーム開発「原神」におけるコンソールプラットフォーム開発
「原神」におけるコンソールプラットフォーム開発Unity Technologies Japan K.K.
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101Yoss Cohen
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)Philip Hammer
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Tiago Sousa
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsElectronic Arts / DICE
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The SurgeMichele Giacalone
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Tiago Sousa
 
SIGGRAPH 2010 Water Flow in Portal 2
SIGGRAPH 2010 Water Flow in Portal 2SIGGRAPH 2010 Water Flow in Portal 2
SIGGRAPH 2010 Water Flow in Portal 2Alex Vlachos
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemGuerrilla
 

What's hot (20)

Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War III
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studio
 
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
Developing and optimizing a procedural game: The Elder Scrolls Blades- Unite ...
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
「原神」におけるコンソールプラットフォーム開発
「原神」におけるコンソールプラットフォーム開発「原神」におけるコンソールプラットフォーム開発
「原神」におけるコンソールプラットフォーム開発
 
OpenCL Programming 101
OpenCL Programming 101OpenCL Programming 101
OpenCL Programming 101
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
 
Cloud gaming
Cloud gamingCloud gaming
Cloud gaming
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
SIGGRAPH 2010 Water Flow in Portal 2
SIGGRAPH 2010 Water Flow in Portal 2SIGGRAPH 2010 Water Flow in Portal 2
SIGGRAPH 2010 Water Flow in Portal 2
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Killzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo PostmortemKillzone Shadow Fall Demo Postmortem
Killzone Shadow Fall Demo Postmortem
 

Similar to [Unity Forum 2019] Mobile Graphics Optimization Guides

[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for ArtistOwen Wu
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 
IoTFuse - Machine Learning at the Edge
IoTFuse - Machine Learning at the EdgeIoTFuse - Machine Learning at the Edge
IoTFuse - Machine Learning at the EdgeAustin Blackstone
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for ArtistsOwen Wu
 
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...DevGAMM Conference
 
Hosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with MultiplayHosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with MultiplayUnity Technologies
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Live Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than MemoryLive Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than MemoryMemVerge
 
Streaming print data directly to printhead electronics
Streaming print data directly to printhead electronicsStreaming print data directly to printhead electronics
Streaming print data directly to printhead electronicsGlobal Graphics Software
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...NTT Software Innovation Center
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...Amazon Web Services
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...James Anderson
 
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...Edge AI and Vision Alliance
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon Web Services
 

Similar to [Unity Forum 2019] Mobile Graphics Optimization Guides (20)

[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
IoTFuse - Machine Learning at the Edge
IoTFuse - Machine Learning at the EdgeIoTFuse - Machine Learning at the Edge
IoTFuse - Machine Learning at the Edge
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
 
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
 
Hosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with MultiplayHosting AAA Multiplayer Experiences with Multiplay
Hosting AAA Multiplayer Experiences with Multiplay
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Live Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than MemoryLive Data: For When Data is Greater than Memory
Live Data: For When Data is Greater than Memory
 
Streaming print data directly to printhead electronics
Streaming print data directly to printhead electronicsStreaming print data directly to printhead electronics
Streaming print data directly to printhead electronics
 
RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...RDMA programming design and case studies – for better performance distributed...
RDMA programming design and case studies – for better performance distributed...
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
Webinář InfiniBox
Webinář InfiniBoxWebinář InfiniBox
Webinář InfiniBox
 
The Road to Ultra Low Latency
The Road to Ultra Low LatencyThe Road to Ultra Low Latency
The Road to Ultra Low Latency
 
Machine Vision Cameras
Machine Vision CamerasMachine Vision Cameras
Machine Vision Cameras
 
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
GDG Cloud Southlake #20:Stefano Doni: Kubernetes performance tuning dilemma: ...
 
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
“Making Edge AI Inference Programming Easier and Flexible,” a Presentation fr...
 
HD CCTV -Arecont Exacq Pivot3.ppt
HD CCTV -Arecont Exacq Pivot3.pptHD CCTV -Arecont Exacq Pivot3.ppt
HD CCTV -Arecont Exacq Pivot3.ppt
 
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
Amazon EC2 deepdive and a sprinkel of AWS Compute | AWS Floor28
 

Recently uploaded

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 

Recently uploaded (20)

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

[Unity Forum 2019] Mobile Graphics Optimization Guides

  • 1. Confidential © 2018 Arm Limited Owen Wu (owen.wu@arm.com) Developer Relations Engineer Mobile Graphics Optimization Guides
  • 2. Confidential © 2018 Arm Limited Who We Are Arm Developer Relations Introduction
  • 3. 3 Confidential © 2018 Arm Limited Who We Are • Arm provides CPU & GPU IPs to Silicon Vendor • We don’t make chips • Clients including Apple, Samsung, Nintendo, Qualcomm, MediaTek, etc. • Softbank acquired us in 2016 • >95% market share of smartphone CPU • >1 Billion GPU shipped in 2018 • We help developers to make better game
  • 4. 4 Confidential © 2018 Arm Limited What We Can Help • Developer education • Game issue investigation • Game performance analysis • Game performance optimization • Deep collaboration for Mali GPU • Game promotion on Arm/Partner global events • Development devices (in the future)
  • 5. Confidential © 2018 Arm Limited Optimization Guides
  • 6. 6 Confidential © 2018 Arm Limited Why Optimize with reliable performance, smooth gameplay, and long battery life Retain users by ensuring good user experience on a wide range of consumer devices Widen market by maximizing rendering effectiveness inside a 2.5 Watt system-wide power budget Enhance visuals
  • 7. 7 Confidential © 2018 Arm Limited Best Parctices of Optimization • Write code with optimization in mind • Hardware knowledge • Check performance regression everyday • Arm Performance Advisor • Don’t do deep optimization too early • Arm Mobile Studio
  • 8. 8 Confidential © 2018 Arm Limited How To Do Optimization • Knowledge of hardware • Gather reliable and stable data • Identify bottleneck first • Find out why bottleneck happened • Figure out the solution • Verify the solution • Optimize one thing at one time
  • 9. Confidential © 2018 Arm Limited Basic Hardware Concepts
  • 10. 10 Confidential © 2018 Arm Limited Pipeline Stages Application Graphics Driver Vertex Shader Primitive Assembler Rasterizer Early Frag Op Fragment Shader Late Frag OpBlending Color Output Depth Output Stencil Output
  • 11. 11 Confidential © 2018 Arm Limited GPU Is Multi-threading • Traditional API is single threaded • Driver hides the complexity • CPU and GPU is running parallelly • Every tasks in GPU are running parallelly too • Optimization goal – keep GPU as busy as possible
  • 12. 12 Confidential © 2018 Arm Limited GPU Desktop GPU – Immidate Rendering Mode Vertex Shader Fragment Shader Frame Buffer 0Frame Buffer 1 Application Draw Calls
  • 13. 13 Confidential © 2018 Arm Limited Mobile GPU – Tiled Rendering Mode GPU Vertex Shader Fragment Shader Frame Buffer 0Frame Buffer 1 Application Draw Calls
  • 14. 14 Confidential © 2018 Arm Limited External Memory / Memory Bandwidth • Mobile’s external memory bandwidth is much smaller than desktop’s • Read or write to external memory is very power consuming • Tile based rendering mode reduce the memory bandwidth requirement • GPU computing power advances faster than memory bandwidth • Usually 2GB/s is recommended GPU Tile Memory External memory Slow Path
  • 15. 15 Confidential © 2018 Arm Limited Cache • Keep data in fast memory • Cache size is small • Keep data small so cache can keep more data • Sequential access can increase cache hit rate
  • 16. 16 Confidential © 2018 Arm Limited Pixel/Fragment/Texel • Pixel is a single position data on frame buffer • Fragment is a thread in GPU which will output a pixel • Texel is a single position data on texture
  • 17. 17 Confidential © 2018 Arm Limited Texture Compression • ASTC may get better quality with same memory size as ETC. • Or same quality with less memory size than ETC. • ASTC takes longer to encode compared to ETC and might make the game packaging process take longer time. Due to this, it is better to use it on final packaging of the game. • ASTC allows more control in terms of quality by allowing to set block size. There are no single best default in block size, but generally setting it to 5x5 or 6x6 is good default.
  • 18. 18 Confidential © 2018 Arm Limited Texture Color Space • Use linear color space rendering if using lighting • Check sRGB in texture inspector window • Textures that are not processed as color should NOT be used in sRGB color space (such as metallic, roughness, normal map, etc). • Current hardware supports sRGB format and hardware will do Gamma correction automatically for free
  • 19. 19 Confidential © 2018 Arm Limited Texture Filtering • Trilinear - Like Bilinear but with added blur between mipmap level • Don’t use trilinear with no mipmap • This filtering will remove noticeable change between mipmap by adding smooth transition.
  • 20. 20 Confidential © 2018 Arm Limited Texture Filtering • Anisotropic - Make textures look better when viewed from different angle, which is good for ground level textures • Higher anisotropic level cost higher
  • 21. 21 Confidential © 2018 Arm Limited Texture Filtering • Use bilinear for balance between performance and visual quality • Trilinear will cost more memory bandwidth than bilinear and needs to be used selectively • Bilinear + 2x Anisotropic most of the time will look and perform better than Trilinear + 1x Anisotropic, so this combination can be better solution rather than using Trilinear. • Keep the anisotropic level low. Using a level higher than 2 should be done very selectively for critical game assets. • This is because higher anisotropic level will cost a lot more bandwidth and will affect device battery life.
  • 22. Confidential © 2018 Arm Limited Best Optimization Practices of Mali GPU
  • 23. 23 Confidential © 2018 Arm Limited Reduce Render State Switch • Render state switch is very expensive operation • Rendering as many primitives as possible in one draw call • Don’t just check number of draw calls or batches • Number of render state switch is also an important index • Using Tris/SetPass (i.e. 95.2K/34) is more accurate • Batch as many draw call as possible • Static batch • GPU Instancing • Dynamic batch
  • 24. 24 Confidential © 2018 Arm Limited Reduce Frame Buffer Switch • Bind each frame buffer object only once • Making all required draw calls before switching to the next • Avoid unnecessary context switch • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check
  • 25. 25 Confidential © 2018 Arm Limited Clear Frame Buffer Before Rendering • Before rendering, GPU will read frame buffer into tile memory from external memory • Minimizing start of tile loads • Can cheaply initialize the tile memory to a clear color value • Ensure that you clear or invalidate all of your attachments at the start of each render pass • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check No clear before switching render target. Bad for performance.
  • 26. 26 Confidential © 2018 Arm Limited Reduce Frame Buffer Write • After rendering, GPU will write result to external memory from tile memory • Minimizing end of tile stores • Avoid writing back to external memory whenever is possible • Disable writing to depth/stencil buffer if depth/stencil value is not used • Use Unity frame debugger to check • Use Arm Mobile Studio to do API level check
  • 27. 27 Confidential © 2018 Arm Limited Avoid Rendering Small Triangles • The bandwidth and processing cost of a vertex is typically orders of magnitude higher than the cost of processing a fragment • Make sure that you get many pixels worth of fragment work for each primitive • Make sure each model which create at least 10-20 fragments per primitive • Use dynamic mesh level-of-detail, using simpler meshes when objects are further away from the camera
  • 28. 28 Confidential © 2018 Arm Limited Take Advantage of Early-Z • Many fragments are occluded by other fragments • Running fragment shader of occluded fragment is wasting GPU power • Render opaque object from front to back • Occluded fragment will be rejected if • Fragment shader doesn’t use discard • Fragment shader doesn’t write value to depth buffer • Alpha-to-coverage is OFF • Otherwise the fragment will go Late-Z path which rejects occluded fragment after fragment shader Early Frag Op Fragment Shader Late Frag Op
  • 29. 29 Confidential © 2018 Arm Limited Always Use Mipmap If Camera Is Not Still • Using mipmapping will improve GPU performance • Less cache miss • Mipmapping also reduce texture aliasing and improve final image quality
  • 30. 30 Confidential © 2018 Arm Limited Don't Use Pre-Z Pass • The opaque geometry is drawn twice, first as a depth-only update, and then as a color render which uses an EQUALS depth test to reduce the redundant fragment processing • Mali GPU has already include optimizations such as FPK to reduce the redundant fragment processing automatically • The cost of the additional draw calls, vertex shading, and memory bandwidth nearly always outweighs the benefits of Z-prepass
  • 31. 31 Confidential © 2018 Arm Limited Shader Floating-point Precision • Use mediump and highp keywords • Full FP32 of vertex attributes is unnecessary for many uses of attribute data • Keep the data at the minimum precision needed to produce an acceptable final output • Use FP32 for computing vertex positions only • Use the lowest possible precision for other attributes • Don’t always use FP32 for everything • Don’t upload FP32 data into a buffer and then read it as a mediump attribute
  • 32. Confidential © 2018 Arm Limited Arm Mobile Studio Introduction
  • 33. 33 Confidential © 2018 Arm Limited What is in the box? Streamline Graphics Analyzer Mali Offline Compiler (separate download) Performance Advisor (closed beta) Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
  • 34. 34 Confidential © 2018 Arm Limited Streamline Performance Analyzer Mali GPU support  Analyze and optimize Mali™ GPU graphics and compute workloads  Accelerate your workflow using built-in analysis templates Optimize for energy  Move beyond simple frame time and FPS tracking  Monitor overall usage of processor cycles and memory bandwidth Speed up your app  Find out where the system is spending the most time  Tune code for cache efficiency Application event traceNative code profiling  Break performance down by function  View cost alongside disassembly listing Arm CPU support  Profile 32-bit and 64-bit apps for ARMv7-A and ARMv8-A cores  Tune multi-threading for DynamIQ multi-core systems  Annotate software workloads  Define logical event channel structure  Trace cross-channel task dependencies Tune your rendering  Identify critical-path GPU shader core resources  Detect content inefficiency
  • 35. 35 Confidential © 2018 Arm Limited Streamline
  • 36. 36 Confidential © 2018 Arm Limited Triage nurse scenarios Vsync bound Fragment bound CPU bound Serialization problems Thermally bound 16.6 ms
  • 37. 37 Confidential © 2018 Arm Limited Graphics Analyzer GPU API Debugger Shader analysis  Capture and view all shaders used  Optimize shader performance using integrated Mali Offline Compiler Cross platform  Host support for Windows, macOS, and Linux  Target support for any Android GPU Rendering API debug  Graphics debug for content developers  Support for all versions of OpenGL ES and Vulkan Android utility appVisual analysis views  Native mode  Overdraw mode  Shader map mode  Fragment count mode State visibility  Show API state after every API call  Trace backwards from point-of-use to API call responsible for state set  Manage on-device connection  Select and launch user application Frame analysis  Diagnose root causes of rendering errors  Identify sources of rendering inefficiency
  • 38. 38 Confidential © 2018 Arm Limited Trace outline Frame capture Vertex data API calls Statistics Target state Shaders Textures, Buffers, Uniforms, …
  • 39. 39 Confidential © 2018 Arm Limited s Mali Offline Compiler Shader static analysis Rapid iteration  Verify impact of shader changes without needing whole application rebuild Profile for any Mali GPU  Cost shader code for every Mali GPU without needing hardware Mali GPU aware  Support for all actively shipping Mali GPUs  Cycle counts reflect specific microarchitecture Critical path analysisControl flow aware  Best case control flow  Worst case control flow Syntax verification  Verify correctness of code changes  Receive clear error diagnostics for invalid shaders  Identify dominant shader resource  Target this for optimization! Register usage  Work registers  Uniform registers  Stack spilling
  • 40. 40 Confidential © 2018 Arm Limited Mali Offline Compiler  Use GA to capture the API calls and shaders to understand the AP behavior.  Use Offline Shader compiler to profiling instruction counts for ALU, L/S, TEX  If the shader needs more registers than the available one, the GPU would need to perform registers spilling  registers spilling will cause big inefficiencies and higher Load/Store utilization Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision 0p0_15dev0 --driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014 ARM Limited. All rights reserved. Compilation successful. 3 work registers used, 16 uniform registers used, spilling not used. A L/S T Total Bound Cycles: 9 5 0 14 A Shortest Path: 4.5 5 0 9.5 L/S Longest Path: 4.5 5 0 9.5 L/S Note: The cycles counts do not include possible stalls due to cache misses.
  • 41. 41 Confidential © 2018 Arm Limited Performance Advisor Analysis Reports Fast workflow  Integrate data capture and analysis into nightly CI test workflow  Read results over a nice cup of tea Caveats …  Still under development  Currently in closed beta Overview chartsRegion views  Split by dynamic behavior  Split by application annotation Executive dashboard  Show high level status summary  Show status breakdown by regions of interest  See performance trends over time  See region splits Summary reports  Easy-to-use performance status reports  Integrated initial root cause analysis
  • 42. 42 Confidential © 2018 Arm Limited Performance Advisor • An automated performance triage nurse • Move beyond simple FPS-based regression tracking • Perform an automated first pass analysis • Generate easy to read performance report • Route common issues directly to the team to review • Free up performance experts to focus on the difficult problems • Integrate into nightly continuous integration • Catch major issues early • Detect gradual regressions before they start impacting users
  • 43. 43 Confidential © 2018 Arm Limited Region-by-region analysis
  • 45. 4545 The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks Confidential © 2018 Arm Limited

Editor's Notes

  1. Mobile Studio consists of four component tools, although at the moment only two are actually in the public tool bundle. Streamline, a system profiler for CPU and GPU performance. Graphics Analyzer, an API debugger for OpenGL ES and Vulkan rendering APIs. In addition we have: Mali Offline Compiler, a syntax checker and static analysis tool for GPU shader programs, which is currently available as a separate download. Performance Advisor, a new tool which places automated performance analysis into a continuous integration workflow. This is currently still in development in a closed beta, but expect to see this joining the Studio release early next year.
  2. This is how the tool looks, all of the views are customizable so you can show only the data you need per API. APITrace is every single call that you make to your chosen api. Can get into the millions quite easily. Dynamic Help is static analysis so we have had a list of our things to watch out for by our experts so it gives you pointers. Textures and Shaders so we get every single asset in your application. And we run shaders through the offline compiler this makes them easily sortable. Frame Outline allows you to quickly navigate between the whole trace to find your problem area fast.
  3. Each region defined has its own analysis section with advice and links to further actions that can be taken All of this information is packaged into one report, which can be integrated into CI systems, or run manualy, and reduces the reliance on technical experts to spends long amounts of time determining why application have performance issues. This enables teams to move forward, empowering them with deeper knowledge, to understand where the application needs attention. In turn Freeing up the indivdual expert to concentrate on other areas.