Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Consoles

Book of the Dead
Optimizing Performance for
High-End Consoles

Rob Thompson
Consoles Graphics Programmer
Unity Technologies

• Technical presentation, focussed on graphics optimisation.
• Looking at Xbox One & PlayStation 4.
• Case study using a Scriptable Render Pipelines (SRP) based project.
Presentation Overview

• Real time rendered short cinematic released at the start of 2018 to critical
acclaim.
• 2018 Webby Award Winner.
• Show case for the capabilities of High Definition Render Pipeline (HDRP).
• https://unity3d.com/book-of-the-dead
Book of the Dead

• Book of the Dead was created by Unity’s award winning demo team.
• Responsible for Adam and The Blacksmith.
The Demo Team

Book of the Dead:
Environment interactive demo

• Allow users to explore Book of the Dead content in an interactive environment.
• Show Book of the Dead quality visuals on hardware people have at home.
• Provide an example Unity project for high end HDRP content.
- All of the script code and assets are now available on the asset store.
• Target Xbox One and PlayStation 4.
• 1080p, 30fps or better on PlayStation 4 Pro and Xbox One X.
Objectives

Book of the Dead:
Environment interactive demo
Performance Case Study

• Worst case view for profiling in terms of GPU load.
Sample Scene

• Deferred rendered using High Definition Render Pipeline (HDRP).
• Most artist authored textures 1-2k , a handful at 4k.
• Baked Occlusion and GI.
• Single Dynamic Shadow Casting Directional Light.
• ~2000 batches (draw calls and compute shader dispatches).
• Initially GPU bound on PS4 Pro at ~45ms.
Scene Summary

Controlling The Batch Count
• 1832 batches in this scene.

• Use Occlusion culling.
• Use GPU instancing.
• Dynamic batching seldom a win on console

• Use Occlusion culling.
• Use GPU instancing.
• Dynamic batching seldom a win on console
• 4500 batches without instancing, more in other views.

Graphics Jobs
• Both PS4 and Xbox One are mutli core machines.
• Good CPU performance is dependant on using those cores effectively.
• Graphics Jobs are Unity’s mechanism for getting rendering work spread across
those cores.
• In Unity find the Graphics Jobs controls under Player Settings -> Other
Settings.
• It’s still flagged as experimental!

Graphics Jobs
Should see a performance gain using Graphics Jobs on consoles if you are rendering anything
more than a handful of batches.
• Graphics Jobs off is the default.
Legacy Jobs
• DX11 for Xbox One
• Available on PS4
Native Jobs
• DX12 for Xbox One
(coming soon)
• Available on PS4

Graphics Jobs
Legacy Jobs
• Takes some pressure off the main
thread and onto threads on the other
cores.
• The “Render Thread”, can still be a
bottleneck in large scenes.
Native Jobs
• Distributes the most work across cores.
• Best option for large scenes.
• In 2018.1 and earlier could put more
work onto the main thread causing
performance regression in comparison
to legacy jobs.
• Should always be the best option from
2018.2 onwards.

Performance Investigation
• Undertaken using the platform holders tools.
• PIX and Razor are world class, use them.
• Get on to console early in your dev cycle.
• Timings presented here from PS4 Pro.

Initial GPU Frame
• Gbuffer (11ms)

Initial GPU Frame
• Gbuffer (11ms)
• Motion Vectors (0.25ms)
• SSAO (0.6ms)

Initial GPU Frame
• Gbuffer (11ms)
• SSAO (0.6ms)
• Shadow maps (13.9ms)

Initial GPU Frame
• Gbuffer (11ms)
• SSAO (0.6ms)
• Deferred Lighting (4.9ms)

Initial GPU Frame
• Gbuffer (11ms)
• SSAO (0.6ms)
• Atmospheric Scattering (6.6ms)

Initial GPU Frame
• Gbuffer (11ms)
• SSAO (0.6ms)
• Atmospheric Scattering (6.6ms)
• TAA & Post Process (7.6ms)

Initial GPU Frame
0 5 10 15 20 25 30 35 40 45 50
Original GPU Time (ms)
Gbuffer Motion Vectors SSAO Shadows Lighting Atmospherics Post
60 FPS 30 FPS

• Too slow at 11ms
• Initial GPU profile showed use of GPU tessellation during GBuffer and shadow map passes.
• Generally using tessellation shaders best avoided on consoles.
 Slow in comparison to rendering the equivalent pre authored assets.
 Should only be used when it solves a visual issue that would be hard or cannot be solved in
art.
• So why use tessellation here?
GBuffer Performance

• Tree bark is an ideal use case for tessellated displacement.
• Trees are “hero objects” in our scene.
 Adding extra detail in this manner helps hide LOD transitions on these important assets.
 Same mesh used for LOD0 and LOD1 but the effect of tessellation is dialled back as we
transition between the two.
• Decided to stick with tessellation despite the performance issues as the advantages in this use
case deemed worth the cost.
Tessellation Use

• Too slow at 11ms
• PIX / Razor analysis showed GPU wave front patterns like that on the right.
• Diagram shows wave front occupancy during a portion of the Gbuffer Pass
• We should see heavy vertex shader (green) and pixel shader (blue) occupancy as we see in
the image on the left. Instead the GPU is starved of work.
Gbuffer Performance
Good Wave Front Occupancy Bad Wave Front Occupancy

Overdraw
• Especially bad on consoles when discard instructions in pixel shaders used.
• This causes depth rejection to not be performed until after pixel shaders have run.
• A lot of our objects are “alpha tested”.
Solution: Use a depth pre-pass
• HDRP now always runs a depth pre-pass for alpha tested objects.
• Option provided to pre-pass everything.
 HDRenderPipeLineAsset -> Rendering Settings.
• Down side, more batches!
• Be careful of CPU performance when using a prepass
Gbuffer Performance

• Some asset optimisation also carried out during this phase.
• GBuffer creation was at ~11ms.
• Now Depth Prepass + GBuffer creation totals ~6ms
Gbuffer Performance

GPU Frame after Prepass
0 5 10 15 20 25 30 35 40 45 50
Inital GPU Time (ms)
After Prepass (ms)
Gbuffer & Prepass Motion Vectors SSAO Shadows Lighting Atmospherics Post
60FPS 30FPS

• Single shadow casting directional light.
• 4 Shadow map splits.
• 4k x 4k resolution (default for HDRP)
• 32bit depth
Shadow Map Generation

• Resolution almost always the performance limiting factor when it comes to shadow maps.
• Analysis in Razor and PIX backed this up.
• Most of our draw calls are in the shadow mapping pass.
• Interesting wave front stall at the end of the shadow mapping wave fronts.

• Consoles write to compressed depth buffers.
• This speeds up depth testing significantly.
• However before the depth buffer can be sampled as a texture it must be decompressed.
• The decompression is our stall in this case around 0.7ms.
• Stall bigger for larger 32 bit render targets.
• Can be problematic on large render targets that are updated sporadically.
• On PS4 from script use PS4.RenderSettings.DisableDepthBufferCompression to experiment
with disabling compression on large depth targets that might only be partially written to in any
given frame (e.g. atlases).

• The first stage of our atmospheric scattering effect reads the shadow map as an input.
• Initially at 6.6ms.
• Razor and PIX showed that this was significantly bandwidth bound reading from the
shadow map.
Shadow Map As Input

• Drop the shadow map resolution to 3k.
• Change the bit depth to 16bit.
• HDRenderPipeline Asset controls this.
Shadow Revisions

• Drop the shadow map resolution to 3kx3k.
• Change the bit depth to 16bit.
• HDRenderPipeline Asset controls this.
• Also need to change the settings on the light
Shadow Revisions

• Repositioned the shadow casting light to get
better use of resolution of the shadow map.
• Only draw the last split on level load.
• Saves batches and GPU time.
• Custom layer culling for shadow maps.
• Shadow map creation 13ms -> 7.9ms
• Lighting pass 4.9ms -> 4.4ms
• Atmospherics 6.6ms -> 4.2ms
Shadow Revisions

GPU Frame after shadow map revision
0 5 10 15 20 25 30 35 40 45 50
Inital GPU Time (ms)
After Prepass (ms)
After Shadows(ms)
Gbuffer & Prepass Motion Vectors SSAO Shadows Lighting Atmospherics Post
60FPS 30FPS

• Under utilisation of the GPU’s computational potential is common during depth
only rendering (such as shadows map generation).
Async Compute

• Could we make use of these unoccupied wave fronts?
• If our compute shader work has no dependencies on the depth only rendering
that proceeds it then async compute will allow this.
Async Compute

• Compute shader wave fronts mingle with those of the depth pass.
• Saves most if not all of the time spent on the compute work from the total frame
time, assuming they have different bottlenecks.
Async Compute

• BOTD uses tile light list gather (part of the lighting pass ) and SSAO on async compute.
• Both overlap with the shadow map rendering where the most “gaps” in our wave front
utilisation occur.
• Async Compute is currently PS4 only, coming to DX12 soon.
• Accessible in script though Unity’s Command Buffer interface (not just SRP).
• Look at HDRP or BOTD script code for examples.
Async Compute

• Can also use it with the legacy renderers.
• Unity automatically creates the fences internally when adding async compute command
buffers to lights or cameras.
• Results in your async compute commands being executed at the appropriate light or camera
event on the graphics queue.
Async Compute

• Learn the platform holders tools (PIX, Razor).
• Get onto console early in your dev cycle.
• Use Graphics Jobs.
• Use GPU Instancing.
• Don’t use Tessellation without good cause.
Key Take Aways

• Consider a depth prepass when using SRP.
• Be careful with shadow map resolution / bit depth.
• Try enabling async compute when using HDRP.
• Consider async compute for any custom compute tasks.
• Book of the Dead: Environment interactive demo is availble on the asset store
now.
Key Take Aways

Thanks To
• The Demo Team.
• Xbox and PlayStation Teams.
• Unity Paris.
• Spotlight Europe.

Visit the
Microsoft & PlayStation booths
Experience the Book of the Dead: Environment interactive demo for yourself

Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Consoles

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Consoles

Similar to Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Consoles (20)

More from Unity Technologies

More from Unity Technologies (20)

Recently uploaded

Recently uploaded (20)

Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Consoles

Editor's Notes