Deferred Lighting and Post Processing on PLAYSTATION®3

Deferred Lighting and Post Processing on
PLAYSTATION®3
Matt Swoboda
PhyreEngine™ Team
Sony Computer Entertainment Europe (SCEE) R&D

2
Where Are We Now?
• PS3 into its 3rd
year
• Many developers on their 2nd
generation engines
• Solved the basic problems
• SPUs STILL underused
– But it’s improving

3
But..
• GPU now the most common bottleneck
• Usually limited by fragment operations
• Many titles take > 1/3 of their time in post processing
• Most developers want to do even more fragment work
– More / heavier post processing effects
– Better lighting techniques / more lights / softer shadows
– Longer shaders
– Features ported from PC / other console hardware

4
“We fixed the vertex bottleneck..”
• Many possible solutions to improve geometry
performance beyond just “optimising the shader”
– LOD
– Occlusion culling & visibility culling
– Move large vertex operations to SPU, e.g. skinning
– SPU triangle culling

5
What About Pixels?
• Fragment operations / post processing rarely optimised
like geometry operations
– Throw whole operation at the GPU
– Same operation done for every pixel
– Spatial optimization / branching considered too slow
• SPU not considered: “too slow”, “uses too much
bandwidth”

6
SPU pixel processing
• Yes, the SPU is fast enough to process pixels
• Won’t beat the GPU in a brute force race
• GPU specialises in rasterising triangles and sampling
textures – has dedicated hardware
• SPU is a general purpose processor
– Use flexibility to your advantage
– Choose different code branches and fast paths

Post Processing Effects on SPU
A Whirlwind Tour

8
What to do on SPU
• Options:
• Offload whole processes from GPU to SPU
• Or use SPU and GPU together to do one process

9
Depth Of Field Pre-Process
• High quality depth of field requires a long fragment shader
– Read depth samples and colour samples in a kernel / disc
– Check depths against centre pixel depth
– Weight colours by depth check results
• Wasteful for “most” of the screen
– All depth checks pass (out of focus) or all fail (in focus)
– All fail == pass through original buffer
– All pass == use pre-blurred buffer – separable gaussian blur
• Categorise the screen for these cases on SPU

10
Depth Of Field Classification Results
• Post process depth buffer
• Classify by min/max depth
• Green: fully in focus
• Blue: fully out of focus
• Red: neither fully in or out

11
Depth Of Field Pre-process results
• Pre-process only on SPU,
blur operations on GPU
– Goal: minimise overall frame
time and latency
• Large blur w.r.t. depth
• 15 ms+ on GPU alone
• 1.5-2ms on SPU + 3 ms on
GPU

12
Screen Tile Classification
• Categorise the screen using the range of depth values
within a tile
• Powerful technique with many applications
– Full screen effect optimization - DOF, SSAO..
– Soft particles
– Affecting lights
– Occluder information

13
Screen Space Ambient Occlusion (SSAO)
• Generate an ambient occlusion approximation using the
depth buffer alone
• Perform a large kernel-based series of depth
comparisons and sum the results
• Downsample output to ½ size for performance
– Output normals for bilateral upsampling

14
SPU Screen Space Ambient Occlusion
Results
• GPU version: 10ms+
• SPU version: 6ms on 2
SPUs
• Used in “Donkey Trader”
PhyreEngine game
template

16
Deferred Shading Overview
• Rasterise geometry information to multiple “GBuffers”
(geometry buffers)
• Apply lighting and shading in a post process

18
Deferred Lighting on SPU
• The SPU can handle the deferred lighting process
• The GPU renders the geometry to GBuffers
• SPU and GPU execute in parallel
– Total time : max( geometry, lighting )

19
Deferred Lighting on SPU: Implementation (1)
• Process each pixel once
• Work out which lights affect each pixel
• Apply the N affecting lights in a loop
• Process the screen in tiles
• Use classification techniques per tile to optimise

20
• Calculate affecting lights per tile
– Build a frustum around the tile using the min and max depth
values in that tile
– Perform frustum check with each light’s bounding volume
– Compare light direction with tile average normal value
• Choose fast paths based on tile contents
– No lights affect the tile? Use fast path
– Check material values to see if any pixels are marked as lit

21
• Choose whether to process MSAA
per tile
– If no sample pair values differ, light
only one sample from the pair,
otherwise light both samples
separately
– Typically quite few tiles need both
MSAA samples lit
Tiles requiring MSAA

22
Deferred Lighting on SPU: Results
• 3 shadow casting lights, 100
point lights
• 2x MSAA, 720p
– Lighting performed per sample
• Apply tone mapping on SPU
– Virtually free
• Performance: > 60 fps, 3
SPUs for 11ms each
– No MSAA: 2 SPUs for 11ms

23
Deferred Lighting on SPU: Issues
• Potential latency
– Must keep GPU busy while SPU process is running
– Render something else or add a frame of latency
• Main memory requirements
• Shadows
– Requires “random” texture access – not ideal for SPU
– Can render shadows on GPU to a full screen buffer and use it
on SPU

24
Flavours of Deferred Lighting on SPU
• Full deferred render on SPU
– Input all GBuffers, output final composited result
• Light pre-pass render on SPU
– Input normal and depth only; calculate light result; sample in
2nd
geometry pass
• Light tile classification data output?
– SPU outputs information per tile about affecting lights
– Do lighting calculations on GPU

26
Volumetric Lighting
• Also known as “god rays” or “light beams”
• Simulates the effect of light illuminating dust particles in
the air
• Numerous fakes exist
– Artist-placed geometry
– Artist-placed particles
• Better: generate using the shadow map
– Works in a “general case”

27
Volumetric Lighting
• Ray march through the shadow map
– Trace one ray per pixel in screen space
– Sample the depth buffer to determine
the end of the ray
• Sample the shadow map at N points
along the ray
– N ~= 50
– Attenuate and sum up the number of
samples that passed
• Blur and add noise

28
Volumetric Lighting
• Effect is a bit too slow to be practical on GPU: ~5ms
• Do it on SPU instead
• Parallelises with GPU easily
– Result needed late in the render at compositing stage
– Only needs depth and shadow map inputs
• Problem: must randomly sample from the shadow map

29
Texture sampling on SPU
• “Random access” texture sampling is bad for SPU
• It’s bad for GPU, too, but sometimes you just have to do it
• GPU:
– Fast access from texture cache; cache miss is slow
– Dedicated hardware handles lookups, filtering and wrapping
• SPU:
– Fast access from “texture cache” (SPU local memory)
– Slow access on cache miss (DMA from main memory)
– Cache lookups slow (no dedicated hardware)
– Must manually handle filtering and wrapping (again, slow)

30
Texture sampling on SPU
• Either:
– Make the texture entirely fit in SPU local memory
– Problem solved!
– Still inefficient: random accesses reduce register parallelism
• Or
– Write a very good software cache
– Locate potential cache misses early - long before you need the values
– Avoid branches in sampling code

31
Volumetric Lighting on SPU
• Volumetric light result will be blurred
– Don’t need full shadow map accuracy
– No filtering on texture samples needed
• Downsample shadow map from 1024x1024, 32 bit to
256x256, 16 bit
– 128k – fits in SPU local memory
• Fast enough to sample on SPU

32
Volumetric Lighting on SPU: Results
• Takes ~11 ms on 1 SPU

33
Shadow Mapping on SPU (1)
• Needs the full-size shadow map
– 1024x1024x32 bit == 4mb : won’t fit in SPU local memory
– We’ll have to write that “very good software cache”, then
• Pre-process the shadow map on SPU
– Calculate min and max depth for each tile
– Store in a low resolution depth hierarchy map
– Output high resolution shadow map as cache tiles

34
• Software cache with 32 entries
– Each entry is a shadow map tile
– Branchless determination of cache entry index for tile index
• Locate cache misses early
– While detiling depth data – work out required shadow tiles
– Pull in all cache-missed tiles
• Sample shadow map during lighting calculations
– All required shadow tiles are now definitely in cache – lookup is
branchless
• It’s quite slow
– Locate tile in cache per pixel

35
• Optimise via special cases to win back
performance
• Use the low resolution shadow tile map
– Always in SPU local memory
– If pixel shadow z > tile max Z : definitely in shadow
– If pixel shadow z < tile min Z : definitely not in shadow
• Check low resolution map before triggering
cache fetches
• Classify whole screen tiles as in or out of
shadow
– Don’t need to sample high resolution shadow map at
all for those tiles Tiles requiring high resolution shadow samples

37
Conclusion
• New additions to your toolbox:
– Tile-based classification techniques on SPU
– Deferred lighting on SPU
– Texture sampling on SPU
• Rendering is no longer just a GPU problem
– Use general purpose nature of the SPU to your advantage
• Rethink fragment processing optimisation strategies
– Make the GPU work smarter, not harder

38
Conclusion
• Some titles are already using SPU post processing
– Killzone 2
• PhyreEngine™ is here to help
– (If you’re a registered PS3 developer) it’s on DevNet now
– Not just an engine: also a reference
– Comes with full source
– Download it, learn from it, steal bits of the code
– Check out the PhyreEngine™ SPU Post Processing Library

Deferred Lighting and Post Processing on PLAYSTATION®3

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deferred Lighting and Post Processing on PLAYSTATION®3

Similar to Deferred Lighting and Post Processing on PLAYSTATION®3 (20)

More from Slide_N

More from Slide_N (20)

Recently uploaded

Recently uploaded (20)

Deferred Lighting and Post Processing on PLAYSTATION®3

Editor's Notes