Umbra 3 is occlusion culling middleware that improves rendering performance on mobile devices by processing only visible objects. It uses automatically generated occluder models and parallel rasterization across CPU cores and SIMD units like NEON to quickly determine visibility. Optimizing for NEON includes vectorizing work, narrowing data widths early, and using efficient instructions like shift-and-insert to pack multiple elements simultaneously. An example demonstrates gathering sign bits from a float array using these techniques for a 16x speedup over a naive implementation.
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
1. Antwan Hätälä
Umbra 3 Lead programmer
Boosting your ARM
mobile 3D rendering
performance with Umbra 3
2. INDEX
• Who are we?
• Games
• What is Umbra 3 and occlusion culling
• bringing our system to the PlayStation 4
• experiences and benefits
• lessons learned
3. UMBRA
SOFTWARE
Occlusion culling middleware
for 3D games
Founded in 2007
14 employees
Based in Helsinki, Finland
Support office in Seattle, WA
Same problem – Different solutions
Mo Money – Mo Problems
“Level artists are there to fill the
world with content. Integrating Umbra saved us
not only artist time but the time to create and
maintain an efficient visibility culling solution.
Umbra’s support provides us with the solutions
and features that we need.”
“Umbra’s technology is playing an important role
in the creation of our next universe, by freeing our artists
from the burden of manual markups typically associated
with polygon soup.”
5. Occlusion Culling: Why bother?
• Process and render only whats visible
• improved frame rate and rendering performance
• allows you to put more detail into levels and create larger levels
7. 7
Determines visible objects fast to save further work both on CPU and GPU
Rasterizes automatically generated proprietary occluder models on CPU
Operates in low resolution, generates conservative (dilated) results
Rasterization is embarassingly parallel in nature
Parallellize across CPU cores
Process multiple pixels/elements in SIMD
Optimized for SSE, Altivec, Cell and ARM NEON
Umbra 3 occluder rasterizer
8. 8
Processing of multiple data elements (2 to 16) in single instruction
Separate execution pipeline: can execute in parallel with ARM
Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64 bit integers
Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9
For mobile 3D title purposes, it will be there
Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue, latencies
For multi-platform, target A9 and enjoy free benefits on more advanced platforms
Used in one of three ways
Inline assembly
Compiler intrinsics
Compiler auto-vectorization
Similar to SSE, Altivec but for best performance you need to know your platform
NEON overview
9. 9
Collaborate with the compiler, but keep an eye on the output
Align your data when possible
Inline functions that operate on SIMD values
Use __restrict to let compiler reorder
Watch for register spilling
Schedule enough NEON work, even when it might be redundant
Loading data from ARM registers is relatively cheap, storing back is expensive
Hide load/store latencies by interleaving with computation (unroll your loops)
Never interleave VFP instructions with NEON
Means pipeline flush, tens of cycles of penalty
Watch for ”s” register use is compiler output
NEON common best practices
10. 10
No penalty from interleaving 2-wide ops with 4-wide ops
Cortex-A8/A9 does 64-bit float operations per cycle
vget_high_xxx, vget_low_xxx to address quadword halves
Narrow to 64 bits early
16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed
Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc.
Use VMOVN or coupled operation and narrow
Careful with your constants
VMOV and VMVN can encode lots of useful constants
Compilers do a good job of constant encoding, but can’t choose the constants for you
Killer instructions
Shift-and-insert: VSRI, VSLI
Byte permute by table lookup: VTBL, VTBX
Gather load and scatter store: VLD2-4, VST2-4
NEON optimization tricks
11. 11
Example routine: gather sign bits of large array of float values
NEON optimization example
function gather_signbits(flt_array):
let output_bitmap = bitmap of size len(flt_array)
foreach elem in flt_array at index idx:
if (elem < 0)
set_bit(output_bitmap, idx)
else
clear_bit(output_bitmap,idx)
14. Thank you.
For more on Umbra 3, go to:
umbra3.com
antti@umbrasoftware.com
Follow us on Twitter @umbrasoftware
Editor's Notes
Hello everybody!
My name is Antti Hätälä, I am the tech lead at Umbra software. Thomas Puha, developer relations manager.
Thank you all for coming.
I am here to talk about what we have accomplished with the Umbra 3 visibility system
how the technology is being used to power up some very exiting titles
how it can help your title as well
A little bit of background.
(soap story)
Umbra software is an independent team of computer graphics geeks based in Helsinki, Finland
We have kept going at it since 2007 – and individually even before that.
Thoughout the years we’ve been attacking the problem from various angles
Permanent presence in the US
The same with a images: only process what you see!
Doing this allows you to add more detail to the visible part of the world in the same frame budget!
In short, better looking games that run faster.
Not all 3D games or environments will significantly benefit from occlusion culling. Games with top-down views, mostly transparent elements or stationary cameras etc.
Available in Apple processors from iPhone 4 onwards
Android armv7 target requires NEON