GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

Antwan Hätälä
Umbra 3 Lead programmer
Boosting your ARM
mobile 3D rendering
performance with Umbra 3

INDEX
• Who are we?
• Games
• What is Umbra 3 and occlusion culling
• bringing our system to the PlayStation 4
• experiences and benefits
• lessons learned

UMBRA
SOFTWARE
Occlusion culling middleware
for 3D games
Founded in 2007
14 employees
Based in Helsinki, Finland
Support office in Seattle, WA
Same problem – Different solutions
Mo Money – Mo Problems
“Level artists are there to fill the
world with content. Integrating Umbra saved us
not only artist time but the time to create and
maintain an efficient visibility culling solution.
Umbra’s support provides us with the solutions
and features that we need.”
“Umbra’s technology is playing an important role
in the creation of our next universe, by freeing our artists
from the burden of manual markups typically associated
with polygon soup.”

Occlusion Culling: Why bother?
• Process and render only whats visible
• improved frame rate and rendering performance
• allows you to put more detail into levels and create larger levels

7
 Determines visible objects fast to save further work both on CPU and GPU
 Rasterizes automatically generated proprietary occluder models on CPU
 Operates in low resolution, generates conservative (dilated) results
 Rasterization is embarassingly parallel in nature
 Parallellize across CPU cores
 Process multiple pixels/elements in SIMD
 Optimized for SSE, Altivec, Cell and ARM NEON
Umbra 3 occluder rasterizer

8
 Processing of multiple data elements (2 to 16) in single instruction
 Separate execution pipeline: can execute in parallel with ARM
 Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64 bit integers
 Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9
 For mobile 3D title purposes, it will be there
 Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue, latencies
 For multi-platform, target A9 and enjoy free benefits on more advanced platforms
 Used in one of three ways
 Inline assembly
 Compiler intrinsics
 Compiler auto-vectorization
 Similar to SSE, Altivec but for best performance you need to know your platform
NEON overview

9
 Collaborate with the compiler, but keep an eye on the output
 Align your data when possible
 Inline functions that operate on SIMD values
 Use __restrict to let compiler reorder
 Watch for register spilling
 Schedule enough NEON work, even when it might be redundant
 Loading data from ARM registers is relatively cheap, storing back is expensive
 Hide load/store latencies by interleaving with computation (unroll your loops)
 Never interleave VFP instructions with NEON
 Means pipeline flush, tens of cycles of penalty
 Watch for ”s” register use is compiler output
NEON common best practices

10
 No penalty from interleaving 2-wide ops with 4-wide ops
 Cortex-A8/A9 does 64-bit float operations per cycle
 vget_high_xxx, vget_low_xxx to address quadword halves
 Narrow to 64 bits early
 16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed
 Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc.
 Use VMOVN or coupled operation and narrow
 Careful with your constants
 VMOV and VMVN can encode lots of useful constants
 Compilers do a good job of constant encoding, but can’t choose the constants for you
 Killer instructions
 Shift-and-insert: VSRI, VSLI
 Byte permute by table lookup: VTBL, VTBX
 Gather load and scatter store: VLD2-4, VST2-4
NEON optimization tricks

11
 Example routine: gather sign bits of large array of float values
NEON optimization example
function gather_signbits(flt_array):
let output_bitmap = bitmap of size len(flt_array)
foreach elem in flt_array at index idx:
if (elem < 0)
set_bit(output_bitmap, idx)
else
clear_bit(output_bitmap,idx)

12
 Sufficient unrolling: handle 16 elements
in one iteration
 compare 4 values per instruction
 bitwise and for correct bit offsets
 collapse with vertical or (pairwise add)
Neon optimization example: first attempt
20: add.w r2, r0, #32
24: vld1.64 {d28-d29}, [r0 :128]
28: vld1.64 {d24-d25}, [r2 :128]
2c: add.w r2, r0, #16
30: vclt.f32 q14, q14, #0
34: vld1.64 {d26-d27}, [r2 :128]
38: add.w r2, r0, #48 ; 0x30
3c: vclt.f32 q12, q12, #0
40: vand q14, q8, q14
44: vld1.64 {d30-d31}, [r2 :128]
48: vclt.f32 q13, q13, #0
4c: vand q13, q11, q13
50: vclt.f32 q15, q15, #0
54: vand q12, q10, q12
58: vand q15, q9, q15
5c: vorr q13, q14, q13
60: vorr q12, q12, q15
64: vorr q12, q13, q12
68: vpadd.i32 d24, d24, d25
6c: vpadd.i32 d24, d24, d24
70: vst1.32 {d24[0]}, [r0 :32], r1

13
 Compare with zero = shift sign bit
 Can shift and combine simultaneously
with VSRI instruction
 Narrow to 16 bits (VMOVN) before
proceeding further
 half the amount of constants
Neon optimization example: shift-and-insert, narrow
early
18: vld1.64 {d18-d19}, [r0 :128]
1c: add.w r3, r0, #16
20: adds r1, #4
22: vshr.u32 q9, q9, #19
26: vld1.64 {d20-d21}, [r3 :128]
2a: add.w r3, r0, #32
2e: vsri.32 q9, q10, #23
32: vld1.64 {d20-d21}, [r3 :128]
36: add.w r3, r0, #48 ; 0x30
3a: vsri.32 q9, q10, #27
3e: vld1.64 {d20-d21}, [r3 :128]
42: vsri.32 q9, q10, #31
46: vmovn.i32 d18, q9
4a: vand d18, d18, d16
4e: vshl.u16 d18, d18, d17
52: vpaddl.u16 d18, d18
56: vpadd.i32 d18, d18, d18
5a: vst1.32 {d18[0]}, [r0 :32], r2

Thank you.
For more on Umbra 3, go to:
umbra3.com
antti@umbrasoftware.com
Follow us on Twitter @umbrasoftware

GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

Similar to GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra (20)

More from Umbra Software

More from Umbra Software (9)

Recently uploaded

Recently uploaded (20)

GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra

Editor's Notes