CODE VECTORIZATION for mobile devices by Dmitriy Vovk
Hardware• Typical hardware found in modern mobile devices: – ARMv7 instructions set – Cortex A8Cortex A9Custom cores (Krait, Swift) – 800 – 1500 MHz – 1-4 cores – Thumb-2 instructions set – VFPv3 – NEON, optional for Cortex A9. Nvidia Tegra 2 has no NEON support
NEON• NEON is a general purpose SIMD engine designed by ARM for ARM processor architecture• 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values
NEON• NEON can be used for: – Software geometry instancing; – Skinning on ES 1.1; – As a general vertex processor; – Other, typical, applications for SIMD.
NEON• Some unified shader architectures, like popular Imagination Technologies USSE1 (PowerVR SGX 530-545) are scalar, NEON is vector by nature. Move your vertex processing to CPU from GPU to speedup calculations*• ???????• PROFIT!!!111• *NOTE. That doesn’t apply to USSE2 hardware
NEON• The weakest side of mobile GPUs is a fill rate. Fill rate is quickly killed by blending. 2D games are heavy on this. PowerVR USSE engine doesn’t care what to do – vertex or fragments processing. Moving you vertex processing to CPU (NEON) will leave some room space for fragment processing.
NEON• There are 3 ways to use NEON vectorization in your code: 1. Intrinsics 2. Handwritten NEON assembly 3. Autovectorization by compiler. –mllvm – vectorize –mllvm –bb-vectorize-aligned-only compiler flags for LLVM. -ftree-vectorizer- verbose=4 -mfpu=neon -funsafe-math- optimizations -ftree-vectorize for GCC
Measurements• Summary: Running time, ms CPU usage, % Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-28 FPU autovectorized 5028 22-24• Intrinsics got me 25% speedup over assembly.• Note that speed of intrinsics code vary from compiler to compiler.
NEON• Intrinsics advantages over assembly: – Higher level code; – No need to manage registers; – You can vectorize basic blocks and build solution to every new problem with this blocks. In contrast to assembly – you have to solve each new problem from scratch;
NEON• Assembly advantages over intrinsics: – Code generated from intrinsics vary from compiler to compiler and can give you really big difference in speed. Assembly code will always be the same.