Дмитрий Вовк: Векторизация кода под мобильные платформы
CODE VECTORIZATION for mobile devices by Dmitriy Vovk
Hardware • Typical hardware found in modern mobile devices: – ARMv7 architecture – Cortex A8Cortex A9Custom cores (Krait, SwiN) – 800 – 1500 MHz – 1-‐4 cores – Thumb-‐2 instrucXons set – VFPv3 – NEON, opXonal for Cortex A9. Nvidia Tegra 2 has no NEON support
NEON • NEON is a general purpose SIMD engine designed by ARM for ARM processor architecture • 16 registers, 128 bit wide each. Supports operaXons on 8, 16, 32 and 64 bits integers and 32 bits ﬂoat values
NEON • NEON can be used for: – SoNware geometry instancing; – Skinning on ES 1.1; – As a general vertex processor; – Other, typical, applicaXons for SIMD.
NEON • Some uniﬁed shader architectures, like popular ImaginaXon Technologies USSE1 (PowerVR SGX 530-‐545) are scalar, NEON is vector by nature. Move your vertex processing to CPU from GPU to speedup calculaXons* • ??????? • PROFIT!!!111 • *NOTE. That doesn’t apply to USSE2 hardware
NEON • The weakest side of mobile GPUs is a ﬁll rate. Fill rate is quickly killed by blending. 2D games are heavy on this. PowerVR USSE engine doesn’t care what to do – vertex or fragments processing. Moving you vertex processing to CPU (NEON) will leave some room space for fragment processing.
NEON • There are 3 ways to use NEON vectorizaXon in your code: 1. Intrinsics 2. Handwrijen NEON assembly 3. AutovectorizaXon by compiler. –mllvm – vectorize –mllvm –bb-‐vectorize-‐aligned-‐only compiler ﬂags for LLVM. -‐Bree-‐vectorizer-‐ verbose=4 -‐mfpu=neon -‐funsafe-‐math-‐ opGmizaGons -‐Bree-‐vectorize for GCC
Measurements • Summary: Running me, ms CPU usage, % Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-‐28 FPU autovectorized 5028 22-‐24 • Intrinsics got me 25% speedup over assembly. • Note that speed of intrinsics code vary from compiler to compiler.
NEON • Intrinsics advantages over assembly: – Higher level code; – No need to manage registers; – You can vectorize basic blocks and build soluXon to every new problem with this blocks. In contrast to assembly – you have to solve each new problem from scratch;
NEON • Assembly advantages over intrinsics: – Code generated from intrinsics vary from compiler to compiler and can give you really big diﬀerence in speed. Assembly code will always be the same.