Successfully reported this slideshow.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

OptimizingARM

  1. 1. Optimizing for ARM architectures Jan-Lieuwe Koopmans Engine Software
  2. 2. ARM platforms  GameBoy Advance  ARM7TDMI @ 16.8Mhz (ARMv4)  Nintendo DS  ARM7TDMI @ 33Mhz (ARMv4)  ARM946E-S @ 67Mhz (ARMv5)  Nintendo DSi  ARM7TDMI @ 33Mhz (ARMv4)  ARM946E-S @ 133Mhz (ARMv5)  Nintendo 3DS  ARM11 MPCore @ 267Mhz (ARMv6k)
  3. 3. ARM platforms  PlayStation Vita  ARM Cortex-A9 MPCore (ARMv7)  Apple iPhone/iPod/iPad  ARM1176JZ(F)-S (ARMv6)  ARM Cortex-A8 [Apple 4] (ARMv7)  ARM Cortex-A9 [Apple 5] (ARMv7)  Android
  4. 4. Key Features  Multiple instruction sets  ARM (powerful, 4 bytes/instruction)  Thumb (simple, 2 bytes/instruction)  Jazelle (Javatm bytecode execution)  Variable cycle execution  Load/store multiple  Conditional execution  Reduces branching  Barrel shifter  Complex instructions
  5. 5. Key Features  DSP extensions (ARMv5TE, ARMv6, ARMv7)  Single cycle 16x16 and 32x16 MAC  Saturated math  Count Leading Zeroes  Load/store register pairs  SIMD extensions (ARMv6, ARMv7)  Simultaneous computation of 2x16-bit or 4x8-bit operands  Fractional arithmetic  User definable saturation modes (arbitrary word-width)  Dual 16x16 multiply-add/subtract 32x32 fractional MAC  Simultaneous 8/16-bit select operations
  6. 6. --asm Output assembly code as well as object code
  7. 7. --asm Output assembly code as well as object code -S Output assembly code instead of object code
  8. 8. --asm Output assembly code as well as object code -S Output assembly code instead of object code --interleave Interleave source with disassembly (use with --asm or -S) ;;;22 // calculate a point on a quadratic Bezier curve ;;;23 Vector2f math::bezier(const Vector2f& a, const Vector2f& b, const Vector2f& c, const f32 t) 000000 ed9f1a16 VLDR s2,|L5.96| ;;;24 { ;;;25 const f32 tInv = 1 - t; ;;;26 const f32 tInvSq = tInv * tInv; ;;;27 const f32 tSq = t * t; ;;;28 const f32 t2tInv = (t * 2) * tInv; 000004 eddf0a16 VLDR s1,|L5.100| 000008 edd22a00 VLDR s5,[r2,#0] 00000c ee311a40 VSUB.F32 s2,s2,s0 ;25 000010 ee601a20 VMUL.F32 s3,s0,s1 000014 ee200a00 VMUL.F32 s0,s0,s0 ;27 000018 ee610a01 VMUL.F32 s1,s2,s2 ;26 00001c ee211a81 VMUL.F32 s2,s3,s2 . . . 000054 ed801a00 VSTR s2,[r0,#0] 000058 ed800a01 VSTR s0,[r0,#4] ;;;29 ;;;30 return tInvSq * a + t2tInv * b + tSq * c; ;;;31 } 00005c e12fff1e BX lr
  9. 9. Address Opcode Mnemonic Operands 00000000 E0804001 ADD R4,R0,R1 1. Branch instructions 2. Register Load and Store instructions 3. Data processing instructions 4. Coprocessor instructions 5. Status register access instructions
  10. 10. Address Opcode Mnemonic Operands 00000000 E0804001 ADD R4,R0,R1 1. Branch instructions 2. Register Load and Store instructions 3. Data processing instructions 4. Coprocessor instructions 5. Status register access instructions
  11. 11. Address Opcode Mnemonic Operands 00000000 E12FFF1E BX LR Branching instructions B Branch BX Branch with exchange (Thumb/ARM) BL Branch with link BLX Branch with link & exchange
  12. 12. Address Opcode Mnemonic Operands 00000000 E1D000F0 LDRSH R0,[R0,#0] Register Load and Store instructionsLDR Load register from memory STR Store register to memory LDM Load multiple registers (32-bit aligned!) STM Store multiple registers (32-bit aligned!) Register Load and Store instructions B Byte (8-bit) SB Signed byte (8-bit) H Half word (16-bit) SH Signed half word (16-bit) D Double word (64-bit)
  13. 13. Address Opcode Mnemonic Operands 00000000 E1B030C6 ASRS R3,R6,#1 Data processing instructions MOV Move to register LSL Logical shift left LSR Logical shift right ASR Arithmetic shift right
  14. 14. Address Opcode Mnemonic Operands 00000000 E0854C2C ADD R4,R5,R12,LSR #24 Data processing instructions (arithmetic) ADD Addition ADC Addition with carry SUB Subtraction SBC Subtraction with carry RSB Reverse subtraction RSC Reverse subtraction with carry MUL Multiply MLA Multiply and accumulate
  15. 15. Address Opcode Mnemonic Operands 00000000 E2000003 AND R0,R0,#3 Data processing instructions (logical) AND Logical AND EOR Logical exclusive OR ORR Logical OR MVN Logical NOT BIC Bit clear (combined logical AND NOT)
  16. 16. Address Opcode Mnemonic Operands 00000000 E3560000 CMP R6,#0 Data processing instructions (tests) CMP Compare CMN Compare negative TST Test bits (logical AND) TEQ Test bits (logical EOR)
  17. 17. Status Register 31 30 29 28 27 26..6 7 6 5 4..0 N Z C V Q I F T Mode EQ Equal NE Not equal CS Carry CC No carry MI Negative PL Positive VS Overflow VC No overflow HI Higher LS Lower or same GE Greater or equal LT Less than GT Greater than LE Less than or equal AL Always NV Never N = negative Z = zero C = carry V = overflow Q = saturated  S suffix: data instruction updates CPSR (Current Program Status Register)
  18. 18. Status Register 31 30 29 28 27 26..6 7 6 5 4..0 N Z C V Q I F T Mode EQ Equal NE Not equal CS Carry CC No carry MI Negative PL Positive VS Overflow VC No overflow HI Higher LS Lower or same GE Greater or equal LT Less than GT Greater than LE Less than or equal AL Always NV Never N = negative Z = zero C = carry V = overflow Q = saturated ANDS num, num, #1 ADDNE odd, odd, #1 ADDEQ even, even, #1 CMP age, #18 BGE |IsAdult|
  19. 19. for (int i = 0; i < n; ++i) { // ... } int i = 0; while (i < n) { // ... ++i; }
  20. 20. MOV i, #0 ; i = 0 CMP i, n ; i < n? BGE |Done| ; no, done |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop |Done|
  21. 21. MOV i, #0 ; i = 0 CMP i, n ; i < n? BGE |Done| ; no, done |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLS |Loop| ; yes, loop |Done| Initial test required, in case n <= 0
  22. 22. int i = 0; do { // ... } while(++i < n); MOV i, #0 ; i = 0 |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop
  23. 23. [Tip!] Use do {} while  Use do-while loops when the initial test isn’t required  Tip: replace initial test with an ‘assert(n > 0)’
  24. 24. [Tip!] Count down loops  Count down in loops  where possible int i = n - 1; do { // ... } while(--i >= 0);
  25. 25. SUB i, n, #1 ; i = n - 1 |Loop| ; ... SUBS i, i, #1 ; --i >= 0? BPL |Loop| ; yes, loop MOV i, #0 ; i = 0 |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop
  26. 26. for (int i = n - 1; i >= 0; --i) { // ... } int i = n - 1; while (i >= 0) { // ... --i; }
  27. 27. SUBS i, n, #1 ; i = n – 1 BMI |Done| |Loop| ; ... SUBS i, i, #1 ; --i >= 0? BPL |Loop| ; yes, loop |Done|
  28. 28. [Tip!] Improve Loop Unrolling Intrinsic Description __promise Allows the compiler to optimize loop unrolling (also improves NEON vectorization) // Promise the compiler that the loop // iteration count is divisible by 16 __promise((n % 16) == 0); for (int i = 0; i < n; i++) { // ... }
  29. 29. Pointer Aliasing  A compiler must assume two pointers could point to the same location. void Object::update(const State& state) { mAge += state.deltaTime; mDelay -= state.deltaTime; }
  30. 30. Pointer Aliasing  A compiler must assume two pointers could point to the same location. void Object::update(const State& state) { this->mAge += state.deltaTime; this->mDelay -= state.deltaTime; }
  31. 31. Pointer Aliasing  A compiler must assume two pointers could point to the same location. LDR r2,[r0,#0] ; load this->mAge LDR r3,[r1,#0] ; load state.deltaTime ; interlock ADD r2,r2,r3 ; mAge += state.deltaTime STR r2,[r0,#0] ; store updated mAge LDR r1,[r1,#0] ; reload state.deltaTime LDR r2,[r0,#4] ; load this->mDelay ; interlock SUB r1,r2,r1 ; mDelay -= state.deltaTime STR r1,[r0,#4] ; store updated mDelay BX lr ; return
  32. 32. Pointer Aliasing  Do not dereference multiple times; cache the value in a local. void Object::update(const State& state) { const int dt = state.deltaTime; mAge += dt; mDelay -= dt; }
  33. 33. Pointer Aliasing  Do not dereference multiple times; cache the value in a local.  Or use __restrict to promise the compiler a certain pointer does not alias other pointers. void Object::update(const State& state) __restrict // restrict the this pointer { mAge += state.deltaTime; mDelay -= state.deltaTime; }
  34. 34. Pointer Aliasing  Do not dereference multiple times; cache the value in a local.  Or use __restrict to promise the compiler that a pointer does not alias other pointers.  This improves code generation tremendously! LDR r12,[r1,#0] ; load state.deltaTime LDM r0,{r2, r3} ; load mAge, mDelay ADD r2,r2,r12 ; mAge += state.deltaTime SUB r3,r3,r12 ; mDelay -= state.deltaTime STM r0,{r2, r3} ; store mAge, mDelay BX lr ; return
  35. 35. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)
  36. 36. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3
  37. 37. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R14 (LR)  Current PC  Current CPU mode (ARM/Thumb)
  38. 38. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R13 (LR)  Return value: R0, R1
  39. 39. [Tip!] Function arguments bool Object::hit(int type, int damage, Object* pSource) { // R0 = this // R1 = type // R2 = damage // R3 = pSource ... // R0 = true return true }  Do not pass more than four 32-bit (integer) arguments  Non-static class member functions: 3 arguments (this pointer counts as argument)
  40. 40. [Tip!] Function arguments s64 dontDoThis(s32 a, s64 b, s32 c) { // R0 = a // R1 // R2, R3 = b // [SP+0] = c return a + b + c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3
  41. 41. [Tip!] Function arguments s64 Object::rememberThis(s64 b, s32 a) { // R0 = this // R1 // R2, R3 = b // [SP+0] = a return a + b + this->c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3  Member functions: this pointer alert!
  42. 42. [Tip!] Function arguments s64 Object::rememberThis(s32 a, s64 b) { // R0 = this // R1 = a // R2, R3 = b return a + b + this->c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3  Member functions: this pointer alert!
  43. 43. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R13 (LR)  Return value: R0, R1  32-bit!
  44. 44. [Tip!] Use 32 bits!  Use 32 bits (or multiples thereof) for:  Arguments  Locals  Return values  When using smaller types compiler has to take care of:  Wrap-around  Sign-extension
  45. 45. [Tip!] Use 32 bits! short addRange(short a, short b, short* pData) { short result = 0; do { result += pData[a++]; } while (a <= b); return result; }
  46. 46. [Tip!] Use 32 bits! MOV r3,#0 |Loop| ADD r12,r2,r0,LSL #1 LDRH r12,[r12,#0] ADD r0,r0,#1 LSL r0,r0,#16 ; wrap-around and... ADD r3,r3,r12 ASR r0,r0,#16 ; sign-extend LSL r3,r3,#16 CMP r0,r1 ASR r3,r3,#16 MOVGT r0,r3 BLE |Loop| BX lr
  47. 47. [Tip!] Use 32 bits! MOV r3,#0 |Loop| ADD r12,r2,r0,LSL #1 ADD r0,r0,#1 LDRH r12,[r12,#0] SXTH r0,r0 ; sign-extend halfword CMP r0,r1 ADD r3,r3,r12 SXTH r3,r3 MOVGT r0,r3 BLE |Loop| BX lr ARMv6
  48. 48. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors int thousandDividedBy(int d) { return 1000 / d; } MOV r1,r0 MOV r0,#1000 B __aeabi_idivmod int thousandDividedBy(int d) { return int(1000 / (float)d); } VMOV s0,r0 VLDR s1,|Thousand| VCVT.F32.S32 s0,s0 VDIV.F32 s2,s1,s0 VCVT.S32.F32 s0,s2 VMOV r0,s0 BX lr |Thousand| DCFS 0x447a0000 ; 1000 VFP Alternative?
  49. 49. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors  Compiler can optimize constant divisors int dividedByThousand(int d) { return d / 1000; } LDR r1,|DivisionMagic| SMULL r1,r0,r1,r0 ASR r1,r0,#6 SUB r0,r1,r0,ASR #31 BX lr |DivisionMagic| DCD 0x10624dd3 int moduloThree(int d) { return d % 3; } LDR r1,|ModuloMagic| SMULL r2,r1,r1,r0 SUB r1,r1,r1,ASR #31 SUB r1,r1,r1,LSL #2 ADD r0,r0,r1 BX lr |ModuloMagic| DCD 0x55555556
  50. 50. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors  Compiler can optimize constant divisors  Especially power of two divisors int dividedByPower2(int d) { return d / 512; } ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr
  51. 51. [Tip!] Signed vs Unsigned int dividedByPower2(int d) { return d / 512; } ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1
  52. 52. [Tip!] Signed vs Unsigned LSR r0,r0,#9 BX lr u32 moduloPower2(u32 d) { return d % 4U; } AND r0, r0, #3 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! u32 dividedByPower2(u32 d) { return d / 512U; }
  53. 53. [Tip!] Signed vs Unsigned ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! int dividedByPower2(int d) { return d / 512; }
  54. 54. [Tip!] Signed vs Unsigned LSR r0,r0,#9 BX lr u32 moduloPower2(u32 d) { return d % 4U; } AND r0, r0, #3 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! u32 dividedByPower2(u32 d) { return d / 512U; }
  55. 55. Interworking  It is possible to switch between ARM & Thumb instruction sets at run-time.  First bit of address determines instruction set.  Compiler allows us to switch between instruction sets with #pragma directives.  Only possible to use these in translation units!  Doesn’t work for inline functions.  Doesn’t work for non-specialized template functions.
  56. 56. Switching to Thumb // code16.h // // --- Thumb mode #if defined(__MWERKS__) // Codewarrior # pragma thumb on #elif defined(__ARMCC_VERSION) // ARMCC/RVCT # pragma thumb #else # error “Unknown compiler!” #endif
  57. 57. Switching to ARM // code32.h // // --- ARM mode #if defined(__MWERKS__) // Codewarrior # pragma thumb off #elif defined(__ARMCC_VERSION) // ARMCC/RVCT # pragma arm #else # error “Unknown compiler!” #endif
  58. 58. Switching to default // codereset.h // // --- default mode #if defined(EFFORT_SMALL) #include <code16.h> #else #include <code32.h> #endif
  59. 59. #include <code16.h> Object::Object() { // ... } Object::~Object() { // ... } #include <codereset.h> #include <code32.h> void Object::update(int ticks) { // ... } #include <codereset.h>
  60. 60. ARM vs Thumb ARM Thumb Instruction size 32-bit (4 bytes) 16-bit (2 bytes) Note: some branch instructions take 4 bytes.
  61. 61. ARM vs Thumb ARM Thumb Instruction size 32-bit (4 bytes) 16-bit (2 bytes) int add(int x, int y) { int result = x + y; printf("%d + %d = %dn", x, y, result); return result; } +16 bytes! 4 PUSH {r4,lr} 4 ADD r4,r0,r1 4 MOV r2,r1 4 MOV r1,r0 4 MOV r3,r4 4 ADR r0,|String| 4 BL printf 4 MOV r0,r4 4 POP {r4,pc} 36 |String| DCB "%d + %d = %dn",0 2 PUSH {r4,lr} 2 ADDS r4,r0,r1 2 MOV r2,r1 2 MOV r1,r0 2 MOV r3,r4 2 ADR r0,|String| 4 BL printf 2 MOV r0,r4 2 POP {r4,pc} 20 |String| DCB "%d + %d = %dn",0
  62. 62. ARM vs Thumb ARM Thumb Conditional execution Nearly all instructions Branch instructions bool isLetter(int c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')); } MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr SUB r1,r0,#’A’ CMP r1,#’Z’ – ’A’ SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ MOVLS r0,#1 MOVHI r0,#0 BX lr 6 instructions 11 instructions
  63. 63. ARM vs Thumb ARM Thumb Conditional execution Nearly all instructions Branch instructions bool isLetter(int c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')); } MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr Note: the compiler must ensure type bool is true (1) or false (0). Avoid [implicit] casting to bool when it’s not required, as it’s expensive!
  64. 64. [Tip!] Boolean type bool isBitSet(int flags, int bit) { return flags & (1 << bit); } Note: the compiler must ensure type bool is true (1) or false (0). Avoid [implicit] casting to bool when it’s not required, as it’s expensive! int isBitSet(int flags, int bit) { return flags & (1 << bit); } MOVS r2,#1 LSLS r2,r2,r1 TST r2,r0 BEQ |False| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr MOV r2,r0 MOVS r0,#1 LSLS r0,r0,r1 ANDS r0,r0,r2 BX lr
  65. 65. ARM vs Thumb-2  Thumb-2 introduced the IT (if-then) instruction  Up to four instructions can be made conditional SUBS r1,r0,#’A’ CMP r1,#’Z’ – ’A’ SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ MOVLS r0,#1 MOVHI r0,#0 BX lr MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ ITT HI SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ ITE LS MOVLS r0,#1 MOVHI r0,#0 BX lr MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr Thumb ARMThumb-2
  66. 66. ARM vs Thumb ARM Thumb Barrel shifter & ALU Accessible by data instructions Requires separate instructions unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } MOVS r3,#0xFF LSLS r2,r0,#8 LSLS r3,r3,#16 ANDS r2,r2,r3 LSLS r1,r0,#24 ORRS r1,r1,r2 LSRS r2,r0,#8 ASRS r3,r3,#8 ANDS r2,r2,r3 ORRS r1,r1,r2 LSRS r0,r0,#24 ORRS r0,r0,r1 BX lr MOV r1,#0xFF,LSL #16 AND r1,r1,r0,LSL #8 MOV r2,#0xFF,LSL #8 ORR r1,r1,r0,LSL #24 AND r2,r2,r0,LSR #8 ORR r1,r1,r2 ORR r0,r1,r0,LSR #24 BX lr 8 instructions 13 instructions
  67. 67. ARM vs Thumb ARM Thumb Barrel shifter & ALU Accessible by data instructions Requires separate instructions unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } REV r0,r0 BX lr ARMv6
  68. 68. ARM vs Thumb ARM Thumb Coprocessor interface Yes No Long Multiply Yes (ARMv4) No Count Leading Zeroes Yes (ARMv5) No Saturated math Yes (ARMv5) No DSP instructions Yes (ARMv5) No SIMD instructions Yes (ARMv6) No
  69. 69. Summary: when to use Thumb  Use Thumb for functions which…  do not benefit from the ARM instruction-set  are not performance critical (i.e.: initialization code) #include <code16.h> void Level::load(const std::string& path) { . . . } #include <codereset.h>
  70. 70. Summary: when to use ARM  Use ARM for functions which…  do benefit from the ARM instruction-set  are performance critical (i.e.: called from inner loops) #include <code32.h> bool Ray::intersects(const Sphere& s) { . . . } #include <codereset.h>
  71. 71. Intrinsic functions  Allows use of specialized CPU instructions in C/C++  Compiler can recognize patterns and might utilize such specialized instructions: unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } REV r0,r0 BX lr
  72. 72. Intrinsic functions  Allows use of specialized CPU instructions in C/C++  Compiler can recognize patterns and might utilize such specialized instructions.  More often the compiler does not. Check compiler output!  Intrinsic functions are compiler specific; read the manual!
  73. 73. Useful intrinsics Intrinsic Description __breakpoint Stops execution, informs the debugger __disable_irq Sets the CPSR irq mask, returns previous state __enable_irq Resets the CPSR irq mask, returns previous state __ldrex Atomic reads __strex Atomic writes
  74. 74. Useful intrinsics (cache) Intrinsic Description __pld Preload data __pldw Preload data for writing __pli Preload instructions
  75. 75. Useful intrinsics (algorithms) Intrinsic Description __usat/__ssat Unsigned/signed saturate (any power of 2) __clz Count leading zeroes __rbit Reverse bit order __rev Reverse byte order
  76. 76. Useful intrinsics (SIMD) Intrinsic Description __usad[a]8|16 Sum of absolute differences (4x8, 2x16) __[u][q]add8|16 [Saturated] addition (4x8, 2x16) __[u][q]sub8|16 [Saturated] subtraction (4x8, 2x16) etc. Check: http://infocenter.arm.com/help
  77. 77. Intrinsic functions struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { r += o.r; g += o.g; b += o.b; a += o.a; return *this; } }; struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { c = __uadd8(c, o.c); return *this; } };
  78. 78. Intrinsic functions PUSH {r4} AND r2,r1,r0,ASR #24 ADD r1,r1,r0 AND r1,r1,#0xff ORR r1,r1,r2 LSL r3,r0,#16 LSL r4,r0,#8 LSR r2,r0,#24 LSL r0,r1,#16 LSR r12,r3,#24 ADD r0,r12,r0,LSR #24 BIC r1,r1,#0xff00 LSL r0,r0,#8 AND r0,r0,#0xff00 ORR r0,r0,r1 BIC r1,r0,#0xff0000 LSL r12,r0,#8 LSR r0,r12,#24 ADD r0,r0,r4,LSR #24 POP {r4} LSL r0,r0,#16 AND r0,r0,#0xff0000 ORR r0,r0,r1 BIC r3,r0,#0xff000000 ADD r0,r2,r0,LSR #24 ORR r0,r3,r0,LSL #24 struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { r += o.r; g += o.g; b += o.b; a += o.a; return *this; } };
  79. 79. Intrinsic functions UADD8 r0,r0,r1struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { c = __uadd8(c, o.c); return *this; } }; Note: I could have demonstrated __uqadd8, which saturates the results to the 8-bit unsigned integer range 0 ≤ x ≤ 28 - 1.
  80. 80. Feel free to ask…
  81. 81. jan-lieuwe@engine-software.nl

×