SlideShare a Scribd company logo
Optimizing for ARM architectures
Jan-Lieuwe Koopmans
Engine Software
ARM platforms
 GameBoy Advance
 ARM7TDMI @ 16.8Mhz (ARMv4)
 Nintendo DS
 ARM7TDMI @ 33Mhz (ARMv4)
 ARM946E-S @ 67Mhz (ARMv5)
 Nintendo DSi
 ARM7TDMI @ 33Mhz (ARMv4)
 ARM946E-S @ 133Mhz (ARMv5)
 Nintendo 3DS
 ARM11 MPCore @ 267Mhz (ARMv6k)
ARM platforms
 PlayStation Vita
 ARM Cortex-A9 MPCore (ARMv7)
 Apple iPhone/iPod/iPad
 ARM1176JZ(F)-S (ARMv6)
 ARM Cortex-A8 [Apple 4] (ARMv7)
 ARM Cortex-A9 [Apple 5] (ARMv7)
 Android
Key Features
 Multiple instruction sets
 ARM (powerful, 4 bytes/instruction)
 Thumb (simple, 2 bytes/instruction)
 Jazelle (Javatm bytecode execution)
 Variable cycle execution
 Load/store multiple
 Conditional execution
 Reduces branching
 Barrel shifter
 Complex instructions
Key Features
 DSP extensions (ARMv5TE, ARMv6, ARMv7)
 Single cycle 16x16 and 32x16 MAC
 Saturated math
 Count Leading Zeroes
 Load/store register pairs
 SIMD extensions (ARMv6, ARMv7)
 Simultaneous computation of 2x16-bit or 4x8-bit operands
 Fractional arithmetic
 User definable saturation modes (arbitrary word-width)
 Dual 16x16 multiply-add/subtract 32x32 fractional MAC
 Simultaneous 8/16-bit select operations
--asm Output assembly code as well as object code
--asm Output assembly code as well as object code
-S Output assembly code instead of object code
--asm Output assembly code as well as object code
-S Output assembly code instead of object code
--interleave Interleave source with disassembly
(use with --asm or -S)
;;;22 // calculate a point on a quadratic Bezier curve
;;;23 Vector2f math::bezier(const Vector2f& a, const Vector2f& b, const Vector2f& c, const f32 t)
000000 ed9f1a16 VLDR s2,|L5.96|
;;;24 {
;;;25 const f32 tInv = 1 - t;
;;;26 const f32 tInvSq = tInv * tInv;
;;;27 const f32 tSq = t * t;
;;;28 const f32 t2tInv = (t * 2) * tInv;
000004 eddf0a16 VLDR s1,|L5.100|
000008 edd22a00 VLDR s5,[r2,#0]
00000c ee311a40 VSUB.F32 s2,s2,s0 ;25
000010 ee601a20 VMUL.F32 s3,s0,s1
000014 ee200a00 VMUL.F32 s0,s0,s0 ;27
000018 ee610a01 VMUL.F32 s1,s2,s2 ;26
00001c ee211a81 VMUL.F32 s2,s3,s2
.
.
.
000054 ed801a00 VSTR s2,[r0,#0]
000058 ed800a01 VSTR s0,[r0,#4]
;;;29
;;;30 return tInvSq * a + t2tInv * b + tSq * c;
;;;31 }
00005c e12fff1e BX lr
Address Opcode Mnemonic Operands
00000000 E0804001 ADD R4,R0,R1
1. Branch instructions
2. Register Load and Store instructions
3. Data processing instructions
4. Coprocessor instructions
5. Status register access instructions
Address Opcode Mnemonic Operands
00000000 E0804001 ADD R4,R0,R1
1. Branch instructions
2. Register Load and Store instructions
3. Data processing instructions
4. Coprocessor instructions
5. Status register access instructions
Address Opcode Mnemonic Operands
00000000 E12FFF1E BX LR
Branching instructions
B Branch
BX Branch with exchange (Thumb/ARM)
BL Branch with link
BLX Branch with link & exchange
Address Opcode Mnemonic Operands
00000000 E1D000F0 LDRSH R0,[R0,#0]
Register Load and Store instructionsLDR Load register from memory
STR Store register to memory
LDM Load multiple registers (32-bit aligned!)
STM Store multiple registers (32-bit aligned!)
Register Load and Store instructions
B Byte (8-bit)
SB Signed byte (8-bit)
H Half word (16-bit)
SH Signed half word (16-bit)
D Double word (64-bit)
Address Opcode Mnemonic Operands
00000000 E1B030C6 ASRS R3,R6,#1
Data processing instructions
MOV Move to register
LSL Logical shift left
LSR Logical shift right
ASR Arithmetic shift right
Address Opcode Mnemonic Operands
00000000 E0854C2C ADD R4,R5,R12,LSR #24
Data processing instructions (arithmetic)
ADD Addition
ADC Addition with carry
SUB Subtraction
SBC Subtraction with carry
RSB Reverse subtraction
RSC Reverse subtraction with carry
MUL Multiply
MLA Multiply and accumulate
Address Opcode Mnemonic Operands
00000000 E2000003 AND R0,R0,#3
Data processing instructions (logical)
AND Logical AND
EOR Logical exclusive OR
ORR Logical OR
MVN Logical NOT
BIC Bit clear (combined logical AND NOT)
Address Opcode Mnemonic Operands
00000000 E3560000 CMP R6,#0
Data processing instructions (tests)
CMP Compare
CMN Compare negative
TST Test bits (logical AND)
TEQ Test bits (logical EOR)
Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
 S suffix: data instruction updates CPSR
(Current Program Status Register)
Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
ANDS num, num, #1
ADDNE odd, odd, #1
ADDEQ even, even, #1
CMP age, #18
BGE |IsAdult|
for (int i = 0; i < n; ++i)
{
// ...
}
int i = 0;
while (i < n)
{
// ...
++i;
}
MOV i, #0 ; i = 0
CMP i, n ; i < n?
BGE |Done| ; no, done
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
|Done|
MOV i, #0 ; i = 0
CMP i, n ; i < n?
BGE |Done| ; no, done
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLS |Loop| ; yes, loop
|Done|
Initial test required, in case n <= 0
int i = 0;
do
{
// ...
} while(++i < n);
MOV i, #0 ; i = 0
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
[Tip!] Use do {} while
 Use do-while loops when the initial test isn’t required
 Tip: replace initial test with an ‘assert(n > 0)’
[Tip!] Count down loops
 Count down in loops
 where possible
int i = n - 1;
do
{
// ...
} while(--i >= 0);
SUB i, n, #1 ; i = n - 1
|Loop|
; ...
SUBS i, i, #1 ; --i >= 0?
BPL |Loop| ; yes, loop
MOV i, #0 ; i = 0
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
for (int i = n - 1; i >= 0; --i)
{
// ...
}
int i = n - 1;
while (i >= 0)
{
// ...
--i;
}
SUBS i, n, #1 ; i = n – 1
BMI |Done|
|Loop|
; ...
SUBS i, i, #1 ; --i >= 0?
BPL |Loop| ; yes, loop
|Done|
[Tip!] Improve Loop Unrolling
Intrinsic Description
__promise Allows the compiler to optimize loop unrolling
(also improves NEON vectorization)
// Promise the compiler that the loop
// iteration count is divisible by 16
__promise((n % 16) == 0);
for (int i = 0; i < n; i++)
{
// ...
}
Pointer Aliasing
 A compiler must assume two pointers could point to
the same location.
void Object::update(const State& state)
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}
Pointer Aliasing
 A compiler must assume two pointers could point to
the same location.
void Object::update(const State& state)
{
this->mAge += state.deltaTime;
this->mDelay -= state.deltaTime;
}
Pointer Aliasing
 A compiler must assume two pointers could point to
the same location.
LDR r2,[r0,#0] ; load this->mAge
LDR r3,[r1,#0] ; load state.deltaTime
; interlock
ADD r2,r2,r3 ; mAge += state.deltaTime
STR r2,[r0,#0] ; store updated mAge
LDR r1,[r1,#0] ; reload state.deltaTime
LDR r2,[r0,#4] ; load this->mDelay
; interlock
SUB r1,r2,r1 ; mDelay -= state.deltaTime
STR r1,[r0,#4] ; store updated mDelay
BX lr ; return
Pointer Aliasing
 Do not dereference multiple times; cache the value in a
local.
void Object::update(const State& state)
{
const int dt = state.deltaTime;
mAge += dt;
mDelay -= dt;
}
Pointer Aliasing
 Do not dereference multiple times; cache the value in a
local.
 Or use __restrict to promise the compiler a certain
pointer does not alias other pointers.
void Object::update(const State& state)
__restrict // restrict the this pointer
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}
Pointer Aliasing
 Do not dereference multiple times; cache the value in a
local.
 Or use __restrict to promise the compiler that a
pointer does not alias other pointers.
 This improves code generation tremendously!
LDR r12,[r1,#0] ; load state.deltaTime
LDM r0,{r2, r3} ; load mAge, mDelay
ADD r2,r2,r12 ; mAge += state.deltaTime
SUB r3,r3,r12 ; mDelay -= state.deltaTime
STM r0,{r2, r3} ; store mAge, mDelay
BX lr ; return
Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)
Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)
 Arguments: R0..R3
Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)
 Arguments: R0..R3
 Return address: R14 (LR)
 Current PC
 Current CPU mode (ARM/Thumb)
Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)
 Arguments: R0..R3
 Return address: R13 (LR)
 Return value: R0, R1
[Tip!] Function arguments
bool Object::hit(int type, int damage, Object* pSource)
{
// R0 = this
// R1 = type
// R2 = damage
// R3 = pSource
...
// R0 = true
return true
}
 Do not pass more than four 32-bit (integer) arguments
 Non-static class member functions: 3 arguments
(this pointer counts as argument)
[Tip!] Function arguments
s64 dontDoThis(s32 a, s64 b, s32 c)
{
// R0 = a
// R1
// R2, R3 = b
// [SP+0] = c
return a + b + c;
// R0, R1 = result
}
 64-bit arguments require two registers
 Must use R0, R1 or R2, R3
[Tip!] Function arguments
s64 Object::rememberThis(s64 b, s32 a)
{
// R0 = this
// R1
// R2, R3 = b
// [SP+0] = a
return a + b + this->c;
// R0, R1 = result
}
 64-bit arguments require two registers
 Must use R0, R1 or R2, R3
 Member functions: this pointer alert!
[Tip!] Function arguments
s64 Object::rememberThis(s32 a, s64 b)
{
// R0 = this
// R1 = a
// R2, R3 = b
return a + b + this->c;
// R0, R1 = result
}
 64-bit arguments require two registers
 Must use R0, R1 or R2, R3
 Member functions: this pointer alert!
Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)
 Arguments: R0..R3
 Return address: R13 (LR)
 Return value: R0, R1
 32-bit!
[Tip!] Use 32 bits!
 Use 32 bits (or multiples thereof) for:
 Arguments
 Locals
 Return values
 When using smaller types compiler has to take care of:
 Wrap-around
 Sign-extension
[Tip!] Use 32 bits!
short addRange(short a, short b, short* pData)
{
short result = 0;
do
{
result += pData[a++];
}
while (a <= b);
return result;
}
[Tip!] Use 32 bits!
MOV r3,#0
|Loop|
ADD r12,r2,r0,LSL #1
LDRH r12,[r12,#0]
ADD r0,r0,#1
LSL r0,r0,#16 ; wrap-around and...
ADD r3,r3,r12
ASR r0,r0,#16 ; sign-extend
LSL r3,r3,#16
CMP r0,r1
ASR r3,r3,#16
MOVGT r0,r3
BLE |Loop|
BX lr
[Tip!] Use 32 bits!
MOV r3,#0
|Loop|
ADD r12,r2,r0,LSL #1
ADD r0,r0,#1
LDRH r12,[r12,#0]
SXTH r0,r0 ; sign-extend halfword
CMP r0,r1
ADD r3,r3,r12
SXTH r3,r3
MOVGT r0,r3
BLE |Loop|
BX lr
ARMv6
Division
 ARM has no hardware integer division/modulo!
 Avoid non-constant divisors
int thousandDividedBy(int d)
{
return 1000 / d;
}
MOV r1,r0
MOV r0,#1000
B __aeabi_idivmod
int thousandDividedBy(int d)
{
return int(1000 / (float)d);
}
VMOV s0,r0
VLDR s1,|Thousand|
VCVT.F32.S32 s0,s0
VDIV.F32 s2,s1,s0
VCVT.S32.F32 s0,s2
VMOV r0,s0
BX lr
|Thousand|
DCFS 0x447a0000 ; 1000
VFP Alternative?
Division
 ARM has no hardware integer division/modulo!
 Avoid non-constant divisors
 Compiler can optimize constant divisors
int dividedByThousand(int d)
{
return d / 1000;
}
LDR r1,|DivisionMagic|
SMULL r1,r0,r1,r0
ASR r1,r0,#6
SUB r0,r1,r0,ASR #31
BX lr
|DivisionMagic|
DCD 0x10624dd3
int moduloThree(int d)
{
return d % 3;
}
LDR r1,|ModuloMagic|
SMULL r2,r1,r1,r0
SUB r1,r1,r1,ASR #31
SUB r1,r1,r1,LSL #2
ADD r0,r0,r1
BX lr
|ModuloMagic|
DCD 0x55555556
Division
 ARM has no hardware integer division/modulo!
 Avoid non-constant divisors
 Compiler can optimize constant divisors
 Especially power of two divisors
int dividedByPower2(int d)
{
return d / 512;
}
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
[Tip!] Signed vs Unsigned
int dividedByPower2(int d)
{
return d / 512;
}
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
 Signed division and modulus are more complicated
 Exception: -1 >> 1 == -1
[Tip!] Signed vs Unsigned
LSR r0,r0,#9
BX lr
u32 moduloPower2(u32 d)
{
return d % 4U;
}
AND r0, r0, #3
BX lr
 Signed division and modulus are more complicated
 Exception: -1 >> 1 == -1
 Use unsigned types where applicable!
u32 dividedByPower2(u32 d)
{
return d / 512U;
}
[Tip!] Signed vs Unsigned
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
 Signed division and modulus are more complicated
 Exception: -1 >> 1 == -1
 Use unsigned types where applicable!
int dividedByPower2(int d)
{
return d / 512;
}
[Tip!] Signed vs Unsigned
LSR r0,r0,#9
BX lr
u32 moduloPower2(u32 d)
{
return d % 4U;
}
AND r0, r0, #3
BX lr
 Signed division and modulus are more complicated
 Exception: -1 >> 1 == -1
 Use unsigned types where applicable!
u32 dividedByPower2(u32 d)
{
return d / 512U;
}
Interworking
 It is possible to switch between ARM & Thumb
instruction sets at run-time.
 First bit of address determines instruction set.
 Compiler allows us to switch between instruction sets
with #pragma directives.
 Only possible to use these in translation units!
 Doesn’t work for inline functions.
 Doesn’t work for non-specialized template functions.
Switching to Thumb
// code16.h
//
// --- Thumb mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb on
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma thumb
#else
# error “Unknown compiler!”
#endif
Switching to ARM
// code32.h
//
// --- ARM mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb off
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma arm
#else
# error “Unknown compiler!”
#endif
Switching to default
// codereset.h
//
// --- default mode
#if defined(EFFORT_SMALL)
#include <code16.h>
#else
#include <code32.h>
#endif
#include <code16.h>
Object::Object()
{
// ...
}
Object::~Object()
{
// ...
}
#include <codereset.h>
#include <code32.h>
void Object::update(int ticks)
{
// ...
}
#include <codereset.h>
ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
Note: some branch
instructions take 4 bytes.
ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
int add(int x, int y)
{
int result = x + y;
printf("%d + %d = %dn", x, y, result);
return result;
}
+16 bytes!
4 PUSH {r4,lr}
4 ADD r4,r0,r1
4 MOV r2,r1
4 MOV r1,r0
4 MOV r3,r4
4 ADR r0,|String|
4 BL printf
4 MOV r0,r4
4 POP {r4,pc}
36
|String|
DCB "%d + %d = %dn",0
2 PUSH {r4,lr}
2 ADDS r4,r0,r1
2 MOV r2,r1
2 MOV r1,r0
2 MOV r3,r4
2 ADR r0,|String|
4 BL printf
2 MOV r0,r4
2 POP {r4,pc}
20
|String|
DCB "%d + %d = %dn",0
ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
SUB r1,r0,#’A’
CMP r1,#’Z’ – ’A’
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
MOVLS r0,#1
MOVHI r0,#0
BX lr
6 instructions
11 instructions
ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!
[Tip!] Boolean type
bool isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!
int isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
MOVS r2,#1
LSLS r2,r2,r1
TST r2,r0
BEQ |False|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
MOV r2,r0
MOVS r0,#1
LSLS r0,r0,r1
ANDS r0,r0,r2
BX lr
ARM vs Thumb-2
 Thumb-2 introduced the IT (if-then) instruction
 Up to four instructions can be made conditional
SUBS r1,r0,#’A’
CMP r1,#’Z’ – ’A’
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
ITT HI
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
ITE LS
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Thumb ARMThumb-2
ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
MOVS r3,#0xFF
LSLS r2,r0,#8
LSLS r3,r3,#16
ANDS r2,r2,r3
LSLS r1,r0,#24
ORRS r1,r1,r2
LSRS r2,r0,#8
ASRS r3,r3,#8
ANDS r2,r2,r3
ORRS r1,r1,r2
LSRS r0,r0,#24
ORRS r0,r0,r1
BX lr
MOV r1,#0xFF,LSL #16
AND r1,r1,r0,LSL #8
MOV r2,#0xFF,LSL #8
ORR r1,r1,r0,LSL #24
AND r2,r2,r0,LSR #8
ORR r1,r1,r2
ORR r0,r1,r0,LSR #24
BX lr
8 instructions
13 instructions
ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr
ARMv6
ARM vs Thumb
ARM Thumb
Coprocessor interface Yes No
Long Multiply Yes (ARMv4) No
Count Leading Zeroes Yes (ARMv5) No
Saturated math Yes (ARMv5) No
DSP instructions Yes (ARMv5) No
SIMD instructions Yes (ARMv6) No
Summary: when to use Thumb
 Use Thumb for functions which…
 do not benefit from the ARM instruction-set
 are not performance critical (i.e.: initialization code)
#include <code16.h>
void Level::load(const std::string& path)
{
.
.
.
}
#include <codereset.h>
Summary: when to use ARM
 Use ARM for functions which…
 do benefit from the ARM instruction-set
 are performance critical (i.e.: called from inner loops)
#include <code32.h>
bool Ray::intersects(const Sphere& s)
{
.
.
.
}
#include <codereset.h>
Intrinsic functions
 Allows use of specialized CPU instructions in C/C++
 Compiler can recognize patterns and might utilize
such specialized instructions:
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr
Intrinsic functions
 Allows use of specialized CPU instructions in C/C++
 Compiler can recognize patterns and might utilize
such specialized instructions.
 More often the compiler does not. Check compiler
output!
 Intrinsic functions are compiler specific; read the
manual!
Useful intrinsics
Intrinsic Description
__breakpoint Stops execution, informs the debugger
__disable_irq Sets the CPSR irq mask, returns previous state
__enable_irq Resets the CPSR irq mask, returns previous state
__ldrex Atomic reads
__strex Atomic writes
Useful intrinsics (cache)
Intrinsic Description
__pld Preload data
__pldw Preload data for writing
__pli Preload instructions
Useful intrinsics (algorithms)
Intrinsic Description
__usat/__ssat Unsigned/signed saturate (any power of 2)
__clz Count leading zeroes
__rbit Reverse bit order
__rev Reverse byte order
Useful intrinsics (SIMD)
Intrinsic Description
__usad[a]8|16 Sum of absolute differences (4x8, 2x16)
__[u][q]add8|16 [Saturated] addition (4x8, 2x16)
__[u][q]sub8|16 [Saturated] subtraction (4x8, 2x16)
etc. Check: http://infocenter.arm.com/help
Intrinsic functions
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
RGBA& operator += (RGBA o)
{
r += o.r;
g += o.g;
b += o.b;
a += o.a;
return *this;
}
};
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
RGBA& operator += (RGBA o)
{
c = __uadd8(c, o.c);
return *this;
}
};
Intrinsic functions
PUSH {r4}
AND r2,r1,r0,ASR #24
ADD r1,r1,r0
AND r1,r1,#0xff
ORR r1,r1,r2
LSL r3,r0,#16
LSL r4,r0,#8
LSR r2,r0,#24
LSL r0,r1,#16
LSR r12,r3,#24
ADD r0,r12,r0,LSR #24
BIC r1,r1,#0xff00
LSL r0,r0,#8
AND r0,r0,#0xff00
ORR r0,r0,r1
BIC r1,r0,#0xff0000
LSL r12,r0,#8
LSR r0,r12,#24
ADD r0,r0,r4,LSR #24
POP {r4}
LSL r0,r0,#16
AND r0,r0,#0xff0000
ORR r0,r0,r1
BIC r3,r0,#0xff000000
ADD r0,r2,r0,LSR #24
ORR r0,r3,r0,LSL #24
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
RGBA& operator += (RGBA o)
{
r += o.r;
g += o.g;
b += o.b;
a += o.a;
return *this;
}
};
Intrinsic functions
UADD8 r0,r0,r1struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
RGBA& operator += (RGBA o)
{
c = __uadd8(c, o.c);
return *this;
}
};
Note:
I could have demonstrated
__uqadd8, which saturates the
results to the 8-bit unsigned
integer range 0 ≤ x ≤ 28 - 1.
Feel free to ask…
jan-lieuwe@engine-software.nl

More Related Content

What's hot

Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Hsien-Hsin Sean Lee, Ph.D.
 
Assembly language
Assembly languageAssembly language
Assembly languagebryle12
 
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILPLec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Hsien-Hsin Sean Lee, Ph.D.
 
Maicrocontroller lab basic experiments
Maicrocontroller lab basic experimentsMaicrocontroller lab basic experiments
Maicrocontroller lab basic experiments
Noor Tahasildar
 
Micro Controller lab basic experiments (1)
Micro Controller lab basic experiments (1)Micro Controller lab basic experiments (1)
Micro Controller lab basic experiments (1)
Noor Tahasildar
 
Instruction types
Instruction typesInstruction types
Instruction types
JyotiprakashMishra18
 
96000707 gas-turbine-control
96000707 gas-turbine-control96000707 gas-turbine-control
96000707 gas-turbine-controlMowaten Masry
 
Chp2 introduction to the 68000 microprocessor copy
Chp2 introduction to the 68000 microprocessor   copyChp2 introduction to the 68000 microprocessor   copy
Chp2 introduction to the 68000 microprocessor copymkazree
 
Arithmetic and logical instructions set
Arithmetic and logical instructions setArithmetic and logical instructions set
Arithmetic and logical instructions set
Robert Almazan
 
Basic computer organization design
Basic computer organization designBasic computer organization design
Basic computer organization design
ndasharath
 
Ee443 phase locked loop - presentation - schwappach and brandy
Ee443   phase locked loop - presentation - schwappach and brandyEe443   phase locked loop - presentation - schwappach and brandy
Ee443 phase locked loop - presentation - schwappach and brandyLoren Schwappach
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Hsien-Hsin Sean Lee, Ph.D.
 
RF Module Design - [Chapter 8] Phase-Locked Loops
RF Module Design - [Chapter 8] Phase-Locked LoopsRF Module Design - [Chapter 8] Phase-Locked Loops
RF Module Design - [Chapter 8] Phase-Locked Loops
Simen Li
 
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Hsien-Hsin Sean Lee, Ph.D.
 
Buy Embedded Systems Projects Online,Buy B tech Projects Online
Buy Embedded Systems Projects Online,Buy B tech Projects OnlineBuy Embedded Systems Projects Online,Buy B tech Projects Online
Buy Embedded Systems Projects Online,Buy B tech Projects Online
Technogroovy
 
Central processing unit
Central processing unitCentral processing unit
Central processing unit
Heman Pathak
 
Emb day2 8051
Emb day2 8051Emb day2 8051
Emb day2 8051
shivamarya55
 
Cbstartbook
CbstartbookCbstartbook
Cbstartbookcheksxk
 

What's hot (19)

Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Assembly language
Assembly languageAssembly language
Assembly language
 
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILPLec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
Lec2 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ILP
 
Maicrocontroller lab basic experiments
Maicrocontroller lab basic experimentsMaicrocontroller lab basic experiments
Maicrocontroller lab basic experiments
 
Micro Controller lab basic experiments (1)
Micro Controller lab basic experiments (1)Micro Controller lab basic experiments (1)
Micro Controller lab basic experiments (1)
 
Instruction types
Instruction typesInstruction types
Instruction types
 
96000707 gas-turbine-control
96000707 gas-turbine-control96000707 gas-turbine-control
96000707 gas-turbine-control
 
Chp2 introduction to the 68000 microprocessor copy
Chp2 introduction to the 68000 microprocessor   copyChp2 introduction to the 68000 microprocessor   copy
Chp2 introduction to the 68000 microprocessor copy
 
Arithmetic and logical instructions set
Arithmetic and logical instructions setArithmetic and logical instructions set
Arithmetic and logical instructions set
 
Basic computer organization design
Basic computer organization designBasic computer organization design
Basic computer organization design
 
Ee443 phase locked loop - presentation - schwappach and brandy
Ee443   phase locked loop - presentation - schwappach and brandyEe443   phase locked loop - presentation - schwappach and brandy
Ee443 phase locked loop - presentation - schwappach and brandy
 
Hd9
Hd9Hd9
Hd9
 
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...
 
RF Module Design - [Chapter 8] Phase-Locked Loops
RF Module Design - [Chapter 8] Phase-Locked LoopsRF Module Design - [Chapter 8] Phase-Locked Loops
RF Module Design - [Chapter 8] Phase-Locked Loops
 
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
 
Buy Embedded Systems Projects Online,Buy B tech Projects Online
Buy Embedded Systems Projects Online,Buy B tech Projects OnlineBuy Embedded Systems Projects Online,Buy B tech Projects Online
Buy Embedded Systems Projects Online,Buy B tech Projects Online
 
Central processing unit
Central processing unitCentral processing unit
Central processing unit
 
Emb day2 8051
Emb day2 8051Emb day2 8051
Emb day2 8051
 
Cbstartbook
CbstartbookCbstartbook
Cbstartbook
 

Similar to OptimizingARM

Arm teaching material
Arm teaching materialArm teaching material
Arm teaching materialJohn Williams
 
ARM instruction set
ARM instruction  setARM instruction  set
ARM instruction set
Karthik Vivek
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
Anh Dung NGUYEN
 
ARM instruction set
ARM instruction  setARM instruction  set
ARM instruction set
Karthik Vivek
 
ARM Architecture Instruction Set
ARM Architecture Instruction SetARM Architecture Instruction Set
ARM Architecture Instruction Set
Dwight Sabio
 
ARM Fundamentals
ARM FundamentalsARM Fundamentals
ARM Fundamentals
guest56d1b781
 
LPC 2148 Instructions Set.ppt
LPC 2148 Instructions Set.pptLPC 2148 Instructions Set.ppt
LPC 2148 Instructions Set.ppt
ProfBadariNathK
 
Arm Cortex material Arm Cortex material3222886.ppt
Arm Cortex material Arm Cortex material3222886.pptArm Cortex material Arm Cortex material3222886.ppt
Arm Cortex material Arm Cortex material3222886.ppt
Manju Badiger
 
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdfdigitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
Duy-Hieu Bui
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
AswathRangaraj1
 
arm-intro.ppt
arm-intro.pptarm-intro.ppt
arm-intro.ppt
MostafaParvin1
 
ARM Introduction
ARM IntroductionARM Introduction
ARM Introduction
Ramasubbu .P
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
meenakshi_l
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Marina Kolpakova
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Unit vi
Unit viUnit vi

Similar to OptimizingARM (20)

Arm teaching material
Arm teaching materialArm teaching material
Arm teaching material
 
Arm teaching material
Arm teaching materialArm teaching material
Arm teaching material
 
ARM instruction set
ARM instruction  setARM instruction  set
ARM instruction set
 
ARM AAE - Intrustion Sets
ARM AAE - Intrustion SetsARM AAE - Intrustion Sets
ARM AAE - Intrustion Sets
 
ARM instruction set
ARM instruction  setARM instruction  set
ARM instruction set
 
ARM Architecture Instruction Set
ARM Architecture Instruction SetARM Architecture Instruction Set
ARM Architecture Instruction Set
 
ARM Fundamentals
ARM FundamentalsARM Fundamentals
ARM Fundamentals
 
LPC 2148 Instructions Set.ppt
LPC 2148 Instructions Set.pptLPC 2148 Instructions Set.ppt
LPC 2148 Instructions Set.ppt
 
Arm Cortex material Arm Cortex material3222886.ppt
Arm Cortex material Arm Cortex material3222886.pptArm Cortex material Arm Cortex material3222886.ppt
Arm Cortex material Arm Cortex material3222886.ppt
 
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdfdigitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
digitaldesign-2020-lecture10a-lc3andmips-beforelecture.pdf
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
 
arm-intro.ppt
arm-intro.pptarm-intro.ppt
arm-intro.ppt
 
ARM Introduction
ARM IntroductionARM Introduction
ARM Introduction
 
Lecture8
Lecture8Lecture8
Lecture8
 
07-arm_overview.ppt
07-arm_overview.ppt07-arm_overview.ppt
07-arm_overview.ppt
 
S emb t4-arch_cpu
S emb t4-arch_cpuS emb t4-arch_cpu
S emb t4-arch_cpu
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
ARM.ppt
ARM.pptARM.ppt
ARM.ppt
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Unit vi
Unit viUnit vi
Unit vi
 

OptimizingARM

  • 1. Optimizing for ARM architectures Jan-Lieuwe Koopmans Engine Software
  • 2.
  • 3. ARM platforms  GameBoy Advance  ARM7TDMI @ 16.8Mhz (ARMv4)  Nintendo DS  ARM7TDMI @ 33Mhz (ARMv4)  ARM946E-S @ 67Mhz (ARMv5)  Nintendo DSi  ARM7TDMI @ 33Mhz (ARMv4)  ARM946E-S @ 133Mhz (ARMv5)  Nintendo 3DS  ARM11 MPCore @ 267Mhz (ARMv6k)
  • 4. ARM platforms  PlayStation Vita  ARM Cortex-A9 MPCore (ARMv7)  Apple iPhone/iPod/iPad  ARM1176JZ(F)-S (ARMv6)  ARM Cortex-A8 [Apple 4] (ARMv7)  ARM Cortex-A9 [Apple 5] (ARMv7)  Android
  • 5. Key Features  Multiple instruction sets  ARM (powerful, 4 bytes/instruction)  Thumb (simple, 2 bytes/instruction)  Jazelle (Javatm bytecode execution)  Variable cycle execution  Load/store multiple  Conditional execution  Reduces branching  Barrel shifter  Complex instructions
  • 6. Key Features  DSP extensions (ARMv5TE, ARMv6, ARMv7)  Single cycle 16x16 and 32x16 MAC  Saturated math  Count Leading Zeroes  Load/store register pairs  SIMD extensions (ARMv6, ARMv7)  Simultaneous computation of 2x16-bit or 4x8-bit operands  Fractional arithmetic  User definable saturation modes (arbitrary word-width)  Dual 16x16 multiply-add/subtract 32x32 fractional MAC  Simultaneous 8/16-bit select operations
  • 7.
  • 8.
  • 9. --asm Output assembly code as well as object code
  • 10. --asm Output assembly code as well as object code -S Output assembly code instead of object code
  • 11. --asm Output assembly code as well as object code -S Output assembly code instead of object code --interleave Interleave source with disassembly (use with --asm or -S) ;;;22 // calculate a point on a quadratic Bezier curve ;;;23 Vector2f math::bezier(const Vector2f& a, const Vector2f& b, const Vector2f& c, const f32 t) 000000 ed9f1a16 VLDR s2,|L5.96| ;;;24 { ;;;25 const f32 tInv = 1 - t; ;;;26 const f32 tInvSq = tInv * tInv; ;;;27 const f32 tSq = t * t; ;;;28 const f32 t2tInv = (t * 2) * tInv; 000004 eddf0a16 VLDR s1,|L5.100| 000008 edd22a00 VLDR s5,[r2,#0] 00000c ee311a40 VSUB.F32 s2,s2,s0 ;25 000010 ee601a20 VMUL.F32 s3,s0,s1 000014 ee200a00 VMUL.F32 s0,s0,s0 ;27 000018 ee610a01 VMUL.F32 s1,s2,s2 ;26 00001c ee211a81 VMUL.F32 s2,s3,s2 . . . 000054 ed801a00 VSTR s2,[r0,#0] 000058 ed800a01 VSTR s0,[r0,#4] ;;;29 ;;;30 return tInvSq * a + t2tInv * b + tSq * c; ;;;31 } 00005c e12fff1e BX lr
  • 12.
  • 13. Address Opcode Mnemonic Operands 00000000 E0804001 ADD R4,R0,R1 1. Branch instructions 2. Register Load and Store instructions 3. Data processing instructions 4. Coprocessor instructions 5. Status register access instructions
  • 14. Address Opcode Mnemonic Operands 00000000 E0804001 ADD R4,R0,R1 1. Branch instructions 2. Register Load and Store instructions 3. Data processing instructions 4. Coprocessor instructions 5. Status register access instructions
  • 15. Address Opcode Mnemonic Operands 00000000 E12FFF1E BX LR Branching instructions B Branch BX Branch with exchange (Thumb/ARM) BL Branch with link BLX Branch with link & exchange
  • 16. Address Opcode Mnemonic Operands 00000000 E1D000F0 LDRSH R0,[R0,#0] Register Load and Store instructionsLDR Load register from memory STR Store register to memory LDM Load multiple registers (32-bit aligned!) STM Store multiple registers (32-bit aligned!) Register Load and Store instructions B Byte (8-bit) SB Signed byte (8-bit) H Half word (16-bit) SH Signed half word (16-bit) D Double word (64-bit)
  • 17. Address Opcode Mnemonic Operands 00000000 E1B030C6 ASRS R3,R6,#1 Data processing instructions MOV Move to register LSL Logical shift left LSR Logical shift right ASR Arithmetic shift right
  • 18. Address Opcode Mnemonic Operands 00000000 E0854C2C ADD R4,R5,R12,LSR #24 Data processing instructions (arithmetic) ADD Addition ADC Addition with carry SUB Subtraction SBC Subtraction with carry RSB Reverse subtraction RSC Reverse subtraction with carry MUL Multiply MLA Multiply and accumulate
  • 19. Address Opcode Mnemonic Operands 00000000 E2000003 AND R0,R0,#3 Data processing instructions (logical) AND Logical AND EOR Logical exclusive OR ORR Logical OR MVN Logical NOT BIC Bit clear (combined logical AND NOT)
  • 20. Address Opcode Mnemonic Operands 00000000 E3560000 CMP R6,#0 Data processing instructions (tests) CMP Compare CMN Compare negative TST Test bits (logical AND) TEQ Test bits (logical EOR)
  • 21. Status Register 31 30 29 28 27 26..6 7 6 5 4..0 N Z C V Q I F T Mode EQ Equal NE Not equal CS Carry CC No carry MI Negative PL Positive VS Overflow VC No overflow HI Higher LS Lower or same GE Greater or equal LT Less than GT Greater than LE Less than or equal AL Always NV Never N = negative Z = zero C = carry V = overflow Q = saturated  S suffix: data instruction updates CPSR (Current Program Status Register)
  • 22. Status Register 31 30 29 28 27 26..6 7 6 5 4..0 N Z C V Q I F T Mode EQ Equal NE Not equal CS Carry CC No carry MI Negative PL Positive VS Overflow VC No overflow HI Higher LS Lower or same GE Greater or equal LT Less than GT Greater than LE Less than or equal AL Always NV Never N = negative Z = zero C = carry V = overflow Q = saturated ANDS num, num, #1 ADDNE odd, odd, #1 ADDEQ even, even, #1 CMP age, #18 BGE |IsAdult|
  • 23.
  • 24. for (int i = 0; i < n; ++i) { // ... } int i = 0; while (i < n) { // ... ++i; }
  • 25. MOV i, #0 ; i = 0 CMP i, n ; i < n? BGE |Done| ; no, done |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop |Done|
  • 26. MOV i, #0 ; i = 0 CMP i, n ; i < n? BGE |Done| ; no, done |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLS |Loop| ; yes, loop |Done| Initial test required, in case n <= 0
  • 27. int i = 0; do { // ... } while(++i < n); MOV i, #0 ; i = 0 |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop
  • 28. [Tip!] Use do {} while  Use do-while loops when the initial test isn’t required  Tip: replace initial test with an ‘assert(n > 0)’
  • 29. [Tip!] Count down loops  Count down in loops  where possible int i = n - 1; do { // ... } while(--i >= 0);
  • 30. SUB i, n, #1 ; i = n - 1 |Loop| ; ... SUBS i, i, #1 ; --i >= 0? BPL |Loop| ; yes, loop MOV i, #0 ; i = 0 |Loop| ; ... ADD i, i, #1 ; ++i CMP i, n ; i < n? BLT |Loop| ; yes, loop
  • 31. for (int i = n - 1; i >= 0; --i) { // ... } int i = n - 1; while (i >= 0) { // ... --i; }
  • 32. SUBS i, n, #1 ; i = n – 1 BMI |Done| |Loop| ; ... SUBS i, i, #1 ; --i >= 0? BPL |Loop| ; yes, loop |Done|
  • 33. [Tip!] Improve Loop Unrolling Intrinsic Description __promise Allows the compiler to optimize loop unrolling (also improves NEON vectorization) // Promise the compiler that the loop // iteration count is divisible by 16 __promise((n % 16) == 0); for (int i = 0; i < n; i++) { // ... }
  • 34.
  • 35. Pointer Aliasing  A compiler must assume two pointers could point to the same location. void Object::update(const State& state) { mAge += state.deltaTime; mDelay -= state.deltaTime; }
  • 36. Pointer Aliasing  A compiler must assume two pointers could point to the same location. void Object::update(const State& state) { this->mAge += state.deltaTime; this->mDelay -= state.deltaTime; }
  • 37. Pointer Aliasing  A compiler must assume two pointers could point to the same location. LDR r2,[r0,#0] ; load this->mAge LDR r3,[r1,#0] ; load state.deltaTime ; interlock ADD r2,r2,r3 ; mAge += state.deltaTime STR r2,[r0,#0] ; store updated mAge LDR r1,[r1,#0] ; reload state.deltaTime LDR r2,[r0,#4] ; load this->mDelay ; interlock SUB r1,r2,r1 ; mDelay -= state.deltaTime STR r1,[r0,#4] ; store updated mDelay BX lr ; return
  • 38. Pointer Aliasing  Do not dereference multiple times; cache the value in a local. void Object::update(const State& state) { const int dt = state.deltaTime; mAge += dt; mDelay -= dt; }
  • 39. Pointer Aliasing  Do not dereference multiple times; cache the value in a local.  Or use __restrict to promise the compiler a certain pointer does not alias other pointers. void Object::update(const State& state) __restrict // restrict the this pointer { mAge += state.deltaTime; mDelay -= state.deltaTime; }
  • 40. Pointer Aliasing  Do not dereference multiple times; cache the value in a local.  Or use __restrict to promise the compiler that a pointer does not alias other pointers.  This improves code generation tremendously! LDR r12,[r1,#0] ; load state.deltaTime LDM r0,{r2, r3} ; load mAge, mDelay ADD r2,r2,r12 ; mAge += state.deltaTime SUB r3,r3,r12 ; mDelay -= state.deltaTime STM r0,{r2, r3} ; store mAge, mDelay BX lr ; return
  • 41.
  • 42. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)
  • 43. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3
  • 44. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R14 (LR)  Current PC  Current CPU mode (ARM/Thumb)
  • 45. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R13 (LR)  Return value: R0, R1
  • 46. [Tip!] Function arguments bool Object::hit(int type, int damage, Object* pSource) { // R0 = this // R1 = type // R2 = damage // R3 = pSource ... // R0 = true return true }  Do not pass more than four 32-bit (integer) arguments  Non-static class member functions: 3 arguments (this pointer counts as argument)
  • 47. [Tip!] Function arguments s64 dontDoThis(s32 a, s64 b, s32 c) { // R0 = a // R1 // R2, R3 = b // [SP+0] = c return a + b + c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3
  • 48. [Tip!] Function arguments s64 Object::rememberThis(s64 b, s32 a) { // R0 = this // R1 // R2, R3 = b // [SP+0] = a return a + b + this->c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3  Member functions: this pointer alert!
  • 49. [Tip!] Function arguments s64 Object::rememberThis(s32 a, s64 b) { // R0 = this // R1 = a // R2, R3 = b return a + b + this->c; // R0, R1 = result }  64-bit arguments require two registers  Must use R0, R1 or R2, R3  Member functions: this pointer alert!
  • 50. Registers R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 SP R14 LR R15 PC  Sixteen 32-bit general purpose registers  Not many for a load/store architecture  PowerPC and MIPS have 32  AMD 26000 has 192 (!)  Arguments: R0..R3  Return address: R13 (LR)  Return value: R0, R1  32-bit!
  • 51. [Tip!] Use 32 bits!  Use 32 bits (or multiples thereof) for:  Arguments  Locals  Return values  When using smaller types compiler has to take care of:  Wrap-around  Sign-extension
  • 52. [Tip!] Use 32 bits! short addRange(short a, short b, short* pData) { short result = 0; do { result += pData[a++]; } while (a <= b); return result; }
  • 53. [Tip!] Use 32 bits! MOV r3,#0 |Loop| ADD r12,r2,r0,LSL #1 LDRH r12,[r12,#0] ADD r0,r0,#1 LSL r0,r0,#16 ; wrap-around and... ADD r3,r3,r12 ASR r0,r0,#16 ; sign-extend LSL r3,r3,#16 CMP r0,r1 ASR r3,r3,#16 MOVGT r0,r3 BLE |Loop| BX lr
  • 54. [Tip!] Use 32 bits! MOV r3,#0 |Loop| ADD r12,r2,r0,LSL #1 ADD r0,r0,#1 LDRH r12,[r12,#0] SXTH r0,r0 ; sign-extend halfword CMP r0,r1 ADD r3,r3,r12 SXTH r3,r3 MOVGT r0,r3 BLE |Loop| BX lr ARMv6
  • 55.
  • 56. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors int thousandDividedBy(int d) { return 1000 / d; } MOV r1,r0 MOV r0,#1000 B __aeabi_idivmod int thousandDividedBy(int d) { return int(1000 / (float)d); } VMOV s0,r0 VLDR s1,|Thousand| VCVT.F32.S32 s0,s0 VDIV.F32 s2,s1,s0 VCVT.S32.F32 s0,s2 VMOV r0,s0 BX lr |Thousand| DCFS 0x447a0000 ; 1000 VFP Alternative?
  • 57. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors  Compiler can optimize constant divisors int dividedByThousand(int d) { return d / 1000; } LDR r1,|DivisionMagic| SMULL r1,r0,r1,r0 ASR r1,r0,#6 SUB r0,r1,r0,ASR #31 BX lr |DivisionMagic| DCD 0x10624dd3 int moduloThree(int d) { return d % 3; } LDR r1,|ModuloMagic| SMULL r2,r1,r1,r0 SUB r1,r1,r1,ASR #31 SUB r1,r1,r1,LSL #2 ADD r0,r0,r1 BX lr |ModuloMagic| DCD 0x55555556
  • 58. Division  ARM has no hardware integer division/modulo!  Avoid non-constant divisors  Compiler can optimize constant divisors  Especially power of two divisors int dividedByPower2(int d) { return d / 512; } ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr
  • 59. [Tip!] Signed vs Unsigned int dividedByPower2(int d) { return d / 512; } ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1
  • 60. [Tip!] Signed vs Unsigned LSR r0,r0,#9 BX lr u32 moduloPower2(u32 d) { return d % 4U; } AND r0, r0, #3 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! u32 dividedByPower2(u32 d) { return d / 512U; }
  • 61. [Tip!] Signed vs Unsigned ASR r1,r0,#31 ADD r0,r0,r1,LSR #23 ASR r0,r0,#9 BX lr int moduloPower2(int d) { return d % 4; } ASR r1,r0,#31 ADD r1,r0,r1,LSR #30 BIC r1,r1,#3 SUB r0,r0,r1 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! int dividedByPower2(int d) { return d / 512; }
  • 62. [Tip!] Signed vs Unsigned LSR r0,r0,#9 BX lr u32 moduloPower2(u32 d) { return d % 4U; } AND r0, r0, #3 BX lr  Signed division and modulus are more complicated  Exception: -1 >> 1 == -1  Use unsigned types where applicable! u32 dividedByPower2(u32 d) { return d / 512U; }
  • 63.
  • 64. Interworking  It is possible to switch between ARM & Thumb instruction sets at run-time.  First bit of address determines instruction set.  Compiler allows us to switch between instruction sets with #pragma directives.  Only possible to use these in translation units!  Doesn’t work for inline functions.  Doesn’t work for non-specialized template functions.
  • 65. Switching to Thumb // code16.h // // --- Thumb mode #if defined(__MWERKS__) // Codewarrior # pragma thumb on #elif defined(__ARMCC_VERSION) // ARMCC/RVCT # pragma thumb #else # error “Unknown compiler!” #endif
  • 66. Switching to ARM // code32.h // // --- ARM mode #if defined(__MWERKS__) // Codewarrior # pragma thumb off #elif defined(__ARMCC_VERSION) // ARMCC/RVCT # pragma arm #else # error “Unknown compiler!” #endif
  • 67. Switching to default // codereset.h // // --- default mode #if defined(EFFORT_SMALL) #include <code16.h> #else #include <code32.h> #endif
  • 68. #include <code16.h> Object::Object() { // ... } Object::~Object() { // ... } #include <codereset.h> #include <code32.h> void Object::update(int ticks) { // ... } #include <codereset.h>
  • 69. ARM vs Thumb ARM Thumb Instruction size 32-bit (4 bytes) 16-bit (2 bytes) Note: some branch instructions take 4 bytes.
  • 70. ARM vs Thumb ARM Thumb Instruction size 32-bit (4 bytes) 16-bit (2 bytes) int add(int x, int y) { int result = x + y; printf("%d + %d = %dn", x, y, result); return result; } +16 bytes! 4 PUSH {r4,lr} 4 ADD r4,r0,r1 4 MOV r2,r1 4 MOV r1,r0 4 MOV r3,r4 4 ADR r0,|String| 4 BL printf 4 MOV r0,r4 4 POP {r4,pc} 36 |String| DCB "%d + %d = %dn",0 2 PUSH {r4,lr} 2 ADDS r4,r0,r1 2 MOV r2,r1 2 MOV r1,r0 2 MOV r3,r4 2 ADR r0,|String| 4 BL printf 2 MOV r0,r4 2 POP {r4,pc} 20 |String| DCB "%d + %d = %dn",0
  • 71. ARM vs Thumb ARM Thumb Conditional execution Nearly all instructions Branch instructions bool isLetter(int c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')); } MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr SUB r1,r0,#’A’ CMP r1,#’Z’ – ’A’ SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ MOVLS r0,#1 MOVHI r0,#0 BX lr 6 instructions 11 instructions
  • 72. ARM vs Thumb ARM Thumb Conditional execution Nearly all instructions Branch instructions bool isLetter(int c) { return ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')); } MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr Note: the compiler must ensure type bool is true (1) or false (0). Avoid [implicit] casting to bool when it’s not required, as it’s expensive!
  • 73. [Tip!] Boolean type bool isBitSet(int flags, int bit) { return flags & (1 << bit); } Note: the compiler must ensure type bool is true (1) or false (0). Avoid [implicit] casting to bool when it’s not required, as it’s expensive! int isBitSet(int flags, int bit) { return flags & (1 << bit); } MOVS r2,#1 LSLS r2,r2,r1 TST r2,r0 BEQ |False| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr MOV r2,r0 MOVS r0,#1 LSLS r0,r0,r1 ANDS r0,r0,r2 BX lr
  • 74. ARM vs Thumb-2  Thumb-2 introduced the IT (if-then) instruction  Up to four instructions can be made conditional SUBS r1,r0,#’A’ CMP r1,#’Z’ – ’A’ SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ MOVLS r0,#1 MOVHI r0,#0 BX lr MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ ITT HI SUBHI r0,r0,#’a’ CMPHI r0,#’z’ – ‘a’ ITE LS MOVLS r0,#1 MOVHI r0,#0 BX lr MOV r1,r0 SUBS r1,r1,#’A’ CMP r1,#’Z’ – ‘A’ BLS |True| SUBS r0,r0,#’a’ CMP r0,#’z’ – ‘a’ BHI |False| |True| MOVS r0,#1 BX lr |False| MOVS r0,#0 BX lr Thumb ARMThumb-2
  • 75. ARM vs Thumb ARM Thumb Barrel shifter & ALU Accessible by data instructions Requires separate instructions unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } MOVS r3,#0xFF LSLS r2,r0,#8 LSLS r3,r3,#16 ANDS r2,r2,r3 LSLS r1,r0,#24 ORRS r1,r1,r2 LSRS r2,r0,#8 ASRS r3,r3,#8 ANDS r2,r2,r3 ORRS r1,r1,r2 LSRS r0,r0,#24 ORRS r0,r0,r1 BX lr MOV r1,#0xFF,LSL #16 AND r1,r1,r0,LSL #8 MOV r2,#0xFF,LSL #8 ORR r1,r1,r0,LSL #24 AND r2,r2,r0,LSR #8 ORR r1,r1,r2 ORR r0,r1,r0,LSR #24 BX lr 8 instructions 13 instructions
  • 76. ARM vs Thumb ARM Thumb Barrel shifter & ALU Accessible by data instructions Requires separate instructions unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } REV r0,r0 BX lr ARMv6
  • 77. ARM vs Thumb ARM Thumb Coprocessor interface Yes No Long Multiply Yes (ARMv4) No Count Leading Zeroes Yes (ARMv5) No Saturated math Yes (ARMv5) No DSP instructions Yes (ARMv5) No SIMD instructions Yes (ARMv6) No
  • 78. Summary: when to use Thumb  Use Thumb for functions which…  do not benefit from the ARM instruction-set  are not performance critical (i.e.: initialization code) #include <code16.h> void Level::load(const std::string& path) { . . . } #include <codereset.h>
  • 79. Summary: when to use ARM  Use ARM for functions which…  do benefit from the ARM instruction-set  are performance critical (i.e.: called from inner loops) #include <code32.h> bool Ray::intersects(const Sphere& s) { . . . } #include <codereset.h>
  • 80.
  • 81. Intrinsic functions  Allows use of specialized CPU instructions in C/C++  Compiler can recognize patterns and might utilize such specialized instructions: unsigned int reverseBytes(unsigned int x) { return (x << 24) | ((x << 8) & 0x00FF0000) | ((x >> 8) & 0x0000FF00) | ((x >> 24)); } REV r0,r0 BX lr
  • 82. Intrinsic functions  Allows use of specialized CPU instructions in C/C++  Compiler can recognize patterns and might utilize such specialized instructions.  More often the compiler does not. Check compiler output!  Intrinsic functions are compiler specific; read the manual!
  • 83. Useful intrinsics Intrinsic Description __breakpoint Stops execution, informs the debugger __disable_irq Sets the CPSR irq mask, returns previous state __enable_irq Resets the CPSR irq mask, returns previous state __ldrex Atomic reads __strex Atomic writes
  • 84. Useful intrinsics (cache) Intrinsic Description __pld Preload data __pldw Preload data for writing __pli Preload instructions
  • 85. Useful intrinsics (algorithms) Intrinsic Description __usat/__ssat Unsigned/signed saturate (any power of 2) __clz Count leading zeroes __rbit Reverse bit order __rev Reverse byte order
  • 86. Useful intrinsics (SIMD) Intrinsic Description __usad[a]8|16 Sum of absolute differences (4x8, 2x16) __[u][q]add8|16 [Saturated] addition (4x8, 2x16) __[u][q]sub8|16 [Saturated] subtraction (4x8, 2x16) etc. Check: http://infocenter.arm.com/help
  • 87. Intrinsic functions struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { r += o.r; g += o.g; b += o.b; a += o.a; return *this; } }; struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { c = __uadd8(c, o.c); return *this; } };
  • 88. Intrinsic functions PUSH {r4} AND r2,r1,r0,ASR #24 ADD r1,r1,r0 AND r1,r1,#0xff ORR r1,r1,r2 LSL r3,r0,#16 LSL r4,r0,#8 LSR r2,r0,#24 LSL r0,r1,#16 LSR r12,r3,#24 ADD r0,r12,r0,LSR #24 BIC r1,r1,#0xff00 LSL r0,r0,#8 AND r0,r0,#0xff00 ORR r0,r0,r1 BIC r1,r0,#0xff0000 LSL r12,r0,#8 LSR r0,r12,#24 ADD r0,r0,r4,LSR #24 POP {r4} LSL r0,r0,#16 AND r0,r0,#0xff0000 ORR r0,r0,r1 BIC r3,r0,#0xff000000 ADD r0,r2,r0,LSR #24 ORR r0,r3,r0,LSL #24 struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { r += o.r; g += o.g; b += o.b; a += o.a; return *this; } };
  • 89. Intrinsic functions UADD8 r0,r0,r1struct RGBA { union { struct { u8 r, g, b, a; }; u32 c; }; RGBA& operator += (RGBA o) { c = __uadd8(c, o.c); return *this; } }; Note: I could have demonstrated __uqadd8, which saturates the results to the 8-bit unsigned integer range 0 ≤ x ≤ 28 - 1.
  • 90. Feel free to ask…