This document discusses optimizing code for ARM architectures. It provides information on various ARM platforms used in devices like the GameBoy Advance, Nintendo DS, Nintendo DSi, Nintendo 3DS, PlayStation Vita, Apple devices and Android. It outlines key features of ARM architectures like multiple instruction sets, variable cycle execution, load/store multiple instructions and DSP/SIMD extensions. It also provides tips for optimizing code like using 32-bit data types, avoiding pointer aliasing, improving loop unrolling and counting down in loops where possible.
32 bit ALU Chip Design using IBM 130nm process technologyBharat Biyani
- Implemented a 32 bit Arithmetic/Logic unit in VHDL using behavioral Modeling which involves all basic ALU operations including special functionality like binary-to-grey code conversion, parity check, sum of first N numbers. Simulation is performed in ModelSim IDE.
- Involved design using Cadence (Virtuoso Layout/Schematic) and Hspice simulation of standard library cell.
- Involved library characterization using NCX, RTL synthesis of VHDL code using Synopsys Design Vision, auto placement & routing using Encounter, static timing analysis using Synopsys Primetime.
In this unit we introduce interrupts in processors and microcontrollers. We explain how the UoS processor (which doesn't support interrupts currently) could be extended to support interrupts.
Unit duration: 50mn.
License: LGPL 2.1
Hobby example; a microcontroller pushed to it's limits; metal lathe spindle sensor for position (accurate to 10 arc-minutes), RPM, #turns, elapsed time.
32 bit ALU Chip Design using IBM 130nm process technologyBharat Biyani
- Implemented a 32 bit Arithmetic/Logic unit in VHDL using behavioral Modeling which involves all basic ALU operations including special functionality like binary-to-grey code conversion, parity check, sum of first N numbers. Simulation is performed in ModelSim IDE.
- Involved design using Cadence (Virtuoso Layout/Schematic) and Hspice simulation of standard library cell.
- Involved library characterization using NCX, RTL synthesis of VHDL code using Synopsys Design Vision, auto placement & routing using Encounter, static timing analysis using Synopsys Primetime.
In this unit we introduce interrupts in processors and microcontrollers. We explain how the UoS processor (which doesn't support interrupts currently) could be extended to support interrupts.
Unit duration: 50mn.
License: LGPL 2.1
Hobby example; a microcontroller pushed to it's limits; metal lathe spindle sensor for position (accurate to 10 arc-minutes), RPM, #turns, elapsed time.
Buy Embedded Systems Projects Online,Buy B tech Projects OnlineTechnogroovy
like our page for more updates:
https://www.facebook.com/Technogroovyindia
With Best Regard's
Technogroovy Systems India Pvt. Ltd.
www.technogroovy.com
Call- +91-9582888121
Whatsapp- +91-8800718323
Buy Embedded Systems Projects Online,Buy B tech Projects OnlineTechnogroovy
like our page for more updates:
https://www.facebook.com/Technogroovyindia
With Best Regard's
Technogroovy Systems India Pvt. Ltd.
www.technogroovy.com
Call- +91-9582888121
Whatsapp- +91-8800718323
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
Explains compilers optimizations, gives taxanomy and examples. The examples are mostly compiler for ARM armv7-a and armv8-a targets, but most of optimizations are machine independent.
4. ARM platforms
PlayStation Vita
ARM Cortex-A9 MPCore (ARMv7)
Apple iPhone/iPod/iPad
ARM1176JZ(F)-S (ARMv6)
ARM Cortex-A8 [Apple 4] (ARMv7)
ARM Cortex-A9 [Apple 5] (ARMv7)
Android
10. --asm Output assembly code as well as object code
-S Output assembly code instead of object code
11. --asm Output assembly code as well as object code
-S Output assembly code instead of object code
--interleave Interleave source with disassembly
(use with --asm or -S)
;;;22 // calculate a point on a quadratic Bezier curve
;;;23 Vector2f math::bezier(const Vector2f& a, const Vector2f& b, const Vector2f& c, const f32 t)
000000 ed9f1a16 VLDR s2,|L5.96|
;;;24 {
;;;25 const f32 tInv = 1 - t;
;;;26 const f32 tInvSq = tInv * tInv;
;;;27 const f32 tSq = t * t;
;;;28 const f32 t2tInv = (t * 2) * tInv;
000004 eddf0a16 VLDR s1,|L5.100|
000008 edd22a00 VLDR s5,[r2,#0]
00000c ee311a40 VSUB.F32 s2,s2,s0 ;25
000010 ee601a20 VMUL.F32 s3,s0,s1
000014 ee200a00 VMUL.F32 s0,s0,s0 ;27
000018 ee610a01 VMUL.F32 s1,s2,s2 ;26
00001c ee211a81 VMUL.F32 s2,s3,s2
.
.
.
000054 ed801a00 VSTR s2,[r0,#0]
000058 ed800a01 VSTR s0,[r0,#4]
;;;29
;;;30 return tInvSq * a + t2tInv * b + tSq * c;
;;;31 }
00005c e12fff1e BX lr
12.
13. Address Opcode Mnemonic Operands
00000000 E0804001 ADD R4,R0,R1
1. Branch instructions
2. Register Load and Store instructions
3. Data processing instructions
4. Coprocessor instructions
5. Status register access instructions
14. Address Opcode Mnemonic Operands
00000000 E0804001 ADD R4,R0,R1
1. Branch instructions
2. Register Load and Store instructions
3. Data processing instructions
4. Coprocessor instructions
5. Status register access instructions
15. Address Opcode Mnemonic Operands
00000000 E12FFF1E BX LR
Branching instructions
B Branch
BX Branch with exchange (Thumb/ARM)
BL Branch with link
BLX Branch with link & exchange
16. Address Opcode Mnemonic Operands
00000000 E1D000F0 LDRSH R0,[R0,#0]
Register Load and Store instructionsLDR Load register from memory
STR Store register to memory
LDM Load multiple registers (32-bit aligned!)
STM Store multiple registers (32-bit aligned!)
Register Load and Store instructions
B Byte (8-bit)
SB Signed byte (8-bit)
H Half word (16-bit)
SH Signed half word (16-bit)
D Double word (64-bit)
17. Address Opcode Mnemonic Operands
00000000 E1B030C6 ASRS R3,R6,#1
Data processing instructions
MOV Move to register
LSL Logical shift left
LSR Logical shift right
ASR Arithmetic shift right
18. Address Opcode Mnemonic Operands
00000000 E0854C2C ADD R4,R5,R12,LSR #24
Data processing instructions (arithmetic)
ADD Addition
ADC Addition with carry
SUB Subtraction
SBC Subtraction with carry
RSB Reverse subtraction
RSC Reverse subtraction with carry
MUL Multiply
MLA Multiply and accumulate
19. Address Opcode Mnemonic Operands
00000000 E2000003 AND R0,R0,#3
Data processing instructions (logical)
AND Logical AND
EOR Logical exclusive OR
ORR Logical OR
MVN Logical NOT
BIC Bit clear (combined logical AND NOT)
20. Address Opcode Mnemonic Operands
00000000 E3560000 CMP R6,#0
Data processing instructions (tests)
CMP Compare
CMN Compare negative
TST Test bits (logical AND)
TEQ Test bits (logical EOR)
21. Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
S suffix: data instruction updates CPSR
(Current Program Status Register)
22. Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
ANDS num, num, #1
ADDNE odd, odd, #1
ADDEQ even, even, #1
CMP age, #18
BGE |IsAdult|
23.
24. for (int i = 0; i < n; ++i)
{
// ...
}
int i = 0;
while (i < n)
{
// ...
++i;
}
25. MOV i, #0 ; i = 0
CMP i, n ; i < n?
BGE |Done| ; no, done
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
|Done|
26. MOV i, #0 ; i = 0
CMP i, n ; i < n?
BGE |Done| ; no, done
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLS |Loop| ; yes, loop
|Done|
Initial test required, in case n <= 0
27. int i = 0;
do
{
// ...
} while(++i < n);
MOV i, #0 ; i = 0
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
28. [Tip!] Use do {} while
Use do-while loops when the initial test isn’t required
Tip: replace initial test with an ‘assert(n > 0)’
29. [Tip!] Count down loops
Count down in loops
where possible
int i = n - 1;
do
{
// ...
} while(--i >= 0);
30. SUB i, n, #1 ; i = n - 1
|Loop|
; ...
SUBS i, i, #1 ; --i >= 0?
BPL |Loop| ; yes, loop
MOV i, #0 ; i = 0
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?
BLT |Loop| ; yes, loop
31. for (int i = n - 1; i >= 0; --i)
{
// ...
}
int i = n - 1;
while (i >= 0)
{
// ...
--i;
}
32. SUBS i, n, #1 ; i = n – 1
BMI |Done|
|Loop|
; ...
SUBS i, i, #1 ; --i >= 0?
BPL |Loop| ; yes, loop
|Done|
33. [Tip!] Improve Loop Unrolling
Intrinsic Description
__promise Allows the compiler to optimize loop unrolling
(also improves NEON vectorization)
// Promise the compiler that the loop
// iteration count is divisible by 16
__promise((n % 16) == 0);
for (int i = 0; i < n; i++)
{
// ...
}
34.
35. Pointer Aliasing
A compiler must assume two pointers could point to
the same location.
void Object::update(const State& state)
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}
36. Pointer Aliasing
A compiler must assume two pointers could point to
the same location.
void Object::update(const State& state)
{
this->mAge += state.deltaTime;
this->mDelay -= state.deltaTime;
}
37. Pointer Aliasing
A compiler must assume two pointers could point to
the same location.
LDR r2,[r0,#0] ; load this->mAge
LDR r3,[r1,#0] ; load state.deltaTime
; interlock
ADD r2,r2,r3 ; mAge += state.deltaTime
STR r2,[r0,#0] ; store updated mAge
LDR r1,[r1,#0] ; reload state.deltaTime
LDR r2,[r0,#4] ; load this->mDelay
; interlock
SUB r1,r2,r1 ; mDelay -= state.deltaTime
STR r1,[r0,#4] ; store updated mDelay
BX lr ; return
38. Pointer Aliasing
Do not dereference multiple times; cache the value in a
local.
void Object::update(const State& state)
{
const int dt = state.deltaTime;
mAge += dt;
mDelay -= dt;
}
39. Pointer Aliasing
Do not dereference multiple times; cache the value in a
local.
Or use __restrict to promise the compiler a certain
pointer does not alias other pointers.
void Object::update(const State& state)
__restrict // restrict the this pointer
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}
40. Pointer Aliasing
Do not dereference multiple times; cache the value in a
local.
Or use __restrict to promise the compiler that a
pointer does not alias other pointers.
This improves code generation tremendously!
LDR r12,[r1,#0] ; load state.deltaTime
LDM r0,{r2, r3} ; load mAge, mDelay
ADD r2,r2,r12 ; mAge += state.deltaTime
SUB r3,r3,r12 ; mDelay -= state.deltaTime
STM r0,{r2, r3} ; store mAge, mDelay
BX lr ; return
44. Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
Sixteen 32-bit general purpose registers
Not many for a load/store architecture
PowerPC and MIPS have 32
AMD 26000 has 192 (!)
Arguments: R0..R3
Return address: R14 (LR)
Current PC
Current CPU mode (ARM/Thumb)
45. Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
Sixteen 32-bit general purpose registers
Not many for a load/store architecture
PowerPC and MIPS have 32
AMD 26000 has 192 (!)
Arguments: R0..R3
Return address: R13 (LR)
Return value: R0, R1
46. [Tip!] Function arguments
bool Object::hit(int type, int damage, Object* pSource)
{
// R0 = this
// R1 = type
// R2 = damage
// R3 = pSource
...
// R0 = true
return true
}
Do not pass more than four 32-bit (integer) arguments
Non-static class member functions: 3 arguments
(this pointer counts as argument)
47. [Tip!] Function arguments
s64 dontDoThis(s32 a, s64 b, s32 c)
{
// R0 = a
// R1
// R2, R3 = b
// [SP+0] = c
return a + b + c;
// R0, R1 = result
}
64-bit arguments require two registers
Must use R0, R1 or R2, R3
48. [Tip!] Function arguments
s64 Object::rememberThis(s64 b, s32 a)
{
// R0 = this
// R1
// R2, R3 = b
// [SP+0] = a
return a + b + this->c;
// R0, R1 = result
}
64-bit arguments require two registers
Must use R0, R1 or R2, R3
Member functions: this pointer alert!
49. [Tip!] Function arguments
s64 Object::rememberThis(s32 a, s64 b)
{
// R0 = this
// R1 = a
// R2, R3 = b
return a + b + this->c;
// R0, R1 = result
}
64-bit arguments require two registers
Must use R0, R1 or R2, R3
Member functions: this pointer alert!
50. Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
Sixteen 32-bit general purpose registers
Not many for a load/store architecture
PowerPC and MIPS have 32
AMD 26000 has 192 (!)
Arguments: R0..R3
Return address: R13 (LR)
Return value: R0, R1
32-bit!
51. [Tip!] Use 32 bits!
Use 32 bits (or multiples thereof) for:
Arguments
Locals
Return values
When using smaller types compiler has to take care of:
Wrap-around
Sign-extension
52. [Tip!] Use 32 bits!
short addRange(short a, short b, short* pData)
{
short result = 0;
do
{
result += pData[a++];
}
while (a <= b);
return result;
}
53. [Tip!] Use 32 bits!
MOV r3,#0
|Loop|
ADD r12,r2,r0,LSL #1
LDRH r12,[r12,#0]
ADD r0,r0,#1
LSL r0,r0,#16 ; wrap-around and...
ADD r3,r3,r12
ASR r0,r0,#16 ; sign-extend
LSL r3,r3,#16
CMP r0,r1
ASR r3,r3,#16
MOVGT r0,r3
BLE |Loop|
BX lr
56. Division
ARM has no hardware integer division/modulo!
Avoid non-constant divisors
int thousandDividedBy(int d)
{
return 1000 / d;
}
MOV r1,r0
MOV r0,#1000
B __aeabi_idivmod
int thousandDividedBy(int d)
{
return int(1000 / (float)d);
}
VMOV s0,r0
VLDR s1,|Thousand|
VCVT.F32.S32 s0,s0
VDIV.F32 s2,s1,s0
VCVT.S32.F32 s0,s2
VMOV r0,s0
BX lr
|Thousand|
DCFS 0x447a0000 ; 1000
VFP Alternative?
57. Division
ARM has no hardware integer division/modulo!
Avoid non-constant divisors
Compiler can optimize constant divisors
int dividedByThousand(int d)
{
return d / 1000;
}
LDR r1,|DivisionMagic|
SMULL r1,r0,r1,r0
ASR r1,r0,#6
SUB r0,r1,r0,ASR #31
BX lr
|DivisionMagic|
DCD 0x10624dd3
int moduloThree(int d)
{
return d % 3;
}
LDR r1,|ModuloMagic|
SMULL r2,r1,r1,r0
SUB r1,r1,r1,ASR #31
SUB r1,r1,r1,LSL #2
ADD r0,r0,r1
BX lr
|ModuloMagic|
DCD 0x55555556
58. Division
ARM has no hardware integer division/modulo!
Avoid non-constant divisors
Compiler can optimize constant divisors
Especially power of two divisors
int dividedByPower2(int d)
{
return d / 512;
}
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
59. [Tip!] Signed vs Unsigned
int dividedByPower2(int d)
{
return d / 512;
}
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
Signed division and modulus are more complicated
Exception: -1 >> 1 == -1
60. [Tip!] Signed vs Unsigned
LSR r0,r0,#9
BX lr
u32 moduloPower2(u32 d)
{
return d % 4U;
}
AND r0, r0, #3
BX lr
Signed division and modulus are more complicated
Exception: -1 >> 1 == -1
Use unsigned types where applicable!
u32 dividedByPower2(u32 d)
{
return d / 512U;
}
61. [Tip!] Signed vs Unsigned
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
ADD r1,r0,r1,LSR #30
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
Signed division and modulus are more complicated
Exception: -1 >> 1 == -1
Use unsigned types where applicable!
int dividedByPower2(int d)
{
return d / 512;
}
62. [Tip!] Signed vs Unsigned
LSR r0,r0,#9
BX lr
u32 moduloPower2(u32 d)
{
return d % 4U;
}
AND r0, r0, #3
BX lr
Signed division and modulus are more complicated
Exception: -1 >> 1 == -1
Use unsigned types where applicable!
u32 dividedByPower2(u32 d)
{
return d / 512U;
}
63.
64. Interworking
It is possible to switch between ARM & Thumb
instruction sets at run-time.
First bit of address determines instruction set.
Compiler allows us to switch between instruction sets
with #pragma directives.
Only possible to use these in translation units!
Doesn’t work for inline functions.
Doesn’t work for non-specialized template functions.
65. Switching to Thumb
// code16.h
//
// --- Thumb mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb on
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma thumb
#else
# error “Unknown compiler!”
#endif
66. Switching to ARM
// code32.h
//
// --- ARM mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb off
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma arm
#else
# error “Unknown compiler!”
#endif
69. ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
Note: some branch
instructions take 4 bytes.
70. ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
int add(int x, int y)
{
int result = x + y;
printf("%d + %d = %dn", x, y, result);
return result;
}
+16 bytes!
4 PUSH {r4,lr}
4 ADD r4,r0,r1
4 MOV r2,r1
4 MOV r1,r0
4 MOV r3,r4
4 ADR r0,|String|
4 BL printf
4 MOV r0,r4
4 POP {r4,pc}
36
|String|
DCB "%d + %d = %dn",0
2 PUSH {r4,lr}
2 ADDS r4,r0,r1
2 MOV r2,r1
2 MOV r1,r0
2 MOV r3,r4
2 ADR r0,|String|
4 BL printf
2 MOV r0,r4
2 POP {r4,pc}
20
|String|
DCB "%d + %d = %dn",0
71. ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
SUB r1,r0,#’A’
CMP r1,#’Z’ – ’A’
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
MOVLS r0,#1
MOVHI r0,#0
BX lr
6 instructions
11 instructions
72. ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!
73. [Tip!] Boolean type
bool isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!
int isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
MOVS r2,#1
LSLS r2,r2,r1
TST r2,r0
BEQ |False|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
MOV r2,r0
MOVS r0,#1
LSLS r0,r0,r1
ANDS r0,r0,r2
BX lr
74. ARM vs Thumb-2
Thumb-2 introduced the IT (if-then) instruction
Up to four instructions can be made conditional
SUBS r1,r0,#’A’
CMP r1,#’Z’ – ’A’
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
ITT HI
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
ITE LS
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Thumb ARMThumb-2
75. ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
MOVS r3,#0xFF
LSLS r2,r0,#8
LSLS r3,r3,#16
ANDS r2,r2,r3
LSLS r1,r0,#24
ORRS r1,r1,r2
LSRS r2,r0,#8
ASRS r3,r3,#8
ANDS r2,r2,r3
ORRS r1,r1,r2
LSRS r0,r0,#24
ORRS r0,r0,r1
BX lr
MOV r1,#0xFF,LSL #16
AND r1,r1,r0,LSL #8
MOV r2,#0xFF,LSL #8
ORR r1,r1,r0,LSL #24
AND r2,r2,r0,LSR #8
ORR r1,r1,r2
ORR r0,r1,r0,LSR #24
BX lr
8 instructions
13 instructions
76. ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr
ARMv6
77. ARM vs Thumb
ARM Thumb
Coprocessor interface Yes No
Long Multiply Yes (ARMv4) No
Count Leading Zeroes Yes (ARMv5) No
Saturated math Yes (ARMv5) No
DSP instructions Yes (ARMv5) No
SIMD instructions Yes (ARMv6) No
78. Summary: when to use Thumb
Use Thumb for functions which…
do not benefit from the ARM instruction-set
are not performance critical (i.e.: initialization code)
#include <code16.h>
void Level::load(const std::string& path)
{
.
.
.
}
#include <codereset.h>
79. Summary: when to use ARM
Use ARM for functions which…
do benefit from the ARM instruction-set
are performance critical (i.e.: called from inner loops)
#include <code32.h>
bool Ray::intersects(const Sphere& s)
{
.
.
.
}
#include <codereset.h>
80.
81. Intrinsic functions
Allows use of specialized CPU instructions in C/C++
Compiler can recognize patterns and might utilize
such specialized instructions:
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr
82. Intrinsic functions
Allows use of specialized CPU instructions in C/C++
Compiler can recognize patterns and might utilize
such specialized instructions.
More often the compiler does not. Check compiler
output!
Intrinsic functions are compiler specific; read the
manual!
83. Useful intrinsics
Intrinsic Description
__breakpoint Stops execution, informs the debugger
__disable_irq Sets the CPSR irq mask, returns previous state
__enable_irq Resets the CPSR irq mask, returns previous state
__ldrex Atomic reads
__strex Atomic writes
85. Useful intrinsics (algorithms)
Intrinsic Description
__usat/__ssat Unsigned/signed saturate (any power of 2)
__clz Count leading zeroes
__rbit Reverse bit order
__rev Reverse byte order
86. Useful intrinsics (SIMD)
Intrinsic Description
__usad[a]8|16 Sum of absolute differences (4x8, 2x16)
__[u][q]add8|16 [Saturated] addition (4x8, 2x16)
__[u][q]sub8|16 [Saturated] subtraction (4x8, 2x16)
etc. Check: http://infocenter.arm.com/help