OptimizingARM

Optimizing for ARM architectures
Jan-Lieuwe Koopmans
Engine Software

ARM platforms
 GameBoy Advance
 ARM7TDMI @ 16.8Mhz (ARMv4)
 Nintendo DS
 ARM7TDMI @ 33Mhz (ARMv4)
 ARM946E-S @ 67Mhz (ARMv5)
 Nintendo DSi
 ARM7TDMI @ 33Mhz (ARMv4)
 ARM946E-S @ 133Mhz (ARMv5)
 Nintendo 3DS
 ARM11 MPCore @ 267Mhz (ARMv6k)

ARM platforms
 PlayStation Vita
 ARM Cortex-A9 MPCore (ARMv7)
 Apple iPhone/iPod/iPad
 ARM1176JZ(F)-S (ARMv6)
 ARM Cortex-A8 [Apple 4] (ARMv7)
 ARM Cortex-A9 [Apple 5] (ARMv7)
 Android

Key Features
 Multiple instruction sets
 ARM (powerful, 4 bytes/instruction)
 Thumb (simple, 2 bytes/instruction)
 Jazelle (Javatm bytecode execution)
 Variable cycle execution
 Load/store multiple
 Conditional execution
 Reduces branching
 Barrel shifter
 Complex instructions

Key Features
 DSP extensions (ARMv5TE, ARMv6, ARMv7)
 Single cycle 16x16 and 32x16 MAC
 Saturated math
 Count Leading Zeroes
 Load/store register pairs
 SIMD extensions (ARMv6, ARMv7)
 Simultaneous computation of 2x16-bit or 4x8-bit operands
 Fractional arithmetic
 User definable saturation modes (arbitrary word-width)
 Dual 16x16 multiply-add/subtract 32x32 fractional MAC
 Simultaneous 8/16-bit select operations

--asm Output assembly code as well as object code

-S Output assembly code instead of object code

-S Output assembly code instead of object code
--interleave Interleave source with disassembly
(use with --asm or -S)
;;;22 // calculate a point on a quadratic Bezier curve
;;;23 Vector2f math::bezier(const Vector2f& a, const Vector2f& b, const Vector2f& c, const f32 t)
000000 ed9f1a16 VLDR s2,|L5.96|
;;;24 {
;;;25 const f32 tInv = 1 - t;
;;;26 const f32 tInvSq = tInv * tInv;
;;;27 const f32 tSq = t * t;
;;;28 const f32 t2tInv = (t * 2) * tInv;
000004 eddf0a16 VLDR s1,|L5.100|
000008 edd22a00 VLDR s5,[r2,#0]
00000c ee311a40 VSUB.F32 s2,s2,s0 ;25
000010 ee601a20 VMUL.F32 s3,s0,s1
000014 ee200a00 VMUL.F32 s0,s0,s0 ;27
000018 ee610a01 VMUL.F32 s1,s2,s2 ;26
00001c ee211a81 VMUL.F32 s2,s3,s2
.
.
.
000054 ed801a00 VSTR s2,[r0,#0]
000058 ed800a01 VSTR s0,[r0,#4]
;;;29
;;;30 return tInvSq * a + t2tInv * b + tSq * c;
;;;31 }
00005c e12fff1e BX lr

Address Opcode Mnemonic Operands
00000000 E0804001 ADD R4,R0,R1
1. Branch instructions
2. Register Load and Store instructions
3. Data processing instructions
4. Coprocessor instructions
5. Status register access instructions

00000000 E12FFF1E BX LR
Branching instructions
B Branch
BX Branch with exchange (Thumb/ARM)
BL Branch with link
BLX Branch with link & exchange

00000000 E1D000F0 LDRSH R0,[R0,#0]
Register Load and Store instructionsLDR Load register from memory
STR Store register to memory
LDM Load multiple registers (32-bit aligned!)
STM Store multiple registers (32-bit aligned!)
Register Load and Store instructions
B Byte (8-bit)
SB Signed byte (8-bit)
H Half word (16-bit)
SH Signed half word (16-bit)
D Double word (64-bit)

00000000 E1B030C6 ASRS R3,R6,#1
Data processing instructions
MOV Move to register
LSL Logical shift left
LSR Logical shift right
ASR Arithmetic shift right

00000000 E0854C2C ADD R4,R5,R12,LSR #24
Data processing instructions (arithmetic)
ADD Addition
ADC Addition with carry
SUB Subtraction
SBC Subtraction with carry
RSB Reverse subtraction
RSC Reverse subtraction with carry
MUL Multiply
MLA Multiply and accumulate

00000000 E2000003 AND R0,R0,#3
Data processing instructions (logical)
AND Logical AND
EOR Logical exclusive OR
ORR Logical OR
MVN Logical NOT
BIC Bit clear (combined logical AND NOT)

00000000 E3560000 CMP R6,#0
Data processing instructions (tests)
CMP Compare
CMN Compare negative
TST Test bits (logical AND)
TEQ Test bits (logical EOR)

Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
 S suffix: data instruction updates CPSR
(Current Program Status Register)

Status Register
31 30 29 28 27 26..6 7 6 5 4..0
N Z C V Q I F T Mode
EQ Equal
NE Not equal
CS Carry
CC No carry
MI Negative
PL Positive
VS Overflow
VC No overflow
HI Higher
LS Lower or same
GE Greater or equal
LT Less than
GT Greater than
LE Less than or equal
AL Always
NV Never
N = negative
Z = zero
C = carry
V = overflow
Q = saturated
ANDS num, num, #1
ADDNE odd, odd, #1
ADDEQ even, even, #1
CMP age, #18
BGE |IsAdult|

for (int i = 0; i < n; ++i)
{
// ...
}
int i = 0;
while (i < n)
{
// ...
++i;
}

int i = 0;
do
{
// ...
} while(++i < n);
MOV i, #0 ; i = 0
|Loop|
; ...
ADD i, i, #1 ; ++i
CMP i, n ; i < n?

[Tip!] Use do {} while
 Use do-while loops when the initial test isn’t required
 Tip: replace initial test with an ‘assert(n > 0)’

[Tip!] Count down loops
 Count down in loops
 where possible
int i = n - 1;
do
{
// ...
} while(--i >= 0);

for (int i = n - 1; i >= 0; --i)
{
// ...
}
int i = n - 1;
while (i >= 0)
{
// ...
--i;
}

[Tip!] Improve Loop Unrolling
Intrinsic Description
__promise Allows the compiler to optimize loop unrolling
(also improves NEON vectorization)
// Promise the compiler that the loop
// iteration count is divisible by 16
__promise((n % 16) == 0);
for (int i = 0; i < n; i++)
{
// ...
}

Pointer Aliasing
 A compiler must assume two pointers could point to
the same location.
void Object::update(const State& state)
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}

Pointer Aliasing
the same location.
{
this->mAge += state.deltaTime;
this->mDelay -= state.deltaTime;
}

Pointer Aliasing
the same location.
LDR r2,[r0,#0] ; load this->mAge
LDR r3,[r1,#0] ; load state.deltaTime
; interlock
ADD r2,r2,r3 ; mAge += state.deltaTime
STR r2,[r0,#0] ; store updated mAge
LDR r1,[r1,#0] ; reload state.deltaTime
LDR r2,[r0,#4] ; load this->mDelay
; interlock
SUB r1,r2,r1 ; mDelay -= state.deltaTime
STR r1,[r0,#4] ; store updated mDelay
BX lr ; return

Pointer Aliasing
 Do not dereference multiple times; cache the value in a
local.
{
const int dt = state.deltaTime;
mAge += dt;
mDelay -= dt;
}

Pointer Aliasing
local.
 Or use __restrict to promise the compiler a certain
pointer does not alias other pointers.
__restrict // restrict the this pointer
{
mAge += state.deltaTime;
mDelay -= state.deltaTime;
}

Pointer Aliasing
local.
 Or use __restrict to promise the compiler that a
pointer does not alias other pointers.
 This improves code generation tremendously!
LDR r12,[r1,#0] ; load state.deltaTime
LDM r0,{r2, r3} ; load mAge, mDelay
ADD r2,r2,r12 ; mAge += state.deltaTime
SUB r3,r3,r12 ; mDelay -= state.deltaTime
STM r0,{r2, r3} ; store mAge, mDelay
BX lr ; return

Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 Sixteen 32-bit general purpose registers
 Not many for a load/store architecture
 PowerPC and MIPS have 32
 AMD 26000 has 192 (!)

Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 AMD 26000 has 192 (!)
 Arguments: R0..R3

Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 AMD 26000 has 192 (!)
 Return address: R14 (LR)
 Current PC
 Current CPU mode (ARM/Thumb)

Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 AMD 26000 has 192 (!)
 Return value: R0, R1

[Tip!] Function arguments
bool Object::hit(int type, int damage, Object* pSource)
{
// R0 = this
// R1 = type
// R2 = damage
// R3 = pSource
...
// R0 = true
return true
}
 Do not pass more than four 32-bit (integer) arguments
 Non-static class member functions: 3 arguments
(this pointer counts as argument)

s64 dontDoThis(s32 a, s64 b, s32 c)
{
// R0 = a
// R1
// R2, R3 = b
// [SP+0] = c
return a + b + c;
// R0, R1 = result
}
 64-bit arguments require two registers
 Must use R0, R1 or R2, R3

s64 Object::rememberThis(s64 b, s32 a)
{
// R0 = this
// R1
// R2, R3 = b
// [SP+0] = a
return a + b + this->c;
// R0, R1 = result
}
 Member functions: this pointer alert!

s64 Object::rememberThis(s32 a, s64 b)
{
// R0 = this
// R1 = a
// R2, R3 = b
return a + b + this->c;
// R0, R1 = result
}
 Member functions: this pointer alert!

Registers R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13 SP
R14 LR
R15 PC
 AMD 26000 has 192 (!)
 Return value: R0, R1
 32-bit!

[Tip!] Use 32 bits!
 Use 32 bits (or multiples thereof) for:
 Arguments
 Locals
 Return values
 When using smaller types compiler has to take care of:
 Wrap-around
 Sign-extension

[Tip!] Use 32 bits!
short addRange(short a, short b, short* pData)
{
short result = 0;
do
{
result += pData[a++];
}
while (a <= b);
return result;
}

[Tip!] Use 32 bits!
MOV r3,#0
|Loop|
ADD r12,r2,r0,LSL #1
LDRH r12,[r12,#0]
ADD r0,r0,#1
LSL r0,r0,#16 ; wrap-around and...
ADD r3,r3,r12
ASR r0,r0,#16 ; sign-extend
LSL r3,r3,#16
CMP r0,r1
ASR r3,r3,#16
MOVGT r0,r3
BLE |Loop|
BX lr

[Tip!] Use 32 bits!
MOV r3,#0
|Loop|
ADD r12,r2,r0,LSL #1
ADD r0,r0,#1
LDRH r12,[r12,#0]
SXTH r0,r0 ; sign-extend halfword
CMP r0,r1
ADD r3,r3,r12
SXTH r3,r3
MOVGT r0,r3
BLE |Loop|
BX lr
ARMv6

Division
 ARM has no hardware integer division/modulo!
 Avoid non-constant divisors
int thousandDividedBy(int d)
{
return 1000 / d;
}
MOV r1,r0
MOV r0,#1000
B __aeabi_idivmod
int thousandDividedBy(int d)
{
return int(1000 / (float)d);
}
VMOV s0,r0
VLDR s1,|Thousand|
VCVT.F32.S32 s0,s0
VDIV.F32 s2,s1,s0
VCVT.S32.F32 s0,s2
VMOV r0,s0
BX lr
|Thousand|
DCFS 0x447a0000 ; 1000
VFP Alternative?

Division
 Compiler can optimize constant divisors
int dividedByThousand(int d)
{
return d / 1000;
}
LDR r1,|DivisionMagic|
SMULL r1,r0,r1,r0
ASR r1,r0,#6
SUB r0,r1,r0,ASR #31
BX lr
|DivisionMagic|
DCD 0x10624dd3
int moduloThree(int d)
{
return d % 3;
}
LDR r1,|ModuloMagic|
SMULL r2,r1,r1,r0
SUB r1,r1,r1,ASR #31
SUB r1,r1,r1,LSL #2
ADD r0,r0,r1
BX lr
|ModuloMagic|
DCD 0x55555556

Division
 Compiler can optimize constant divisors
 Especially power of two divisors
int dividedByPower2(int d)
{
return d / 512;
}
ASR r1,r0,#31
ADD r0,r0,r1,LSR #23
ASR r0,r0,#9
BX lr
int moduloPower2(int d)
{
return d % 4;
}
ASR r1,r0,#31
BIC r1,r1,#3
SUB r0,r0,r1
BX lr

[Tip!] Signed vs Unsigned
{
return d / 512;
}
ASR r1,r0,#31
ASR r0,r0,#9
BX lr
{
return d % 4;
}
ASR r1,r0,#31
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
 Signed division and modulus are more complicated
 Exception: -1 >> 1 == -1

LSR r0,r0,#9
BX lr
u32 moduloPower2(u32 d)
{
return d % 4U;
}
AND r0, r0, #3
BX lr
 Exception: -1 >> 1 == -1
 Use unsigned types where applicable!
u32 dividedByPower2(u32 d)
{
return d / 512U;
}

ASR r1,r0,#31
ASR r0,r0,#9
BX lr
{
return d % 4;
}
ASR r1,r0,#31
BIC r1,r1,#3
SUB r0,r0,r1
BX lr
 Exception: -1 >> 1 == -1
 Use unsigned types where applicable!
{
return d / 512;
}

Interworking
 It is possible to switch between ARM & Thumb
instruction sets at run-time.
 First bit of address determines instruction set.
 Compiler allows us to switch between instruction sets
with #pragma directives.
 Only possible to use these in translation units!
 Doesn’t work for inline functions.
 Doesn’t work for non-specialized template functions.

Switching to Thumb
// code16.h
//
// --- Thumb mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb on
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma thumb
#else
# error “Unknown compiler!”
#endif

Switching to ARM
// code32.h
//
// --- ARM mode
#if defined(__MWERKS__)
// Codewarrior
# pragma thumb off
#elif defined(__ARMCC_VERSION)
// ARMCC/RVCT
# pragma arm
#else
# error “Unknown compiler!”
#endif

Switching to default
// codereset.h
//
// --- default mode
#if defined(EFFORT_SMALL)
#include <code16.h>
#else
#include <code32.h>
#endif

#include <code16.h>
Object::Object()
{
// ...
}
Object::~Object()
{
// ...
}
#include <codereset.h>
#include <code32.h>
void Object::update(int ticks)
{
// ...
}

ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
Note: some branch
instructions take 4 bytes.

ARM vs Thumb
ARM Thumb
Instruction size 32-bit (4 bytes) 16-bit (2 bytes)
int add(int x, int y)
{
int result = x + y;
printf("%d + %d = %dn", x, y, result);
return result;
}
+16 bytes!
4 PUSH {r4,lr}
4 ADD r4,r0,r1
4 MOV r2,r1
4 MOV r1,r0
4 MOV r3,r4
4 ADR r0,|String|
4 BL printf
4 MOV r0,r4
4 POP {r4,pc}
36
|String|
DCB "%d + %d = %dn",0
2 PUSH {r4,lr}
2 ADDS r4,r0,r1
2 MOV r2,r1
2 MOV r1,r0
2 MOV r3,r4
2 ADR r0,|String|
4 BL printf
2 MOV r0,r4
2 POP {r4,pc}
20
|String|
DCB "%d + %d = %dn",0

ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
SUB r1,r0,#’A’
CMP r1,#’Z’ – ’A’
SUBHI r0,r0,#’a’
CMPHI r0,#’z’ – ‘a’
MOVLS r0,#1
MOVHI r0,#0
BX lr
6 instructions
11 instructions

ARM vs Thumb
ARM Thumb
Conditional execution Nearly all instructions Branch instructions
bool isLetter(int c)
{
return ((c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z'));
}
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!

[Tip!] Boolean type
bool isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
Note: the compiler must ensure type bool is true
(1) or false (0). Avoid [implicit] casting to bool
when it’s not required, as it’s expensive!
int isBitSet(int flags, int bit)
{
return flags & (1 << bit);
}
MOVS r2,#1
LSLS r2,r2,r1
TST r2,r0
BEQ |False|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
MOV r2,r0
MOVS r0,#1
LSLS r0,r0,r1
ANDS r0,r0,r2
BX lr

ARM vs Thumb-2
 Thumb-2 introduced the IT (if-then) instruction
 Up to four instructions can be made conditional
SUBS r1,r0,#’A’
CMP r1,#’Z’ – ’A’
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
ITT HI
ITE LS
MOVLS r0,#1
MOVHI r0,#0
BX lr
MOV r1,r0
SUBS r1,r1,#’A’
CMP r1,#’Z’ – ‘A’
BLS |True|
SUBS r0,r0,#’a’
CMP r0,#’z’ – ‘a’
BHI |False|
|True|
MOVS r0,#1
BX lr
|False|
MOVS r0,#0
BX lr
Thumb ARMThumb-2

ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
unsigned int reverseBytes(unsigned int x)
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
MOVS r3,#0xFF
LSLS r2,r0,#8
LSLS r3,r3,#16
ANDS r2,r2,r3
LSLS r1,r0,#24
ORRS r1,r1,r2
LSRS r2,r0,#8
ASRS r3,r3,#8
ANDS r2,r2,r3
ORRS r1,r1,r2
LSRS r0,r0,#24
ORRS r0,r0,r1
BX lr
MOV r1,#0xFF,LSL #16
AND r1,r1,r0,LSL #8
MOV r2,#0xFF,LSL #8
ORR r1,r1,r0,LSL #24
AND r2,r2,r0,LSR #8
ORR r1,r1,r2
ORR r0,r1,r0,LSR #24
BX lr
8 instructions
13 instructions

ARM vs Thumb
ARM Thumb
Barrel shifter & ALU Accessible by data instructions Requires separate instructions
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr
ARMv6

ARM vs Thumb
ARM Thumb
Coprocessor interface Yes No
Long Multiply Yes (ARMv4) No
Count Leading Zeroes Yes (ARMv5) No
Saturated math Yes (ARMv5) No
DSP instructions Yes (ARMv5) No
SIMD instructions Yes (ARMv6) No

Summary: when to use Thumb
 Use Thumb for functions which…
 do not benefit from the ARM instruction-set
 are not performance critical (i.e.: initialization code)
#include <code16.h>
void Level::load(const std::string& path)
{
.
.
.
}

Summary: when to use ARM
 Use ARM for functions which…
 do benefit from the ARM instruction-set
 are performance critical (i.e.: called from inner loops)
#include <code32.h>
bool Ray::intersects(const Sphere& s)
{
.
.
.
}

Intrinsic functions
 Allows use of specialized CPU instructions in C/C++
 Compiler can recognize patterns and might utilize
such specialized instructions:
{
return (x << 24) |
((x << 8) & 0x00FF0000) |
((x >> 8) & 0x0000FF00) |
((x >> 24));
}
REV r0,r0
BX lr

Intrinsic functions
 Allows use of specialized CPU instructions in C/C++
 Compiler can recognize patterns and might utilize
such specialized instructions.
 More often the compiler does not. Check compiler
output!
 Intrinsic functions are compiler specific; read the
manual!

Useful intrinsics
__breakpoint Stops execution, informs the debugger
__disable_irq Sets the CPSR irq mask, returns previous state
__enable_irq Resets the CPSR irq mask, returns previous state
__ldrex Atomic reads
__strex Atomic writes

Useful intrinsics (cache)
__pld Preload data
__pldw Preload data for writing
__pli Preload instructions

Useful intrinsics (algorithms)
__usat/__ssat Unsigned/signed saturate (any power of 2)
__clz Count leading zeroes
__rbit Reverse bit order
__rev Reverse byte order

Useful intrinsics (SIMD)
__usad[a]8|16 Sum of absolute differences (4x8, 2x16)
__[u][q]add8|16 [Saturated] addition (4x8, 2x16)
__[u][q]sub8|16 [Saturated] subtraction (4x8, 2x16)
etc. Check: http://infocenter.arm.com/help

Intrinsic functions
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
RGBA& operator += (RGBA o)
{
r += o.r;
g += o.g;
b += o.b;
a += o.a;
return *this;
}
};
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
{
c = __uadd8(c, o.c);
return *this;
}
};

Intrinsic functions
PUSH {r4}
AND r2,r1,r0,ASR #24
ADD r1,r1,r0
AND r1,r1,#0xff
ORR r1,r1,r2
LSL r3,r0,#16
LSL r4,r0,#8
LSR r2,r0,#24
LSL r0,r1,#16
LSR r12,r3,#24
BIC r1,r1,#0xff00
LSL r0,r0,#8
AND r0,r0,#0xff00
ORR r0,r0,r1
BIC r1,r0,#0xff0000
LSL r12,r0,#8
LSR r0,r12,#24
POP {r4}
LSL r0,r0,#16
AND r0,r0,#0xff0000
ORR r0,r0,r1
BIC r3,r0,#0xff000000
ORR r0,r3,r0,LSL #24
struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
{
r += o.r;
g += o.g;
b += o.b;
a += o.a;
return *this;
}
};

Intrinsic functions
UADD8 r0,r0,r1struct RGBA
{
union
{
struct
{
u8 r, g, b, a;
};
u32 c;
};
{
c = __uadd8(c, o.c);
return *this;
}
};
Note:
I could have demonstrated
__uqadd8, which saturates the
results to the 8-bit unsigned
integer range 0 ≤ x ≤ 28 - 1.

OptimizingARM

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to OptimizingARM

Similar to OptimizingARM (20)

OptimizingARM