Cranking Floating Point Performance To 11 On The iPhone

Cranking Floating Point
Performance Up To 11
Noel Llopis
Snappy Touch

http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com

void* p = &s_particles2[0];
asm volatile (
"fldmias %1, {s0} nt"
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fmacs s8, s16, s0 nt"
"fmuls s16, s16, s1 nt"
"fstmias r2!, {s8-s13} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
: "r0", "r1", "r2", "r3", "cc", "memory"
);

“Don’t do that bit-
twiddling thing”

twiddling thing”

“Optimize at the
algorithm level”

twiddling thing”

“Optimize at the
algorithm level”

Yes, but key to good performance
is looking at your data and your
target platform

Floating point numbers

• Representation of rational numbers


• 1.2345, -0.8374, 2.0000, 14388439.34, etc


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Single precision: 32 bits


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Single precision: 32 bits
• Double precision: 64 bits

Why ﬂoating point
performance?

Why ﬂoating point
performance?

• Most games use ﬂoating point numbers for
most of their calculations

Why ﬂoating point
performance?

• Positions, velocities, physics, etc, etc.

Why ﬂoating point
performance?

• Positions, velocities, physics, etc, etc.
• Maybe not so much for regular apps

CPU

• 32-bit RISC ARM 11
• 400-535Mhz

CPU

• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
Touch 1st and 2nd gen

CPU (iPhone 3GS)

• Cortex-A8 600MHz

CPU (iPhone 3GS)

• Cortex-A8 600MHz
• More advanced
architecture

CPU

• No ﬂoating point support
in the ARM CPU!!!

How about integer
math?

• No need to do any ﬂoating point
operations

How about integer
math?

operations
• Fully supported in the ARM processor

How about integer
math?

operations
• Fully supported in the ARM processor
• But...

Integer Divide

There is no integer divide

Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a ﬁxed-point library.

• Performs rational arithmetic with integer
values at a reduced range/resolution.

• Performs rational arithmetic with integer
values at a reduced range/resolution.
• Not so great...

Floating point support
• There’s a ﬂoating point
unit

• There’s a ﬂoating point
unit

• Compiled C/C++/ObjC
code uses the VFP unit
for any ﬂoating point
operations.

Sample program
struct Particle
{
float x, y, z;
float vx, vy, vz;
};

Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

• 14.1 seconds on an iPod Touch 2nd gen


Trust no one!


Trust no one!

When in doubt, check the
assembly generated

Thumb Mode
• CPU has a special thumb mode.

Thumb Mode
• Less memory, maybe better
performance.

Thumb Mode
performance.
• No ﬂoating point support.

Thumb Mode
performance.
• No ﬂoating point support.
• Every timeitthere’s an fp of
operation, switches out
Thumb, does the fp operation,
and switches back on.

Thumb Mode

• It’s on by default!

Thumb Mode

• It’s on by default!
• Potentiallyoff. wins
turning it
HUGE

Thumb Mode

• Turning off Thumb mode increased
performance in Flower Garden by over 2x

Thumb Mode

• Heavy usage of ﬂoating point operations
though

Thumb Mode

• Heavy usage of ﬂoating point operations
though
• Most games will probably beneﬁt from
turning it off (especially 3D games)

ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!

ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!

Z80!!!

ARM assembly

• Hit the docs

ARM assembly

• Hit the docs
• References included in your USB card

ARM assembly

• Hit the docs
• Or download them from the ARM site

ARM assembly

• Hit the docs
• Or download them from the ARM site
• http://bit.ly/arminfo

ARM assembly

• Reading assembly is a very important skill
for high-performance programming

ARM assembly

• Reading assembly is a very important skill
for high-performance programming
• Writing is more specialized. Most people
don’t need to.

VFP unit
A0
+
B0
=
C0

A1
+
B1
=
C1

VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2

A1
+
B1
=
C1

VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2

A1 A3
+ +
B1 B3
= =
C1 C3

VFP unit
A0 A1 A2 A3

+

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=
C0 C1 C2 C3

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=
C0 C1 C2 C3

Sweet! How do we
use the vfp?

Like this!

"fldmias %2, {s8-s23} nt"
"fldmias %1!, {s0-s3} nt"

"fldmias %1!, {s4-s7} nt"

"fstmias %0!, {s24-s27} nt"

Writing vfp assembly

• There are two parts to it


• How to write any assembly in gcc


• How to write any assembly in gcc
• Learning ARM and VPM assembly

vfpmath library

• Already done a lot of work for you

vfpmath library

• http://code.google.com/p/vfpmathlibrary

vfpmath library

• Vector/matrix math

vfpmath library

• Vector/matrix math
• Might not be exactly what you need, but it’s
a great starting point

Assembly in gcc
• Only use it when targeting the device

Assembly in gcc
• Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1)
#define USE_VFP
#endif

Assembly in gcc
• The basics

asm (“cmp r2, r1”);

Assembly in gcc
• The basics

asm (“cmp r2, r1”);

http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-
HOWTO.html

Assembly in gcc
• Multiple lines
asm (
“mov r0, #1000nt”
“cmp r2, r1nt”
);

Assembly in gcc
• Accessing C variables
asm (//assembly code
: // output operands
: // input operands
: // clobbered registers
);

Assembly in gcc
: // input operands
);

int src = 19;
int dest = 0;

asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);

Assembly in gcc
: // input operands
);

int src = 19;
int dest = 0;
%0, %1, etc are the
variables in order
asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);

Assembly in gcc
int src = 19;
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Assembly in gcc
int src = 19;
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Clobber register list
are registers used by
the asm block

Assembly in gcc
int src = 19; volatile prevents “optimizations”
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Clobber register list
are registers used by
the asm block

VFP asm
Four banks of 8 32-bit registers each

Can address them as single precision
or as doubles

VFP asm

#define VFP_VECTOR_LENGTH(VEC_LENGTH)
"fmrx r0, fpscr nt"
"bic r0, r0, #0x00370000 nt"
"orr r0, r0, #0x000" #VEC_LENGTH "0000 nt"
"fmxr fpscr, r0 nt"

VFP asm

Bank 0 is always scalar!
Operations only work on a single bank
(wrap around possible)

VFP asm
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i];
p.x += p.vx*dt;
p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

VFP asm
{
Particle& p = s_particles[i]; {
p.x += p.vx*dt; void* p = &s_particles[i];
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
p.vx *= drag; "fldmias %1, {s0} nt"
p.vy *= drag; "fldmias %2, {s1} nt"
p.vz *= drag; "fldmias %0, {s8-s13} nt"
}
"fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}

VFP asm
{
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
}
Was: 5.1 seconds "fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
);
}

VFP asm
{
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
}
Was: 5.1 seconds "fstmias %0, {s8-s13} nt"
:
Now: 2.7 seconds!! : "r" (p), "r" (&dt), "r" (&drag)
);
}

VFP asm
{
void* p = &s_particles[i];
asm volatile (
:
: "r" (p), "r" (&dt), "r" (&drag)
);
}

VFP asm
for (int i=0; i<MaxParticles; ++i) Same every loop!
{
void* p = &s_particles[i];
asm volatile (
:
: "r" (p), "r" (&dt), "r" (&drag)
);
}

VFP asm
void* p = &s_particles[0];
asm volatile (
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
asm volatile (
"fldmias %1, {s0} nt" Was: 2.7 seconds
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"bne 0b nt"
:
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
asm volatile (
"fldmias %1, {s0} nt" Was: 2.7 seconds
"mov r1, %0 nt" Now: 2.7 seconds
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"bne 0b nt"
:
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
We can do 8 operations at once. So let’s try doing two
particles in a single operation.

struct Particle2
{
float x0, y0, z0;
float x1, y1, z1;
float vx0, vy0, vz0;
float vx1, vy1, vz1;
};

VFP asm
asm volatile (
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"bne 0b nt"
:
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
asm volatile (

"fldmias %1, {s0}
"fldmias %2, {s1}
nt"
nt" Was: 2.77 seconds
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"bne 0b nt"
:
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
asm volatile (

"fldmias %1, {s0}
"fldmias %2, {s1}
nt"
nt" Was: 2.77 seconds
"mov r1, %0 nt"

"mov r2, %0
"mov r3, %3
nt"
nt" Now: 2.67 seconds
"0: nt"
"bne 0b nt"
:
: "r0", "r1", "r2", "r3", "cc", "memory"
);

VFP asm
What’s the loop/cache overhead?
{
Particle* p = &s_particles[i];
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

VFP asm
{
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

Was: 2.67 seconds

VFP asm
{
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

Was: 2.67 seconds
Now: 2.41 seconds!!!!

Matrix multiply
Straight from vfpmathlib

Matrix multiply

Touch: 0.0379 s

Matrix multiply

Touch: 0.0379 s
Normal: 0.0968 s

Matrix multiply

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s

Matrix multiply

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s

About 2x faster!

Good use of vfp
Something with lots of fp operations in a row

Good use of vfp

• Matrix operations

Good use of vfp

• Particle systems

Good use of vfp

• Skinning

Good use of vfp

• Skinning
• Physics

Good use of vfp

• Skinning
• Physics
• Procedural content generation

Good use of vfp

• Skinning
• Physics
• Procedural content generation
• ....

What about the 3GS?
3G 3GS
Thumb 14.1 14.5
Normal 5.14 4.76
VFP1 2.77 4.53
VFP2 2.77 4.26
VFP3 2.66 3.57
Touch 2.41 0.42

Matrix multiply on 3GS
In ms

Matrix multiply on 3GS
In ms
3G 3GS
Normal 96 82

VFP1 42 90

VFP2 42 75

Touch 38 19

VFP resources
• ARM and VFP reference in your USB drive
• http://aleiby.blogspot.com/2008/12/iphone-
vfp-for-n00bs.html
• http://www.ibiblio.org/gferg/ldp/GCC-
Inline-Assembly-HOWTO.html

More 3GS: NEON

• SIMD coprocessor

More 3GS: NEON

• Floating point and integer

More 3GS: NEON

• Huge potential

More 3GS: NEON

• Huge potential
• Not many examples yet

NEON resources

• Cortex A8 reference in USB drive

NEON resources

• http://gcc.gnu.org/onlinedocs/gcc/ARM-
NEON-Intrinsics.html

NEON resources

• http://gcc.gnu.org/onlinedocs/gcc/ARM-
NEON-Intrinsics.html
• http://code.google.com/p/oolongengine/
source/browse/trunk/Oolong+Engine2/
Math/neonmath

Conclusions
• Turn Thumb mode off NOW

Conclusions
• Expect to get at least 2x performance in
older hardware by using vfp

Conclusions
• Not much difference in 3GS (but it’s fast
already)

Conclusions
• Not much difference in 3GS (but it’s fast
already)
• NEON SIMD tech still unused. Research
that and be the ﬁrst one with the killer 3GS
app!

Thank you!

Noel Llopis
Snappy Touch

http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com

Cranking Floating Point Performance To 11 On The iPhone

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Cranking Floating Point Performance To 11 On The iPhone

Similar to Cranking Floating Point Performance To 11 On The iPhone (20)

Recently uploaded

Recently uploaded (20)

Cranking Floating Point Performance To 11 On The iPhone

Editor's Notes