Cranking Floating Point Performance Up To 11

Cranking Floating Point
Performance Up To 11
Noel Llopis
Snappy Touch

http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com

Floating point numbers

• Representation of rational numbers


• 1.2345, -0.8374, 2.0000, 14388439.34, etc


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Single precision: 32 bits


• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Single precision: 32 bits
• Double precision: 64 bits

Why ﬂoating point
performance?

Why ﬂoating point
performance?

• Most games use ﬂoating point numbers for
most of their calculations

Why ﬂoating point
performance?

• Positions, velocities, physics, etc, etc.

Why ﬂoating point
performance?

• Positions, velocities, physics, etc, etc.
• Maybe not so much for regular apps

CPU

• 32-bit RISC ARM 11
• 400-535Mhz

CPU

• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
Touch 1st and 2nd gen

CPU (iPhone 3GS)

• Cortex-A8 600MHz

CPU (iPhone 3GS)

• Cortex-A8 600MHz
• More advanced
architecture

CPU

• No ﬂoating point support
in the ARM CPU!!!

How about integer
math?

• No need to do any ﬂoating point
operations

How about integer
math?

operations
• Fully supported in the ARM processor

How about integer
math?

operations
• Fully supported in the ARM processor
• But...

Integer Divide

There is no integer divide

Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it

• You need to represent rational numbers

• Can use a ﬁxed-point library.

• Performs rational arithmetic with integer
values at a reduced range/resolution.

• Performs rational arithmetic with integer
values at a reduced range/resolution.
• Not so great...

Floating point support

• There’s a ﬂoating point unit


• There’s a ﬂoating point unit

• Compiled C/C++/ObjC
code uses the VFP unit for
any ﬂoating point
operations.

Sample program
struct Particle
{
float x, y, z;
float vx, vy, vz;
};

Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

• 7.2 seconds on an iPod Touch 2nd gen


Trust no one!


Trust no one!
When in doubt, check the
assembly generated

Thumb Mode
• CPU has a special thumb
mode.

Thumb Mode
mode.

• Less memory, maybe better
performance.

Thumb Mode
mode.

performance.

• No ﬂoating point support.

Thumb Mode
mode.

performance.

• No ﬂoating point support.

• Every time there’s an fp
operation, it switches out of
Thumb, does the fp operation,
and switches back on.

Thumb Mode

• It’s on by default!

Thumb Mode

• It’s on by default!
• Potentiallyoff. wins
turning it
HUGE

Thumb Mode

• Turning off Thumb mode increased
performance in Flower Garden by over 2x

Thumb Mode

• Heavy usage of ﬂoating point operations
though

Thumb Mode

• Heavy usage of ﬂoating point operations
though
• Most games will probably beneﬁt from
turning it off (especially 3D games)

ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!

ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!

Z80!!!

ARM assembly

• Hit the docs

ARM assembly

• Hit the docs
• References included in your USB card

ARM assembly

• Hit the docs
• Or download them from the ARM site

ARM assembly

• Hit the docs
• Or download them from the ARM site
• http://bit.ly/arminfo

ARM assembly

• Reading assembly is a very important skill
for high-performance programming

ARM assembly

• Reading assembly is a very important skill
for high-performance programming
• Writing is more specialized. Most people
don’t need to.

VFP unit
A0
+
B0
=
C0

A1
+
B1
=
C1

VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2

A1
+
B1
=
C1

VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2

A1 A3
+ +
B1 B3
= =
C1 C3

VFP unit
A0 A1 A2 A3

+

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=
C0 C1 C2 C3

VFP unit
A0 A1 A2 A3

+
B0 B1 B2 B3

=
C0 C1 C2 C3

Sweet! How do we
use the vfp?

Like this!

"fldmias %2, {s8-s23} nt"
"fldmias %1!, {s0-s3} nt"
"fmuls s24, s8, s0 nt"
"fmacs s24, s12, s1 nt"

"fldmias %1!, {s4-s7} nt"

"fstmias %0!, {s24-s27} nt"

Writing vfp assembly

• There are two parts to it


• How to write any assembly in gcc


• How to write any assembly in gcc
• Learning ARM and VPM assembly

vfpmath library

• Already done a lot of work for you

vfpmath library

• http://code.google.com/p/vfpmathlibrary

vfpmath library

• Vector/matrix math

vfpmath library

• Vector/matrix math
• Might not be exactly what you need, but it’s
a great starting point

Assembly in gcc
• Only use it when targeting the device

Assembly in gcc
• Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1)
#define USE_VFP
#endif

Assembly in gcc
• The basics

asm (“cmp r2, r1”);

Assembly in gcc
• The basics

asm (“cmp r2, r1”);

http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-
HOWTO.html

Assembly in gcc
• Multiple lines
asm (
“mov r0, #1000nt”
“cmp r2, r1nt”
);

Assembly in gcc
• Accessing C variables
asm (//assembly code
: // output operands
: // input operands
: // clobbered registers
);

Assembly in gcc
: // input operands
);

int src = 19;
int dest = 0;

asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);

Assembly in gcc
: // input operands
);

int src = 19;
int dest = 0;
%0, %1, etc are the
variables in order
asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);

Assembly in gcc
int src = 19;
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Assembly in gcc
int src = 19;
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Clobber register list
are registers used by
the asm block

Assembly in gcc
int src = 19; volatile prevents “optimizations”
int dest = 0;

asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);

Clobber register list
are registers used by
the asm block

VFP asm
Four banks of 8 32-bit registers each

VFP asm
Four banks of 8 32-bit registers each

#define VFP_VECTOR_LENGTH(VEC_LENGTH)
"fmrx r0, fpscr nt"
"bic r0, r0, #0x00370000 nt"
"orr r0, r0, #0x000" #VEC_LENGTH "0000 nt"
"fmxr fpscr, r0 nt"

VFP asm
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i];
p.x += p.vx*dt;
p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}

VFP asm

for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i)
{ {
Particle* p = &s_particles[i];
p.x += p.vx*dt;
p.y += p.vy*dt; asm volatile (
p.z += p.vz*dt; "fldmias %0, {s0-s5} nt"
p.vx *= drag;
p.vy *= drag;
p.vz *= drag; "fldmias %2, {s9-s11} nt"
} "fmacs s0, s3, s6 nt"
"fstmias %0, {s0-s5} nt"
: "=r" (p)
: "r" (p), "r" (dtArray),
"r" (dragArray)
:
);
}

VFP asm

{ {
p.x += p.vx*dt;
p.vx *= drag;
p.vy *= drag;

Was: 2.6 seconds

"fstmias %0, {s0-s5}
: "=r" (p)
nt"

"r" (dragArray)
:
);
}

VFP asm

{ {
p.x += p.vx*dt;
p.vx *= drag;
p.vy *= drag;

Was: 2.6 seconds

"fstmias %0, {s0-s5}
: "=r" (p)
nt"

Now: 1.4 seconds!! "r" (dragArray)
:
);
}

VFP asm
Let’s do 6 operations at once!

struct Particle2
{
float x0, y0, z0;
float x1, y1, z1;
float vx0, vy0, vz0;
float vx1, vy1, vz1;
};

VFP asm
for (int i=0; i<iterations; ++i)
{
Particle2* p = &s_particles2[i];
asm volatile (
: "=r" (p)
: "r" (p), "r" (dtArray), "r" (dragArray)
:
);
}

VFP asm
{
asm volatile (
: "=r" (p)
:
);
} Was: 1.4 seconds

VFP asm
{
asm volatile (
: "=r" (p)
:
);
} Was: 1.4 seconds
Now: 1.2 seconds

VFP asm
What’s the loop/cache overhead?
{
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

VFP asm
{
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

Was: 1.2 seconds

VFP asm
{
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}

Was: 1.2 seconds
Now: 1.2 seconds!!!!

Matrix multiply
Straight from vfpmathlib

Matrix multiply

Touch: 0.037919 s

Matrix multiply

Touch: 0.037919 s
Normal: 0.096855 s

Matrix multiply

Touch: 0.037919 s
Normal: 0.096855 s
VFP: 0.042216 s

Matrix multiply

Touch: 0.037919 s
Normal: 0.096855 s
VFP: 0.042216 s

About 2x faster!

Good use of vfp
• Matrix operations

Good use of vfp
• Particle systems

Good use of vfp
• Skinning

Good use of vfp
• Skinning
• Physics

Good use of vfp
• Skinning
• Physics
• Procedural content generation

Good use of vfp
• Skinning
• Physics
• Procedural content generation
• ....

What about the 3GS?
3G 3GS
Thumb 7.2 8.0

Normal 2.6 2.6

VFP1 1.4 1.30

VFP2 1.2 0.64

Touch 1.2 0.18

More 3GS: NEON

• SIMD coprocessor

More 3GS: NEON

• Floating point and integer

More 3GS: NEON

• Huge potential

More 3GS: NEON

• Huge potential
• Very little documentation right now :-(

Thank you!

Noel Llopis
Snappy Touch

http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com

Cranking Floating Point Performance Up To 11

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Cranking Floating Point Performance Up To 11

Similar to Cranking Floating Point Performance Up To 11 (20)

More from John Wilker

More from John Wilker (20)

Recently uploaded

Recently uploaded (20)

Cranking Floating Point Performance Up To 11