-
1.
Cranking Floating Point
Performance Up To 11
Noel Llopis
Snappy Touch
http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com
-
2.
void* p = &s_particles2[0];
asm volatile (
"fldmias %1, {s0} nt"
"fldmias %2, {s1} nt"
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fldmias r1!, {s16-s21} nt"
"fmacs s8, s16, s0 nt"
"fmuls s16, s16, s1 nt"
"fstmias r2!, {s8-s13} nt"
"fstmias r2!, {s16-s21} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
3.
“Don’t do that bit-
twiddling thing”
-
4.
“Don’t do that bit-
twiddling thing”
“Optimize at the
algorithm level”
-
5.
“Don’t do that bit-
twiddling thing”
“Optimize at the
algorithm level”
Yes, but key to good performance
is looking at your data and your
target platform
-
6.
Floating Point
Performance
-
7.
Floating point numbers
-
8.
Floating point numbers
• Representation of rational numbers
-
9.
Floating point numbers
• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
-
10.
Floating point numbers
• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
-
11.
Floating point numbers
• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
• Single precision: 32 bits
-
12.
Floating point numbers
• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE 754 format
• Single precision: 32 bits
• Double precision: 64 bits
-
13.
Floating point numbers
-
14.
Floating point numbers
-
15.
Why floating point
performance?
-
16.
Why floating point
performance?
• Most games use floating point numbers for
most of their calculations
-
17.
Why floating point
performance?
• Most games use floating point numbers for
most of their calculations
• Positions, velocities, physics, etc, etc.
-
18.
Why floating point
performance?
• Most games use floating point numbers for
most of their calculations
• Positions, velocities, physics, etc, etc.
• Maybe not so much for regular apps
-
19.
CPU
-
20.
CPU
• 32-bit RISC ARM 11
-
21.
CPU
• 32-bit RISC ARM 11
• 400-535Mhz
-
22.
CPU
• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
Touch 1st and 2nd gen
-
23.
CPU (iPhone 3GS)
-
24.
CPU (iPhone 3GS)
• Cortex-A8 600MHz
-
25.
CPU (iPhone 3GS)
• Cortex-A8 600MHz
• More advanced
architecture
-
26.
CPU
-
27.
CPU
• No floating point support
in the ARM CPU!!!
-
28.
How about integer
math?
-
29.
How about integer
math?
• No need to do any floating point
operations
-
30.
How about integer
math?
• No need to do any floating point
operations
• Fully supported in the ARM processor
-
31.
How about integer
math?
• No need to do any floating point
operations
• Fully supported in the ARM processor
• But...
-
32.
Integer Divide
-
33.
Integer Divide
-
34.
Integer Divide
There is no integer divide
-
35.
Fixed-point arithmetic
-
36.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
-
37.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
-
38.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
-
39.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
• Performs rational arithmetic with integer
values at a reduced range/resolution.
-
40.
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a fixed-point library.
• Performs rational arithmetic with integer
values at a reduced range/resolution.
• Not so great...
-
41.
Floating point support
-
42.
Floating point support
• There’s a floating point
unit
-
43.
Floating point support
• There’s a floating point
unit
• Compiled C/C++/ObjC
code uses the VFP unit
for any floating point
operations.
-
44.
Sample program
-
45.
Sample program
struct Particle
{
float x, y, z;
float vx, vy, vz;
};
-
46.
Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}
-
47.
Sample program
struct Particle for (int i=0; i<MaxParticles; ++i)
{ {
float x, y, z; Particle& p = s_particles[i];
float vx, vy, vz; p.x += p.vx*dt;
}; p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}
• 14.1 seconds on an iPod Touch 2nd gen
-
48.
Floating point support
-
49.
Floating point support
Trust no one!
-
50.
Floating point support
Trust no one!
When in doubt, check the
assembly generated
-
51.
Floating point support
-
52.
Thumb Mode
-
53.
Thumb Mode
-
54.
Thumb Mode
• CPU has a special thumb mode.
-
55.
Thumb Mode
• CPU has a special thumb mode.
• Less memory, maybe better
performance.
-
56.
Thumb Mode
• CPU has a special thumb mode.
• Less memory, maybe better
performance.
• No floating point support.
-
57.
Thumb Mode
• CPU has a special thumb mode.
• Less memory, maybe better
performance.
• No floating point support.
• Every timeitthere’s an fp of
operation, switches out
Thumb, does the fp operation,
and switches back on.
-
58.
Thumb Mode
-
59.
Thumb Mode
• It’s on by default!
-
60.
Thumb Mode
• It’s on by default!
• Potentiallyoff. wins
turning it
HUGE
-
61.
Thumb Mode
• It’s on by default!
• Potentiallyoff. wins
turning it
HUGE
-
62.
Thumb Mode
-
63.
Thumb Mode
• Turning off Thumb mode increased
performance in Flower Garden by over 2x
-
64.
Thumb Mode
• Turning off Thumb mode increased
performance in Flower Garden by over 2x
• Heavy usage of floating point operations
though
-
65.
Thumb Mode
• Turning off Thumb mode increased
performance in Flower Garden by over 2x
• Heavy usage of floating point operations
though
• Most games will probably benefit from
turning it off (especially 3D games)
-
66.
5.1 seconds!
-
67.
ARM assembly
DISCLAIMER:
-
68.
ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!
-
69.
ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!
-
70.
ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!
-
71.
ARM assembly
DISCLAIMER:
I’m not an ARM assembly expert!!!
Z80!!!
-
72.
ARM assembly
-
73.
ARM assembly
• Hit the docs
-
74.
ARM assembly
• Hit the docs
• References included in your USB card
-
75.
ARM assembly
• Hit the docs
• References included in your USB card
• Or download them from the ARM site
-
76.
ARM assembly
• Hit the docs
• References included in your USB card
• Or download them from the ARM site
• http://bit.ly/arminfo
-
77.
ARM assembly
-
78.
ARM assembly
• Reading assembly is a very important skill
for high-performance programming
-
79.
ARM assembly
• Reading assembly is a very important skill
for high-performance programming
• Writing is more specialized. Most people
don’t need to.
-
80.
VFP unit
-
81.
VFP unit
A0
-
82.
VFP unit
A0
+
-
83.
VFP unit
A0
+
B0
-
84.
VFP unit
A0
+
B0
=
-
85.
VFP unit
A0
+
B0
=
C0
-
86.
VFP unit
A0
+
B0
=
C0
A1
+
B1
=
C1
-
87.
VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2
A1
+
B1
=
C1
-
88.
VFP unit
A0 A2
+ +
B0 B2
= =
C0 C2
A1 A3
+ +
B1 B3
= =
C1 C3
-
89.
VFP unit
-
90.
VFP unit
A0 A1 A2 A3
-
91.
VFP unit
A0 A1 A2 A3
+
-
92.
VFP unit
A0 A1 A2 A3
+
B0 B1 B2 B3
-
93.
VFP unit
A0 A1 A2 A3
+
B0 B1 B2 B3
=
-
94.
VFP unit
A0 A1 A2 A3
+
B0 B1 B2 B3
=
C0 C1 C2 C3
-
95.
VFP unit
A0 A1 A2 A3
+
B0 B1 B2 B3
=
C0 C1 C2 C3
Sweet! How do we
use the vfp?
-
96.
Like this!
"fldmias %2, {s8-s23} nt"
"fldmias %1!, {s0-s3} nt"
"fmuls s24, s8, s0 nt"
"fmacs s24, s12, s1 nt"
"fldmias %1!, {s4-s7} nt"
"fmacs s24, s16, s2 nt"
"fmacs s24, s20, s3 nt"
"fstmias %0!, {s24-s27} nt"
-
97.
Writing vfp assembly
-
98.
Writing vfp assembly
• There are two parts to it
-
99.
Writing vfp assembly
• There are two parts to it
• How to write any assembly in gcc
-
100.
Writing vfp assembly
• There are two parts to it
• How to write any assembly in gcc
• Learning ARM and VPM assembly
-
101.
vfpmath library
-
102.
vfpmath library
• Already done a lot of work for you
-
103.
vfpmath library
• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
-
104.
vfpmath library
• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
-
105.
vfpmath library
• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
• Might not be exactly what you need, but it’s
a great starting point
-
106.
Assembly in gcc
• Only use it when targeting the device
-
107.
Assembly in gcc
• Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1)
#define USE_VFP
#endif
-
108.
Assembly in gcc
• The basics
asm (“cmp r2, r1”);
-
109.
Assembly in gcc
• The basics
asm (“cmp r2, r1”);
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-
HOWTO.html
-
110.
Assembly in gcc
• Multiple lines
asm (
“mov r0, #1000nt”
“cmp r2, r1nt”
);
-
111.
Assembly in gcc
• Accessing C variables
asm (//assembly code
: // output operands
: // input operands
: // clobbered registers
);
-
112.
Assembly in gcc
• Accessing C variables
asm (//assembly code
: // output operands
: // input operands
: // clobbered registers
);
int src = 19;
int dest = 0;
asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);
-
113.
Assembly in gcc
• Accessing C variables
asm (//assembly code
: // output operands
: // input operands
: // clobbered registers
);
int src = 19;
int dest = 0;
%0, %1, etc are the
variables in order
asm volatile (
"add %0, %1, #42"
: "=r" (dest)
: "r" (src)
:
);
-
114.
Assembly in gcc
-
115.
Assembly in gcc
int src = 19;
int dest = 0;
asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);
-
116.
Assembly in gcc
int src = 19;
int dest = 0;
asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);
Clobber register list
are registers used by
the asm block
-
117.
Assembly in gcc
int src = 19; volatile prevents “optimizations”
int dest = 0;
asm volatile (
"add r10, %1, #42nt"
"add %0, r10, #33nt"
: "=r" (dest)
: "r" (src)
: "r10"
);
Clobber register list
are registers used by
the asm block
-
118.
VFP asm
Four banks of 8 32-bit registers each
Can address them as single precision
or as doubles
-
119.
VFP asm
-
120.
VFP asm
#define VFP_VECTOR_LENGTH(VEC_LENGTH)
"fmrx r0, fpscr nt"
"bic r0, r0, #0x00370000 nt"
"orr r0, r0, #0x000" #VEC_LENGTH "0000 nt"
"fmxr fpscr, r0 nt"
-
121.
VFP asm
Bank 0 is always scalar!
Operations only work on a single bank
(wrap around possible)
-
122.
VFP asm
-
123.
VFP asm
-
124.
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i];
p.x += p.vx*dt;
p.y += p.vy*dt;
p.z += p.vz*dt;
p.vx *= drag;
p.vy *= drag;
p.vz *= drag;
}
-
125.
VFP asm
for (int i=0; i<MaxParticles; ++i)
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i]; {
p.x += p.vx*dt; void* p = &s_particles[i];
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
p.vx *= drag; "fldmias %1, {s0} nt"
p.vy *= drag; "fldmias %2, {s1} nt"
p.vz *= drag; "fldmias %0, {s8-s13} nt"
}
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}
-
126.
VFP asm
for (int i=0; i<MaxParticles; ++i)
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i]; {
p.x += p.vx*dt; void* p = &s_particles[i];
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
p.vx *= drag; "fldmias %1, {s0} nt"
p.vy *= drag; "fldmias %2, {s1} nt"
p.vz *= drag; "fldmias %0, {s8-s13} nt"
}
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
Was: 5.1 seconds "fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}
-
127.
VFP asm
for (int i=0; i<MaxParticles; ++i)
for (int i=0; i<MaxParticles; ++i)
{
Particle& p = s_particles[i]; {
p.x += p.vx*dt; void* p = &s_particles[i];
p.y += p.vy*dt;
asm volatile (
p.z += p.vz*dt;
p.vx *= drag; "fldmias %1, {s0} nt"
p.vy *= drag; "fldmias %2, {s1} nt"
p.vz *= drag; "fldmias %0, {s8-s13} nt"
}
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
Was: 5.1 seconds "fstmias %0, {s8-s13} nt"
:
Now: 2.7 seconds!! : "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}
-
128.
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
void* p = &s_particles[i];
asm volatile (
"fldmias %1, {s0} nt"
"fldmias %2, {s1} nt"
"fldmias %0, {s8-s13} nt"
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}
-
129.
VFP asm
for (int i=0; i<MaxParticles; ++i) Same every loop!
{
void* p = &s_particles[i];
asm volatile (
"fldmias %1, {s0} nt"
"fldmias %2, {s1} nt"
"fldmias %0, {s8-s13} nt"
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias %0, {s8-s13} nt"
:
: "r" (p), "r" (&dt), "r" (&drag)
: "r0", "cc", "memory"
);
}
-
130.
VFP asm
void* p = &s_particles[0];
asm volatile (
"fldmias %1, {s0} nt"
"fldmias %2, {s1} nt"
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias r2!, {s8-s13} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
131.
VFP asm
void* p = &s_particles[0];
asm volatile (
"fldmias %1, {s0} nt" Was: 2.7 seconds
"fldmias %2, {s1} nt"
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias r2!, {s8-s13} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
132.
VFP asm
void* p = &s_particles[0];
asm volatile (
"fldmias %1, {s0} nt" Was: 2.7 seconds
"fldmias %2, {s1} nt"
"mov r1, %0 nt" Now: 2.7 seconds
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fmacs s8, s11, s0 nt"
"fmuls s11, s11, s1 nt"
"fstmias r2!, {s8-s13} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
133.
VFP asm
We can do 8 operations at once. So let’s try doing two
particles in a single operation.
struct Particle2
{
float x0, y0, z0;
float x1, y1, z1;
float vx0, vy0, vz0;
float vx1, vy1, vz1;
};
-
134.
VFP asm
void* p = &s_particles2[0];
asm volatile (
"fldmias %1, {s0} nt"
"fldmias %2, {s1} nt"
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fldmias r1!, {s16-s21} nt"
"fmacs s8, s16, s0 nt"
"fmuls s16, s16, s1 nt"
"fstmias r2!, {s8-s13} nt"
"fstmias r2!, {s16-s21} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
135.
VFP asm
void* p = &s_particles2[0];
asm volatile (
"fldmias %1, {s0}
"fldmias %2, {s1}
nt"
nt" Was: 2.77 seconds
"mov r1, %0 nt"
"mov r2, %0 nt"
"mov r3, %3 nt"
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fldmias r1!, {s16-s21} nt"
"fmacs s8, s16, s0 nt"
"fmuls s16, s16, s1 nt"
"fstmias r2!, {s8-s13} nt"
"fstmias r2!, {s16-s21} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
136.
VFP asm
void* p = &s_particles2[0];
asm volatile (
"fldmias %1, {s0}
"fldmias %2, {s1}
nt"
nt" Was: 2.77 seconds
"mov r1, %0 nt"
"mov r2, %0
"mov r3, %3
nt"
nt" Now: 2.67 seconds
"0: nt"
"fldmias r1!, {s8-s13} nt"
"fldmias r1!, {s16-s21} nt"
"fmacs s8, s16, s0 nt"
"fmuls s16, s16, s1 nt"
"fstmias r2!, {s8-s13} nt"
"fstmias r2!, {s16-s21} nt"
"subs r3, r3, #1 nt"
"bne 0b nt"
:
: "r" (p), "r" (&dt), "r" (&drag), "r" (iterations)
: "r0", "r1", "r2", "r3", "cc", "memory"
);
-
137.
VFP asm
What’s the loop/cache overhead?
for (int i=0; i<MaxParticles; ++i)
{
Particle* p = &s_particles[i];
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}
-
138.
VFP asm
What’s the loop/cache overhead?
for (int i=0; i<MaxParticles; ++i)
{
Particle* p = &s_particles[i];
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}
Was: 2.67 seconds
-
139.
VFP asm
What’s the loop/cache overhead?
for (int i=0; i<MaxParticles; ++i)
{
Particle* p = &s_particles[i];
p->x = p->vx;
p->y = p->vy;
p->z = p->vz;
}
Was: 2.67 seconds
Now: 2.41 seconds!!!!
-
140.
Matrix multiply
-
141.
Matrix multiply
Straight from vfpmathlib
-
142.
Matrix multiply
Straight from vfpmathlib
Touch: 0.0379 s
-
143.
Matrix multiply
Straight from vfpmathlib
Touch: 0.0379 s
Normal: 0.0968 s
-
144.
Matrix multiply
Straight from vfpmathlib
Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s
-
145.
Matrix multiply
Straight from vfpmathlib
Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s
About 2x faster!
-
146.
Good use of vfp
-
147.
Good use of vfp
Something with lots of fp operations in a row
-
148.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
-
149.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
• Particle systems
-
150.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
• Particle systems
• Skinning
-
151.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
• Particle systems
• Skinning
• Physics
-
152.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
• Particle systems
• Skinning
• Physics
• Procedural content generation
-
153.
Good use of vfp
Something with lots of fp operations in a row
• Matrix operations
• Particle systems
• Skinning
• Physics
• Procedural content generation
• ....
-
154.
What about the 3GS?
-
155.
What about the 3GS?
3G 3GS
Thumb 14.1 14.5
Normal 5.14 4.76
VFP1 2.77 4.53
VFP2 2.77 4.26
VFP3 2.66 3.57
Touch 2.41 0.42
-
156.
What about the 3GS?
3G 3GS
Thumb 14.1 14.5
Normal 5.14 4.76
VFP1 2.77 4.53
VFP2 2.77 4.26
VFP3 2.66 3.57
Touch 2.41 0.42
-
157.
What about the 3GS?
3G 3GS
Thumb 14.1 14.5
Normal 5.14 4.76
VFP1 2.77 4.53
VFP2 2.77 4.26
VFP3 2.66 3.57
Touch 2.41 0.42
-
158.
What about the 3GS?
3G 3GS
Thumb 14.1 14.5
Normal 5.14 4.76
VFP1 2.77 4.53
VFP2 2.77 4.26
VFP3 2.66 3.57
Touch 2.41 0.42
-
159.
Matrix multiply on 3GS
In ms
-
160.
Matrix multiply on 3GS
In ms
3G 3GS
Normal 96 82
VFP1 42 90
VFP2 42 75
Touch 38 19
-
161.
Matrix multiply on 3GS
In ms
3G 3GS
Normal 96 82
VFP1 42 90
VFP2 42 75
Touch 38 19
-
162.
Matrix multiply on 3GS
In ms
3G 3GS
Normal 96 82
VFP1 42 90
VFP2 42 75
Touch 38 19
-
163.
Matrix multiply on 3GS
In ms
3G 3GS
Normal 96 82
VFP1 42 90
VFP2 42 75
Touch 38 19
-
164.
VFP resources
• ARM and VFP reference in your USB drive
• http://code.google.com/p/vfpmathlibrary
• http://aleiby.blogspot.com/2008/12/iphone-
vfp-for-n00bs.html
• http://www.ibiblio.org/gferg/ldp/GCC-
Inline-Assembly-HOWTO.html
-
165.
More 3GS: NEON
-
166.
More 3GS: NEON
• SIMD coprocessor
-
167.
More 3GS: NEON
• SIMD coprocessor
• Floating point and integer
-
168.
More 3GS: NEON
• SIMD coprocessor
• Floating point and integer
• Huge potential
-
169.
More 3GS: NEON
• SIMD coprocessor
• Floating point and integer
• Huge potential
• Not many examples yet
-
170.
NEON resources
-
171.
NEON resources
• Cortex A8 reference in USB drive
-
172.
NEON resources
• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
NEON-Intrinsics.html
-
173.
NEON resources
• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
NEON-Intrinsics.html
• http://code.google.com/p/oolongengine/
source/browse/trunk/Oolong+Engine2/
Math/neonmath
-
174.
Conclusions
-
175.
Conclusions
• Turn Thumb mode off NOW
-
176.
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
older hardware by using vfp
-
177.
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
older hardware by using vfp
• Not much difference in 3GS (but it’s fast
already)
-
178.
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
older hardware by using vfp
• Not much difference in 3GS (but it’s fast
already)
• NEON SIMD tech still unused. Research
that and be the first one with the killer 3GS
app!
-
179.
Thank you!
Noel Llopis
Snappy Touch
http://twitter.com/snappytouch
noel@snappytouch.com
http://gamesfromwithin.com
Slides will be up on my web site after the talk
I just wanted to flash that to scare people off :-)
In the last 11 years I&#x2019;ve made games for almost every major platform out there
In the last 11 years I&#x2019;ve made games for almost every major platform out there
In the last 11 years I&#x2019;ve made games for almost every major platform out there
In the last 11 years I&#x2019;ve made games for almost every major platform out there
In the last 11 years I&#x2019;ve made games for almost every major platform out there
In the last 11 years I&#x2019;ve made games for almost every major platform out there
During that time, working on engine technology and trying to achieve maximum performance
Performance always very important
Almost a year ago I started Snappy Touch
Almost a year ago I started Snappy Touch
Almost a year ago I started Snappy Touch
Don&#x2019;t waste time optimizing if you don&#x2019;t have to
Sometimes you need performance to back up design
Also, beware optimizing the best case. Optimize the worst case!
Don&#x2019;t waste time optimizing if you don&#x2019;t have to
Sometimes you need performance to back up design
Also, beware optimizing the best case. Optimize the worst case!
Don&#x2019;t waste time optimizing if you don&#x2019;t have to
Sometimes you need performance to back up design
Also, beware optimizing the best case. Optimize the worst case!
Don&#x2019;t waste time optimizing if you don&#x2019;t have to
Sometimes you need performance to back up design
Also, beware optimizing the best case. Optimize the worst case!
Not necessary to understand the format, but helps a lot to understand source of problems
Single precision (32 bit) vs double (64 bits)
Let&#x2019;s see what we have to work with
Let&#x2019;s see what we have to work with
Let&#x2019;s see what we have to work with
We want to optimize for the old model
We want to optimize for the old model
So plan accordingly!
So plan accordingly!
ACTION: Go to XCode
Simulator vs. device
Release
ACTION: Go to XCode
Simulator vs. device
Release
ACTION: Go to XCode
Simulator vs. device
Release
What&#x2019;s going on in there??
That&#x2019;s more like it!
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
Single operation
Single operation
Single operation
Single operation
Single operation
Single operation
Need to get down to the metal
Can control how many register to use for each operation.
Up to 8 at once!
Bank 0 is always scalar!!!
Can control how many register to use for each operation.
Up to 8 at once!
Bank 0 is always scalar!!!
Stride
Read the manual for the specific instructions
That didn&#x2019;t help much! What&#x2019;s going on?
That didn&#x2019;t help much! What&#x2019;s going on?
That didn&#x2019;t help much! What&#x2019;s going on?
Barely much of an improvement!
Barely much of an improvement!
VFP almost as fast as just iterating through the loop.
Calculations become almost free!!!
VFP almost as fast as just iterating through the loop.
Calculations become almost free!!!
This is one pipeline
Can squeeze more performance by doing up to 3 operations at the same time
ACTION: Switch to XCode
ACTION: Switch to XCode
ACTION: Switch to XCode
ACTION: Switch to XCode
ACTION: Switch to XCode
The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
That combined with a much faster CPU, makes it pretty useless there
The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
That combined with a much faster CPU, makes it pretty useless there
The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
That combined with a much faster CPU, makes it pretty useless there
The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
That combined with a much faster CPU, makes it pretty useless there
Maybe next year&#x2019;s 360iDev session :-)
Maybe next year&#x2019;s 360iDev session :-)
Maybe next year&#x2019;s 360iDev session :-)
Maybe next year&#x2019;s 360iDev session :-)