Cranking Floating Point
Performance Up To 11 
             Noel Llopis
            Snappy Touch

    http://twitter.com/sn...
void* p = &s_particles2[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}         nt"
	   	   	   	 "fldmias ...
“Don’t do that bit-
 twiddling thing”
“Don’t do that bit-
 twiddling thing”

“Optimize at the
 algorithm level”
“Don’t do that bit-
                      twiddling thing”

                      “Optimize at the
                       ...
Floating Point
Performance
Floating point numbers
Floating point numbers

• Representation of rational numbers
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers
Floating point numbers
Why floating point
 performance?
Why floating point
    performance?

• Most games use floating point numbers for
  most of their calculations
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, ...
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, ...
CPU
CPU

• 32-bit RISC ARM 11
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
  Touch 1st and 2nd gen
CPU (iPhone 3GS)
CPU (iPhone 3GS)


• Cortex-A8 600MHz
CPU (iPhone 3GS)


• Cortex-A8 600MHz
• More advanced
  architecture
CPU
CPU


• No floating point support
  in the ARM CPU!!!
How about integer
     math?
How about integer
        math?

• No need to do any floating point
  operations
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
• B...
Integer Divide
Integer Divide
Integer Divide




There is no integer divide
Fixed-point arithmetic
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Floating point support
Floating point support
• There’s a floating point
  unit
Floating point support
• There’s a floating point
  unit

• Compiled C/C++/ObjC
  code uses the VFP unit
  for any floating ...
Sample program
Sample program
	   struct Particle
	   {
	   	 float x, y, z;
	   	 float vx, vy, vz;
	   };
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, ...
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, ...
Floating point support
Floating point support

        Trust no one!
Floating point support

         Trust no one!



 When in doubt, check the
   assembly generated
Floating point support
Thumb Mode
Thumb Mode
Thumb Mode
   • CPU has a special thumb mode.
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
   • No floating point suppo...
Thumb Mode
   • CPU has a special thumb mode.
   • Less memory, maybe better
     performance.
   • No floating point suppo...
Thumb Mode
Thumb Mode

    • It’s on by default!
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point op...
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point op...
5.1 seconds!
ARM assembly
   DISCLAIMER:
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
              DISCLAIMER:
I’m not an ARM assembly expert!!!




          Z80!!!
ARM assembly
ARM assembly

• Hit the docs
ARM assembly

• Hit the docs
• References included in your USB card
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
• http://bit.ly/a...
ARM assembly
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
• Writing is more specialize...
VFP unit
VFP unit
A0
VFP unit
A0
+
VFP unit
A0
+
B0
VFP unit
A0
+
B0
=
VFP unit
A0
+
B0
=
C0
VFP unit
A0
+
B0
=
C0


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1   A3
+    +
B1   B3
=    =
C1   C3
VFP unit
VFP unit
A0   A1   A2    A3
VFP unit
A0   A1       A2    A3

          +
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3




 Sweet! How do...
Like this!

"fldmias %2, {s8-s23}     nt"
"fldmias %1!, {s0-s3}     nt"
"fmuls s24, s8, s0        nt"
"fmacs s24, s12, s1 ...
Writing vfp assembly
Writing vfp assembly

• There are two parts to it
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
 • Learning ARM and VPM assembly
vfpmath library
vfpmath library

• Already done a lot of work for you
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
• Mig...
Assembly in gcc
• Only use it when targeting the device
Assembly in gcc
 • Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0...
Assembly in gcc
• The basics

          asm (“cmp r2, r1”);
Assembly in gcc
    • The basics

                asm (“cmp r2, r1”);




http://www.ibiblio.org/gferg/ldp/GCC-Inline-Asse...
Assembly in gcc
• Multiple lines
            asm (
                “mov r0, #1000nt”
                “cmp r2, r1nt”
      ...
Assembly in gcc
• Accessing C variables
         asm (//assembly code
             : // output operands
             : // ...
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
          ...
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
          ...
Assembly in gcc
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	 ...
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	 ...
Assembly in gcc
	   	   int src = 19;      volatile prevents “optimizations”
	   	   int dest = 0;
	   	
	   	   asm volat...
VFP asm
Four banks of 8 32-bit registers each




        Can address them as single precision
                  or as dou...
VFP asm
VFP asm




#define VFP_VECTOR_LENGTH(VEC_LENGTH)
    "fmrx    r0, fpscr                         nt" 
    "bic     r0, r0,...
VFP asm




       Bank 0 is always scalar!
Operations only work on a single bank
       (wrap around possible)
VFP asm
VFP asm
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];
    p.x += p.vx*dt;
    p.y += p.vy*dt;
   ...
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    P...
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    P...
VFP asm
for (int i=0; i<MaxParticles; ++i)
                                     for (int i=0; i<MaxParticles; ++i)
{
    P...
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
    void* p = &s_particles[i];
    asm volatile (
        "fldmias %1, {s0}  ...
VFP asm
for (int i=0; i<MaxParticles; ++i)      Same every loop!
{
    void* p = &s_particles[i];
    asm volatile (
     ...
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"
	   	...
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"   Was...
VFP asm
	   	   	   void* p = &s_particles[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}        nt"   Was...
VFP asm
We can do 8 operations at once. So let’s try doing two
           particles in a single operation.

              ...
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	   	   	   	 "fldmias %1, {s0}         nt"
	  ...
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	
	
    	
    	
        	
        	
           ...
VFP asm
	   	   	   void* p = &s_particles2[0];
	   	   	   asm volatile (
	
	
    	
    	
        	
        	
           ...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
Matrix multiply
Matrix multiply
Straight from vfpmathlib
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.0379 s
Normal: 0.0968 s
VFP: 0.0422 s

     About 2x faster!
Good use of vfp
Good use of vfp
Something with lots of fp operations in a row
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Phy...
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Phy...
Good use of vfp
Something with lots of fp operations in a row

 • Matrix operations
 • Particle systems
 • Skinning
 • Phy...
What about the 3GS?
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77 ...
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77 ...
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77 ...
What about the 3GS?
          3G     3GS
 Thumb    14.1   14.5
 Normal   5.14   4.76
  VFP1    2.77   4.53
  VFP2    2.77 ...
Matrix multiply on 3GS
        In ms
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42   ...
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42   ...
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42   ...
Matrix multiply on 3GS
          In ms
            3G    3GS
 Normal     96    82

  VFP1      42    90

  VFP2      42   ...
VFP resources
• ARM and VFP reference in your USB drive
• http://code.google.com/p/vfpmathlibrary
• http://aleiby.blogspot...
More 3GS: NEON
More 3GS: NEON

• SIMD coprocessor
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
• Not many examples yet
NEON resources
NEON resources

• Cortex A8 reference in USB drive
NEON resources

• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
  NEON-Intrinsics.html
NEON resources

• Cortex A8 reference in USB drive
• http://gcc.gnu.org/onlinedocs/gcc/ARM-
  NEON-Intrinsics.html
• http:...
Conclusions
Conclusions
• Turn Thumb mode off NOW
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
• Not much ...
Conclusions
• Turn Thumb mode off NOW
• Expect to get at least 2x performance in
  older hardware by using vfp
• Not much ...
Thank you!


         Noel Llopis
        Snappy Touch

http://twitter.com/snappytouch
   noel@snappytouch.com
 http://gam...
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Cranking Floating Point Performance To 11 On The iPhone
Upcoming SlideShare
Loading in …5
×

Cranking Floating Point Performance To 11 On The iPhone

10,764 views

Published on

360iDev presentation

Published in: Technology, News & Politics
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,764
On SlideShare
0
From Embeds
0
Number of Embeds
101
Actions
Shares
0
Downloads
160
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Slides will be up on my web site after the talk
  • I just wanted to flash that to scare people off :-)
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • In the last 11 years I&amp;#x2019;ve made games for almost every major platform out there
  • During that time, working on engine technology and trying to achieve maximum performance
    Performance always very important
  • Almost a year ago I started Snappy Touch
  • Almost a year ago I started Snappy Touch
  • Almost a year ago I started Snappy Touch
  • Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to
    Sometimes you need performance to back up design
    Also, beware optimizing the best case. Optimize the worst case!
  • Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to
    Sometimes you need performance to back up design
    Also, beware optimizing the best case. Optimize the worst case!
  • Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to
    Sometimes you need performance to back up design
    Also, beware optimizing the best case. Optimize the worst case!
  • Don&amp;#x2019;t waste time optimizing if you don&amp;#x2019;t have to
    Sometimes you need performance to back up design
    Also, beware optimizing the best case. Optimize the worst case!
  • Not necessary to understand the format, but helps a lot to understand source of problems
    Single precision (32 bit) vs double (64 bits)
  • Let&amp;#x2019;s see what we have to work with
  • Let&amp;#x2019;s see what we have to work with
  • Let&amp;#x2019;s see what we have to work with
  • We want to optimize for the old model
  • We want to optimize for the old model
  • So plan accordingly!
  • So plan accordingly!
  • ACTION: Go to XCode
    Simulator vs. device
    Release
  • ACTION: Go to XCode
    Simulator vs. device
    Release
  • ACTION: Go to XCode
    Simulator vs. device
    Release
  • What&amp;#x2019;s going on in there??
  • That&amp;#x2019;s more like it!
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&amp;#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Need to get down to the metal
  • Can control how many register to use for each operation.
    Up to 8 at once!
    Bank 0 is always scalar!!!
  • Can control how many register to use for each operation.
    Up to 8 at once!
    Bank 0 is always scalar!!!
  • Stride
  • Read the manual for the specific instructions
  • That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  • That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  • That didn&amp;#x2019;t help much! What&amp;#x2019;s going on?
  • Barely much of an improvement!
  • Barely much of an improvement!
  • VFP almost as fast as just iterating through the loop.
    Calculations become almost free!!!
  • VFP almost as fast as just iterating through the loop.
    Calculations become almost free!!!
  • This is one pipeline
    Can squeeze more performance by doing up to 3 operations at the same time
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
    That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
    That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
    That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G
    That combined with a much faster CPU, makes it pretty useless there
  • Maybe next year&amp;#x2019;s 360iDev session :-)
  • Maybe next year&amp;#x2019;s 360iDev session :-)
  • Maybe next year&amp;#x2019;s 360iDev session :-)
  • Maybe next year&amp;#x2019;s 360iDev session :-)
  • Cranking Floating Point Performance To 11 On The iPhone

    1. 1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
    2. 2. void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
    3. 3. “Don’t do that bit- twiddling thing”
    4. 4. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level”
    5. 5. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level” Yes, but key to good performance is looking at your data and your target platform
    6. 6. Floating Point Performance
    7. 7. Floating point numbers
    8. 8. Floating point numbers • Representation of rational numbers
    9. 9. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
    10. 10. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
    11. 11. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
    12. 12. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
    13. 13. Floating point numbers
    14. 14. Floating point numbers
    15. 15. Why floating point performance?
    16. 16. Why floating point performance? • Most games use floating point numbers for most of their calculations
    17. 17. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
    18. 18. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
    19. 19. CPU
    20. 20. CPU • 32-bit RISC ARM 11
    21. 21. CPU • 32-bit RISC ARM 11 • 400-535Mhz
    22. 22. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
    23. 23. CPU (iPhone 3GS)
    24. 24. CPU (iPhone 3GS) • Cortex-A8 600MHz
    25. 25. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
    26. 26. CPU
    27. 27. CPU • No floating point support in the ARM CPU!!!
    28. 28. How about integer math?
    29. 29. How about integer math? • No need to do any floating point operations
    30. 30. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor
    31. 31. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor • But...
    32. 32. Integer Divide
    33. 33. Integer Divide
    34. 34. Integer Divide There is no integer divide
    35. 35. Fixed-point arithmetic
    36. 36. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
    37. 37. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
    38. 38. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library.
    39. 39. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
    40. 40. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
    41. 41. Floating point support
    42. 42. Floating point support • There’s a floating point unit
    43. 43. Floating point support • There’s a floating point unit • Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.
    44. 44. Sample program
    45. 45. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
    46. 46. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
    47. 47. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 14.1 seconds on an iPod Touch 2nd gen
    48. 48. Floating point support
    49. 49. Floating point support Trust no one!
    50. 50. Floating point support Trust no one! When in doubt, check the assembly generated
    51. 51. Floating point support
    52. 52. Thumb Mode
    53. 53. Thumb Mode
    54. 54. Thumb Mode • CPU has a special thumb mode.
    55. 55. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
    56. 56. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support.
    57. 57. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support. • Every timeitthere’s an fp of operation, switches out Thumb, does the fp operation, and switches back on.
    58. 58. Thumb Mode
    59. 59. Thumb Mode • It’s on by default!
    60. 60. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
    61. 61. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
    62. 62. Thumb Mode
    63. 63. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
    64. 64. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though
    65. 65. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though • Most games will probably benefit from turning it off (especially 3D games)
    66. 66. 5.1 seconds!
    67. 67. ARM assembly DISCLAIMER:
    68. 68. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    69. 69. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    70. 70. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    71. 71. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
    72. 72. ARM assembly
    73. 73. ARM assembly • Hit the docs
    74. 74. ARM assembly • Hit the docs • References included in your USB card
    75. 75. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
    76. 76. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
    77. 77. ARM assembly
    78. 78. ARM assembly • Reading assembly is a very important skill for high-performance programming
    79. 79. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
    80. 80. VFP unit
    81. 81. VFP unit A0
    82. 82. VFP unit A0 +
    83. 83. VFP unit A0 + B0
    84. 84. VFP unit A0 + B0 =
    85. 85. VFP unit A0 + B0 = C0
    86. 86. VFP unit A0 + B0 = C0 A1 + B1 = C1
    87. 87. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
    88. 88. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
    89. 89. VFP unit
    90. 90. VFP unit A0 A1 A2 A3
    91. 91. VFP unit A0 A1 A2 A3 +
    92. 92. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
    93. 93. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
    94. 94. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
    95. 95. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
    96. 96. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
    97. 97. Writing vfp assembly
    98. 98. Writing vfp assembly • There are two parts to it
    99. 99. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
    100. 100. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
    101. 101. vfpmath library
    102. 102. vfpmath library • Already done a lot of work for you
    103. 103. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
    104. 104. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
    105. 105. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
    106. 106. Assembly in gcc • Only use it when targeting the device
    107. 107. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
    108. 108. Assembly in gcc • The basics asm (“cmp r2, r1”);
    109. 109. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
    110. 110. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
    111. 111. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
    112. 112. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
    113. 113. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
    114. 114. Assembly in gcc
    115. 115. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
    116. 116. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
    117. 117. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
    118. 118. VFP asm Four banks of 8 32-bit registers each Can address them as single precision or as doubles
    119. 119. VFP asm
    120. 120. VFP asm #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
    121. 121. VFP asm Bank 0 is always scalar! Operations only work on a single bank (wrap around possible)
    122. 122. VFP asm
    123. 123. VFP asm
    124. 124. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
    125. 125. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
    126. 126. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
    127. 127. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : Now: 2.7 seconds!! : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
    128. 128. VFP asm for (int i=0; i<MaxParticles; ++i) { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
    129. 129. VFP asm for (int i=0; i<MaxParticles; ++i) Same every loop! { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
    130. 130. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
    131. 131. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
    132. 132. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" Now: 2.7 seconds "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
    133. 133. VFP asm We can do 8 operations at once. So let’s try doing two particles in a single operation. struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
    134. 134. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
    135. 135. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
    136. 136. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 "mov r3, %3 nt" nt" Now: 2.67 seconds "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
    137. 137. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
    138. 138. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds
    139. 139. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds Now: 2.41 seconds!!!!
    140. 140. Matrix multiply
    141. 141. Matrix multiply Straight from vfpmathlib
    142. 142. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s
    143. 143. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s
    144. 144. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s
    145. 145. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s About 2x faster!
    146. 146. Good use of vfp
    147. 147. Good use of vfp Something with lots of fp operations in a row
    148. 148. Good use of vfp Something with lots of fp operations in a row • Matrix operations
    149. 149. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems
    150. 150. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning
    151. 151. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics
    152. 152. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
    153. 153. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
    154. 154. What about the 3GS?
    155. 155. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
    156. 156. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
    157. 157. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
    158. 158. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
    159. 159. Matrix multiply on 3GS In ms
    160. 160. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
    161. 161. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
    162. 162. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
    163. 163. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
    164. 164. VFP resources • ARM and VFP reference in your USB drive • http://code.google.com/p/vfpmathlibrary • http://aleiby.blogspot.com/2008/12/iphone- vfp-for-n00bs.html • http://www.ibiblio.org/gferg/ldp/GCC- Inline-Assembly-HOWTO.html
    165. 165. More 3GS: NEON
    166. 166. More 3GS: NEON • SIMD coprocessor
    167. 167. More 3GS: NEON • SIMD coprocessor • Floating point and integer
    168. 168. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
    169. 169. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Not many examples yet
    170. 170. NEON resources
    171. 171. NEON resources • Cortex A8 reference in USB drive
    172. 172. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html
    173. 173. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html • http://code.google.com/p/oolongengine/ source/browse/trunk/Oolong+Engine2/ Math/neonmath
    174. 174. Conclusions
    175. 175. Conclusions • Turn Thumb mode off NOW
    176. 176. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp
    177. 177. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already)
    178. 178. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already) • NEON SIMD tech still unused. Research that and be the first one with the killer 3GS app!
    179. 179. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com

    ×