0
Cranking Floating Point
Performance Up To 11 
             Noel Llopis
            Snappy Touch

    http://twitter.com/sn...
Floating Point
Performance
Floating point numbers
Floating point numbers

• Representation of rational numbers
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers

• Representation of rational numbers
• 1.2345, -0.8374, 2.0000, 14388439.34, etc
• Following IEEE ...
Floating point numbers
Floating point numbers
Why floating point
 performance?
Why floating point
    performance?

• Most games use floating point numbers for
  most of their calculations
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, ...
Why floating point
     performance?

• Most games use floating point numbers for
  most of their calculations
• Positions, ...
CPU
CPU

• 32-bit RISC ARM 11
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
CPU

• 32-bit RISC ARM 11
• 400-535Mhz
• iPhone 2G/3G and iPod
  Touch 1st and 2nd gen
CPU (iPhone 3GS)
CPU (iPhone 3GS)


• Cortex-A8 600MHz
CPU (iPhone 3GS)


• Cortex-A8 600MHz
• More advanced
  architecture
CPU
CPU


• No floating point support
  in the ARM CPU!!!
How about integer
     math?
How about integer
        math?

• No need to do any floating point
  operations
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
How about integer
        math?

• No need to do any floating point
  operations
• Fully supported in the ARM processor
• B...
Integer Divide
Integer Divide
Integer Divide




There is no integer divide
Fixed-point arithmetic
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Fixed-point arithmetic
• Sometimes integer arithmetic doesn’t cut it
• You need to represent rational numbers
• Can use a ...
Floating point support
Floating point support

•   There’s a floating point unit
Floating point support

•   There’s a floating point unit

•   Compiled C/C++/ObjC
    code uses the VFP unit for
    any fl...
Sample program
Sample program
	   struct Particle
	   {
	   	 float x, y, z;
	   	 float vx, vy, vz;
	   };
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, ...
Sample program
	   struct Particle       for (int i=0; i<MaxParticles; ++i)
	   {                     {
	   	 float x, y, ...
Floating point support
Floating point support

   Trust no one!
Floating point support

   Trust no one!
 When in doubt, check the
   assembly generated
Floating point support
Thumb Mode
Thumb Mode
Thumb Mode
    •   CPU has a special thumb
        mode.
Thumb Mode
    •   CPU has a special thumb
        mode.

    •   Less memory, maybe better
        performance.
Thumb Mode
    •   CPU has a special thumb
        mode.

    •   Less memory, maybe better
        performance.

    •   ...
Thumb Mode
    •   CPU has a special thumb
        mode.

    •   Less memory, maybe better
        performance.

    •   ...
Thumb Mode
Thumb Mode

    • It’s on by default!
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode

    • It’s on by default!
    • Potentiallyoff. wins
      turning it
                  HUGE
Thumb Mode
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point op...
Thumb Mode

• Turning off Thumb mode increased
  performance in Flower Garden by over 2x
• Heavy usage of floating point op...
2.6 seconds!
ARM assembly
   DISCLAIMER:
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
            DISCLAIMER:
I’m not an ARM assembly expert!!!
ARM assembly
              DISCLAIMER:
I’m not an ARM assembly expert!!!




          Z80!!!
ARM assembly
ARM assembly

• Hit the docs
ARM assembly

• Hit the docs
• References included in your USB card
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
ARM assembly

• Hit the docs
• References included in your USB card
• Or download them from the ARM site
• http://bit.ly/a...
ARM assembly
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
ARM assembly

• Reading assembly is a very important skill
  for high-performance programming
• Writing is more specialize...
VFP unit
VFP unit
A0
VFP unit
A0
+
VFP unit
A0
+
B0
VFP unit
A0
+
B0
=
VFP unit
A0
+
B0
=
C0
VFP unit
A0
+
B0
=
C0


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1
+
B1
=
C1
VFP unit
A0   A2
+    +
B0   B2
=    =
C0   C2


A1   A3
+    +
B1   B3
=    =
C1   C3
VFP unit
VFP unit
A0   A1   A2    A3
VFP unit
A0   A1       A2    A3

          +
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3
VFP unit
A0   A1       A2    A3

          +
B0   B1       B2    B3

          =
C0   C1       C2    C3




 Sweet! How do...
Like this!

"fldmias %2, {s8-s23}     nt"
"fldmias %1!, {s0-s3}     nt"
"fmuls s24, s8, s0        nt"
"fmacs s24, s12, s1 ...
Writing vfp assembly
Writing vfp assembly

• There are two parts to it
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
Writing vfp assembly

• There are two parts to it
 • How to write any assembly in gcc
 • Learning ARM and VPM assembly
vfpmath library
vfpmath library

• Already done a lot of work for you
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
vfpmath library

• Already done a lot of work for you
• http://code.google.com/p/vfpmathlibrary
• Vector/matrix math
• Mig...
Assembly in gcc
• Only use it when targeting the device
Assembly in gcc
 • Only use it when targeting the device
#include <TargetConditionals.h>
#if (TARGET_IPHONE_SIMULATOR == 0...
Assembly in gcc
• The basics

          asm (“cmp r2, r1”);
Assembly in gcc
    • The basics

                asm (“cmp r2, r1”);




http://www.ibiblio.org/gferg/ldp/GCC-Inline-Asse...
Assembly in gcc
• Multiple lines
            asm (
                “mov r0, #1000nt”
                “cmp r2, r1nt”
      ...
Assembly in gcc
• Accessing C variables
         asm (//assembly code
             : // output operands
             : // ...
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
          ...
Assembly in gcc
• Accessing C variables
             asm (//assembly code
                 : // output operands
          ...
Assembly in gcc
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	 ...
Assembly in gcc
	   	   int src = 19;
	   	   int dest = 0;
	   	
	   	   asm volatile (
	   	   	 "add r10, %1, #42nt"
	 ...
Assembly in gcc
	   	   int src = 19;      volatile prevents “optimizations”
	   	   int dest = 0;
	   	
	   	   asm volat...
VFP asm
Four banks of 8 32-bit registers each
VFP asm
Four banks of 8 32-bit registers each




    #define VFP_VECTOR_LENGTH(VEC_LENGTH)
        "fmrx    r0, fpscr    ...
VFP asm
VFP asm
VFP asm
for (int i=0; i<MaxParticles; ++i)
{
    Particle& p = s_particles[i];
    p.x += p.vx*dt;
    p.y += p.vy*dt;
   ...
VFP asm
                            	 	
for (int i=0; i<MaxParticles; ++i)   	 for (int i=0; i<MaxParticles; ++i)
{       ...
VFP asm
                            	 	
for (int i=0; i<MaxParticles; ++i)   	 for (int i=0; i<MaxParticles; ++i)
{       ...
VFP asm
                            	 	
for (int i=0; i<MaxParticles; ++i)   	 for (int i=0; i<MaxParticles; ++i)
{       ...
VFP asm
Let’s do 6 operations at once!

    	   struct Particle2
    	   {
    	   	 float x0, y0, z0;
    	   	 float x1,...
VFP asm
	   	   	   for (int i=0; i<iterations; ++i)
	   	   	   {
	   	   	   	 Particle2* p = &s_particles2[i];
	   	   ...
VFP asm
	   	   	   for (int i=0; i<iterations; ++i)
	   	   	   {
	   	   	   	 Particle2* p = &s_particles2[i];
	   	   ...
VFP asm
	   	   	   for (int i=0; i<iterations; ++i)
	   	   	   {
	   	   	   	 Particle2* p = &s_particles2[i];
	   	   ...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
VFP asm
    What’s the loop/cache overhead?
	   	   	   for (int i=0; i<MaxParticles; ++i)
	   	   	   {
	   	   	   	 Par...
Matrix multiply
Matrix multiply
Straight from vfpmathlib
Matrix multiply
Straight from vfpmathlib

Touch: 0.037919 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.037919 s
Normal: 0.096855 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.037919 s
Normal: 0.096855 s
VFP: 0.042216 s
Matrix multiply
Straight from vfpmathlib

Touch: 0.037919 s
Normal: 0.096855 s
VFP: 0.042216 s

     About 2x faster!
Good use of vfp
Good use of vfp
• Matrix operations
Good use of vfp
• Matrix operations
• Particle systems
Good use of vfp
• Matrix operations
• Particle systems
• Skinning
Good use of vfp
• Matrix operations
• Particle systems
• Skinning
• Physics
Good use of vfp
• Matrix operations
• Particle systems
• Skinning
• Physics
• Procedural content generation
Good use of vfp
• Matrix operations
• Particle systems
• Skinning
• Physics
• Procedural content generation
• ....
What about the 3GS?
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
What about the 3GS?
          3G    3GS
 Thumb    7.2   8.0

 Normal   2.6   2.6

  VFP1    1.4   1.30

  VFP2    1.2   0....
More 3GS: NEON
More 3GS: NEON

• SIMD coprocessor
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
More 3GS: NEON

• SIMD coprocessor
• Floating point and integer
• Huge potential
• Very little documentation right now :-(
Thank you!


         Noel Llopis
        Snappy Touch

http://twitter.com/snappytouch
   noel@snappytouch.com
 http://gam...
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Cranking Floating Point Performance Up To 11
Upcoming SlideShare
Loading in...5
×

Cranking Floating Point Performance Up To 11

1,578

Published on

The iPhone has a surprisingly powerful engine under that shiny hood when it comes to floating-point computations. This is something that surprises a lot of programmers because by default, things can slow down a lot whenever any floating point numbers are involved. This session will explain the secrets to unlocking maximum performance for floating point calculations, from the mysteries of Thumb mode, to harnessing the full power of the forgotten vector floating point unit. Stay away from this session if he thought of reading or even (gasp!) writing assembly code scares you.

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,578
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Cranking Floating Point Performance Up To 11"

  1. 1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
  2. 2. Floating Point Performance
  3. 3. Floating point numbers
  4. 4. Floating point numbers • Representation of rational numbers
  5. 5. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
  6. 6. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
  7. 7. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
  8. 8. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
  9. 9. Floating point numbers
  10. 10. Floating point numbers
  11. 11. Why floating point performance?
  12. 12. Why floating point performance? • Most games use floating point numbers for most of their calculations
  13. 13. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
  14. 14. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
  15. 15. CPU
  16. 16. CPU • 32-bit RISC ARM 11
  17. 17. CPU • 32-bit RISC ARM 11 • 400-535Mhz
  18. 18. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
  19. 19. CPU (iPhone 3GS)
  20. 20. CPU (iPhone 3GS) • Cortex-A8 600MHz
  21. 21. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
  22. 22. CPU
  23. 23. CPU • No floating point support in the ARM CPU!!!
  24. 24. How about integer math?
  25. 25. How about integer math? • No need to do any floating point operations
  26. 26. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor
  27. 27. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor • But...
  28. 28. Integer Divide
  29. 29. Integer Divide
  30. 30. Integer Divide There is no integer divide
  31. 31. Fixed-point arithmetic
  32. 32. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
  33. 33. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
  34. 34. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library.
  35. 35. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
  36. 36. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
  37. 37. Floating point support
  38. 38. Floating point support • There’s a floating point unit
  39. 39. Floating point support • There’s a floating point unit • Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.
  40. 40. Sample program
  41. 41. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
  42. 42. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  43. 43. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 7.2 seconds on an iPod Touch 2nd gen
  44. 44. Floating point support
  45. 45. Floating point support Trust no one!
  46. 46. Floating point support Trust no one! When in doubt, check the assembly generated
  47. 47. Floating point support
  48. 48. Thumb Mode
  49. 49. Thumb Mode
  50. 50. Thumb Mode • CPU has a special thumb mode.
  51. 51. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
  52. 52. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support.
  53. 53. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support. • Every time there’s an fp operation, it switches out of Thumb, does the fp operation, and switches back on.
  54. 54. Thumb Mode
  55. 55. Thumb Mode • It’s on by default!
  56. 56. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  57. 57. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  58. 58. Thumb Mode
  59. 59. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
  60. 60. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though
  61. 61. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though • Most games will probably benefit from turning it off (especially 3D games)
  62. 62. 2.6 seconds!
  63. 63. ARM assembly DISCLAIMER:
  64. 64. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  65. 65. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  66. 66. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  67. 67. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
  68. 68. ARM assembly
  69. 69. ARM assembly • Hit the docs
  70. 70. ARM assembly • Hit the docs • References included in your USB card
  71. 71. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
  72. 72. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
  73. 73. ARM assembly
  74. 74. ARM assembly • Reading assembly is a very important skill for high-performance programming
  75. 75. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
  76. 76. VFP unit
  77. 77. VFP unit A0
  78. 78. VFP unit A0 +
  79. 79. VFP unit A0 + B0
  80. 80. VFP unit A0 + B0 =
  81. 81. VFP unit A0 + B0 = C0
  82. 82. VFP unit A0 + B0 = C0 A1 + B1 = C1
  83. 83. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
  84. 84. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
  85. 85. VFP unit
  86. 86. VFP unit A0 A1 A2 A3
  87. 87. VFP unit A0 A1 A2 A3 +
  88. 88. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
  89. 89. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
  90. 90. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
  91. 91. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
  92. 92. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
  93. 93. Writing vfp assembly
  94. 94. Writing vfp assembly • There are two parts to it
  95. 95. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
  96. 96. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
  97. 97. vfpmath library
  98. 98. vfpmath library • Already done a lot of work for you
  99. 99. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
  100. 100. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
  101. 101. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
  102. 102. Assembly in gcc • Only use it when targeting the device
  103. 103. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
  104. 104. Assembly in gcc • The basics asm (“cmp r2, r1”);
  105. 105. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
  106. 106. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
  107. 107. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
  108. 108. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  109. 109. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  110. 110. Assembly in gcc
  111. 111. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
  112. 112. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  113. 113. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  114. 114. VFP asm Four banks of 8 32-bit registers each
  115. 115. VFP asm Four banks of 8 32-bit registers each #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
  116. 116. VFP asm
  117. 117. VFP asm
  118. 118. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  119. 119. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" "fstmias %0, {s0-s5} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
  120. 120. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
  121. 121. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), Now: 1.4 seconds!! "r" (dragArray) : ); }
  122. 122. VFP asm Let’s do 6 operations at once! struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
  123. 123. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
  124. 124. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds
  125. 125. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds Now: 1.2 seconds
  126. 126. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
  127. 127. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds
  128. 128. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds Now: 1.2 seconds!!!!
  129. 129. Matrix multiply
  130. 130. Matrix multiply Straight from vfpmathlib
  131. 131. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s
  132. 132. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s
  133. 133. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s
  134. 134. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s About 2x faster!
  135. 135. Good use of vfp
  136. 136. Good use of vfp • Matrix operations
  137. 137. Good use of vfp • Matrix operations • Particle systems
  138. 138. Good use of vfp • Matrix operations • Particle systems • Skinning
  139. 139. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics
  140. 140. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
  141. 141. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
  142. 142. What about the 3GS?
  143. 143. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  144. 144. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  145. 145. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  146. 146. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  147. 147. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  148. 148. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  149. 149. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
  150. 150. More 3GS: NEON
  151. 151. More 3GS: NEON • SIMD coprocessor
  152. 152. More 3GS: NEON • SIMD coprocessor • Floating point and integer
  153. 153. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
  154. 154. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Very little documentation right now :-(
  155. 155. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×