Cranking Floating Point Performance To 11 On The iPhone
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Cranking Floating Point Performance To 11 On The iPhone

on

  • 13,064 views

360iDev presentation

360iDev presentation

Statistics

Views

Total Views
13,064
Views on SlideShare
12,968
Embed Views
96

Actions

Likes
9
Downloads
153
Comments
0

4 Embeds 96

http://www.slideshare.net 91
http://webcache.googleusercontent.com 2
http://a0.twimg.com 2
http://www.techgig.com 1

Accessibility

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Slides will be up on my web site after the talk
  • I just wanted to flash that to scare people off :-)
  • In the last 11 years I’ve made games for almost every major platform out there
  • In the last 11 years I’ve made games for almost every major platform out there
  • In the last 11 years I’ve made games for almost every major platform out there
  • In the last 11 years I’ve made games for almost every major platform out there
  • In the last 11 years I’ve made games for almost every major platform out there
  • In the last 11 years I’ve made games for almost every major platform out there
  • During that time, working on engine technology and trying to achieve maximum performance <br /> Performance always very important
  • Almost a year ago I started Snappy Touch
  • Almost a year ago I started Snappy Touch
  • Almost a year ago I started Snappy Touch
  • Don&#x2019;t waste time optimizing if you don&#x2019;t have to <br /> Sometimes you need performance to back up design <br /> Also, beware optimizing the best case. Optimize the worst case!
  • Don&#x2019;t waste time optimizing if you don&#x2019;t have to <br /> Sometimes you need performance to back up design <br /> Also, beware optimizing the best case. Optimize the worst case!
  • Don&#x2019;t waste time optimizing if you don&#x2019;t have to <br /> Sometimes you need performance to back up design <br /> Also, beware optimizing the best case. Optimize the worst case!
  • Don&#x2019;t waste time optimizing if you don&#x2019;t have to <br /> Sometimes you need performance to back up design <br /> Also, beware optimizing the best case. Optimize the worst case!
  • Not necessary to understand the format, but helps a lot to understand source of problems <br /> Single precision (32 bit) vs double (64 bits)
  • Let&#x2019;s see what we have to work with
  • Let&#x2019;s see what we have to work with
  • Let&#x2019;s see what we have to work with
  • We want to optimize for the old model
  • We want to optimize for the old model
  • So plan accordingly!
  • So plan accordingly!
  • ACTION: Go to XCode <br /> Simulator vs. device <br /> Release
  • ACTION: Go to XCode <br /> Simulator vs. device <br /> Release
  • ACTION: Go to XCode <br /> Simulator vs. device <br /> Release
  • What&#x2019;s going on in there??
  • That&#x2019;s more like it!
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • I&#x2019;m saying that so you can see how anybody can get a good understand ing of that, not so that you leave me right now :-)
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Single operation
  • Need to get down to the metal
  • Can control how many register to use for each operation. <br /> Up to 8 at once! <br /> Bank 0 is always scalar!!!
  • Can control how many register to use for each operation. <br /> Up to 8 at once! <br /> Bank 0 is always scalar!!!
  • Stride
  • Read the manual for the specific instructions
  • That didn&#x2019;t help much! What&#x2019;s going on?
  • That didn&#x2019;t help much! What&#x2019;s going on?
  • That didn&#x2019;t help much! What&#x2019;s going on?
  • Barely much of an improvement!
  • Barely much of an improvement!
  • VFP almost as fast as just iterating through the loop. <br /> Calculations become almost free!!!
  • VFP almost as fast as just iterating through the loop. <br /> Calculations become almost free!!!
  • This is one pipeline <br /> Can squeeze more performance by doing up to 3 operations at the same time
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • ACTION: Switch to XCode
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G <br /> That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G <br /> That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G <br /> That combined with a much faster CPU, makes it pretty useless there
  • The vfp on the 3GS seems to be running at half the speed in comparison to the 3G <br /> That combined with a much faster CPU, makes it pretty useless there
  • Maybe next year&#x2019;s 360iDev session :-)
  • Maybe next year&#x2019;s 360iDev session :-)
  • Maybe next year&#x2019;s 360iDev session :-)
  • Maybe next year&#x2019;s 360iDev session :-)

Cranking Floating Point Performance To 11 On The iPhone Presentation Transcript

  • 1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
  • 2. void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 3. “Don’t do that bit- twiddling thing”
  • 4. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level”
  • 5. “Don’t do that bit- twiddling thing” “Optimize at the algorithm level” Yes, but key to good performance is looking at your data and your target platform
  • 6. Floating Point Performance
  • 7. Floating point numbers
  • 8. Floating point numbers • Representation of rational numbers
  • 9. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
  • 10. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
  • 11. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
  • 12. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
  • 13. Floating point numbers
  • 14. Floating point numbers
  • 15. Why floating point performance?
  • 16. Why floating point performance? • Most games use floating point numbers for most of their calculations
  • 17. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
  • 18. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
  • 19. CPU
  • 20. CPU • 32-bit RISC ARM 11
  • 21. CPU • 32-bit RISC ARM 11 • 400-535Mhz
  • 22. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
  • 23. CPU (iPhone 3GS)
  • 24. CPU (iPhone 3GS) • Cortex-A8 600MHz
  • 25. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
  • 26. CPU
  • 27. CPU • No floating point support in the ARM CPU!!!
  • 28. How about integer math?
  • 29. How about integer math? • No need to do any floating point operations
  • 30. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor
  • 31. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor • But...
  • 32. Integer Divide
  • 33. Integer Divide
  • 34. Integer Divide There is no integer divide
  • 35. Fixed-point arithmetic
  • 36. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
  • 37. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
  • 38. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library.
  • 39. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
  • 40. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
  • 41. Floating point support
  • 42. Floating point support • There’s a floating point unit
  • 43. Floating point support • There’s a floating point unit • Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.
  • 44. Sample program
  • 45. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
  • 46. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  • 47. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 14.1 seconds on an iPod Touch 2nd gen
  • 48. Floating point support
  • 49. Floating point support Trust no one!
  • 50. Floating point support Trust no one! When in doubt, check the assembly generated
  • 51. Floating point support
  • 52. Thumb Mode
  • 53. Thumb Mode
  • 54. Thumb Mode • CPU has a special thumb mode.
  • 55. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
  • 56. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support.
  • 57. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support. • Every timeitthere’s an fp of operation, switches out Thumb, does the fp operation, and switches back on.
  • 58. Thumb Mode
  • 59. Thumb Mode • It’s on by default!
  • 60. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  • 61. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
  • 62. Thumb Mode
  • 63. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
  • 64. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though
  • 65. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though • Most games will probably benefit from turning it off (especially 3D games)
  • 66. 5.1 seconds!
  • 67. ARM assembly DISCLAIMER:
  • 68. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 69. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 70. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
  • 71. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
  • 72. ARM assembly
  • 73. ARM assembly • Hit the docs
  • 74. ARM assembly • Hit the docs • References included in your USB card
  • 75. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
  • 76. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
  • 77. ARM assembly
  • 78. ARM assembly • Reading assembly is a very important skill for high-performance programming
  • 79. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
  • 80. VFP unit
  • 81. VFP unit A0
  • 82. VFP unit A0 +
  • 83. VFP unit A0 + B0
  • 84. VFP unit A0 + B0 =
  • 85. VFP unit A0 + B0 = C0
  • 86. VFP unit A0 + B0 = C0 A1 + B1 = C1
  • 87. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
  • 88. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
  • 89. VFP unit
  • 90. VFP unit A0 A1 A2 A3
  • 91. VFP unit A0 A1 A2 A3 +
  • 92. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
  • 93. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
  • 94. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
  • 95. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
  • 96. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
  • 97. Writing vfp assembly
  • 98. Writing vfp assembly • There are two parts to it
  • 99. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
  • 100. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
  • 101. vfpmath library
  • 102. vfpmath library • Already done a lot of work for you
  • 103. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
  • 104. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
  • 105. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
  • 106. Assembly in gcc • Only use it when targeting the device
  • 107. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
  • 108. Assembly in gcc • The basics asm (“cmp r2, r1”);
  • 109. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
  • 110. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
  • 111. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
  • 112. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  • 113. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
  • 114. Assembly in gcc
  • 115. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
  • 116. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  • 117. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
  • 118. VFP asm Four banks of 8 32-bit registers each Can address them as single precision or as doubles
  • 119. VFP asm
  • 120. VFP asm #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
  • 121. VFP asm Bank 0 is always scalar! Operations only work on a single bank (wrap around possible)
  • 122. VFP asm
  • 123. VFP asm
  • 124. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
  • 125. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 126. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 127. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; { p.x += p.vx*dt; void* p = &s_particles[i]; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; p.vx *= drag; "fldmias %1, {s0} nt" p.vy *= drag; "fldmias %2, {s1} nt" p.vz *= drag; "fldmias %0, {s8-s13} nt" } "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" Was: 5.1 seconds "fstmias %0, {s8-s13} nt" : Now: 2.7 seconds!! : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 128. VFP asm for (int i=0; i<MaxParticles; ++i) { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 129. VFP asm for (int i=0; i<MaxParticles; ++i) Same every loop! { void* p = &s_particles[i]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "fldmias %0, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias %0, {s8-s13} nt" : : "r" (p), "r" (&dt), "r" (&drag) : "r0", "cc", "memory" ); }
  • 130. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 131. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 132. VFP asm void* p = &s_particles[0]; asm volatile ( "fldmias %1, {s0} nt" Was: 2.7 seconds "fldmias %2, {s1} nt" "mov r1, %0 nt" Now: 2.7 seconds "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fmacs s8, s11, s0 nt" "fmuls s11, s11, s1 nt" "fstmias r2!, {s8-s13} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (MaxParticles) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 133. VFP asm We can do 8 operations at once. So let’s try doing two particles in a single operation. struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
  • 134. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} nt" "fldmias %2, {s1} nt" "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 135. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 nt" "mov r3, %3 nt" "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 136. VFP asm void* p = &s_particles2[0]; asm volatile ( "fldmias %1, {s0} "fldmias %2, {s1} nt" nt" Was: 2.77 seconds "mov r1, %0 nt" "mov r2, %0 "mov r3, %3 nt" nt" Now: 2.67 seconds "0: nt" "fldmias r1!, {s8-s13} nt" "fldmias r1!, {s16-s21} nt" "fmacs s8, s16, s0 nt" "fmuls s16, s16, s1 nt" "fstmias r2!, {s8-s13} nt" "fstmias r2!, {s16-s21} nt" "subs r3, r3, #1 nt" "bne 0b nt" : : "r" (p), "r" (&dt), "r" (&drag), "r" (iterations) : "r0", "r1", "r2", "r3", "cc", "memory" );
  • 137. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
  • 138. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds
  • 139. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 2.67 seconds Now: 2.41 seconds!!!!
  • 140. Matrix multiply
  • 141. Matrix multiply Straight from vfpmathlib
  • 142. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s
  • 143. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s
  • 144. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s
  • 145. Matrix multiply Straight from vfpmathlib Touch: 0.0379 s Normal: 0.0968 s VFP: 0.0422 s About 2x faster!
  • 146. Good use of vfp
  • 147. Good use of vfp Something with lots of fp operations in a row
  • 148. Good use of vfp Something with lots of fp operations in a row • Matrix operations
  • 149. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems
  • 150. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning
  • 151. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics
  • 152. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
  • 153. Good use of vfp Something with lots of fp operations in a row • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
  • 154. What about the 3GS?
  • 155. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 156. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 157. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 158. What about the 3GS? 3G 3GS Thumb 14.1 14.5 Normal 5.14 4.76 VFP1 2.77 4.53 VFP2 2.77 4.26 VFP3 2.66 3.57 Touch 2.41 0.42
  • 159. Matrix multiply on 3GS In ms
  • 160. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 161. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 162. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 163. Matrix multiply on 3GS In ms 3G 3GS Normal 96 82 VFP1 42 90 VFP2 42 75 Touch 38 19
  • 164. VFP resources • ARM and VFP reference in your USB drive • http://code.google.com/p/vfpmathlibrary • http://aleiby.blogspot.com/2008/12/iphone- vfp-for-n00bs.html • http://www.ibiblio.org/gferg/ldp/GCC- Inline-Assembly-HOWTO.html
  • 165. More 3GS: NEON
  • 166. More 3GS: NEON • SIMD coprocessor
  • 167. More 3GS: NEON • SIMD coprocessor • Floating point and integer
  • 168. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
  • 169. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Not many examples yet
  • 170. NEON resources
  • 171. NEON resources • Cortex A8 reference in USB drive
  • 172. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html
  • 173. NEON resources • Cortex A8 reference in USB drive • http://gcc.gnu.org/onlinedocs/gcc/ARM- NEON-Intrinsics.html • http://code.google.com/p/oolongengine/ source/browse/trunk/Oolong+Engine2/ Math/neonmath
  • 174. Conclusions
  • 175. Conclusions • Turn Thumb mode off NOW
  • 176. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp
  • 177. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already)
  • 178. Conclusions • Turn Thumb mode off NOW • Expect to get at least 2x performance in older hardware by using vfp • Not much difference in 3GS (but it’s fast already) • NEON SIMD tech still unused. Research that and be the first one with the killer 3GS app!
  • 179. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com