Cranking Floating Point Performance Up To 11

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Cranking Floating Point Performance Up To 11 - Presentation Transcript

    1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
    2. Floating Point Performance
    3. Floating point numbers
    4. Floating point numbers • Representation of rational numbers
    5. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
    6. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
    7. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
    8. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
    9. Floating point numbers
    10. Floating point numbers
    11. Why floating point performance?
    12. Why floating point performance? • Most games use floating point numbers for most of their calculations
    13. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
    14. Why floating point performance? • Most games use floating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
    15. CPU
    16. CPU • 32-bit RISC ARM 11
    17. CPU • 32-bit RISC ARM 11 • 400-535Mhz
    18. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
    19. CPU (iPhone 3GS)
    20. CPU (iPhone 3GS) • Cortex-A8 600MHz
    21. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
    22. CPU
    23. CPU • No floating point support in the ARM CPU!!!
    24. How about integer math?
    25. How about integer math? • No need to do any floating point operations
    26. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor
    27. How about integer math? • No need to do any floating point operations • Fully supported in the ARM processor • But...
    28. Integer Divide
    29. Integer Divide
    30. Integer Divide There is no integer divide
    31. Fixed-point arithmetic
    32. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
    33. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
    34. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library.
    35. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
    36. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a fixed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
    37. Floating point support
    38. Floating point support • There’s a floating point unit
    39. Floating point support • There’s a floating point unit • Compiled C/C++/ObjC code uses the VFP unit for any floating point operations.
    40. Sample program
    41. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
    42. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
    43. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 7.2 seconds on an iPod Touch 2nd gen
    44. Floating point support
    45. Floating point support Trust no one!
    46. Floating point support Trust no one! When in doubt, check the assembly generated
    47. Floating point support
    48. Thumb Mode
    49. Thumb Mode
    50. Thumb Mode • CPU has a special thumb mode.
    51. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
    52. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support.
    53. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No floating point support. • Every time there’s an fp operation, it switches out of Thumb, does the fp operation, and switches back on.
    54. Thumb Mode
    55. Thumb Mode • It’s on by default!
    56. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
    57. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
    58. Thumb Mode
    59. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
    60. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though
    61. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of floating point operations though • Most games will probably benefit from turning it off (especially 3D games)
    62. 2.6 seconds!
    63. ARM assembly DISCLAIMER:
    64. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    65. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    66. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
    67. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
    68. ARM assembly
    69. ARM assembly • Hit the docs
    70. ARM assembly • Hit the docs • References included in your USB card
    71. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
    72. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
    73. ARM assembly
    74. ARM assembly • Reading assembly is a very important skill for high-performance programming
    75. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
    76. VFP unit
    77. VFP unit A0
    78. VFP unit A0 +
    79. VFP unit A0 + B0
    80. VFP unit A0 + B0 =
    81. VFP unit A0 + B0 = C0
    82. VFP unit A0 + B0 = C0 A1 + B1 = C1
    83. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
    84. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
    85. VFP unit
    86. VFP unit A0 A1 A2 A3
    87. VFP unit A0 A1 A2 A3 +
    88. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
    89. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
    90. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
    91. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
    92. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
    93. Writing vfp assembly
    94. Writing vfp assembly • There are two parts to it
    95. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
    96. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
    97. vfpmath library
    98. vfpmath library • Already done a lot of work for you
    99. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
    100. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
    101. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
    102. Assembly in gcc • Only use it when targeting the device
    103. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
    104. Assembly in gcc • The basics asm (“cmp r2, r1”);
    105. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
    106. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
    107. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
    108. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
    109. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
    110. Assembly in gcc
    111. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
    112. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
    113. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
    114. VFP asm Four banks of 8 32-bit registers each
    115. VFP asm Four banks of 8 32-bit registers each #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
    116. VFP asm
    117. VFP asm
    118. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
    119. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" "fstmias %0, {s0-s5} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
    120. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
    121. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), Now: 1.4 seconds!! "r" (dragArray) : ); }
    122. VFP asm Let’s do 6 operations at once! struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
    123. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
    124. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds
    125. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds Now: 1.2 seconds
    126. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
    127. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds
    128. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds Now: 1.2 seconds!!!!
    129. Matrix multiply
    130. Matrix multiply Straight from vfpmathlib
    131. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s
    132. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s
    133. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s
    134. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s About 2x faster!
    135. Good use of vfp
    136. Good use of vfp • Matrix operations
    137. Good use of vfp • Matrix operations • Particle systems
    138. Good use of vfp • Matrix operations • Particle systems • Skinning
    139. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics
    140. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
    141. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
    142. What about the 3GS?
    143. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    144. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    145. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    146. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    147. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    148. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    149. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
    150. More 3GS: NEON
    151. More 3GS: NEON • SIMD coprocessor
    152. More 3GS: NEON • SIMD coprocessor • Floating point and integer
    153. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
    154. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Very little documentation right now :-(
    155. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
    SlideShare Zeitgeist 2009

    + John WilkerJohn Wilker Nominate

    custom

    328 views, 0 favs, 0 embeds more stats

    The iPhone has a surprisingly powerful engine under more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 328
      • 328 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories