0
Upcoming SlideShare
×

# Cranking Floating Point Performance Up To 11

1,578

Published on

The iPhone has a surprisingly powerful engine under that shiny hood when it comes to floating-point computations. This is something that surprises a lot of programmers because by default, things can slow down a lot whenever any floating point numbers are involved. This session will explain the secrets to unlocking maximum performance for floating point calculations, from the mysteries of Thumb mode, to harnessing the full power of the forgotten vector floating point unit. Stay away from this session if he thought of reading or even (gasp!) writing assembly code scares you.

Published in: Technology, News & Politics
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,578
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
29
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Transcript of "Cranking Floating Point Performance Up To 11"

1. 1. Cranking Floating Point Performance Up To 11  Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
2. 2. Floating Point Performance
3. 3. Floating point numbers
4. 4. Floating point numbers • Representation of rational numbers
5. 5. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc
6. 6. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format
7. 7. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits
8. 8. Floating point numbers • Representation of rational numbers • 1.2345, -0.8374, 2.0000, 14388439.34, etc • Following IEEE 754 format • Single precision: 32 bits • Double precision: 64 bits
9. 9. Floating point numbers
10. 10. Floating point numbers
11. 11. Why ﬂoating point performance?
12. 12. Why ﬂoating point performance? • Most games use ﬂoating point numbers for most of their calculations
13. 13. Why ﬂoating point performance? • Most games use ﬂoating point numbers for most of their calculations • Positions, velocities, physics, etc, etc.
14. 14. Why ﬂoating point performance? • Most games use ﬂoating point numbers for most of their calculations • Positions, velocities, physics, etc, etc. • Maybe not so much for regular apps
15. 15. CPU
16. 16. CPU • 32-bit RISC ARM 11
17. 17. CPU • 32-bit RISC ARM 11 • 400-535Mhz
18. 18. CPU • 32-bit RISC ARM 11 • 400-535Mhz • iPhone 2G/3G and iPod Touch 1st and 2nd gen
19. 19. CPU (iPhone 3GS)
20. 20. CPU (iPhone 3GS) • Cortex-A8 600MHz
21. 21. CPU (iPhone 3GS) • Cortex-A8 600MHz • More advanced architecture
22. 22. CPU
23. 23. CPU • No ﬂoating point support in the ARM CPU!!!
24. 24. How about integer math?
25. 25. How about integer math? • No need to do any ﬂoating point operations
26. 26. How about integer math? • No need to do any ﬂoating point operations • Fully supported in the ARM processor
27. 27. How about integer math? • No need to do any ﬂoating point operations • Fully supported in the ARM processor • But...
28. 28. Integer Divide
29. 29. Integer Divide
30. 30. Integer Divide There is no integer divide
31. 31. Fixed-point arithmetic
32. 32. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it
33. 33. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers
34. 34. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a ﬁxed-point library.
35. 35. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a ﬁxed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution.
36. 36. Fixed-point arithmetic • Sometimes integer arithmetic doesn’t cut it • You need to represent rational numbers • Can use a ﬁxed-point library. • Performs rational arithmetic with integer values at a reduced range/resolution. • Not so great...
37. 37. Floating point support
38. 38. Floating point support • There’s a ﬂoating point unit
39. 39. Floating point support • There’s a ﬂoating point unit • Compiled C/C++/ObjC code uses the VFP unit for any ﬂoating point operations.
40. 40. Sample program
41. 41. Sample program struct Particle { float x, y, z; float vx, vy, vz; };
42. 42. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
43. 43. Sample program struct Particle for (int i=0; i<MaxParticles; ++i) { { float x, y, z; Particle& p = s_particles[i]; float vx, vy, vz; p.x += p.vx*dt; }; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; } • 7.2 seconds on an iPod Touch 2nd gen
44. 44. Floating point support
45. 45. Floating point support Trust no one!
46. 46. Floating point support Trust no one! When in doubt, check the assembly generated
47. 47. Floating point support
48. 48. Thumb Mode
49. 49. Thumb Mode
50. 50. Thumb Mode • CPU has a special thumb mode.
51. 51. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance.
52. 52. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No ﬂoating point support.
53. 53. Thumb Mode • CPU has a special thumb mode. • Less memory, maybe better performance. • No ﬂoating point support. • Every time there’s an fp operation, it switches out of Thumb, does the fp operation, and switches back on.
54. 54. Thumb Mode
55. 55. Thumb Mode • It’s on by default!
56. 56. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
57. 57. Thumb Mode • It’s on by default! • Potentiallyoff. wins turning it HUGE
58. 58. Thumb Mode
59. 59. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x
60. 60. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of ﬂoating point operations though
61. 61. Thumb Mode • Turning off Thumb mode increased performance in Flower Garden by over 2x • Heavy usage of ﬂoating point operations though • Most games will probably beneﬁt from turning it off (especially 3D games)
62. 62. 2.6 seconds!
63. 63. ARM assembly DISCLAIMER:
64. 64. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
65. 65. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
66. 66. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!!
67. 67. ARM assembly DISCLAIMER: I’m not an ARM assembly expert!!! Z80!!!
68. 68. ARM assembly
69. 69. ARM assembly • Hit the docs
70. 70. ARM assembly • Hit the docs • References included in your USB card
71. 71. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site
72. 72. ARM assembly • Hit the docs • References included in your USB card • Or download them from the ARM site • http://bit.ly/arminfo
73. 73. ARM assembly
74. 74. ARM assembly • Reading assembly is a very important skill for high-performance programming
75. 75. ARM assembly • Reading assembly is a very important skill for high-performance programming • Writing is more specialized. Most people don’t need to.
76. 76. VFP unit
77. 77. VFP unit A0
78. 78. VFP unit A0 +
79. 79. VFP unit A0 + B0
80. 80. VFP unit A0 + B0 =
81. 81. VFP unit A0 + B0 = C0
82. 82. VFP unit A0 + B0 = C0 A1 + B1 = C1
83. 83. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 + B1 = C1
84. 84. VFP unit A0 A2 + + B0 B2 = = C0 C2 A1 A3 + + B1 B3 = = C1 C3
85. 85. VFP unit
86. 86. VFP unit A0 A1 A2 A3
87. 87. VFP unit A0 A1 A2 A3 +
88. 88. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3
89. 89. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 =
90. 90. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3
91. 91. VFP unit A0 A1 A2 A3 + B0 B1 B2 B3 = C0 C1 C2 C3 Sweet! How do we use the vfp?
92. 92. Like this! "fldmias %2, {s8-s23} nt" "fldmias %1!, {s0-s3} nt" "fmuls s24, s8, s0 nt" "fmacs s24, s12, s1 nt" "fldmias %1!, {s4-s7} nt" "fmacs s24, s16, s2 nt" "fmacs s24, s20, s3 nt" "fstmias %0!, {s24-s27} nt"
93. 93. Writing vfp assembly
94. 94. Writing vfp assembly • There are two parts to it
95. 95. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc
96. 96. Writing vfp assembly • There are two parts to it • How to write any assembly in gcc • Learning ARM and VPM assembly
97. 97. vfpmath library
98. 98. vfpmath library • Already done a lot of work for you
99. 99. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary
100. 100. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math
101. 101. vfpmath library • Already done a lot of work for you • http://code.google.com/p/vfpmathlibrary • Vector/matrix math • Might not be exactly what you need, but it’s a great starting point
102. 102. Assembly in gcc • Only use it when targeting the device
103. 103. Assembly in gcc • Only use it when targeting the device #include <TargetConditionals.h> #if (TARGET_IPHONE_SIMULATOR == 0) && (TARGET_OS_IPHONE == 1) #define USE_VFP #endif
104. 104. Assembly in gcc • The basics asm (“cmp r2, r1”);
105. 105. Assembly in gcc • The basics asm (“cmp r2, r1”); http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly- HOWTO.html
106. 106. Assembly in gcc • Multiple lines asm ( “mov r0, #1000nt” “cmp r2, r1nt” );
107. 107. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers );
108. 108. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
109. 109. Assembly in gcc • Accessing C variables asm (//assembly code : // output operands : // input operands : // clobbered registers ); int src = 19; int dest = 0; %0, %1, etc are the variables in order asm volatile ( "add %0, %1, #42" : "=r" (dest) : "r" (src) : );
110. 110. Assembly in gcc
111. 111. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" );
112. 112. Assembly in gcc int src = 19; int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
113. 113. Assembly in gcc int src = 19; volatile prevents “optimizations” int dest = 0; asm volatile ( "add r10, %1, #42nt" "add %0, r10, #33nt" : "=r" (dest) : "r" (src) : "r10" ); Clobber register list are registers used by the asm block
114. 114. VFP asm Four banks of 8 32-bit registers each
115. 115. VFP asm Four banks of 8 32-bit registers each #define VFP_VECTOR_LENGTH(VEC_LENGTH) "fmrx r0, fpscr nt" "bic r0, r0, #0x00370000 nt" "orr r0, r0, #0x000" #VEC_LENGTH "0000 nt" "fmxr fpscr, r0 nt"
116. 116. VFP asm
117. 117. VFP asm
118. 118. VFP asm for (int i=0; i<MaxParticles; ++i) { Particle& p = s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; p.z += p.vz*dt; p.vx *= drag; p.vy *= drag; p.vz *= drag; }
119. 119. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" "fstmias %0, {s0-s5} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
120. 120. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
121. 121. VFP asm for (int i=0; i<MaxParticles; ++i) for (int i=0; i<MaxParticles; ++i) { { Particle& p = s_particles[i]; Particle* p = &s_particles[i]; p.x += p.vx*dt; p.y += p.vy*dt; asm volatile ( p.z += p.vz*dt; "fldmias %0, {s0-s5} nt" p.vx *= drag; "fldmias %1, {s6-s8} nt" p.vy *= drag; p.vz *= drag; "fldmias %2, {s9-s11} nt" } "fmacs s0, s3, s6 nt" "fmuls s3, s3, s9 nt" Was: 2.6 seconds "fstmias %0, {s0-s5} : "=r" (p) nt" : "r" (p), "r" (dtArray), Now: 1.4 seconds!! "r" (dragArray) : ); }
122. 122. VFP asm Let’s do 6 operations at once! struct Particle2 { float x0, y0, z0; float x1, y1, z1; float vx0, vy0, vz0; float vx1, vy1, vz1; };
123. 123. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); }
124. 124. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds
125. 125. VFP asm for (int i=0; i<iterations; ++i) { Particle2* p = &s_particles2[i]; asm volatile ( "fldmias %0, {s0-s11} nt" "fldmias %1, {s12-s17} nt" "fldmias %2, {s18-s23} nt" "fmacs s0, s6, s12 nt" "fmuls s6, s6, s18 nt" "fstmias %0, {s0-s11} nt" : "=r" (p) : "r" (p), "r" (dtArray), "r" (dragArray) : ); } Was: 1.4 seconds Now: 1.2 seconds
126. 126. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; }
127. 127. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds
128. 128. VFP asm What’s the loop/cache overhead? for (int i=0; i<MaxParticles; ++i) { Particle* p = &s_particles[i]; p->x = p->vx; p->y = p->vy; p->z = p->vz; } Was: 1.2 seconds Now: 1.2 seconds!!!!
129. 129. Matrix multiply
130. 130. Matrix multiply Straight from vfpmathlib
131. 131. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s
132. 132. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s
133. 133. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s
134. 134. Matrix multiply Straight from vfpmathlib Touch: 0.037919 s Normal: 0.096855 s VFP: 0.042216 s About 2x faster!
135. 135. Good use of vfp
136. 136. Good use of vfp • Matrix operations
137. 137. Good use of vfp • Matrix operations • Particle systems
138. 138. Good use of vfp • Matrix operations • Particle systems • Skinning
139. 139. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics
140. 140. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation
141. 141. Good use of vfp • Matrix operations • Particle systems • Skinning • Physics • Procedural content generation • ....
142. 142. What about the 3GS?
143. 143. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
144. 144. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
145. 145. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
146. 146. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
147. 147. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
148. 148. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
149. 149. What about the 3GS? 3G 3GS Thumb 7.2 8.0 Normal 2.6 2.6 VFP1 1.4 1.30 VFP2 1.2 0.64 Touch 1.2 0.18
150. 150. More 3GS: NEON
151. 151. More 3GS: NEON • SIMD coprocessor
152. 152. More 3GS: NEON • SIMD coprocessor • Floating point and integer
153. 153. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential
154. 154. More 3GS: NEON • SIMD coprocessor • Floating point and integer • Huge potential • Very little documentation right now :-(
155. 155. Thank you! Noel Llopis Snappy Touch http://twitter.com/snappytouch noel@snappytouch.com http://gamesfromwithin.com
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.