© Copyright Khronos Group, 2004 - Page 1


Published on

Published in: Technology, Art & Photos
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

© Copyright Khronos Group, 2004 - Page 1

  1. 1. The challenge of migration : desktop to handheld Phil Atkin Product Manager 3D Graphics September 2004
  2. 2. Topics <ul><li>Overview </li></ul><ul><ul><li>Definitions </li></ul></ul><ul><ul><ul><li>What does ‘desktop’ mean? </li></ul></ul></ul><ul><ul><ul><li>What does ‘handheld’ mean? </li></ul></ul></ul><ul><ul><li>Challenges </li></ul></ul><ul><ul><ul><li>Management of 3D resources </li></ul></ul></ul><ul><ul><ul><li>Management of CPU resources </li></ul></ul></ul><ul><ul><li>Case study </li></ul></ul><ul><ul><ul><li>Realities of porting a desktop 3D framework to handheld </li></ul></ul></ul><ul><ul><ul><li>Demonstrations (Intel / Intrinsyc Carbonado) </li></ul></ul></ul><ul><ul><ul><li>Performance (PowerBook vs. Carbonado) </li></ul></ul></ul><ul><ul><li>Conclusions </li></ul></ul>
  3. 3. Desktop vs. handheld systems <ul><li>Desktop system </li></ul><ul><ul><li>CPU + GPU + 3D API </li></ul></ul><ul><ul><ul><li>Powerful - 1GHz up to >3GHz CPU with SIMD floating-point </li></ul></ul></ul><ul><ul><ul><li>Big caches </li></ul></ul></ul><ul><ul><ul><li>Minimum ‘Free3D’ chipset </li></ul></ul></ul><ul><ul><ul><li>Maximum GeForce 6800 / Radeon X800 </li></ul></ul></ul><ul><ul><ul><li>OpenGL 1.5 transitioning to OpenGL 2.0 </li></ul></ul></ul><ul><li>Handheld system (PowerVR 3D) </li></ul><ul><ul><li>CPU + GPU + 3D API </li></ul></ul><ul><ul><ul><li>CPU ranges from 100MHz to 500+MHz </li></ul></ul></ul><ul><ul><ul><li>Small caches </li></ul></ul></ul><ul><ul><ul><li>CPU may or may not have FP capability </li></ul></ul></ul><ul><ul><ul><li>Minimum MBX Lite no VGP - 1M tris, 100M pixels </li></ul></ul></ul><ul><ul><ul><li>Maximum MBX VGP - 4M tris, 350M pixels, free AA </li></ul></ul></ul><ul><ul><ul><li>OpenGL ES 1.0 transitioning to OpenGL ES 1.1 </li></ul></ul></ul>
  4. 4. Handheld 3D <ul><ul><li>Delivering accelerated handheld 3D is all about power management </li></ul></ul><ul><ul><li>All chip vendors have access to similar process technologies </li></ul></ul><ul><ul><ul><li>Leads to similar power / MHz </li></ul></ul></ul><ul><ul><ul><li>Leads to similar performance / mW </li></ul></ul></ul><ul><ul><li>All system vendors have access to the similar battery technologies </li></ul></ul><ul><ul><ul><li>Leads to similar ‘talk time / game-time’ per recharge </li></ul></ul></ul><ul><ul><li>Some architectures have clear power/performance advantages </li></ul></ul><ul><ul><ul><li>Tile-based rendering, on-die framebuffers - minimize data passing between chips </li></ul></ul></ul><ul><ul><li>These factors lead to a relatively narrow spectrum of capabilities </li></ul></ul><ul><ul><li>Low-end and high-end systems only differ by 3-4x </li></ul></ul><ul><ul><li>Admittedly PowerVR sets a high baseline, but the generalization holds </li></ul></ul>
  5. 5. Observations <ul><li>Even low-end handheld 3D accelerators will offer excellent performance </li></ul><ul><ul><li>On par with 2nd / 3rd generation desktop accelerators </li></ul></ul><ul><ul><li>Efficient API is in place and standardized </li></ul></ul><ul><ul><li>Hence the path from the driver to the hardware is sorted - but … </li></ul></ul><ul><li>What about the path from the application to the driver? </li></ul><ul><ul><li>How to structure application code to keep hardware busy? </li></ul></ul><ul><li>Despite relatively narrow spectrum of 3D capabilities </li></ul><ul><ul><li>Potential for extremely large disparity between systems </li></ul></ul><ul><ul><li>Floating point-less CPU, rasterizer-only 3D </li></ul></ul><ul><ul><li>Very high performance CPU / FPU, vertex-programmable 3D </li></ul></ul><ul><li>How to develop or port with such a spread of computational capabilities? </li></ul>
  6. 6. The challenge <ul><li>Management of 3D capabilities is not the challenge </li></ul><ul><ul><li>The usual techniques learned in the desktop space can be used </li></ul></ul><ul><ul><li>Resolution / triangle count / texture filtering / AA quality </li></ul></ul><ul><li>Management of CPU resources is the challenge </li></ul><ul><ul><li>Lowering vertex counts to GPU will inherently lower CPU load </li></ul></ul><ul><ul><li>But the problem is far bigger in scope than just this </li></ul></ul><ul><ul><li>The data type float is essentially unavailable at the low end </li></ul></ul><ul><li>Platform CPUs have such diverse capabilities - either </li></ul><ul><ul><li>Stratify in software, code explicitly to each market stratum </li></ul></ul><ul><ul><li>Or code in a floating-point agnostic manner </li></ul></ul><ul><li>The latter is achievable and allows a single code base across platforms </li></ul>
  7. 7. Why bother porting to an FPU-less platform? <ul><li>Consider the following 3 likely classes of handheld device </li></ul><ul><ul><li>Class A </li></ul></ul><ul><ul><ul><li>High-performance CPU, FPU, GPU with vertex processing </li></ul></ul></ul><ul><ul><li>Class B </li></ul></ul><ul><ul><ul><li>High-performance CPU, GPU with vertex processing </li></ul></ul></ul><ul><ul><li>Class C </li></ul></ul><ul><ul><ul><li>CPU, rasterizer </li></ul></ul></ul><ul><ul><li>Classes B and C will likely be smaller die, lower cost </li></ul></ul><ul><ul><li>Will likely ship in higher volumes </li></ul></ul><ul><ul><li>If so - </li></ul></ul><ul><ul><ul><li>will offer more revenue opportunities for software vendors </li></ul></ul></ul><ul><ul><ul><li>yet platforms do not have floating-point capability </li></ul></ul></ul><ul><ul><li>But a Class A device may win out </li></ul></ul><ul><ul><li>Software vendors must cover all the bases to guarantee success </li></ul></ul>
  8. 8. Why not just make everything fixed point? <ul><li>Because your desktop platform </li></ul><ul><ul><li>Will be faster in floating-point </li></ul></ul><ul><ul><li>Does not have fixed-point OpenGL ES entrypoints! </li></ul></ul><ul><li>If you really need </li></ul><ul><ul><li>The same code base to run on desktop and handheld </li></ul></ul><ul><ul><li>High performance on all classes of handheld systems </li></ul></ul><ul><li>You need to abstract out your numeric format </li></ul><ul><li>C++ class, build-time switchable from 16.16 to float </li></ul>
  9. 9. Porting desktop software - 4 step program <ul><ul><li>Observations </li></ul></ul><ul><ul><ul><li>Debugging on a handheld is no fun </li></ul></ul></ul><ul><ul><ul><li>The porting process needs to be derisked as much as possible </li></ul></ul></ul><ul><ul><ul><li>Strive to get as close as possible to the handheld codebase without leaving the desktop </li></ul></ul></ul><ul><ul><ul><li>Code extremely defensively - make no assumptions regarding performance </li></ul></ul></ul><ul><ul><li>‘ Portification’ </li></ul></ul><ul><ul><ul><li>Yes, I know it’s not a real word… </li></ul></ul></ul><ul><ul><ul><li>The process of preparing for the port without actually executing on it </li></ul></ul></ul><ul><ul><li>Step 1 - implement the abstracted real number class </li></ul></ul><ul><ul><li>Step 2 - portify 3D code </li></ul></ul><ul><ul><li>Step 3 - portify application code </li></ul></ul><ul><ul><li>Step 4 - do the port </li></ul></ul>
  10. 10. Step 1 - implement real number class <ul><ul><li>C++ operators for +-*/ and type conversion </li></ul></ul><ul><ul><li>Note ARM does not have a divide instruction </li></ul></ul><ul><ul><ul><li>Recommendation - normalize / reciprocate / multiply / denormalize </li></ul></ul></ul><ul><ul><ul><li>ARM does have a normalize instruction - CLZ </li></ul></ul></ul><ul><ul><li>Functions for common but expensive operations </li></ul></ul><ul><ul><ul><li>E.g. implement your own sqrt and trig </li></ul></ul></ul><ul><ul><ul><li>Why - because you may wish to sidestep glRotate() etc. </li></ul></ul></ul><ul><ul><li>These functions will of course work in fixed or float </li></ul></ul><ul><ul><li>Hence testability on desktop is high and immediate </li></ul></ul>
  11. 11. Step 2 - portify 3D code <ul><ul><li>Isolate your 3D code if not already done </li></ul></ul><ul><ul><ul><li>Minimize #include <gl/gl.h> </li></ul></ul></ul><ul><ul><li>Modify 3D code so it is OpenGL / OpenGL ES agnostic </li></ul></ul><ul><ul><li>Modify it so it is floating point / fixed point agnostic </li></ul></ul><ul><ul><li>And obviously modify your data too </li></ul></ul><ul><ul><li>Make your world representable by 16.16 </li></ul></ul>
  12. 12. Step 3 - portify application code <ul><ul><li>Work out what maths absolutely must be floating-point </li></ul></ul><ul><ul><li>Replace everything else with real number class </li></ul></ul><ul><ul><li>But be really careful - for example </li></ul></ul><ul><ul><ul><li>Really common case - distance between 2 points - Pythagoras </li></ul></ul></ul><ul><ul><ul><li>Squaring those numbers will blow up for almost all cases </li></ul></ul></ul><ul><ul><ul><li>Code defensively - implement a ‘radius’ function that will not blow up </li></ul></ul></ul><ul><ul><li>OK, you could keep this example as floats </li></ul></ul><ul><ul><ul><li>But floats are so very expensive without FPU </li></ul></ul></ul><ul><ul><ul><li>It’s a common operation, and it’s easy to get it right in fixed-point </li></ul></ul></ul><ul><ul><li>Remember - conservation of CPU cycles is the challenge </li></ul></ul><ul><ul><ul><li>The hardware developers and Khronos have taken care of the 3D </li></ul></ul></ul><ul><ul><ul><li>CPU cycles are precious, conserve them </li></ul></ul></ul>
  13. 13. Step 4 - port to the handheld platform <ul><li>This step is really easy if the last 3 went well ... </li></ul><ul><ul><li>Take cross-compiler </li></ul></ul><ul><ul><li>Turn on all the #ifdefs you prepared earlier </li></ul></ul><ul><ul><li>Type ‘make’ </li></ul></ul><ul><ul><li>Or under Embedded Visual C++ hit F7 </li></ul></ul><ul><li>It will just work. Trust me, it will. </li></ul>
  14. 14. Case study - the Mobile Scene Graph <ul><li>Framework for 3D applications </li></ul><ul><ul><li>Initial implementation - desktop </li></ul></ul><ul><ul><ul><li>Interactive landscape, architecture and garden design review </li></ul></ul></ul><ul><ul><ul><li>Straightforward design </li></ul></ul></ul><ul><ul><ul><ul><li>Classic app + cull + draw, frustum culling </li></ul></ul></ul></ul><ul><ul><ul><ul><li>C++, STL, polymorphic, RTTI </li></ul></ul></ul></ul><ul><ul><ul><li>Target platform PowerBook G3 500MHz / OpenGL / glut </li></ul></ul></ul><ul><ul><li>Transitioned into </li></ul></ul><ul><ul><ul><li>Desktop - interactive landscape, architecture and garden design review </li></ul></ul></ul><ul><ul><ul><li>Handheld - experimental testbed for OpenGL ES rendering </li></ul></ul></ul><ul><ul><ul><li>Target platforms </li></ul></ul></ul><ul><ul><ul><ul><li>PowerBook G3 500MHz / OpenGL 1.4 / glut </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Intel / Intrinsyc Carbonado / OpenGL ES 1.0 / egl </li></ul></ul></ul></ul><ul><ul><li>Great opportunity to take on a port </li></ul></ul><ul><ul><ul><li>Aiming for 100% application source code compatibility </li></ul></ul></ul><ul><ul><ul><li>Aiming to deliver highest possible performance on desktop and handheld </li></ul></ul></ul>
  15. 15. MSG Implementation details <ul><ul><li>‘ MSGReal’ </li></ul></ul><ul><ul><ul><li>Build-time switchable float or OpenGL ES 16.16 fixed point </li></ul></ul></ul><ul><ul><ul><li>C++ operators provide +-*/ and common type conversions </li></ul></ul></ul><ul><ul><ul><li>Functions provide trig, sqrt / recipsqrt </li></ul></ul></ul><ul><ul><ul><li>All expensive operations implemented by piecewise quadratics </li></ul></ul></ul><ul><ul><li>Additional 4.12 ‘MSGShortFix’ type </li></ul></ul><ul><ul><ul><li>Intermediate product fits into 32 bits, no double-length maths </li></ul></ul></ul><ul><ul><ul><li>Superbright unclamped colour accumulation </li></ul></ul></ul><ul><ul><ul><li>Reflection-mapping via quadratic approximation without overflow </li></ul></ul></ul><ul><ul><li>Only 2 internal functions use floating-point </li></ul></ul><ul><ul><ul><li>Plane fitter for frustum construction </li></ul></ul></ul><ul><ul><ul><li>Determinant calculation in matrix inverter </li></ul></ul></ul>
  16. 16. Porting realities - timescales <ul><li>Approximately 3 man-months of portification </li></ul><ul><ul><li>Difficult to measure accurately </li></ul></ul><ul><ul><li>Coding was in progress as portification began </li></ul></ul><ul><li>Approximately 20,000 lines of code </li></ul><ul><ul><li>Only 800 lines can see <gl/gl.h> </li></ul></ul><ul><ul><li>Just 8 #ifdefs in this module </li></ul></ul><ul><ul><li>i.e.if this is representative, the portification process is manageable </li></ul></ul><ul><li>2 evening porting sessions </li></ul><ul><ul><li>Just 6 hours at the desk from ‘move code onto PC’ to ‘run on handheld’ </li></ul></ul><ul><ul><li>… and one evening should have been enough </li></ul></ul><ul><li>Then performance tuning </li></ul><ul><ul><li>Anticipated >30Hz was only 15-20Hz </li></ul></ul><ul><ul><li>Now tuned up to >40Hz with no change in geometric load </li></ul></ul>
  17. 17. Porting realities - gotchas <ul><li>Handheld specific </li></ul><ul><ul><li>Performance not linear with clock for a variety of reasons </li></ul></ul><ul><ul><ul><li>e.g. caching behaviour, driver behaviour, architectural </li></ul></ul></ul><ul><ul><li>Limited container class and template support </li></ul></ul><ul><ul><li>Some C++ operations will hurt more than you expect </li></ul></ul><ul><ul><ul><li>Very slow RTTI </li></ul></ul></ul><ul><ul><ul><li>STL list operations sort(), push_back(), pop_front() proved surprisingly expensive </li></ul></ul></ul><ul><li>3D gotchas </li></ul><ul><ul><li>Unanticipated differences in behaviour </li></ul></ul><ul><ul><ul><li>E.g. multiple strips from single pointer setup – multiple TnL on Carbonado </li></ul></ul></ul><ul><ul><ul><li>Would benefit from gLDrawMultiElements </li></ul></ul></ul><ul><ul><li>Short tristrip performance </li></ul></ul><ul><ul><ul><li>Would benefit from gLDrawMultiElements!! </li></ul></ul></ul><ul><ul><li>Best performance - glDrawElements(glTriangles) </li></ul></ul><ul><ul><li>Fixed-point to integer conversion in OpenGL ES interface </li></ul></ul>
  18. 18. Demonstrations <ul><li>MSGRefMap - arithmetic performance test </li></ul><ul><ul><li>Single object, reflection mapped </li></ul></ul><ul><ul><ul><li>Cull time virtually zero </li></ul></ul></ul><ul><ul><ul><li>Virtually all cycles spent in reflection-map code </li></ul></ul></ul><ul><ul><ul><li>This is fixed-point on all platforms </li></ul></ul></ul><ul><ul><ul><li>16-bit skybox textures </li></ul></ul></ul><ul><li>MSGHurricane - frustum-culling test </li></ul><ul><ul><li>2048 objects in hierarchical terrain </li></ul></ul><ul><ul><ul><li>unlit, 8-bit luminance texture </li></ul></ul></ul><ul><ul><li>7 animated aircraft </li></ul></ul><ul><ul><ul><li>lit with 2 lights </li></ul></ul></ul><ul><ul><ul><li>16-bit aircraft texture </li></ul></ul></ul><ul><ul><ul><li>16-bit skybox textures </li></ul></ul></ul>
  19. 19. Performance <ul><li>MSGRefMap </li></ul><ul><ul><li>PowerBook floating point </li></ul></ul><ul><ul><ul><li>OpenGL renderer - 116 Hz </li></ul></ul></ul><ul><ul><ul><li>NULL renderer - 1360 Hz </li></ul></ul></ul><ul><ul><li>PowerBook fixed point </li></ul></ul><ul><ul><ul><li>NULL renderer - 1620 Hz </li></ul></ul></ul><ul><ul><li>Carbonado fixed point </li></ul></ul><ul><ul><ul><li>OpenGL ES renderer - 35.9 Hz </li></ul></ul></ul><ul><ul><ul><li>NULL renderer - 668.4 Hz </li></ul></ul></ul><ul><ul><li>Carbonado floating point </li></ul></ul><ul><ul><ul><li>NULL renderer - 101.2 Hz </li></ul></ul></ul><ul><li>MSGHurricane </li></ul><ul><ul><li>PowerBook floating point </li></ul></ul><ul><ul><ul><li>OpenGL renderer - 122 Hz </li></ul></ul></ul><ul><ul><ul><li>NULL renderer - 1890 Hz </li></ul></ul></ul><ul><ul><li>PowerBook fixed point </li></ul></ul><ul><ul><ul><li>NULL renderer - 960 Hz </li></ul></ul></ul><ul><ul><li>Carbonado fixed point </li></ul></ul><ul><ul><ul><li>OpenGL ES renderer - 34.6 Hz </li></ul></ul></ul><ul><ul><ul><li>NULL renderer - 271.5 Hz </li></ul></ul></ul><ul><ul><li>Carbonado floating point </li></ul></ul><ul><ul><ul><li>NULL renderer - 46.25 Hz </li></ul></ul></ul><ul><ul><li>Fixed-point code averages 6x faster than FP emulation </li></ul></ul><ul><ul><ul><li>Despite data structure traversal and other non-arithmetic code </li></ul></ul></ul><ul><ul><ul><li>Despite fixed point reflection-mapping code in floating point version </li></ul></ul></ul><ul><ul><ul><li>This is a fast CPU, yet it is too slow in FP emulation running MSGHurricane </li></ul></ul></ul>
  20. 20. Last word on performance <ul><li>The missing case - </li></ul><ul><ul><li>Floating point application code </li></ul></ul><ul><ul><li>Fixed point framework / middleware </li></ul></ul><ul><ul><li>Estimated by isolating application cycles on Carbonado </li></ul></ul><ul><ul><ul><li>Time spent in application = 11% of frame time (NULL renderer) </li></ul></ul></ul><ul><ul><li>MSGHurricane </li></ul></ul><ul><ul><ul><li>Fixed point frame time = 0.0037 sec </li></ul></ul></ul><ul><ul><ul><li>Floating point frame time = 0.021 sec </li></ul></ul></ul><ul><ul><ul><li>Mixed-mode frame = (89% * 0.0037) + (11% * 0.021) = 0.011 sec </li></ul></ul></ul><ul><ul><ul><li>Estimated 88Hz mixed-mode rate </li></ul></ul></ul><ul><ul><li>Within 33mS budget </li></ul></ul><ul><ul><li>But scale processor back to 150MHz and it becomes too slow again </li></ul></ul><ul><ul><li>And this is just a demo - just splines, no physics, no gameplay </li></ul></ul><ul><ul><li>Floating-point emulation is just too slow for even the simplest case </li></ul></ul>
  21. 21. Conclusions <ul><ul><li>The software migration process can be relatively painless </li></ul></ul><ul><ul><li>Source code should be ‘portified’ - i.e. made </li></ul></ul><ul><ul><ul><li>3D API agnostic </li></ul></ul></ul><ul><ul><ul><ul><li>Isolate and encapsulate your 3D API interactions </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Structure desktop code to be OpenGL ES friendly </li></ul></ul></ul></ul><ul><ul><ul><li>Floating point agnostic </li></ul></ul></ul><ul><ul><ul><ul><li>Abstract out your real number format </li></ul></ul></ul></ul><ul><ul><ul><ul><li>At minimum in middleware layer </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Ideally allow fixed-point from application down to hardware </li></ul></ul></ul></ul><ul><ul><li>You can do all this from the safety of your workstation </li></ul></ul><ul><ul><ul><li>No handheld platform debugging until project is mature </li></ul></ul></ul><ul><ul><ul><li>MSG ported to Carbonado in 2 evenings with just printf </li></ul></ul></ul><ul><ul><li>And if you get it right </li></ul></ul><ul><ul><ul><li>It will just port and just work - but may require some tuning </li></ul></ul></ul><ul><ul><ul><li>Performance will be high across platforms </li></ul></ul></ul><ul><ul><ul><li>Resulting software will be highly portable and reusable </li></ul></ul></ul>