Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Optimizing Games for Mobiles

A set of mobile game optimization best practices. This presentation extensively covers PowerVR series of GPUs from Imagination Technologies and iOS, however the majority of recommendations can be applied to other GPUs and mobile operating systems.

  • Be the first to comment

Optimizing Games for Mobiles

  1. 1. Optimising games for mobiles by Dmytro Vovk
  2. 2. Mobile GPUs architecture • There are 3 major mobile GPU architectures on a market: • IMR (Immediate Mode Renderer) • TBR (Tile Based Renderer) • TBDR (Tile Based Deferred Renderer) 2
  3. 3. IMR • Renders anything sent to the GPU immediately. It makes no assumption about what is going to be submitted next. • Application has to sort opaque geometry front to back. • It’s basically a brute force. • Nvidia, AMD. 3
  4. 4. TBR • Improves on IMR, but still is an IMR. • Bandwidth is a precious resource on mobiles and TBR tries to reduce data transfers as much as possible. • Your geometry is split in to tiles and then it is processed per tile. Tiles have small amount of memory for colour, depthstencil buffers, so they have no need to do transfers fromto system memory. • Qualcomm Adreno, ARM Mali 4
  5. 5. TBDR • It is deferred i.e. all the graphics is drawn somewhere later. • And this is where all the magic happens! • The GPU is aware of context - it know’s what is going to be drawn in future and this allows it to employ some awesome optimisations, reduce power consumption, bandwidth and a fillrate. • Imagination PowerVR. 5
  7. 7. What you might know • Batch, Batch, Batch! http://ce.u- amminger)/papers/BatchBatchBatch.pdf • Render from one thread only • Avoid synchronisations: 1. glFlush/glFinish; 2. Querying GL states; 3. Accessing render targets;
  9. 9. What you might know • Pixel perfect HSR (Hidden Surface Removal), Adreno and ARM does not feature this. • But still needs to sort transparent geometry! • Avoid doing alpha test. Use alpha blend instead
  10. 10. What you might not know • HSR still requires vertices to be processed! • …thus don’t forget to cull your geometry on CPU! • Prefer Stencil Test before Scissor. – Stencil test is performed in hardware on PowerVR GPUs. – Stencil mask is stored in fast on-chip memory – Stencil can be of any form in contrast to the rectangular Scissor
  11. 11. What you might not know • Why no alpha test?! o Alpha testdiscard requires fragment shader to run, before visibility for current fragment can be determined. This will remove benefits of HSR o Even more! If shader code contains discard, than any geometry rendered with this shader will suffer from alpha test drawbacks. Even if this key-word is under condition, USSE (PVR’s shader engine) does assumes, that this condition may be hit. o Move discard into separate shader o Draw opaque geometry, than alpha tested one and alpha blended in the end
  12. 12. What you might know • Bandwidth matters 1. Use constant colour per object, instead of per vertex 2. Simplify your models. Use smaller data types. 3. Use indexed triangles or non-indexed triangle strips 4. Use VBO instead of client arrays 5. Use VAO
  13. 13. What you might not know • VBOs allocations are aligned by 4KB page size. That means, your small buffer for just a couple of triangles will occupy 4KB in memory, - large amount of small VBOs can defragment and waste you memory.
  14. 14. What you might not know • Updating your VBO data each frame: 1. glBufferSubData. If it is used to update big part of the original data it will harm performance. Try to avoid updates to buffers, that are in use now 2. glBufferData. It’s OK to completely overwrite original data. Old data will be orphaned by driver and a new data storage will be allocated 3. glMapBuffer with triple buffered VBO is preferred way to update your data • EXT_map_buffer_range (iOS 6+ only), when you need to update only a subset of a buffer object.
  15. 15. What you might not know int bufferID = 0; //initialization for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it { glBindBuffer(vertexBuffer[i]); glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW); } //... glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]); void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES); //update data here glUnmapBufferOES(GL_ARRAY_BUFFER); ++bufferID; if (bufferID == 3) //cycling through 3 buffers { bufferID = 0; }
  16. 16. What you might not know • This scheme will give you the best performance possible – without blocking CPU or GPU, no redundant memcpy operations, lower CPU load, but extra memory is used (note, that you will need no extra temporal buffer to store your data before sending it to VBO). This is ideal for dynamic batching of sprites. update(1), draw(1), gpuworking(..............) update(2), draw(2), gpuworking(..............) update(3), draw(3), gpuworking(..............)
  17. 17. What you might not know • Float type is native to GPU • …that means any other type will be converted to float by USSE • …resulting in few additional cycles • Thus it’s your choice of tradeoff between bandwidthstorage and additional cycles
  18. 18. What you might know • Use interleaved vertex data – Align each vertex attribute by 4 bytes boundaries
  19. 19. What you might not know • If you don’t align your data, driver will do this instead. • …resulting in slower performance.
  20. 20. What you might not know • PowerVR SGX 5XT GPU series have a vertex cache for last 12 vertex indices. Optimise your indexed geometry for this cache size. • PowerVR Series 6 (XT) has 16k of vertex cache • Take a look at optimisers, that use Tom Forsyth’s algorithm s/fast_vert_cache_opt.html
  21. 21. What you might know • Split your vertex data into two parts: 1. Static VBO - the one, that never will be changed 2. Dynamic VBO – the one, that needs to be updated frequently • Split your vertex data into few VBOs, when few meshes share the same set of attributes
  23. 23. What you might know • Bandwidth matters 1. Use lower precision formats - RGBA4444, RGBA5551 2. Use PVRTC compressed textures 3. Use atlases 4. Use mipmaps. They improve texture cache efficiency and quality.
  24. 24. What you might not know • Avoid RGB8 format - texture data has to be aligned, so driver will pad RGB8 to RGBA8. • Try to replace it with RGB565 24
  25. 25. What you might not know • Why PVRTC? 1. PVRTC provides great compression, resulting in smaller texture size, improved cache, saved bandwidth and decreased power consumption 2. PVRTC stores pixel data in GPU’s native order i.e BGRA, instead of RGBA, in blocks optimised for data access pattern.
  26. 26. What you might not know • It doesn’t matter whether your textures are in RGBA or BGRA format - the driver will still do internal processing on a texture data to improve memory access locality and cache efficiency. 26
  27. 27. What you might not know • On PVR 6 (XT) driver will reserve memory for both texture and mip maps chain, but it will commit memory only for mip level 0. • If you’ll decide to generate mip maps driver will commit pages reserved for mip chain. • That’s expectable.
  28. 28. What you might not know • On PVR 55MP (tested on iOS 4 – 7.1.1 versions) driver will ALWAYS commit memory for mip maps, regardless, whether you requested to create them, or not. • That means you’ll waste 33% of memory! • In most cases you don’t need mip maps for 2D games, but you are forced to pay this overhead. • That’s too bad for 2D games. However there is one workaround – make your textures NPOT (non-power of two). 28
  29. 29. What you might not know • Luckily, there is one solution to this problem. • Core OpenGL ES 2.0 doesn’t support mip maps for NPoT (non power of two) textures, so if you’ll make your textures to be NPoT, you will not pay this memory overhead. 29
  30. 30. What you might not know • Interesting notes: • glTexImage2D driver implementation has a function CheckFastPath. When you upload PoT texture you’ll hit this fast path. NPoT textures omit it. • When you upload a lot of textures you VRAM gets defragmented, so driver will remap memory - i.e. it will create one big buffer for few small textures and will move them to that buffer 30
  31. 31. What you might not know • Let’s take a look on a texture upload process. • Usual way to do this: 1. Load texture to temporal buffer in RAM 1. Encode texture if it is stored in compressed file format – JPGPNG 2. Feed this buffer to glTexImage2D 3. Draw! • Looks simple, but is it the fastest way?
  32. 32. What you might not know • …NO! void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture LoadTexture(textureName); glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf); // buf is copied into internal buffer, created by driver (that's obvious) free(buf); // because buffer can be freed immediately after glTexImage2D glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! • A lot of redundant work!
  33. 33. What you might not know • Jedi way to upload textures: int fileHandle = open(filename, O_RDONLY); void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr); glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! munmap(ptr, TEXTURE_SIZE); • File mapping does not copy your file data into RAM! It does load file data page by page, when it’s accessed. • Thus we eliminated one redundant copy, dramatically decreased texture upload time and decreased memory fragmentation
  34. 34. What you might not know • Keep in my, that textures are finally wired only when they are used first time. So draw them off screen immediately after glTexImage2D, otherwise it will take too long to render the first frame and it will be nearly impossible to track the cause of this. 34
  35. 35. What you might not know • NPOT textures works only with the GL_CLAMP_TO_EDGE wrap mode • POT are preferable, they gives you the best performance possible • Use NPOT textures with dimensions multiple to 32 pixels for best performance • Driver will pad data of your NPOT texture to match the size of the closes POT values.
  36. 36. What you might not know • Prefer OES_texture_half_float instead of OES_texture_float • Texture reads fetch only 32 bits per texel, thus RGBA float texture will result in 4 texture reads
  37. 37. What you might not know • Always use glClear at the beginning of the frame… • … and EXT_discard_framebuffer at the end. • PVR GPU series have a fast on chip depthstencil buffer for each tile. If you forget to cleardiscard depth buffer, it will be uploaded from HW to SW
  38. 38. What you might know • Prefer multi texturing instead of multiple passes • Configure texture parameters before feeding image data to driver
  40. 40. What you might know • Be wise with precision hints • Avoid branching • Eliminate loops • Do not use discard. Place discard instruction as early, as possible to avoid useless computations
  41. 41. What you might not know • Code inside of dynamic branch (condition is non constant value) will be executed anyway and than it will be orphaned if condition is false
  42. 42. What you might not know • highp – represents 32 bit floating point value • mediump – represents 16 bit floating point value in range of [-65520, 65520] • lowp – 10 bit fixed point values in range of [-2, 2] with step of 1/256 • Try to give the same precision to all you operands, because conversion takes some time
  43. 43. What you might not know • highp values are calculated on a scalar processor only on USSE1 (thats PVR 5): highp vec4 v1, v2; highp float s1, s2; v2 = (v1 * s1) * s2; //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations v2 = v1 * (s1 * s2); //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
  45. 45. What you might know • Typical CPU found in mobile devices: 1. ARMv7ARMv8 architecture 2. Cortex AXKraitSwift or Cyclone 3. Up to 2300 MHz 4. Up to 8 cores 5. Thumb-2 instructions set
  46. 46. What you might not know • ARMv7 has no hardware support for integer division • VFPv3, VFPv4 FPU • NEON SIMD engine • Unaligned access is done in software on Cortex A8. That means it is hundred times slower • Cortex A8 is in-order CPU. Cortex A9+ are out of order
  47. 47. What you might not know • Cortex A9+ core has full VFPv3 FPU, while Cortex A8 has a VFPLite. That means, that float operations take 1 cycle on A9 and 10 cycles on A8!
  48. 48. What you might not know • NEON – 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values • NEON can be used for: – Software geometry instancing; – Skinning; – As a general vertex processor; – Other, typical, applications for SIMD.
  49. 49. What you might not know • There are 3 ways to use NEON engine in your code: 1. Intrinsics 1.1 GLKMath 2. Handwritten NEON assembly 3. Autovectorization. Add –mllvm –vectorize – mllvm –bb-vectorize-aligned-only to Other CC++ Flags in project settings and you are ready to go.
  50. 50. What you might not know • Intrinsics:
  51. 51. What you might not know • Assembly:
  52. 52. What you might not know • Summary: Running time, ms CPU usage, % Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-28 FPU autovectorized 5028 22-24 • Intrinsics got me 25% speedup over assembly. • Note that speed of code generated from intrinsics will vary from compiler to compiler. Modern compilers are really good in this.
  53. 53. What you might not know • Intrinsics advantages over assembly: – Higher level code; – Much simpler; – No need to manage registers; – You can vectorize basic blocks and build solution for every new problem with this blocks. In contrast to assembly – you have to solve each new problem from scratch;
  54. 54. What you might not know • Assembly advantages over intrinsics: – Code generated from intrinsics vary from compiler to compiler and can give you really big difference in speed. Assembly code will always be the same.
  55. 55. What you might not know __attribute__((always_inline)) void Matrix4ByVec4(const float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]); }
  56. 56. What you might not know __attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { #ifdef INTRINSICS (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3)); }
  57. 57. What you might not know __asm__ volatile ( "vldmia %6, { q0-q3 } nt" "vldmia %0, { q8-q11 }nt" "vmul.f32 q12, q8, d0[0]nt" "vmul.f32 q13, q8, d2[0]nt" "vmul.f32 q14, q8, d4[0]nt" "vmul.f32 q15, q8, d6[0]nt" "vmla.f32 q12, q9, d0[1]nt" "vmla.f32 q13, q9, d2[1]nt" "vmla.f32 q14, q9, d4[1]nt" "vmla.f32 q15, q9, d6[1]nt" "vmla.f32 q12, q10, d1[0]nt" "vmla.f32 q13, q10, d3[0]nt" "vmla.f32 q14, q10, d5[0]nt" "vmla.f32 q15, q10, d7[0]nt" "vmla.f32 q12, q11, d1[1]nt" "vmla.f32 q13, q11, d3[1]nt" "vmla.f32 q14, q11, d5[1]nt" "vmla.f32 q15, q11, d7[1]nt" "vldmia %1, { q0-q3 } nt" "vmul.f32 q8, q12, d0[0]nt" "vmul.f32 q9, q12, d2[0]nt" "vmul.f32 q10, q12, d4[0]nt" "vmul.f32 q11, q12, d6[0]nt" "vmla.f32 q8, q13, d0[1]nt" "vmla.f32 q8, q14, d1[0]nt" "vmla.f32 q8, q15, d1[1]nt" "vmla.f32 q9, q13, d2[1]nt" "vmla.f32 q9, q14, d3[0]nt" "vmla.f32 q9, q15, d3[1]nt" "vmla.f32 q10, q13, d4[1]nt" "vmla.f32 q10, q14, d5[0]nt" "vmla.f32 q10, q15, d5[1]nt" "vmla.f32 q11, q13, d6[1]nt" "vmla.f32 q11, q14, d7[0]nt" "vmla.f32 q11, q15, d7[1]nt" "vstmia %2, { q8 }nt" "vstmia %3, { q9 }nt" "vstmia %4, { q10 }nt" "vstmia %5, { q11 }" : : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
  58. 58. What you might not know • For detailed explanation on intrinsicsassembly see: c=/com.arm.doc.dui0491e/CIHJBEFE.html