Optimizing Games for Mobiles

Optimising games for mobiles
by Dmytro Vovk

Mobile GPUs architecture
• There are 3 major mobile GPU architectures
on a market:
• IMR (Immediate Mode Renderer)
• TBR (Tile Based Renderer)
• TBDR (Tile Based Deferred Renderer)
2

IMR
• Renders anything sent to the GPU
immediately. It makes no assumption about
what is going to be submitted next.
• Application has to sort opaque geometry front
to back.
• It’s basically a brute force.
• Nvidia, AMD.
3

TBR
• Improves on IMR, but still is an IMR.
• Bandwidth is a precious resource on mobiles
and TBR tries to reduce data transfers as much
as possible.
• Your geometry is split in to tiles and then it is
processed per tile. Tiles have small amount of
memory for colour, depthstencil buffers, so
they have no need to do transfers fromto
system memory.
• Qualcomm Adreno, ARM Mali 4

TBDR
• It is deferred i.e. all the graphics is drawn
somewhere later.
• And this is where all the magic happens!
• The GPU is aware of context - it know’s what is
going to be drawn in future and this allows it
to employ some awesome optimisations,
reduce power consumption, bandwidth and a
fillrate.
• Imagination PowerVR.
5

What you might know
• Batch, Batch, Batch!
http://ce.u-
sys.org/Veranstaltungen/Interaktive%20Computergraphik%20(St
amminger)/papers/BatchBatchBatch.pdf
• Render from one thread only
• Avoid synchronisations:
1. glFlush/glFinish;
2. Querying GL states;
3. Accessing render targets;

What you might know
• Pixel perfect HSR (Hidden Surface Removal),
Adreno and ARM does not feature this.
• But still needs to sort transparent geometry!
• Avoid doing alpha test. Use alpha blend
instead

What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
CPU!
• Prefer Stencil Test before Scissor.
– Stencil test is performed in hardware on PowerVR
GPUs.
– Stencil mask is stored in fast on-chip memory
– Stencil can be of any form in contrast to the
rectangular Scissor

• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before visibility for
current fragment can be determined. This will remove benefits of HSR
o Even more! If shader code contains discard, than any geometry rendered
with this shader will suffer from alpha test drawbacks. Even if this key-word
is under condition, USSE (PVR’s shader engine) does assumes, that this
condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha blended in the end

What you might know
• Bandwidth matters
1. Use constant colour per object, instead of per
vertex
2. Simplify your models. Use smaller data types.
3. Use indexed triangles or non-indexed triangle
strips
4. Use VBO instead of client arrays
5. Use VAO

• VBOs allocations are aligned by 4KB page size.
That means, your small buffer for just a
couple of triangles will occupy 4KB in
memory, - large amount of small VBOs can
defragment and waste you memory.

• Updating your VBO data each frame:
1. glBufferSubData. If it is used to update big part of the
original data it will harm performance. Try to avoid
updates to buffers, that are in use now
2. glBufferData. It’s OK to completely overwrite original
data. Old data will be orphaned by driver and a new
data storage will be allocated
3. glMapBuffer with triple buffered VBO is preferred way
to update your data
• EXT_map_buffer_range (iOS 6+ only), when you need to
update only a subset of a buffer object.

int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it
{
glBindBuffer(vertexBuffer[i]);
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);
}
//...
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);
//update data here
glUnmapBufferOES(GL_ARRAY_BUFFER);
++bufferID;
if (bufferID == 3) //cycling through 3 buffers
{
bufferID = 0;
}

• This scheme will give you the best performance
possible – without blocking CPU or GPU, no
redundant memcpy operations, lower CPU load, but
extra memory is used (note, that you will need no
extra temporal buffer to store your data before
sending it to VBO). This is ideal for dynamic
batching of sprites.
update(1), draw(1), gpuworking(..............)

• Float type is native to GPU
• …that means any other type will be converted
to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice of tradeoff between
bandwidthstorage and additional cycles

What you might know
• Use interleaved vertex data
– Align each vertex attribute by 4 bytes boundaries

• If you don’t align your data, driver will do this
instead.
• …resulting in slower performance.

• PowerVR SGX 5XT GPU series have a vertex
cache for last 12 vertex indices. Optimise your
indexed geometry for this cache size.
• PowerVR Series 6 (XT) has 16k of vertex cache
• Take a look at optimisers, that use Tom
Forsyth’s algorithm
http://home.comcast.net/~tom_forsyth/paper
s/fast_vert_cache_opt.html

What you might know
• Split your vertex data into two parts:
1. Static VBO - the one, that never will be changed
2. Dynamic VBO – the one, that needs to be
updated frequently
• Split your vertex data into few VBOs, when few
meshes share the same set of attributes

What you might know
• Bandwidth matters
1. Use lower precision formats - RGBA4444,
RGBA5551
2. Use PVRTC compressed textures
3. Use atlases
4. Use mipmaps. They improve texture cache
efficiency and quality.

• Avoid RGB8 format - texture data has to be
aligned, so driver will pad RGB8 to RGBA8.
• Try to replace it with RGB565
24

• Why PVRTC?
1. PVRTC provides great compression, resulting in
smaller texture size, improved cache, saved
bandwidth and decreased power consumption
2. PVRTC stores pixel data in GPU’s native order i.e
BGRA, instead of RGBA, in blocks optimised for
data access pattern.

• It doesn’t matter whether your textures are in
RGBA or BGRA format - the driver will still do
internal processing on a texture data to
improve memory access locality and cache
efficiency.
26

• On PVR 6 (XT) driver will reserve memory for both
texture and mip maps chain, but it will commit
memory only for mip level 0.
• If you’ll decide to generate mip maps driver will
commit pages reserved for mip chain.
• That’s expectable.

• On PVR 55MP (tested on iOS 4 – 7.1.1 versions)
driver will ALWAYS commit memory for mip maps,
regardless, whether you requested to create them, or
not.
• That means you’ll waste 33% of memory!
• In most cases you don’t need mip maps for 2D
games, but you are forced to pay this overhead.
• That’s too bad for 2D games. However there is one
workaround – make your textures NPOT (non-power
of two).
28

• Luckily, there is one solution to this problem.
• Core OpenGL ES 2.0 doesn’t support mip maps
for NPoT (non power of two) textures, so if
you’ll make your textures to be NPoT, you will
not pay this memory overhead.
29

• Interesting notes:
• glTexImage2D driver implementation has a
function CheckFastPath. When you upload
PoT texture you’ll hit this fast path. NPoT
textures omit it.
• When you upload a lot of textures you
VRAM gets defragmented, so driver will
remap memory - i.e. it will create one big
buffer for few small textures and will move
them to that buffer 30

• Let’s take a look on a texture upload process.
• Usual way to do this:
1. Load texture to temporal buffer in RAM
1. Encode texture if it is stored in compressed file format
– JPGPNG
2. Feed this buffer to glTexImage2D
3. Draw!
• Looks simple, but is it the fastest way?

• …NO!
void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture
LoadTexture(textureName);
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf);
// buf is copied into internal buffer, created by driver (that's obvious)
free(buf); // because buffer can be freed immediately after glTexImage2D
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
• A lot of redundant work!

• Jedi way to upload textures:
int fileHandle = open(filename, O_RDONLY);
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
munmap(ptr, TEXTURE_SIZE);
• File mapping does not copy your file data into RAM! It
does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
decreased texture upload time and decreased memory
fragmentation

• Keep in my, that textures are finally wired only
when they are used first time. So draw them
off screen immediately after glTexImage2D,
otherwise it will take too long to render the
first frame and it will be nearly impossible to
track the cause of this.
34

• NPOT textures works only with the
GL_CLAMP_TO_EDGE wrap mode
• POT are preferable, they gives you the best
performance possible
• Use NPOT textures with dimensions multiple to
32 pixels for best performance
• Driver will pad data of your NPOT texture to
match the size of the closes POT values.

• Prefer OES_texture_half_float instead of
OES_texture_float
• Texture reads fetch only 32 bits per texel, thus RGBA float
texture will result in 4 texture reads

• Always use glClear at the beginning of the
frame…
• … and EXT_discard_framebuffer at the end.
• PVR GPU series have a fast on chip
depthstencil buffer for each tile. If you forget
to cleardiscard depth buffer, it will be
uploaded from HW to SW

What you might know
• Prefer multi texturing instead of multiple
passes
• Configure texture parameters before feeding
image data to driver

What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard instruction as
early, as possible to avoid useless
computations

• Code inside of dynamic branch (condition is
non constant value) will be executed anyway
and than it will be orphaned if condition is
false

• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
2] with step of 1/256
• Try to give the same precision to all you
operands, because conversion takes some time

• highp values are calculated on a scalar
processor only on USSE1 (thats PVR 5):
highp vec4 v1, v2;
highp float s1, s2;
v2 = (v1 * s1) * s2;
//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on
//a scalar processor again – 4 additional operations
v2 = v1 * (s1 * s2);
//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor

What you might know
• Typical CPU found in mobile devices:
1. ARMv7ARMv8 architecture
2. Cortex AXKraitSwift or Cyclone
3. Up to 2300 MHz
4. Up to 8 cores
5. Thumb-2 instructions set

• ARMv7 has no hardware support for integer
division
• VFPv3, VFPv4 FPU
• NEON SIMD engine
• Unaligned access is done in software on Cortex
A8. That means it is hundred times slower
• Cortex A8 is in-order CPU. Cortex A9+ are out
of order

• Cortex A9+ core has full VFPv3 FPU, while
Cortex A8 has a VFPLite. That means, that float
operations take 1 cycle on A9 and 10 cycles on
A8!

• NEON – 16 registers, 128 bit wide each.
Supports operations on 8, 16, 32 and 64 bits
integers and 32 bits float values
• NEON can be used for:
– Software geometry instancing;
– Skinning;
– As a general vertex processor;
– Other, typical, applications for SIMD.

• There are 3 ways to use NEON engine in your
code:
1. Intrinsics
1.1 GLKMath
2. Handwritten NEON assembly
3. Autovectorization. Add –mllvm –vectorize –
mllvm –bb-vectorize-aligned-only to Other CC++
Flags in project settings and you are ready to go.

• Intrinsics:

• Assembly:

• Summary:
Running time, ms CPU usage, %
Intrinsics 2764 19
Assembly 3664 20
FPU 6209 25-28
FPU autovectorized 5028 22-24
• Intrinsics got me 25% speedup over assembly.
• Note that speed of code generated from
intrinsics will vary from compiler to compiler.
Modern compilers are really good in this.

• Intrinsics advantages over assembly:
– Higher level code;
– Much simpler;
– No need to manage registers;
– You can vectorize basic blocks and build
solution for every new problem with this
blocks. In contrast to assembly – you have to
solve each new problem from scratch;

• Assembly advantages over intrinsics:
– Code generated from intrinsics vary from
compiler to compiler and can give you really
big difference in speed. Assembly code will
always be the same.

__attribute__((always_inline)) void Matrix4ByVec4(const
float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__
vec, float32x4_t* __restrict__ result)
{
(*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);
(*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);
}

__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2,
float32x4x4_t* __restrict__ r)
{
#ifdef INTRINSICS
(*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));
(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));
}

__asm__ volatile
(
"vldmia %6, { q0-q3 } nt"
"vldmia %0, { q8-q11 }nt"
"vmul.f32 q12, q8, d0[0]nt"
"vmul.f32 q13, q8, d2[0]nt"
"vmul.f32 q14, q8, d4[0]nt"
"vmul.f32 q15, q8, d6[0]nt"
"vmla.f32 q12, q9, d0[1]nt"
"vmla.f32 q13, q9, d2[1]nt"
"vmla.f32 q14, q9, d4[1]nt"
"vmla.f32 q15, q9, d6[1]nt"
"vmla.f32 q12, q10, d1[0]nt"
"vmla.f32 q13, q10, d3[0]nt"
"vmla.f32 q14, q10, d5[0]nt"
"vmla.f32 q15, q10, d7[0]nt"
"vmla.f32 q12, q11, d1[1]nt"
"vmla.f32 q13, q11, d3[1]nt"
"vmla.f32 q14, q11, d5[1]nt"
"vmla.f32 q15, q11, d7[1]nt"
"vldmia %1, { q0-q3 } nt"
"vmul.f32 q8, q12, d0[0]nt"
"vmul.f32 q9, q12, d2[0]nt"
"vmul.f32 q10, q12, d4[0]nt"
"vmul.f32 q11, q12, d6[0]nt"
"vmla.f32 q8, q13, d0[1]nt"
"vmla.f32 q8, q14, d1[0]nt"
"vmla.f32 q8, q15, d1[1]nt"
"vmla.f32 q9, q13, d2[1]nt"
"vmla.f32 q9, q14, d3[0]nt"
"vmla.f32 q9, q15, d3[1]nt"
"vmla.f32 q10, q13, d4[1]nt"
"vmla.f32 q10, q14, d5[0]nt"
"vmla.f32 q10, q15, d5[1]nt"
"vmla.f32 q11, q13, d6[1]nt"
"vmla.f32 q11, q14, d7[0]nt"
"vmla.f32 q11, q15, d7[1]nt"
"vstmia %2, { q8 }nt"
"vstmia %5, { q11 }"
:
: "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)
: "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"
);

• For detailed explanation on
intrinsicsassembly see:
http://infocenter.arm.com/help/index.jsp?topi
c=/com.arm.doc.dui0491e/CIHJBEFE.html

Optimizing Games for Mobiles

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Optimizing Games for Mobiles

Similar to Optimizing Games for Mobiles (20)

Recently uploaded

Recently uploaded (20)

Optimizing Games for Mobiles

Editor's Notes