PPU Optimisation Lesson

Engineer Learning Session #1
Optimisation Tips for PowerPC

Ben Hanke
April 16th 2009

Established Common Sense
 “90% of execution time is spent in 10% of the code”
 “Programmers are really bad at knowing what really needs
optimizing so should be guided by profiling”
 “Why should I bother optimizing X? It only runs N times per
frame”
 “Processors are really fast these days so it doesn’t matter”
 “The compiler can optimize that better than I can”

Alternative View
 The compiler isn’t always as smart as you think it is.
 Really bad things can happen in the most innocent looking code
because of huge penalties inherent in the architecture.
 A generally sub-optimal code base will nickle and dime you for a
big chunk of your frame rate.
 It’s easier to write more efficient code up front than to go and find
it all later with a profiler.

PPU Hardware Threads
 We have two of them running on alternate cycles
 If one stalls, other thread runs
 Not multi-core:
Shared exection units
Shared access to memory and cache
Most registers duplicated
 Ideal usage:
Threads filling in stalls for each other without thrashing cache

PS3 Cache
 Level 1 Instruction Cache
32 kb 2-way set associative, 128 byte cache line
 Level 1 Data Cache
32 kb 4-way set associative, 128 byte cache line
Write-through to L2
 Level 2 Data and Instruction Cache
512 kb 8-way set associative
Write-back
128 byte cache line

Cache Miss Penalties
 L1 cache miss = 40 cycles
 L2 cache miss = ~1000 cycles!
 In other words, random reads from memory are excruciatingly
expensive!
 Reading data with large strides – very bad
 Consider smaller data structures, or group data that will be read
together

Virtual Functions
 What happens when you call a virtual function?
 What does this code do?
virtual void Update() {}
 May touch cache line at vtable address unnecessarily
 Consider batching by type for better iCache pattern
 If you know the type, maybe you don’t need the virtual – save
touching memory to read the function address
 Even better – maybe the data you actually want to manipulate
can be kept close together in memory?

Spot the Stall

int SlowFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}

Method 1

Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)
{
a = 1;
b = 2;
return a + b;
}

Method 2

int FastFunction(int * __restrict a,
int * __restrict b)
{
*a = 1;
*b = 2;
return *a + *b; // we promise that a != b
}

__restrict Keyword

 __restrict only works with pointers, not references (which
sucks).
 Aliasing only applies to identical types.

 Can be applied to implicit this pointer in member functions.

 Put it after the closing brace.

 Stops compiler worrying that you passed a class data member

to the function.

Load-Hit-Store

 What is it?

 Write followed by read
 PPU store queue

 Average latency 40 cycles

 Snoop bits 52 through 61 (implications?)

True LHS
 Type casting between register files:
float floatValue = (float)intValue;
float posY = vPosition.X();

 Data member as loop counter
while( m_Counter-- ) {}

 Aliasing:
void swap( int & a, int & b ) { int t = a; a = b; b = t; }

Workaround: Data member as loop counter

int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
doSomething();
}
m_Counter = 0; // Store to memory just once

Workarounds
 Keep data in the same domain as much as possible
 Reorder your writes and reads to allow space for latency
 Consider using word flags instead of many packed bools
Load flags word into register
Perform logical bitwise operations on flags in register
Store new flags value

e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
newcanSeeFlag | kAngry;

False LHS Case
 Store queue snooping only compares bits 52 through 61
 So false LHS occurs if you read from address while different item
in queue matches addr & 0xFFC.
 Writing to and reading from memory 4KB apart
 Writing to and reading from memory on sub-word boundary, e.g.
packed short, byte, bool
 Write a bool then read a nearby one -> ~40 cycle stall

Example
struct BaddyState
{
bool m_bAlive;
bool m_bCanSeePlayer;
bool m_bAngry;
bool m_bWearingHat;
}; // 4 bytes

Where might we stall?
if ( m_bAlive )
{
m_bCanSeePlayer = LineOfSightCheck( this, player );
if ( m_bCanSeePlayer && !m_bAngry )
{
m_bAngry = true;
StartShootingPlayer();
}
}

Workaround
if ( m_bAlive ) // load, compare
{
const bool bAngry = m_bAngry; // load
const bool bCanSeePlayer = LineOfSightCheck( this, player );
m_bCanSeePlayer = bCanSeePlayer; // store
if (bCanSeePlayer && !bAngry ) // compare registers
{
m_bAngry = true; // store
StartShootingPlayer();
}
}

Loop + Singleton Gotcha
What happens here?

for( int i = 0; i < enemyCount; ++i )
{
EnemyManager::Get().DispatchEnemy();
}

Workaround

EnemyManager & enemyManager = EnemyManager::Get();
for( int i = 0; i < enemyCount; ++i )
{
enemyManager.DispatchEnemy();
}

Branch Hints
 Branch mis-prediction can hurt your performance
 24 cycles penalty to flush the instruction queue
 If you know a certain branch is rarely taken, you can use a static
branch hint, e.g.
if ( __builtin_expect( bResult, 0 ) )

 Far better to eliminate the branch!
Use logical bitwise operations to mask results
This is far easier and more applicable in SIMD code

Floating Point Branch Elimination
 __fsel, __fsels – Floating point select

float min( float a, float b )
{
return ( a < b ) ? a : b;
}
float min( float a, float b )
{
return __fsels( a – b, b, a );
}

Microcoded Instructions
 Single instruction -> several, fetched from ROM, pipeline bubble
 Common example of one to avoid: shift immediate.
int a = b << c;
 Minimum 11 cycle latency.
 If you know range of values, can be better to switch to a fixed shift!
switch( c )
{
case 1: a = b << 1; break;
default: break;
}

Loop Unrolling
 Why unroll loops?
Less branches
Better concurrency, instruction pipelining, hide latency
More opportunities for compiler to optimise
Only works if code can actually be interleaved, e.g. inline functions, no
inter-iteration dependencies
 How many times is enough?
On average about 4 – 6 times works well
Best for loops where num iterations is known up front
 Need to think about spare - iterationCount % unrollCount

Picking up the Spare
 If you can, artificially pad your data with safe values to keep as multiple of
unroll count. In this example, you might process up to 3 dummy items in
the worst case.

for ( int i = 0; i < numElements; i += 4 )
{
InlineTransformation( pElements[ i+0 ] );
}

 If you can’t pad, run for numElements & ~3 instead and run to
completion in a second loop.
for ( ; i < numElements; ++i )
{
InlineTransformation( pElements[ i ] );
}

Alternative method (pros and cons - one branch but longer code generated).

switch( numElements & 3 )
{
case 3: InlineTransformation( pElements[ i+2 ] );
case 0: break;
}

Loop unrolled… now what?
 If you unrolled your loop 4 times, you might be able to use SIMD
 Use AltiVec intrinsics – align your data to 16 bytes
 128 bit registers - operate on 4 32-bit values in parallel
 Most SIMD instructions have 1 cycle throughput
 Consider using SOA data instead of AOS
AOS: Arrays of interleaved posX, posY, posZ structures
SOA: A structure of arrays for each field dimensioned for all elements

PPU Optimisation Lesson

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PPU Optimisation Lesson

Similar to PPU Optimisation Lesson (20)

More from slantsixgames

More from slantsixgames (12)

Recently uploaded

Recently uploaded (20)

PPU Optimisation Lesson