Engineer Learning Session #1Optimisation Tips for PowerPCBen HankeApril 16th 2009
Playstation 3 PPU
Optimized Playstation 3 PPU
Consoles with PowerPC
Established Common Sense   “90% of execution time is spent in 10% of the code”   “Programmers are really bad at knowing ...
Alternative View   The compiler isn’t always as smart as you think it is.   Really bad things can happen in the most inn...
PPU Hardware Threads   We have two of them running on alternate cycles   If one stalls, other thread runs   Not multi-c...
PS3 Cache   Level 1 Instruction Cache      32 kb 2-way set associative, 128 byte cache line   Level 1 Data Cache      32...
Cache Miss Penalties   L1 cache miss = 40 cycles   L2 cache miss = ~1000 cycles!   In other words, random reads from me...
Virtual Functions   What happens when you call a virtual function?   What does this code do?         virtual void Update...
Data Hazards Ahead
Spot the Stallint SlowFunction(int & a, int & b){   a = 1;   b = 2;   return a + b;}
Method 1Q: When will this work and when won’t it?inline int FastFunction(int & a, int & b){   a = 1;   b = 2;   return a +...
Method 2int FastFunction(int * __restrict a,                 int * __restrict b){   *a = 1;   *b = 2;   return *a + *b; //...
__restrict Keyword __restrict only works with pointers, not references (whichsucks). Aliasing only applies to identical ...
Load-Hit-Store   What is it?   Write followed by read   PPU store queue   Average latency 40 cycles   Snoop bits 52 t...
True LHS   Type casting between register files:      float floatValue = (float)intValue;       float posY = vPosition.X()...
Workaround: Data member as loop counterint counter = m_Counter; // load into registerwhile( counter-- ) // register will d...
Workarounds   Keep data in the same domain as much as possible   Reorder your writes and reads to allow space for latenc...
False LHS Case   Store queue snooping only compares bits 52 through 61   So false LHS occurs if you read from address wh...
Examplestruct BaddyState{    bool m_bAlive;    bool m_bCanSeePlayer;    bool m_bAngry;    bool m_bWearingHat;}; // 4 bytes
Where might we stall?if ( m_bAlive ){    m_bCanSeePlayer = LineOfSightCheck( this, player );    if ( m_bCanSeePlayer && !m...
Workaroundif ( m_bAlive ) // load, compare{    const bool bAngry = m_bAngry; // load    const bool bCanSeePlayer = LineOfS...
Loop + Singleton GotchaWhat happens here? for( int i = 0; i < enemyCount; ++i ) {     EnemyManager::Get().DispatchEnemy(); }
WorkaroundEnemyManager & enemyManager = EnemyManager::Get();for( int i = 0; i < enemyCount; ++i ){    enemyManager.Dispatc...
Branch Hints   Branch mis-prediction can hurt your performance   24 cycles penalty to flush the instruction queue   If ...
Floating Point Branch Elimination   __fsel, __fsels – Floating point select    float min( float a, float b )    {        ...
Microcoded Instructions   Single instruction -> several, fetched from ROM, pipeline bubble   Common example of one to av...
Loop Unrolling   Why unroll loops?      Less branches      Better concurrency, instruction pipelining, hide latency      ...
Picking up the Spare   If you can, artificially pad your data with safe values to keep as multiple of    unroll count. In...
Picking up the Spare   If you can’t pad, run for numElements & ~3 instead and run to    completion in a second loop.    f...
Picking up the SpareAlternative method (pros and cons - one branch but longer code generated).  switch( numElements & 3 ) ...
Loop unrolled… now what?   If you unrolled your loop 4 times, you might be able to use SIMD   Use AltiVec intrinsics – a...
Example: FloatToHalf
Example: FloatToHalf4
Questions?
Upcoming SlideShare
Loading in …5
×

PPU Optimisation Lesson

1,250 views
1,145 views

Published on

PowerPC Optimization Tips

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,250
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

PPU Optimisation Lesson

  1. 1. Engineer Learning Session #1Optimisation Tips for PowerPCBen HankeApril 16th 2009
  2. 2. Playstation 3 PPU
  3. 3. Optimized Playstation 3 PPU
  4. 4. Consoles with PowerPC
  5. 5. Established Common Sense “90% of execution time is spent in 10% of the code” “Programmers are really bad at knowing what really needs optimizing so should be guided by profiling” “Why should I bother optimizing X? It only runs N times per frame” “Processors are really fast these days so it doesn’t matter” “The compiler can optimize that better than I can”
  6. 6. Alternative View The compiler isn’t always as smart as you think it is. Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture. A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate. It’s easier to write more efficient code up front than to go and find it all later with a profiler.
  7. 7. PPU Hardware Threads We have two of them running on alternate cycles If one stalls, other thread runs Not multi-core: Shared exection units Shared access to memory and cache Most registers duplicated Ideal usage: Threads filling in stalls for each other without thrashing cache
  8. 8. PS3 Cache Level 1 Instruction Cache 32 kb 2-way set associative, 128 byte cache line Level 1 Data Cache 32 kb 4-way set associative, 128 byte cache line Write-through to L2 Level 2 Data and Instruction Cache 512 kb 8-way set associative Write-back 128 byte cache line
  9. 9. Cache Miss Penalties L1 cache miss = 40 cycles L2 cache miss = ~1000 cycles! In other words, random reads from memory are excruciatingly expensive! Reading data with large strides – very bad Consider smaller data structures, or group data that will be read together
  10. 10. Virtual Functions What happens when you call a virtual function? What does this code do? virtual void Update() {} May touch cache line at vtable address unnecessarily Consider batching by type for better iCache pattern If you know the type, maybe you don’t need the virtual – save touching memory to read the function address Even better – maybe the data you actually want to manipulate can be kept close together in memory?
  11. 11. Data Hazards Ahead
  12. 12. Spot the Stallint SlowFunction(int & a, int & b){ a = 1; b = 2; return a + b;}
  13. 13. Method 1Q: When will this work and when won’t it?inline int FastFunction(int & a, int & b){ a = 1; b = 2; return a + b;}
  14. 14. Method 2int FastFunction(int * __restrict a, int * __restrict b){ *a = 1; *b = 2; return *a + *b; // we promise that a != b}
  15. 15. __restrict Keyword __restrict only works with pointers, not references (whichsucks). Aliasing only applies to identical types. Can be applied to implicit this pointer in member functions.  Put it after the closing brace.  Stops compiler worrying that you passed a class data member to the function.
  16. 16. Load-Hit-Store What is it? Write followed by read PPU store queue Average latency 40 cycles Snoop bits 52 through 61 (implications?)
  17. 17. True LHS Type casting between register files: float floatValue = (float)intValue; float posY = vPosition.X(); Data member as loop counter while( m_Counter-- ) {} Aliasing: void swap( int & a, int & b ) { int t = a; a = b; b = t; }
  18. 18. Workaround: Data member as loop counterint counter = m_Counter; // load into registerwhile( counter-- ) // register will decrement{ doSomething();}m_Counter = 0; // Store to memory just once
  19. 19. Workarounds Keep data in the same domain as much as possible Reorder your writes and reads to allow space for latency Consider using word flags instead of many packed bools Load flags word into register Perform logical bitwise operations on flags in register Store new flags value e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) | newcanSeeFlag | kAngry;
  20. 20. False LHS Case Store queue snooping only compares bits 52 through 61 So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC. Writing to and reading from memory 4KB apart Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool Write a bool then read a nearby one -> ~40 cycle stall
  21. 21. Examplestruct BaddyState{ bool m_bAlive; bool m_bCanSeePlayer; bool m_bAngry; bool m_bWearingHat;}; // 4 bytes
  22. 22. Where might we stall?if ( m_bAlive ){ m_bCanSeePlayer = LineOfSightCheck( this, player ); if ( m_bCanSeePlayer && !m_bAngry ) { m_bAngry = true; StartShootingPlayer(); }}
  23. 23. Workaroundif ( m_bAlive ) // load, compare{ const bool bAngry = m_bAngry; // load const bool bCanSeePlayer = LineOfSightCheck( this, player ); m_bCanSeePlayer = bCanSeePlayer; // store if (bCanSeePlayer && !bAngry ) // compare registers { m_bAngry = true; // store StartShootingPlayer(); }}
  24. 24. Loop + Singleton GotchaWhat happens here? for( int i = 0; i < enemyCount; ++i ) { EnemyManager::Get().DispatchEnemy(); }
  25. 25. WorkaroundEnemyManager & enemyManager = EnemyManager::Get();for( int i = 0; i < enemyCount; ++i ){ enemyManager.DispatchEnemy();}
  26. 26. Branch Hints Branch mis-prediction can hurt your performance 24 cycles penalty to flush the instruction queue If you know a certain branch is rarely taken, you can use a static branch hint, e.g. if ( __builtin_expect( bResult, 0 ) ) Far better to eliminate the branch! Use logical bitwise operations to mask results This is far easier and more applicable in SIMD code
  27. 27. Floating Point Branch Elimination __fsel, __fsels – Floating point select float min( float a, float b ) { return ( a < b ) ? a : b; } float min( float a, float b ) { return __fsels( a – b, b, a ); }
  28. 28. Microcoded Instructions Single instruction -> several, fetched from ROM, pipeline bubble Common example of one to avoid: shift immediate. int a = b << c; Minimum 11 cycle latency. If you know range of values, can be better to switch to a fixed shift! switch( c ) { case 1: a = b << 1; break; case 2: a = b << 2; break; case 3: a = b << 3; break; default: break; }
  29. 29. Loop Unrolling Why unroll loops? Less branches Better concurrency, instruction pipelining, hide latency More opportunities for compiler to optimise Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies How many times is enough? On average about 4 – 6 times works well Best for loops where num iterations is known up front Need to think about spare - iterationCount % unrollCount
  30. 30. Picking up the Spare If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case. for ( int i = 0; i < numElements; i += 4 ) { InlineTransformation( pElements[ i+0 ] ); InlineTransformation( pElements[ i+1 ] ); InlineTransformation( pElements[ i+2 ] ); InlineTransformation( pElements[ i+3 ] ); }
  31. 31. Picking up the Spare If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop. for ( ; i < numElements; ++i ) { InlineTransformation( pElements[ i ] ); }
  32. 32. Picking up the SpareAlternative method (pros and cons - one branch but longer code generated). switch( numElements & 3 ) { case 3: InlineTransformation( pElements[ i+2 ] ); case 2: InlineTransformation( pElements[ i+1 ] ); case 1: InlineTransformation( pElements[ i+0 ] ); case 0: break; }
  33. 33. Loop unrolled… now what? If you unrolled your loop 4 times, you might be able to use SIMD Use AltiVec intrinsics – align your data to 16 bytes 128 bit registers - operate on 4 32-bit values in parallel Most SIMD instructions have 1 cycle throughput Consider using SOA data instead of AOS AOS: Arrays of interleaved posX, posY, posZ structures SOA: A structure of arrays for each field dimensioned for all elements
  34. 34. Example: FloatToHalf
  35. 35. Example: FloatToHalf4
  36. 36. Questions?

×