Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PPU Optimisation Lesson

1,571 views

Published on

PowerPC Optimization Tips

Published in: Technology
  • Be the first to comment

PPU Optimisation Lesson

  1. 1. Engineer Learning Session #1Optimisation Tips for PowerPCBen HankeApril 16th 2009
  2. 2. Playstation 3 PPU
  3. 3. Optimized Playstation 3 PPU
  4. 4. Consoles with PowerPC
  5. 5. Established Common Sense “90% of execution time is spent in 10% of the code” “Programmers are really bad at knowing what really needs optimizing so should be guided by profiling” “Why should I bother optimizing X? It only runs N times per frame” “Processors are really fast these days so it doesn’t matter” “The compiler can optimize that better than I can”
  6. 6. Alternative View The compiler isn’t always as smart as you think it is. Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture. A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate. It’s easier to write more efficient code up front than to go and find it all later with a profiler.
  7. 7. PPU Hardware Threads We have two of them running on alternate cycles If one stalls, other thread runs Not multi-core: Shared exection units Shared access to memory and cache Most registers duplicated Ideal usage: Threads filling in stalls for each other without thrashing cache
  8. 8. PS3 Cache Level 1 Instruction Cache 32 kb 2-way set associative, 128 byte cache line Level 1 Data Cache 32 kb 4-way set associative, 128 byte cache line Write-through to L2 Level 2 Data and Instruction Cache 512 kb 8-way set associative Write-back 128 byte cache line
  9. 9. Cache Miss Penalties L1 cache miss = 40 cycles L2 cache miss = ~1000 cycles! In other words, random reads from memory are excruciatingly expensive! Reading data with large strides – very bad Consider smaller data structures, or group data that will be read together
  10. 10. Virtual Functions What happens when you call a virtual function? What does this code do? virtual void Update() {} May touch cache line at vtable address unnecessarily Consider batching by type for better iCache pattern If you know the type, maybe you don’t need the virtual – save touching memory to read the function address Even better – maybe the data you actually want to manipulate can be kept close together in memory?
  11. 11. Data Hazards Ahead
  12. 12. Spot the Stallint SlowFunction(int & a, int & b){ a = 1; b = 2; return a + b;}
  13. 13. Method 1Q: When will this work and when won’t it?inline int FastFunction(int & a, int & b){ a = 1; b = 2; return a + b;}
  14. 14. Method 2int FastFunction(int * __restrict a, int * __restrict b){ *a = 1; *b = 2; return *a + *b; // we promise that a != b}
  15. 15. __restrict Keyword __restrict only works with pointers, not references (whichsucks). Aliasing only applies to identical types. Can be applied to implicit this pointer in member functions.  Put it after the closing brace.  Stops compiler worrying that you passed a class data member to the function.
  16. 16. Load-Hit-Store What is it? Write followed by read PPU store queue Average latency 40 cycles Snoop bits 52 through 61 (implications?)
  17. 17. True LHS Type casting between register files: float floatValue = (float)intValue; float posY = vPosition.X(); Data member as loop counter while( m_Counter-- ) {} Aliasing: void swap( int & a, int & b ) { int t = a; a = b; b = t; }
  18. 18. Workaround: Data member as loop counterint counter = m_Counter; // load into registerwhile( counter-- ) // register will decrement{ doSomething();}m_Counter = 0; // Store to memory just once
  19. 19. Workarounds Keep data in the same domain as much as possible Reorder your writes and reads to allow space for latency Consider using word flags instead of many packed bools Load flags word into register Perform logical bitwise operations on flags in register Store new flags value e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) | newcanSeeFlag | kAngry;
  20. 20. False LHS Case Store queue snooping only compares bits 52 through 61 So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC. Writing to and reading from memory 4KB apart Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool Write a bool then read a nearby one -> ~40 cycle stall
  21. 21. Examplestruct BaddyState{ bool m_bAlive; bool m_bCanSeePlayer; bool m_bAngry; bool m_bWearingHat;}; // 4 bytes
  22. 22. Where might we stall?if ( m_bAlive ){ m_bCanSeePlayer = LineOfSightCheck( this, player ); if ( m_bCanSeePlayer && !m_bAngry ) { m_bAngry = true; StartShootingPlayer(); }}
  23. 23. Workaroundif ( m_bAlive ) // load, compare{ const bool bAngry = m_bAngry; // load const bool bCanSeePlayer = LineOfSightCheck( this, player ); m_bCanSeePlayer = bCanSeePlayer; // store if (bCanSeePlayer && !bAngry ) // compare registers { m_bAngry = true; // store StartShootingPlayer(); }}
  24. 24. Loop + Singleton GotchaWhat happens here? for( int i = 0; i < enemyCount; ++i ) { EnemyManager::Get().DispatchEnemy(); }
  25. 25. WorkaroundEnemyManager & enemyManager = EnemyManager::Get();for( int i = 0; i < enemyCount; ++i ){ enemyManager.DispatchEnemy();}
  26. 26. Branch Hints Branch mis-prediction can hurt your performance 24 cycles penalty to flush the instruction queue If you know a certain branch is rarely taken, you can use a static branch hint, e.g. if ( __builtin_expect( bResult, 0 ) ) Far better to eliminate the branch! Use logical bitwise operations to mask results This is far easier and more applicable in SIMD code
  27. 27. Floating Point Branch Elimination __fsel, __fsels – Floating point select float min( float a, float b ) { return ( a < b ) ? a : b; } float min( float a, float b ) { return __fsels( a – b, b, a ); }
  28. 28. Microcoded Instructions Single instruction -> several, fetched from ROM, pipeline bubble Common example of one to avoid: shift immediate. int a = b << c; Minimum 11 cycle latency. If you know range of values, can be better to switch to a fixed shift! switch( c ) { case 1: a = b << 1; break; case 2: a = b << 2; break; case 3: a = b << 3; break; default: break; }
  29. 29. Loop Unrolling Why unroll loops? Less branches Better concurrency, instruction pipelining, hide latency More opportunities for compiler to optimise Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies How many times is enough? On average about 4 – 6 times works well Best for loops where num iterations is known up front Need to think about spare - iterationCount % unrollCount
  30. 30. Picking up the Spare If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case. for ( int i = 0; i < numElements; i += 4 ) { InlineTransformation( pElements[ i+0 ] ); InlineTransformation( pElements[ i+1 ] ); InlineTransformation( pElements[ i+2 ] ); InlineTransformation( pElements[ i+3 ] ); }
  31. 31. Picking up the Spare If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop. for ( ; i < numElements; ++i ) { InlineTransformation( pElements[ i ] ); }
  32. 32. Picking up the SpareAlternative method (pros and cons - one branch but longer code generated). switch( numElements & 3 ) { case 3: InlineTransformation( pElements[ i+2 ] ); case 2: InlineTransformation( pElements[ i+1 ] ); case 1: InlineTransformation( pElements[ i+0 ] ); case 0: break; }
  33. 33. Loop unrolled… now what? If you unrolled your loop 4 times, you might be able to use SIMD Use AltiVec intrinsics – align your data to 16 bytes 128 bit registers - operate on 4 32-bit values in parallel Most SIMD instructions have 1 cycle throughput Consider using SOA data instead of AOS AOS: Arrays of interleaved posX, posY, posZ structures SOA: A structure of arrays for each field dimensioned for all elements
  34. 34. Example: FloatToHalf
  35. 35. Example: FloatToHalf4
  36. 36. Questions?

×