SlideShare a Scribd company logo
Engineer Learning Session #1
Optimisation Tips for PowerPC

Ben Hanke
April 16th 2009
Playstation 3 PPU
Optimized Playstation 3 PPU
Consoles with PowerPC
Established Common Sense
   “90% of execution time is spent in 10% of the code”
   “Programmers are really bad at knowing what really needs
    optimizing so should be guided by profiling”
   “Why should I bother optimizing X? It only runs N times per
    frame”
   “Processors are really fast these days so it doesn’t matter”
   “The compiler can optimize that better than I can”
Alternative View
   The compiler isn’t always as smart as you think it is.
   Really bad things can happen in the most innocent looking code
    because of huge penalties inherent in the architecture.
   A generally sub-optimal code base will nickle and dime you for a
    big chunk of your frame rate.
   It’s easier to write more efficient code up front than to go and find
    it all later with a profiler.
PPU Hardware Threads
   We have two of them running on alternate cycles
   If one stalls, other thread runs
   Not multi-core:
      Shared exection units
      Shared access to memory and cache
      Most registers duplicated
   Ideal usage:
      Threads filling in stalls for each other without thrashing cache
PS3 Cache
   Level 1 Instruction Cache
      32 kb 2-way set associative, 128 byte cache line
   Level 1 Data Cache
      32 kb 4-way set associative, 128 byte cache line
      Write-through to L2
   Level 2 Data and Instruction Cache
      512 kb 8-way set associative
      Write-back
      128 byte cache line
Cache Miss Penalties
   L1 cache miss = 40 cycles
   L2 cache miss = ~1000 cycles!
   In other words, random reads from memory are excruciatingly
    expensive!
   Reading data with large strides – very bad
   Consider smaller data structures, or group data that will be read
    together
Virtual Functions
   What happens when you call a virtual function?
   What does this code do?
         virtual void Update() {}
   May touch cache line at vtable address unnecessarily
   Consider batching by type for better iCache pattern
   If you know the type, maybe you don’t need the virtual – save
    touching memory to read the function address
   Even better – maybe the data you actually want to manipulate
    can be kept close together in memory?
Data Hazards Ahead
Spot the Stall


int SlowFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 1


Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)
{
   a = 1;
   b = 2;
   return a + b;
}
Method 2


int FastFunction(int * __restrict a,
                 int * __restrict b)
{
   *a = 1;
   *b = 2;
   return *a + *b; // we promise that a != b
}
__restrict Keyword


 __restrict only works with pointers, not references (which
sucks).
 Aliasing only applies to identical types.

 Can be applied to implicit this pointer in member functions.

    Put it after the closing brace.

    Stops compiler worrying that you passed a class data member

   to the function.
Load-Hit-Store

   What is it?

   Write followed by read
   PPU store queue

   Average latency 40 cycles

   Snoop bits 52 through 61 (implications?)
True LHS
   Type casting between register files:
      float floatValue = (float)intValue;
       float posY = vPosition.X();


   Data member as loop counter
      while( m_Counter-- ) {}


   Aliasing:
      void swap( int & a, int & b ) { int t = a; a = b; b = t; }
Workaround: Data member as loop counter

int counter = m_Counter; // load into register
while( counter-- ) // register will decrement
{
    doSomething();
}
m_Counter = 0; // Store to memory just once
Workarounds
   Keep data in the same domain as much as possible
   Reorder your writes and reads to allow space for latency
   Consider using word flags instead of many packed bools
      Load flags word into register
      Perform logical bitwise operations on flags in register
      Store new flags value

    e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |
      newcanSeeFlag | kAngry;
False LHS Case
   Store queue snooping only compares bits 52 through 61
   So false LHS occurs if you read from address while different item
    in queue matches addr & 0xFFC.
   Writing to and reading from memory 4KB apart
   Writing to and reading from memory on sub-word boundary, e.g.
    packed short, byte, bool
   Write a bool then read a nearby one -> ~40 cycle stall
Example
struct BaddyState
{
    bool m_bAlive;
    bool m_bCanSeePlayer;
    bool m_bAngry;
    bool m_bWearingHat;
}; // 4 bytes
Where might we stall?
if ( m_bAlive )
{
    m_bCanSeePlayer = LineOfSightCheck( this, player );
    if ( m_bCanSeePlayer && !m_bAngry )
    {
        m_bAngry = true;
        StartShootingPlayer();
    }
}
Workaround
if ( m_bAlive ) // load, compare
{
    const bool bAngry = m_bAngry; // load
    const bool bCanSeePlayer = LineOfSightCheck( this, player );
    m_bCanSeePlayer = bCanSeePlayer; // store
    if (bCanSeePlayer && !bAngry ) // compare registers
    {
        m_bAngry = true; // store
        StartShootingPlayer();
    }
}
Loop + Singleton Gotcha
What happens here?

 for( int i = 0; i < enemyCount; ++i )
 {
     EnemyManager::Get().DispatchEnemy();
 }
Workaround

EnemyManager & enemyManager = EnemyManager::Get();
for( int i = 0; i < enemyCount; ++i )
{
    enemyManager.DispatchEnemy();
}
Branch Hints
   Branch mis-prediction can hurt your performance
   24 cycles penalty to flush the instruction queue
   If you know a certain branch is rarely taken, you can use a static
    branch hint, e.g.
      if ( __builtin_expect( bResult, 0 ) )

   Far better to eliminate the branch!
      Use logical bitwise operations to mask results
      This is far easier and more applicable in SIMD code
Floating Point Branch Elimination
   __fsel, __fsels – Floating point select

    float min( float a, float b )
    {
        return ( a < b ) ? a : b;
    }
    float min( float a, float b )
    {
        return __fsels( a – b, b, a );
    }
Microcoded Instructions
   Single instruction -> several, fetched from ROM, pipeline bubble
   Common example of one to avoid: shift immediate.
        int a = b << c;
   Minimum 11 cycle latency.
   If you know range of values, can be better to switch to a fixed shift!
       switch( c )
       {
            case 1: a = b << 1; break;
            case 2: a = b << 2; break;
            case 3: a = b << 3; break;
            default: break;
       }
Loop Unrolling
   Why unroll loops?
      Less branches
      Better concurrency, instruction pipelining, hide latency
      More opportunities for compiler to optimise
      Only works if code can actually be interleaved, e.g. inline functions, no
      inter-iteration dependencies
   How many times is enough?
      On average about 4 – 6 times works well
      Best for loops where num iterations is known up front
   Need to think about spare - iterationCount % unrollCount
Picking up the Spare
   If you can, artificially pad your data with safe values to keep as multiple of
    unroll count. In this example, you might process up to 3 dummy items in
    the worst case.

     for ( int i = 0; i < numElements; i += 4 )
     {
          InlineTransformation( pElements[ i+0 ] );
          InlineTransformation( pElements[ i+1 ] );
          InlineTransformation( pElements[ i+2 ] );
          InlineTransformation( pElements[ i+3 ] );
     }
Picking up the Spare
   If you can’t pad, run for numElements & ~3 instead and run to
    completion in a second loop.
    for ( ; i < numElements; ++i )
    {
          InlineTransformation( pElements[ i ] );
    }
Picking up the Spare
Alternative method (pros and cons - one branch but longer code generated).


  switch( numElements & 3 )
 {
 case 3: InlineTransformation( pElements[ i+2 ] );
 case 2: InlineTransformation( pElements[ i+1 ] );
 case 1: InlineTransformation( pElements[ i+0 ] );
 case 0: break;
 }
Loop unrolled… now what?
   If you unrolled your loop 4 times, you might be able to use SIMD
   Use AltiVec intrinsics – align your data to 16 bytes
   128 bit registers - operate on 4 32-bit values in parallel
   Most SIMD instructions have 1 cycle throughput
   Consider using SOA data instead of AOS
      AOS: Arrays of interleaved posX, posY, posZ structures
      SOA: A structure of arrays for each field dimensioned for all elements
Example: FloatToHalf
Example: FloatToHalf4
Questions?

More Related Content

What's hot

opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
Miguel Gamboa
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
Andrey Karpov
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
MLconf
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
exsuns
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
Chun Hao Wang
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precision
Daniel_Rhodes
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Roman Okolovich
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipher
Niloy Biswas
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operation
Nam Yong Kim
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
Travis Oliphant
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timing
PVS-Studio
 
Modes of Operation
Modes of Operation Modes of Operation
Modes of Operation
Showkot Usman
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Piotr Przymus
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Piotr Przymus
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operation
Mazin Alwaaly
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation
harshit chavda
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
Seiya Tokui
 
lec9_ref.pdf
lec9_ref.pdflec9_ref.pdf
lec9_ref.pdf
vishal choudhary
 
Caching in
Caching inCaching in
Caching in
RichardWarburton
 

What's hot (20)

opt-mem-trx
opt-mem-trxopt-mem-trx
opt-mem-trx
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects 100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
Sequence Learning with CTC technique
Sequence Learning with CTC techniqueSequence Learning with CTC technique
Sequence Learning with CTC technique
 
PHP floating point precision
PHP floating point precisionPHP floating point precision
PHP floating point precision
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Cryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipherCryptography - Block cipher & stream cipher
Cryptography - Block cipher & stream cipher
 
Chap06 block cipher operation
Chap06 block cipher operationChap06 block cipher operation
Chap06 block cipher operation
 
Numba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPyNumba: Array-oriented Python Compiler for NumPy
Numba: Array-oriented Python Compiler for NumPy
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
An eternal question of timing
An eternal question of timingAn eternal question of timing
An eternal question of timing
 
Modes of Operation
Modes of Operation Modes of Operation
Modes of Operation
 
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
Everything You Always Wanted to Know About Memory in Python - But Were Afraid...
 
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
Everything You Always Wanted to Know About Memory in Python But Were Afraid t...
 
Information and data security block cipher operation
Information and data security block cipher operationInformation and data security block cipher operation
Information and data security block cipher operation
 
Block cipher modes of operation
Block cipher modes of operation Block cipher modes of operation
Block cipher modes of operation
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
lec9_ref.pdf
lec9_ref.pdflec9_ref.pdf
lec9_ref.pdf
 
Caching in
Caching inCaching in
Caching in
 

Similar to PPU Optimisation Lesson

Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
Wei Lin
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
guest3eed30
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
RomanKhavronenko
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Matopt
MatoptMatopt
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
Shaul Rosenzwieg
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
PVS-Studio
 
Code Tuning
Code TuningCode Tuning
Code Tuning
guest4df97e3d
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
Platonov Sergey
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
Dmitri Nesteruk
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
Ekaterina Milovidova
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
PVS-Studio
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernel
PVS-Studio
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
Fyaz Ghaffar
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
Ankit Pandey
 

Similar to PPU Optimisation Lesson (20)

Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Writing a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdfWriting a TSDB from scratch_ performance optimizations.pdf
Writing a TSDB from scratch_ performance optimizations.pdf
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Matopt
MatoptMatopt
Matopt
 
Why learn Internals?
Why learn Internals?Why learn Internals?
Why learn Internals?
 
100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects100 bugs in Open Source C/C++ projects
100 bugs in Open Source C/C++ projects
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
Analyzing Firebird 3.0
Analyzing Firebird 3.0Analyzing Firebird 3.0
Analyzing Firebird 3.0
 
PVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernelPVS-Studio delved into the FreeBSD kernel
PVS-Studio delved into the FreeBSD kernel
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
 

More from slantsixgames

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hd
slantsixgames
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8thslantsixgames
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
slantsixgames
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)
slantsixgames
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introduction
slantsixgames
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SCons
slantsixgames
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009
slantsixgames
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentation
slantsixgames
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overview
slantsixgames
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
slantsixgames
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentation
slantsixgames
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipe
slantsixgames
 

More from slantsixgames (12)

Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hd
 
Ask the Producers Feb 8th
Ask the Producers Feb 8thAsk the Producers Feb 8th
Ask the Producers Feb 8th
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
 
Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)Maximize Your Production Effort (Chinese)
Maximize Your Production Effort (Chinese)
 
SCons an Introduction
SCons an IntroductionSCons an Introduction
SCons an Introduction
 
Confrontation Pipeline and SCons
Confrontation Pipeline and SConsConfrontation Pipeline and SCons
Confrontation Pipeline and SCons
 
Confrontation Audio GDC 2009
Confrontation Audio GDC 2009Confrontation Audio GDC 2009
Confrontation Audio GDC 2009
 
Audio SPU Presentation
Audio SPU PresentationAudio SPU Presentation
Audio SPU Presentation
 
Collision Detection an Overview
Collision Detection an OverviewCollision Detection an Overview
Collision Detection an Overview
 
Modern Graphics Pipeline Overview
Modern Graphics Pipeline OverviewModern Graphics Pipeline Overview
Modern Graphics Pipeline Overview
 
Event System Presentation
Event System PresentationEvent System Presentation
Event System Presentation
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipe
 

Recently uploaded

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
FODUU
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Things to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUUThings to Consider When Choosing a Website Developer for your Website | FODUU
Things to Consider When Choosing a Website Developer for your Website | FODUU
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

PPU Optimisation Lesson

  • 1. Engineer Learning Session #1 Optimisation Tips for PowerPC Ben Hanke April 16th 2009
  • 5. Established Common Sense  “90% of execution time is spent in 10% of the code”  “Programmers are really bad at knowing what really needs optimizing so should be guided by profiling”  “Why should I bother optimizing X? It only runs N times per frame”  “Processors are really fast these days so it doesn’t matter”  “The compiler can optimize that better than I can”
  • 6. Alternative View  The compiler isn’t always as smart as you think it is.  Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture.  A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate.  It’s easier to write more efficient code up front than to go and find it all later with a profiler.
  • 7. PPU Hardware Threads  We have two of them running on alternate cycles  If one stalls, other thread runs  Not multi-core: Shared exection units Shared access to memory and cache Most registers duplicated  Ideal usage: Threads filling in stalls for each other without thrashing cache
  • 8. PS3 Cache  Level 1 Instruction Cache 32 kb 2-way set associative, 128 byte cache line  Level 1 Data Cache 32 kb 4-way set associative, 128 byte cache line Write-through to L2  Level 2 Data and Instruction Cache 512 kb 8-way set associative Write-back 128 byte cache line
  • 9. Cache Miss Penalties  L1 cache miss = 40 cycles  L2 cache miss = ~1000 cycles!  In other words, random reads from memory are excruciatingly expensive!  Reading data with large strides – very bad  Consider smaller data structures, or group data that will be read together
  • 10. Virtual Functions  What happens when you call a virtual function?  What does this code do? virtual void Update() {}  May touch cache line at vtable address unnecessarily  Consider batching by type for better iCache pattern  If you know the type, maybe you don’t need the virtual – save touching memory to read the function address  Even better – maybe the data you actually want to manipulate can be kept close together in memory?
  • 12. Spot the Stall int SlowFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 13. Method 1 Q: When will this work and when won’t it? inline int FastFunction(int & a, int & b) { a = 1; b = 2; return a + b; }
  • 14. Method 2 int FastFunction(int * __restrict a, int * __restrict b) { *a = 1; *b = 2; return *a + *b; // we promise that a != b }
  • 15. __restrict Keyword  __restrict only works with pointers, not references (which sucks).  Aliasing only applies to identical types.  Can be applied to implicit this pointer in member functions.  Put it after the closing brace.  Stops compiler worrying that you passed a class data member to the function.
  • 16. Load-Hit-Store  What is it?  Write followed by read  PPU store queue  Average latency 40 cycles  Snoop bits 52 through 61 (implications?)
  • 17. True LHS  Type casting between register files: float floatValue = (float)intValue; float posY = vPosition.X();  Data member as loop counter while( m_Counter-- ) {}  Aliasing: void swap( int & a, int & b ) { int t = a; a = b; b = t; }
  • 18. Workaround: Data member as loop counter int counter = m_Counter; // load into register while( counter-- ) // register will decrement { doSomething(); } m_Counter = 0; // Store to memory just once
  • 19. Workarounds  Keep data in the same domain as much as possible  Reorder your writes and reads to allow space for latency  Consider using word flags instead of many packed bools Load flags word into register Perform logical bitwise operations on flags in register Store new flags value e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) | newcanSeeFlag | kAngry;
  • 20. False LHS Case  Store queue snooping only compares bits 52 through 61  So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC.  Writing to and reading from memory 4KB apart  Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool  Write a bool then read a nearby one -> ~40 cycle stall
  • 21. Example struct BaddyState { bool m_bAlive; bool m_bCanSeePlayer; bool m_bAngry; bool m_bWearingHat; }; // 4 bytes
  • 22. Where might we stall? if ( m_bAlive ) { m_bCanSeePlayer = LineOfSightCheck( this, player ); if ( m_bCanSeePlayer && !m_bAngry ) { m_bAngry = true; StartShootingPlayer(); } }
  • 23. Workaround if ( m_bAlive ) // load, compare { const bool bAngry = m_bAngry; // load const bool bCanSeePlayer = LineOfSightCheck( this, player ); m_bCanSeePlayer = bCanSeePlayer; // store if (bCanSeePlayer && !bAngry ) // compare registers { m_bAngry = true; // store StartShootingPlayer(); } }
  • 24. Loop + Singleton Gotcha What happens here? for( int i = 0; i < enemyCount; ++i ) { EnemyManager::Get().DispatchEnemy(); }
  • 25. Workaround EnemyManager & enemyManager = EnemyManager::Get(); for( int i = 0; i < enemyCount; ++i ) { enemyManager.DispatchEnemy(); }
  • 26. Branch Hints  Branch mis-prediction can hurt your performance  24 cycles penalty to flush the instruction queue  If you know a certain branch is rarely taken, you can use a static branch hint, e.g. if ( __builtin_expect( bResult, 0 ) )  Far better to eliminate the branch! Use logical bitwise operations to mask results This is far easier and more applicable in SIMD code
  • 27. Floating Point Branch Elimination  __fsel, __fsels – Floating point select float min( float a, float b ) { return ( a < b ) ? a : b; } float min( float a, float b ) { return __fsels( a – b, b, a ); }
  • 28. Microcoded Instructions  Single instruction -> several, fetched from ROM, pipeline bubble  Common example of one to avoid: shift immediate. int a = b << c;  Minimum 11 cycle latency.  If you know range of values, can be better to switch to a fixed shift! switch( c ) { case 1: a = b << 1; break; case 2: a = b << 2; break; case 3: a = b << 3; break; default: break; }
  • 29. Loop Unrolling  Why unroll loops? Less branches Better concurrency, instruction pipelining, hide latency More opportunities for compiler to optimise Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies  How many times is enough? On average about 4 – 6 times works well Best for loops where num iterations is known up front  Need to think about spare - iterationCount % unrollCount
  • 30. Picking up the Spare  If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case. for ( int i = 0; i < numElements; i += 4 ) { InlineTransformation( pElements[ i+0 ] ); InlineTransformation( pElements[ i+1 ] ); InlineTransformation( pElements[ i+2 ] ); InlineTransformation( pElements[ i+3 ] ); }
  • 31. Picking up the Spare  If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop. for ( ; i < numElements; ++i ) { InlineTransformation( pElements[ i ] ); }
  • 32. Picking up the Spare Alternative method (pros and cons - one branch but longer code generated). switch( numElements & 3 ) { case 3: InlineTransformation( pElements[ i+2 ] ); case 2: InlineTransformation( pElements[ i+1 ] ); case 1: InlineTransformation( pElements[ i+0 ] ); case 0: break; }
  • 33. Loop unrolled… now what?  If you unrolled your loop 4 times, you might be able to use SIMD  Use AltiVec intrinsics – align your data to 16 bytes  128 bit registers - operate on 4 32-bit values in parallel  Most SIMD instructions have 1 cycle throughput  Consider using SOA data instead of AOS AOS: Arrays of interleaved posX, posY, posZ structures SOA: A structure of arrays for each field dimensioned for all elements