Optimizing Lua for ConsolesAllan J. MurphySenior Software Design EngineerAdvanced Technology GroupMicrosoft
IntroductionWhat do I know about Lua?Part of Microsoft’s ATG groupPerformance reviews, developer visitsWorking with actual title performanceIncluding Lua loading from light to very heavy
About Console Cores
Lua UsageLua is commonly used in console gamesLow memory footprintLightweight processingMany sets of bindings to C++360 and PS3 have 3.2Ghz CPUsLua should run just fine, right?Sadly, like most other converted code, not so
Lua PerformanceIs performance a problem?Level of Lua usage in console games variesDepends on genre of game, in partSparse use – e.g. complex AI behaviors onlyCould be a couple of millisecondsHighly integrated – all the way down into the engine and rendererCould be the major bound on frame rate on CPULua is not always easily parallelizableOr at least, parallel implementations are uncommonSo yes, Lua performance really is important
Performance of Ported CodeCode ported to PS3 or 360 CPU may surpriseAnd not in a good way360 naïve port can be 10x slower than WindowsLua tasks can be in that rangeBut the processor is 3.2Ghz, why the slowdown?CPU cores are cut down to reduce costMemory system lower specCheap slow memory, smaller caches, no L3
Console Performance Penalties
In-Order PenaltiesWhere is code penalized?Memory accessL2 cache missCPU core missing out-of-order execution hardwareLoad Hit StoreBranch mispredictExpensive instructions
L2 Cache MissMemory is slowAn L2 miss is 610 cyclesAn L2 hit is 40 cyclesAn L1 hit is 5 cyclesFactor of 15 difference between L2 hit and missCache line is 128 bytesTypically loading double the line size of x86Easy to waste memory throughputPoor memory use heavily penalized
Load-Hit-Store (LHS)LHS occurs when the CPU stores to a memory address…	… then loads from it very shortly afterIn-order hardware unable to alter instruction flow to avoidNo store-forwarding hardware in CPUNo instructions for moving data between register setsLHS most often caused in code by:Changing register set, eg casts, combining math typesParameters passed by referencePointer aliasing
Branch MispredictBranches prevent compiler scheduling around penaltiesGiven other penalties, this can be very importantMispredicting a branch on console is costlyMispredict causes CPU to:Discard instructions it has fetched, thinking it needed them23-24 cycle penalty as correct instructions fetchedBranch prediction normally does a good jobBut in some cases this penalty can be high
How Does This Affect the Lua VM?
How Does This Affect the Lua VM?	Console CPU cores penalize Lua in several ways:LHS on data handlingL2 miss on table accessL2 miss on garbage collection and free list maintenanceBranch mispredict on VM main loopInteresting asideWork to avoid in-order core issues and L2 miss…	… improves performance on out of order cores anyway
Data Handling, LHS & Memory Access
Data Handling, LHS & Memory AccessLua keeps all basic types internally as a union4 byte value represents bool, pointer, numeric data…Type fieldResults in 64 bit structureIssuesEnum has only 9 values, but is stored in 32 bitsNo way to pass this structure in registersPass value as int, LHS when you need float, and vice versaStoring on stack incurs extra instructions and memory access
Data Handling, LHS & Memory AccessNot a very easy problem to solve elegantlyPoor solution:…Just bear the costDoesn’t seem good enough on performance starved CPUUnpalatable solution:…Don’t use unionPass int and float parts through registers at all timesSolves memory and LHS issuesNot very pretty though
getTable() & L2 Miss
getTable() & L2 MissMuch of Lua’s data stored in tablesEven simple field access goes through table systemFor some sequentially indexed data…	… goes through separate small array storageCommonly…	…value lookup done via hash table
getTable() & L2 MissL2 MissL2 MissLuaTablestructKey & TValuenextPtrTValueTValueTValueL2 MissKey & TValueTvaluenextPtrBranchArray PartTValueKey & TValueTValueHash TablenextPtrTValueTValueL2 MissKey & TValuenextPtr
getTable() & L2 MissLikely several L2 misses just to get to valueSeveral possible improvementsAbandon small sequential arraySave space, which improves cachingWe don’t have the large caches and fast memory of a desktopDrop branching and logic for handling small arrayMain hash table works for sequential case anywayFocus effort on optimizing one mechanism, not two
getTable() & L2 MissCompact hash table to improve L2 performanceStore table of 2 entries since typical list depth is 1.2Make hash table contiguousDrop next pointersStore types as 4 bits packed separate to valuesBulk together in groups of 28, ie one cache line in sizeDrops data size by 62.5%, L2 miss should drop similarlyMake hash collision mechanism just advance in arrayCollision should be much less expensiveMeans hash function can be simpler, ie faster
Garbage Collection & L2 Miss
Garbage Collection & L2 MissDefault garbage collectorWorks via mark and sweep systemOn console, this is very expensiveEach free block record examined incurs L2 miss ie 610 cyclesTypically only a flag per block record examinedBut L2 miss loads 128 byte cache lineThroughput is wasted, loaded data is unusedL2 miss massively dominates total time
Garbage Collection & L2 MissConsider supporting with custom block allocatorHistogram allocation requestsTune block allocator sizes to spikes in histogramBlock allocator…Keeps a bitmask of allocated chunksChunks are fixed sizeGood allocator size is multiple of 1024 records – L2 cache line sizeReduces memory fragmentationWhen full, falls back to normal allocator
Branch Mispredict & Lua VM
Branch Mispredict & Lua VMLua is typically interpreted on consolesNo JITting since security model forbids executing on dataPrecompiled code possible, but some disadvantagesVM main loop typically does:Pick up opcodeJump through huge switch to code to execute opcodePick up data required by opcodeExecuteBack to top
Branch Mispredict & Lua VMProblem…The VM loop is mispredict-mungousSwitch statement is implemented using bctr instructionLoads unknown & unpredictable value from memory (opcode)Then branch on itSimple branch prediction hardware on core:Has 6 bit global history and 2 bit prediction schemeDoesn’t have much of a chance in this caseMispredict penalty grows linearly with opcode count
Branch Mispredict & Lua VMThere are many code perturbations that seem hopefulTree of ifs derived from popularity of opcodes‘direct threading’Preloading ctr registerSadly, the best route is to branch lessStatistical analysis of opcode sequencesFor example, 35% of opcode pairs are getTable-getTableIdea: build super-opcode processing which drops branchesRemove other branches on opcode
Summary
SummaryConsole cores and memory punish Lua performanceFour areas mentioned aboveBut other smaller areas tooLHS, branch mispredict and L2 miss are your enemyIn particular, L2 miss is never to be underestimatedImproving performance requires care and thoughtBut there are gains to be found
Optimizing Lua For Consoles - Allen Murphy (Microsoft)

Optimizing Lua For Consoles - Allen Murphy (Microsoft)

  • 2.
    Optimizing Lua forConsolesAllan J. MurphySenior Software Design EngineerAdvanced Technology GroupMicrosoft
  • 3.
    IntroductionWhat do Iknow about Lua?Part of Microsoft’s ATG groupPerformance reviews, developer visitsWorking with actual title performanceIncluding Lua loading from light to very heavy
  • 4.
  • 5.
    Lua UsageLua iscommonly used in console gamesLow memory footprintLightweight processingMany sets of bindings to C++360 and PS3 have 3.2Ghz CPUsLua should run just fine, right?Sadly, like most other converted code, not so
  • 6.
    Lua PerformanceIs performancea problem?Level of Lua usage in console games variesDepends on genre of game, in partSparse use – e.g. complex AI behaviors onlyCould be a couple of millisecondsHighly integrated – all the way down into the engine and rendererCould be the major bound on frame rate on CPULua is not always easily parallelizableOr at least, parallel implementations are uncommonSo yes, Lua performance really is important
  • 7.
    Performance of PortedCodeCode ported to PS3 or 360 CPU may surpriseAnd not in a good way360 naïve port can be 10x slower than WindowsLua tasks can be in that rangeBut the processor is 3.2Ghz, why the slowdown?CPU cores are cut down to reduce costMemory system lower specCheap slow memory, smaller caches, no L3
  • 8.
  • 9.
    In-Order PenaltiesWhere iscode penalized?Memory accessL2 cache missCPU core missing out-of-order execution hardwareLoad Hit StoreBranch mispredictExpensive instructions
  • 10.
    L2 Cache MissMemoryis slowAn L2 miss is 610 cyclesAn L2 hit is 40 cyclesAn L1 hit is 5 cyclesFactor of 15 difference between L2 hit and missCache line is 128 bytesTypically loading double the line size of x86Easy to waste memory throughputPoor memory use heavily penalized
  • 11.
    Load-Hit-Store (LHS)LHS occurswhen the CPU stores to a memory address… … then loads from it very shortly afterIn-order hardware unable to alter instruction flow to avoidNo store-forwarding hardware in CPUNo instructions for moving data between register setsLHS most often caused in code by:Changing register set, eg casts, combining math typesParameters passed by referencePointer aliasing
  • 12.
    Branch MispredictBranches preventcompiler scheduling around penaltiesGiven other penalties, this can be very importantMispredicting a branch on console is costlyMispredict causes CPU to:Discard instructions it has fetched, thinking it needed them23-24 cycle penalty as correct instructions fetchedBranch prediction normally does a good jobBut in some cases this penalty can be high
  • 13.
    How Does ThisAffect the Lua VM?
  • 14.
    How Does ThisAffect the Lua VM? Console CPU cores penalize Lua in several ways:LHS on data handlingL2 miss on table accessL2 miss on garbage collection and free list maintenanceBranch mispredict on VM main loopInteresting asideWork to avoid in-order core issues and L2 miss… … improves performance on out of order cores anyway
  • 15.
    Data Handling, LHS& Memory Access
  • 16.
    Data Handling, LHS& Memory AccessLua keeps all basic types internally as a union4 byte value represents bool, pointer, numeric data…Type fieldResults in 64 bit structureIssuesEnum has only 9 values, but is stored in 32 bitsNo way to pass this structure in registersPass value as int, LHS when you need float, and vice versaStoring on stack incurs extra instructions and memory access
  • 17.
    Data Handling, LHS& Memory AccessNot a very easy problem to solve elegantlyPoor solution:…Just bear the costDoesn’t seem good enough on performance starved CPUUnpalatable solution:…Don’t use unionPass int and float parts through registers at all timesSolves memory and LHS issuesNot very pretty though
  • 18.
  • 19.
    getTable() & L2MissMuch of Lua’s data stored in tablesEven simple field access goes through table systemFor some sequentially indexed data… … goes through separate small array storageCommonly… …value lookup done via hash table
  • 20.
    getTable() & L2MissL2 MissL2 MissLuaTablestructKey & TValuenextPtrTValueTValueTValueL2 MissKey & TValueTvaluenextPtrBranchArray PartTValueKey & TValueTValueHash TablenextPtrTValueTValueL2 MissKey & TValuenextPtr
  • 21.
    getTable() & L2MissLikely several L2 misses just to get to valueSeveral possible improvementsAbandon small sequential arraySave space, which improves cachingWe don’t have the large caches and fast memory of a desktopDrop branching and logic for handling small arrayMain hash table works for sequential case anywayFocus effort on optimizing one mechanism, not two
  • 22.
    getTable() & L2MissCompact hash table to improve L2 performanceStore table of 2 entries since typical list depth is 1.2Make hash table contiguousDrop next pointersStore types as 4 bits packed separate to valuesBulk together in groups of 28, ie one cache line in sizeDrops data size by 62.5%, L2 miss should drop similarlyMake hash collision mechanism just advance in arrayCollision should be much less expensiveMeans hash function can be simpler, ie faster
  • 23.
  • 24.
    Garbage Collection &L2 MissDefault garbage collectorWorks via mark and sweep systemOn console, this is very expensiveEach free block record examined incurs L2 miss ie 610 cyclesTypically only a flag per block record examinedBut L2 miss loads 128 byte cache lineThroughput is wasted, loaded data is unusedL2 miss massively dominates total time
  • 25.
    Garbage Collection &L2 MissConsider supporting with custom block allocatorHistogram allocation requestsTune block allocator sizes to spikes in histogramBlock allocator…Keeps a bitmask of allocated chunksChunks are fixed sizeGood allocator size is multiple of 1024 records – L2 cache line sizeReduces memory fragmentationWhen full, falls back to normal allocator
  • 26.
  • 27.
    Branch Mispredict &Lua VMLua is typically interpreted on consolesNo JITting since security model forbids executing on dataPrecompiled code possible, but some disadvantagesVM main loop typically does:Pick up opcodeJump through huge switch to code to execute opcodePick up data required by opcodeExecuteBack to top
  • 28.
    Branch Mispredict &Lua VMProblem…The VM loop is mispredict-mungousSwitch statement is implemented using bctr instructionLoads unknown & unpredictable value from memory (opcode)Then branch on itSimple branch prediction hardware on core:Has 6 bit global history and 2 bit prediction schemeDoesn’t have much of a chance in this caseMispredict penalty grows linearly with opcode count
  • 29.
    Branch Mispredict &Lua VMThere are many code perturbations that seem hopefulTree of ifs derived from popularity of opcodes‘direct threading’Preloading ctr registerSadly, the best route is to branch lessStatistical analysis of opcode sequencesFor example, 35% of opcode pairs are getTable-getTableIdea: build super-opcode processing which drops branchesRemove other branches on opcode
  • 30.
  • 31.
    SummaryConsole cores andmemory punish Lua performanceFour areas mentioned aboveBut other smaller areas tooLHS, branch mispredict and L2 miss are your enemyIn particular, L2 miss is never to be underestimatedImproving performance requires care and thoughtBut there are gains to be found