Optimizing Lua For Consoles - Allen Murphy (Microsoft)


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Optimizing Lua For Consoles - Allen Murphy (Microsoft)

  1. 1.
  2. 2. Optimizing Lua for Consoles<br />Allan J. Murphy<br />Senior Software Design Engineer<br />Advanced Technology Group<br />Microsoft<br />
  3. 3. Introduction<br />What do I know about Lua?<br />Part of Microsoft’s ATG group<br />Performance reviews, developer visits<br />Working with actual title performance<br />Including Lua loading from light to very heavy<br />
  4. 4. About Console Cores<br />
  5. 5. Lua Usage<br />Lua is commonly used in console games<br />Low memory footprint<br />Lightweight processing<br />Many sets of bindings to C++<br />360 and PS3 have 3.2Ghz CPUs<br />Lua should run just fine, right?<br />Sadly, like most other converted code, not so<br />
  6. 6. Lua Performance<br />Is performance a problem?<br />Level of Lua usage in console games varies<br />Depends on genre of game, in part<br />Sparse use – e.g. complex AI behaviors only<br />Could be a couple of milliseconds<br />Highly integrated – all the way down into the engine and renderer<br />Could be the major bound on frame rate on CPU<br />Lua is not always easily parallelizable<br />Or at least, parallel implementations are uncommon<br />So yes, Lua performance really is important<br />
  7. 7. Performance of Ported Code<br />Code ported to PS3 or 360 CPU may surprise<br />And not in a good way<br />360 naïve port can be 10x slower than Windows<br />Lua tasks can be in that range<br />But the processor is 3.2Ghz, why the slowdown?<br />CPU cores are cut down to reduce cost<br />Memory system lower spec<br />Cheap slow memory, smaller caches, no L3<br />
  8. 8. Console Performance Penalties<br />
  9. 9. In-Order Penalties<br />Where is code penalized?<br />Memory access<br />L2 cache miss<br />CPU core missing out-of-order execution hardware<br />Load Hit Store<br />Branch mispredict<br />Expensive instructions<br />
  10. 10. L2 Cache Miss<br />Memory is slow<br />An L2 miss is 610 cycles<br />An L2 hit is 40 cycles<br />An L1 hit is 5 cycles<br />Factor of 15 difference between L2 hit and miss<br />Cache line is 128 bytes<br />Typically loading double the line size of x86<br />Easy to waste memory throughput<br />Poor memory use heavily penalized<br />
  11. 11. Load-Hit-Store (LHS)<br />LHS occurs when the CPU stores to a memory address… … then loads from it very shortly after<br />In-order hardware unable to alter instruction flow to avoid<br />No store-forwarding hardware in CPU<br />No instructions for moving data between register sets<br />LHS most often caused in code by:<br />Changing register set, eg casts, combining math types<br />Parameters passed by reference<br />Pointer aliasing<br />
  12. 12. Branch Mispredict<br />Branches prevent compiler scheduling around penalties<br />Given other penalties, this can be very important<br />Mispredicting a branch on console is costly<br />Mispredict causes CPU to:<br />Discard instructions it has fetched, thinking it needed them<br />23-24 cycle penalty as correct instructions fetched<br />Branch prediction normally does a good job<br />But in some cases this penalty can be high<br />
  13. 13. How Does This Affect the Lua VM? <br />
  14. 14. How Does This Affect the Lua VM? <br />Console CPU cores penalize Lua in several ways:<br />LHS on data handling<br />L2 miss on table access<br />L2 miss on garbage collection and free list maintenance<br />Branch mispredict on VM main loop<br />Interesting aside<br />Work to avoid in-order core issues and L2 miss…<br /> … improves performance on out of order cores anyway<br />
  15. 15. Data Handling, LHS & Memory Access<br />
  16. 16. Data Handling, LHS & Memory Access<br />Lua keeps all basic types internally as a union<br />4 byte value represents bool, pointer, numeric data…<br />Type field<br />Results in 64 bit structure<br />Issues<br />Enum has only 9 values, but is stored in 32 bits<br />No way to pass this structure in registers<br />Pass value as int, LHS when you need float, and vice versa<br />Storing on stack incurs extra instructions and memory access<br />
  17. 17. Data Handling, LHS & Memory Access<br />Not a very easy problem to solve elegantly<br />Poor solution:<br />…Just bear the cost<br />Doesn’t seem good enough on performance starved CPU<br />Unpalatable solution:<br />…Don’t use union<br />Pass int and float parts through registers at all times<br />Solves memory and LHS issues<br />Not very pretty though<br />
  18. 18. getTable() & L2 Miss<br />
  19. 19. getTable() & L2 Miss<br />Much of Lua’s data stored in tables<br />Even simple field access goes through table system<br />For some sequentially indexed data…<br /> … goes through separate small array storage<br />Commonly…<br /> …value lookup done via hash table<br />
  20. 20. getTable() & L2 Miss<br />L2 Miss<br />L2 Miss<br />Lua<br />Table<br />struct<br />Key & TValue<br />nextPtr<br />TValue<br />TValue<br />TValue<br />L2 Miss<br />Key & TValue<br />Tvalue<br />nextPtr<br />Branch<br />Array Part<br />TValue<br />Key & TValue<br />TValue<br />Hash Table<br />nextPtr<br />TValue<br />TValue<br />L2 Miss<br />Key & TValue<br />nextPtr<br />
  21. 21. getTable() & L2 Miss<br />Likely several L2 misses just to get to value<br />Several possible improvements<br />Abandon small sequential array<br />Save space, which improves caching<br />We don’t have the large caches and fast memory of a desktop<br />Drop branching and logic for handling small array<br />Main hash table works for sequential case anyway<br />Focus effort on optimizing one mechanism, not two<br />
  22. 22. getTable() & L2 Miss<br />Compact hash table to improve L2 performance<br />Store table of 2 entries since typical list depth is 1.2<br />Make hash table contiguous<br />Drop next pointers<br />Store types as 4 bits packed separate to values<br />Bulk together in groups of 28, ie one cache line in size<br />Drops data size by 62.5%, L2 miss should drop similarly<br />Make hash collision mechanism just advance in array<br />Collision should be much less expensive<br />Means hash function can be simpler, ie faster<br />
  23. 23. Garbage Collection & L2 Miss<br />
  24. 24. Garbage Collection & L2 Miss<br />Default garbage collector<br />Works via mark and sweep system<br />On console, this is very expensive<br />Each free block record examined incurs L2 miss ie 610 cycles<br />Typically only a flag per block record examined<br />But L2 miss loads 128 byte cache line<br />Throughput is wasted, loaded data is unused<br />L2 miss massively dominates total time<br />
  25. 25. Garbage Collection & L2 Miss<br />Consider supporting with custom block allocator<br />Histogram allocation requests<br />Tune block allocator sizes to spikes in histogram<br />Block allocator…<br />Keeps a bitmask of allocated chunks<br />Chunks are fixed size<br />Good allocator size is multiple of 1024 records – L2 cache line size<br />Reduces memory fragmentation<br />When full, falls back to normal allocator<br />
  26. 26. Branch Mispredict & Lua VM<br />
  27. 27. Branch Mispredict & Lua VM<br />Lua is typically interpreted on consoles<br />No JITting since security model forbids executing on data<br />Precompiled code possible, but some disadvantages<br />VM main loop typically does:<br />Pick up opcode<br />Jump through huge switch to code to execute opcode<br />Pick up data required by opcode<br />Execute<br />Back to top<br />
  28. 28. Branch Mispredict & Lua VM<br />Problem…<br />The VM loop is mispredict-mungous<br />Switch statement is implemented using bctr instruction<br />Loads unknown & unpredictable value from memory (opcode)<br />Then branch on it<br />Simple branch prediction hardware on core:<br />Has 6 bit global history and 2 bit prediction scheme<br />Doesn’t have much of a chance in this case<br />Mispredict penalty grows linearly with opcode count<br />
  29. 29. Branch Mispredict & Lua VM<br />There are many code perturbations that seem hopeful<br />Tree of ifs derived from popularity of opcodes<br />‘direct threading’<br />Preloading ctr register<br />Sadly, the best route is to branch less<br />Statistical analysis of opcode sequences<br />For example, 35% of opcode pairs are getTable-getTable<br />Idea: build super-opcode processing which drops branches<br />Remove other branches on opcode<br />
  30. 30. Summary<br />
  31. 31. Summary<br />Console cores and memory punish Lua performance<br />Four areas mentioned above<br />But other smaller areas too<br />LHS, branch mispredict and L2 miss are your enemy<br />In particular, L2 miss is never to be underestimated<br />Improving performance requires care and thought<br />But there are gains to be found<br />