SlideShare a Scribd company logo
Optimizing Lua for Consoles Allan J. Murphy Senior Software Design Engineer Advanced Technology Group Microsoft
Introduction What do I know about Lua? Part of Microsoft’s ATG group Performance reviews, developer visits Working with actual title performance Including Lua loading from light to very heavy
About Console Cores
Lua Usage Lua is commonly used in console games Low memory footprint Lightweight processing Many sets of bindings to C++ 360 and PS3 have 3.2Ghz CPUs Lua should run just fine, right? Sadly, like most other converted code, not so
Lua Performance Is performance a problem? Level of Lua usage in console games varies Depends on genre of game, in part Sparse use – e.g. complex AI behaviors only Could be a couple of milliseconds Highly integrated – all the way down into the engine and renderer Could be the major bound on frame rate on CPU Lua is not always easily parallelizable Or at least, parallel implementations are uncommon So yes, Lua performance really is important
Performance of Ported Code Code ported to PS3 or 360 CPU may surprise And not in a good way 360 naïve port can be 10x slower than Windows Lua tasks can be in that range But the processor is 3.2Ghz, why the slowdown? CPU cores are cut down to reduce cost Memory system lower spec Cheap slow memory, smaller caches, no L3
Console Performance Penalties
In-Order Penalties Where is code penalized? Memory access L2 cache miss CPU core missing out-of-order execution hardware Load Hit Store Branch mispredict Expensive instructions
L2 Cache Miss Memory is slow An L2 miss is 610 cycles An L2 hit is 40 cycles An L1 hit is 5 cycles Factor of 15 difference between L2 hit and miss Cache line is 128 bytes Typically loading double the line size of x86 Easy to waste memory throughput Poor memory use heavily penalized
Load-Hit-Store (LHS) LHS occurs when the CPU stores to a memory address…	… then loads from it very shortly after In-order hardware unable to alter instruction flow to avoid No store-forwarding hardware in CPU No instructions for moving data between register sets LHS most often caused in code by: Changing register set, eg casts, combining math types Parameters passed by reference Pointer aliasing
Branch Mispredict Branches prevent compiler scheduling around penalties Given other penalties, this can be very important Mispredicting a branch on console is costly Mispredict causes CPU to: Discard instructions it has fetched, thinking it needed them 23-24 cycle penalty as correct instructions fetched Branch prediction normally does a good job But in some cases this penalty can be high
How Does This Affect the Lua VM?
How Does This Affect the Lua VM?	 Console CPU cores penalize Lua in several ways: LHS on data handling L2 miss on table access L2 miss on garbage collection and free list maintenance Branch mispredict on VM main loop Interesting aside Work to avoid in-order core issues and L2 miss… 	… improves performance on out of order cores anyway
Data Handling, LHS & Memory Access
Data Handling, LHS & Memory Access Lua keeps all basic types internally as a union 4 byte value represents bool, pointer, numeric data… Type field Results in 64 bit structure Issues Enum has only 9 values, but is stored in 32 bits No way to pass this structure in registers Pass value as int, LHS when you need float, and vice versa Storing on stack incurs extra instructions and memory access
Data Handling, LHS & Memory Access Not a very easy problem to solve elegantly Poor solution: …Just bear the cost Doesn’t seem good enough on performance starved CPU Unpalatable solution: …Don’t use union Pass int and float parts through registers at all times Solves memory and LHS issues Not very pretty though
getTable() & L2 Miss
getTable() & L2 Miss Much of Lua’s data stored in tables Even simple field access goes through table system For some sequentially indexed data… 	… goes through separate small array storage Commonly… 	…value lookup done via hash table
getTable() & L2 Miss L2 Miss L2 Miss Lua Table struct Key & TValue nextPtr TValue TValue TValue L2 Miss Key & TValue Tvalue nextPtr Branch Array Part TValue Key & TValue TValue Hash Table nextPtr TValue TValue L2 Miss Key & TValue nextPtr
getTable() & L2 Miss Likely several L2 misses just to get to value Several possible improvements Abandon small sequential array Save space, which improves caching We don’t have the large caches and fast memory of a desktop Drop branching and logic for handling small array Main hash table works for sequential case anyway Focus effort on optimizing one mechanism, not two
getTable() & L2 Miss Compact hash table to improve L2 performance Store table of 2 entries since typical list depth is 1.2 Make hash table contiguous Drop next pointers Store types as 4 bits packed separate to values Bulk together in groups of 28, ie one cache line in size Drops data size by 62.5%, L2 miss should drop similarly Make hash collision mechanism just advance in array Collision should be much less expensive Means hash function can be simpler, ie faster
Garbage Collection & L2 Miss
Garbage Collection & L2 Miss Default garbage collector Works via mark and sweep system On console, this is very expensive Each free block record examined incurs L2 miss ie 610 cycles Typically only a flag per block record examined But L2 miss loads 128 byte cache line Throughput is wasted, loaded data is unused L2 miss massively dominates total time
Garbage Collection & L2 Miss Consider supporting with custom block allocator Histogram allocation requests Tune block allocator sizes to spikes in histogram Block allocator… Keeps a bitmask of allocated chunks Chunks are fixed size Good allocator size is multiple of 1024 records – L2 cache line size Reduces memory fragmentation When full, falls back to normal allocator
Branch Mispredict & Lua VM
Branch Mispredict & Lua VM Lua is typically interpreted on consoles No JITting since security model forbids executing on data Precompiled code possible, but some disadvantages VM main loop typically does: Pick up opcode Jump through huge switch to code to execute opcode Pick up data required by opcode Execute Back to top
Branch Mispredict & Lua VM Problem… The VM loop is mispredict-mungous Switch statement is implemented using bctr instruction Loads unknown & unpredictable value from memory (opcode) Then branch on it Simple branch prediction hardware on core: Has 6 bit global history and 2 bit prediction scheme Doesn’t have much of a chance in this case Mispredict penalty grows linearly with opcode count
Branch Mispredict & Lua VM There are many code perturbations that seem hopeful Tree of ifs derived from popularity of opcodes ‘direct threading’ Preloading ctr register Sadly, the best route is to branch less Statistical analysis of opcode sequences For example, 35% of opcode pairs are getTable-getTable Idea: build super-opcode processing which drops branches Remove other branches on opcode
Summary
Summary Console cores and memory punish Lua performance Four areas mentioned above But other smaller areas too LHS, branch mispredict and L2 miss are your enemy In particular, L2 miss is never to be underestimated Improving performance requires care and thought But there are gains to be found
Optimizing Lua For Consoles - Allen Murphy (Microsoft)

More Related Content

What's hot

cache
cachecache
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Farwa Ansari
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
K Gowsic Gowsic
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
VISHAL DONGA
 
Csc1401 lecture05 - cache memory
Csc1401   lecture05 - cache memoryCsc1401   lecture05 - cache memory
Csc1401 lecture05 - cache memory
IIUM
 
Cache memory
Cache memory Cache memory
Cache memory
IndrajaMeghavathula
 
Lecture2
Lecture2Lecture2
Lecture2
philipsinter
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Array Processor
Array ProcessorArray Processor
Array Processor
Anshuman Biswal
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
Ismail Mukiibi
 
Unit 5-lecture-2
Unit 5-lecture-2Unit 5-lecture-2
Unit 5-lecture-2
vishal choudhary
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
PrabhanshuKatiyar1
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
PrabhanshuKatiyar1
 
Cache memory
Cache memoryCache memory
Cache memory
MohanChimanna
 
Cache optimization
Cache optimizationCache optimization
Cache optimization
Kavi Kathir
 
Cache memory
Cache memoryCache memory
Cache memory
kitturashmikittu
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
 
Dual port ram
Dual port ramDual port ram
Dual port ram
PravallikaTammisetty
 
Chapter 5 c
Chapter 5 cChapter 5 c
Chapter 5 c
ececourse
 
Buffer Overflow
Buffer OverflowBuffer Overflow
Buffer Overflow
Kaustubh Padwad
 

What's hot (20)

cache
cachecache
cache
 
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address TranslationMemory Hierarchy Design, Basics, Cache Optimization, Address Translation
Memory Hierarchy Design, Basics, Cache Optimization, Address Translation
 
Unit I Memory technology and optimization
Unit I Memory technology and optimizationUnit I Memory technology and optimization
Unit I Memory technology and optimization
 
Cache memory and cache
Cache memory and cacheCache memory and cache
Cache memory and cache
 
Csc1401 lecture05 - cache memory
Csc1401   lecture05 - cache memoryCsc1401   lecture05 - cache memory
Csc1401 lecture05 - cache memory
 
Cache memory
Cache memory Cache memory
Cache memory
 
Lecture2
Lecture2Lecture2
Lecture2
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2Advanced computer architecture lesson 1 and 2
Advanced computer architecture lesson 1 and 2
 
Unit 5-lecture-2
Unit 5-lecture-2Unit 5-lecture-2
Unit 5-lecture-2
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache optimization
Cache optimizationCache optimization
Cache optimization
 
Cache memory
Cache memoryCache memory
Cache memory
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
 
Dual port ram
Dual port ramDual port ram
Dual port ram
 
Chapter 5 c
Chapter 5 cChapter 5 c
Chapter 5 c
 
Buffer Overflow
Buffer OverflowBuffer Overflow
Buffer Overflow
 

Viewers also liked

Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Kore VM
 
Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)
Kore VM
 
Lua patient zero bret mogilefsky (scea)
Lua patient zero   bret mogilefsky (scea)Lua patient zero   bret mogilefsky (scea)
Lua patient zero bret mogilefsky (scea)
Kore VM
 
Media Kit May 2010
Media Kit May 2010Media Kit May 2010
Media Kit May 2010
Grameen America
 
Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)
Kore VM
 
Lua by Ong Hean Kuan
Lua by Ong Hean KuanLua by Ong Hean Kuan
Lua by Ong Hean Kuan
fossmy
 

Viewers also liked (6)

Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
Robotic Testing to the Rescue - Paul Dubois (DoubleFine)
 
Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)Sony Lua - RJ Mical (SCEA)
Sony Lua - RJ Mical (SCEA)
 
Lua patient zero bret mogilefsky (scea)
Lua patient zero   bret mogilefsky (scea)Lua patient zero   bret mogilefsky (scea)
Lua patient zero bret mogilefsky (scea)
 
Media Kit May 2010
Media Kit May 2010Media Kit May 2010
Media Kit May 2010
 
Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)Lua and adaptive audio - Don Veca (Activision)
Lua and adaptive audio - Don Veca (Activision)
 
Lua by Ong Hean Kuan
Lua by Ong Hean KuanLua by Ong Hean Kuan
Lua by Ong Hean Kuan
 

Similar to Optimizing Lua For Consoles - Allen Murphy (Microsoft)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
Kris Buytaert
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
Peter Lawrey
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
ScyllaDB
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
Bob Ward
 
Chapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsiChapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsi
risal07
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
Sajith Harshana
 
Sql server troubleshooting
Sql server troubleshootingSql server troubleshooting
Sql server troubleshooting
Nathan Winters
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
Lee Hanxue
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
Manish Rawat
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
corehard_by
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
AHM Pervej Kabir
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
Gibraltar Software
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
Chris Adkin
 
cs-procstruc.ppt
cs-procstruc.pptcs-procstruc.ppt
cs-procstruc.ppt
Mohamoud Saed Mohamed
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
Brendan Gregg
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards
Bharti Khemani
 

Similar to Optimizing Lua For Consoles - Allen Murphy (Microsoft) (20)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
MySQL HA with PaceMaker
MySQL HA with  PaceMakerMySQL HA with  PaceMaker
MySQL HA with PaceMaker
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
Chapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsiChapter 7 cpu struktur dan fungsi
Chapter 7 cpu struktur dan fungsi
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Sql server troubleshooting
Sql server troubleshootingSql server troubleshooting
Sql server troubleshooting
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 
Sql server performance tuning and optimization
Sql server performance tuning and optimizationSql server performance tuning and optimization
Sql server performance tuning and optimization
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Knowledge share about scalable application architecture
Knowledge share about scalable application architectureKnowledge share about scalable application architecture
Knowledge share about scalable application architecture
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
 
Super scaling singleton inserts
Super scaling singleton insertsSuper scaling singleton inserts
Super scaling singleton inserts
 
cs-procstruc.ppt
cs-procstruc.pptcs-procstruc.ppt
cs-procstruc.ppt
 
What Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versaWhat Linux can learn from Solaris performance and vice-versa
What Linux can learn from Solaris performance and vice-versa
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Optimizing Lua For Consoles - Allen Murphy (Microsoft)

  • 1.
  • 2. Optimizing Lua for Consoles Allan J. Murphy Senior Software Design Engineer Advanced Technology Group Microsoft
  • 3. Introduction What do I know about Lua? Part of Microsoft’s ATG group Performance reviews, developer visits Working with actual title performance Including Lua loading from light to very heavy
  • 5. Lua Usage Lua is commonly used in console games Low memory footprint Lightweight processing Many sets of bindings to C++ 360 and PS3 have 3.2Ghz CPUs Lua should run just fine, right? Sadly, like most other converted code, not so
  • 6. Lua Performance Is performance a problem? Level of Lua usage in console games varies Depends on genre of game, in part Sparse use – e.g. complex AI behaviors only Could be a couple of milliseconds Highly integrated – all the way down into the engine and renderer Could be the major bound on frame rate on CPU Lua is not always easily parallelizable Or at least, parallel implementations are uncommon So yes, Lua performance really is important
  • 7. Performance of Ported Code Code ported to PS3 or 360 CPU may surprise And not in a good way 360 naïve port can be 10x slower than Windows Lua tasks can be in that range But the processor is 3.2Ghz, why the slowdown? CPU cores are cut down to reduce cost Memory system lower spec Cheap slow memory, smaller caches, no L3
  • 9. In-Order Penalties Where is code penalized? Memory access L2 cache miss CPU core missing out-of-order execution hardware Load Hit Store Branch mispredict Expensive instructions
  • 10. L2 Cache Miss Memory is slow An L2 miss is 610 cycles An L2 hit is 40 cycles An L1 hit is 5 cycles Factor of 15 difference between L2 hit and miss Cache line is 128 bytes Typically loading double the line size of x86 Easy to waste memory throughput Poor memory use heavily penalized
  • 11. Load-Hit-Store (LHS) LHS occurs when the CPU stores to a memory address… … then loads from it very shortly after In-order hardware unable to alter instruction flow to avoid No store-forwarding hardware in CPU No instructions for moving data between register sets LHS most often caused in code by: Changing register set, eg casts, combining math types Parameters passed by reference Pointer aliasing
  • 12. Branch Mispredict Branches prevent compiler scheduling around penalties Given other penalties, this can be very important Mispredicting a branch on console is costly Mispredict causes CPU to: Discard instructions it has fetched, thinking it needed them 23-24 cycle penalty as correct instructions fetched Branch prediction normally does a good job But in some cases this penalty can be high
  • 13. How Does This Affect the Lua VM?
  • 14. How Does This Affect the Lua VM? Console CPU cores penalize Lua in several ways: LHS on data handling L2 miss on table access L2 miss on garbage collection and free list maintenance Branch mispredict on VM main loop Interesting aside Work to avoid in-order core issues and L2 miss… … improves performance on out of order cores anyway
  • 15. Data Handling, LHS & Memory Access
  • 16. Data Handling, LHS & Memory Access Lua keeps all basic types internally as a union 4 byte value represents bool, pointer, numeric data… Type field Results in 64 bit structure Issues Enum has only 9 values, but is stored in 32 bits No way to pass this structure in registers Pass value as int, LHS when you need float, and vice versa Storing on stack incurs extra instructions and memory access
  • 17. Data Handling, LHS & Memory Access Not a very easy problem to solve elegantly Poor solution: …Just bear the cost Doesn’t seem good enough on performance starved CPU Unpalatable solution: …Don’t use union Pass int and float parts through registers at all times Solves memory and LHS issues Not very pretty though
  • 19. getTable() & L2 Miss Much of Lua’s data stored in tables Even simple field access goes through table system For some sequentially indexed data… … goes through separate small array storage Commonly… …value lookup done via hash table
  • 20. getTable() & L2 Miss L2 Miss L2 Miss Lua Table struct Key & TValue nextPtr TValue TValue TValue L2 Miss Key & TValue Tvalue nextPtr Branch Array Part TValue Key & TValue TValue Hash Table nextPtr TValue TValue L2 Miss Key & TValue nextPtr
  • 21. getTable() & L2 Miss Likely several L2 misses just to get to value Several possible improvements Abandon small sequential array Save space, which improves caching We don’t have the large caches and fast memory of a desktop Drop branching and logic for handling small array Main hash table works for sequential case anyway Focus effort on optimizing one mechanism, not two
  • 22. getTable() & L2 Miss Compact hash table to improve L2 performance Store table of 2 entries since typical list depth is 1.2 Make hash table contiguous Drop next pointers Store types as 4 bits packed separate to values Bulk together in groups of 28, ie one cache line in size Drops data size by 62.5%, L2 miss should drop similarly Make hash collision mechanism just advance in array Collision should be much less expensive Means hash function can be simpler, ie faster
  • 24. Garbage Collection & L2 Miss Default garbage collector Works via mark and sweep system On console, this is very expensive Each free block record examined incurs L2 miss ie 610 cycles Typically only a flag per block record examined But L2 miss loads 128 byte cache line Throughput is wasted, loaded data is unused L2 miss massively dominates total time
  • 25. Garbage Collection & L2 Miss Consider supporting with custom block allocator Histogram allocation requests Tune block allocator sizes to spikes in histogram Block allocator… Keeps a bitmask of allocated chunks Chunks are fixed size Good allocator size is multiple of 1024 records – L2 cache line size Reduces memory fragmentation When full, falls back to normal allocator
  • 27. Branch Mispredict & Lua VM Lua is typically interpreted on consoles No JITting since security model forbids executing on data Precompiled code possible, but some disadvantages VM main loop typically does: Pick up opcode Jump through huge switch to code to execute opcode Pick up data required by opcode Execute Back to top
  • 28. Branch Mispredict & Lua VM Problem… The VM loop is mispredict-mungous Switch statement is implemented using bctr instruction Loads unknown & unpredictable value from memory (opcode) Then branch on it Simple branch prediction hardware on core: Has 6 bit global history and 2 bit prediction scheme Doesn’t have much of a chance in this case Mispredict penalty grows linearly with opcode count
  • 29. Branch Mispredict & Lua VM There are many code perturbations that seem hopeful Tree of ifs derived from popularity of opcodes ‘direct threading’ Preloading ctr register Sadly, the best route is to branch less Statistical analysis of opcode sequences For example, 35% of opcode pairs are getTable-getTable Idea: build super-opcode processing which drops branches Remove other branches on opcode
  • 31. Summary Console cores and memory punish Lua performance Four areas mentioned above But other smaller areas too LHS, branch mispredict and L2 miss are your enemy In particular, L2 miss is never to be underestimated Improving performance requires care and thought But there are gains to be found