This document discusses strategies for coding games to take advantage of multiple processor cores. It notes that future CPUs will be predominantly multi-core, and that good design is critical for effective multithreading. It provides examples of common tasks that can be multithreaded, such as file decompression and rendering. It also discusses synchronization techniques, managing threads, and profiling multithreaded applications.
Kernel-Level Programming: Entering Ring NaughtDavid Evans
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Leslie Lamport wins the Turing Award!
Hardware-Based Memory Isolation
Software-Based Memory Isolation
Kernel-Level Programming
Which came first, programming languages or operating systems?
Programming without other programs
Kernel development
IronKernel
For embedded notes, see:
http://rust-class.org/class-14-entering-ring-naught.html
Slide deck from my presentation on multi-threading with .NET. The presentation covers from beginner onwards and looks at current technologies (i.e. pre .NET 4.0) specifically.
What makes this extra special is the entire process of how I prepared for it, from finding content to slide deck layout to presentation prep is documented at: http://www.sadev.co.za/content/how-i-build-presentations-series-index
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Explicit Memory Management
4.3BSD
Morris Worm
fingerd code
NX bit
For embedded notes, see: http://rust-class.org/class-8-managing-memory.html
Kernel-Level Programming: Entering Ring NaughtDavid Evans
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Leslie Lamport wins the Turing Award!
Hardware-Based Memory Isolation
Software-Based Memory Isolation
Kernel-Level Programming
Which came first, programming languages or operating systems?
Programming without other programs
Kernel development
IronKernel
For embedded notes, see:
http://rust-class.org/class-14-entering-ring-naught.html
Slide deck from my presentation on multi-threading with .NET. The presentation covers from beginner onwards and looks at current technologies (i.e. pre .NET 4.0) specifically.
What makes this extra special is the entire process of how I prepared for it, from finding content to slide deck layout to presentation prep is documented at: http://www.sadev.co.za/content/how-i-build-presentations-series-index
University of Virginia
cs4414: Operating Systems
http://rust-class.org
Explicit Memory Management
4.3BSD
Morris Worm
fingerd code
NX bit
For embedded notes, see: http://rust-class.org/class-8-managing-memory.html
University of Virginia
cs4414: Operating Systems
Rust Expressions and Higher-Order Procedures
How to Share a Processor
Non-Preemptive and Preemptive Multitasking
Kernel Timer Interrupt
Kernel vulnerabilities was commonly used to obtain admin privileges, and main rule was to stay in kernel as small time as possible! But nowdays even when you get admin / root then current operating systems are sometimes too restrictive. And that made kernel exploitation nice vector for installing to kernel mode!
In this talk we will examine steps from CPL3 to CPL0, including some nice tricks, and we end up with developing kernel mode drivers.
Ice Age melting down: Intel features considered usefull!Peter Hlavaty
Decades history of kernel exploitation, however still most used techniques are such as ROP. Software based approaches comes finally challenge this technique, one more successful than the others. Those approaches usually trying to solve far more than ROP only problem, and need to handle not only security but almost more importantly performance issues. Another common attacker vector for redirecting control flow is stack what comes from design of today’s architectures, and once again some software approaches lately tackling this as well. Although this software based methods are piece of nice work and effective to big extent, new game changing approach seems coming to the light. Methodology closing this attack vector coming right from hardware - intel. We will compare this way to its software alternatives, how one interleaving another and how they can benefit from each other to challenge attacker by breaking his most fundamental technologies. However same time we go further, to challenge those approaches and show that even with those technologies in place attackers is not yet in the corner.
Abusing Microsoft Kerberos - Sorry you guys don't get itBenjamin Delpy
Talk of Skip Duckwall and I at BlackHat 2014 USA / Defcon Wall of Sheep.
Kerberos, and new pass-the-* feature, like overpass-the-hash and the Golden Ticket
This session discusses about the basic building blocks of Concurrent Programming in Java, which include:
high-level concurrency objects, lock objects, executors, executor interfaces, thread pools, fork/join, concurrent collections, atomic variables, concurrent random numbers.
cs4414: Operating Systems
http://rust-class.org/class-1-what-is-an-operating-system.html
Class 1: What is an Operating System?
Why so many programming languages?
Introducing Rust
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
University of Virginia
cs4414: Operating Systems
Rust Expressions and Higher-Order Procedures
How to Share a Processor
Non-Preemptive and Preemptive Multitasking
Kernel Timer Interrupt
Kernel vulnerabilities was commonly used to obtain admin privileges, and main rule was to stay in kernel as small time as possible! But nowdays even when you get admin / root then current operating systems are sometimes too restrictive. And that made kernel exploitation nice vector for installing to kernel mode!
In this talk we will examine steps from CPL3 to CPL0, including some nice tricks, and we end up with developing kernel mode drivers.
Ice Age melting down: Intel features considered usefull!Peter Hlavaty
Decades history of kernel exploitation, however still most used techniques are such as ROP. Software based approaches comes finally challenge this technique, one more successful than the others. Those approaches usually trying to solve far more than ROP only problem, and need to handle not only security but almost more importantly performance issues. Another common attacker vector for redirecting control flow is stack what comes from design of today’s architectures, and once again some software approaches lately tackling this as well. Although this software based methods are piece of nice work and effective to big extent, new game changing approach seems coming to the light. Methodology closing this attack vector coming right from hardware - intel. We will compare this way to its software alternatives, how one interleaving another and how they can benefit from each other to challenge attacker by breaking his most fundamental technologies. However same time we go further, to challenge those approaches and show that even with those technologies in place attackers is not yet in the corner.
Abusing Microsoft Kerberos - Sorry you guys don't get itBenjamin Delpy
Talk of Skip Duckwall and I at BlackHat 2014 USA / Defcon Wall of Sheep.
Kerberos, and new pass-the-* feature, like overpass-the-hash and the Golden Ticket
This session discusses about the basic building blocks of Concurrent Programming in Java, which include:
high-level concurrency objects, lock objects, executors, executor interfaces, thread pools, fork/join, concurrent collections, atomic variables, concurrent random numbers.
cs4414: Operating Systems
http://rust-class.org/class-1-what-is-an-operating-system.html
Class 1: What is an Operating System?
Why so many programming languages?
Introducing Rust
The Java Memory Model describes how threads in the Java programming language interact through memory. Together with the description of single-threaded execution of code, the memory model provides the semantics of the Java programming language.
It is crucial for a programmer to know how, according to Java Language Specification, write correctly synchronized, race free programs.
[Defcon] Hardware backdooring is practicalMoabi.com
This presentation will demonstrate that permanent backdooring of hardware is practical. We have built a generic proof of concept malware for the intel architecture, Rakshasa, capable of infecting more than a hundred of different motherboards. The first net effect of Rakshasa is to disable NX permanently and remove SMM related fixes from the BIOS, resulting in permanent lowering of the security of the backdoored computer, even after complete earasing of hard disks and reinstallation of a new operating system. We shall also demonstrate that preexisting work on MBR subvertions such as bootkiting and preboot authentication software bruteforce can be embedded in Rakshasa with little effort. More over, Rakshasa is built on top of free software, including the Coreboot project, meaning that most of its source code is already public. This presentation will take a deep dive into Coreboot and hardware components such as the BIOS, CMOS and PIC embedded on the motherboard, before detailing the inner workings of Rakshasa and demo its capabilities. It is hoped to raise awareness of the security community regarding the dangers associated with non open source firmwares shipped with any computer and question their integrity. This shall also result in upgrading the best practices for forensics and post intrusion analysis by including the afore mentioned firmwares as part of their scope of work.
Windows Kernel Exploitation : This Time Font hunt you down in 4 bytesPeter Hlavaty
In our recent work we targeted also win32k, what seems to be fruit giving target. @promised_lu made our own TTF-fuzzer which comes with bunch of results in form of gigabytes of crashes and various bugs. Fortunately windows make great work and in February most of our bugs was dead - patched, but not all of them…
Whats left were looking as seemingly unexploitable kernel bugs with ridiculous conditions. We decided to check it out, and finally combine it with our user mode bug & emet bypass. Through IE & flash we break down system and pointed out at weak points in defensive mechanism.
In this talk we will present our research dedicated for pwn2own event this year. We will describe kernel part of exploit in detail*, including bug description, resulting memory corruption conditions & caveats up to final pwn via one of our TTF bugs.
Throughout the talk we will describe how to break various exploit mitigations in windows kernel and why it is possible. We will introduce novel kernel exploitation techniques breaking all what stands { KASLR, SMEP, even imaginary SMAP or CFG } and bring you SYSTEM exec (from kernel driver to system calc).
* unfortunately bug was not fixed at the time of talk, so we do not exposed details about TTF vulnerability, and we skipped directly to some challenges during exploitation, and demonstrate how OS design can overpower introduced exploit mitigations.
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.
At a time when Herbt Sutter announced to everyone that the free lunch is over (The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software), concurrency has become our everyday life.A big change is coming to Java, the Loom project and with it such new terms as "virtual thread", "continuations" and "structured concurrency". If you've been wondering what they will change in our daily work or
whether it's worth rewriting your Tomcat-based application to super-efficient reactive Netty,or whether to wait for Project Loom? This presentation is for you.
I will talk about the Loom project and the new possibilities related to virtual wattles and "structured concurrency". I will tell you how it works and what can be achieved and the impact on performance
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
Session - Debugging memory stomps and other atrocities - Stefan Reinalter - T...Expert Insight
Stefan has spent many nights debugging hard-to-find bugs. Memory overwrites leading to random crashes, well-hidden stack overflows and other atrocities that will eventually give each programmer nightmares. In this talk, Stefan shares tips and tricks how to identify the root cause of such problems and how to debug them with little or no help from a debugger. He will also go into detail about which memory management techniques can be applied in order to prevent some of those bugs in the first place.
Readers will learn techniques that help with finding and identifying problems that are due to the low-level access granted by C++ and the sometimes careless use of pointers. The presented techniques are applicable to the PC as well as console platforms.
Github - Git Training Slides: FoundationsLee Hanxue
Slide deck with detailed step breakdown that explains how git works, together with simple examples that you can try out yourself. Slides originated from http://teach.github.com/articles/course-slides/
Author: https://twitter.com/matthewmccull
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S7.1 Summary & Conclusion Day 2Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S16 Summary & Conclusion Day 4Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S15.1 Summary & Conclusion Day 3Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S15 Comparison of Satipatthana ContentsLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides compare the contents of Satipatthana in 7 different suttas.
Satipatthana Sutta Workshop - S14 Pali Terms for 3 times & SatiLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S13 Noble TruthsLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slide covers 4 Noble Truths, the Noble Eightfold Path and the interlink between the concepts of the Noble Eightfold Path.
Satipatthana Sutta Workshop - S12.1 Samatha & VipassanaLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides explore the meaning of samatha, vipassana and compared these 2 concepts. These slides also cover the concept of sankhara.
Satipatthana Sutta Workshop - S11 samadhi in kayanupassana & kayagatasati1.3.3Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides compare the Kayanupassana with Kayagatasati Sutta and propose that the 4 satipatthanas and kayagatasati are 2 different formulations with different emphasis.
Satipatthana Sutta Workshop - S10.1 Summary & Conclusion Day 3Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S8 HindrancesLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides cover the Hindrances, namely:
Sensual Desire
Ill-will
Sloth & Torpor
Restlessness & Worry
Doubts
Satipatthana Sutta Workshop - S6.1 Body Parts and the Four ElementsLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides cover the four elements.
Satipatthana Sutta Workshop - S4.2 Summary & Conclusion Day 1Lee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
Satipatthana Sutta Workshop - S4.1 Calming Bodily FormationLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides cover the concept of bodily formation.
Satipatthana Sutta Workshop - S3 Satipatthana StructureLee Hanxue
This set of slides is from the Satipatthana Workshop conducted by Venerable Aggacitta at Sasanarakkha Buddhist Sanctuary between July 26-29, 2012.
Permission is given to redistribute without any modifications, for non-commercial purposes only.
These slides cover the structure of the Satipatthana Sutta, as well as the proposed Satipatthana Mula by Ajahn Sujato.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
1. Coding for Multiple Cores
Bruce Dawson & Chuck Walbourn
Programmers
Game Technology Group
2. Why multi-threading/multi-core?
Clock rates are stagnant
Future CPUs will be predominantly multi-thread/
multi-core
Xbox 360 has 3 cores
PS3 will be multi-core
>70% of PC sales will be multi-core by end of 2006
Most Windows Vista systems will be multi-core
Two performance possibilities:
Single-threaded? Minimal performance growth
Multi-threaded? Exponential performance growth
3. Design for Multithreading
Good design is critical
Bad multithreading can be worse than no
multithreading
Deadlocks, synchronization bugs, poor
performance, etc.
5. Good Multithreading
Game Thread
Main Thread
RRReeennndddeeerrriiinnnggg TTThhhrrreeeaaaddd
Physics
Rendering Thread
Animation/
Skinning
Particle Systems
Networking
File I/O
6. Another Paradigm: Cascades
Thread Input
1
Thread Physics
2
Thread AI
3
Rendering
Thread 4
Thread Present
5
FFFFrrrraaaammmmeeee 1243
Advantages:
Synchronization points are few and well-defined
Disadvantages:
Increases latency (for constant frame rate)
Needs simple (one-way) data flow
8. File Decompression
Most common CPU heavy thread on the
Xbox 360
Easy to multithread
Allows use of aggressive compression to
improve load times
Don’t throw a thread at a problem better
solved by offline processing
Texture compression, file packing, etc.
9. Rendering
Separate update and render threads
Rendering on multiple threads
(D3DCREATE_MULTITHREADED) works poorly
Exception: Xbox 360 command buffers
Special case of cascades paradigm
Pass render state from update to render
With constant workload gives same latency,
better frame rate
With increased workload gives same frame rate,
worse latency
10. Graphics Fluff
Extra graphics that doesn't affect play
Procedurally generated animating cloud textures
Cloth simulations
Dynamic ambient occlusion
Procedurally generated vegetation, etc.
Extra particles, better particle physics, etc.
Easy to synchronize
Potentially expensive, but if the core is
otherwise idle...?
11. Physics?
Could cascade from update to physics to
rendering
Makes use of three threads
May be too much latency
Could run physics on many threads
Uses many threads while doing physics
May leave threads mostly idle elsewhere
13. How Many Threads?
No more than one CPU intensive software
thread per core
3-6 on Xbox 360
1-? on PC (1-4 for now, need to query)
Too many busy threads adds complexity,
and lowers performance
Context switches are not free
Can have many non-CPU intensive
threads
I/O threads that block, or intermittent tasks
14. Simultaneous Multi-Threading
Be careful with Simultaneous Multi-
Threading (SMT) threads
Not the same as double the number of cores
Can give a small perf boost
Can cause a perf drop
Can avoid scheduler latency
Ideally one heavy thread per core plus
some additional intermittent threads
15. Case Study: Kameo (Xbox 360)
Started single threaded
Rendering was taking half of time—put on
separate thread
Two render-description buffers created to
communicate from update to render
Linear read/write access for best cache usage
Doesn't copy const data
File I/O and decompress on other threads
17. Case Study: Kameo (Xbox 360)
Core Thread Software threads
0
0 Game update
1 File I/O
1
0 Rendering
1
2
0 XAudio
1 File decompression
Total usage was ~2.2-2.5 cores
18. Case Study: Project Gotham Racing
Core Thread Software threads
0
0 Update, physics, rendering, UI
1 Audio update, networking
1
0 Crowd update, texture decompression
1 Texture decompression
2
0 XAudio
1
Total usage was ~2.0-3.0 cores
19. Managing Your Threads
Creating threads
Synchronizing
Terminating
Don't use TerminateThread()
Bad idea on Windows: leaves the process in an
indeterminate state, doesn't allow clean-up, etc.
Unavailable on Xbox 360
Instead return from your thread function, or call
ExitThread
20. Stack size of zero
means inherit parent's
Don't forget to close
this when done with it
Creating Threads Poorly
stack size
const int stackSize = 0;
HANDLE hThread = CreateThread(0, stackSize,
ThreadFunctionBad, 0, 0, 0);
// Do work on main thread here.
for (;;) { // Wait for child thread to complete
DWORD exitCode;
GetExitCodeThread(hThread, &exitCode);
if (exitCode != STILL_ACTIVE)
break;
}
...
Be careful with thread
affinities on Windows
DWORD __stdcall ThreadFunctionBad(void* data)
{
#ifdef WIN32
SetThreadAffinityMask(GetCurrentThread(), 8);
#endif
// Do child thread work here.
return 0;
}
CreateThread doesn't
initialize C runtime
Busy waiting is bad!
21. Specify stack size on
Don't forget to close
this when done with it
Creating Threads Well
const int stackSize = 65536;
HANDLE hThread = (HANDLE)_beginthreadex(0, stackSize,
ThreadFunction, 0, 0, 0);
Xbox 360
// Do work on main thread here.
// Wait for child thread to complete
WaitForSingleObject(hThread, INFINITE);
CloseHandle(hThread);
...
Thread affinities must
be specified on Xbox
unsigned __stdcall ThreadFunction(void* data)
{
#ifdef XBOX
// On Xbox 360 you must explicitly assign
// software threads to hardware threads.
XSetThreadProcessor(GetCurrentThread(), 2);
#endif
// Do child thread work here.
return 0;
}
_beginthreadex
initializes CRT
The correct way to wait
for a thread to exit
360
22. Alternative: OpenMP
Available in VC++ 2005
Simple way to parallelize loops and some
other constructs
Works best on long symmetric tasks—
particles?
Game tasks are short—16.6 ms
Many game tasks are not symmetric
OpenMP is nice, but not ideal
23. Available Synchronization Objects
Events
Semaphores
Mutexes
Critical Sections
Don't use SuspendThread()
Some title have used this for synchronization
Can easily lead to deadlocks
Interacts badly with Visual Studio debugger
26. Lockless programming
Trendy technique to use clever programming to
share resources without locking
Includes InterlockedXXX(), lockless
message passing, Double Checked Locking, etc.
Very hard to get right:
Compiler can reorder instructions
CPU can reorder instructions
CPU can reorder reads and writes
Not as fast as avoiding synchronization entirely
27. Lockless Messages: Buggy
void SendMessage(void* input) {
// Wait for the message to be 'empty'.
while (g_msg.filled)
;
memcpy(g_msg.data, input, MESSAGESIZE);
g_msg.filled = true;
}
void GetMessage() {
// Wait for the message to be 'filled'.
while (!g_msg.filled)
;
memcpy(localMsg.data, g_msg.data, MESSAGESIZE);
g_msg.filled = false;
}
28. Synchronization tips/costs:
Synchronization is moderately expensive
when there is no contention
Hundreds to thousands of cycles
Synchronization can be arbitrarily
expensive when there is contention!
Goals:
Synchronize rarely
Hold locks briefly
Minimize shared data
29. Beware hidden synchronization:
Allocations are (generally) a synch point
Consider per-thread heaps with no locking
HEAP_NO_SERIALIZE flag avoids lock on Win32
heaps
Consider custom single-purpose allocators
Consider avoiding memory allocations!
Avoid synch in in-house profilers
D3DCREATE_MULTITHREADED causes
synchronization on almost every Direct3D
call
30. Threading File I/O & Decompression
First: use large reads and asynchronous
I/O
Then: consider compression to accelerate
loading
Don't do format conversions etc. that are better
done at build time!
Have resource proxies to allow rendering
to continue
31. File I/O Implementation Details
vector<Resource*> g_resources;
Worst design: decompressor locks g_resources while
decompressing
Better design: decompressor adds resources to vector
after decompressing
Still requires renderer to synch on every resource access
Best design: two Resource* vectors
Renderer has private vector, no locking required
Decompressor use shared vector, syncs when adding new
Resource*
Renderer moves Resource* from shared to private vector once
per frame
32. Profiling multi-threaded apps
Need thread-aware profilers
Profiling may hide many synchronization stalls
Home-grown spin locks make profiling harder
Consider instrumenting calls to synchronization
functions
Don't use locks in instrumentation—use TLS variables to
store results
Windows: Intel VTune, AMD CodeAnalyst, and
the Visual Studio Team System Profiler
Xbox 360: PIX, XbPerfView, etc.
34. Naming Threads
typedef struct tagTHREADNAME_INFO {
DWORD dwType; // must be 0x1000
LPCSTR szName; // pointer to name (in user addr space)
DWORD dwThreadID; // thread ID (-1=caller thread)
DWORD dwFlags; // reserved for future use, must be zero
} THREADNAME_INFO;
void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) {
THREADNAME_INFO info;
info.dwType = 0x1000;
info.szName = szThreadName;
info.dwThreadID = dwThreadID;
info.dwFlags = 0;
__try {
RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD),
(DWORD*)&info );
}
__except(EXCEPTION_CONTINUE_EXECUTION) {
}
}
SetThreadName(-1, "Main thread");
35. Other Ideas
Debugging tips for MT
Visual Studio does support multi-threaded debugging
Use threads window
Use @hwthread in watch window on Xbox 360
KD and WinDBG support multi-threaded debugging
Thread Local Storage (TLS)
__declspec(thread) declares per-thread variables
But doesn't work in dynamically loaded DLLs
TLSAlloc is less efficient, less convenient, but works in
dynamically loaded DLLs
36. Windows tips
Avoid using D3DCREATE_MULTITHREADED
It’s easy, it works, it’s really really slow
Best to do all calls to Direct3D from a single
thread
Could pass off locked resource pointers to a
queue for a loading threads to work with
Test on multiple machines and
configurations
Single-core, SMT (i.e. Hyper-Threading), Dual-core,
Intel and AMD chips, Multi-socket multicore
(4+ cores)
37. Windows API features
WaitForMultipleObject
Obviously better than a series of
WaitForSingleObject calls
The OS is highly optimized around multithreading
and event-based blocking
I/O Completion Ports
Very efficient way to have the OS assign a pool of
worker threads to incoming I/O requests
Useful construct for implementing a game server
38. SMT versus Multicore
OS returns number of logical processors in
GetSystemInfo(), so a 2 could mean a
SMT machine with only 1 actual core –or-
2 cores
Detailed Win32 APIs exposing this
distinction not available until Windows XP
x64, Windows Server 2003 SP1, Windows
Vista, etc.
GetLogicalProcessorInformation()
For now you have to use CPUID detailed
by Intel and AMD to parse this out…
39. Timing with Multiple Cores
RDTSC is not always synced between cores!
As your thread moves from core to core, results of RDTSC
counter deltas may be nonsense
CPU frequency itself can change at run-time
through speed step technologies
See Power Management APIs for more information
Best thing to do is use Win32 API
QueryPerformanceCounter /
QueryPerformanceFrequency
See DirectX SDK article Game Timing and
Multiple Cores
40. Thread Micromanagement
Use SetThreadAffinityMask with
caution!
May be useful for assigning ‘heavy’ work threads
This mask is technically a hint, not a commitment
RDTSC-based instrumenting will require locking
the game threads to a single core
Otherwise let the Windows scheduler do the right
thing
CreateDevice/Reset might have a side-effect
on the calling thread’s affinity with software vertex
processing enabled
41. Thread Micromanagement (cont)
Be careful about boosting thread priority
If the priority is too high, you could cause the
system to hang and become unresponsive
If the priority is too low, the thread may starve
42. DLLs and Multithreading
DllMain for every DLL is informed of
thread creation/destruction
For some DLLs this is required to initialize TLS
For many this is a waste of time, so call
DisableThreadLibraryCalls() from your
DllMain during process creation
(DLL_PROCESS_ATTACH)
The OS serializes access to the entry point
This means threads created during DllMain
won’t start for a while, so don’t wait on them in the
DLL startup
43. Resources
Multithreading Applications in Win32, Jim Beveridge &
Robert Weiner, Addison-Wesley, 1997
Multiprocessor Considerations for Kernel-Mode Drivers
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-
436b-9281-92cdfeae4b45/MP_issues.doc
Determining Logical Processors per Physical Processor
http://www.intel.com/cd/ids/developer/asmo-na/
eng/dc/threading/knowledgebase/43842.htm
GetLogicalProcessorInformation
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/
dllproc/base/getlogicalprocessorinformation.asp
Double checked locking
http://en.wikipedia.org/wiki/Double-checked_locking
44. Resources
GDC 2006 Presentations
http://msdn.com/directx/presentations
DirectX Developer Center
http://msdn.com/directx
XNA Developer Center
http://msdn.com/xna
Xbox Developer Center (Registered Devs Only)
https://xds.xbox.com
XNA, DirectX, XACT Forums
http://msdn.com/directx/forums
Email addresses
directx@microsoft.com (DirectX Feedback)
xboxds@microsoft.com (Xbox Developers Only)
xna@microsoft.com (XNA Feedback)
Good afternoon. My name is Bruce Dawson, and this is Chuck Walbourn. We work together in the Microsoft Game Technology Group. My specialty is Xbox 360, and Chuck&apos;s specialty is Windows. We&apos;re here today to give you some thoughts on how to take advantage of multi-core processors.
This topic is important because the free ride is over...
Per hardware thread performance is stagnant, but processor improvement continues
&lt;SWIPE&gt;
&gt;70% figure applies to servers (&gt;85%), desktop, and laptops—everything
Celeron dual-core!
Moore&apos;s law lives, but since we can&apos;t increase single-proc clocks our transistor counts much more...
&lt;SWIPE&gt;
More performance requires multi-threading
Multi-core penetration figures:
http://cache-www.intel.com/cd/00/00/23/54/235413_235413.pdf
Why this talk?
Multi-threading is hard—to get benefit you need to plan for it, and you will hit subtle bugs.
Effective multi-threading can be really hard. You may hit problems where threading is actually hurting performance.
Done properly—huge benefits
Good multi-threading always starts with good design.
Haphazard design
Start with one thread, it spawns a couple more
Then they spawn a couple more
Then you start adding communication between threads
And more communication between threads
And still more communication between threads
Then you add synchronization points, where threads need data from other threads or shared resources
End result: a lot of your threads spend a lot of time waiting, you need a lot of synchronization objects, you’re prone to resource contention and synch bugs
Start with main thread, look for major tasks
Split out into Game/Rendering
Add synch points… other than at those points, both threads can run independently
Look for additional parallelizable tasks… physics might be a good candidate
Synch points before and after
Break out other parallelizable tasks
Look for tasks that can run independently of main threads… service requests
Add communication but keep it to a minimum
Also, each chunk size must be same as largest. Probably not well-suited for games
Can work if you have very few stages. At 30Hz, intolerable latency
Update loop should generally be single threaded. May be able to pull out some parts, like path-finding, but synchronization concerns limit your options.
File I/O is something that is often put on a separate thread. This can avoid stalls that asynchronous I/O can&apos;t always hide.
Normally file I/O is not CPU heavy. That can change now.
File read/write is cheap, but spare threads allows use of aggressive compression
Rendering is usually quite expensive. D3D overhead adds up, and scene traversal costs also
Limited number of primitives per second (On Modern Windows machines, we recommend expecting about 300 draws per frame for 60 FPS)
Simple in theory: double-buffer all state that affects rendering. Sometimes complicated in practice.
Synchronize once per frame
Graphics fluff is a good candidate because it has few interactions with other data. May not need to run at same frame-rate as game.
Some games are spending 100% of a core on cloth animation. &quot;That&apos;s crazy!&quot;, or is it brilliant? The main loop of your game may be impossible to multi-thread, in which case the other threads will sit idle unless you add new features.
On PC, graphics fluff can be dropped on single-core machines without affecting game-play. Can be replaced with cheaper alternatives.
This diagram does show good multithreading, but probably not perfect. It relies on spawning extra threads for physics, animation, and particle systems.
It could turn out that this system demands ten hardware threads at some times, and two hardware threads at others. Ideally you should try to have the same number of CPU heavy threads running at all times.
Amdahl&apos;s law—speeding up part of your calculations just leaves the remainder as the single-threaded bottleneck
Middle-ware needs to be flexible enough to adapt to the needs of different games. Physics may be allowed one core—or not.
It is reasonable to have additional threads that are not CPU intensive—blocking on I/O
Seque: One per hardware thread, or one per core
SMT means that two hardware threads are sharing execution resources. They share L1 caches and execution units, but have independent register sets.
If first thread is under utilizing these resources (too many dependency stalls) then another thread can share the resources and total throughput increases.
If first thread is heavily utilizing these resources (well scheduled code) then SMT can&apos;t help much.
Cache is often a problem—L1 is small, and two threads may fight over it. Worst case, adding a second thread may reduce performance.
How to tell? Measure. Easy on Xbox 360, trickier on PC.
Scheduler latency is when you have a thread that is ready to run but the OS waits for the current scheduling quantum to expire before running the thread. If you put a thread on its own hardware thread—even just an SMT thread—then it can wake up faster. This works well if you have a thread that mostly sleeps but needs to wake quickly on demand.
There can be multiple threads per core, multiple cores per chip, multiple chips per socket, and multiple sockets per computer. Identifying shared L1 caches can help with decisions about how many processors to use.
The non-uniformity of hardware threads is one reason why setting thread affinity is problematic on PC.
Now, some examples.
Almost finished on Xbox in August 2004—then moved to Xbox 360
Mostly single-threaded game
CPU usage split was 51/49 for update/render—perfect
3-MB buffer to describe rendering (not always filled), took ~1-2 ms to fill buffer, ~33 ms to render
Decompression thread saved space on DVD and improved load times, cost was some spare CPU cycles. Actually two threads for file I/O—one for reading, one for decompressing, because some calls can block for ~0.5s doing directory lookups
First the update thread fills buffer 0. The render thread is idle.
Then the update thread fills buffer 1. While it is doing this the render thread can run, reading from buffer 0.
Then the threads swap buffers.
This process continues (go back and forth with the arrow keys).
Multi-threading was added very late—~6 months before launch—but it worked
This shows the distribution of threads to cores and hardware threads. Note that one hardware thread is unused. That&apos;s okay—it ensures that rendering runs at top speed.
There were a few other threads (audio processing, etc.) but not many—roughly one CPU intensive thread per core
Cores 0 and 1 were ~80-99% utilized, and core 2 was typically 50% utilized, for total CPU usage of ~2.2-2.5 cores, or ~7-8 GHz
This title is also on Xbox 360.
Things to notice: rendering on same thread as update. Two decompression threads. One unused thread, to leave all cycles to audio.
Audio was a problem in this title. The update thread and crowd update threads both need to trigger sounds, which required grabbing a critical section that the Audio update thread was often holding.
Things to point out:
_beginthreadex is required on Windows to ensure that the CRT is initialized with any TLS required.
Optional on Xbox 360 (can use CreateThread, only difference is return type of thread-creation function and thread function)
Specifying the stack size is important on Xbox 360 to avoid wasting memory. Should generally always be a multiple of 64-KB.
Waiting on the thread handle is how you tell when a thread has terminated—don&apos;t busy wait for this!
Return value is a thread handle, must be closed when not needed anymore.
Thread affinity is completely manual on Xbox 360. Generally best to let OS do it on Windows, unless you really know what you&apos;re doing. Can easily reduce performance by poor understanding of processor topology (overusing two hardware threads on one core, while leaving a thread idle), or by poor interactions with other processes—your threads unable to run despite having idle hardware threads.
Thread creation is expensive. Don&apos;t do it often. If a thread is temporarily unneeded, leave it waiting on an event or semaphore.
This code specifies the stack size, uses _beginthreadex, properly waits for the child to terminate, closes the handle, and specifies the thread affinity on Xbox 360 but not Win32. Perfection!
If you don&apos;t need the handle, close it immediately.
Some areas where OpenMP has been used include:
Particles
Skinning
Physics
Usually minimal benefit, due to limited scope
This guarantees that ManipulateSharedData() is only executed by one thread at a time.
But, mutexes are not the cheapest option...
Critical sections are much cheaper. On Xbox 360 and on Windows they run roughly 20x faster.
Two restrictions: cannot be used between processes, and cannot be used with WaitForMultipleObjects
Mutexes are kernel objects, so they require a kernel transition, whereas critical sections are user-space objects. Mutexes are more robust in the face of thread death.
CRITICAL_SECTION is a good optimization but... key optimization is don&apos;t synchronize too often
x86 CPUs can reorder reads
Xbox 360 CPUs can reorder reads and writes—despite being in-order CPUs
g_message.filled must be marked volatile or else both loops will tend to spin forever.
Even then, with many compiler/platform pairs there is nothing to stop the write to g_message.filled from being reordered. In SendMessage the write to g_message.filled might be visible before the write to g_message.data is visible.
Similarly in GetMessage the reads in process message might come from L2 before the read of g_message.filled.
Both types of reordering can happen on the Xbox 360 CPU, and on many compilers.
Different hardware threads don&apos;t talk to each other directly—they talk to shared memory/shared L2. Thus, if you prevent reordering in SendMessage that guarantees that the writes get to L2 in order. However, you still have to separately guarantee that reads come from L2 in order in GetMessage.
Crucial observation: Lockless programming can be fast, but it is still a type of synchronization, and is more expensive than no synchronization
Particularly tricky prior to VS 2005—poorly defined guarantees from volatile
Particularly tricky on Xbox 360—volatile and InterlockedXxx semantics are slightly different and don&apos;t prevent CPU reorganization of reads and writes—need explicit memory barriers.
Requiring exclusive access to a popular resource can make multi-threading a complex way of doing single-threading on multiple threads
Ideally you want to use synchronization primitives to guarantee multiple threads won&apos;t modify resources simultaneously, while designing so that they generally won&apos;t anyway.
Sometimes it is worth doing a short spin-lock on resources that are likey to be held for only a short time. InitializeCriticalSectionAndSpinCount supports this.
g_resources holds a list of pointers to all loaded resources. It is referenced frequently by the render thread as it needs meshes, textures, shaders, etc.
The load thread needs to make resources available once they are loaded.
If the decompression thread locks g_resources while it decompresses, or while it does file I/O, then the render thread may be locked out for long periods.
If g_resources is shared at all, then every reference by the render thread requires synchronization, wasting time on acquiring and releasing locks.
Best design is two (or more) vectors, to insulate threads from each other. Private data is good.
Anecdote about profile capture completely hiding critical issue (code was waiting on GPU, but only when not profiling. Same thing happened waiting on load thread)
I actually saw a title that had instrumented a ton of functions but then stored the results to a shared array, using critical sections to guard it. About 90% of their synchronization was in the profile functions.
Synchronization stalls are hard to locate
Use Timing Capture on Xbox 360 to visualize threading behavior
Add instrumentation to make visualization easier
This wacky trick makes the name available for Visual Studio, WinDBG, etc, on Xbox 360 and on Windows. It also makes the name available to some other tools, like the PIX timing capture.
The VS screenshot of a Windows app shows just two threads (named), and the VS screenshot of an Xbox 360 app shows... more.
If your multi-threaded code is not tested on multi-proc systems, it will fail!
Mention that this is complicated by the fact that some early releases of processor drivers have bugs where QPC/QPF relies on RDTSC and therefore exhibits the problem. The fix is to get the latest processor driver from AMD website.