Coding for multiple cores

Coding for Multiple Cores
Bruce Dawson & Chuck Walbourn
Programmers
Game Technology Group

Why multi-threading/multi-core?
Clock rates are stagnant
Future CPUs will be predominantly multi-thread/
multi-core
Xbox 360 has 3 cores
PS3 will be multi-core
>70% of PC sales will be multi-core by end of 2006
Most Windows Vista systems will be multi-core
Two performance possibilities:
Single-threaded? Minimal performance growth
Multi-threaded? Exponential performance growth

Design for Multithreading
Good design is critical
Bad multithreading can be worse than no
multithreading
Deadlocks, synchronization bugs, poor
performance, etc.

Bad Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5

Good Multithreading
Game Thread
Main Thread
RRReeennndddeeerrriiinnnggg TTThhhrrreeeaaaddd
Physics
Rendering Thread
Animation/
Skinning
Particle Systems
Networking
File I/O

Another Paradigm: Cascades
Thread Input
1
Thread Physics
2
Thread AI
3
Rendering
Thread 4
Thread Present
5
FFFFrrrraaaammmmeeee 1243
Advantages:
Synchronization points are few and well-defined
Disadvantages:
Increases latency (for constant frame rate)
Needs simple (one-way) data flow

Typical Threaded Tasks
File Decompression
Rendering
Graphics Fluff
Physics

File Decompression
Most common CPU heavy thread on the
Xbox 360
Easy to multithread
Allows use of aggressive compression to
improve load times
Don’t throw a thread at a problem better
solved by offline processing
Texture compression, file packing, etc.

Rendering
Separate update and render threads
Rendering on multiple threads
(D3DCREATE_MULTITHREADED) works poorly
Exception: Xbox 360 command buffers
Special case of cascades paradigm
Pass render state from update to render
With constant workload gives same latency,
better frame rate
With increased workload gives same frame rate,
worse latency

Graphics Fluff
Extra graphics that doesn't affect play
Procedurally generated animating cloud textures
Cloth simulations
Dynamic ambient occlusion
Procedurally generated vegetation, etc.
Extra particles, better particle physics, etc.
Easy to synchronize
Potentially expensive, but if the core is
otherwise idle...?

Physics?
Could cascade from update to physics to
rendering
Makes use of three threads
May be too much latency
Could run physics on many threads
Uses many threads while doing physics
May leave threads mostly idle elsewhere

Overcommitted Multithreading?
RReennddeerriinngg TThhrreeaadd
Physics
Rendering Thread
Animation/
Skinning
Particle Systems
Game Thread

How Many Threads?
No more than one CPU intensive software
thread per core
3-6 on Xbox 360
1-? on PC (1-4 for now, need to query)
Too many busy threads adds complexity,
and lowers performance
Context switches are not free
Can have many non-CPU intensive
threads
I/O threads that block, or intermittent tasks

Simultaneous Multi-Threading
Be careful with Simultaneous Multi-
Threading (SMT) threads
Not the same as double the number of cores
Can give a small perf boost
Can cause a perf drop
Can avoid scheduler latency
Ideally one heavy thread per core plus
some additional intermittent threads

Case Study: Kameo (Xbox 360)
Started single threaded
Rendering was taking half of time—put on
separate thread
Two render-description buffers created to
communicate from update to render
Linear read/write access for best cache usage
Doesn't copy const data
File I/O and decompress on other threads

Separate Rendering Thread
Update Thread
Buffer 0
Buffer 1
Render Thread

Case Study: Kameo (Xbox 360)
Core Thread Software threads
0
0 Game update
1 File I/O
1
0 Rendering
1
2
0 XAudio
1 File decompression
Total usage was ~2.2-2.5 cores

Case Study: Project Gotham Racing
Core Thread Software threads
0
0 Update, physics, rendering, UI
1 Audio update, networking
1
0 Crowd update, texture decompression
1 Texture decompression
2
0 XAudio
1
Total usage was ~2.0-3.0 cores

Managing Your Threads
Creating threads
Synchronizing
Terminating
Don't use TerminateThread()
Bad idea on Windows: leaves the process in an
indeterminate state, doesn't allow clean-up, etc.
Unavailable on Xbox 360
Instead return from your thread function, or call
ExitThread

Stack size of zero
means inherit parent's
Don't forget to close
this when done with it
Creating Threads Poorly
stack size
const int stackSize = 0;
HANDLE hThread = CreateThread(0, stackSize,
ThreadFunctionBad, 0, 0, 0);
// Do work on main thread here.
for (;;) { // Wait for child thread to complete
DWORD exitCode;
GetExitCodeThread(hThread, &exitCode);
if (exitCode != STILL_ACTIVE)
break;
}
...
Be careful with thread
affinities on Windows
DWORD __stdcall ThreadFunctionBad(void* data)
{
#ifdef WIN32
SetThreadAffinityMask(GetCurrentThread(), 8);
#endif
// Do child thread work here.
return 0;
}
CreateThread doesn't
initialize C runtime
Busy waiting is bad!

Specify stack size on
Don't forget to close
this when done with it
Creating Threads Well
const int stackSize = 65536;
HANDLE hThread = (HANDLE)_beginthreadex(0, stackSize,
ThreadFunction, 0, 0, 0);
Xbox 360
// Do work on main thread here.
// Wait for child thread to complete
WaitForSingleObject(hThread, INFINITE);
CloseHandle(hThread);
...
Thread affinities must
be specified on Xbox
unsigned __stdcall ThreadFunction(void* data)
{
#ifdef XBOX
// On Xbox 360 you must explicitly assign
// software threads to hardware threads.
XSetThreadProcessor(GetCurrentThread(), 2);
#endif
// Do child thread work here.
return 0;
}
_beginthreadex
initializes CRT
The correct way to wait
for a thread to exit
360

Alternative: OpenMP
Available in VC++ 2005
Simple way to parallelize loops and some
other constructs
Works best on long symmetric tasks—
particles?
Game tasks are short—16.6 ms
Many game tasks are not symmetric
OpenMP is nice, but not ideal

Available Synchronization Objects
Events
Semaphores
Mutexes
Critical Sections
Don't use SuspendThread()
Some title have used this for synchronization
Can easily lead to deadlocks
Interacts badly with Visual Studio debugger

Exclusive Access: Mutex
// Initialize
HANDLE mutex =
CreateMutex(0, FALSE, 0);
// Use
void ManipulateSharedData() {
WaitForSingleObject(mutex, INFINITE);
// Manipulate stuff...
ReleaseMutex(mutex);
}
// Destroy
CloseHandle(mutex);

Exclusive Access:
/C/R IInTitIiCalAizLe_SECTION
CRITICAL_SECTION cs;
InitializeCriticalSection(&cs);
// Use
void ManipulateSharedData() {
EnterCriticalSection(&cs);
// Manipulate stuff...
LeaveCriticalSection(&cs);
}
// Destroy
DeleteCriticalSection(&cs);

Lockless programming
Trendy technique to use clever programming to
share resources without locking
Includes InterlockedXXX(), lockless
message passing, Double Checked Locking, etc.
Very hard to get right:
Compiler can reorder instructions
CPU can reorder instructions
CPU can reorder reads and writes
Not as fast as avoiding synchronization entirely

Lockless Messages: Buggy
void SendMessage(void* input) {
// Wait for the message to be 'empty'.
while (g_msg.filled)
;
memcpy(g_msg.data, input, MESSAGESIZE);
g_msg.filled = true;
}
void GetMessage() {
// Wait for the message to be 'filled'.
while (!g_msg.filled)
;
memcpy(localMsg.data, g_msg.data, MESSAGESIZE);
g_msg.filled = false;
}

Synchronization tips/costs:
Synchronization is moderately expensive
when there is no contention
Hundreds to thousands of cycles
Synchronization can be arbitrarily
expensive when there is contention!
Goals:
Synchronize rarely
Hold locks briefly
Minimize shared data

Beware hidden synchronization:
Allocations are (generally) a synch point
Consider per-thread heaps with no locking
HEAP_NO_SERIALIZE flag avoids lock on Win32
heaps
Consider custom single-purpose allocators
Consider avoiding memory allocations!
Avoid synch in in-house profilers
D3DCREATE_MULTITHREADED causes
synchronization on almost every Direct3D
call

Threading File I/O & Decompression
First: use large reads and asynchronous
I/O
Then: consider compression to accelerate
loading
Don't do format conversions etc. that are better
done at build time!
Have resource proxies to allow rendering
to continue

File I/O Implementation Details
vector<Resource*> g_resources;
Worst design: decompressor locks g_resources while
decompressing
Better design: decompressor adds resources to vector
after decompressing
Still requires renderer to synch on every resource access
Best design: two Resource* vectors
Renderer has private vector, no locking required
Decompressor use shared vector, syncs when adding new
Resource*
Renderer moves Resource* from shared to private vector once
per frame

Profiling multi-threaded apps
Need thread-aware profilers
Profiling may hide many synchronization stalls
Home-grown spin locks make profiling harder
Consider instrumenting calls to synchronization
functions
Don't use locks in instrumentation—use TLS variables to
store results
Windows: Intel VTune, AMD CodeAnalyst, and
the Visual Studio Team System Profiler
Xbox 360: PIX, XbPerfView, etc.

Naming Threads
typedef struct tagTHREADNAME_INFO {
DWORD dwType; // must be 0x1000
LPCSTR szName; // pointer to name (in user addr space)
DWORD dwThreadID; // thread ID (-1=caller thread)
DWORD dwFlags; // reserved for future use, must be zero
} THREADNAME_INFO;
void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) {
THREADNAME_INFO info;
info.dwType = 0x1000;
info.szName = szThreadName;
info.dwThreadID = dwThreadID;
info.dwFlags = 0;
__try {
RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD),
(DWORD*)&info );
}
__except(EXCEPTION_CONTINUE_EXECUTION) {
}
}
SetThreadName(-1, "Main thread");

Other Ideas
Debugging tips for MT
Visual Studio does support multi-threaded debugging
Use threads window
Use @hwthread in watch window on Xbox 360
KD and WinDBG support multi-threaded debugging
Thread Local Storage (TLS)
__declspec(thread) declares per-thread variables
But doesn't work in dynamically loaded DLLs
TLSAlloc is less efficient, less convenient, but works in
dynamically loaded DLLs

Windows tips
Avoid using D3DCREATE_MULTITHREADED
It’s easy, it works, it’s really really slow
Best to do all calls to Direct3D from a single
thread
Could pass off locked resource pointers to a
queue for a loading threads to work with
Test on multiple machines and
configurations
Single-core, SMT (i.e. Hyper-Threading), Dual-core,
Intel and AMD chips, Multi-socket multicore
(4+ cores)

Windows API features
WaitForMultipleObject
Obviously better than a series of
WaitForSingleObject calls
The OS is highly optimized around multithreading
and event-based blocking
I/O Completion Ports
Very efficient way to have the OS assign a pool of
worker threads to incoming I/O requests
Useful construct for implementing a game server

SMT versus Multicore
OS returns number of logical processors in
GetSystemInfo(), so a 2 could mean a
SMT machine with only 1 actual core –or-
2 cores
Detailed Win32 APIs exposing this
distinction not available until Windows XP
x64, Windows Server 2003 SP1, Windows
Vista, etc.
GetLogicalProcessorInformation()
For now you have to use CPUID detailed
by Intel and AMD to parse this out…

Timing with Multiple Cores
RDTSC is not always synced between cores!
As your thread moves from core to core, results of RDTSC
counter deltas may be nonsense
CPU frequency itself can change at run-time
through speed step technologies
See Power Management APIs for more information
Best thing to do is use Win32 API
QueryPerformanceCounter /
QueryPerformanceFrequency
See DirectX SDK article Game Timing and
Multiple Cores

Thread Micromanagement
Use SetThreadAffinityMask with
caution!
May be useful for assigning ‘heavy’ work threads
This mask is technically a hint, not a commitment
RDTSC-based instrumenting will require locking
the game threads to a single core
Otherwise let the Windows scheduler do the right
thing
CreateDevice/Reset might have a side-effect
on the calling thread’s affinity with software vertex
processing enabled

Thread Micromanagement (cont)
Be careful about boosting thread priority
If the priority is too high, you could cause the
system to hang and become unresponsive
If the priority is too low, the thread may starve

DLLs and Multithreading
DllMain for every DLL is informed of
thread creation/destruction
For some DLLs this is required to initialize TLS
For many this is a waste of time, so call
DisableThreadLibraryCalls() from your
DllMain during process creation
(DLL_PROCESS_ATTACH)
The OS serializes access to the entry point
This means threads created during DllMain
won’t start for a while, so don’t wait on them in the
DLL startup

Resources
Multithreading Applications in Win32, Jim Beveridge &
Robert Weiner, Addison-Wesley, 1997
Multiprocessor Considerations for Kernel-Mode Drivers
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-
436b-9281-92cdfeae4b45/MP_issues.doc
Determining Logical Processors per Physical Processor
http://www.intel.com/cd/ids/developer/asmo-na/
eng/dc/threading/knowledgebase/43842.htm
GetLogicalProcessorInformation
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/
dllproc/base/getlogicalprocessorinformation.asp
Double checked locking
http://en.wikipedia.org/wiki/Double-checked_locking

Resources
GDC 2006 Presentations
http://msdn.com/directx/presentations
DirectX Developer Center
http://msdn.com/directx
XNA Developer Center
http://msdn.com/xna
Xbox Developer Center (Registered Devs Only)
https://xds.xbox.com
XNA, DirectX, XACT Forums
http://msdn.com/directx/forums
Email addresses
directx@microsoft.com (DirectX Feedback)
xboxds@microsoft.com (Xbox Developers Only)
xna@microsoft.com (XNA Feedback)

© 2006 Microsoft Corporation. All rights reserved.
Microsoft, DirectX, Xbox 360, the Xbox logo, and XNA are either registered trademarks or trademarks of Microsoft Corporation in the United Sates and / or other countries.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Coding for multiple cores

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Coding for multiple cores

Similar to Coding for multiple cores (20)

More from Lee Hanxue

More from Lee Hanxue (20)

Recently uploaded

Recently uploaded (20)

Coding for multiple cores

Editor's Notes