Multithreading is a huge topic and this article is really just giving an introduction to many terms and
concepts. It is aimed at the threading newcomer and intends to highlight some of the problems
associated with writing multithreaded applications.
Firstly I'll cover the basics on threads – what are they, and how do they relate to processes. Next
there are some guidelines of where threads can be useful, and when not to use them. Then I'll
explain the “primitives”, the underlying building blocks provided by the operating system kernel to
allow multithreading programs to be written.
Following that I cover some concepts that do not at first seem intuitive: memory barriers – what
they are and why they are needed; race conditions – with specific examples of deadlock and
livelock; and priority inversion – where a low priority task gets more processor time than a high
Lastly a look at the where the world of threading is headed.
Tasks, processes and threads
In the beginning there was synchronous execution. A list of commands were executed in order, one
at a time. As time progressed, more and more was expected from an operating system.
Multitasking allowed multiple programs to appear to be running at the same time, while in actual
fact the CPU was still only capable of running one thing at a time.
In order to switch computation from one task to another there would be a context switch. The
execution details of one task were saved out of the registers in the CPU, and details of another task
were moved in, and execution of the new task would resume. Multitasking appeared in two forms:
cooperative and pre-emptive. With cooperative multitasking, it is up to the application to relinquish
control, whereas with pre-emptive multitasking the OS kernel will interrupt the task “mid flow” and
switch it out. The primary problem with cooperative multitasking is that it only takes one badly
behaved process to disrupt the entire system.
Processes can be thought of as tasks that have certain isolation guarantees with respect to memory
and resources. Threads are created by a process – from its “main thread”. Created threads share
memory and resources with the thread that created it.
Shared access to memory and resources combined with pre-emptive multitasking is where the
complexity of multithreaded programming comes in.
When to use threads
Unfortunately there is no black and white answer for this question. There are several important
questions that you should ask yourself before diving into multiple threads:
Are shared resources really needed?
Is the extra complexity worth the effort?
There are many cases where it seems desirable to have separate threads – particularly when there
may be clean separation in processing. An important question here is “should they be separate
threads or separate processes?” Do both tasks really need access to shared resources or is there just
information being passed from one to the other? Two single threaded programs are often better
than one multithreaded program due to the following points:
Single threaded code is easier to write – no resource synchronization needed
Single threaded programs are easier to debug
Single threaded programs are easier to maintain – no threading knowledge needed
Separate programs are easier to test – each can be tested in isolation
Threading adds complexity to often already complex situations. A good rule of thumb is “Is the
complexity that is added through the use of threads significantly reducing the complexity of the
problem?” If the answer is “NO” then you may want to strongly reconsider the use of threads.
However there may be other overriding factors that push strongly towards a threaded solution:
Complete processor utilisation
By their nature GUI programs are event driven. For much of the time, the programs are waiting for
the user to click or type something. GUI programs are normally written with a main event loop
where the application needs to respond to “desktop events” such as repaint or move as well as user
events. When applications fail to respond to these events in a timely manner, the operating system
often marks the processes as non-responsive and the application can appear frozen.
There are often situations in writing an application where long running functions are needed (for
some arbitrary definition of long) – for example: reading or writing files, running a spell check, or
encoding some data. In order to have a responsive GUI for the application it is not desirable to have
these functions running in the same thread as the main event processing loop.
Complete processor utilisation
Some classes of problems lend themselves to being easily broken into independent parts. Some
classes of problems need to be solved as soon as possible. Some classes of problems are purely
calculation intensive where the problem is not waiting on other results or information from other
sources. The intersection of these three sets of problems are candidates for a solution where all the
processors in a machine could be fully utilised. In situations where separate processes are not
viable, having a number of threads to do the processing will provide a time benefit. One of the
tricks is not to have more processing threads than effective CPUs (effective as a single
hyperthreading CPU can appear as two). The reason for this is that if all the threads are CPU bound
(which means the only thing slowing the calculation of the results is the speed of the processor),
then ideally there should be one thread assigned to each effective CPU. If there are more
calculation threads than CPUs then more context switching occurs as the active threads compete
with each other for the processors. Every context switch has an overhead which slows down the
results generation even more.
Using threads in this way are often referred to as “worker threads”. A point to consider when
dealing with worker threads is when to create them. There are always overheads involved when
creating and destroying threads. The amount of overhead varies from platform to platform. It is a
common idiom to use a collection of threads that are started prior to any work being available for
them. These threads then wait for work to do. The collection of threads are often referred to as a
“thread pool” or “work crew”. Since the thread generating the work does not pass it on to a specific
thread, there needs to be some abstraction between the threads generating the work and the threads
processing the work. This is often implemented using a queue. The work generating threads put
work on the queue, and the workers take work from the queue. In situations such as this there are
normally two types of work: those that have feedback – results passed back to the work generators;
and those that do not – for example the worker threads save results directly to a database. How the
feedback from the worker threads is handled depends greatly on the problem – suffice to say there
are many ways to do it.
Whether on the client or the server side, threads are often used when communicating over a
network. On the client side threads are used when requesting a response that may take the server
some time to respond to – especially if the client side application has more to do than just wait for
the response such as responding to GUI events.
Threads are more common on the server side where the server may be listening on a socket. When
a client connects a socket descriptor is generated that represents the connection between the client
and the server. The socket descriptor can then be used independently of waiting for more
connections. In order for the server application to remain responsive to client connections, the
actual work with the socket descriptors is normally handled by a different thread than the thread
where the program is accepting connections.
Thread pools are frequently used to handle server side execution where the threads wait for the
primary thread to accept a client connection. The socket descriptor is then passed to an idle worker
thread in the thread pool. This thread then handles the conversation with the client and performs the
necessary actions to produce results for the client. Once the conversation with the client is finished,
the worker thread releases the socket descriptor and goes back into a waiting state.
Providing an API to start, stop and wait for threads is really just the tip of a much bigger iceberg of
necessary functions. The simplest of these functions are atomic operations. These are functions,
provided by the OS kernel, that are guaranteed to complete before the task is switched out.
As well as atomic additions and subtractions, there are also functions that perform more than one
“action” such as “decrement and test” and “compare and swap”.
Decrement and test subtracts one from the atomic variable and returns true if the atomic
variable is now zero.
Compare and swap takes three parameters, the atomic variable, the expected current value,
and the new value. If the atomic variable is the expected value then it is updated to the new
value and returns true. If the atomic variable is not the expected value, then the expected
current value is updated to what the atomic variable currently holds and the function returns
These atomic operations are then used to build more complex primitives.
In order to provide some isolation guarantees for multiple threads within one process, the operating
system provides a way to define mutual exclusion blocks. The primitive that provides the interface
to these blocks is often referred to as a mutex (mut-ual ex-clusion). The key to how the mutex
works is in the acquiring and releasing. When a mutex is acquired, it is effectively marked as being
owned by the thread that acquired it. If another thread attempts to acquire the mutex while it is still
owned by a different thread, the acquiring thread blocks waiting for the mutex to be released.
When the mutex is released it is marked as “unowned”. In the case where there is another thread
waiting, it is effectively woken up and given ownership of the mutex. Some care needs to be taken
acquiring mutex ownership as some implementations will not check for self ownership before
waiting, so it is possible for a single thread to wait for itself to relinquish ownership – not a
situation you really want to be in.
Any non-trivial data structure uses multiple member variables to define the structure's state. When
the data structure is being modified it is possible that the thread that is updating the data structure
could be switched out. Another thread that wants to access the shared data structure could be
switched in. Without the use of mutual exclusion blocks it is possible that the data structure's state
is inconsistent. Once this happens the program is often in the world of undefined behaviour.
Consider the following example:
There is a container of Foo objects called foo.
A mutex to protect foo access called m
acquire mutex m
bar = reference to a value in foo
release mutex m
acquire mutex m
release mutex m
What is protecting bar? It is possible that thread 2 may modify the value of bar while thread 1 is
trying to use it. Protecting memory access between threads is almost always more complex than it
first looks. One approach is to put mutexes around everything, however this does little more than
serialise the application and prohibits the concurrency that is intended through the use of threads.
Multiple threads with extraneous mutexes often perform worse than a single threaded approach due
to the additional overhead of context switches.
The next two primitives are more for synchronization than mutual exclusion. They are events and
semaphores. Events and semaphores can be used to initiate actions on threads in response to things
happening on a different thread of execution. Unfortunately there is no agreed standard on what an
event is, and the two most common platforms (Win32 and posix) handle things quite differently.
Without going into too much detail, you can have threads waiting on an event. These threads block
until the event occurs. When the event is signalled (from a non-blocking thread obviously) one or
more threads (implementation dependant) will wake up and continue executing.
An example of using an event would be a client-server system where the client is able to cancel
long running calls. One way of doing this is to have the thread that is communicating with the
client not actually do the work, but instead get another thread to do the work and be told when the
work is complete. The way in which the servicing thread is informed of complete work is through
the use of an event.
gets client request
passes on to worker thread
waits on event
gets work for client
processes for a long time
signals event when complete
wakes on event signal
returns result to client
What advantage does this give over just one thread? Implemented this way allows the possibility of
having either time-outs for long work or client driven cancellation. For client driven cancellation it
may work something like this:
gets client requesting
passes on to worker thread
waits on event
gets work for client
processes for a long time
client cancels original request
sets cancelled flag
wakes on event signal
returns cancelled request
What happens to the worker thread depends on the problem. The worker thread may be allowed to
run to completion and discard results or it maybe be possible to interrupt the worker thread prior to
For time-outs instead of having a client connection initiating the cancellation it is some “watchdog”
thread whose job it is to observe running work and cancel work that runs overtime.
A semaphore can be thought of as a counter that can be waited on. A semaphore is initialised with
a non-negative number. When a thread wants to acquire a semaphore, this value is checked. If the
value is positive, then the value is decremented and execution continues. If the value is zero, then
the thread blocks. The semaphore value is incremented by a non-blocked thread calling the “signal”
or “release” semaphore function. If there is a waiting thread, it will awaken, decrement the value
and continue executing.
Semaphores are often used where there are multiple copies of a resource to be managed. For
example a database connection pool. Instead of initiating a connection every time the database
needs to be accessed, a number of connections are created “ahead of time” and await use. One way
of managing this pool of connections is to create a semaphore with an initial count of the number of
available connections. When a thread requests a connection it does so by acquiring the semaphore.
If the value of the semaphore is non-zero the semaphore is decremented which means there is a
database connection available. When the thread is finished with the connection, the connection is
returned to the pool and the semaphore is incremented. If the semaphore is zero when a connection
is requested, the thread blocks until the semaphore is incremented. When a connection is returned
to the pool and the semaphore incremented a thread waiting on the semaphore will awaken –
decrementing the semaphore in the process, allowing it to get the connection.
Advances in processor efficiency have led to many tricks and optimisations that the processor can
do in order to speed up memory access. Grouping together memory reads and memory writes,
along with block reads and writes are some of the tricks that are done to make access faster. The
reordering of memory reads and writes are a special case of the more general optimisation referred
to as “execution out of order”. The commands are grouped in such a way that any grouping that is
done is transparent to a single thread of execution. The problem occurs when there is more than
one thread sharing the memory that is being read or written to in the optimised manner.
Suppose we have two threads as illustrated by this pseudo code:
x = 0, y = 0
while x == 0 loop
y = 42
It might seem reasonable to expect the print statement from Thread 1 to show the value 42, however
it is possible that the processor may store the value for x before y, and hence the printed value may
A memory barrier is a general term that is used to refer to processor instructions that inhibit the
“optimal” reordering of memory access. A memory barrier is used to make sure that all memory
loads and saves before the barrier happen before any loads and saves after the barrier. The actual
behaviour of memory barriers varies from platform to platform.
Many synchronization primitives have implicit memory barriers associated with them. For example
it is usual for a mutex to have a memory barrier defined at the point of acquiring and at the point of
As well as processor execution out of order, there are also issues with compiler optimisations.
What a compiler may do depends on the memory model for the language that is being used. The
more thread aware languages provide keywords which will stop the compiler making certain
assumptions, for example the volatile keyword in Java.
A race condition is a catch-all term for situations where the timing of the code execution can impact
the result. The definition provided by the Free On-Line Dictionary Of Computing  is
“Anomalous behavior due to unexpected critical dependence on the relative timing of events”.
Many of the well known problems associated with multithreaded code are types of race conditions.
A large number of race conditions can be solved through the use of synchronization primitives.
However overly liberal use of synchronization primitives will have a performance impact and can
also bring a different type of problem into the picture.
Probably the best known result of a race condition is deadlock. A deadlock is where two or more
threads are waiting for resources that the other waiting threads own - so each waiting thread waits
There are four necessary conditions for deadlocks to occur. These were initially defined in a 1971
paper by Coffman, Elphick, and Shoshani .
1) Tasks claim exclusive control of the resources they require (“mutual exclusion” condition).
2) Tasks hold resources already allocated to them while waiting for additional resources (“wait
3) Resources cannot be forcibly removed from the tasks holding them until the resources are
used to completion (“no pre-emption” condition).
4) A circular chain of tasks exists, such that each task holds one or more resources that are
being requested by the next task in the chain (“circular wait” condition).
Having these conditions in the code does not necessarily mean that deadlocks will occur, just that
they could, and hence the relative timing of events in different threads can lead to this race
There are several strategies for avoiding deadlocks, the simplest is to avoid the circular chain. This
is normally achieved by specifying a particular order in which locks must be acquired if multiple
locks are needed.
A less well known problem is that of livelock. A livelock is where two or more threads
continuously execute but make no progress. For example two threads are both doing calculations
while some value is true, but they are doing it in such a way that each invalidates the other's test,
causing both threads to continue for ever.
A real world example of live lock is two people walking towards each other in a corridor. Each
person trying to be polite steps to the side, however they just end up mirroring the other persons
movement and end up moving from side to side. Luckily we don't normally end up stuck forever in
Priority inversion is more of a problem for real-time systems, but it can occur on any system where
threads or processes can have priorities assigned to them. This ends up being a scheduling problem
and occurs when a low priority task has ownership of a resource that a high priority task needs.
In these situations the low priority task ends up taking precedence over the high priority task. This
can be complicated even more by there being another task running with a priority between the high
and low task. The medium priority task takes precedence over the low priority task which delays
even longer the releasing of the resource that the high priority task needs.
In many cases, priority inversion passes without much notice, but it can cause serious problems, as
in the case with the “Mars Pathfinder”  where the craft was resetting the computer frequently.
Where to from here?
We are now in the situation where multi-core processors are becoming commonplace. In order to
write software that will make full use of new hardware, multithreading is going to be used more and
Languages that have defined memory models are considering them with respect to changes in
hardware technology. Languages that don't, like C++, are looking to define them. A particular
interest is the possibility of adding language constructs that will allow the automatic use of threads
to speed up loops.
Another area of research that is likely to see more interest is lock-free data structures and wait-free
algorithms. Lock-free data structures are possible to write by using the atomic operations supplied
by the operating system. However due to the more inherent complexities of non-trivial data
structures, they are hard to get right.
Wait-free algorithms are of particular interest to real time systems due to the fact that if a high
priority task is never blocked, then priority inversion does not happen.
If you are looking at using multiple threads first stop and think. Are they really necessary? Is the
added complexity from threading going to reduce the complexity of the problem? Would separate
processes be better?
If after careful consideration you decide that they are in fact going to help, then go forward with
care. Protect your shared data structures with mutual exclusion blocks or cunningly-devised lock-
free algorithms. Avoid circular dependencies on mutexes, and use the correct synchronisation
methods for the problem.