Multi core programming
● Most modern computers and even cell phones
have more than one physical device that can run
● These devices, whether you call them cores or
CPUs can operate at the same time.
● The idea was to speed up processing when
making the transistor smaller became too
● Add more cores and you are done, right?!?
Adding CPUs is not everything...
● There are resources which are shared between CPUs and
as you add CPUs these resources become bottle necks and
start giving you headaches...
● This means that there is a very real limit to the number of
CPUs you can just add.
● In addition each CPU is very expensive in terms of power
consumption and other internal infrastructure (caches etc)
● This makes you want to share this infrastructure between
● In comes multi-core.
Multi Core vs Multi CPU
● the terms multi core and multi CPU are highly
abused and used with various meaning.
● We will use the following meanings:
● A multi CPU machine is a machine with more
than one execution unit where little is shared
● A multi core machine is a machine where more
is shared between CPUs.
● This definition is not very precise.
There is always sharing
● CPUs always share somethings...
● Main memory is a good example – if both CPUs
try to access main memory at the same time
one of them is delayed a little.
● Multi core CPUs share more than just main
memory: various caches, parts of the execution
pipeline, interrupt controller and more.
● What exactly is shared depends on the
Why should I care about Multi core?
● The main reason is performance.
● If two CPUs that you are running on share stuff
then they could, at some conditions, cause
delays one for the other.
● This could be costly.
● Knowing what could be delayed would mean
that you can program differently and avoid the
Types of sharing we don't care
● There are other types of sharing we don't care
about like voltage controllers, physical sockets
● These do not cause any delays of significance.
● In general the more sharing the more mutual
influence the CPUs have one on the other and
the more potential delays could be created, the
more room we have for optimisation.
How do I know how many cores I
● read the manual of your mother board?!?
● $ cat /proc/cpuinfo | grep processor
● number of configured processors: $ getconf
● number of online processors: $ getconf
● $ cat /sys/devices/system/cpu/
● Programatically: sysconf(3)
Number of cores is not constant
● Linux supports hot plugging of CPUs on various
● Most user space programmers don't care too
much about this.
● Adding a CPU is not a big problem.
● Removing a CPU works by moving load from a
running core to other cores.
● Of course, you can't remove the last core.
What is the maximum number of
CPUs I can run on?
● Sometimes you need access to this information in order to hold
data per CPU.
● you can ask how many processors are currently online but new
processors may be added at runtime.
● the operating system kernel is compiled to a maximum number
of CPUs in it's data structures. You can use that number in your
code as well.
● you can get that from the kernel headers package.
● or you can read /sys/devices/system/cpu/kernel_max
How multicore vs multi CPU effects
● Multicore CPUs will see memory changes
● Multicore will slow each other more since they
● Memory communication between multicore
processes will be faster
● In general our intuition about multi – CPU
machines is more akin to multi CPU machines
than to multi core machines.
Multicore and Multi-CPU are not
● In todays architectures it is common to find a
single system which has both multi core and
● A single system could have 4 cores, packed as
a 2x2 multi-core system.
● This means that the two pairs of CPUs share
little while each pair shares a lot.
● You need to know about all of this – this is
called the CPU topology.
Does anyone look at the CPU
● Yes, the OS kernel knows about it and uses this
information for better scheduling and other
● You can use the information from:
● You can do that at compile time to produce
various versions for various topologies, using a
config file per topology or even dynamically at
A little bit about timing...
● In order to appreciate performance benefits we need to
know a little about timings. The following are only ball
● Single machine instruction without memory access or
that only accesses the cache: nanoseconds...
● Single machine instruction which accesses main
memory: hundreds of nanoseconds
● Single OS system call: microsecond
● Read data from physical hard drive: miliseconds
What are pipe lines in the CPU?
● A clock in a CPU is simply a oscillator crystal at
a constant rate that synchronizes the operation
of all the parts of the CPU.
● On a steady clock CPU the clock also has to be
longer than the longest executing instruction
● This means that it is difficult to increase the
clock rate. Even if most instructions are very
short the longest one determines the clock rate.
● One solution was to simpify instructions: break
down complicated instructions into simple ones.
● Now the longest instruction is much shorter and
the CPU clock could be increased.
● But now you need to fetch more instructions and
have a bigger instruction cache.
● Compilers also work harder because even simple
operations require generation of several machine
instructions - and object files become bigger.
A different solution was to keep instructions complex but break
them down inside the cpu.
● Each instruction will be broken up to several stages.
● This way several instructions could be executed at the same time
because every stage will be executed separately
● Clock rate can now be increased to match the time it takes the
most complicated stage of the most complicated instruction to
● Compilers don't have to work hard.
● Because increasing the clock became hard engineers decided on
a pipe line approach to increasing the CPU speed.
How pipelines effect programming?
● the CPU misses branch prediction as a result of
the pipelining process.
● we need to tell the CPU how likely/unlikey our
branch is using hints to the compiler.
● code which is very un regular (noisy and hard
to predict) will be slower because it will confuse
the branch prediction in the CPU.
● See example...
What are cache lines?
● when the CPU fetches memory it does not fetch it
one address at a time.
● instead it fetches a full "cache line".
● typical cache lines are 16 to 128 bytes and are
● for example, on current intel processors cache
lines are 64 bytes.
● Cache line should not be confused with memory
page which is a much bigger construct (4096 bytes
How to get the cache line?
● On the command line:
● $ getconf LEVEL1_DCACHE_LINESIZE
● $ cat /proc/cpuinfo | grep "clflush size"
● $ cat
● Or use sysconf(3).
● Or hardcode in you build process.
How do you pad a structure to have
the size rounded to cache line size?
● Do it at compile time (of course).
● Add a char[n] member to your class/struct/...
● Or use special compiler padding syntax.
● Sometimes, this is counter productive when two
of these structures could fit into a single cache
line for instance. In these cases laying out
memory of the entire data structure is a better
● Dont do this without thinking about it!
How pipeline organization in a core
● Example: A CPU has two pipelines: one for
integer computation and one for floating point.
● Then it is better to have integer instructions
interleaved with floating point ones (they will be
● The problem is that this type of advice is very
hard to follow in practice.
● And the compiler is supposed to be able to do
that for you anyway...
How pipeline organization in a core
● The CPU rearranges the order in which it does
loads and stores as a side effect of pipeline
● The semantics of your code remain the same, the
order does not matter.
● Usually you don't care about this and should leave
● Except in cases where you are writing a
synchronization mechanism or talking to hardware.
● In that case you would use a machine barrier to
prevent this reordering (will be explained later).
The problems of memory
● Why do CPUs need a memory cache at all? The idea is
that not all memory acts in the same speed
● There are, and always will be, better technologies for
faster memory, and old technologies for slower memory
● There is a huge monetary difference between these
types of memory.
● Why didn't they make the entire Airplane out the stuff
that makes up the black box? ...:)
● So can we use the fact that we have fast and
expensive memory as opposed to slower and less
expensive memory to speed up the computer? Yes.
Symmetric vs non symmetric
● Access to main memory is not uniform for all CPUs or
cores in the system
● When the access is the same (there are various
techniques to do that) then the system is called UMA
– Uniform Memory Access.
● When the system is not symmetric (e.g. main
memory is connected to CPU0 and CPU1 has to ask
CPU0 for access) then the system is called NUMA –
Non Uniform Memory Access.
● There are other types of NUMA systems as well
(more on that later)
The CPU cache
A relatively small area of memory within the CPU that is made out
of the fast expensive memory and which stores a copy of some
addresses accessed by the CPU.
The algorithm that chooses which addresses to store in the cache
is varied across different CPUs.
some store the last addresses accessed, some store the most
heavily used ones.
● The idea is to reduce the AVERAGE time it take to access a given
● Rests upon the assumption that software is not chaotic and tends
to reuse the same addresses most of the time.
If you access memory randomly you will break this assumption
and will thrash the cash (discussed later).
Types of caches
● In reality CPUs have more than one type of
● DCACHE: data cache used for data access
(read and write)
● ICACHE: instruction cache (used to cache
execute only memory which is your code)
● TLB: Translation Lookaside Buffer. Speeds up
virtual to physical address translation.
Number of caches
● In reality CPUs usually have more than one
● As you get away from the core caches become
bigger and cache lines become bigger.
● But at the same time access becomes slower.
● Until we reach main memory which is biggest
● These are called L1, L2, L3, ...
thrashing: using many resources to do little work. bad.
A piece of software allocates huge amount of ram and then accesses
address by random accessing each address one time.
● A piece of software at any given point in time with so much ram that the
same address is never accessed twice.
Two threads or processes that share a cache. Each one would run without
cache misses but both produce lots of cache misses.
In all of these cases the cache is used (it is in hardware after all) but it is not
● The cache is only useful if you have low cache miss rates.
Coherent vs non coherent platforms
● Coherency is a very important subject.
● A perfectly coherent platform is one where all
components of the system (IO devices, all cores and
CPUS) see memory in exactly the same way.
● Coherency is a hardware feature which can be
controlled by software (special machine instructions).
● Most real systems are incoherent in some ways but not
● Some systems could be used in both coherent and
Coherency - Example
● Intel is said to be a cache coherent platform.
This means that two processors see the same memory.
If one processor puts data in cache then the other processor will
respect that and will fetch the data from that cache.
This slows down the both CPUs a little.
● But it makes programming a whole lot easier.
● But the same platform (Intel) - is not coherent between CPU and
A special instruction needs to be used to make sure that CPU sees
data put into main memory by peripheral and the reverse
Coherency and memory reordering
● Coherency has nothing to do with memory
● A platform can be coherent but still have
memory reordering (Intel is an example). A
platform can also be non-coherent and have no
memory reordering (older RISC systems).
● In theory different instructions should deal with
memory reordering and coherency but in
practice one instruction usually does both.
Optimizing CPU cache utilization
● First find the CPU cache line size
● Then define the most common structures to fit
inside a cache line.
● Another idea is to split commonly used data from
uncommonly used data. This will make sure that
the commonly used data will fit into one cache line.
● You can pad your structures to a cache line size
using compiler hints.
● See example....
● modern CPUs hold counters for all of the above
● TLB misses, data cache misses, instruction cache
misses, branch prediction misses, instructions
executed and much much more
● some of the counters are only accessible from
kernel mode and most from any mode.
● Writing to them is usually restricted to kernel
● in order to access them you need to write
assembly code which is non portable.
CPU counters - problems
● the problem is counting them right since these
counters are not per thread or process because
the CPU doesn't know about threads or
● If you are the only process running then no
problem (although you are counting the OS as
● if not then better solutions are required
CPU counters – best practices
● Writing your own code in assembly is hard.
Most people don't do that.
● on Linux use perf
● use a library like PAPI to give you the CPU
● see demo and example...
What is NUMA?
● On very large system using a single main memory
bank becomes the main bottle neck.
● On such systems the memory is divided into banks
where each bank is attached a set of CPUs.
● Example: 4 memory banks of 4GB each where
each bank is attached a Dual Core processor.
● The banks are connected via a bus to allow each
processor to access any address.
● Access to the local bank is, of course, much faster
than to access a remote bank.
NUMA on Linux
● The Linux kernel knows about NUMA and reads
the NUMA topology at boot time.
● It schedules processes on processors according
to memory affinity.
● It tries to allocate memory for a process on the
memory bank closest to that process.
● Once a thread is running Linux will try very hard
not to migrate it to a core that is on a different
bank since that will cause large memory latency.
NUMA syscalls on Linux
● There are various system calls on Linux that allow
a programmer to control it's properties in relation to
● mbind allows a programmer to bind memory to a
● migrate_pages and move_pages move pages from
bank to bank.
● set_mempolicy and get_mempolicy allow control of
the kernel policy regarding NUMA.
● /proc/[pid]/numa_maps shows the NUMA mapping
for a certain process.
Memory policies in programming
● When there is a chance of two threads on two cores
sharing the same cache line reduce it by moving
memory apart or by padding structures.
● When the reverse is true and a flat area of memory is
accessed only by one thread on one core try to make
memory as small as possible.
● When you have frequently used memory next to non
frequency used memory move them apart (redesign of
your data structures may be required)
● Measure, measure, measure
● See example....
Concepts and definitions
● In order to deal with programming for multi-core
or even time sharing machines we have to use
various tools and concepts.
● We will now go over these concepts.
● Some of these are low level mechanisms
supplied either by the OS or libraries.
● Some are high level concepts which set a
framework for the programmer to think in.
● A set of CPU instructions intended to aid low
level programmers in writing synchronization
● These instructions can be used from user
space so you can use them too.
● Either to gain performance benefits in various
● Or in order to write synchronization primitives of
Atomic operations (cont)
● Usually the compiler vendor wraps these primitives
● You can still write assembly code if you want.
● In GCC and Intel compiler see documentation for
● New C++ 11 APIs also exist in some compilers
● Intel threading building blocks has a nice atomic<T>
● These operations are fast (nanosecond time).
Atomic ops : what do they do?
● Guarantee that other CPUs and HW do not see wrong
intermediate “junk” values.
● Guarantee that the CPU caches for this variable will be
synchronized (only for platforms that do not synchronize
● Guarantee that the HW will see the change.
● Prevent reordering of instructions around the atomic
● Prevent race of the same op by two or more cores.
● See example...
● definition: an OS/library based construct
enabling two pieces of code to avoid entering a
predefined piece(s) of code at the same time.
● How are mutexes implemented?
● The simplest way is a bit inside the OS kernel
backed up by a test-and-set CPU machine
● This means that in order to lock the mutex you
have to go to the OS (microsecond range).
The problems of mutexes...
● A good design is one in which mutexes are
unlocked most of the time
● Otherwise these mutexes serialize all cores.
● This means that we waste the trip to the OS
kernel because we don't need the OS to put us
to sleep or to wake us up.
● Can't we just lock the mutex from user space?
Enter futex / critical section
● A futex is just like a mutex with the exception
that the state of the lock is shared by all
concurrent processes in user space.
● The test-and-set is executed in user space and
only if it fails (the lock is locked) then we go to
the OS to put the current thread to sleep.
● The reverse applies when leaving the critical
section: if no one is waiting the trip to the OS
kernel is avoided.
● The problem with this idea is robustness.
● What happens if a process dies holding a
● In a multi-threaded program the whole program
dies (a thread cannot die) so we don't care.
● In a multi-process program we must release all
locks held at time of death.
● This is referred to as “futex robustness”.
● In Linux all futexes are robust.
● Semaphore is a very overloaded concept.
● Sometimes it is used in place of mutex.
● Sometimes it is used with a meaning of
“number of permits”
● Sometimes it is used as mutex which has a
name and therefor could be used with multiple
processes as opposed to multi-threading.
● Linux has two kinds of these semaphores: sysv
● A resource protected by a lock is usually
protected because a write and read at the same
time would create problems.
● Multiple reads are not a problem and do not
need a lock at all.
● If we have many readers compared to writes
we could gain performance by allowing many
readers into the critical section and not allowing
● A writer would enter the critical section alone
● If we only have writers then we serialize all of
them and therefore gain no benefit from this
type of lock.
● How is a readers/writer lock implemented?
Using a regular type of lock.
● In Linux it can even be shared between
● See: pthread_rwlock_*
● We are talking about user space spin locks here (do not
confuse with kernel space spin locks).
● Is a lock that spins – busy waits – until a test-and-set
● This is efficient when the operation waited upon is very short
compared to the trip to the OS to lock the structure.
● On the other hand if the lock is not contended then it has the
same speed as a futex.
● In the contended case it may be problematic if the two
contending thread are on the same core. That is why they are
usually used for threads running on different cores.
● See pthread_spin_* on Linux...
Lock free algorithms
● Are a class of algorithms that require no locks
during their operation.
● They usually require atomic operations instead.
● Examples are a lock free circular buffer, lock
free hash table and more.
● Sometimes these algorithms rely on spinning for
a short while.
● Usually quite a pain to implement correctly.
● But you can use a library instead...
● A situation in which no work is done because everyone
is waiting on a lock and so no one is releasing any
● Example: A takes “a” lock, B takes “b” lock, A waits for
“b” lock, B waits for “a” lock.
● Solutions: code carefully or always acquire locks in the
● There are all kinds of dead lock detectors available
that have overhead but can be used at development
time and removed in production.
● Dead lock is called dead because all threads are
sleeping when it occurs.
● A situation where no work is getting done
although all threads are live.
● This happens when all threads yield on to the
other instead of acquiring the lock.
● … and no thread actually acquires the lock
● Since most code doesn't yield this almost never
happens in practice.
● See sched_yield(2) in Linux...
● Having one piece of memory accessible by
multiple running entities (threads or processes).
● Multiple threads share all memory by default.
● Multiple processes share no memory by
● In Linux there are two types of shared memory:
SYSV IPC and POSIX.
● And libraries like ACE and BOOST to wrap
Memory sharing (cont)
● If you share data in the shared memory then
you have to either:
● Have a piece of data per process and then you
don't need to protect it.
● Have multiple processes access the same data and
then protect it via all the mechanisms describe
● Use a specialized lock like readers/writer lock.
● Use a specialized framework like RCU.
● Use a lock free algorithm.
Memory sharing (cont)
● In Linux most synchronization primitives can be
placed into shared memory.
● This means you can use a futex, spinlock,
reader/writers lock or an atomic variable to
● That is the reason that futexes in Linux were
implemented as robust futexes.
● A mechanism by which two processes or
threads could pass messages one to the other.
● If used for multi-processing then handled by the
● If used between processes it is an OS
managed resource, is limited and has efficiency
issues since messages are copied from source
process to OS and then back from OS to target
Message queues (cont...)
● In a multi-threaded environment message
queues are used to pass pointers to messages
around and more as a synchronization
● This is because all threads share memory.
● ACE and BOOST both implement queues.
● A synchronization primitive that has a number
associated with it.
● When a thread arrives at a barrier it goes to sleep.
● When enough threads arrive at the barrier (the number
specified at the barrier creation) the barrier is “broken”
and all the threads continue to run.
● Typical use: real time application time slice work.
● Linux: see pthread_barrier_*
● A synchronization primitive enabling one thread to sleep
waiting for a software defined condition to occur.
● Another thread may change the software state and signal
sleeping threads to check the condition.
● A condition has a mutex associated with it that must be
grabbed before using it.
● In Linux: see pthread_cond_*
● There are ACE and BOOST wrappers.
● See example...
● RCU: Read/Copy/Update
● RCU is framework for thinking about accessing
a shared resource.
● The idea is not to block either readers or writers
in accessing the resource.
● Instead, writers get their own copy.
● The new copy gets substituted atomically.
● Readers who started reading the old copy
continue to read the old copy.
● New readers will read the new copy.
● The problem with RCU is the costly copy
operation for writers.
● Usually implemented in heavy systems where
the cost of the read or write operation is heavy
and outweighs the cost of the copy.
● A good example is a database system.
Machine memory barrier
● Is a specialized machine instruction that tells the CPU you are running
on not to reorder reads or writes around it.
● In theory there are all kinds of barriers: read, write, read and write, io
and CPU, only CPU, etc.
● In practice every hardware platform implements only a subset of those.
● In user space it is rarely used because authors of synchronization
mechanisms use them and the end programmer just uses the
● In user space Linux: __sync_synchronize() is a full machine barrier
along with all other atomic operations.
● In kernel (or in writing device drivers in user space) it is used heavily to
communicate correctly with hardware (mb(),rmb(),wmb())
● Is a compile time instruction to the compiler
(much like #pragma) that tells the compiler not
to assume values of memory around the
● See example...
● It is better to use a compiler barrier than the
volatile keyword for writing synchronization
Volatile considered harmful
● Because it doesn't allow caching, even within a
● Because it disables use of CPU caches on
various platforms (not on Intel).
● Basically because it is an overkill.
● Better use the compiler barrier mentioned
● Multicore Application Programming, For
Windows, Linux, and Oracle Solaris. Darryl
● The Art of Multiprocessor Programming,
Maurice Herlihy and Nir Shavit.
● What Every Programmer Should Know About
Memory, Ulrich Drepper,
● The Paul E. McKenney book on RCU.