SlideShare a Scribd company logo
1 of 70
Multi core programming
Mark Veltzer
mark@veltzer.net
Introduction
● Most modern computers and even cell phones
have more than one physical device that can run
code.
● These devices, whether you call them cores or
CPUs can operate at the same time.
● The idea was to speed up processing when
making the transistor smaller became too
difficult.
● Add more cores and you are done, right?!?
Wrong...
Adding CPUs is not everything...
● There are resources which are shared between CPUs and
as you add CPUs these resources become bottle necks and
start giving you headaches...
● This means that there is a very real limit to the number of
CPUs you can just add.
● In addition each CPU is very expensive in terms of power
consumption and other internal infrastructure (caches etc)
● This makes you want to share this infrastructure between
CPUs.
● In comes multi-core.
Multi Core vs Multi CPU
● the terms multi core and multi CPU are highly
abused and used with various meaning.
● We will use the following meanings:
● A multi CPU machine is a machine with more
than one execution unit where little is shared
between CPUs.
● A multi core machine is a machine where more
is shared between CPUs.
● This definition is not very precise.
There is always sharing
● CPUs always share somethings...
● Main memory is a good example – if both CPUs
try to access main memory at the same time
one of them is delayed a little.
● Multi core CPUs share more than just main
memory: various caches, parts of the execution
pipeline, interrupt controller and more.
● What exactly is shared depends on the
architecture.
Why should I care about Multi core?
● The main reason is performance.
● If two CPUs that you are running on share stuff
then they could, at some conditions, cause
delays one for the other.
● This could be costly.
● Knowing what could be delayed would mean
that you can program differently and avoid the
delay.
Types of sharing we don't care
about
● There are other types of sharing we don't care
about like voltage controllers, physical sockets
and more.
● These do not cause any delays of significance.
● In general the more sharing the more mutual
influence the CPUs have one on the other and
the more potential delays could be created, the
more room we have for optimisation.
How do I know how many cores I
have?
● read the manual of your mother board?!?
● $ cat /proc/cpuinfo | grep processor
● number of configured processors: $ getconf
_NPROCESSORS_CONF
● number of online processors: $ getconf
_NPROCESSORS_ONLN
● $ cat /sys/devices/system/cpu/
{present,offline,online}
● Programatically: sysconf(3)
Number of cores is not constant
● Linux supports hot plugging of CPUs on various
architectures.
● Most user space programmers don't care too
much about this.
● Adding a CPU is not a big problem.
● Removing a CPU works by moving load from a
running core to other cores.
● Of course, you can't remove the last core.
What is the maximum number of
CPUs I can run on?
● Sometimes you need access to this information in order to hold
data per CPU.
● you can ask how many processors are currently online but new
processors may be added at runtime.
● the operating system kernel is compiled to a maximum number
of CPUs in it's data structures. You can use that number in your
code as well.
● you can get that from the kernel headers package.
● or you can read /sys/devices/system/cpu/kernel_max
How multicore vs multi CPU effects
programming...
● Multicore CPUs will see memory changes
earlier.
● Multicore will slow each other more since they
share resources.
● Memory communication between multicore
processes will be faster
● In general our intuition about multi – CPU
machines is more akin to multi CPU machines
than to multi core machines.
Multicore and Multi-CPU are not
mutually exclusive...
● In todays architectures it is common to find a
single system which has both multi core and
multi CPU.
● A single system could have 4 cores, packed as
a 2x2 multi-core system.
● This means that the two pairs of CPUs share
little while each pair shares a lot.
● You need to know about all of this – this is
called the CPU topology.
Does anyone look at the CPU
topology?
● Yes, the OS kernel knows about it and uses this
information for better scheduling and other
uses.
● You can use the information from:
/sys/devices/system/cpu/cpu[n]/topology/*
● You can do that at compile time to produce
various versions for various topologies, using a
config file per topology or even dynamically at
run time...
A little bit about timing...
● In order to appreciate performance benefits we need to
know a little about timings. The following are only ball
park figures...
● Single machine instruction without memory access or
that only accesses the cache: nanoseconds...
● Single machine instruction which accesses main
memory: hundreds of nanoseconds
● Single OS system call: microsecond
● Read data from physical hard drive: miliseconds
What are pipe lines in the CPU?
● A clock in a CPU is simply a oscillator crystal at
a constant rate that synchronizes the operation
of all the parts of the CPU.
● On a steady clock CPU the clock also has to be
longer than the longest executing instruction
(bummer!)
● This means that it is difficult to increase the
clock rate. Even if most instructions are very
short the longest one determines the clock rate.
RISC
● One solution was to simpify instructions: break
down complicated instructions into simple ones.
● Now the longest instruction is much shorter and
the CPU clock could be increased.
● But now you need to fetch more instructions and
have a bigger instruction cache.
● Compilers also work harder because even simple
operations require generation of several machine
instructions - and object files become bigger.
Pipe lines
●
A different solution was to keep instructions complex but break
them down inside the cpu.
● Each instruction will be broken up to several stages.
● This way several instructions could be executed at the same time
because every stage will be executed separately
● Clock rate can now be increased to match the time it takes the
most complicated stage of the most complicated instruction to
execute.
● Compilers don't have to work hard.
● Because increasing the clock became hard engineers decided on
a pipe line approach to increasing the CPU speed.
How pipelines effect programming?
● the CPU misses branch prediction as a result of
the pipelining process.
● we need to tell the CPU how likely/unlikey our
branch is using hints to the compiler.
● code which is very un regular (noisy and hard
to predict) will be slower because it will confuse
the branch prediction in the CPU.
● See example...
What are cache lines?
● when the CPU fetches memory it does not fetch it
one address at a time.
● instead it fetches a full "cache line".
● typical cache lines are 16 to 128 bytes and are
usually aligned.
● for example, on current intel processors cache
lines are 64 bytes.
● Cache line should not be confused with memory
page which is a much bigger construct (4096 bytes
on Intel).
How to get the cache line?
● On the command line:
● $ getconf LEVEL1_DCACHE_LINESIZE
● $ cat /proc/cpuinfo | grep "clflush size"
● $ cat
/sys/devices/system/cpu/cpu0/cache/index0/co
herency_line_size
● Or use sysconf(3).
● Or hardcode in you build process.
How do you pad a structure to have
the size rounded to cache line size?
● Do it at compile time (of course).
● Add a char[n] member to your class/struct/...
● Or use special compiler padding syntax.
● Sometimes, this is counter productive when two
of these structures could fit into a single cache
line for instance. In these cases laying out
memory of the entire data structure is a better
strategy.
● Dont do this without thinking about it!
How pipeline organization in a core
effects programming?
● Example: A CPU has two pipelines: one for
integer computation and one for floating point.
● Then it is better to have integer instructions
interleaved with floating point ones (they will be
executed concurrently).
● The problem is that this type of advice is very
hard to follow in practice.
● And the compiler is supposed to be able to do
that for you anyway...
How pipeline organization in a core
effects programming?
● The CPU rearranges the order in which it does
loads and stores as a side effect of pipeline
● The semantics of your code remain the same, the
order does not matter.
● Usually you don't care about this and should leave
this alone.
● Except in cases where you are writing a
synchronization mechanism or talking to hardware.
● In that case you would use a machine barrier to
prevent this reordering (will be explained later).
The problems of memory
● Why do CPUs need a memory cache at all? The idea is
that not all memory acts in the same speed
● There are, and always will be, better technologies for
faster memory, and old technologies for slower memory
● There is a huge monetary difference between these
types of memory.
● Why didn't they make the entire Airplane out the stuff
that makes up the black box? ...:)
● So can we use the fact that we have fast and
expensive memory as opposed to slower and less
expensive memory to speed up the computer? Yes.
Symmetric vs non symmetric
memory
● Access to main memory is not uniform for all CPUs or
cores in the system
● When the access is the same (there are various
techniques to do that) then the system is called UMA
– Uniform Memory Access.
● When the system is not symmetric (e.g. main
memory is connected to CPU0 and CPU1 has to ask
CPU0 for access) then the system is called NUMA –
Non Uniform Memory Access.
● There are other types of NUMA systems as well
(more on that later)
The CPU cache
●
A relatively small area of memory within the CPU that is made out
of the fast expensive memory and which stores a copy of some
addresses accessed by the CPU.
●
The algorithm that chooses which addresses to store in the cache
is varied across different CPUs.
●
some store the last addresses accessed, some store the most
heavily used ones.
● The idea is to reduce the AVERAGE time it take to access a given
memory address.
● Rests upon the assumption that software is not chaotic and tends
to reuse the same addresses most of the time.
●
If you access memory randomly you will break this assumption
and will thrash the cash (discussed later).
Types of caches
● In reality CPUs have more than one type of
cache:
● DCACHE: data cache used for data access
(read and write)
● ICACHE: instruction cache (used to cache
execute only memory which is your code)
● TLB: Translation Lookaside Buffer. Speeds up
virtual to physical address translation.
Number of caches
● In reality CPUs usually have more than one
cache.
● As you get away from the core caches become
bigger and cache lines become bigger.
● But at the same time access becomes slower.
● Until we reach main memory which is biggest
and slowest.
● These are called L1, L2, L3, ...
Cache thrashing...
●
thrashing: using many resources to do little work. bad.
●
A piece of software allocates huge amount of ram and then accesses
address by random accessing each address one time.
● A piece of software at any given point in time with so much ram that the
same address is never accessed twice.
●
Two threads or processes that share a cache. Each one would run without
cache misses but both produce lots of cache misses.
●
In all of these cases the cache is used (it is in hardware after all) but it is not
useful.
● The cache is only useful if you have low cache miss rates.
Coherent vs non coherent platforms
● Coherency is a very important subject.
● A perfectly coherent platform is one where all
components of the system (IO devices, all cores and
CPUS) see memory in exactly the same way.
● Coherency is a hardware feature which can be
controlled by software (special machine instructions).
● Most real systems are incoherent in some ways but not
in others.
● Some systems could be used in both coherent and
incoherent ways.
Coherency - Example
● Intel is said to be a cache coherent platform.
●
This means that two processors see the same memory.
●
If one processor puts data in cache then the other processor will
respect that and will fetch the data from that cache.
●
This slows down the both CPUs a little.
● But it makes programming a whole lot easier.
● But the same platform (Intel) - is not coherent between CPU and
peripherals.
●
A special instruction needs to be used to make sure that CPU sees
data put into main memory by peripheral and the reverse
Coherency and memory reordering
● Coherency has nothing to do with memory
reordering.
● A platform can be coherent but still have
memory reordering (Intel is an example). A
platform can also be non-coherent and have no
memory reordering (older RISC systems).
● In theory different instructions should deal with
memory reordering and coherency but in
practice one instruction usually does both.
Optimizing CPU cache utilization
● First find the CPU cache line size
● Then define the most common structures to fit
inside a cache line.
● Another idea is to split commonly used data from
uncommonly used data. This will make sure that
the commonly used data will fit into one cache line.
● You can pad your structures to a cache line size
using compiler hints.
● See example....
CPU counters
● modern CPUs hold counters for all of the above
● TLB misses, data cache misses, instruction cache
misses, branch prediction misses, instructions
executed and much much more
● some of the counters are only accessible from
kernel mode and most from any mode.
● Writing to them is usually restricted to kernel
mode.
● in order to access them you need to write
assembly code which is non portable.
CPU counters - problems
● the problem is counting them right since these
counters are not per thread or process because
the CPU doesn't know about threads or
processes.
● If you are the only process running then no
problem (although you are counting the OS as
well).
● if not then better solutions are required
CPU counters – best practices
● Writing your own code in assembly is hard.
Most people don't do that.
● on Linux use perf
● use a library like PAPI to give you the CPU
counters involved.
● see demo and example...
What is NUMA?
● On very large system using a single main memory
bank becomes the main bottle neck.
● On such systems the memory is divided into banks
where each bank is attached a set of CPUs.
● Example: 4 memory banks of 4GB each where
each bank is attached a Dual Core processor.
● The banks are connected via a bus to allow each
processor to access any address.
● Access to the local bank is, of course, much faster
than to access a remote bank.
NUMA on Linux
● The Linux kernel knows about NUMA and reads
the NUMA topology at boot time.
● It schedules processes on processors according
to memory affinity.
● It tries to allocate memory for a process on the
memory bank closest to that process.
● Once a thread is running Linux will try very hard
not to migrate it to a core that is on a different
bank since that will cause large memory latency.
NUMA syscalls on Linux
● There are various system calls on Linux that allow
a programmer to control it's properties in relation to
NUMA.
● mbind allows a programmer to bind memory to a
NUMA nodes.
● migrate_pages and move_pages move pages from
bank to bank.
● set_mempolicy and get_mempolicy allow control of
the kernel policy regarding NUMA.
● /proc/[pid]/numa_maps shows the NUMA mapping
for a certain process.
Memory policies in programming
● When there is a chance of two threads on two cores
sharing the same cache line reduce it by moving
memory apart or by padding structures.
● When the reverse is true and a flat area of memory is
accessed only by one thread on one core try to make
memory as small as possible.
● When you have frequently used memory next to non
frequency used memory move them apart (redesign of
your data structures may be required)
● Measure, measure, measure
● See example....
Core concepts for the course
Concepts and definitions
● In order to deal with programming for multi-core
or even time sharing machines we have to use
various tools and concepts.
● We will now go over these concepts.
● Some of these are low level mechanisms
supplied either by the OS or libraries.
● Some are high level concepts which set a
framework for the programmer to think in.
Atomic operations
● A set of CPU instructions intended to aid low
level programmers in writing synchronization
primitives.
● These instructions can be used from user
space so you can use them too.
● Either to gain performance benefits in various
conditions.
● Or in order to write synchronization primitives of
your own.
Atomic operations (cont)
● Usually the compiler vendor wraps these primitives
for you.
● You can still write assembly code if you want.
● In GCC and Intel compiler see documentation for
__sync_*
● New C++ 11 APIs also exist in some compilers
(__atomic_*).
● Intel threading building blocks has a nice atomic<T>
template.
● These operations are fast (nanosecond time).
Atomic ops : what do they do?
● Guarantee that other CPUs and HW do not see wrong
intermediate “junk” values.
● Guarantee that the CPU caches for this variable will be
synchronized (only for platforms that do not synchronize
the caches).
● Guarantee that the HW will see the change.
● Prevent reordering of instructions around the atomic
operation.
● Prevent race of the same op by two or more cores.
● See example...
Mutex
● definition: an OS/library based construct
enabling two pieces of code to avoid entering a
predefined piece(s) of code at the same time.
● How are mutexes implemented?
● The simplest way is a bit inside the OS kernel
backed up by a test-and-set CPU machine
instruction.
● This means that in order to lock the mutex you
have to go to the OS (microsecond range).
The problems of mutexes...
● A good design is one in which mutexes are
unlocked most of the time
● Otherwise these mutexes serialize all cores.
● This means that we waste the trip to the OS
kernel because we don't need the OS to put us
to sleep or to wake us up.
● Can't we just lock the mutex from user space?
Enter futex / critical section
● A futex is just like a mutex with the exception
that the state of the lock is shared by all
concurrent processes in user space.
● The test-and-set is executed in user space and
only if it fails (the lock is locked) then we go to
the OS to put the current thread to sleep.
● The reverse applies when leaving the critical
section: if no one is waiting the trip to the OS
kernel is avoided.
● The problem with this idea is robustness.
Futex robustness
● What happens if a process dies holding a
futex?
● In a multi-threaded program the whole program
dies (a thread cannot die) so we don't care.
● In a multi-process program we must release all
locks held at time of death.
● This is referred to as “futex robustness”.
● In Linux all futexes are robust.
Semaphores
● Semaphore is a very overloaded concept.
● Sometimes it is used in place of mutex.
● Sometimes it is used with a meaning of
“number of permits”
● Sometimes it is used as mutex which has a
name and therefor could be used with multiple
processes as opposed to multi-threading.
● Linux has two kinds of these semaphores: sysv
and POSIX.
Readers/Writer lock
● A resource protected by a lock is usually
protected because a write and read at the same
time would create problems.
● Multiple reads are not a problem and do not
need a lock at all.
● If we have many readers compared to writes
we could gain performance by allowing many
readers into the critical section and not allowing
writers.
● A writer would enter the critical section alone
Readers/Writer lock
● If we only have writers then we serialize all of
them and therefore gain no benefit from this
type of lock.
● How is a readers/writer lock implemented?
Using a regular type of lock.
● In Linux it can even be shared between
processes.
● See: pthread_rwlock_*
Spin lock
● We are talking about user space spin locks here (do not
confuse with kernel space spin locks).
● Is a lock that spins – busy waits – until a test-and-set
operation succeeds.
● This is efficient when the operation waited upon is very short
compared to the trip to the OS to lock the structure.
● On the other hand if the lock is not contended then it has the
same speed as a futex.
● In the contended case it may be problematic if the two
contending thread are on the same core. That is why they are
usually used for threads running on different cores.
● See pthread_spin_* on Linux...
Lock free algorithms
● Are a class of algorithms that require no locks
during their operation.
● They usually require atomic operations instead.
● Examples are a lock free circular buffer, lock
free hash table and more.
● Sometimes these algorithms rely on spinning for
a short while.
● Usually quite a pain to implement correctly.
● But you can use a library instead...
Dead lock
● A situation in which no work is done because everyone
is waiting on a lock and so no one is releasing any
lock...
● Example: A takes “a” lock, B takes “b” lock, A waits for
“b” lock, B waits for “a” lock.
● Solutions: code carefully or always acquire locks in the
same order.
● There are all kinds of dead lock detectors available
that have overhead but can be used at development
time and removed in production.
● Dead lock is called dead because all threads are
sleeping when it occurs.
Live lock
● A situation where no work is getting done
although all threads are live.
● This happens when all threads yield on to the
other instead of acquiring the lock.
● … and no thread actually acquires the lock
● Since most code doesn't yield this almost never
happens in practice.
● See sched_yield(2) in Linux...
Memory sharing
● Having one piece of memory accessible by
multiple running entities (threads or processes).
● Multiple threads share all memory by default.
● Multiple processes share no memory by
default.
● In Linux there are two types of shared memory:
SYSV IPC and POSIX.
● And libraries like ACE and BOOST to wrap
them...
Memory sharing (cont)
● If you share data in the shared memory then
you have to either:
● Have a piece of data per process and then you
don't need to protect it.
● Have multiple processes access the same data and
then protect it via all the mechanisms describe
before
● Use a specialized lock like readers/writer lock.
● Use a specialized framework like RCU.
● Use a lock free algorithm.
Memory sharing (cont)
● In Linux most synchronization primitives can be
placed into shared memory.
● This means you can use a futex, spinlock,
reader/writers lock or an atomic variable to
synchronize processes.
● That is the reason that futexes in Linux were
implemented as robust futexes.
Message queues
● A mechanism by which two processes or
threads could pass messages one to the other.
● If used for multi-processing then handled by the
OS.
● If used between processes it is an OS
managed resource, is limited and has efficiency
issues since messages are copied from source
process to OS and then back from OS to target
process.
Message queues (cont...)
● In a multi-threaded environment message
queues are used to pass pointers to messages
around and more as a synchronization
mechanism.
● This is because all threads share memory.
● ACE and BOOST both implement queues.
Barriers
● A synchronization primitive that has a number
associated with it.
● When a thread arrives at a barrier it goes to sleep.
● When enough threads arrive at the barrier (the number
specified at the barrier creation) the barrier is “broken”
and all the threads continue to run.
● Typical use: real time application time slice work.
● Linux: see pthread_barrier_*
Condition
● A synchronization primitive enabling one thread to sleep
waiting for a software defined condition to occur.
● Another thread may change the software state and signal
sleeping threads to check the condition.
● A condition has a mutex associated with it that must be
grabbed before using it.
● In Linux: see pthread_cond_*
● There are ACE and BOOST wrappers.
● See example...
RCU
● RCU: Read/Copy/Update
● RCU is framework for thinking about accessing
a shared resource.
● The idea is not to block either readers or writers
in accessing the resource.
● Instead, writers get their own copy.
● The new copy gets substituted atomically.
● Readers who started reading the old copy
continue to read the old copy.
RCU (continued)
● New readers will read the new copy.
● The problem with RCU is the costly copy
operation for writers.
● Usually implemented in heavy systems where
the cost of the read or write operation is heavy
and outweighs the cost of the copy.
● A good example is a database system.
Machine memory barrier
● Is a specialized machine instruction that tells the CPU you are running
on not to reorder reads or writes around it.
● In theory there are all kinds of barriers: read, write, read and write, io
and CPU, only CPU, etc.
● In practice every hardware platform implements only a subset of those.
● In user space it is rarely used because authors of synchronization
mechanisms use them and the end programmer just uses the
synchronization mechanisms.
● In user space Linux: __sync_synchronize() is a full machine barrier
along with all other atomic operations.
● In kernel (or in writing device drivers in user space) it is used heavily to
communicate correctly with hardware (mb(),rmb(),wmb())
Compiler barrier
● Is a compile time instruction to the compiler
(much like #pragma) that tells the compiler not
to assume values of memory around the
instruction.
● See example...
● It is better to use a compiler barrier than the
volatile keyword for writing synchronization
primitives.
Volatile considered harmful
● Because it doesn't allow caching, even within a
critical section.
● Because it disables use of CPU caches on
various platforms (not on Intel).
● Basically because it is an overkill.
● Better use the compiler barrier mentioned
before...
References
● Multicore Application Programming, For
Windows, Linux, and Oracle Solaris. Darryl
Gove.
● The Art of Multiprocessor Programming,
Maurice Herlihy and Nir Shavit.
● What Every Programmer Should Know About
Memory, Ulrich Drepper,
http://www.akkadia.org/drepper/cpumemory.pdf
● The Paul E. McKenney book on RCU.
Synchronization mechanisms
● Nothing
● Atomics
● Lockless Algos/DS
● RCU/COW
● Readers/Writer lock
● Spinlock
● Futex
● Mutex
● Semaphore

More Related Content

What's hot

Asus WL500gP USB Serial
Asus WL500gP USB SerialAsus WL500gP USB Serial
Asus WL500gP USB Serial
guestac21b8
 

What's hot (20)

The Shell Game Part 2: What are your shell choices?
The Shell Game Part 2: What are your shell choices?The Shell Game Part 2: What are your shell choices?
The Shell Game Part 2: What are your shell choices?
 
LSA2 - 02 Namespaces
LSA2 - 02  NamespacesLSA2 - 02  Namespaces
LSA2 - 02 Namespaces
 
Asus WL500gP USB Serial
Asus WL500gP USB SerialAsus WL500gP USB Serial
Asus WL500gP USB Serial
 
Linux Kernel Init Process
Linux Kernel Init ProcessLinux Kernel Init Process
Linux Kernel Init Process
 
Kernel Recipes 2016 -
Kernel Recipes 2016 - Kernel Recipes 2016 -
Kernel Recipes 2016 -
 
Many Cores Java - Session One: Threads and Threads
Many Cores Java - Session One: Threads and ThreadsMany Cores Java - Session One: Threads and Threads
Many Cores Java - Session One: Threads and Threads
 
Threads and Threads
Threads and ThreadsThreads and Threads
Threads and Threads
 
Live migrating a container: pros, cons and gotchas -- Pavel Emelyanov
Live migrating a container: pros, cons and gotchas -- Pavel EmelyanovLive migrating a container: pros, cons and gotchas -- Pavel Emelyanov
Live migrating a container: pros, cons and gotchas -- Pavel Emelyanov
 
Linux boot process
Linux boot processLinux boot process
Linux boot process
 
Tips on High Performance Server Programming
Tips on High Performance Server ProgrammingTips on High Performance Server Programming
Tips on High Performance Server Programming
 
Introduction to systemd
Introduction to systemdIntroduction to systemd
Introduction to systemd
 
epoll() - The I/O Hero
epoll() - The I/O Heroepoll() - The I/O Hero
epoll() - The I/O Hero
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
Basic of Systemd
Basic of SystemdBasic of Systemd
Basic of Systemd
 
Amepad lt(tmpfs)
Amepad lt(tmpfs)Amepad lt(tmpfs)
Amepad lt(tmpfs)
 
Kvm optimizations
Kvm optimizationsKvm optimizations
Kvm optimizations
 
System ctl file that is created at time of installation
System ctl file that is created at time of installationSystem ctl file that is created at time of installation
System ctl file that is created at time of installation
 
Clear cache memory
Clear cache memoryClear cache memory
Clear cache memory
 
Introduction to HSA
Introduction to HSAIntroduction to HSA
Introduction to HSA
 
Backing Up Data with MMS
Backing Up Data with MMSBacking Up Data with MMS
Backing Up Data with MMS
 

Viewers also liked (7)

parallelization strategy
parallelization strategyparallelization strategy
parallelization strategy
 
الديسلكسيا العسر القرائي
الديسلكسيا العسر القرائيالديسلكسيا العسر القرائي
الديسلكسيا العسر القرائي
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Multi core processors
Multi core processorsMulti core processors
Multi core processors
 

Similar to Multicore

Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basics
moeincanada007
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
ssusere19c741
 

Similar to Multicore (20)

Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Unix operating system basics
Unix operating system basicsUnix operating system basics
Unix operating system basics
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
CS6401 Operating Systems
CS6401 Operating SystemsCS6401 Operating Systems
CS6401 Operating Systems
 
linux monitoring and performance tunning
linux monitoring and performance tunning linux monitoring and performance tunning
linux monitoring and performance tunning
 
Uni Processor Architecture
Uni Processor ArchitectureUni Processor Architecture
Uni Processor Architecture
 
Operating System.pdf
Operating System.pdfOperating System.pdf
Operating System.pdf
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
Ch4 memory management
Ch4 memory managementCh4 memory management
Ch4 memory management
 
Group 3
Group 3Group 3
Group 3
 
Computers for Scenarios
Computers for ScenariosComputers for Scenarios
Computers for Scenarios
 
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptxParallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
Parallel Processing & Pipelining in Computer Architecture_Prof.Sumalatha.pptx
 
MK Sistem Operasi.pdf
MK Sistem Operasi.pdfMK Sistem Operasi.pdf
MK Sistem Operasi.pdf
 
Cloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basicsCloud Computing-UNIT 1 claud computing basics
Cloud Computing-UNIT 1 claud computing basics
 
CA presentation of multicore processor
CA presentation of multicore processorCA presentation of multicore processor
CA presentation of multicore processor
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 

More from Mark Veltzer (11)

Gcc
GccGcc
Gcc
 
Gcc opt
Gcc optGcc opt
Gcc opt
 
Linux io
Linux ioLinux io
Linux io
 
Linux monitoring
Linux monitoringLinux monitoring
Linux monitoring
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexing
 
Linux tmpfs
Linux tmpfsLinux tmpfs
Linux tmpfs
 
Pthreads linux
Pthreads linuxPthreads linux
Pthreads linux
 
Realtime
RealtimeRealtime
Realtime
 
Streams
StreamsStreams
Streams
 
Volatile
VolatileVolatile
Volatile
 
Effective cplusplus
Effective cplusplusEffective cplusplus
Effective cplusplus
 

Recently uploaded

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Precisely
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial IntelligenceRevolutionizing SAP® Processes with Automation and Artificial Intelligence
Revolutionizing SAP® Processes with Automation and Artificial Intelligence
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 

Multicore

  • 1. Multi core programming Mark Veltzer mark@veltzer.net
  • 2. Introduction ● Most modern computers and even cell phones have more than one physical device that can run code. ● These devices, whether you call them cores or CPUs can operate at the same time. ● The idea was to speed up processing when making the transistor smaller became too difficult. ● Add more cores and you are done, right?!? Wrong...
  • 3. Adding CPUs is not everything... ● There are resources which are shared between CPUs and as you add CPUs these resources become bottle necks and start giving you headaches... ● This means that there is a very real limit to the number of CPUs you can just add. ● In addition each CPU is very expensive in terms of power consumption and other internal infrastructure (caches etc) ● This makes you want to share this infrastructure between CPUs. ● In comes multi-core.
  • 4. Multi Core vs Multi CPU ● the terms multi core and multi CPU are highly abused and used with various meaning. ● We will use the following meanings: ● A multi CPU machine is a machine with more than one execution unit where little is shared between CPUs. ● A multi core machine is a machine where more is shared between CPUs. ● This definition is not very precise.
  • 5. There is always sharing ● CPUs always share somethings... ● Main memory is a good example – if both CPUs try to access main memory at the same time one of them is delayed a little. ● Multi core CPUs share more than just main memory: various caches, parts of the execution pipeline, interrupt controller and more. ● What exactly is shared depends on the architecture.
  • 6. Why should I care about Multi core? ● The main reason is performance. ● If two CPUs that you are running on share stuff then they could, at some conditions, cause delays one for the other. ● This could be costly. ● Knowing what could be delayed would mean that you can program differently and avoid the delay.
  • 7. Types of sharing we don't care about ● There are other types of sharing we don't care about like voltage controllers, physical sockets and more. ● These do not cause any delays of significance. ● In general the more sharing the more mutual influence the CPUs have one on the other and the more potential delays could be created, the more room we have for optimisation.
  • 8. How do I know how many cores I have? ● read the manual of your mother board?!? ● $ cat /proc/cpuinfo | grep processor ● number of configured processors: $ getconf _NPROCESSORS_CONF ● number of online processors: $ getconf _NPROCESSORS_ONLN ● $ cat /sys/devices/system/cpu/ {present,offline,online} ● Programatically: sysconf(3)
  • 9. Number of cores is not constant ● Linux supports hot plugging of CPUs on various architectures. ● Most user space programmers don't care too much about this. ● Adding a CPU is not a big problem. ● Removing a CPU works by moving load from a running core to other cores. ● Of course, you can't remove the last core.
  • 10. What is the maximum number of CPUs I can run on? ● Sometimes you need access to this information in order to hold data per CPU. ● you can ask how many processors are currently online but new processors may be added at runtime. ● the operating system kernel is compiled to a maximum number of CPUs in it's data structures. You can use that number in your code as well. ● you can get that from the kernel headers package. ● or you can read /sys/devices/system/cpu/kernel_max
  • 11. How multicore vs multi CPU effects programming... ● Multicore CPUs will see memory changes earlier. ● Multicore will slow each other more since they share resources. ● Memory communication between multicore processes will be faster ● In general our intuition about multi – CPU machines is more akin to multi CPU machines than to multi core machines.
  • 12. Multicore and Multi-CPU are not mutually exclusive... ● In todays architectures it is common to find a single system which has both multi core and multi CPU. ● A single system could have 4 cores, packed as a 2x2 multi-core system. ● This means that the two pairs of CPUs share little while each pair shares a lot. ● You need to know about all of this – this is called the CPU topology.
  • 13. Does anyone look at the CPU topology? ● Yes, the OS kernel knows about it and uses this information for better scheduling and other uses. ● You can use the information from: /sys/devices/system/cpu/cpu[n]/topology/* ● You can do that at compile time to produce various versions for various topologies, using a config file per topology or even dynamically at run time...
  • 14. A little bit about timing... ● In order to appreciate performance benefits we need to know a little about timings. The following are only ball park figures... ● Single machine instruction without memory access or that only accesses the cache: nanoseconds... ● Single machine instruction which accesses main memory: hundreds of nanoseconds ● Single OS system call: microsecond ● Read data from physical hard drive: miliseconds
  • 15. What are pipe lines in the CPU? ● A clock in a CPU is simply a oscillator crystal at a constant rate that synchronizes the operation of all the parts of the CPU. ● On a steady clock CPU the clock also has to be longer than the longest executing instruction (bummer!) ● This means that it is difficult to increase the clock rate. Even if most instructions are very short the longest one determines the clock rate.
  • 16. RISC ● One solution was to simpify instructions: break down complicated instructions into simple ones. ● Now the longest instruction is much shorter and the CPU clock could be increased. ● But now you need to fetch more instructions and have a bigger instruction cache. ● Compilers also work harder because even simple operations require generation of several machine instructions - and object files become bigger.
  • 17. Pipe lines ● A different solution was to keep instructions complex but break them down inside the cpu. ● Each instruction will be broken up to several stages. ● This way several instructions could be executed at the same time because every stage will be executed separately ● Clock rate can now be increased to match the time it takes the most complicated stage of the most complicated instruction to execute. ● Compilers don't have to work hard. ● Because increasing the clock became hard engineers decided on a pipe line approach to increasing the CPU speed.
  • 18. How pipelines effect programming? ● the CPU misses branch prediction as a result of the pipelining process. ● we need to tell the CPU how likely/unlikey our branch is using hints to the compiler. ● code which is very un regular (noisy and hard to predict) will be slower because it will confuse the branch prediction in the CPU. ● See example...
  • 19. What are cache lines? ● when the CPU fetches memory it does not fetch it one address at a time. ● instead it fetches a full "cache line". ● typical cache lines are 16 to 128 bytes and are usually aligned. ● for example, on current intel processors cache lines are 64 bytes. ● Cache line should not be confused with memory page which is a much bigger construct (4096 bytes on Intel).
  • 20. How to get the cache line? ● On the command line: ● $ getconf LEVEL1_DCACHE_LINESIZE ● $ cat /proc/cpuinfo | grep "clflush size" ● $ cat /sys/devices/system/cpu/cpu0/cache/index0/co herency_line_size ● Or use sysconf(3). ● Or hardcode in you build process.
  • 21. How do you pad a structure to have the size rounded to cache line size? ● Do it at compile time (of course). ● Add a char[n] member to your class/struct/... ● Or use special compiler padding syntax. ● Sometimes, this is counter productive when two of these structures could fit into a single cache line for instance. In these cases laying out memory of the entire data structure is a better strategy. ● Dont do this without thinking about it!
  • 22. How pipeline organization in a core effects programming? ● Example: A CPU has two pipelines: one for integer computation and one for floating point. ● Then it is better to have integer instructions interleaved with floating point ones (they will be executed concurrently). ● The problem is that this type of advice is very hard to follow in practice. ● And the compiler is supposed to be able to do that for you anyway...
  • 23. How pipeline organization in a core effects programming? ● The CPU rearranges the order in which it does loads and stores as a side effect of pipeline ● The semantics of your code remain the same, the order does not matter. ● Usually you don't care about this and should leave this alone. ● Except in cases where you are writing a synchronization mechanism or talking to hardware. ● In that case you would use a machine barrier to prevent this reordering (will be explained later).
  • 24. The problems of memory ● Why do CPUs need a memory cache at all? The idea is that not all memory acts in the same speed ● There are, and always will be, better technologies for faster memory, and old technologies for slower memory ● There is a huge monetary difference between these types of memory. ● Why didn't they make the entire Airplane out the stuff that makes up the black box? ...:) ● So can we use the fact that we have fast and expensive memory as opposed to slower and less expensive memory to speed up the computer? Yes.
  • 25. Symmetric vs non symmetric memory ● Access to main memory is not uniform for all CPUs or cores in the system ● When the access is the same (there are various techniques to do that) then the system is called UMA – Uniform Memory Access. ● When the system is not symmetric (e.g. main memory is connected to CPU0 and CPU1 has to ask CPU0 for access) then the system is called NUMA – Non Uniform Memory Access. ● There are other types of NUMA systems as well (more on that later)
  • 26. The CPU cache ● A relatively small area of memory within the CPU that is made out of the fast expensive memory and which stores a copy of some addresses accessed by the CPU. ● The algorithm that chooses which addresses to store in the cache is varied across different CPUs. ● some store the last addresses accessed, some store the most heavily used ones. ● The idea is to reduce the AVERAGE time it take to access a given memory address. ● Rests upon the assumption that software is not chaotic and tends to reuse the same addresses most of the time. ● If you access memory randomly you will break this assumption and will thrash the cash (discussed later).
  • 27. Types of caches ● In reality CPUs have more than one type of cache: ● DCACHE: data cache used for data access (read and write) ● ICACHE: instruction cache (used to cache execute only memory which is your code) ● TLB: Translation Lookaside Buffer. Speeds up virtual to physical address translation.
  • 28. Number of caches ● In reality CPUs usually have more than one cache. ● As you get away from the core caches become bigger and cache lines become bigger. ● But at the same time access becomes slower. ● Until we reach main memory which is biggest and slowest. ● These are called L1, L2, L3, ...
  • 29. Cache thrashing... ● thrashing: using many resources to do little work. bad. ● A piece of software allocates huge amount of ram and then accesses address by random accessing each address one time. ● A piece of software at any given point in time with so much ram that the same address is never accessed twice. ● Two threads or processes that share a cache. Each one would run without cache misses but both produce lots of cache misses. ● In all of these cases the cache is used (it is in hardware after all) but it is not useful. ● The cache is only useful if you have low cache miss rates.
  • 30. Coherent vs non coherent platforms ● Coherency is a very important subject. ● A perfectly coherent platform is one where all components of the system (IO devices, all cores and CPUS) see memory in exactly the same way. ● Coherency is a hardware feature which can be controlled by software (special machine instructions). ● Most real systems are incoherent in some ways but not in others. ● Some systems could be used in both coherent and incoherent ways.
  • 31. Coherency - Example ● Intel is said to be a cache coherent platform. ● This means that two processors see the same memory. ● If one processor puts data in cache then the other processor will respect that and will fetch the data from that cache. ● This slows down the both CPUs a little. ● But it makes programming a whole lot easier. ● But the same platform (Intel) - is not coherent between CPU and peripherals. ● A special instruction needs to be used to make sure that CPU sees data put into main memory by peripheral and the reverse
  • 32. Coherency and memory reordering ● Coherency has nothing to do with memory reordering. ● A platform can be coherent but still have memory reordering (Intel is an example). A platform can also be non-coherent and have no memory reordering (older RISC systems). ● In theory different instructions should deal with memory reordering and coherency but in practice one instruction usually does both.
  • 33. Optimizing CPU cache utilization ● First find the CPU cache line size ● Then define the most common structures to fit inside a cache line. ● Another idea is to split commonly used data from uncommonly used data. This will make sure that the commonly used data will fit into one cache line. ● You can pad your structures to a cache line size using compiler hints. ● See example....
  • 34. CPU counters ● modern CPUs hold counters for all of the above ● TLB misses, data cache misses, instruction cache misses, branch prediction misses, instructions executed and much much more ● some of the counters are only accessible from kernel mode and most from any mode. ● Writing to them is usually restricted to kernel mode. ● in order to access them you need to write assembly code which is non portable.
  • 35. CPU counters - problems ● the problem is counting them right since these counters are not per thread or process because the CPU doesn't know about threads or processes. ● If you are the only process running then no problem (although you are counting the OS as well). ● if not then better solutions are required
  • 36. CPU counters – best practices ● Writing your own code in assembly is hard. Most people don't do that. ● on Linux use perf ● use a library like PAPI to give you the CPU counters involved. ● see demo and example...
  • 37. What is NUMA? ● On very large system using a single main memory bank becomes the main bottle neck. ● On such systems the memory is divided into banks where each bank is attached a set of CPUs. ● Example: 4 memory banks of 4GB each where each bank is attached a Dual Core processor. ● The banks are connected via a bus to allow each processor to access any address. ● Access to the local bank is, of course, much faster than to access a remote bank.
  • 38. NUMA on Linux ● The Linux kernel knows about NUMA and reads the NUMA topology at boot time. ● It schedules processes on processors according to memory affinity. ● It tries to allocate memory for a process on the memory bank closest to that process. ● Once a thread is running Linux will try very hard not to migrate it to a core that is on a different bank since that will cause large memory latency.
  • 39. NUMA syscalls on Linux ● There are various system calls on Linux that allow a programmer to control it's properties in relation to NUMA. ● mbind allows a programmer to bind memory to a NUMA nodes. ● migrate_pages and move_pages move pages from bank to bank. ● set_mempolicy and get_mempolicy allow control of the kernel policy regarding NUMA. ● /proc/[pid]/numa_maps shows the NUMA mapping for a certain process.
  • 40. Memory policies in programming ● When there is a chance of two threads on two cores sharing the same cache line reduce it by moving memory apart or by padding structures. ● When the reverse is true and a flat area of memory is accessed only by one thread on one core try to make memory as small as possible. ● When you have frequently used memory next to non frequency used memory move them apart (redesign of your data structures may be required) ● Measure, measure, measure ● See example....
  • 41. Core concepts for the course
  • 42. Concepts and definitions ● In order to deal with programming for multi-core or even time sharing machines we have to use various tools and concepts. ● We will now go over these concepts. ● Some of these are low level mechanisms supplied either by the OS or libraries. ● Some are high level concepts which set a framework for the programmer to think in.
  • 43. Atomic operations ● A set of CPU instructions intended to aid low level programmers in writing synchronization primitives. ● These instructions can be used from user space so you can use them too. ● Either to gain performance benefits in various conditions. ● Or in order to write synchronization primitives of your own.
  • 44. Atomic operations (cont) ● Usually the compiler vendor wraps these primitives for you. ● You can still write assembly code if you want. ● In GCC and Intel compiler see documentation for __sync_* ● New C++ 11 APIs also exist in some compilers (__atomic_*). ● Intel threading building blocks has a nice atomic<T> template. ● These operations are fast (nanosecond time).
  • 45. Atomic ops : what do they do? ● Guarantee that other CPUs and HW do not see wrong intermediate “junk” values. ● Guarantee that the CPU caches for this variable will be synchronized (only for platforms that do not synchronize the caches). ● Guarantee that the HW will see the change. ● Prevent reordering of instructions around the atomic operation. ● Prevent race of the same op by two or more cores. ● See example...
  • 46. Mutex ● definition: an OS/library based construct enabling two pieces of code to avoid entering a predefined piece(s) of code at the same time. ● How are mutexes implemented? ● The simplest way is a bit inside the OS kernel backed up by a test-and-set CPU machine instruction. ● This means that in order to lock the mutex you have to go to the OS (microsecond range).
  • 47. The problems of mutexes... ● A good design is one in which mutexes are unlocked most of the time ● Otherwise these mutexes serialize all cores. ● This means that we waste the trip to the OS kernel because we don't need the OS to put us to sleep or to wake us up. ● Can't we just lock the mutex from user space?
  • 48. Enter futex / critical section ● A futex is just like a mutex with the exception that the state of the lock is shared by all concurrent processes in user space. ● The test-and-set is executed in user space and only if it fails (the lock is locked) then we go to the OS to put the current thread to sleep. ● The reverse applies when leaving the critical section: if no one is waiting the trip to the OS kernel is avoided. ● The problem with this idea is robustness.
  • 49. Futex robustness ● What happens if a process dies holding a futex? ● In a multi-threaded program the whole program dies (a thread cannot die) so we don't care. ● In a multi-process program we must release all locks held at time of death. ● This is referred to as “futex robustness”. ● In Linux all futexes are robust.
  • 50. Semaphores ● Semaphore is a very overloaded concept. ● Sometimes it is used in place of mutex. ● Sometimes it is used with a meaning of “number of permits” ● Sometimes it is used as mutex which has a name and therefor could be used with multiple processes as opposed to multi-threading. ● Linux has two kinds of these semaphores: sysv and POSIX.
  • 51. Readers/Writer lock ● A resource protected by a lock is usually protected because a write and read at the same time would create problems. ● Multiple reads are not a problem and do not need a lock at all. ● If we have many readers compared to writes we could gain performance by allowing many readers into the critical section and not allowing writers. ● A writer would enter the critical section alone
  • 52. Readers/Writer lock ● If we only have writers then we serialize all of them and therefore gain no benefit from this type of lock. ● How is a readers/writer lock implemented? Using a regular type of lock. ● In Linux it can even be shared between processes. ● See: pthread_rwlock_*
  • 53. Spin lock ● We are talking about user space spin locks here (do not confuse with kernel space spin locks). ● Is a lock that spins – busy waits – until a test-and-set operation succeeds. ● This is efficient when the operation waited upon is very short compared to the trip to the OS to lock the structure. ● On the other hand if the lock is not contended then it has the same speed as a futex. ● In the contended case it may be problematic if the two contending thread are on the same core. That is why they are usually used for threads running on different cores. ● See pthread_spin_* on Linux...
  • 54. Lock free algorithms ● Are a class of algorithms that require no locks during their operation. ● They usually require atomic operations instead. ● Examples are a lock free circular buffer, lock free hash table and more. ● Sometimes these algorithms rely on spinning for a short while. ● Usually quite a pain to implement correctly. ● But you can use a library instead...
  • 55. Dead lock ● A situation in which no work is done because everyone is waiting on a lock and so no one is releasing any lock... ● Example: A takes “a” lock, B takes “b” lock, A waits for “b” lock, B waits for “a” lock. ● Solutions: code carefully or always acquire locks in the same order. ● There are all kinds of dead lock detectors available that have overhead but can be used at development time and removed in production. ● Dead lock is called dead because all threads are sleeping when it occurs.
  • 56. Live lock ● A situation where no work is getting done although all threads are live. ● This happens when all threads yield on to the other instead of acquiring the lock. ● … and no thread actually acquires the lock ● Since most code doesn't yield this almost never happens in practice. ● See sched_yield(2) in Linux...
  • 57. Memory sharing ● Having one piece of memory accessible by multiple running entities (threads or processes). ● Multiple threads share all memory by default. ● Multiple processes share no memory by default. ● In Linux there are two types of shared memory: SYSV IPC and POSIX. ● And libraries like ACE and BOOST to wrap them...
  • 58. Memory sharing (cont) ● If you share data in the shared memory then you have to either: ● Have a piece of data per process and then you don't need to protect it. ● Have multiple processes access the same data and then protect it via all the mechanisms describe before ● Use a specialized lock like readers/writer lock. ● Use a specialized framework like RCU. ● Use a lock free algorithm.
  • 59. Memory sharing (cont) ● In Linux most synchronization primitives can be placed into shared memory. ● This means you can use a futex, spinlock, reader/writers lock or an atomic variable to synchronize processes. ● That is the reason that futexes in Linux were implemented as robust futexes.
  • 60. Message queues ● A mechanism by which two processes or threads could pass messages one to the other. ● If used for multi-processing then handled by the OS. ● If used between processes it is an OS managed resource, is limited and has efficiency issues since messages are copied from source process to OS and then back from OS to target process.
  • 61. Message queues (cont...) ● In a multi-threaded environment message queues are used to pass pointers to messages around and more as a synchronization mechanism. ● This is because all threads share memory. ● ACE and BOOST both implement queues.
  • 62. Barriers ● A synchronization primitive that has a number associated with it. ● When a thread arrives at a barrier it goes to sleep. ● When enough threads arrive at the barrier (the number specified at the barrier creation) the barrier is “broken” and all the threads continue to run. ● Typical use: real time application time slice work. ● Linux: see pthread_barrier_*
  • 63. Condition ● A synchronization primitive enabling one thread to sleep waiting for a software defined condition to occur. ● Another thread may change the software state and signal sleeping threads to check the condition. ● A condition has a mutex associated with it that must be grabbed before using it. ● In Linux: see pthread_cond_* ● There are ACE and BOOST wrappers. ● See example...
  • 64. RCU ● RCU: Read/Copy/Update ● RCU is framework for thinking about accessing a shared resource. ● The idea is not to block either readers or writers in accessing the resource. ● Instead, writers get their own copy. ● The new copy gets substituted atomically. ● Readers who started reading the old copy continue to read the old copy.
  • 65. RCU (continued) ● New readers will read the new copy. ● The problem with RCU is the costly copy operation for writers. ● Usually implemented in heavy systems where the cost of the read or write operation is heavy and outweighs the cost of the copy. ● A good example is a database system.
  • 66. Machine memory barrier ● Is a specialized machine instruction that tells the CPU you are running on not to reorder reads or writes around it. ● In theory there are all kinds of barriers: read, write, read and write, io and CPU, only CPU, etc. ● In practice every hardware platform implements only a subset of those. ● In user space it is rarely used because authors of synchronization mechanisms use them and the end programmer just uses the synchronization mechanisms. ● In user space Linux: __sync_synchronize() is a full machine barrier along with all other atomic operations. ● In kernel (or in writing device drivers in user space) it is used heavily to communicate correctly with hardware (mb(),rmb(),wmb())
  • 67. Compiler barrier ● Is a compile time instruction to the compiler (much like #pragma) that tells the compiler not to assume values of memory around the instruction. ● See example... ● It is better to use a compiler barrier than the volatile keyword for writing synchronization primitives.
  • 68. Volatile considered harmful ● Because it doesn't allow caching, even within a critical section. ● Because it disables use of CPU caches on various platforms (not on Intel). ● Basically because it is an overkill. ● Better use the compiler barrier mentioned before...
  • 69. References ● Multicore Application Programming, For Windows, Linux, and Oracle Solaris. Darryl Gove. ● The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit. ● What Every Programmer Should Know About Memory, Ulrich Drepper, http://www.akkadia.org/drepper/cpumemory.pdf ● The Paul E. McKenney book on RCU.
  • 70. Synchronization mechanisms ● Nothing ● Atomics ● Lockless Algos/DS ● RCU/COW ● Readers/Writer lock ● Spinlock ● Futex ● Mutex ● Semaphore