Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology
  • Be the first to comment


  1. 1. Multi core programming Mark Veltzer
  2. 2. Introduction ● Most modern computers and even cell phones have more than one physical device that can run code. ● These devices, whether you call them cores or CPUs can operate at the same time. ● The idea was to speed up processing when making the transistor smaller became too difficult. ● Add more cores and you are done, right?!? Wrong...
  3. 3. Adding CPUs is not everything... ● There are resources which are shared between CPUs and as you add CPUs these resources become bottle necks and start giving you headaches... ● This means that there is a very real limit to the number of CPUs you can just add. ● In addition each CPU is very expensive in terms of power consumption and other internal infrastructure (caches etc) ● This makes you want to share this infrastructure between CPUs. ● In comes multi-core.
  4. 4. Multi Core vs Multi CPU ● the terms multi core and multi CPU are highly abused and used with various meaning. ● We will use the following meanings: ● A multi CPU machine is a machine with more than one execution unit where little is shared between CPUs. ● A multi core machine is a machine where more is shared between CPUs. ● This definition is not very precise.
  5. 5. There is always sharing ● CPUs always share somethings... ● Main memory is a good example – if both CPUs try to access main memory at the same time one of them is delayed a little. ● Multi core CPUs share more than just main memory: various caches, parts of the execution pipeline, interrupt controller and more. ● What exactly is shared depends on the architecture.
  6. 6. Why should I care about Multi core? ● The main reason is performance. ● If two CPUs that you are running on share stuff then they could, at some conditions, cause delays one for the other. ● This could be costly. ● Knowing what could be delayed would mean that you can program differently and avoid the delay.
  7. 7. Types of sharing we don't care about ● There are other types of sharing we don't care about like voltage controllers, physical sockets and more. ● These do not cause any delays of significance. ● In general the more sharing the more mutual influence the CPUs have one on the other and the more potential delays could be created, the more room we have for optimisation.
  8. 8. How do I know how many cores I have? ● read the manual of your mother board?!? ● $ cat /proc/cpuinfo | grep processor ● number of configured processors: $ getconf _NPROCESSORS_CONF ● number of online processors: $ getconf _NPROCESSORS_ONLN ● $ cat /sys/devices/system/cpu/ {present,offline,online} ● Programatically: sysconf(3)
  9. 9. Number of cores is not constant ● Linux supports hot plugging of CPUs on various architectures. ● Most user space programmers don't care too much about this. ● Adding a CPU is not a big problem. ● Removing a CPU works by moving load from a running core to other cores. ● Of course, you can't remove the last core.
  10. 10. What is the maximum number of CPUs I can run on? ● Sometimes you need access to this information in order to hold data per CPU. ● you can ask how many processors are currently online but new processors may be added at runtime. ● the operating system kernel is compiled to a maximum number of CPUs in it's data structures. You can use that number in your code as well. ● you can get that from the kernel headers package. ● or you can read /sys/devices/system/cpu/kernel_max
  11. 11. How multicore vs multi CPU effects programming... ● Multicore CPUs will see memory changes earlier. ● Multicore will slow each other more since they share resources. ● Memory communication between multicore processes will be faster ● In general our intuition about multi – CPU machines is more akin to multi CPU machines than to multi core machines.
  12. 12. Multicore and Multi-CPU are not mutually exclusive... ● In todays architectures it is common to find a single system which has both multi core and multi CPU. ● A single system could have 4 cores, packed as a 2x2 multi-core system. ● This means that the two pairs of CPUs share little while each pair shares a lot. ● You need to know about all of this – this is called the CPU topology.
  13. 13. Does anyone look at the CPU topology? ● Yes, the OS kernel knows about it and uses this information for better scheduling and other uses. ● You can use the information from: /sys/devices/system/cpu/cpu[n]/topology/* ● You can do that at compile time to produce various versions for various topologies, using a config file per topology or even dynamically at run time...
  14. 14. A little bit about timing... ● In order to appreciate performance benefits we need to know a little about timings. The following are only ball park figures... ● Single machine instruction without memory access or that only accesses the cache: nanoseconds... ● Single machine instruction which accesses main memory: hundreds of nanoseconds ● Single OS system call: microsecond ● Read data from physical hard drive: miliseconds
  15. 15. What are pipe lines in the CPU? ● A clock in a CPU is simply a oscillator crystal at a constant rate that synchronizes the operation of all the parts of the CPU. ● On a steady clock CPU the clock also has to be longer than the longest executing instruction (bummer!) ● This means that it is difficult to increase the clock rate. Even if most instructions are very short the longest one determines the clock rate.
  16. 16. RISC ● One solution was to simpify instructions: break down complicated instructions into simple ones. ● Now the longest instruction is much shorter and the CPU clock could be increased. ● But now you need to fetch more instructions and have a bigger instruction cache. ● Compilers also work harder because even simple operations require generation of several machine instructions - and object files become bigger.
  17. 17. Pipe lines ● A different solution was to keep instructions complex but break them down inside the cpu. ● Each instruction will be broken up to several stages. ● This way several instructions could be executed at the same time because every stage will be executed separately ● Clock rate can now be increased to match the time it takes the most complicated stage of the most complicated instruction to execute. ● Compilers don't have to work hard. ● Because increasing the clock became hard engineers decided on a pipe line approach to increasing the CPU speed.
  18. 18. How pipelines effect programming? ● the CPU misses branch prediction as a result of the pipelining process. ● we need to tell the CPU how likely/unlikey our branch is using hints to the compiler. ● code which is very un regular (noisy and hard to predict) will be slower because it will confuse the branch prediction in the CPU. ● See example...
  19. 19. What are cache lines? ● when the CPU fetches memory it does not fetch it one address at a time. ● instead it fetches a full "cache line". ● typical cache lines are 16 to 128 bytes and are usually aligned. ● for example, on current intel processors cache lines are 64 bytes. ● Cache line should not be confused with memory page which is a much bigger construct (4096 bytes on Intel).
  20. 20. How to get the cache line? ● On the command line: ● $ getconf LEVEL1_DCACHE_LINESIZE ● $ cat /proc/cpuinfo | grep "clflush size" ● $ cat /sys/devices/system/cpu/cpu0/cache/index0/co herency_line_size ● Or use sysconf(3). ● Or hardcode in you build process.
  21. 21. How do you pad a structure to have the size rounded to cache line size? ● Do it at compile time (of course). ● Add a char[n] member to your class/struct/... ● Or use special compiler padding syntax. ● Sometimes, this is counter productive when two of these structures could fit into a single cache line for instance. In these cases laying out memory of the entire data structure is a better strategy. ● Dont do this without thinking about it!
  22. 22. How pipeline organization in a core effects programming? ● Example: A CPU has two pipelines: one for integer computation and one for floating point. ● Then it is better to have integer instructions interleaved with floating point ones (they will be executed concurrently). ● The problem is that this type of advice is very hard to follow in practice. ● And the compiler is supposed to be able to do that for you anyway...
  23. 23. How pipeline organization in a core effects programming? ● The CPU rearranges the order in which it does loads and stores as a side effect of pipeline ● The semantics of your code remain the same, the order does not matter. ● Usually you don't care about this and should leave this alone. ● Except in cases where you are writing a synchronization mechanism or talking to hardware. ● In that case you would use a machine barrier to prevent this reordering (will be explained later).
  24. 24. The problems of memory ● Why do CPUs need a memory cache at all? The idea is that not all memory acts in the same speed ● There are, and always will be, better technologies for faster memory, and old technologies for slower memory ● There is a huge monetary difference between these types of memory. ● Why didn't they make the entire Airplane out the stuff that makes up the black box? ...:) ● So can we use the fact that we have fast and expensive memory as opposed to slower and less expensive memory to speed up the computer? Yes.
  25. 25. Symmetric vs non symmetric memory ● Access to main memory is not uniform for all CPUs or cores in the system ● When the access is the same (there are various techniques to do that) then the system is called UMA – Uniform Memory Access. ● When the system is not symmetric (e.g. main memory is connected to CPU0 and CPU1 has to ask CPU0 for access) then the system is called NUMA – Non Uniform Memory Access. ● There are other types of NUMA systems as well (more on that later)
  26. 26. The CPU cache ● A relatively small area of memory within the CPU that is made out of the fast expensive memory and which stores a copy of some addresses accessed by the CPU. ● The algorithm that chooses which addresses to store in the cache is varied across different CPUs. ● some store the last addresses accessed, some store the most heavily used ones. ● The idea is to reduce the AVERAGE time it take to access a given memory address. ● Rests upon the assumption that software is not chaotic and tends to reuse the same addresses most of the time. ● If you access memory randomly you will break this assumption and will thrash the cash (discussed later).
  27. 27. Types of caches ● In reality CPUs have more than one type of cache: ● DCACHE: data cache used for data access (read and write) ● ICACHE: instruction cache (used to cache execute only memory which is your code) ● TLB: Translation Lookaside Buffer. Speeds up virtual to physical address translation.
  28. 28. Number of caches ● In reality CPUs usually have more than one cache. ● As you get away from the core caches become bigger and cache lines become bigger. ● But at the same time access becomes slower. ● Until we reach main memory which is biggest and slowest. ● These are called L1, L2, L3, ...
  29. 29. Cache thrashing... ● thrashing: using many resources to do little work. bad. ● A piece of software allocates huge amount of ram and then accesses address by random accessing each address one time. ● A piece of software at any given point in time with so much ram that the same address is never accessed twice. ● Two threads or processes that share a cache. Each one would run without cache misses but both produce lots of cache misses. ● In all of these cases the cache is used (it is in hardware after all) but it is not useful. ● The cache is only useful if you have low cache miss rates.
  30. 30. Coherent vs non coherent platforms ● Coherency is a very important subject. ● A perfectly coherent platform is one where all components of the system (IO devices, all cores and CPUS) see memory in exactly the same way. ● Coherency is a hardware feature which can be controlled by software (special machine instructions). ● Most real systems are incoherent in some ways but not in others. ● Some systems could be used in both coherent and incoherent ways.
  31. 31. Coherency - Example ● Intel is said to be a cache coherent platform. ● This means that two processors see the same memory. ● If one processor puts data in cache then the other processor will respect that and will fetch the data from that cache. ● This slows down the both CPUs a little. ● But it makes programming a whole lot easier. ● But the same platform (Intel) - is not coherent between CPU and peripherals. ● A special instruction needs to be used to make sure that CPU sees data put into main memory by peripheral and the reverse
  32. 32. Coherency and memory reordering ● Coherency has nothing to do with memory reordering. ● A platform can be coherent but still have memory reordering (Intel is an example). A platform can also be non-coherent and have no memory reordering (older RISC systems). ● In theory different instructions should deal with memory reordering and coherency but in practice one instruction usually does both.
  33. 33. Optimizing CPU cache utilization ● First find the CPU cache line size ● Then define the most common structures to fit inside a cache line. ● Another idea is to split commonly used data from uncommonly used data. This will make sure that the commonly used data will fit into one cache line. ● You can pad your structures to a cache line size using compiler hints. ● See example....
  34. 34. CPU counters ● modern CPUs hold counters for all of the above ● TLB misses, data cache misses, instruction cache misses, branch prediction misses, instructions executed and much much more ● some of the counters are only accessible from kernel mode and most from any mode. ● Writing to them is usually restricted to kernel mode. ● in order to access them you need to write assembly code which is non portable.
  35. 35. CPU counters - problems ● the problem is counting them right since these counters are not per thread or process because the CPU doesn't know about threads or processes. ● If you are the only process running then no problem (although you are counting the OS as well). ● if not then better solutions are required
  36. 36. CPU counters – best practices ● Writing your own code in assembly is hard. Most people don't do that. ● on Linux use perf ● use a library like PAPI to give you the CPU counters involved. ● see demo and example...
  37. 37. What is NUMA? ● On very large system using a single main memory bank becomes the main bottle neck. ● On such systems the memory is divided into banks where each bank is attached a set of CPUs. ● Example: 4 memory banks of 4GB each where each bank is attached a Dual Core processor. ● The banks are connected via a bus to allow each processor to access any address. ● Access to the local bank is, of course, much faster than to access a remote bank.
  38. 38. NUMA on Linux ● The Linux kernel knows about NUMA and reads the NUMA topology at boot time. ● It schedules processes on processors according to memory affinity. ● It tries to allocate memory for a process on the memory bank closest to that process. ● Once a thread is running Linux will try very hard not to migrate it to a core that is on a different bank since that will cause large memory latency.
  39. 39. NUMA syscalls on Linux ● There are various system calls on Linux that allow a programmer to control it's properties in relation to NUMA. ● mbind allows a programmer to bind memory to a NUMA nodes. ● migrate_pages and move_pages move pages from bank to bank. ● set_mempolicy and get_mempolicy allow control of the kernel policy regarding NUMA. ● /proc/[pid]/numa_maps shows the NUMA mapping for a certain process.
  40. 40. Memory policies in programming ● When there is a chance of two threads on two cores sharing the same cache line reduce it by moving memory apart or by padding structures. ● When the reverse is true and a flat area of memory is accessed only by one thread on one core try to make memory as small as possible. ● When you have frequently used memory next to non frequency used memory move them apart (redesign of your data structures may be required) ● Measure, measure, measure ● See example....
  41. 41. Core concepts for the course
  42. 42. Concepts and definitions ● In order to deal with programming for multi-core or even time sharing machines we have to use various tools and concepts. ● We will now go over these concepts. ● Some of these are low level mechanisms supplied either by the OS or libraries. ● Some are high level concepts which set a framework for the programmer to think in.
  43. 43. Atomic operations ● A set of CPU instructions intended to aid low level programmers in writing synchronization primitives. ● These instructions can be used from user space so you can use them too. ● Either to gain performance benefits in various conditions. ● Or in order to write synchronization primitives of your own.
  44. 44. Atomic operations (cont) ● Usually the compiler vendor wraps these primitives for you. ● You can still write assembly code if you want. ● In GCC and Intel compiler see documentation for __sync_* ● New C++ 11 APIs also exist in some compilers (__atomic_*). ● Intel threading building blocks has a nice atomic<T> template. ● These operations are fast (nanosecond time).
  45. 45. Atomic ops : what do they do? ● Guarantee that other CPUs and HW do not see wrong intermediate “junk” values. ● Guarantee that the CPU caches for this variable will be synchronized (only for platforms that do not synchronize the caches). ● Guarantee that the HW will see the change. ● Prevent reordering of instructions around the atomic operation. ● Prevent race of the same op by two or more cores. ● See example...
  46. 46. Mutex ● definition: an OS/library based construct enabling two pieces of code to avoid entering a predefined piece(s) of code at the same time. ● How are mutexes implemented? ● The simplest way is a bit inside the OS kernel backed up by a test-and-set CPU machine instruction. ● This means that in order to lock the mutex you have to go to the OS (microsecond range).
  47. 47. The problems of mutexes... ● A good design is one in which mutexes are unlocked most of the time ● Otherwise these mutexes serialize all cores. ● This means that we waste the trip to the OS kernel because we don't need the OS to put us to sleep or to wake us up. ● Can't we just lock the mutex from user space?
  48. 48. Enter futex / critical section ● A futex is just like a mutex with the exception that the state of the lock is shared by all concurrent processes in user space. ● The test-and-set is executed in user space and only if it fails (the lock is locked) then we go to the OS to put the current thread to sleep. ● The reverse applies when leaving the critical section: if no one is waiting the trip to the OS kernel is avoided. ● The problem with this idea is robustness.
  49. 49. Futex robustness ● What happens if a process dies holding a futex? ● In a multi-threaded program the whole program dies (a thread cannot die) so we don't care. ● In a multi-process program we must release all locks held at time of death. ● This is referred to as “futex robustness”. ● In Linux all futexes are robust.
  50. 50. Semaphores ● Semaphore is a very overloaded concept. ● Sometimes it is used in place of mutex. ● Sometimes it is used with a meaning of “number of permits” ● Sometimes it is used as mutex which has a name and therefor could be used with multiple processes as opposed to multi-threading. ● Linux has two kinds of these semaphores: sysv and POSIX.
  51. 51. Readers/Writer lock ● A resource protected by a lock is usually protected because a write and read at the same time would create problems. ● Multiple reads are not a problem and do not need a lock at all. ● If we have many readers compared to writes we could gain performance by allowing many readers into the critical section and not allowing writers. ● A writer would enter the critical section alone
  52. 52. Readers/Writer lock ● If we only have writers then we serialize all of them and therefore gain no benefit from this type of lock. ● How is a readers/writer lock implemented? Using a regular type of lock. ● In Linux it can even be shared between processes. ● See: pthread_rwlock_*
  53. 53. Spin lock ● We are talking about user space spin locks here (do not confuse with kernel space spin locks). ● Is a lock that spins – busy waits – until a test-and-set operation succeeds. ● This is efficient when the operation waited upon is very short compared to the trip to the OS to lock the structure. ● On the other hand if the lock is not contended then it has the same speed as a futex. ● In the contended case it may be problematic if the two contending thread are on the same core. That is why they are usually used for threads running on different cores. ● See pthread_spin_* on Linux...
  54. 54. Lock free algorithms ● Are a class of algorithms that require no locks during their operation. ● They usually require atomic operations instead. ● Examples are a lock free circular buffer, lock free hash table and more. ● Sometimes these algorithms rely on spinning for a short while. ● Usually quite a pain to implement correctly. ● But you can use a library instead...
  55. 55. Dead lock ● A situation in which no work is done because everyone is waiting on a lock and so no one is releasing any lock... ● Example: A takes “a” lock, B takes “b” lock, A waits for “b” lock, B waits for “a” lock. ● Solutions: code carefully or always acquire locks in the same order. ● There are all kinds of dead lock detectors available that have overhead but can be used at development time and removed in production. ● Dead lock is called dead because all threads are sleeping when it occurs.
  56. 56. Live lock ● A situation where no work is getting done although all threads are live. ● This happens when all threads yield on to the other instead of acquiring the lock. ● … and no thread actually acquires the lock ● Since most code doesn't yield this almost never happens in practice. ● See sched_yield(2) in Linux...
  57. 57. Memory sharing ● Having one piece of memory accessible by multiple running entities (threads or processes). ● Multiple threads share all memory by default. ● Multiple processes share no memory by default. ● In Linux there are two types of shared memory: SYSV IPC and POSIX. ● And libraries like ACE and BOOST to wrap them...
  58. 58. Memory sharing (cont) ● If you share data in the shared memory then you have to either: ● Have a piece of data per process and then you don't need to protect it. ● Have multiple processes access the same data and then protect it via all the mechanisms describe before ● Use a specialized lock like readers/writer lock. ● Use a specialized framework like RCU. ● Use a lock free algorithm.
  59. 59. Memory sharing (cont) ● In Linux most synchronization primitives can be placed into shared memory. ● This means you can use a futex, spinlock, reader/writers lock or an atomic variable to synchronize processes. ● That is the reason that futexes in Linux were implemented as robust futexes.
  60. 60. Message queues ● A mechanism by which two processes or threads could pass messages one to the other. ● If used for multi-processing then handled by the OS. ● If used between processes it is an OS managed resource, is limited and has efficiency issues since messages are copied from source process to OS and then back from OS to target process.
  61. 61. Message queues (cont...) ● In a multi-threaded environment message queues are used to pass pointers to messages around and more as a synchronization mechanism. ● This is because all threads share memory. ● ACE and BOOST both implement queues.
  62. 62. Barriers ● A synchronization primitive that has a number associated with it. ● When a thread arrives at a barrier it goes to sleep. ● When enough threads arrive at the barrier (the number specified at the barrier creation) the barrier is “broken” and all the threads continue to run. ● Typical use: real time application time slice work. ● Linux: see pthread_barrier_*
  63. 63. Condition ● A synchronization primitive enabling one thread to sleep waiting for a software defined condition to occur. ● Another thread may change the software state and signal sleeping threads to check the condition. ● A condition has a mutex associated with it that must be grabbed before using it. ● In Linux: see pthread_cond_* ● There are ACE and BOOST wrappers. ● See example...
  64. 64. RCU ● RCU: Read/Copy/Update ● RCU is framework for thinking about accessing a shared resource. ● The idea is not to block either readers or writers in accessing the resource. ● Instead, writers get their own copy. ● The new copy gets substituted atomically. ● Readers who started reading the old copy continue to read the old copy.
  65. 65. RCU (continued) ● New readers will read the new copy. ● The problem with RCU is the costly copy operation for writers. ● Usually implemented in heavy systems where the cost of the read or write operation is heavy and outweighs the cost of the copy. ● A good example is a database system.
  66. 66. Machine memory barrier ● Is a specialized machine instruction that tells the CPU you are running on not to reorder reads or writes around it. ● In theory there are all kinds of barriers: read, write, read and write, io and CPU, only CPU, etc. ● In practice every hardware platform implements only a subset of those. ● In user space it is rarely used because authors of synchronization mechanisms use them and the end programmer just uses the synchronization mechanisms. ● In user space Linux: __sync_synchronize() is a full machine barrier along with all other atomic operations. ● In kernel (or in writing device drivers in user space) it is used heavily to communicate correctly with hardware (mb(),rmb(),wmb())
  67. 67. Compiler barrier ● Is a compile time instruction to the compiler (much like #pragma) that tells the compiler not to assume values of memory around the instruction. ● See example... ● It is better to use a compiler barrier than the volatile keyword for writing synchronization primitives.
  68. 68. Volatile considered harmful ● Because it doesn't allow caching, even within a critical section. ● Because it disables use of CPU caches on various platforms (not on Intel). ● Basically because it is an overkill. ● Better use the compiler barrier mentioned before...
  69. 69. References ● Multicore Application Programming, For Windows, Linux, and Oracle Solaris. Darryl Gove. ● The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit. ● What Every Programmer Should Know About Memory, Ulrich Drepper, ● The Paul E. McKenney book on RCU.
  70. 70. Synchronization mechanisms ● Nothing ● Atomics ● Lockless Algos/DS ● RCU/COW ● Readers/Writer lock ● Spinlock ● Futex ● Mutex ● Semaphore