Multi-core architectures

7,013 views

Published on

Published in: Technology
1 Comment
9 Likes
Statistics
Notes
  • Excellent slide show!! Congratulations!

    It is easy to create C# applications that take advantage of multicore CPUs. You need to read a good book. That's all!!

    Read this article and the book 'C# 2008 and 2005 threaded programming'. http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp

    I had never used threads and after reading the first 7 chapters, I am exploiting my Core 2 Quad. Highly recommended to C# developers.

    http://www.packtpub.com/beginners-guide-for-C-sharp-2008-and-2005-threaded-programming/book

    Cheers!

    Henry Simlan
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
7,013
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
399
Comments
1
Likes
9
Embeds 0
No embeds

No notes for slide

Multi-core architectures

  1. 1. Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007
  2. 2. Single-core computer
  3. 3. Single-core CPU chip the single core
  4. 4. Multi-core architectures <ul><li>This lecture is about a new trend in computer architecture: Replicate multiple processor cores on a single die. </li></ul>Core 1 Core 2 Core 3 Core 4 Multi-core CPU chip
  5. 5. Multi-core CPU chip <ul><li>The cores fit on a single processor socket </li></ul><ul><li>Also called CMP (Chip Multi-Processor) </li></ul>core 1 core 2 core 3 core 4
  6. 6. The cores run in parallel core 1 core 2 core 3 core 4 thread 1 thread 2 thread 3 thread 4
  7. 7. Within each core, threads are time-sliced (just like on a uniprocessor) core 1 core 2 core 3 core 4 several threads several threads several threads several threads
  8. 8. Interaction with the Operating System <ul><li>OS perceives each core as a separate processor </li></ul><ul><li>OS scheduler maps threads/processes to different cores </li></ul><ul><li>Most major OS support multi-core today: Windows, Linux, Mac OS X, … </li></ul>
  9. 9. Why multi-core ? <ul><li>Difficult to make single-core clock frequencies even higher </li></ul><ul><li>Deeply pipelined circuits: </li></ul><ul><ul><li>heat problems </li></ul></ul><ul><ul><li>speed of light problems </li></ul></ul><ul><ul><li>difficult design and verification </li></ul></ul><ul><ul><li>large design teams necessary </li></ul></ul><ul><ul><li>server farms need expensive air-conditioning </li></ul></ul><ul><li>Many new applications are multithreaded </li></ul><ul><li>General trend in computer architecture (shift towards more parallelism) </li></ul>
  10. 10. Instruction-level parallelism <ul><li>Parallelism at the machine-instruction level </li></ul><ul><li>The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. </li></ul><ul><li>Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years </li></ul>
  11. 11. Thread-level parallelism (TLP) <ul><li>This is parallelism on a more coarser scale </li></ul><ul><li>Server can serve each client in a separate thread (Web server, database server) </li></ul><ul><li>A computer game can do AI, graphics, and physics in three separate threads </li></ul><ul><li>Single-core superscalar processors cannot fully exploit TLP </li></ul><ul><li>Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP </li></ul>
  12. 12. General context: Multiprocessors <ul><li>Multiprocessor is any computer with several processors </li></ul><ul><li>SIMD </li></ul><ul><ul><li>Single instruction, multiple data </li></ul></ul><ul><ul><li>Modern graphics cards </li></ul></ul><ul><li>MIMD </li></ul><ul><ul><li>Multiple instructions, multiple data </li></ul></ul>Lemieux cluster, Pittsburgh supercomputing center
  13. 13. Multiprocessor memory types <ul><li>Shared memory: In this model, there is one (large) common shared memory for all processors </li></ul><ul><li>Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else </li></ul>
  14. 14. Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip <ul><li>Multi-core processors are MIMD: Different cores execute different threads ( M ultiple I nstructions), operating on different parts of memory ( M ultiple D ata). </li></ul><ul><li>Multi-core is a shared memory multiprocessor: All cores share the same memory </li></ul>
  15. 15. What applications benefit from multi-core? <ul><li>Database servers </li></ul><ul><li>Web servers (Web commerce) </li></ul><ul><li>Compilers </li></ul><ul><li>Multimedia applications </li></ul><ul><li>Scientific applications, CAD/CAM </li></ul><ul><li>In general, applications with Thread-level parallelism (as opposed to instruction-level parallelism) </li></ul>Each can run on its own core
  16. 16. More examples <ul><li>Editing a photo while recording a TV show through a digital video recorder </li></ul><ul><li>Downloading software while running an anti-virus program </li></ul><ul><li>“Anything that can be threaded today will map efficiently to multi-core” </li></ul><ul><li>BUT: some applications difficult to parallelize </li></ul>
  17. 17. A technique complementary to multi-core: Simultaneous multithreading <ul><li>Problem addressed: The processor pipeline can get stalled: </li></ul><ul><ul><li>Waiting for the result of a long floating point (or integer) operation </li></ul></ul><ul><ul><li>Waiting for data to arrive from memory </li></ul></ul><ul><li>Other execution units wait unused </li></ul>Source: Intel BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus
  18. 18. Simultaneous multithreading (SMT) <ul><li>Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core </li></ul><ul><li>Weaving together multiple “threads” on the same core </li></ul><ul><li>Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units </li></ul>
  19. 19. Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point
  20. 20. Without SMT, only a single thread can run at any given time BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 2: integer operation
  21. 21. SMT processor: both threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1: floating point Thread 2: integer operation
  22. 22. But: Can’t simultaneously use the same functional unit BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2 This scenario is impossible with SMT on a single core (assuming a single integer unit) IMPOSSIBLE
  23. 23. SMT not a “true” parallel processor <ul><li>Enables better threading (e.g. up to 30%) </li></ul><ul><li>OS and applications perceive each simultaneous thread as a separate “virtual processor” </li></ul><ul><li>The chip has only a single copy of each resource </li></ul><ul><li>Compare to multi-core: each core has its own copy of resources </li></ul>
  24. 24. Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 2
  25. 25. Multi-core: threads can run on separate cores BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 3 Thread 4
  26. 26. Combining Multi-core and SMT <ul><li>Cores can be SMT-enabled (or not) </li></ul><ul><li>The different combinations: </li></ul><ul><ul><li>Single-core, non-SMT: standard uniprocessor </li></ul></ul><ul><ul><li>Single-core, with SMT </li></ul></ul><ul><ul><li>Multi-core, non-SMT </li></ul></ul><ul><ul><li>Multi-core, with SMT: our fish machines </li></ul></ul><ul><li>The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads </li></ul><ul><li>Intel calls them “hyper-threads” </li></ul>
  27. 27. SMT Dual-core: all four threads can run concurrently BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus BTB and I-TLB Decoder Trace Cache Rename/Alloc Uop queues Schedulers Integer Floating Point L1 D-Cache D-TLB uCode ROM BTB L2 Cache and Control Bus Thread 1 Thread 3 Thread 2 Thread 4
  28. 28. Comparison: multi-core vs SMT <ul><li>Advantages/disadvantages? </li></ul>
  29. 29. Comparison: multi-core vs SMT <ul><li>Multi-core: </li></ul><ul><ul><li>Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) </li></ul></ul><ul><ul><li>However, great with thread-level parallelism </li></ul></ul><ul><li>SMT </li></ul><ul><ul><li>Can have one large and fast superscalar core </li></ul></ul><ul><ul><li>Great performance on a single thread </li></ul></ul><ul><ul><li>Mostly still only exploits instruction-level parallelism </li></ul></ul>
  30. 30. The memory hierarchy <ul><li>If simultaneous multithreading only: </li></ul><ul><ul><li>all caches shared </li></ul></ul><ul><li>Multi-core chips: </li></ul><ul><ul><li>L1 caches private </li></ul></ul><ul><ul><li>L2 caches private in some architectures and shared in others </li></ul></ul><ul><li>Memory is always shared </li></ul>
  31. 31. “Fish” machines <ul><li>Dual-core Intel Xeon processors </li></ul><ul><li>Each core is hyper-threaded </li></ul><ul><li>Private L1 caches </li></ul><ul><li>Shared L2 caches </li></ul>memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 hyper-threads
  32. 32. Designs with private L2 caches memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache memory L2 cache L1 cache L1 cache C O R E 1 C O R E 0 L2 cache Both L1 and L2 are private Examples: AMD Opteron, AMD Athlon, Intel Pentium D L3 cache L3 cache A design with L3 caches Example: Intel Itanium 2
  33. 33. Private vs shared caches? <ul><li>Advantages/disadvantages? </li></ul>
  34. 34. Private vs shared caches <ul><li>Advantages of private: </li></ul><ul><ul><li>They are closer to core, so faster access </li></ul></ul><ul><ul><li>Reduces contention </li></ul></ul><ul><li>Advantages of shared: </li></ul><ul><ul><li>Threads on different cores can share the same cache data </li></ul></ul><ul><ul><li>More cache space available if a single (or a few) high-performance thread runs on the system </li></ul></ul>
  35. 35. The cache coherence problem <ul><li>Since we have private caches: How to keep the data consistent across caches? </li></ul><ul><li>Each core should perceive the memory as a monolithic array, shared by all the cores </li></ul>
  36. 36. The cache coherence problem Suppose variable x initially contains 15213 One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  37. 37. The cache coherence problem Core 1 reads x One or more levels of cache x=15213 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  38. 38. The cache coherence problem Core 2 reads x One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  39. 39. The cache coherence problem Core 1 writes to x, setting it to 21660 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches Core 1 Core 2 Core 3 Core 4
  40. 40. The cache coherence problem Core 2 attempts to read x… gets a stale copy One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  41. 41. Solutions for cache coherence <ul><li>This is a general problem with multiprocessors, not limited just to multi-core </li></ul><ul><li>There exist many solution algorithms, coherence protocols, etc. </li></ul><ul><li>A simple solution: invalidation -based protocol with snooping </li></ul>
  42. 42. Inter-core bus One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory multi-core chip inter-core bus Core 1 Core 2 Core 3 Core 4
  43. 43. Invalidation protocol with snooping <ul><li>Invalidation: If a core writes to a data item, all other copies of this data item in other caches are invalidated </li></ul><ul><li>Snooping: All cores continuously “snoop” (monitor) the bus connecting the cores. </li></ul>
  44. 44. The cache coherence problem Revisited: Cores 1 and 2 have both read x One or more levels of cache x=15213 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=15213 multi-core chip Core 1 Core 2 Core 3 Core 4
  45. 45. The cache coherence problem Core 1 writes to x, setting it to 21660 One or more levels of cache x=21660 One or more levels of cache x=15213 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches INVALIDATED sends invalidation request inter-core bus Core 1 Core 2 Core 3 Core 4
  46. 46. The cache coherence problem After invalidation: One or more levels of cache x=21660 One or more levels of cache One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  47. 47. The cache coherence problem Core 2 reads x. Cache misses, and loads the new copy. One or more levels of cache x=21660 One or more levels of cache x=21660 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip Core 1 Core 2 Core 3 Core 4
  48. 48. Alternative to invalidate protocol: update protocol Core 1 writes x=21660: One or more levels of cache x=21660 One or more levels of cache x= 21660 One or more levels of cache One or more levels of cache Main memory x=21660 multi-core chip assuming write-through caches UPDATED broadcasts updated value inter-core bus Core 1 Core 2 Core 3 Core 4
  49. 49. Which do you think is better? Invalidation or update?
  50. 50. Invalidation vs update <ul><li>Multiple writes to the same location </li></ul><ul><ul><li>invalidation: only the first time </li></ul></ul><ul><ul><li>update: must broadcast each write (which includes new variable value) </li></ul></ul><ul><li>Invalidation generally performs better: it generates less bus traffic </li></ul>
  51. 51. Invalidation protocols <ul><li>This was just the basic invalidation protocol </li></ul><ul><li>More sophisticated protocols use extra cache state bits </li></ul><ul><li>MSI, MESI (Modified, Exclusive, Shared, Invalid) </li></ul>
  52. 52. Programming for multi-core <ul><li>Programmers must use threads or processes </li></ul><ul><li>Spread the workload across multiple cores </li></ul><ul><li>Write parallel algorithms </li></ul><ul><li>OS will map threads/processes to cores </li></ul>
  53. 53. Thread safety very important <ul><li>Pre-emptive context switching: context switch can happen AT ANY TIME </li></ul><ul><li>True concurrency, not just uniprocessor time-slicing </li></ul><ul><li>Concurrency bugs exposed much faster with multi-core </li></ul>
  54. 54. However: Need to use synchronization even if only time-slicing on a uniprocessor <ul><li>int counter=0; </li></ul><ul><li>void thread1() { </li></ul><ul><li>int temp1=counter; </li></ul><ul><li>counter = temp1 + 1; </li></ul><ul><li>} </li></ul><ul><li>void thread2() { </li></ul><ul><li>int temp2=counter; </li></ul><ul><li>counter = temp2 + 1; </li></ul><ul><li>} </li></ul>
  55. 55. Need to use synchronization even if only time-slicing on a uniprocessor <ul><li>temp1=counter; </li></ul><ul><li>counter = temp1 + 1; </li></ul><ul><li>temp2=counter; </li></ul><ul><li>counter = temp2 + 1 </li></ul><ul><li>temp1=counter; </li></ul><ul><li>temp2=counter; </li></ul><ul><li>counter = temp1 + 1; </li></ul><ul><li>counter = temp2 + 1 </li></ul>gives counter=2 gives counter=1
  56. 56. Assigning threads to the cores <ul><li>Each thread/process has an affinity mask </li></ul><ul><li>Affinity mask specifies what cores the thread is allowed to run on </li></ul><ul><li>Different threads can have different masks </li></ul><ul><li>Affinities are inherited across fork() </li></ul>
  57. 57. Affinity masks are bit vectors <ul><li>Example: 4-way multi-core, without SMT </li></ul>1 0 1 1 core 3 core 2 core 1 core 0 <ul><li>Process/thread is allowed to run on cores 0,2,3, but not on core 1 </li></ul>
  58. 58. Affinity masks when multi-core and SMT combined <ul><li>Separate bits for each simultaneous thread </li></ul><ul><li>Example: 4-way multi-core, 2 threads per core </li></ul>1 core 3 core 2 core 1 core 0 1 0 0 1 0 1 1 thread 1 <ul><li>Core 2 can’t run the process </li></ul><ul><li>Core 1 can only use one simultaneous thread </li></ul>thread 0 thread 1 thread 0 thread 1 thread 0 thread 1 thread 0
  59. 59. Default Affinities <ul><li>Default affinity mask is all 1s: all threads can run on all processors </li></ul><ul><li>Then, the OS scheduler decides what threads run on what core </li></ul><ul><li>OS scheduler detects skewed workloads, migrating threads to less busy processors </li></ul>
  60. 60. Process migration is costly <ul><li>Need to restart the execution pipeline </li></ul><ul><li>Cached data is invalidated </li></ul><ul><li>OS scheduler tries to avoid migration as much as possible: it tends to keeps a thread on the same core </li></ul><ul><li>This is called soft affinity </li></ul>
  61. 61. Hard affinities <ul><li>The programmer can prescribe her own affinities (hard affinities) </li></ul><ul><li>Rule of thumb: use the default scheduler unless a good reason not to </li></ul>
  62. 62. When to set your own affinities <ul><li>Two (or more) threads share data-structures in memory </li></ul><ul><ul><li>map to same core so that can share cache </li></ul></ul><ul><li>Real-time threads: Example: a thread running a robot controller: - must not be context switched, or else robot can go unstable - dedicate an entire core just to this thread </li></ul>Source: Sensable.com
  63. 63. Kernel scheduler API <ul><li>#include <sched.h> </li></ul><ul><li>int sched_getaffinity (pid_t pid, unsigned int len, unsigned long * mask); </li></ul><ul><li>Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’. </li></ul><ul><li>‘ len’ is the system word size: sizeof(unsigned int long) </li></ul>
  64. 64. Kernel scheduler API <ul><li>#include <sched.h> </li></ul><ul><li>int sched_setaffinity (pid_t pid, unsigned int len, unsigned long * mask); </li></ul><ul><li>Sets the current affinity mask of process ‘pid’ to *mask </li></ul><ul><li>‘ len’ is the system word size: sizeof(unsigned int long) </li></ul><ul><li>To query affinity of a running process: </li></ul><ul><li>[barbic@bonito ~]$ taskset -p 3935 </li></ul><ul><li>pid 3935's current affinity mask: f </li></ul>
  65. 65. Windows Task Manager core 2 core 1
  66. 66. Legal licensing issues <ul><li>Will software vendors charge a separate license per each core or only a single license per chip? </li></ul><ul><li>Microsoft, Red Hat Linux, Suse Linux will license their OS per chip, not per core </li></ul>
  67. 67. Conclusion <ul><li>Multi-core chips an important new trend in computer architecture </li></ul><ul><li>Several new multi-core chips in design phases </li></ul><ul><li>Parallel programming techniques likely to gain importance </li></ul>

×