Published on

1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • This could be a class question… A study of parallelism in instruction and data streams called for by instructions at the most constrained component of the machine…
  • 1 st green bullet: access allowed if it has the correct access rights/permissions
  • Maybe make this a class example?
  • Maybe make this a class example? Shaded parts are different. Maybe that’s the class question…what parts are different?
  • Maybe have the class come up with these… Update protocol must work on individual words, invalidation protocol can work on cache blocks (2 nd item) Bus and memory bandwidth usually commodity most in demand – usually invalidation is the weapon of choice
  • Maybe have the class come up with these… Update protocol must work on individual words, invalidation protocol can work on cache blocks (2 nd item) Bus and memory bandwidth usually commodity most in demand – usually invalidation is the weapon of choice
  • Intro: a little bit more about communicating b/t the two various types…
  • Intro: a little bit more about communicating b/t the two various types…
  • Turn the last bullet into a class question. What are some other metrics that could affect these metrics? Which ones? How?
  • Turn the last bullet into a class question. What are some other metrics that could affect these metrics? Which ones? How?
  • Lecture19-ParallelProcessing.ppt

    1. 1. CS 2200 – Lecture 19 Parallel Processing (Lectures based on the work of Jay Brockman, Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy, Ken MacKenzie, Richard Murphy, and Michael Niemier)
    2. 2. Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy
    3. 3. The Next Step <ul><li>Create more powerful computers simply by interconnecting many small computers </li></ul><ul><ul><li>Should be scalable </li></ul></ul><ul><ul><li>Should be fault tolerant </li></ul></ul><ul><ul><li>More economical </li></ul></ul><ul><li>Multiprocessors </li></ul><ul><ul><li>High throughput running independent tasks </li></ul></ul><ul><li>Parallel Processing </li></ul><ul><ul><li>Single program on multiple processors </li></ul></ul>
    4. 4. Key Questions <ul><li>How do parallel processors share data? </li></ul><ul><li>How do parallel processors communicate? </li></ul><ul><li>How many processors? </li></ul>
    5. 5. Today: Parallelism vs. Parallelism <ul><li>Uni: </li></ul><ul><li>Pipelined </li></ul><ul><li>Superscalar </li></ul><ul><li>VLIW/”EPIC” </li></ul><ul><li>SMP (“Symmetric”) </li></ul><ul><li>Distributed </li></ul>ILP TLP P M M M M P M P P M P M N E T
    6. 6. Flynn’s taxonomy <ul><li>Single instruction stream, single data stream (SISD) </li></ul><ul><ul><li>Essentially, this is a uniprocessor </li></ul></ul><ul><li>Single instruction stream, multiple data streams (SIMD) </li></ul><ul><ul><li>Same instruction executed by multiple processors with different data streams </li></ul></ul><ul><ul><li>Each processor has own data memory, but 1 instruction memory and control processor to fetch/dispatch instructions </li></ul></ul>
    7. 7. Flynn’s Taxonomy <ul><li>Multiple instruction streams, single data streams (MISD) </li></ul><ul><ul><li>Can anyone think of a good application for this machine? </li></ul></ul><ul><li>Multiple instruction streams, multiple data streams (MIMD) </li></ul><ul><ul><li>Each processor fetches its own instructions and operates on its own data </li></ul></ul>
    8. 8. A history… <ul><li>From a parallel perspective, many early processors = SIMD </li></ul><ul><ul><li>In recent past, MIMD most common multiprocessor arch. </li></ul></ul><ul><li>Why MIMD? </li></ul><ul><ul><li>Often MIMD machines made of “off-the-shelf” components </li></ul></ul><ul><ul><ul><li>Usually means flexibility – could be used as a single-user machine or multi-programmed machine </li></ul></ul></ul>
    9. 9. A history… <ul><li>MIMD machines can be further sub-divided </li></ul><ul><ul><li>Centralized shared-memory architectures </li></ul></ul><ul><ul><ul><li>Multiple processors share a single centralized memory and interconnect to it via a bus </li></ul></ul></ul><ul><ul><ul><ul><li>works best with smaller # of processors </li></ul></ul></ul></ul><ul><ul><ul><li>B/c of centralization/uniform access time – sometimes called Uniform Memory Access </li></ul></ul></ul><ul><ul><li>Physically distributed memory </li></ul></ul><ul><ul><ul><li>Almost a must for larger processor counts; else bandwidth a problem </li></ul></ul></ul>
    10. 10. Ok, so we introduced the two kinds of parallel computer architectures that we’re going to talk about. We’ll come back to them soon enough. But 1 st , we’ll talk about why parallel processing is a good thing…
    11. 11. Parallel Computers <ul><li>Definition: “ A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast .” </li></ul><ul><ul><li>Almasi and Gottlieb, Highly Parallel Computing , 1989 </li></ul></ul><ul><li>Questions about parallel computers: </li></ul><ul><ul><li>How large a collection? </li></ul></ul><ul><ul><li>How powerful are processing elements? </li></ul></ul><ul><ul><li>How do they cooperate and communicate? </li></ul></ul><ul><ul><li>How is data transmitted? </li></ul></ul><ul><ul><li>What type of interconnection? </li></ul></ul><ul><ul><li>What are HW and SW primitives for programmer? </li></ul></ul><ul><ul><li>Does it translate into performance? </li></ul></ul>(i.e. things you should have some understanding of after class today)
    12. 12. The Plan <ul><li>Applications (problem space) </li></ul><ul><li>Key hardware issues </li></ul><ul><ul><li>shared memory: how to keep caches coherence </li></ul></ul><ul><ul><li>message passing: low-cost communication </li></ul></ul><ul><ul><ul><li>* See Board (intro. to cache coherency) </li></ul></ul></ul>
    13. 13. Current Practice <ul><li>Some success w/MPPs (Massively Parallel Processors) </li></ul><ul><ul><li>dense matrix scientific computing (Petrolium, Automotive, Aeronautics, Pharmaceuticals) </li></ul></ul><ul><ul><li>file servers, databases, web search engines </li></ul></ul><ul><ul><li>entertainment/graphics </li></ul></ul><ul><li>Small-scale machines: DELL WORKSTATION 530 </li></ul><ul><ul><li>1.7GHz Intel Pentium® IV (in Minitower) </li></ul></ul><ul><ul><li>512 MB RDRAM memory, 40GB disk, 20X CD, 19” monitor, Quadro2Pro Graphics card, RedHat Linux, 3yrs service </li></ul></ul><ul><ul><li>$2,760; for 2nd processor, add $515 </li></ul></ul><ul><ul><li>(Can also chain these together) </li></ul></ul>
    14. 14. Parallel Architecture <ul><li>Parallel Architecture extends traditional computer architecture with a communication architecture </li></ul><ul><ul><li>Programmming model (SW view) </li></ul></ul><ul><ul><li>Abstractions (HW/SW interface) </li></ul></ul><ul><ul><li>Implementation to realize abstraction efficiently </li></ul></ul><ul><li>Historically, implementations have been tied to programming models but that is changing. </li></ul>
    15. 15. Parallel Applications <ul><li>Throughput-oriented (want many answers) </li></ul><ul><ul><li>multiprogramming </li></ul></ul><ul><ul><li>databases, web servers </li></ul></ul><ul><li>Latency oriented (want one answer, fast) </li></ul><ul><ul><li>“ Grand Challenge” problems: </li></ul></ul><ul><ul><ul><li>See: http://www.nhse.org/grand_challenge.html </li></ul></ul></ul><ul><ul><ul><li>See: http://www.research.att.com/~dsj/nsflist.html </li></ul></ul></ul><ul><ul><li>global climate model </li></ul></ul><ul><ul><li>human genome </li></ul></ul><ul><ul><li>quantum chromodynamics </li></ul></ul><ul><ul><li>combustion model </li></ul></ul><ul><ul><li>cognition </li></ul></ul>
    16. 16. Programming <ul><li>As contrasted to instruction level parallelism which may be largely ignored by the programmer... </li></ul><ul><li>Writing efficient multiprocessor programs is hard. </li></ul><ul><ul><li>Wizards write programs with sequential interface (e.g. Databases, file servers, CAD) </li></ul></ul><ul><ul><li>Communications overhead becomes a factor </li></ul></ul><ul><ul><li>Requires a lot of knowledge of the hardware!!! </li></ul></ul>
    17. 17. Speedup metric for performance on latency-sensitive applications <ul><li>Time(1) / Time(P) for P processors </li></ul><ul><ul><li>note: must use the best sequential algorithm for Time(1) -- the parallel algorithm may be different. </li></ul></ul>1 2 4 8 16 32 64 1 2 4 8 16 32 64 # processors speedup “ linear” speedup (ideal) typical: rolls off w/some # of processors occasionally see “ superlinear”... why?
    18. 18. Speedup Challenge <ul><li>To get full benefit of parallelism need to be able to parallelize the entire program! </li></ul><ul><li>Amdahl’s Law </li></ul><ul><ul><li>Time after = (Time affected /Improvement)+Time unaffected </li></ul></ul><ul><ul><li>Example: We want 100 times speedup with 100 processors </li></ul></ul><ul><ul><li>Time unaffected = 0!!! </li></ul></ul><ul><ul><li>(see board notes for this worked out) </li></ul></ul>
    19. 19. Hardware: Two Main Variations <ul><li>Shared-Memory </li></ul><ul><ul><li>may be physically shared or only logically shared </li></ul></ul><ul><ul><li>“communication” is implicit in loads and stores </li></ul></ul><ul><li>Message-Passing </li></ul><ul><ul><li>must add explicit communication </li></ul></ul>
    20. 20. Shared-Memory Hardware (1) Hardware and programming model don’t have to match, but this is the mental model for shared-memory programming <ul><li>Memory: centralized with uniform access time (“ UMA ”) and bus interconnect, I/O </li></ul><ul><li>Examples: Dell Workstation 530, Sun Enterprise, SGI Challenge </li></ul><ul><li>typical: </li></ul><ul><li>1 cycle to local cache </li></ul><ul><li>20 cycles to remote cache </li></ul><ul><li>100 cycles to memory </li></ul>P M P
    21. 21. Sharing Data (another view) Memory Uniform Memory Access - UMA Symmetric Multiprocessor SMP Processor Cache Processor Cache Processor Cache
    22. 22. Shared-Memory Hardware (2) <ul><li>Variation: memory is not centralized. Called non-uniform access time (“ NUMA ”) </li></ul><ul><li>Shared memory accesses are converted into a messaging protocol (usually by HW) </li></ul><ul><li>Examples: DASH/Alewife/FLASH (academic), SGI Origin, Compaq GS320, Sequent (IBM) NUMA-Q </li></ul>P M P M
    23. 23. Sharing Data (another view) Non-Uniform Memory Access - NUMA CPU x 4 Channel Cache Memory I/O CPU x 4 Channel Cache Memory I/O CPU x 4 Channel Cache Memory I/O CPU x 4 Channel Cache Memory I/O
    24. 24. More on distributed memory <ul><li>Distributing memory among processing nodes has 2 pluses: </li></ul><ul><ul><li>It’s a great way to save some bandwidth </li></ul></ul><ul><ul><ul><li>With memory distributed at nodes, most accesses are to local memory within a particular node </li></ul></ul></ul><ul><ul><ul><li>No need for bus communication </li></ul></ul></ul><ul><ul><li>Reduces latency for accesses to local memory </li></ul></ul><ul><li>It also has 1 big minus! </li></ul><ul><ul><li>Have to communicate among various processors </li></ul></ul><ul><ul><ul><li>Leads to a higher latency for intra-node communication </li></ul></ul></ul><ul><ul><ul><li>Also need bandwidth to actually handle communication </li></ul></ul></ul>
    25. 25. Message Passing Model <ul><li>Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations </li></ul><ul><ul><li>Essentially NUMA but integrated at I/O devices instead of at the memory system </li></ul></ul><ul><li>Send specifies local buffer + receiving process on remote computer </li></ul>
    26. 26. Message Passing Model <ul><li>Receive specifies sending process on remote computer + local buffer to place data </li></ul><ul><ul><li>Usually send includes process tag and receive has rule on tag: match 1, match any </li></ul></ul><ul><ul><li>Synch : when send completes, when buffer free, when request accepted, receive wait for send </li></ul></ul><ul><li>Send+receive => memory-memory copy, where each each supplies local address, AND does pairwise sychronization! </li></ul>
    27. 27. Two terms: multicomputers vs. multiprocessors
    28. 28. Communicating between nodes <ul><li>One way to communicate b/t processors treats physically separate memories as “1 big memory” </li></ul><ul><ul><li>(i.e. “1 big logically shared address space”) </li></ul></ul><ul><ul><ul><li>Any processor can make a memory reference to any memory location – even if its at a different node </li></ul></ul></ul><ul><ul><li>Machines are called “ distributed shared memory ”(DSM) </li></ul></ul><ul><ul><ul><li>Same physical address on two processors refers to the same one location in memory </li></ul></ul></ul><ul><li>Another method involves private address spaces </li></ul><ul><ul><li>Memories are logically disjoint; cannot be addressed be a remote processor </li></ul></ul><ul><ul><ul><li>Same physical address on two processors refers to two different locations in memory </li></ul></ul></ul><ul><li>These are “multicomputers” </li></ul>
    29. 29. Multicomputer Proc + Cache A Proc + Cache B memory memory interconnect
    30. 30. Multiprocessor “Symmetric” Multiprocessor or SMP Cache A Cache B memory
    31. 31. But both can have a cache coherency problem… Cache A Cache B memory X: 0 X: 1 X: 0 Oops! Read X … ... Read X … Write X
    32. 32. Simplest Coherence Strategy: Enforce Exactly One Copy Cache A Cache B memory X: 0 X: 1 X: 0 Read X … ... Read X … Write X
    33. 33. Exactly One Copy INVALID VALID Read or write/ (invalidate other copies) Replacement or invalidation <ul><li>Maintain a “lock” per cache line </li></ul><ul><li>Invalidate other caches on a read/write </li></ul><ul><li>Easy on a bus: “snoop” bus for transactions </li></ul>More reads or writes
    34. 34. Exactly One Copy <ul><li>Works, but performance is crummy. </li></ul><ul><li>Suppose we all just want to read the same memory location </li></ul><ul><ul><li>one lousy global variable: “n” the size of the problem, written once at the start of the program and read thereafter </li></ul></ul>Permit multiple readers (readers/writer lock per cache line)
    35. 35. Cache consistency (i.e. how do we avoid the previous “protocol”?)
    36. 36. Multiprocessor Cache Coherency <ul><li>Means values in cache and memory are consistent or that we know they are different and can act accordingly </li></ul><ul><li>Considered to be a good thing. </li></ul><ul><li>Becomes more difficult with multiple processors and multiple caches! </li></ul><ul><li>Popular technique: Snooping! </li></ul><ul><ul><li>Write-invalidate </li></ul></ul><ul><ul><li>Write-update </li></ul></ul>
    37. 37. Cache coherence protocols <ul><li>Directory Based: </li></ul><ul><ul><li>Whether or not some physical memory location is shared or not is recorded in 1 central location </li></ul></ul><ul><ul><ul><li>Called “the directory” </li></ul></ul></ul><ul><li>Snooping: </li></ul><ul><ul><li>Every cache w/entries from centralized main memory also has a particular block’s “sharing status” </li></ul></ul><ul><ul><li>No centralized state kept </li></ul></ul><ul><ul><li>Caches connected to shared memory bus </li></ul></ul><ul><ul><ul><li>If there is bus traffic, caches check (or “snoop”) to see if they have the block being transferred on bus </li></ul></ul></ul><ul><ul><li>Main focus of upcoming discussion </li></ul></ul>
    38. 38. Side note: Snoopy Cache State Tag Data CPU Bus CPU references check cache tags (as usual) Cache misses filled from memory (as usual) + Other read/write on bus must check tags, too, and possibly invalidate
    39. 39. Maintaining the coherence requirement <ul><li>1 way – make sure processor has exclusive access to a data word before its written </li></ul><ul><ul><li>Called the “ write invalidate protocol ” </li></ul></ul><ul><ul><ul><li>Will actually invalidate other copies of the data word on a write </li></ul></ul></ul><ul><ul><ul><li>Most common for both snooping and directory schemes </li></ul></ul></ul>
    40. 40. Maintaining the coherence requirement <ul><li>What if 2 processors try to write at the same time? </li></ul><ul><ul><li>The short answer: one of them will get “permission” to write first: </li></ul></ul><ul><ul><ul><li>The other’s copy will be invalidated, </li></ul></ul></ul><ul><ul><ul><li>Then it’ll get a new copy of the data with updated value </li></ul></ul></ul><ul><ul><ul><li>Then it can get permission and write… </li></ul></ul></ul><ul><ul><li>Probably more on “how” later, but briefly… </li></ul></ul><ul><ul><ul><li>Caches snoop on the bus, so they’ll detect a “request to write”; so whichever machine gets to the bus 1 st , goes 1 st </li></ul></ul></ul>
    41. 41. Write invalidate example <ul><li>Assumes neither cache had value/location X in it 1 st </li></ul><ul><li>When 2 nd miss by B occurs, CPU A responds with value canceling response from memory. </li></ul><ul><li>Update B’s cache & memory contents of X updated </li></ul><ul><li>Typical and simple… </li></ul>0 1 Invalidation for X CPU A writes a 1 to X 1 1 1 Cache miss for X CPU B reads X 0 0 0 Cache miss for X CPU B reads X 0 0 Cache miss for X CPU A reads X 0 Contents of memory location X Contents of CPU B’s cache Contents of CPU A’s cache Bus Activity Processor Activity
    42. 42. Maintaining the cache coherency requirement <ul><li>Alternative to write invalidate – update all cached copies of a data item when the item is written … </li></ul><ul><ul><li>Called a “ write update/broadcast protocol ” </li></ul></ul><ul><ul><li>One problem – bandwidth issues could quickly get out of hand </li></ul></ul><ul><ul><ul><li>Solution: track whether or not a word in the cache is shared (i.e. contained in another cache) </li></ul></ul></ul><ul><ul><ul><li>If the word is not shared, there’s no need to broadcast on a write… </li></ul></ul></ul>
    43. 43. Write update example <ul><li>Assumes neither cache had value/location X in it 1 st </li></ul><ul><li>CPU and memory contents show value after processor and bus activity both completed </li></ul><ul><li>When CPU A broadcasts the write, cache in CPU B and memory location X are updated </li></ul>(Shaded parts are different than before) 1 1 1 Write broadcast of X CPU A writes a 1 to X 1 1 1 CPU B reads X 0 0 0 Cache miss for X CPU B reads X 0 0 Cache miss for X CPU A reads X 0 Contents of memory location X Contents of CPU B’s cache Contents of CPU A’s cache Bus Activity Processor Activity
    44. 44. Comparing write update/write invalidate <ul><li>What if there are multiple writes and no intermediate reads to the same word? </li></ul><ul><ul><li>With update protocol , multiple write broadcasts required </li></ul></ul><ul><ul><li>With invalidation protocol , only one invalidation </li></ul></ul><ul><li>Writing to multiword cache blocks </li></ul><ul><ul><li>With update protocol , each word written in a cache block requires a write broadcast </li></ul></ul><ul><ul><li>With invalidation protocol , only 1 st write to any word needs to generate an invalidate </li></ul></ul>
    45. 45. Comparing write update/write invalidate <ul><li>What about delays between writing and reading? </li></ul><ul><ul><li>With update protocol delay b/t writing a word on one processor and reading a word in another usually less </li></ul></ul><ul><ul><ul><li>Written data is immediately updated in reader’s cache </li></ul></ul></ul><ul><ul><li>With invalidation protocol , reader invalidated, later reads/stalls </li></ul></ul>
    46. 46. See example…
    47. 47. Messages vs. Shared Memory? <ul><li>Shared Memory </li></ul><ul><ul><li>As a programming model , shared memory is considered “easier” </li></ul></ul><ul><ul><li>automatic caching is good for dynamic/irregular problems </li></ul></ul><ul><li>Message Passing </li></ul><ul><ul><li>As a programming model , messages are the most portable </li></ul></ul><ul><ul><li>Right Thing for static/regular problems </li></ul></ul><ul><ul><li>BW ++, latency --, no concept of caching </li></ul></ul><ul><li>Model == implementation? </li></ul><ul><ul><li>not necessarily... </li></ul></ul>
    48. 48. More on address spaces… (i.e. 1 shared memory vs. distributed, multiple memories)
    49. 49. Communicating between nodes <ul><li>In a shared address space… </li></ul><ul><ul><li>Data could be implicitly transferred with just a load or a store instruction </li></ul></ul><ul><ul><ul><li>Ex. Machine X executes Load $5, 0($4). 0($4) actually stored in the memory of Machine Y. </li></ul></ul></ul>
    50. 50. Communicating between nodes <ul><li>With private/multiple address spaces… </li></ul><ul><ul><li>Communication of data done by explicitly passing messages among processors </li></ul></ul><ul><ul><ul><li>Usually based on Remote Procedure Call (RPC) protocol </li></ul></ul></ul><ul><ul><ul><li>Is a synchronous transfer – i.e. requesting machine waits for a reply before continuing </li></ul></ul></ul><ul><ul><ul><li>This is OS stuff – no more detail here </li></ul></ul></ul><ul><ul><li>Could also have the “writer” initiate data transfers… </li></ul></ul><ul><ul><ul><li>Done in hopes that a node will be a “soon to be” consumer </li></ul></ul></ul><ul><ul><ul><li>Often done asynchronously; sender process can continue right away </li></ul></ul></ul>
    51. 51. Performance metrics <ul><li>3 performance metrics critical for communication: </li></ul><ul><li>(1) Communication bandwidth: </li></ul><ul><ul><li>Usually limited by processor, memory, and interconnection bandwidths </li></ul></ul><ul><ul><ul><li>Not by some aspect of communication mechanism </li></ul></ul></ul><ul><ul><li>Often “occupancy” can be a limiting factor. </li></ul></ul><ul><ul><ul><li>When communication occurs, resources wi/ nodes are tied up or “occupied” – prevents other outgoing communication </li></ul></ul></ul><ul><ul><ul><li>If occupancy incurred for each word of a message, sets a limit on communication bandwidth </li></ul></ul></ul><ul><ul><ul><ul><li>(often lower than what network or memory system can provide) </li></ul></ul></ul></ul>
    52. 52. Performance metrics <ul><li>(2) Communication latency: </li></ul><ul><ul><li>Latency includes </li></ul></ul><ul><ul><ul><li>Transport latency (function of interconnection network) </li></ul></ul></ul><ul><ul><ul><li>SW/HW overheads (from sending/receiving messages) </li></ul></ul></ul><ul><ul><ul><ul><li>Largely determined by communication mechanism and its implementation </li></ul></ul></ul></ul><ul><ul><li>Latency must be hidden!!! </li></ul></ul><ul><ul><ul><li>Else, processor might just spend lots of time waiting for messages… </li></ul></ul></ul>
    53. 53. Performance metrics <ul><li>(3) Hiding communication latency: </li></ul><ul><ul><li>Ideally we want to mask latency of waiting for communication, etc. </li></ul></ul><ul><ul><ul><li>This might be done by overlapping communication with other, independent computations </li></ul></ul></ul><ul><ul><ul><li>Or maybe 2 independent messages could be sent at once? </li></ul></ul></ul><ul><ul><li>Quantifying how well a multiprocessor configuration can do this is “this metric” </li></ul></ul><ul><ul><li>Often this burden is placed to some degree on the SW and the programmer </li></ul></ul><ul><ul><li>Also, this metric is heavily application dependent </li></ul></ul>
    54. 54. Performance metrics <ul><li>All of these metrics are actually affected by application type, data sizes, communication patterns, etc. </li></ul>
    55. 55. Advantages and disadvantages <ul><li>What’s good about shared memory? What’s bad about it? </li></ul><ul><li>What’s good about message-passing? What’s bad about it? </li></ul><ul><ul><li>Note: message passing implies distributed memory </li></ul></ul>
    56. 56. Advantages and disadvantages <ul><li>Shared memory – good: </li></ul><ul><ul><li>Compatibility with well-understood mechanisms in use in centralized multiprocessors – used shared memory </li></ul></ul><ul><ul><li>Its easy to program </li></ul></ul><ul><ul><ul><li>Especially if communication patterns are complex </li></ul></ul></ul><ul><ul><ul><ul><li>Easier just to do a load/store operation and not worry about where the data might be (i.e. on another node with DSM) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>But, you also take a big time performance hit… </li></ul></ul></ul></ul><ul><ul><li>Smaller messages are more efficient w/shared memory </li></ul></ul><ul><ul><ul><li>Might communicate via memory mapping instead of going through OS </li></ul></ul></ul><ul><ul><ul><ul><li>(like we’d have to do for a remote procedure call) </li></ul></ul></ul></ul>
    57. 57. Advantages and disadvantages <ul><li>Shared memory – good (continued): </li></ul><ul><ul><li>Caching can be controlled by the hardware </li></ul></ul><ul><ul><ul><li>Reduces the frequency of remote communication by supporting automatic caching of all data </li></ul></ul></ul><ul><li>Message-passing – good: </li></ul><ul><ul><li>The HW is lots simpler </li></ul></ul><ul><ul><ul><li>Especially by comparison with a scalable shared-memory implementation that supports coherent caching of data </li></ul></ul></ul><ul><ul><li>Communication is explicit </li></ul></ul><ul><ul><ul><li>Forces programmers/compiler writers to think about it and make it efficient </li></ul></ul></ul><ul><ul><ul><li>This could be a bad thing too FYI… </li></ul></ul></ul>
    58. 58. More detail on cache coherency protocols with some examples…
    59. 59. More on centralized shared memory <ul><li>Its worth studying the various ramifications of a centralized shared memory machine </li></ul><ul><ul><li>(and there are lots of them) </li></ul></ul><ul><ul><li>Later we’ll look at distributed shared memory… </li></ul></ul><ul><li>When studying memory hierarchies we saw… </li></ul><ul><ul><li>…cache structures can substantially reduce memory bandwidth demands of a processor </li></ul></ul><ul><ul><ul><li>Multiple processors may be able to share the same memory </li></ul></ul></ul>
    60. 60. More on centralized shared memory <ul><li>Centralized shared memory supports private/shared data </li></ul><ul><ul><li>If 1 processor in a multiprocessor network operates on private data, caching, etc. are handled just as in uniprocessors </li></ul></ul><ul><ul><li>But if shared data is cached there can be multiple copies and multiple updates </li></ul></ul><ul><ul><ul><li>Good b/c it reduces required memory bandwidth; bad because we now must worry about cache coherence </li></ul></ul></ul>
    61. 61. Cache coherence – why it’s a problem <ul><li>Assumes that neither cache had value/location X in it 1 st </li></ul><ul><li>Both a write-through cache and a write-back cache will encounter this problem </li></ul><ul><li>If B reads the value of X after Time 3, it will get 1 which is the wrong value! </li></ul>0 1 0 CPU A stores 0 into X 3 1 1 1 CPU B reads X 2 1 1 CPU A reads X 1 1 0 Memory contents for location X Cache contents for CPU B Cache contents for CPU A Event Time
    62. 62. Coherence in shared memory programs <ul><li>Must have coherence and consistency </li></ul><ul><li>Memory system coherent if: </li></ul><ul><ul><li>Program order preserved (always true in uniprocessor) </li></ul></ul><ul><ul><ul><li>Say we have a read by processor P of location X </li></ul></ul></ul><ul><ul><ul><li>Before the read processor P wrote something to location X </li></ul></ul></ul><ul><ul><ul><li>In the interim, no other processor has written to X </li></ul></ul></ul><ul><ul><ul><li>A read to X should always return the value written by P </li></ul></ul></ul><ul><ul><li>A coherent view of memory is provided </li></ul></ul><ul><ul><ul><li>1 st , processor A writes something to memory location X </li></ul></ul></ul><ul><ul><ul><li>Then, processor B tries to read from memory location X </li></ul></ul></ul><ul><ul><ul><li>Processor B should get the value written by processor A assuming… </li></ul></ul></ul><ul><ul><ul><ul><li>Enough time has past b/t the two events </li></ul></ul></ul></ul><ul><ul><ul><ul><li>No other writes to X have occurred in the interim </li></ul></ul></ul></ul>
    63. 63. Coherence in shared memory programs (continued) <ul><li>Memory system coherent if: (continued) </li></ul><ul><ul><li>Writes to same location are serialized </li></ul></ul><ul><ul><ul><li>Two writes to the same location by any two processors are seen in the same order by all processors </li></ul></ul></ul><ul><ul><ul><li>Ex. Values of A and B are written to memory location X </li></ul></ul></ul><ul><ul><ul><ul><li>Processors can’t read the value of B and then later as A </li></ul></ul></ul></ul><ul><ul><ul><li>If writes not serialized… </li></ul></ul></ul><ul><ul><ul><ul><li>One processor might see the write of processor P2 to location X 1 st </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Then, it might later see a write to location X by processor P1 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>(P1 actually wrote X before P2) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Value of P1 could be maintained indefinitely even though it was overwritten </li></ul></ul></ul></ul>
    64. 64. Coherence/consistency <ul><li>Coherence and consistency are complementary </li></ul><ul><ul><li>Coherence defines actions of reads and writes to same memory location </li></ul></ul><ul><ul><li>Consistency defines actions of reads and writes with regard to accesses of other memory locations </li></ul></ul><ul><li>Assumption for the following discussion: </li></ul><ul><ul><li>Write does not complete until all processors have seen effect of write </li></ul></ul><ul><ul><li>Processor does not change order of any write with any other memory accesses </li></ul></ul><ul><ul><ul><li>Not exactly the case for either one really…but more later… </li></ul></ul></ul>
    65. 65. Caches in coherent multiprocessors <ul><li>In multiprocessors, caches at individual nodes help w/ performance </li></ul><ul><ul><li>Usually by providing properties of “migration” and “replication” </li></ul></ul><ul><ul><li>Migration: </li></ul></ul><ul><ul><ul><li>Instead of going to centralized memory for each reference, data word will “migrate” to a cache at a node </li></ul></ul></ul><ul><ul><ul><li>Reduces latency </li></ul></ul></ul><ul><ul><li>Replication: </li></ul></ul><ul><ul><ul><li>If data simultaneously read by two different nodes, copy is made at each node </li></ul></ul></ul><ul><ul><ul><li>Reduces access latency and contention for shared item </li></ul></ul></ul><ul><li>Supporting these require cache coherence protocols </li></ul><ul><ul><li>Really, we need to keep track of shared blocks… </li></ul></ul>
    66. 66. Detail about snooping
    67. 67. Implementing protocols <ul><li>We’ll focus on the invalidation protocol … </li></ul><ul><ul><li>And start with a generic template for invalidation… </li></ul></ul><ul><li>To perform an invalidate… </li></ul><ul><ul><li>Processor must acquire bus access </li></ul></ul><ul><ul><li>Broadcast the address to be invalidated on the bus </li></ul></ul><ul><ul><li>Processors connected to bus “snoop” on addresses </li></ul></ul><ul><ul><li>If address on bus is in processor’s cache, data invalidated </li></ul></ul><ul><ul><ul><li>Serialization of accesses enforces serialization of writes… </li></ul></ul></ul><ul><ul><ul><li>When 2 processors compete to write to the same location, 1 gets access to the bus 1 st </li></ul></ul></ul>
    68. 68. It’s not THAT easy though… <ul><li>What happens on a cache miss? </li></ul><ul><ul><li>With a write through cache, no problem </li></ul></ul><ul><ul><ul><li>Data is always in main memory </li></ul></ul></ul><ul><ul><ul><li>In shared memory machine, every cache write would go back to main memory – bad, bad, bad for bandwidth! </li></ul></ul></ul><ul><ul><li>What about write back caches though? </li></ul></ul><ul><ul><ul><li>Much harder. </li></ul></ul></ul><ul><ul><ul><li>Most recent value of data could be in a cache instead of memory </li></ul></ul></ul><ul><li>How to handle write back caches? </li></ul><ul><ul><li>Snoop. </li></ul></ul><ul><ul><li>Each processor snoops every address placed on the bus </li></ul></ul><ul><ul><li>If a processor has a dirty copy of requested cache block, it responds to read request, and memory request is cancelled </li></ul></ul>
    69. 69. Specifics of snooping <ul><li>Normal cache tags can be used </li></ul><ul><li>Existing valid bit makes it easy to invalidate </li></ul><ul><li>What about read misses? </li></ul><ul><ul><li>Easy to handle too; rely on snooping capability </li></ul></ul><ul><li>What about writes? </li></ul><ul><ul><li>We’d like to know if any other copies of the block are cached </li></ul></ul><ul><ul><ul><li>If they’re NOT, we can save bus bandwidth </li></ul></ul></ul><ul><ul><li>Can add extra bit of state to solve this problem – state bit… </li></ul></ul><ul><ul><ul><li>Tells us if block is shared, if we must generate an invalidate </li></ul></ul></ul><ul><ul><ul><li>When write to a block in shared state happens, cache generates invalidation and marks block as “private” </li></ul></ul></ul><ul><ul><ul><li>No other invalidations sent by that processor for that block… </li></ul></ul></ul>
    70. 70. Specifics of snooping <ul><li>When invalidation sent, state of owner’s (processor with sole copy of cache block) cache block is changed from shared to unshared (or exclusive) </li></ul><ul><ul><li>If another processor later requests cache block, state must be made shared again </li></ul></ul><ul><ul><li>Snooping cache also sees any misses </li></ul></ul><ul><ul><ul><li>Knows when exclusive cache block has been requested by another processor and state should be made shared </li></ul></ul></ul>
    71. 71. Specifics of snooping <ul><li>More overhead… </li></ul><ul><ul><li>Every bus transaction would have to check cache-addr. tags </li></ul></ul><ul><ul><ul><li>Could easily overwhelm normal CPU cache accesses </li></ul></ul></ul><ul><ul><li>Solutions: </li></ul></ul><ul><ul><ul><li>Duplicate the tags – snooping/CPU accesses can go on in parallel </li></ul></ul></ul><ul><ul><ul><li>Employ a multi-level cache with inclusion </li></ul></ul></ul><ul><ul><ul><ul><li>Everything in the L1 cache also in L2; snooping checks L2, CPU L1 </li></ul></ul></ul></ul>
    72. 72. An example protocol <ul><li>Bus-based protocol usually implemented with a finite state machine controller in each node </li></ul><ul><ul><li>Controller responds to requests from processor & bus </li></ul></ul><ul><ul><ul><li>Changes the state of the selected cache block and uses the bus to access data or invalidate it </li></ul></ul></ul><ul><li>An example protocol (which we’ll go through an example of) </li></ul>Request data from cache or memory; perform any needed invalidates Bus Write miss Request data from cache or memory Bus Read miss Write data in cache Processor Write hit Read data in cache Processor Read hit Function Source Request