SlideShare a Scribd company logo
1 of 105
CPE 806: ADVANCED COMPUTER ARCHITECTURE
1
MULTIPROCESSORS
Multiprocessing: Flynn’s Taxonomy
2
• Flynn’s Taxonomy of Parallel Machines
– How many Instruction (I) streams?
– How many Data (D) streams?
• Flynn’s classified architectures in terms of streams of data
and instructions;
– Stream of Instructions (SI): sequence of instructions executed by
the computer.
– Stream of Data (SD): sequence of data including input, temporary
or partial results referenced by instructions.
Flynn’s Taxonomy….
• Computer architectures are characterized by the
multiplicity of hardware to serve instruction and data
streams.
1. Single Instruction Single Data (SISD)
2. Single Instruction Multiple Data (SIMD)
3. Multiple Instruction Multiple Data (MIMD)
4. Multiple Instruction Single Data (MISD)
3
Flynn’s Taxonomy: SISD
• SISD: Single I Stream, Single D Stream
– A uniprocessor von Neumann computers
4
Control
Unit
Processor
(P)
Memory
(M)
I/O
Data Stream
Instruction Stream
Instruction Stream
SIMD
• SIMD: Single I, Multiple D Streams
– Each “processor” works on its own data
– But all execute the same instrs in lockstep
– E.g. a vector processor or MMX
• Consists of 2 parts
– A front-end Von Neumann computer
– A processor array: connected to the memory bus of the front end
5
Control
Unit
P1 M1
Pn Mn
Instruction Stream
Program loaded
from front end
Data Stream
Data Stream
Data loaded from
front end
SIMD Architecture
6
P1 P2 P3 Pn-1 Pn
M1 M2 M3 Mn-1 Mn
Interconnection Network
Control Unit
Scheme 1 Each processor has its own local memory
P1 P2 P3 Pn-1 Pn
M1 M2 M3 Mn-1 Mn
Interconnection Network
Control Unit
Scheme 2 Processor and memory modules communicate with
each other via interconnection network
SIMD: Shared Memory and Not Shared
7
P1 P2 P3 Pn-1 Pn
Shared Memory
Interconnection Network
Control Unit
Control
Unit
P1 M1
Pn Mn
Instruction Stream
Program loaded
from front end
Data Stream
Data Stream
Data loaded from
front end
MIMD
• MIMD: Multiple I, Multiple D Streams
– Made of multiple processors and multiple memory modules connected together via
some interconnection network
– Each processor executes its own instructions and operates on its own data
– This is your typical off-the-shelf multiprocessor
(made using a bunch of “normal” processors)
– Includes multi-core processors
• 2 broad classes:
– Shared memory
– Message passing
8
Control
Unit-1
P1 M1
Data Stream
Instruction Stream
Control
Unit-n
Pn Mn
Data Stream
Instruction Stream
Instruction Stream
Instruction Stream
MIMD Multiprocessors
9
Centralized Shared Memory Distributed Memory
• Multiprocessors  computers consisting of tightly coupled processors whose coordination and
usage are typically controlled by a single operating system and that share memory through a
shared address space.
• Such systems exploit thread-level parallelism through two different software models.
• Parallel processing
• Request-level processing
Flynn’s Taxonomy: MISD
• MISD: Multiple I, Single D Stream
– No processor has been produced using this taxonomy.
10
Introduction
• Goal: connecting multiple computers to get higher
performance
– Multiprocessors
– Scalability, availability, power efficiency
• Job-level (process-level) parallelism
– High throughput for independent jobs
• Parallel processing program
– Single program run on multiple processors
• Multicore microprocessors
– Chips with multiple processors (cores)
11
Multiprocessors
• Why do we need multiprocessors?
– Uniprocessor speed keeps improving
• There are limits to which ILP can be increased
– But there are things that need even more speed
• Wait for a few years for Moore’s law to catch up?
• Or use multiple processors and do it now?
– Need for more computing power
• Data intensive applications
• Utility computing requires powerful processors
• Multiprocessor software problem
– Most code is sequential (for uniprocessors)
• Much easier to write and debug
– Parallel code required for effective and efficient utilization of all cores
• But Correct parallel code very, very difficult to write
– Efficient and correct is even harder
– Debugging even more difficult (Heisenbugs)
12
Multiprocessors
• The main argument for using multiprocessors is to
create powerful computers by simply connecting
multiple processors.
– A multiprocessor is expected to reach faster speed than
the fastest single-processor system.
– More cost-effective.
• A multiprocessor consisting of a number of single processors is
expected to be than building a high-performance single
processor.
– Fault tolerance.
• If a processor fails, the remaining processors should be able to
provide continued service, albeit with degraded performance.
13
Two Models for Communication and
Memory Architecture
1. Communication occurs by explicitly passing messages among the
processors:
– message-passing multiprocessors
2. Communication occurs through a shared address space (via loads
and stores):
– shared memory multiprocessors either
• UMA (Uniform Memory Access time) for shared address, centralized memory MP
• NUMA (Non Uniform Memory Access time multiprocessor) for shared address,
distributed memory MP
• In past, confusion whether “sharing”means sharing physical
memory (Symmetric MP) or sharing address space
14
Symmetric Shared-Memory Architectures
• From multiple boards on a shared bus to multiple
processors inside a single chip
• Caches both
– Private data are used by a single processor
– Shared data are used by multiple processors
15
Important ideas
• Technology drives the solutions.
– Multi-cores have altered the game!!
– Thread-level parallelism (TLP) vs ILP.
• Computing and communication deeply intertwined.
– Write serialization exploits broadcast communication on the
interconnection network or the bus connecting L1, L2, and L3 caches for
cache coherence.
• Access to data located at the fastest memory level greatly
improves the performance.
• Caches are critical for performance but create new problems
– Cache coherence protocols:
1. Cache snooping  traditional multiprocessor
2. Directory based  multi-core processors
16
Review of basic concepts
• Cache  smaller, faster memory which stores copies of the data from frequently
used main memory locations.
• Cache writing policies
– write-through  every write to the cache causes a write to main memory.
– write-back  writes are not immediately mirrored to main memory.
• Locations written are marked dirty and written back to the main memory only when that data is evicted from the cache.
• A read miss may require two memory accesses: write the dirty location to memory and read new location from memory.
• Caches are organized in blocks or cache lines.
• Cache blocks consist of
– Tag  contains (part of) address of actual data fetched from main memory
– Data block
– Flags  dirty bit, shared bit,
• Broadcast networks  all nodes share a communication media and hear all
messages transmitted, e.g., bus.
17
Cache Coherence and Consistency
• Coherence
– Reads by any processor must return the most recently written value
– Writes to the same location by any two processors are seen in the same
order by all processors
– Coherence defines behaviour of reads and writes to the same location,
• Consistency
– A read returns the last value written
– If a processor writes location A followed by location B, any processor
that sees the new value of B must also see the new value of A
– Consistency defines behaviour of reads and writes to other locations
18
Thread-level parallelism (TLP)
• Distribute the workload among a set of concurrently running threads.
• Uses MIMD model  multiple program counters
• Targeted for tightly-coupled shared-memory multiprocessors
• To be effective need n threads for n processors.
• Amount of computation assigned to each thread = grain size
– Threads can be used for data-level parallelism, but the overheads may outweigh
the benefit
• Speedup
– Maximum speedup with n processors is n; embarrassingly parallel
– The actual speedup depends on the ratio of parallel versus sequential portion of a
program according to Amdahl’s law.
19
TLP and ILP
• The costs for exploiting ILP are prohibitive in terms of silicon
area and of power consumption.
• Multicore processor have altered the game
– Shifted the burden for keeping the processor busy from the hardware
and architects to application developers and programmers.
– Shift from ILP to TLP
• Large-scale multiprocessors are not a large market, they have
been replaced by clusters of multicore systems.
20
Multi-core processors
• Cores are now the building blocks of chips.
• Intel offers a family of processors based on the Nehalem
architecture with a different number of cores and L3
caches
21
MIMD Multiprocessors
• Centralized Shared Memory
• Distributed Memory
22
Centralized-Memory Machines
• Also “Symmetric Multiprocessors” (SMP)
“Uniform Memory Access” (UMA)
– All memory locations have similar latencies
– Data sharing through memory reads/writes
– P1 can write data to a physical address A,
P2 can then read physical address A to get that data
• Caching data
– reduces the access time but demands cache coherence
• Two distinct data states
– Global state  defined by the data in main memory
– Local state  defined by the data in local caches
• In multi-core L3 cache is shared; L1 and
L2 caches are private
• Problem: Memory Contention
– All processor share the one memory
– Memory bandwidth becomes bottleneck
– Used only for smaller machines
• Most often 2,4, or 8 processors
23
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
Processor Processor Processor Processor
Shared cache
Main Memory I/O system
Private
caches
Shared Memory Pros and Cons
• Pros
– Communication happens automatically
– More natural way of programming
• Easier to write correct programs and gradually optimize them
– No need to manually distribute data
(but can help if you do)
• Cons
– Needs more hardware support
– Easy to write correct, but inefficient programs
(remote accesses look the same as local ones)
24
MIMD: Distributed-Memory Machines
• Two kinds
– Distributed Shared-Memory (DSM)
• All processors can address all memory
locations
• Data sharing like in SMP
• Also called NUMA (non-uniform
memory access)
• Latencies of different memory locations
can differ
(local access faster than remote access)
– Message-Passing
• A processor can directly address only
local memory
• To communicate with other processors,
must explicitly send/receive messages
• Also called multicomputers or clusters
• Most accesses local, so less
memory contention (can scale to
well over 1000 processors)
25
Multicore
Processor
+ Cache
Interconnection Network
I/O
Memory Memory Memory Memory
I/O
Memory
Multicore
Processor
+ Caches
Multicore
Processor
+ Cache
Multicore
Processor
+ Cache
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
I/O I/O I/O
I/O I/O
I/O
Distributed Shared-Memory Multiprocessor…
• Two major benefits:
– It is a cost-effective way to scale the memory bandwidth
if most of the accesses are to local memory in the node.
– It reduces the latency for accesses to the local memory.
• Two key disadvantages:
– Communicating data between processors becomes more
complex.
– It requires more effort in the software to take advantage
of the increased memory bandwidth afforded by
distributed memories
26
Message-Passing Machines
• A cluster of computers
– Each with its own processor and memory
– An interconnect to pass messages between them
– Producer-Consumer Scenario:
• P1 produces data D, uses a SEND to send it to P2
• The network routes the message to P2
• P2 then calls a RECEIVE to get the message
– Two types of send primitives
• Synchronous: P1 stops until P2 confirms receipt of message
• Asynchronous: P1 sends its message and continues
– Standard libraries for message passing:
Most common is MPI – Message Passing Interface
27
Communication Performance
• Metrics for Communication Performance
– Communication Bandwidth
– Communication Latency
• Sender overhead + transfer time + receiver overhead
– Communication latency hiding
• Characterizing Applications
– Communication to Computation Ratio
• Work done vs. bytes sent over network
• Example: 146 bytes per 1000 instructions
28
Message Passing Pros and Cons
• Pros
– Simpler and cheaper hardware
– Explicit communication makes programmers aware of costly
(communication) operations
• Cons
– Explicit communication is painful to program
– Requires manual optimization
• If you want a variable to be local and accessible via LD/ST, you must
declare it as such
• If other processes need to read or write this variable, you must explicitly
code the needed sends and receives to do this
29
Parallel Processing Performance
• Challenges of Parallel Processing:
– First challenge is % of program inherently sequential
– Suppose 80x speedup from 100 processors. What fraction of
original program can be sequential?
• (a) 10% (b) 5% (c) 1% (d) <1%
– Assume that the program operates in only two modes:
• Parallel with all processors fully used (enhanced mode)
• Serial with only one processor in use
30
Amdahl’s Law Provides solution
Amdahl’s Law Provides solution
31
Need sequential part to be 0.0125% of original time.
Sequential part can limit speedup
Second Challenge:
Long Latency to Remote Memory
• Suppose 32 CPU MP, 2GHz, 200 ns to handle reference to a
remote memory, all local accesses hit memory hierarchy
and base CPI is 0.5. (Remote request cost = 200/0.5 = 400
clock cycles.)
• What is performance impact if 0.2% instructions involve
remote access?
– (a) 1.5X (b) 2.0X (c) 2.5X
32
CPI Equation
• CPI = Base CPI + Remote request rate x Remote request cost
• Cycle time =
• CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
• No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions
involved in remote access
• In practice, the performance analysis is much more complex, since
– Some fraction of the non-communication references will miss in the local
hierarchy
– Remote access time does not have a single constant value.
33
1
Cycle Cycle =
1
2 GHz = 0.5 ns
Remote access cost
Cycle Time
200 ns
0.5 ns
= = 400
• Remote request cost =
Challenge: Scaling Example
• Suppose you want to perform two sums:
– one is a sum of two scalar variables and
– one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000.
What speedup do you get with 1000 processors?
• Solution:
– If we assume performance is a function of the time for an addition, t , then
there is 1 addition that does not benefit from parallel processors and
1,000,000 additions that do.
– If the time before (for single processor) is: 1,000,000t + 1t = 1000,001t
• Execution time after improvement
34
Execution time affected by improvement
Amount of improvement
+ Execution time
unaffected
Execution time after =
improvement
Challenge: Scalar and Matrix Addition
35
Execution time after improvement =
1,000,000t
1,000
+ 1t = 1001t
Speedup is then
Speedup =
1,000,001t
1,001t
= 999
Even if the sequential portion expanded to 100 sums of scalar variables versus
one sum of a pair of 1000 by 1000 arrays, the speedup would still be 909.
Scaling Example...
• What if matrix size is 100 x 100?
– Single processor: Time= (10 + 10000) x tadd
– 10 processors
• Time = 10 x tadd + 10000/10 x tadd = 1010 x tadd
• Speedup = 10010/1010 = 9.9 (99% of potential)
– 100 processors
• Time = 10 x tadd + 10000/100 x tadd = 110 x tadd
• Speedup = 10010/110 = 9.1 (91% of potential)
Assuming load balanced
36
Symmetric Shared-Memory Architectures
Cache Coherence Problem
• Shared memory easy with no caches
– P1 writes, P2 can read
– Only one copy of data exists (in memory)
• Caches store their own copies of the data
– Those copies can easily get inconsistent
– Classic example: adding to a sum
• P1 loads allSum, adds its mySum, stores new allSum
• P1’s cache now has dirty data, but memory not updated
• P2 loads allSum from memory, adds its mySum, stores allSum
• P2’s cache also has dirty data
• Eventually P1 and P2’s cached data will go to memory
• Regardless of write-back order, the final value ends up wrong
37
Cache Coherence Problem…
38
P1 P2
Memory
Allsum: 0
Allsum: 5
Allsum:12
1
Allsum: Allsum + mysum2 (12)
Allsum: Allsum + mysum1 (5)
2
All Processes accessing main memory may see very stale value Alllsum:
Cache Coherence Definition
• A memory system is coherent if
1. Preserve Program Order: A read R from address X on processor
P1 returns the value written by the most recent write W to X on
P1 if no other processor has written to X between W and R.
1. This property simply preserves program order—we expect this property
to be true even in uniprocessors.
39
Figure The cache coherence problem for a single memory location (X), read
and
written by two processors (A and B).
Time Event Cache
contents for
processor A
Cache
contents for
processor B
Memory
contents for
location X
0 1
1 Processor A reads X 1 1
2 Processor B reads X 1 1 1
3 Processor A stores 0 into X 0 1 0
Cache Coherence Definition…
2. Coherent view of Memory: If P1 writes to X and P2 reads X after
a sufficient time, and there are no other writes to X in between,
P2’s read returns the value written by P1’s write.
• The second property defines the notion of what it means to have a
coherent view of memory:
– If a processor could continuously read an old data value, we would clearly
say that memory was incoherent.
3. Write Serialization: Writes to the same location are serialized.
Two writes to location X are seen in the same order by all
processors.
• For example, if the values 1 and then 2 are written to a location,
processors can never read the value of the location as 2 and then later
read it as 1.
40
Coherence defines behaviour of reads and writes to the same location
Write Consistency
For now assume
1. A write does not complete (and allow the next write to
occur) until all processors have seen the effect of that write
2. The processor does not change the order of any write with
respect to any other memory access
if a processor writes location A followed by location B,
any processor that sees the new value of B must also see
the new value of A
• These restrictions allow the processor to reorder reads, but
forces the processor to finish writes in program order
41
Consistency defines behaviour of reads and writes to other locations
Basic Schemes for Enforcing Coherence
• Migration – data can be moved to a local cache and used
there in a transparent fashion
– Reduces both latency to access shared data that is allocated
remotely and bandwidth demand on the shared memory
• Replication – for reading shared data simultaneously, since
caches make a copy of data in local cache
– Reduces both latency of access and contention for read shared
data
42
Maintaining Cache Coherence
• Hardware schemes
– Shared Caches
• Trivially enforces coherence
• Not scalable (L1 cache quickly becomes a bottleneck)
– Snooping
• Needs a broadcast network (like a bus) to enforce coherence
• Each cache that has a block tracks its sharing state on its own
– Directory
• Can enforce coherence even with a point-to-point network
• A block has just one place where its full sharing state is kept
– All information about the blocks is kept in the directory
• SMP: one centralised directory is provided in the outermost cache for multi-
core systems
• DSM: Directory is distributed. Each node maintains its own directory which
tracks the sharing information of every cache line in the node
43
Maintaining Cache Coherence:
Two Classes of Protocols in Use
Cache coherence Protocols
• Directory based
– The sharing status of a block of physical memory is kept in just one
location, called the directory;
– Directory-based coherence has slightly higher implementation
overhead than snooping, but it can scale to larger processor counts.
• The Sun T1 design uses directories, albeit with a central physical memory.
• Snooping
– Every cache that has a copy of the data from a block of physical
memory also has a copy of the sharing status of the block, but no
centralized state is kept.
– The caches are all accessible via some broadcast medium (a bus or
switch), and all cache controllers monitor or snoop on the medium
• To determine whether or not they have a copy of a block that is requested on a
bus or switch access.
44
Snoopy Cache-Coherence Protocols
• Cache Controller “snoops”all transactions on the shared
medium (bus or switch)
– relevant transaction if for a block it contains
– take action to ensure coherence
• » invalidate, update, or supply value
– depends on state of the block and the protocol
• Either get exclusive access before write via write invalidate or
update all copies on write
45
Pn
P1
Memory
cache
Bus snoop
I/O devices
cache
Cache-memory
transaction
Data
State
Address
Communication between private
and shared caches
• Multi-core processor  a bus connects private L1 and L2
instruction (I) and data (D) caches to the shared L3 cache.
• To invalidate a cached item the processor changing the
value must
– first acquire the bus and
– then place the address of the item to be invalidated on the bus.
• DSM  Locating the value of an item is harder for
– write-back caches
• because the current value of the item can be in the local caches of another
processor.
46
Snooping Protocol
• Typically used for bus-based (SMP) multiprocessors
– Serialization on the bus used to maintain coherence property 3
• Two flavors
– Write-update (write broadcast)
• A write to shared data is broadcast to update all copies
• All subsequent reads will return the new written value (property 2)
• All see the writes in the order of broadcasts
One bus == one order seen by all (property 3)
– Write-invalidate
• Write to shared data forces invalidation of all other cached copies
• Subsequent reads miss and fetch new value (property 2)
• Writes ordered by invalidations on the bus (property 3)
47
Write Invalidate: Example
• Write invalidate  on write, invalidate all other copies.
– Used in modern microprocessors
– Example: a write-back cache during read misses of item X, processors A and B.
Once A writes X it invalidates the B’s cache copy of X
48
Processor activity Bus activity Contents of
processor A’s
cache
Contents of
processor B’s
cache
Contents of
memory location X
0
Processor A reads X Cache miss for X 0 0
Processor B reads X Cache miss for X 0 0 0
Processor A writes a 1 to X Invalidation for X 1 0
Processor B reads X Cache miss for X 1 1 1
For a write, we require that the writing processor have exclusive access,
preventing any other processor from being able to write simultaneously.
An example of an invalidation protocol working on a snooping bus for a single cache block (X) with
write-back caches.
Write Invalidate: Example...
• An example of an invalidation protocol working on a snooping bus for a single
cache block (X) with write-back caches
• We assume that neither cache initially holds X and that the value of X in
memory is 0.
• The CPU and memory contents show the value after the processor and bus
activity have both completed.
• A blank indicates no activity or no copy cached.
• When the second miss by B occurs, CPU A responds with the value cancelling
the response from memory.
– In addition, both the contents of B’s cache and the memory contents of X are updated.
• This update of memory, which occurs when a block becomes shared, is typical
in most protocols and simplifies the protocol.
49
Example: Write-through Invalidate
50
• Must invalidate before step 3
• Write update uses more broadcast medium bandwidth
all recent MPUs use write invalidate
Exclusive access ensures that no
other readable or writable copies of
an data exist when the write occurs
Write Update: Example
• An example of a write update or broadcast protocol working on a
snooping bus for a single cache block (X) with write-back caches.
• We assume that neither cache initially holds X and that the value of
X in memory is 0.
51
Processor activity Bus activity Contents of
processor
A’s cache
Contents of
processor
B’s cache
Contents of
memory location X
0
Processor A reads X Cache miss for X 0 0
Processor B reads X Cache miss for X 0 0 0
Processor A writes a 1 to X Write Broadcast of X 1 1 1
Processor B reads X No bus activity 1 1 1
Write Update: Example...
• The CPU and memory contents show the value after the
processor and bus activity have both completed.
• A blank indicates no activity or no copy cached.
• When CPU A broadcasts the write, both the cache in CPU B
and the memory location of X are updated.
• In the second read, processor B finds the updated value of
X and therefore there is no bus activity.
52
Update vs. Invalidate
• A burst of writes by a processor to one address
– Update: each sends an update
– Invalidate: possibly only the first invalidation is sent
• Writes to different words of a block
– Update: update sent for each word
– Invalidate: possibly only the first invalidation is sent
• Producer-consumer communication latency
– Update: producer sends an update,
• consumer reads new value from its cache
– Invalidate: producer invalidates consumer’s copy,
• consumer’s read misses and has to request the block
• Which is better depends on application
– But write-invalidate is simpler and implemented in most MP-capable
processors today.
53
Implementation of cache Invalidate
• The key to implementing an invalidate protocol in a
multicore is
– the use of the bus, or another broadcast medium, to perform
invalidates.
– All processors snoop on the bus.
• To invalidate the processor changing an item
– acquires the bus and
– broadcasts the address to be invalidated on the bus.
• If two processors attempt to change at the same time the
bus arbitrator allows access to only one of them.
– All coherence schemes require some method of serializing accesses
to the same cache block, either by serializing access to the
communication medium or another shared structure.
54
Implementation of cache Invalidate…
• How to find the most recent value of a data item
– Write-through cache  the value is in memory but write buffers could
complicate the scenario.
– Write-back cache  harder problem, the item could be in the private
cache of another processor.
• A block of cache has extra state bits
– Valid bit – indicates if the block is valid or not
– Dirty bit - indicates if the block has been modified
– Shared bit – cache block is shared with other processors
• If a processor finds that it has a dirty copy of the requested
cache block, it provides that cache block in response to the read
request and causes the memory (or L3) access to be aborted.
55
Implementation of cache Invalidate…
• When a write to a block in the shared state occurs,
– the cache generates an invalidation on the bus and marks the
block as exclusive.
– No further invalidations will be sent by that core for that block.
– The core with the sole copy of a cache block is normally
called the owner of the cache block.
• When an invalidation is sent,
– the state of the owner’s cache block is changed from shared
to unshared (or exclusive).
– If another processor later requests this cache block, the state
must be made shared again
56
Locate up-to-date copy of data
• For a write-through cache
– Get up-to-date copy from memory (Since all written data are
always sent to the memory, from which the most recent values of
a data item can always be fetched.)
– Write through simpler if enough memory bandwidth is available
– Use of write through simplifies the implementation of cache
coherence.
• For a write-back cache
– Most recent copy can be in a cache rather than in memory
– The problem of finding the most recent data value is harder
57
Locate up-to-date copy of data…
• Write-back caches can use the same snooping scheme both for
cache misses and for writes:
– Each processor snoops every address placed on the bus.
– If a processor has dirty copy of the requested cache block, it provides it
in response to the read request and aborts the memory access.
– Complexity comes from having to retrieve the cache block from a
processor’s cache, which can take longer than retrieving it from the
shared memory if the processors are in separate chips.
• Write-back needs lower memory bandwidth
– ⇒ Support larger numbers of faster processors
– ⇒ Most multiprocessors use write-back
58
Cache Resources for Write-Back Snooping
• Normal cache tags can be used for snooping
• Valid bit for each block makes invalidation easy
• Read misses easy since rely on snooping
• Writes Need to know if whether any other copies of the
block are cached
– No other copies No need to place write on bus in a write-back cache
(reduce both the time taken by the write and the required bandwidth)
– Other copies Need to place invalidate on bus
59
Index
Block Address
Tag
Block Offset
Cache Resources for Write-Back Snooping…
• To track whether a cache block is shared, add extra state bit
associated with each cache block, like valid bit and dirty bit
– Write to shared block ⇒ Need to generates an invalidation on the bus
and marks the state of the block as exclusive.
– Otherwise, no further invalidations will be sent by that processor for
that block
– The processor with the sole copy of a cache block is normally called the
owner of the cache block
– When invalidation is sent, the state of the owner’s cache block is
changed from shared to exclusive.
– If another processor later requests this cache block, the state must be
made shared again.
60
Cache Behaviour in Response to Bus
• Every bus transaction must check the cache address
tags
– could potentially interfere with processor cache accesses
• A way to reduce interference is to duplicate tags
– One set for caches access, one set for bus accesses
• The interference can also be reduced in a multilevel
cache by directing the snoop request to the L2 cache
– Since L2 less heavily used than L1 (the processor uses only
when it has a miss in the L1 cache)
⇒ Every entry in the L1 cache must be present in the L2
cache, called the inclusion property
– If Snoop gets a hit in L2 cache, then it must arbitrate for the L1
cache to update the state and possibly retrieve the data,
which usually requires a stall of the processor
61
Example: Write Back MSI Snooping Protocol
• Snooping coherence protocol is usually implemented by
incorporating a finite‐state controller in each node
• There is only one finite-state machine per cache, with stimuli coming
either from the attached processor or from the bus
• Logically, think of a separate controller associated with each cache
block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to
distinct blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
62
Example: Write Back MSI Snooping Protocol…
• Processor only observes state of memory system by issuing memory
operations
• Assume bus transactions and memory operations are atomic and a
one‐level cache
– all phases of one bus transaction complete before next one starts
– processor waits for memory operation to complete before issuing next
– with one‐level cache, assume invalidations applied during bus transaction
• All writes go to bus + atomicity
– Writes serialized by order in which they appear on bus (bus order) => invalidations applied to
caches in bus order
• How to insert reads in this order?
– Important since processors see writes through reads, so determines whether write
serialization is satisfied
– But read hits may happen independently and do not appear on bus or enter directly in bus
order
63
Example: Write Back MSI Snooping Protocol
• Invalidation protocol, write‐back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to the
read request and aborts the memory access
• State of block B in cache C can be
– Invalid: B is not cached in C
• To read or write, must make a request on the bus
– Modified: B is dirty in C
• C has the block, no other cache has the block, and C must update memory when it displaces B
• Can read or write B without going to the bus
– Shared: B is clean in C
• C has the block, other caches have the block, and C need not update memory when it
displaces B
• Can read B without going to bus
• To write, must send an upgrade request to the bus
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses
64
note that the modified state implies that the block is
exclusive
Write‐Back State Machine ‐ Processor Request
65
Transition Arcs: The stimulus causing a state change is shown on the transition
arcs in Blue
Bus actions generated as part of the state transition are shown on the transition
arc in Bold.
Write‐Back State Machine ‐ Processor Request…
66
Finite-state transition diagram for a single private cache block
using a write invalidation protocol and a write-back cache
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU read hit
CPU
read
miss
Place read
miss on bus
CPU read
CPU write
CPU write hit
CPU read hit
CPU write miss
Place read miss on bus
Place
write
miss
on
bus
Write-back cache block
Place write miss on bus
• Any transition to the Exclusive
state (which is required for a
processor to write to the block)
requires an invalidate or write
miss to be placed on the bus,
 causing all local caches to make the
block invalid.
 In addition, if some other local
cache had the block in Exclusive
state, that local cache generates a
write-back, which supplies the
block containing the desired
address.
Cache block States:
 Invalid
 Shared and
 Exclusive (Modified)
Cache state transitions
based on requests from CPU
Write‐Back State Machine ‐ Bus Request
67
Finite-state transition diagram for a single private cache block using a
write invalidation protocol and a write-back cache
• If a read miss occurs on the bus to a
block in the exclusive state,
 the local cache with the exclusive copy
changes its state to shared.
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU read miss
write miss for this block
Write
miss
For
this
block
Invalidate for
this block
Write-back;
block
abort
memory
access
Request Source State of
addressed
cache block
Type of
Cache action
Function and explanation
Read miss Bus Shared No action Allow shared cache or memory to service read miss
Read miss Bus Modified Coherence Attempt to share data: place cache block on bus and change state to shared.
invalidate Bus Shared Coherence Attempt to write shared block; invalidate the block
Write miss Bus Shared Coherence Attempt to write shared block; invalidate the cache block
Write miss Bus Modified Coherence Attempt to write block that is exclusive elsewhere; write-back the block and
make its state invalid in the local cache
Cache state transitions
based on requests from Bus
Combined Cache Coherence State Diagram for both
Processor and Bus Requests
68
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU
read
hit
CPU
read
miss
Place read
miss on bus
CPU write
CPU write hit
CPU read hit
CPU write miss
Write miss
for block
CPU read
Place read miss on bus
Place
write
miss
on
bus
Write-back cache block
Place write miss on bus
Invalidate for this block
Write miss for this block
Write-back
block
Transition Arcs:
 Local Processor induced transition in
Black
 Bus activities induced transition in Blu
 Activities on transition in Red
Example
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
69
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl
P1 Read Ai
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory
Example: Step 1
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
70
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Processor Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read Ai
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory
• Active arrow =
Example: Step 2
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
71
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read A1 Excl A1 10
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory
• Active arrow =
Example: Step 3
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
72
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read A1 Excl A1 10
P2 Read A1 Shar. A1 RdMs P2 A1 A1 10
Shar A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory
• Active arrow =
Example: Step 4
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
73
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read A1 Excl A1 10
P2 Read A1 Shar. A1 RdMs P2 A1 A1 10
Shar A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory
• Active arrow =
Example: Step 5
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
74
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read A1 Excl A1 10
P2 Read A1 Shar. A1 RdMs P2 A1 A1 10
Shar A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10
P2: Write 40 to A2 WrBk P2 A1 20 A1 20
Excl. A2 40 WrMs P2 A2 A1 20
Processor 1 Processor 2 Bus Memory
• Active arrow =
Limitations in Symmetric Shared-Memory
Multiprocessors and Snooping Protocols
• As the number of processors in a multiprocessor grows, or as the memory demands of each
processor grow, any centralized resource in the system can become a bottleneck.
• As processors have increased in speed in the last few years, the number of processors that
can be supported on a single bus or by using a single physical memory unit has fallen
• Single memory accommodate all CPUs
– Multiple memory banks
• Bus-based multiprocessor, bus must support both coherence traffic & normal memory
traffic
– Multiple buses or interconnection networks (cross bar or small point-to-point)
75
InIn such designs, the memory system
can be configured into multiple physical
banks, so as to boost the effective
memory bandwidth while retaining
uniform access time to memory
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
Processor Processor Processor Processor
Interconnection Network
Memory
I/O system
Memory Memory Memory
Cache Performance
• Cache performance is combination of
1. Behaviour of uniprocessor cache miss traffic
2. Traffic caused by communication
• Results in invalidations and subsequent cache misses
– Changing the processor count, cache size, and block size can
affect these two components of the miss rate in different ways.
• Uniprocessor miss rate:
– Can be broken down into:
• Compulsory,
• Capacity and
• Conflict misses
76
Cache Performance…
• Compulsory miss:
– The very first access to a block cannot be in the cache.
• Capacity miss:
– The cache cannot contain all the blocks needed during execution
of a program, capacity miss will occur because of blocks being
discarded and later retrieved.
• Conflict miss:
– If the block placement strategy is set associative or direct
mapped, conflict misses will occur because a block may be
discarded and later retrieved if too many blocks map to its set.
77
Coherency Misses
The misses that arise from inter-processor communication, which are
often called coherence misses, can be broken into two separate
sources.
1. True sharing misses arise from the communication of data
through the cache coherence mechanism
– In an invalidation-based protocol, the 1st write to a shared block causes an
invalidation to establish ownership of the block.
– When another processor attempts to read a modified word in the cache block,
a miss occurs and the block is transferred.
2. False sharing misses when a block is invalidated because some
word in the block, other than the one being read, is written into
– Invalidation does not cause a new value to be communicated, but only causes
an extra cache miss
– Block is shared, but no word in block is actually shared  miss would not
occur if block size were 1 word
78
Example: True v. False Sharing v. Hit?
• Assume x1 and x2 in same cache block and in shared state
• P1 and P2 both read x1 and x2 before.
79
Time P1 P2 True, False, Hit? Why?
1 Write x1 True miss: invalidate x1 in P2
2 Read x2 False miss: x1 irrelevant to P2
3 Write x1 False miss: x1 irrelevant to P2
4 Write x2 False miss: x1 irrelevant to P2
5 Read x2 True miss: invalidate x2 in P1
Classifications by Time Step
1. This event is a true sharing miss, since x1 was read by P2 and needs to be
invalidated from P2.
2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in
P1, but that value of x1 is not used in P2.
3. This event is a false sharing miss, since the block containing x1 is marked shared
due to the read in P2, but P2 did not read x1. The cache block containing x1 will
be in the shared state after the read by P2; a write miss is required to obtain
exclusive access to the block. In some protocols this will be handled as an
upgrade request, which generates a bus invalidate, but does not transfer the
cache block.
4. This event is a false sharing miss for the same reason as step 3.
5. This event is a true sharing miss, since the value being read was written by P2.
80
Cache to Cache transfers
• Problem
– P1 has block B in M state
– P2 wants to read B, puts a RdReq on bus
– If P1 does nothing, memory will supply the data to P2
– What does P1 do?
• Solution 1: abort/retry
– P1 cancels P2’s request, issues a write back
– P2 later retries RdReq and gets data from memory
– Too slow (two memory latencies to move data from P1 to P2)
• Solution 2: intervention
– P1 indicates it will supply the data (“intervention” bus signal)
– Memory sees that, does not supply the data, and waits for P1’s data
– P1 starts sending the data on the bus, memory is updated
– P2 snoops the transfer during the write-back and gets the block
81
Cache to Cache transfers…
• Intervention works if some cache has data in M state
– Nobody else has the correct data, clear who supplies the
data
• What if a cache has requested data in S state
– There might be others who have it, who should supply the
data?
– Solution 1: let memory supply the data
– Solution 2: whoever wins arbitration supplies the data
– Solution 3: A separate state similar to S that indicates there
are maybe others who have the block in S state, but if
anybody asks for the data we should supply it
82
Extensions to the Basic Coherence Protocol
• We have just considered a coherence protocol
with 3 states: Modified, Shared, Invalid (MSI)
• There are many extensions of MSI
– With additional states and transactions, which
optimise certain behaviours, possibly resulting in
improved performance.
• Two of the most common extensions are: MESI
and MOESI
83
MESI (Modified, Exclusive, shared & Invalid)
• MESI adds the state Exclusive (E) to the basic MSI protocol.
• Exclusive indicates when a cache block is resident only in a single cache but is
clean
• If a block is in the E state, it can be written without generating any invalidates,
which optimizes the case where a block is read by a single cache before being
written by that same cache.
• Of course, when a read miss to a block in the E state occurs, the block must be
changed to the S state to maintain coherence.
– Because all subsequent accesses are snooped, it is possible to maintain the accuracy of this state.
– In particular, if another processor issues a read miss, the state is changed from exclusive to shared
• Pros of adding E state:
– subsequent write to a block in the exclusive state by the same core need not acquire bus access or
generate an invalidate, since the block is known to be exclusively in this local cache; the processor
merely changes the state to modified.
• The Intel i7 uses a variant of a MESI protocol, called MESIF, which adds a state
(Forward) to designate which sharing processor should respond to a request.
– It is designed to enhance performance in distributed memory organizations.
84
MOESI
(Modified, Owned, Exclusive, Shared & Invalid)
• MOESI adds the state Owned to the MESI protocol to indicate that the associated block is
owned by that cache and out-of-date in memory.
• In MSI and MESI protocols, when there is an attempt to share a block in the Modified state,
the state is changed to Shared (in both the original and newly sharing cache), and the block
must be written back to memory.
• In a MOESI protocol, the block can be changed from the Modified to Owned state in the
original cache without writing it to memory.
• Other caches, which are newly sharing the block, keep the block in the Shared state; the O
state, which only the original cache holds, indicates that the main memory copy is out of
date and that the designated cache is the owner.
• The owner of the block must supply it on a miss, since memory is not up to date and must
write the block back to memory if it is replaced.
• The AMD Opteron uses the MOESI protocol.
85
Directory-Based Coherence Protocol
• Typically in distributed shared memory
• For every local memory block, local directory
has an entry
• Directory entry indicates
–Who has cached copies of the block
–In what state do they have the block
86
Distributed-Memory Multiprocessor with the
directories added to each Node
87
Directory-Based Cache Coherence Protocols:
The Basics
• Just as with a snooping protocol, there are two primary operations that a directory
protocol must implement:
– handling a read miss and
– handling a write to a shared, clean cache block.
• Handling a write miss to a block that is currently shared is a simple combination of
these two.
• To implement these operations, a directory must track the state of each cache
block.
• In a simple protocol, these states could be the following:
– Shared—One or more nodes have the block cached, and the value in memory is up to date
(as well as in all the caches).
– Uncached—No node has a copy of the cache block.
– Modified—Exactly one node has a copy of the cache block, and it has written the block, so
the memory copy is out of date. The processor is called the owner of the block.
88
Basic Directory Scheme
• Each entry has
– One dirty bit (1 if there is a dirty cached copy)
– A presence vector (1 bit for each node) Tells which nodes may
have cached copies
• All misses sent to block’s home
• Directory performs needed coherence actions
• Eventually, directory responds with data
89
Read Miss
• Processor Pk has a read miss on block B, sends
request to home node of the block
• Directory controller
– Finds entry for B, checks D bit
– If D=0
• Read memory and send data back, set P[k]
– If D=1
• Request block from processor whose P bit is 1
• When block arrives, update memory, clear D bit,
send block to Pk and set P[k]
90
Directory Operation
• Network controller connected to each bus
– A proxy for remote caches and memories
• Requests for remote addresses forwarded to home,
responses from home placed on the bus
• Requests from home placed on the bus,
cache responses sent back to home node
• Each cache still has its own coherence state
– Directory is there just to avoid broadcasts
and order accesses to each location
• Simplest scheme:
If access A1 to block B still not fully processed by directory
when A2 arrives, A2 waits in a queue until A1 is done
91
92
MULTIPROCESSOR INTERCONNECTION NETWORKS
Multiprocessor Interconnection Networks
• Multiprocessors interconnection networks (INs) can be classified based on
a number of criteria. These include
– (1) mode of operation (synchronous versus asynchronous),
– (2) control strategy (centralized versus decentralized),
– (3) switching technique(circuit versus packet), and
– (4) topology (static versus dynamic).
• Mode of Operation
– According to the mode of operation, INs are classified as synchronous versus asynchronous.
• In synchronous mode of operation, a single global clock is used by all components in the system such
that the whole system is operating in a lock–step manner.
• Asynchronous mode of operation, on the other hand, does not require a global clock. Handshaking
signals are used instead in order to coordinate the operation of asynchronous systems.
– While synchronous systems tend to be slower compared to asynchronous systems, they are
race and hazard-free.
93
Multiprocessor Interconnection Networks…
• Control Strategy
– According to the control strategy, INs can be classified as centralized versus
decentralized.
• In centralized control systems, a single central control unit is used to oversee and
control the operation of the components of the system.
• In decentralized control, the control function is distributed among different
components in the system.
– The function and reliability of the central control unit can become the bottleneck
in a centralized control system. While the crossbar is a centralized system, the
multistage interconnection networks are decentralized.
• Switching Techniques
– Interconnection networks can be classified according to the switching
mechanism as circuit versus packet switching networks.
• In the circuit switching mechanism, a complete path has to be established prior to the
start of communication between a source and a destination. The established path will
remain in existence during the whole communication period.
• In a packet switching mechanism, communication between a source and destination
takes place via messages that are divided into smaller entities, called packets. On their
way to the destination, packets can be sent from a node to another in a store-and-
forward manner until they reach their destination.
– While packet switching tends to use the network resources more efficiently
compared to circuit switching, it suffers from variable packet delays.
94
Multiprocessor Interconnection Networks…
• Topology
– An interconnection network topology is a mapping function from the set of
processors and memories onto the same set of processors and memories.
In other words, the topology describes how to connect processors and
memories to other processors and memories.
– A fully connected topology, for example, is a mapping in which each
processor is connected to all other processors in the computer.
– A ring topology is a mapping that connects processor k to its neighbours,
processors (k – 1) and (k + 1).
– In general, interconnection networks can be classified as
• static versus dynamic networks.
– In static networks, direct fixed links are established among nodes to form a
fixed network, while
– In dynamic networks, connections are established as needed.
– Switching elements are used to establish connections among inputs and
outputs.
– Depending on the switch settings, different interconnections can be
established.
– Nearly all multiprocessor systems can be distinguished by their 95
Interconnection networks for
Shared Memory and Message Passing Systems.
• Shared memory
– Shared memory systems can be designed using bus-based or switch-based
INs.
– The simplest IN for shared memory systems is the bus. However, the bus
may get saturated if multiple processors are trying to access the shared
memory (via the bus) simultaneously.
– A typical bus-based design uses caches to solve the bus contention
problem.
– Other shared memory designs rely on switches for interconnection.
• For example, a crossbar switch can be used to connect multiple processors to
multiple memory modules.
96
(a) (b)
Figure: Shared memory interconnection networks
(a) bus-based and (b) switch-based
Figure Single bus and multiple bus
systems.
Interconnection networks for
Shared Memory and Message Passing Systems…
• Message passing INs
– Message passing INs can be divided into static and dynamic.
• Static networks form all connections when the system is designed rather than when the
connection is needed. In a static network, messages must be routed along established links.
• Dynamic INs establish a connection between two or more nodes on the fly as messages are
routed along the links. The number of hops in a path from source to destination node is equal
to the number of point-to-point links a message must traverse to reach its destination.
– In either static or dynamic networks, a single message may have to hop through
intermediate processors on its way to its destination.
• Therefore, the ultimate performance of an interconnection network is greatly influenced by
the number of hops taken to traverse the network.
97
Figure Examples of static topologies.
Interconnection networks for
Shared Memory and Message Passing Systems…
98
(a) (b) (c)
Figure Example dynamic INs: (a) single-stage, (b) multistage, and (c) crossbar
switch.
• The single-stage interconnection network of Figure (a) is a simple
dynamic network that connects each of the inputs on the left side to
some, but not all, outputs on the right side through a single layer of
binary switches represented by the rectangles.
 The binary switches can direct the message on the left-side input to
one of two possible outputs on the right side.
Interconnection networks for
Shared Memory and Message Passing Systems…
• Figure (b). The Omega MIN (Multistage Interconnection Network) connects eight
sources to eight destinations.
– The connection from the source 010 to the destination 010 is shown as a bold path
– These are dynamic INs because the connection is made on the fly, as needed.
– In order to connect a source to a destination, we simply use a function of the bits of the
source and destination addresses as instructions for dynamically selecting a path through the
switches.
– For example, to connect source 111 to destination 001 in the omega network,
• the switches in the first and second stage must be set to connect to the upper output port,
• while the switch at the third stage must be set to connect to the lower output port (001).
• In general, when using k × k switches, a Omega MIN with N input-output ports requires
at least logk N stages, each of which contains N/k switches, for a total of N/k (logk N)
switches.
• Figure (c ) Crossbar Switch provides a path from any input or source to any other
output or destination by simply selecting a direction on the fly.
– To connect row 111 to column 001 requires only one binary switch at the intersection of the
111 input line and 001 output line to be set.
• The crossbar switch clearly uses more binary switching components;
– for example, N2 components are needed to connect N x N source/destination pairs.
99
Pros and Cons of Crossbar Switch and Omega MIN
• Pros
– Crossbar switch has potential for speed. In one clock, a connection can be made
between source and destination.
– The diameter of the crossbar is one.
• (Note: Diameter, D, of a network having N nodes is defined as the maximum shortest paths
between any two nodes in the network.)
– The omega MIN, on the other hand requires log N clocks to make a connection.
• The diameter of the omega MIN is therefore log N.
• Cons
– Both Crossbar Switch and Omega MIN networks limit the number of alternate paths
between any source/destination pair.
– This leads to limited fault tolerance and network traffic congestion.
– If the single path between pairs becomes faulty, that pair cannot communicate.
– If two pairs attempt to communicate at the same time along a shared path, one pair
must wait for the other.
• This is called blocking, and such MINs are called blocking networks.
• A network that can handle all possible connections without blocking is called a nonblocking
network.
100
Example Problem
• Example:
– Compute the cost of interconnecting 4096 nodes using a single crossbar switch
relative to doing so using a MIN built from 2 × 2, 4 × 4, and 16 × 16 switches.
Consider separately the relative cost of the unidirectional links and the relative
cost of the switches. Switch cost is assumed to grow quadratically with the
number of input (alternatively, output) ports, k, for k × k switches.
• Solution:
– The switch cost of the network when using a single crossbar is proportional to
40962
.
– The unidirectional link cost is 8192, which accounts for the set of links from the
end nodes to the crossbar and also from the crossbar back to the end nodes.
– When using a MIN with k × k switches, the cost of each switch is proportional
to k2 but there are 4096/k (logk 4096) total switches.
– Likewise, there are (logk 4096) stages of N unidirectional links per stage from
the switches plus N links to the MIN from the end nodes.
– Therefore, the relative costs of the crossbar with respect to each MIN is given
by the following:
101
Example Problem…
102
Relative cost (2 × 2)switches = 40962/ (22× 4096/2 × log2 4096) = 170
Relative cost (4 × 4)switches = 40962/ (42× 4096/4 × log4 4096) = 170
Relative cost (16 × 16)switches = 40962/ (162× 4096/16 × log16 4096) = 85
Relative cost (2 × 2)links = 8192/ (4096× (log2 4096 + 1) = 2/13= 0.1538
Relative cost (4 × 4)links = 8192/ (4096× (log4 4096 + 1) = 2/7= 0.2857
Relative cost (16 × 16)links = 8192/ (4096× (log16 4096 + 1) = 2/4= 0.5
Example Problem…
• Conclusion
– In all cases, the single crossbar has much higher switch cost than the
MINs.
– The most dramatic reduction in cost comes from the MIN composed
from the smallest sized but largest number of switches, but it is
interesting to see that the MINs with 2 × 2 and 4 × 4 switches yield the
same relative switch cost.
– The relative link cost of the crossbar is lower than the MINs, but by less
than an order of magnitude in all cases.
– We must keep in mind that end node links are different from switch
links in their length and packaging requirements, so they usually have
different associated costs.
– Despite the lower link cost, the crossbar has higher overall relative cost.
103
Performance Comparison of some Dynamic Ins
• In the table below, m represents the number of multiple
buses used, while N represents the number of processors
(memory modules) or input/output of the network.
104
Network Delay Cost (Complexity)
Bus O(N) O(1)
Multiple-bus O(mN) O(m)
Multistage INs (MINs) O(log N) O(N log N)
Table Performance Comparison of Some Dynamic INs
Performance Comparison of some Static
INs.
• The table below shows a performance comparison among a
number of static INs.
 In this table, the degree of a network is defined as the maximum
number of links (channels) connected to any node in the network.
 The diameter of a network is defined as the maximum path, p, of the
shortest paths between any two nodes. Degree of a node, d, is
defined as the number of channels incident on the node.
105
Network Degree Diameter Cost (No. of links)
Linear array 2 N – 1 N – 1
Binary tree 3 2([Log2 N] – 1 ) N – 1
n-cube Log2 N Log2 N nN/2
2D-mesh 4 2(n – 1) 2(N – n)
Table Performance Characteristics of Static INs

More Related Content

Similar to Multiprocessor.pptx

18 parallel processing
18 parallel processing18 parallel processing
18 parallel processingdilip kumar
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORSAmirthavalli Senthil
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemmohantysikun0
 
Multiprocessor_YChen.ppt
Multiprocessor_YChen.pptMultiprocessor_YChen.ppt
Multiprocessor_YChen.pptAberaZeleke1
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 modelsanandme07
 
Introduction 1
Introduction 1Introduction 1
Introduction 1Yasir Khan
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
chapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).pptchapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).pptNANDHINIS109942
 
Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Sumalatha A
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 

Similar to Multiprocessor.pptx (20)

parallel-processing.ppt
parallel-processing.pptparallel-processing.ppt
parallel-processing.ppt
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processing
 
PARALLELISM IN MULTICORE PROCESSORS
PARALLELISM  IN MULTICORE PROCESSORSPARALLELISM  IN MULTICORE PROCESSORS
PARALLELISM IN MULTICORE PROCESSORS
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
Multiprocessor_YChen.ppt
Multiprocessor_YChen.pptMultiprocessor_YChen.ppt
Multiprocessor_YChen.ppt
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
 
Hpc 4 5
Hpc 4 5Hpc 4 5
Hpc 4 5
 
Introduction 1
Introduction 1Introduction 1
Introduction 1
 
22CS201 COA
22CS201 COA22CS201 COA
22CS201 COA
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
parallel processing.ppt
parallel processing.pptparallel processing.ppt
parallel processing.ppt
 
chapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).pptchapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).ppt
 
Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...Array Processors & Architectural Classification Schemes_Computer Architecture...
Array Processors & Architectural Classification Schemes_Computer Architecture...
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Parallel processing extra
Parallel processing extraParallel processing extra
Parallel processing extra
 

Recently uploaded

Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 

Multiprocessor.pptx

  • 1. CPE 806: ADVANCED COMPUTER ARCHITECTURE 1 MULTIPROCESSORS
  • 2. Multiprocessing: Flynn’s Taxonomy 2 • Flynn’s Taxonomy of Parallel Machines – How many Instruction (I) streams? – How many Data (D) streams? • Flynn’s classified architectures in terms of streams of data and instructions; – Stream of Instructions (SI): sequence of instructions executed by the computer. – Stream of Data (SD): sequence of data including input, temporary or partial results referenced by instructions.
  • 3. Flynn’s Taxonomy…. • Computer architectures are characterized by the multiplicity of hardware to serve instruction and data streams. 1. Single Instruction Single Data (SISD) 2. Single Instruction Multiple Data (SIMD) 3. Multiple Instruction Multiple Data (MIMD) 4. Multiple Instruction Single Data (MISD) 3
  • 4. Flynn’s Taxonomy: SISD • SISD: Single I Stream, Single D Stream – A uniprocessor von Neumann computers 4 Control Unit Processor (P) Memory (M) I/O Data Stream Instruction Stream Instruction Stream
  • 5. SIMD • SIMD: Single I, Multiple D Streams – Each “processor” works on its own data – But all execute the same instrs in lockstep – E.g. a vector processor or MMX • Consists of 2 parts – A front-end Von Neumann computer – A processor array: connected to the memory bus of the front end 5 Control Unit P1 M1 Pn Mn Instruction Stream Program loaded from front end Data Stream Data Stream Data loaded from front end
  • 6. SIMD Architecture 6 P1 P2 P3 Pn-1 Pn M1 M2 M3 Mn-1 Mn Interconnection Network Control Unit Scheme 1 Each processor has its own local memory P1 P2 P3 Pn-1 Pn M1 M2 M3 Mn-1 Mn Interconnection Network Control Unit Scheme 2 Processor and memory modules communicate with each other via interconnection network
  • 7. SIMD: Shared Memory and Not Shared 7 P1 P2 P3 Pn-1 Pn Shared Memory Interconnection Network Control Unit Control Unit P1 M1 Pn Mn Instruction Stream Program loaded from front end Data Stream Data Stream Data loaded from front end
  • 8. MIMD • MIMD: Multiple I, Multiple D Streams – Made of multiple processors and multiple memory modules connected together via some interconnection network – Each processor executes its own instructions and operates on its own data – This is your typical off-the-shelf multiprocessor (made using a bunch of “normal” processors) – Includes multi-core processors • 2 broad classes: – Shared memory – Message passing 8 Control Unit-1 P1 M1 Data Stream Instruction Stream Control Unit-n Pn Mn Data Stream Instruction Stream Instruction Stream Instruction Stream
  • 9. MIMD Multiprocessors 9 Centralized Shared Memory Distributed Memory • Multiprocessors  computers consisting of tightly coupled processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space. • Such systems exploit thread-level parallelism through two different software models. • Parallel processing • Request-level processing
  • 10. Flynn’s Taxonomy: MISD • MISD: Multiple I, Single D Stream – No processor has been produced using this taxonomy. 10
  • 11. Introduction • Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency • Job-level (process-level) parallelism – High throughput for independent jobs • Parallel processing program – Single program run on multiple processors • Multicore microprocessors – Chips with multiple processors (cores) 11
  • 12. Multiprocessors • Why do we need multiprocessors? – Uniprocessor speed keeps improving • There are limits to which ILP can be increased – But there are things that need even more speed • Wait for a few years for Moore’s law to catch up? • Or use multiple processors and do it now? – Need for more computing power • Data intensive applications • Utility computing requires powerful processors • Multiprocessor software problem – Most code is sequential (for uniprocessors) • Much easier to write and debug – Parallel code required for effective and efficient utilization of all cores • But Correct parallel code very, very difficult to write – Efficient and correct is even harder – Debugging even more difficult (Heisenbugs) 12
  • 13. Multiprocessors • The main argument for using multiprocessors is to create powerful computers by simply connecting multiple processors. – A multiprocessor is expected to reach faster speed than the fastest single-processor system. – More cost-effective. • A multiprocessor consisting of a number of single processors is expected to be than building a high-performance single processor. – Fault tolerance. • If a processor fails, the remaining processors should be able to provide continued service, albeit with degraded performance. 13
  • 14. Two Models for Communication and Memory Architecture 1. Communication occurs by explicitly passing messages among the processors: – message-passing multiprocessors 2. Communication occurs through a shared address space (via loads and stores): – shared memory multiprocessors either • UMA (Uniform Memory Access time) for shared address, centralized memory MP • NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP • In past, confusion whether “sharing”means sharing physical memory (Symmetric MP) or sharing address space 14
  • 15. Symmetric Shared-Memory Architectures • From multiple boards on a shared bus to multiple processors inside a single chip • Caches both – Private data are used by a single processor – Shared data are used by multiple processors 15
  • 16. Important ideas • Technology drives the solutions. – Multi-cores have altered the game!! – Thread-level parallelism (TLP) vs ILP. • Computing and communication deeply intertwined. – Write serialization exploits broadcast communication on the interconnection network or the bus connecting L1, L2, and L3 caches for cache coherence. • Access to data located at the fastest memory level greatly improves the performance. • Caches are critical for performance but create new problems – Cache coherence protocols: 1. Cache snooping  traditional multiprocessor 2. Directory based  multi-core processors 16
  • 17. Review of basic concepts • Cache  smaller, faster memory which stores copies of the data from frequently used main memory locations. • Cache writing policies – write-through  every write to the cache causes a write to main memory. – write-back  writes are not immediately mirrored to main memory. • Locations written are marked dirty and written back to the main memory only when that data is evicted from the cache. • A read miss may require two memory accesses: write the dirty location to memory and read new location from memory. • Caches are organized in blocks or cache lines. • Cache blocks consist of – Tag  contains (part of) address of actual data fetched from main memory – Data block – Flags  dirty bit, shared bit, • Broadcast networks  all nodes share a communication media and hear all messages transmitted, e.g., bus. 17
  • 18. Cache Coherence and Consistency • Coherence – Reads by any processor must return the most recently written value – Writes to the same location by any two processors are seen in the same order by all processors – Coherence defines behaviour of reads and writes to the same location, • Consistency – A read returns the last value written – If a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A – Consistency defines behaviour of reads and writes to other locations 18
  • 19. Thread-level parallelism (TLP) • Distribute the workload among a set of concurrently running threads. • Uses MIMD model  multiple program counters • Targeted for tightly-coupled shared-memory multiprocessors • To be effective need n threads for n processors. • Amount of computation assigned to each thread = grain size – Threads can be used for data-level parallelism, but the overheads may outweigh the benefit • Speedup – Maximum speedup with n processors is n; embarrassingly parallel – The actual speedup depends on the ratio of parallel versus sequential portion of a program according to Amdahl’s law. 19
  • 20. TLP and ILP • The costs for exploiting ILP are prohibitive in terms of silicon area and of power consumption. • Multicore processor have altered the game – Shifted the burden for keeping the processor busy from the hardware and architects to application developers and programmers. – Shift from ILP to TLP • Large-scale multiprocessors are not a large market, they have been replaced by clusters of multicore systems. 20
  • 21. Multi-core processors • Cores are now the building blocks of chips. • Intel offers a family of processors based on the Nehalem architecture with a different number of cores and L3 caches 21
  • 22. MIMD Multiprocessors • Centralized Shared Memory • Distributed Memory 22
  • 23. Centralized-Memory Machines • Also “Symmetric Multiprocessors” (SMP) “Uniform Memory Access” (UMA) – All memory locations have similar latencies – Data sharing through memory reads/writes – P1 can write data to a physical address A, P2 can then read physical address A to get that data • Caching data – reduces the access time but demands cache coherence • Two distinct data states – Global state  defined by the data in main memory – Local state  defined by the data in local caches • In multi-core L3 cache is shared; L1 and L2 caches are private • Problem: Memory Contention – All processor share the one memory – Memory bandwidth becomes bottleneck – Used only for smaller machines • Most often 2,4, or 8 processors 23 1 or more levels of cache 1 or more levels of cache 1 or more levels of cache 1 or more levels of cache Processor Processor Processor Processor Shared cache Main Memory I/O system Private caches
  • 24. Shared Memory Pros and Cons • Pros – Communication happens automatically – More natural way of programming • Easier to write correct programs and gradually optimize them – No need to manually distribute data (but can help if you do) • Cons – Needs more hardware support – Easy to write correct, but inefficient programs (remote accesses look the same as local ones) 24
  • 25. MIMD: Distributed-Memory Machines • Two kinds – Distributed Shared-Memory (DSM) • All processors can address all memory locations • Data sharing like in SMP • Also called NUMA (non-uniform memory access) • Latencies of different memory locations can differ (local access faster than remote access) – Message-Passing • A processor can directly address only local memory • To communicate with other processors, must explicitly send/receive messages • Also called multicomputers or clusters • Most accesses local, so less memory contention (can scale to well over 1000 processors) 25 Multicore Processor + Cache Interconnection Network I/O Memory Memory Memory Memory I/O Memory Multicore Processor + Caches Multicore Processor + Cache Multicore Processor + Cache Multicore Processor + Caches Memory Multicore Processor + Caches Memory Multicore Processor + Caches Memory Multicore Processor + Caches I/O I/O I/O I/O I/O I/O
  • 26. Distributed Shared-Memory Multiprocessor… • Two major benefits: – It is a cost-effective way to scale the memory bandwidth if most of the accesses are to local memory in the node. – It reduces the latency for accesses to the local memory. • Two key disadvantages: – Communicating data between processors becomes more complex. – It requires more effort in the software to take advantage of the increased memory bandwidth afforded by distributed memories 26
  • 27. Message-Passing Machines • A cluster of computers – Each with its own processor and memory – An interconnect to pass messages between them – Producer-Consumer Scenario: • P1 produces data D, uses a SEND to send it to P2 • The network routes the message to P2 • P2 then calls a RECEIVE to get the message – Two types of send primitives • Synchronous: P1 stops until P2 confirms receipt of message • Asynchronous: P1 sends its message and continues – Standard libraries for message passing: Most common is MPI – Message Passing Interface 27
  • 28. Communication Performance • Metrics for Communication Performance – Communication Bandwidth – Communication Latency • Sender overhead + transfer time + receiver overhead – Communication latency hiding • Characterizing Applications – Communication to Computation Ratio • Work done vs. bytes sent over network • Example: 146 bytes per 1000 instructions 28
  • 29. Message Passing Pros and Cons • Pros – Simpler and cheaper hardware – Explicit communication makes programmers aware of costly (communication) operations • Cons – Explicit communication is painful to program – Requires manual optimization • If you want a variable to be local and accessible via LD/ST, you must declare it as such • If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this 29
  • 30. Parallel Processing Performance • Challenges of Parallel Processing: – First challenge is % of program inherently sequential – Suppose 80x speedup from 100 processors. What fraction of original program can be sequential? • (a) 10% (b) 5% (c) 1% (d) <1% – Assume that the program operates in only two modes: • Parallel with all processors fully used (enhanced mode) • Serial with only one processor in use 30 Amdahl’s Law Provides solution
  • 31. Amdahl’s Law Provides solution 31 Need sequential part to be 0.0125% of original time. Sequential part can limit speedup
  • 32. Second Challenge: Long Latency to Remote Memory • Suppose 32 CPU MP, 2GHz, 200 ns to handle reference to a remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote request cost = 200/0.5 = 400 clock cycles.) • What is performance impact if 0.2% instructions involve remote access? – (a) 1.5X (b) 2.0X (c) 2.5X 32
  • 33. CPI Equation • CPI = Base CPI + Remote request rate x Remote request cost • Cycle time = • CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 • No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involved in remote access • In practice, the performance analysis is much more complex, since – Some fraction of the non-communication references will miss in the local hierarchy – Remote access time does not have a single constant value. 33 1 Cycle Cycle = 1 2 GHz = 0.5 ns Remote access cost Cycle Time 200 ns 0.5 ns = = 400 • Remote request cost =
  • 34. Challenge: Scaling Example • Suppose you want to perform two sums: – one is a sum of two scalar variables and – one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000. What speedup do you get with 1000 processors? • Solution: – If we assume performance is a function of the time for an addition, t , then there is 1 addition that does not benefit from parallel processors and 1,000,000 additions that do. – If the time before (for single processor) is: 1,000,000t + 1t = 1000,001t • Execution time after improvement 34 Execution time affected by improvement Amount of improvement + Execution time unaffected Execution time after = improvement
  • 35. Challenge: Scalar and Matrix Addition 35 Execution time after improvement = 1,000,000t 1,000 + 1t = 1001t Speedup is then Speedup = 1,000,001t 1,001t = 999 Even if the sequential portion expanded to 100 sums of scalar variables versus one sum of a pair of 1000 by 1000 arrays, the speedup would still be 909.
  • 36. Scaling Example... • What if matrix size is 100 x 100? – Single processor: Time= (10 + 10000) x tadd – 10 processors • Time = 10 x tadd + 10000/10 x tadd = 1010 x tadd • Speedup = 10010/1010 = 9.9 (99% of potential) – 100 processors • Time = 10 x tadd + 10000/100 x tadd = 110 x tadd • Speedup = 10010/110 = 9.1 (91% of potential) Assuming load balanced 36
  • 37. Symmetric Shared-Memory Architectures Cache Coherence Problem • Shared memory easy with no caches – P1 writes, P2 can read – Only one copy of data exists (in memory) • Caches store their own copies of the data – Those copies can easily get inconsistent – Classic example: adding to a sum • P1 loads allSum, adds its mySum, stores new allSum • P1’s cache now has dirty data, but memory not updated • P2 loads allSum from memory, adds its mySum, stores allSum • P2’s cache also has dirty data • Eventually P1 and P2’s cached data will go to memory • Regardless of write-back order, the final value ends up wrong 37
  • 38. Cache Coherence Problem… 38 P1 P2 Memory Allsum: 0 Allsum: 5 Allsum:12 1 Allsum: Allsum + mysum2 (12) Allsum: Allsum + mysum1 (5) 2 All Processes accessing main memory may see very stale value Alllsum:
  • 39. Cache Coherence Definition • A memory system is coherent if 1. Preserve Program Order: A read R from address X on processor P1 returns the value written by the most recent write W to X on P1 if no other processor has written to X between W and R. 1. This property simply preserves program order—we expect this property to be true even in uniprocessors. 39 Figure The cache coherence problem for a single memory location (X), read and written by two processors (A and B). Time Event Cache contents for processor A Cache contents for processor B Memory contents for location X 0 1 1 Processor A reads X 1 1 2 Processor B reads X 1 1 1 3 Processor A stores 0 into X 0 1 0
  • 40. Cache Coherence Definition… 2. Coherent view of Memory: If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write. • The second property defines the notion of what it means to have a coherent view of memory: – If a processor could continuously read an old data value, we would clearly say that memory was incoherent. 3. Write Serialization: Writes to the same location are serialized. Two writes to location X are seen in the same order by all processors. • For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1. 40 Coherence defines behaviour of reads and writes to the same location
  • 41. Write Consistency For now assume 1. A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write 2. The processor does not change the order of any write with respect to any other memory access if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A • These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order 41 Consistency defines behaviour of reads and writes to other locations
  • 42. Basic Schemes for Enforcing Coherence • Migration – data can be moved to a local cache and used there in a transparent fashion – Reduces both latency to access shared data that is allocated remotely and bandwidth demand on the shared memory • Replication – for reading shared data simultaneously, since caches make a copy of data in local cache – Reduces both latency of access and contention for read shared data 42
  • 43. Maintaining Cache Coherence • Hardware schemes – Shared Caches • Trivially enforces coherence • Not scalable (L1 cache quickly becomes a bottleneck) – Snooping • Needs a broadcast network (like a bus) to enforce coherence • Each cache that has a block tracks its sharing state on its own – Directory • Can enforce coherence even with a point-to-point network • A block has just one place where its full sharing state is kept – All information about the blocks is kept in the directory • SMP: one centralised directory is provided in the outermost cache for multi- core systems • DSM: Directory is distributed. Each node maintains its own directory which tracks the sharing information of every cache line in the node 43
  • 44. Maintaining Cache Coherence: Two Classes of Protocols in Use Cache coherence Protocols • Directory based – The sharing status of a block of physical memory is kept in just one location, called the directory; – Directory-based coherence has slightly higher implementation overhead than snooping, but it can scale to larger processor counts. • The Sun T1 design uses directories, albeit with a central physical memory. • Snooping – Every cache that has a copy of the data from a block of physical memory also has a copy of the sharing status of the block, but no centralized state is kept. – The caches are all accessible via some broadcast medium (a bus or switch), and all cache controllers monitor or snoop on the medium • To determine whether or not they have a copy of a block that is requested on a bus or switch access. 44
  • 45. Snoopy Cache-Coherence Protocols • Cache Controller “snoops”all transactions on the shared medium (bus or switch) – relevant transaction if for a block it contains – take action to ensure coherence • » invalidate, update, or supply value – depends on state of the block and the protocol • Either get exclusive access before write via write invalidate or update all copies on write 45 Pn P1 Memory cache Bus snoop I/O devices cache Cache-memory transaction Data State Address
  • 46. Communication between private and shared caches • Multi-core processor  a bus connects private L1 and L2 instruction (I) and data (D) caches to the shared L3 cache. • To invalidate a cached item the processor changing the value must – first acquire the bus and – then place the address of the item to be invalidated on the bus. • DSM  Locating the value of an item is harder for – write-back caches • because the current value of the item can be in the local caches of another processor. 46
  • 47. Snooping Protocol • Typically used for bus-based (SMP) multiprocessors – Serialization on the bus used to maintain coherence property 3 • Two flavors – Write-update (write broadcast) • A write to shared data is broadcast to update all copies • All subsequent reads will return the new written value (property 2) • All see the writes in the order of broadcasts One bus == one order seen by all (property 3) – Write-invalidate • Write to shared data forces invalidation of all other cached copies • Subsequent reads miss and fetch new value (property 2) • Writes ordered by invalidations on the bus (property 3) 47
  • 48. Write Invalidate: Example • Write invalidate  on write, invalidate all other copies. – Used in modern microprocessors – Example: a write-back cache during read misses of item X, processors A and B. Once A writes X it invalidates the B’s cache copy of X 48 Processor activity Bus activity Contents of processor A’s cache Contents of processor B’s cache Contents of memory location X 0 Processor A reads X Cache miss for X 0 0 Processor B reads X Cache miss for X 0 0 0 Processor A writes a 1 to X Invalidation for X 1 0 Processor B reads X Cache miss for X 1 1 1 For a write, we require that the writing processor have exclusive access, preventing any other processor from being able to write simultaneously. An example of an invalidation protocol working on a snooping bus for a single cache block (X) with write-back caches.
  • 49. Write Invalidate: Example... • An example of an invalidation protocol working on a snooping bus for a single cache block (X) with write-back caches • We assume that neither cache initially holds X and that the value of X in memory is 0. • The CPU and memory contents show the value after the processor and bus activity have both completed. • A blank indicates no activity or no copy cached. • When the second miss by B occurs, CPU A responds with the value cancelling the response from memory. – In addition, both the contents of B’s cache and the memory contents of X are updated. • This update of memory, which occurs when a block becomes shared, is typical in most protocols and simplifies the protocol. 49
  • 50. Example: Write-through Invalidate 50 • Must invalidate before step 3 • Write update uses more broadcast medium bandwidth all recent MPUs use write invalidate Exclusive access ensures that no other readable or writable copies of an data exist when the write occurs
  • 51. Write Update: Example • An example of a write update or broadcast protocol working on a snooping bus for a single cache block (X) with write-back caches. • We assume that neither cache initially holds X and that the value of X in memory is 0. 51 Processor activity Bus activity Contents of processor A’s cache Contents of processor B’s cache Contents of memory location X 0 Processor A reads X Cache miss for X 0 0 Processor B reads X Cache miss for X 0 0 0 Processor A writes a 1 to X Write Broadcast of X 1 1 1 Processor B reads X No bus activity 1 1 1
  • 52. Write Update: Example... • The CPU and memory contents show the value after the processor and bus activity have both completed. • A blank indicates no activity or no copy cached. • When CPU A broadcasts the write, both the cache in CPU B and the memory location of X are updated. • In the second read, processor B finds the updated value of X and therefore there is no bus activity. 52
  • 53. Update vs. Invalidate • A burst of writes by a processor to one address – Update: each sends an update – Invalidate: possibly only the first invalidation is sent • Writes to different words of a block – Update: update sent for each word – Invalidate: possibly only the first invalidation is sent • Producer-consumer communication latency – Update: producer sends an update, • consumer reads new value from its cache – Invalidate: producer invalidates consumer’s copy, • consumer’s read misses and has to request the block • Which is better depends on application – But write-invalidate is simpler and implemented in most MP-capable processors today. 53
  • 54. Implementation of cache Invalidate • The key to implementing an invalidate protocol in a multicore is – the use of the bus, or another broadcast medium, to perform invalidates. – All processors snoop on the bus. • To invalidate the processor changing an item – acquires the bus and – broadcasts the address to be invalidated on the bus. • If two processors attempt to change at the same time the bus arbitrator allows access to only one of them. – All coherence schemes require some method of serializing accesses to the same cache block, either by serializing access to the communication medium or another shared structure. 54
  • 55. Implementation of cache Invalidate… • How to find the most recent value of a data item – Write-through cache  the value is in memory but write buffers could complicate the scenario. – Write-back cache  harder problem, the item could be in the private cache of another processor. • A block of cache has extra state bits – Valid bit – indicates if the block is valid or not – Dirty bit - indicates if the block has been modified – Shared bit – cache block is shared with other processors • If a processor finds that it has a dirty copy of the requested cache block, it provides that cache block in response to the read request and causes the memory (or L3) access to be aborted. 55
  • 56. Implementation of cache Invalidate… • When a write to a block in the shared state occurs, – the cache generates an invalidation on the bus and marks the block as exclusive. – No further invalidations will be sent by that core for that block. – The core with the sole copy of a cache block is normally called the owner of the cache block. • When an invalidation is sent, – the state of the owner’s cache block is changed from shared to unshared (or exclusive). – If another processor later requests this cache block, the state must be made shared again 56
  • 57. Locate up-to-date copy of data • For a write-through cache – Get up-to-date copy from memory (Since all written data are always sent to the memory, from which the most recent values of a data item can always be fetched.) – Write through simpler if enough memory bandwidth is available – Use of write through simplifies the implementation of cache coherence. • For a write-back cache – Most recent copy can be in a cache rather than in memory – The problem of finding the most recent data value is harder 57
  • 58. Locate up-to-date copy of data… • Write-back caches can use the same snooping scheme both for cache misses and for writes: – Each processor snoops every address placed on the bus. – If a processor has dirty copy of the requested cache block, it provides it in response to the read request and aborts the memory access. – Complexity comes from having to retrieve the cache block from a processor’s cache, which can take longer than retrieving it from the shared memory if the processors are in separate chips. • Write-back needs lower memory bandwidth – ⇒ Support larger numbers of faster processors – ⇒ Most multiprocessors use write-back 58
  • 59. Cache Resources for Write-Back Snooping • Normal cache tags can be used for snooping • Valid bit for each block makes invalidation easy • Read misses easy since rely on snooping • Writes Need to know if whether any other copies of the block are cached – No other copies No need to place write on bus in a write-back cache (reduce both the time taken by the write and the required bandwidth) – Other copies Need to place invalidate on bus 59 Index Block Address Tag Block Offset
  • 60. Cache Resources for Write-Back Snooping… • To track whether a cache block is shared, add extra state bit associated with each cache block, like valid bit and dirty bit – Write to shared block ⇒ Need to generates an invalidation on the bus and marks the state of the block as exclusive. – Otherwise, no further invalidations will be sent by that processor for that block – The processor with the sole copy of a cache block is normally called the owner of the cache block – When invalidation is sent, the state of the owner’s cache block is changed from shared to exclusive. – If another processor later requests this cache block, the state must be made shared again. 60
  • 61. Cache Behaviour in Response to Bus • Every bus transaction must check the cache address tags – could potentially interfere with processor cache accesses • A way to reduce interference is to duplicate tags – One set for caches access, one set for bus accesses • The interference can also be reduced in a multilevel cache by directing the snoop request to the L2 cache – Since L2 less heavily used than L1 (the processor uses only when it has a miss in the L1 cache) ⇒ Every entry in the L1 cache must be present in the L2 cache, called the inclusion property – If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to update the state and possibly retrieve the data, which usually requires a stall of the processor 61
  • 62. Example: Write Back MSI Snooping Protocol • Snooping coherence protocol is usually implemented by incorporating a finite‐state controller in each node • There is only one finite-state machine per cache, with stimuli coming either from the attached processor or from the bus • Logically, think of a separate controller associated with each cache block – That is, snooping operations or cache requests for different blocks can proceed independently • In implementations, a single controller allows multiple operations to distinct blocks to proceed in interleaved fashion – that is, one operation may be initiated before another is completed, even through only one cache access or one bus access is allowed at time 62
  • 63. Example: Write Back MSI Snooping Protocol… • Processor only observes state of memory system by issuing memory operations • Assume bus transactions and memory operations are atomic and a one‐level cache – all phases of one bus transaction complete before next one starts – processor waits for memory operation to complete before issuing next – with one‐level cache, assume invalidations applied during bus transaction • All writes go to bus + atomicity – Writes serialized by order in which they appear on bus (bus order) => invalidations applied to caches in bus order • How to insert reads in this order? – Important since processors see writes through reads, so determines whether write serialization is satisfied – But read hits may happen independently and do not appear on bus or enter directly in bus order 63
  • 64. Example: Write Back MSI Snooping Protocol • Invalidation protocol, write‐back cache – Snoops every address on bus – If it has a dirty copy of requested block, provides that block in response to the read request and aborts the memory access • State of block B in cache C can be – Invalid: B is not cached in C • To read or write, must make a request on the bus – Modified: B is dirty in C • C has the block, no other cache has the block, and C must update memory when it displaces B • Can read or write B without going to the bus – Shared: B is clean in C • C has the block, other caches have the block, and C need not update memory when it displaces B • Can read B without going to bus • To write, must send an upgrade request to the bus • Read misses: cause all caches to snoop bus • Writes to clean blocks are treated as misses 64 note that the modified state implies that the block is exclusive
  • 65. Write‐Back State Machine ‐ Processor Request 65 Transition Arcs: The stimulus causing a state change is shown on the transition arcs in Blue Bus actions generated as part of the state transition are shown on the transition arc in Bold.
  • 66. Write‐Back State Machine ‐ Processor Request… 66 Finite-state transition diagram for a single private cache block using a write invalidation protocol and a write-back cache Invalid Exclusive (read/write) Shared (read only) CPU read hit CPU read miss Place read miss on bus CPU read CPU write CPU write hit CPU read hit CPU write miss Place read miss on bus Place write miss on bus Write-back cache block Place write miss on bus • Any transition to the Exclusive state (which is required for a processor to write to the block) requires an invalidate or write miss to be placed on the bus,  causing all local caches to make the block invalid.  In addition, if some other local cache had the block in Exclusive state, that local cache generates a write-back, which supplies the block containing the desired address. Cache block States:  Invalid  Shared and  Exclusive (Modified) Cache state transitions based on requests from CPU
  • 67. Write‐Back State Machine ‐ Bus Request 67 Finite-state transition diagram for a single private cache block using a write invalidation protocol and a write-back cache • If a read miss occurs on the bus to a block in the exclusive state,  the local cache with the exclusive copy changes its state to shared. Invalid Exclusive (read/write) Shared (read only) CPU read miss write miss for this block Write miss For this block Invalidate for this block Write-back; block abort memory access Request Source State of addressed cache block Type of Cache action Function and explanation Read miss Bus Shared No action Allow shared cache or memory to service read miss Read miss Bus Modified Coherence Attempt to share data: place cache block on bus and change state to shared. invalidate Bus Shared Coherence Attempt to write shared block; invalidate the block Write miss Bus Shared Coherence Attempt to write shared block; invalidate the cache block Write miss Bus Modified Coherence Attempt to write block that is exclusive elsewhere; write-back the block and make its state invalid in the local cache Cache state transitions based on requests from Bus
  • 68. Combined Cache Coherence State Diagram for both Processor and Bus Requests 68 Invalid Exclusive (read/write) Shared (read only) CPU read hit CPU read miss Place read miss on bus CPU write CPU write hit CPU read hit CPU write miss Write miss for block CPU read Place read miss on bus Place write miss on bus Write-back cache block Place write miss on bus Invalidate for this block Write miss for this block Write-back block Transition Arcs:  Local Processor induced transition in Black  Bus activities induced transition in Blu  Activities on transition in Red
  • 69. Example • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 69 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl P1 Read Ai P2 Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Bus Memory
  • 70. Example: Step 1 • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 70 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Processor Addr Value Addr Value P1: Write 10 to A1 Excl A1 10 WrMs P1 A1 P1 Read Ai P2 Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Bus Memory • Active arrow =
  • 71. Example: Step 2 • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 71 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl A1 10 WrMs P1 A1 P1 Read A1 Excl A1 10 P2 Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Bus Memory • Active arrow =
  • 72. Example: Step 3 • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 72 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl A1 10 WrMs P1 A1 P1 Read A1 Excl A1 10 P2 Read A1 Shar. A1 RdMs P2 A1 A1 10 Shar A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 P2: Write 40 to A2 Processor 1 Processor 2 Bus Memory • Active arrow =
  • 73. Example: Step 4 • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 73 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl A1 10 WrMs P1 A1 P1 Read A1 Excl A1 10 P2 Read A1 Shar. A1 RdMs P2 A1 A1 10 Shar A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10 P2: Write 40 to A2 Processor 1 Processor 2 Bus Memory • Active arrow =
  • 74. Example: Step 5 • Assume: – initial cache state is invalid and – addresses A1 and A2 map to same cache block, • but A1 != A2 74 P1 P2 Bus Memory Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl A1 10 WrMs P1 A1 P1 Read A1 Excl A1 10 P2 Read A1 Shar. A1 RdMs P2 A1 A1 10 Shar A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10 P2: Write 40 to A2 WrBk P2 A1 20 A1 20 Excl. A2 40 WrMs P2 A2 A1 20 Processor 1 Processor 2 Bus Memory • Active arrow =
  • 75. Limitations in Symmetric Shared-Memory Multiprocessors and Snooping Protocols • As the number of processors in a multiprocessor grows, or as the memory demands of each processor grow, any centralized resource in the system can become a bottleneck. • As processors have increased in speed in the last few years, the number of processors that can be supported on a single bus or by using a single physical memory unit has fallen • Single memory accommodate all CPUs – Multiple memory banks • Bus-based multiprocessor, bus must support both coherence traffic & normal memory traffic – Multiple buses or interconnection networks (cross bar or small point-to-point) 75 InIn such designs, the memory system can be configured into multiple physical banks, so as to boost the effective memory bandwidth while retaining uniform access time to memory 1 or more levels of cache 1 or more levels of cache 1 or more levels of cache 1 or more levels of cache Processor Processor Processor Processor Interconnection Network Memory I/O system Memory Memory Memory
  • 76. Cache Performance • Cache performance is combination of 1. Behaviour of uniprocessor cache miss traffic 2. Traffic caused by communication • Results in invalidations and subsequent cache misses – Changing the processor count, cache size, and block size can affect these two components of the miss rate in different ways. • Uniprocessor miss rate: – Can be broken down into: • Compulsory, • Capacity and • Conflict misses 76
  • 77. Cache Performance… • Compulsory miss: – The very first access to a block cannot be in the cache. • Capacity miss: – The cache cannot contain all the blocks needed during execution of a program, capacity miss will occur because of blocks being discarded and later retrieved. • Conflict miss: – If the block placement strategy is set associative or direct mapped, conflict misses will occur because a block may be discarded and later retrieved if too many blocks map to its set. 77
  • 78. Coherency Misses The misses that arise from inter-processor communication, which are often called coherence misses, can be broken into two separate sources. 1. True sharing misses arise from the communication of data through the cache coherence mechanism – In an invalidation-based protocol, the 1st write to a shared block causes an invalidation to establish ownership of the block. – When another processor attempts to read a modified word in the cache block, a miss occurs and the block is transferred. 2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into – Invalidation does not cause a new value to be communicated, but only causes an extra cache miss – Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word 78
  • 79. Example: True v. False Sharing v. Hit? • Assume x1 and x2 in same cache block and in shared state • P1 and P2 both read x1 and x2 before. 79 Time P1 P2 True, False, Hit? Why? 1 Write x1 True miss: invalidate x1 in P2 2 Read x2 False miss: x1 irrelevant to P2 3 Write x1 False miss: x1 irrelevant to P2 4 Write x2 False miss: x1 irrelevant to P2 5 Read x2 True miss: invalidate x2 in P1
  • 80. Classifications by Time Step 1. This event is a true sharing miss, since x1 was read by P2 and needs to be invalidated from P2. 2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2. 3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1. The cache block containing x1 will be in the shared state after the read by P2; a write miss is required to obtain exclusive access to the block. In some protocols this will be handled as an upgrade request, which generates a bus invalidate, but does not transfer the cache block. 4. This event is a false sharing miss for the same reason as step 3. 5. This event is a true sharing miss, since the value being read was written by P2. 80
  • 81. Cache to Cache transfers • Problem – P1 has block B in M state – P2 wants to read B, puts a RdReq on bus – If P1 does nothing, memory will supply the data to P2 – What does P1 do? • Solution 1: abort/retry – P1 cancels P2’s request, issues a write back – P2 later retries RdReq and gets data from memory – Too slow (two memory latencies to move data from P1 to P2) • Solution 2: intervention – P1 indicates it will supply the data (“intervention” bus signal) – Memory sees that, does not supply the data, and waits for P1’s data – P1 starts sending the data on the bus, memory is updated – P2 snoops the transfer during the write-back and gets the block 81
  • 82. Cache to Cache transfers… • Intervention works if some cache has data in M state – Nobody else has the correct data, clear who supplies the data • What if a cache has requested data in S state – There might be others who have it, who should supply the data? – Solution 1: let memory supply the data – Solution 2: whoever wins arbitration supplies the data – Solution 3: A separate state similar to S that indicates there are maybe others who have the block in S state, but if anybody asks for the data we should supply it 82
  • 83. Extensions to the Basic Coherence Protocol • We have just considered a coherence protocol with 3 states: Modified, Shared, Invalid (MSI) • There are many extensions of MSI – With additional states and transactions, which optimise certain behaviours, possibly resulting in improved performance. • Two of the most common extensions are: MESI and MOESI 83
  • 84. MESI (Modified, Exclusive, shared & Invalid) • MESI adds the state Exclusive (E) to the basic MSI protocol. • Exclusive indicates when a cache block is resident only in a single cache but is clean • If a block is in the E state, it can be written without generating any invalidates, which optimizes the case where a block is read by a single cache before being written by that same cache. • Of course, when a read miss to a block in the E state occurs, the block must be changed to the S state to maintain coherence. – Because all subsequent accesses are snooped, it is possible to maintain the accuracy of this state. – In particular, if another processor issues a read miss, the state is changed from exclusive to shared • Pros of adding E state: – subsequent write to a block in the exclusive state by the same core need not acquire bus access or generate an invalidate, since the block is known to be exclusively in this local cache; the processor merely changes the state to modified. • The Intel i7 uses a variant of a MESI protocol, called MESIF, which adds a state (Forward) to designate which sharing processor should respond to a request. – It is designed to enhance performance in distributed memory organizations. 84
  • 85. MOESI (Modified, Owned, Exclusive, Shared & Invalid) • MOESI adds the state Owned to the MESI protocol to indicate that the associated block is owned by that cache and out-of-date in memory. • In MSI and MESI protocols, when there is an attempt to share a block in the Modified state, the state is changed to Shared (in both the original and newly sharing cache), and the block must be written back to memory. • In a MOESI protocol, the block can be changed from the Modified to Owned state in the original cache without writing it to memory. • Other caches, which are newly sharing the block, keep the block in the Shared state; the O state, which only the original cache holds, indicates that the main memory copy is out of date and that the designated cache is the owner. • The owner of the block must supply it on a miss, since memory is not up to date and must write the block back to memory if it is replaced. • The AMD Opteron uses the MOESI protocol. 85
  • 86. Directory-Based Coherence Protocol • Typically in distributed shared memory • For every local memory block, local directory has an entry • Directory entry indicates –Who has cached copies of the block –In what state do they have the block 86
  • 87. Distributed-Memory Multiprocessor with the directories added to each Node 87
  • 88. Directory-Based Cache Coherence Protocols: The Basics • Just as with a snooping protocol, there are two primary operations that a directory protocol must implement: – handling a read miss and – handling a write to a shared, clean cache block. • Handling a write miss to a block that is currently shared is a simple combination of these two. • To implement these operations, a directory must track the state of each cache block. • In a simple protocol, these states could be the following: – Shared—One or more nodes have the block cached, and the value in memory is up to date (as well as in all the caches). – Uncached—No node has a copy of the cache block. – Modified—Exactly one node has a copy of the cache block, and it has written the block, so the memory copy is out of date. The processor is called the owner of the block. 88
  • 89. Basic Directory Scheme • Each entry has – One dirty bit (1 if there is a dirty cached copy) – A presence vector (1 bit for each node) Tells which nodes may have cached copies • All misses sent to block’s home • Directory performs needed coherence actions • Eventually, directory responds with data 89
  • 90. Read Miss • Processor Pk has a read miss on block B, sends request to home node of the block • Directory controller – Finds entry for B, checks D bit – If D=0 • Read memory and send data back, set P[k] – If D=1 • Request block from processor whose P bit is 1 • When block arrives, update memory, clear D bit, send block to Pk and set P[k] 90
  • 91. Directory Operation • Network controller connected to each bus – A proxy for remote caches and memories • Requests for remote addresses forwarded to home, responses from home placed on the bus • Requests from home placed on the bus, cache responses sent back to home node • Each cache still has its own coherence state – Directory is there just to avoid broadcasts and order accesses to each location • Simplest scheme: If access A1 to block B still not fully processed by directory when A2 arrives, A2 waits in a queue until A1 is done 91
  • 93. Multiprocessor Interconnection Networks • Multiprocessors interconnection networks (INs) can be classified based on a number of criteria. These include – (1) mode of operation (synchronous versus asynchronous), – (2) control strategy (centralized versus decentralized), – (3) switching technique(circuit versus packet), and – (4) topology (static versus dynamic). • Mode of Operation – According to the mode of operation, INs are classified as synchronous versus asynchronous. • In synchronous mode of operation, a single global clock is used by all components in the system such that the whole system is operating in a lock–step manner. • Asynchronous mode of operation, on the other hand, does not require a global clock. Handshaking signals are used instead in order to coordinate the operation of asynchronous systems. – While synchronous systems tend to be slower compared to asynchronous systems, they are race and hazard-free. 93
  • 94. Multiprocessor Interconnection Networks… • Control Strategy – According to the control strategy, INs can be classified as centralized versus decentralized. • In centralized control systems, a single central control unit is used to oversee and control the operation of the components of the system. • In decentralized control, the control function is distributed among different components in the system. – The function and reliability of the central control unit can become the bottleneck in a centralized control system. While the crossbar is a centralized system, the multistage interconnection networks are decentralized. • Switching Techniques – Interconnection networks can be classified according to the switching mechanism as circuit versus packet switching networks. • In the circuit switching mechanism, a complete path has to be established prior to the start of communication between a source and a destination. The established path will remain in existence during the whole communication period. • In a packet switching mechanism, communication between a source and destination takes place via messages that are divided into smaller entities, called packets. On their way to the destination, packets can be sent from a node to another in a store-and- forward manner until they reach their destination. – While packet switching tends to use the network resources more efficiently compared to circuit switching, it suffers from variable packet delays. 94
  • 95. Multiprocessor Interconnection Networks… • Topology – An interconnection network topology is a mapping function from the set of processors and memories onto the same set of processors and memories. In other words, the topology describes how to connect processors and memories to other processors and memories. – A fully connected topology, for example, is a mapping in which each processor is connected to all other processors in the computer. – A ring topology is a mapping that connects processor k to its neighbours, processors (k – 1) and (k + 1). – In general, interconnection networks can be classified as • static versus dynamic networks. – In static networks, direct fixed links are established among nodes to form a fixed network, while – In dynamic networks, connections are established as needed. – Switching elements are used to establish connections among inputs and outputs. – Depending on the switch settings, different interconnections can be established. – Nearly all multiprocessor systems can be distinguished by their 95
  • 96. Interconnection networks for Shared Memory and Message Passing Systems. • Shared memory – Shared memory systems can be designed using bus-based or switch-based INs. – The simplest IN for shared memory systems is the bus. However, the bus may get saturated if multiple processors are trying to access the shared memory (via the bus) simultaneously. – A typical bus-based design uses caches to solve the bus contention problem. – Other shared memory designs rely on switches for interconnection. • For example, a crossbar switch can be used to connect multiple processors to multiple memory modules. 96 (a) (b) Figure: Shared memory interconnection networks (a) bus-based and (b) switch-based Figure Single bus and multiple bus systems.
  • 97. Interconnection networks for Shared Memory and Message Passing Systems… • Message passing INs – Message passing INs can be divided into static and dynamic. • Static networks form all connections when the system is designed rather than when the connection is needed. In a static network, messages must be routed along established links. • Dynamic INs establish a connection between two or more nodes on the fly as messages are routed along the links. The number of hops in a path from source to destination node is equal to the number of point-to-point links a message must traverse to reach its destination. – In either static or dynamic networks, a single message may have to hop through intermediate processors on its way to its destination. • Therefore, the ultimate performance of an interconnection network is greatly influenced by the number of hops taken to traverse the network. 97 Figure Examples of static topologies.
  • 98. Interconnection networks for Shared Memory and Message Passing Systems… 98 (a) (b) (c) Figure Example dynamic INs: (a) single-stage, (b) multistage, and (c) crossbar switch. • The single-stage interconnection network of Figure (a) is a simple dynamic network that connects each of the inputs on the left side to some, but not all, outputs on the right side through a single layer of binary switches represented by the rectangles.  The binary switches can direct the message on the left-side input to one of two possible outputs on the right side.
  • 99. Interconnection networks for Shared Memory and Message Passing Systems… • Figure (b). The Omega MIN (Multistage Interconnection Network) connects eight sources to eight destinations. – The connection from the source 010 to the destination 010 is shown as a bold path – These are dynamic INs because the connection is made on the fly, as needed. – In order to connect a source to a destination, we simply use a function of the bits of the source and destination addresses as instructions for dynamically selecting a path through the switches. – For example, to connect source 111 to destination 001 in the omega network, • the switches in the first and second stage must be set to connect to the upper output port, • while the switch at the third stage must be set to connect to the lower output port (001). • In general, when using k × k switches, a Omega MIN with N input-output ports requires at least logk N stages, each of which contains N/k switches, for a total of N/k (logk N) switches. • Figure (c ) Crossbar Switch provides a path from any input or source to any other output or destination by simply selecting a direction on the fly. – To connect row 111 to column 001 requires only one binary switch at the intersection of the 111 input line and 001 output line to be set. • The crossbar switch clearly uses more binary switching components; – for example, N2 components are needed to connect N x N source/destination pairs. 99
  • 100. Pros and Cons of Crossbar Switch and Omega MIN • Pros – Crossbar switch has potential for speed. In one clock, a connection can be made between source and destination. – The diameter of the crossbar is one. • (Note: Diameter, D, of a network having N nodes is defined as the maximum shortest paths between any two nodes in the network.) – The omega MIN, on the other hand requires log N clocks to make a connection. • The diameter of the omega MIN is therefore log N. • Cons – Both Crossbar Switch and Omega MIN networks limit the number of alternate paths between any source/destination pair. – This leads to limited fault tolerance and network traffic congestion. – If the single path between pairs becomes faulty, that pair cannot communicate. – If two pairs attempt to communicate at the same time along a shared path, one pair must wait for the other. • This is called blocking, and such MINs are called blocking networks. • A network that can handle all possible connections without blocking is called a nonblocking network. 100
  • 101. Example Problem • Example: – Compute the cost of interconnecting 4096 nodes using a single crossbar switch relative to doing so using a MIN built from 2 × 2, 4 × 4, and 16 × 16 switches. Consider separately the relative cost of the unidirectional links and the relative cost of the switches. Switch cost is assumed to grow quadratically with the number of input (alternatively, output) ports, k, for k × k switches. • Solution: – The switch cost of the network when using a single crossbar is proportional to 40962 . – The unidirectional link cost is 8192, which accounts for the set of links from the end nodes to the crossbar and also from the crossbar back to the end nodes. – When using a MIN with k × k switches, the cost of each switch is proportional to k2 but there are 4096/k (logk 4096) total switches. – Likewise, there are (logk 4096) stages of N unidirectional links per stage from the switches plus N links to the MIN from the end nodes. – Therefore, the relative costs of the crossbar with respect to each MIN is given by the following: 101
  • 102. Example Problem… 102 Relative cost (2 × 2)switches = 40962/ (22× 4096/2 × log2 4096) = 170 Relative cost (4 × 4)switches = 40962/ (42× 4096/4 × log4 4096) = 170 Relative cost (16 × 16)switches = 40962/ (162× 4096/16 × log16 4096) = 85 Relative cost (2 × 2)links = 8192/ (4096× (log2 4096 + 1) = 2/13= 0.1538 Relative cost (4 × 4)links = 8192/ (4096× (log4 4096 + 1) = 2/7= 0.2857 Relative cost (16 × 16)links = 8192/ (4096× (log16 4096 + 1) = 2/4= 0.5
  • 103. Example Problem… • Conclusion – In all cases, the single crossbar has much higher switch cost than the MINs. – The most dramatic reduction in cost comes from the MIN composed from the smallest sized but largest number of switches, but it is interesting to see that the MINs with 2 × 2 and 4 × 4 switches yield the same relative switch cost. – The relative link cost of the crossbar is lower than the MINs, but by less than an order of magnitude in all cases. – We must keep in mind that end node links are different from switch links in their length and packaging requirements, so they usually have different associated costs. – Despite the lower link cost, the crossbar has higher overall relative cost. 103
  • 104. Performance Comparison of some Dynamic Ins • In the table below, m represents the number of multiple buses used, while N represents the number of processors (memory modules) or input/output of the network. 104 Network Delay Cost (Complexity) Bus O(N) O(1) Multiple-bus O(mN) O(m) Multistage INs (MINs) O(log N) O(N log N) Table Performance Comparison of Some Dynamic INs
  • 105. Performance Comparison of some Static INs. • The table below shows a performance comparison among a number of static INs.  In this table, the degree of a network is defined as the maximum number of links (channels) connected to any node in the network.  The diameter of a network is defined as the maximum path, p, of the shortest paths between any two nodes. Degree of a node, d, is defined as the number of channels incident on the node. 105 Network Degree Diameter Cost (No. of links) Linear array 2 N – 1 N – 1 Binary tree 3 2([Log2 N] – 1 ) N – 1 n-cube Log2 N Log2 N nN/2 2D-mesh 4 2(n – 1) 2(N – n) Table Performance Characteristics of Static INs

Editor's Notes

  1. Patterson 5th ed. Pg 440
  2. Comp Arch 3ed pg
  3. Literature: Advanced Comp Arch by Hesham