Multiprocessor.pptx

CPE 806: ADVANCED COMPUTER ARCHITECTURE
1
MULTIPROCESSORS

Multiprocessing: Flynn’s Taxonomy
2
• Flynn’s Taxonomy of Parallel Machines
– How many Instruction (I) streams?
– How many Data (D) streams?
• Flynn’s classified architectures in terms of streams of data
and instructions;
– Stream of Instructions (SI): sequence of instructions executed by
the computer.
– Stream of Data (SD): sequence of data including input, temporary
or partial results referenced by instructions.

Flynn’s Taxonomy….
• Computer architectures are characterized by the
multiplicity of hardware to serve instruction and data
streams.
1. Single Instruction Single Data (SISD)
2. Single Instruction Multiple Data (SIMD)
3. Multiple Instruction Multiple Data (MIMD)
4. Multiple Instruction Single Data (MISD)
3

Flynn’s Taxonomy: SISD
• SISD: Single I Stream, Single D Stream
– A uniprocessor von Neumann computers
4
Control
Unit
Processor
(P)
Memory
(M)
I/O
Data Stream
Instruction Stream
Instruction Stream

SIMD
• SIMD: Single I, Multiple D Streams
– Each “processor” works on its own data
– But all execute the same instrs in lockstep
– E.g. a vector processor or MMX
• Consists of 2 parts
– A front-end Von Neumann computer
– A processor array: connected to the memory bus of the front end
5
Control
Unit
P1 M1
Pn Mn
Instruction Stream
Program loaded
from front end
Data Stream
Data Stream
Data loaded from
front end

SIMD Architecture
6
P1 P2 P3 Pn-1 Pn
M1 M2 M3 Mn-1 Mn
Interconnection Network
Control Unit
Scheme 1 Each processor has its own local memory
P1 P2 P3 Pn-1 Pn
M1 M2 M3 Mn-1 Mn
Control Unit
Scheme 2 Processor and memory modules communicate with
each other via interconnection network

SIMD: Shared Memory and Not Shared
7
P1 P2 P3 Pn-1 Pn
Shared Memory
Control Unit
Control
Unit
P1 M1
Pn Mn
Instruction Stream
Program loaded
from front end
Data Stream
Data Stream
Data loaded from
front end

MIMD
• MIMD: Multiple I, Multiple D Streams
– Made of multiple processors and multiple memory modules connected together via
some interconnection network
– Each processor executes its own instructions and operates on its own data
– This is your typical off-the-shelf multiprocessor
(made using a bunch of “normal” processors)
– Includes multi-core processors
• 2 broad classes:
– Shared memory
– Message passing
8
Control
Unit-1
P1 M1
Data Stream
Instruction Stream
Control
Unit-n
Pn Mn
Data Stream
Instruction Stream
Instruction Stream
Instruction Stream

MIMD Multiprocessors
9
Centralized Shared Memory Distributed Memory
• Multiprocessors  computers consisting of tightly coupled processors whose coordination and
usage are typically controlled by a single operating system and that share memory through a
shared address space.
• Such systems exploit thread-level parallelism through two different software models.
• Parallel processing
• Request-level processing

Flynn’s Taxonomy: MISD
• MISD: Multiple I, Single D Stream
– No processor has been produced using this taxonomy.
10

Introduction
• Goal: connecting multiple computers to get higher
performance
– Multiprocessors
– Scalability, availability, power efficiency
• Job-level (process-level) parallelism
– High throughput for independent jobs
• Parallel processing program
– Single program run on multiple processors
• Multicore microprocessors
– Chips with multiple processors (cores)
11

Multiprocessors
• Why do we need multiprocessors?
– Uniprocessor speed keeps improving
• There are limits to which ILP can be increased
– But there are things that need even more speed
• Wait for a few years for Moore’s law to catch up?
• Or use multiple processors and do it now?
– Need for more computing power
• Data intensive applications
• Utility computing requires powerful processors
• Multiprocessor software problem
– Most code is sequential (for uniprocessors)
• Much easier to write and debug
– Parallel code required for effective and efficient utilization of all cores
• But Correct parallel code very, very difficult to write
– Efficient and correct is even harder
– Debugging even more difficult (Heisenbugs)
12

Multiprocessors
• The main argument for using multiprocessors is to
create powerful computers by simply connecting
multiple processors.
– A multiprocessor is expected to reach faster speed than
the fastest single-processor system.
– More cost-effective.
• A multiprocessor consisting of a number of single processors is
expected to be than building a high-performance single
processor.
– Fault tolerance.
• If a processor fails, the remaining processors should be able to
provide continued service, albeit with degraded performance.
13

Two Models for Communication and
Memory Architecture
1. Communication occurs by explicitly passing messages among the
processors:
– message-passing multiprocessors
2. Communication occurs through a shared address space (via loads
and stores):
– shared memory multiprocessors either
• UMA (Uniform Memory Access time) for shared address, centralized memory MP
• NUMA (Non Uniform Memory Access time multiprocessor) for shared address,
distributed memory MP
• In past, confusion whether “sharing”means sharing physical
memory (Symmetric MP) or sharing address space
14

Symmetric Shared-Memory Architectures
• From multiple boards on a shared bus to multiple
processors inside a single chip
• Caches both
– Private data are used by a single processor
– Shared data are used by multiple processors
15

Important ideas
• Technology drives the solutions.
– Multi-cores have altered the game!!
– Thread-level parallelism (TLP) vs ILP.
• Computing and communication deeply intertwined.
– Write serialization exploits broadcast communication on the
interconnection network or the bus connecting L1, L2, and L3 caches for
cache coherence.
• Access to data located at the fastest memory level greatly
improves the performance.
• Caches are critical for performance but create new problems
– Cache coherence protocols:
1. Cache snooping  traditional multiprocessor
2. Directory based  multi-core processors
16

Review of basic concepts
• Cache  smaller, faster memory which stores copies of the data from frequently
used main memory locations.
• Cache writing policies
– write-through  every write to the cache causes a write to main memory.
– write-back  writes are not immediately mirrored to main memory.
• Locations written are marked dirty and written back to the main memory only when that data is evicted from the cache.
• A read miss may require two memory accesses: write the dirty location to memory and read new location from memory.
• Caches are organized in blocks or cache lines.
• Cache blocks consist of
– Tag  contains (part of) address of actual data fetched from main memory
– Data block
– Flags  dirty bit, shared bit,
• Broadcast networks  all nodes share a communication media and hear all
messages transmitted, e.g., bus.
17

Cache Coherence and Consistency
• Coherence
– Reads by any processor must return the most recently written value
– Writes to the same location by any two processors are seen in the same
order by all processors
– Coherence defines behaviour of reads and writes to the same location,
• Consistency
– A read returns the last value written
– If a processor writes location A followed by location B, any processor
that sees the new value of B must also see the new value of A
– Consistency defines behaviour of reads and writes to other locations
18

Thread-level parallelism (TLP)
• Distribute the workload among a set of concurrently running threads.
• Uses MIMD model  multiple program counters
• Targeted for tightly-coupled shared-memory multiprocessors
• To be effective need n threads for n processors.
• Amount of computation assigned to each thread = grain size
– Threads can be used for data-level parallelism, but the overheads may outweigh
the benefit
• Speedup
– Maximum speedup with n processors is n; embarrassingly parallel
– The actual speedup depends on the ratio of parallel versus sequential portion of a
program according to Amdahl’s law.
19

TLP and ILP
• The costs for exploiting ILP are prohibitive in terms of silicon
area and of power consumption.
• Multicore processor have altered the game
– Shifted the burden for keeping the processor busy from the hardware
and architects to application developers and programmers.
– Shift from ILP to TLP
• Large-scale multiprocessors are not a large market, they have
been replaced by clusters of multicore systems.
20

Multi-core processors
• Cores are now the building blocks of chips.
• Intel offers a family of processors based on the Nehalem
architecture with a different number of cores and L3
caches
21

MIMD Multiprocessors
• Centralized Shared Memory
• Distributed Memory
22

Centralized-Memory Machines
• Also “Symmetric Multiprocessors” (SMP)
“Uniform Memory Access” (UMA)
– All memory locations have similar latencies
– Data sharing through memory reads/writes
– P1 can write data to a physical address A,
P2 can then read physical address A to get that data
• Caching data
– reduces the access time but demands cache coherence
• Two distinct data states
– Global state  defined by the data in main memory
– Local state  defined by the data in local caches
• In multi-core L3 cache is shared; L1 and
L2 caches are private
• Problem: Memory Contention
– All processor share the one memory
– Memory bandwidth becomes bottleneck
– Used only for smaller machines
• Most often 2,4, or 8 processors
23
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
Processor Processor Processor Processor
Shared cache
Main Memory I/O system
Private
caches

Shared Memory Pros and Cons
• Pros
– Communication happens automatically
– More natural way of programming
• Easier to write correct programs and gradually optimize them
– No need to manually distribute data
(but can help if you do)
• Cons
– Needs more hardware support
– Easy to write correct, but inefficient programs
(remote accesses look the same as local ones)
24

MIMD: Distributed-Memory Machines
• Two kinds
– Distributed Shared-Memory (DSM)
• All processors can address all memory
locations
• Data sharing like in SMP
• Also called NUMA (non-uniform
memory access)
• Latencies of different memory locations
can differ
(local access faster than remote access)
– Message-Passing
• A processor can directly address only
local memory
• To communicate with other processors,
must explicitly send/receive messages
• Also called multicomputers or clusters
• Most accesses local, so less
memory contention (can scale to
well over 1000 processors)
25
Multicore
Processor
+ Cache
I/O
Memory Memory Memory Memory
I/O
Memory
Multicore
Processor
+ Caches
Multicore
Processor
+ Cache
Multicore
Processor
+ Cache
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
Memory
Multicore
Processor
+ Caches
I/O I/O I/O
I/O I/O
I/O

Distributed Shared-Memory Multiprocessor…
• Two major benefits:
– It is a cost-effective way to scale the memory bandwidth
if most of the accesses are to local memory in the node.
– It reduces the latency for accesses to the local memory.
• Two key disadvantages:
– Communicating data between processors becomes more
complex.
– It requires more effort in the software to take advantage
of the increased memory bandwidth afforded by
distributed memories
26

Message-Passing Machines
• A cluster of computers
– Each with its own processor and memory
– An interconnect to pass messages between them
– Producer-Consumer Scenario:
• P1 produces data D, uses a SEND to send it to P2
• The network routes the message to P2
• P2 then calls a RECEIVE to get the message
– Two types of send primitives
• Synchronous: P1 stops until P2 confirms receipt of message
• Asynchronous: P1 sends its message and continues
– Standard libraries for message passing:
Most common is MPI – Message Passing Interface
27

Communication Performance
• Metrics for Communication Performance
– Communication Bandwidth
– Communication Latency
• Sender overhead + transfer time + receiver overhead
– Communication latency hiding
• Characterizing Applications
– Communication to Computation Ratio
• Work done vs. bytes sent over network
• Example: 146 bytes per 1000 instructions
28

Message Passing Pros and Cons
• Pros
– Simpler and cheaper hardware
– Explicit communication makes programmers aware of costly
(communication) operations
• Cons
– Explicit communication is painful to program
– Requires manual optimization
• If you want a variable to be local and accessible via LD/ST, you must
declare it as such
• If other processes need to read or write this variable, you must explicitly
code the needed sends and receives to do this
29

Parallel Processing Performance
• Challenges of Parallel Processing:
– First challenge is % of program inherently sequential
– Suppose 80x speedup from 100 processors. What fraction of
original program can be sequential?
• (a) 10% (b) 5% (c) 1% (d) <1%
– Assume that the program operates in only two modes:
• Parallel with all processors fully used (enhanced mode)
• Serial with only one processor in use
30
Amdahl’s Law Provides solution

Amdahl’s Law Provides solution
31
Need sequential part to be 0.0125% of original time.
Sequential part can limit speedup

Second Challenge:
Long Latency to Remote Memory
• Suppose 32 CPU MP, 2GHz, 200 ns to handle reference to a
remote memory, all local accesses hit memory hierarchy
and base CPI is 0.5. (Remote request cost = 200/0.5 = 400
clock cycles.)
• What is performance impact if 0.2% instructions involve
remote access?
– (a) 1.5X (b) 2.0X (c) 2.5X
32

CPI Equation
• CPI = Base CPI + Remote request rate x Remote request cost
• Cycle time =
• CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3
• No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions
involved in remote access
• In practice, the performance analysis is much more complex, since
– Some fraction of the non-communication references will miss in the local
hierarchy
– Remote access time does not have a single constant value.
33
1
Cycle Cycle =
1
2 GHz = 0.5 ns
Remote access cost
Cycle Time
200 ns
0.5 ns
= = 400
• Remote request cost =

Challenge: Scaling Example
• Suppose you want to perform two sums:
– one is a sum of two scalar variables and
– one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000.
What speedup do you get with 1000 processors?
• Solution:
– If we assume performance is a function of the time for an addition, t , then
there is 1 addition that does not benefit from parallel processors and
1,000,000 additions that do.
– If the time before (for single processor) is: 1,000,000t + 1t = 1000,001t
• Execution time after improvement
34
Execution time affected by improvement
Amount of improvement
+ Execution time
unaffected
Execution time after =
improvement

Challenge: Scalar and Matrix Addition
35
Execution time after improvement =
1,000,000t
1,000
+ 1t = 1001t
Speedup is then
Speedup =
1,000,001t
1,001t
= 999
Even if the sequential portion expanded to 100 sums of scalar variables versus
one sum of a pair of 1000 by 1000 arrays, the speedup would still be 909.

Scaling Example...
• What if matrix size is 100 x 100?
– Single processor: Time= (10 + 10000) x tadd
– 10 processors
• Time = 10 x tadd + 10000/10 x tadd = 1010 x tadd
• Speedup = 10010/1010 = 9.9 (99% of potential)
– 100 processors
• Time = 10 x tadd + 10000/100 x tadd = 110 x tadd
• Speedup = 10010/110 = 9.1 (91% of potential)
Assuming load balanced
36

Symmetric Shared-Memory Architectures
Cache Coherence Problem
• Shared memory easy with no caches
– P1 writes, P2 can read
– Only one copy of data exists (in memory)
• Caches store their own copies of the data
– Those copies can easily get inconsistent
– Classic example: adding to a sum
• P1 loads allSum, adds its mySum, stores new allSum
• P1’s cache now has dirty data, but memory not updated
• P2 loads allSum from memory, adds its mySum, stores allSum
• P2’s cache also has dirty data
• Eventually P1 and P2’s cached data will go to memory
• Regardless of write-back order, the final value ends up wrong
37

Cache Coherence Problem…
38
P1 P2
Memory
Allsum: 0
Allsum: 5
Allsum:12
1
Allsum: Allsum + mysum2 (12)
Allsum: Allsum + mysum1 (5)
2
All Processes accessing main memory may see very stale value Alllsum:

Cache Coherence Definition
• A memory system is coherent if
1. Preserve Program Order: A read R from address X on processor
P1 returns the value written by the most recent write W to X on
P1 if no other processor has written to X between W and R.
1. This property simply preserves program order—we expect this property
to be true even in uniprocessors.
39
Figure The cache coherence problem for a single memory location (X), read
and
written by two processors (A and B).
Time Event Cache
contents for
processor A
Cache
contents for
processor B
Memory
contents for
location X
0 1
1 Processor A reads X 1 1
2 Processor B reads X 1 1 1
3 Processor A stores 0 into X 0 1 0

Cache Coherence Definition…
2. Coherent view of Memory: If P1 writes to X and P2 reads X after
a sufficient time, and there are no other writes to X in between,
P2’s read returns the value written by P1’s write.
• The second property defines the notion of what it means to have a
coherent view of memory:
– If a processor could continuously read an old data value, we would clearly
say that memory was incoherent.
3. Write Serialization: Writes to the same location are serialized.
Two writes to location X are seen in the same order by all
processors.
• For example, if the values 1 and then 2 are written to a location,
processors can never read the value of the location as 2 and then later
read it as 1.
40
Coherence defines behaviour of reads and writes to the same location

Write Consistency
For now assume
1. A write does not complete (and allow the next write to
occur) until all processors have seen the effect of that write
2. The processor does not change the order of any write with
respect to any other memory access
if a processor writes location A followed by location B,
any processor that sees the new value of B must also see
the new value of A
• These restrictions allow the processor to reorder reads, but
forces the processor to finish writes in program order
41
Consistency defines behaviour of reads and writes to other locations

Basic Schemes for Enforcing Coherence
• Migration – data can be moved to a local cache and used
there in a transparent fashion
– Reduces both latency to access shared data that is allocated
remotely and bandwidth demand on the shared memory
• Replication – for reading shared data simultaneously, since
caches make a copy of data in local cache
– Reduces both latency of access and contention for read shared
data
42

Maintaining Cache Coherence
• Hardware schemes
– Shared Caches
• Trivially enforces coherence
• Not scalable (L1 cache quickly becomes a bottleneck)
– Snooping
• Needs a broadcast network (like a bus) to enforce coherence
• Each cache that has a block tracks its sharing state on its own
– Directory
• Can enforce coherence even with a point-to-point network
• A block has just one place where its full sharing state is kept
– All information about the blocks is kept in the directory
• SMP: one centralised directory is provided in the outermost cache for multi-
core systems
• DSM: Directory is distributed. Each node maintains its own directory which
tracks the sharing information of every cache line in the node
43

Maintaining Cache Coherence:
Two Classes of Protocols in Use
Cache coherence Protocols
• Directory based
– The sharing status of a block of physical memory is kept in just one
location, called the directory;
– Directory-based coherence has slightly higher implementation
overhead than snooping, but it can scale to larger processor counts.
• The Sun T1 design uses directories, albeit with a central physical memory.
• Snooping
– Every cache that has a copy of the data from a block of physical
memory also has a copy of the sharing status of the block, but no
centralized state is kept.
– The caches are all accessible via some broadcast medium (a bus or
switch), and all cache controllers monitor or snoop on the medium
• To determine whether or not they have a copy of a block that is requested on a
bus or switch access.
44

Snoopy Cache-Coherence Protocols
• Cache Controller “snoops”all transactions on the shared
medium (bus or switch)
– relevant transaction if for a block it contains
– take action to ensure coherence
• » invalidate, update, or supply value
– depends on state of the block and the protocol
• Either get exclusive access before write via write invalidate or
update all copies on write
45
Pn
P1
Memory
cache
Bus snoop
I/O devices
cache
Cache-memory
transaction
Data
State
Address

Communication between private
and shared caches
• Multi-core processor  a bus connects private L1 and L2
instruction (I) and data (D) caches to the shared L3 cache.
• To invalidate a cached item the processor changing the
value must
– first acquire the bus and
– then place the address of the item to be invalidated on the bus.
• DSM  Locating the value of an item is harder for
– write-back caches
• because the current value of the item can be in the local caches of another
processor.
46

Snooping Protocol
• Typically used for bus-based (SMP) multiprocessors
– Serialization on the bus used to maintain coherence property 3
• Two flavors
– Write-update (write broadcast)
• A write to shared data is broadcast to update all copies
• All subsequent reads will return the new written value (property 2)
• All see the writes in the order of broadcasts
One bus == one order seen by all (property 3)
– Write-invalidate
• Write to shared data forces invalidation of all other cached copies
• Subsequent reads miss and fetch new value (property 2)
• Writes ordered by invalidations on the bus (property 3)
47

Write Invalidate: Example
• Write invalidate  on write, invalidate all other copies.
– Used in modern microprocessors
– Example: a write-back cache during read misses of item X, processors A and B.
Once A writes X it invalidates the B’s cache copy of X
48
Processor activity Bus activity Contents of
processor A’s
cache
Contents of
processor B’s
cache
Contents of
memory location X
0
Processor A reads X Cache miss for X 0 0
Processor B reads X Cache miss for X 0 0 0
Processor A writes a 1 to X Invalidation for X 1 0
For a write, we require that the writing processor have exclusive access,
preventing any other processor from being able to write simultaneously.
An example of an invalidation protocol working on a snooping bus for a single cache block (X) with
write-back caches.

Write Invalidate: Example...
• An example of an invalidation protocol working on a snooping bus for a single
cache block (X) with write-back caches
• We assume that neither cache initially holds X and that the value of X in
memory is 0.
• The CPU and memory contents show the value after the processor and bus
activity have both completed.
• A blank indicates no activity or no copy cached.
• When the second miss by B occurs, CPU A responds with the value cancelling
the response from memory.
– In addition, both the contents of B’s cache and the memory contents of X are updated.
• This update of memory, which occurs when a block becomes shared, is typical
in most protocols and simplifies the protocol.
49

Example: Write-through Invalidate
50
• Must invalidate before step 3
• Write update uses more broadcast medium bandwidth
all recent MPUs use write invalidate
Exclusive access ensures that no
other readable or writable copies of
an data exist when the write occurs

Write Update: Example
• An example of a write update or broadcast protocol working on a
snooping bus for a single cache block (X) with write-back caches.
• We assume that neither cache initially holds X and that the value of
X in memory is 0.
51
Processor activity Bus activity Contents of
processor
A’s cache
Contents of
processor
B’s cache
Contents of
memory location X
0
Processor A reads X Cache miss for X 0 0
Processor A writes a 1 to X Write Broadcast of X 1 1 1
Processor B reads X No bus activity 1 1 1

Write Update: Example...
• The CPU and memory contents show the value after the
processor and bus activity have both completed.
• A blank indicates no activity or no copy cached.
• When CPU A broadcasts the write, both the cache in CPU B
and the memory location of X are updated.
• In the second read, processor B finds the updated value of
X and therefore there is no bus activity.
52

Update vs. Invalidate
• A burst of writes by a processor to one address
– Update: each sends an update
– Invalidate: possibly only the first invalidation is sent
• Writes to different words of a block
– Update: update sent for each word
– Invalidate: possibly only the first invalidation is sent
• Producer-consumer communication latency
– Update: producer sends an update,
• consumer reads new value from its cache
– Invalidate: producer invalidates consumer’s copy,
• consumer’s read misses and has to request the block
• Which is better depends on application
– But write-invalidate is simpler and implemented in most MP-capable
processors today.
53

Implementation of cache Invalidate
• The key to implementing an invalidate protocol in a
multicore is
– the use of the bus, or another broadcast medium, to perform
invalidates.
– All processors snoop on the bus.
• To invalidate the processor changing an item
– acquires the bus and
– broadcasts the address to be invalidated on the bus.
• If two processors attempt to change at the same time the
bus arbitrator allows access to only one of them.
– All coherence schemes require some method of serializing accesses
to the same cache block, either by serializing access to the
communication medium or another shared structure.
54

Implementation of cache Invalidate…
• How to find the most recent value of a data item
– Write-through cache  the value is in memory but write buffers could
complicate the scenario.
– Write-back cache  harder problem, the item could be in the private
cache of another processor.
• A block of cache has extra state bits
– Valid bit – indicates if the block is valid or not
– Dirty bit - indicates if the block has been modified
– Shared bit – cache block is shared with other processors
• If a processor finds that it has a dirty copy of the requested
cache block, it provides that cache block in response to the read
request and causes the memory (or L3) access to be aborted.
55

Implementation of cache Invalidate…
• When a write to a block in the shared state occurs,
– the cache generates an invalidation on the bus and marks the
block as exclusive.
– No further invalidations will be sent by that core for that block.
– The core with the sole copy of a cache block is normally
called the owner of the cache block.
• When an invalidation is sent,
– the state of the owner’s cache block is changed from shared
to unshared (or exclusive).
– If another processor later requests this cache block, the state
must be made shared again
56

Locate up-to-date copy of data
• For a write-through cache
– Get up-to-date copy from memory (Since all written data are
always sent to the memory, from which the most recent values of
a data item can always be fetched.)
– Write through simpler if enough memory bandwidth is available
– Use of write through simplifies the implementation of cache
coherence.
• For a write-back cache
– Most recent copy can be in a cache rather than in memory
– The problem of finding the most recent data value is harder
57

Locate up-to-date copy of data…
• Write-back caches can use the same snooping scheme both for
cache misses and for writes:
– Each processor snoops every address placed on the bus.
– If a processor has dirty copy of the requested cache block, it provides it
in response to the read request and aborts the memory access.
– Complexity comes from having to retrieve the cache block from a
processor’s cache, which can take longer than retrieving it from the
shared memory if the processors are in separate chips.
• Write-back needs lower memory bandwidth
– ⇒ Support larger numbers of faster processors
– ⇒ Most multiprocessors use write-back
58

Cache Resources for Write-Back Snooping
• Normal cache tags can be used for snooping
• Valid bit for each block makes invalidation easy
• Read misses easy since rely on snooping
• Writes Need to know if whether any other copies of the
block are cached
– No other copies No need to place write on bus in a write-back cache
(reduce both the time taken by the write and the required bandwidth)
– Other copies Need to place invalidate on bus
59
Index
Block Address
Tag
Block Offset

Cache Resources for Write-Back Snooping…
• To track whether a cache block is shared, add extra state bit
associated with each cache block, like valid bit and dirty bit
– Write to shared block ⇒ Need to generates an invalidation on the bus
and marks the state of the block as exclusive.
– Otherwise, no further invalidations will be sent by that processor for
that block
– The processor with the sole copy of a cache block is normally called the
owner of the cache block
– When invalidation is sent, the state of the owner’s cache block is
changed from shared to exclusive.
– If another processor later requests this cache block, the state must be
made shared again.
60

Cache Behaviour in Response to Bus
• Every bus transaction must check the cache address
tags
– could potentially interfere with processor cache accesses
• A way to reduce interference is to duplicate tags
– One set for caches access, one set for bus accesses
• The interference can also be reduced in a multilevel
cache by directing the snoop request to the L2 cache
– Since L2 less heavily used than L1 (the processor uses only
when it has a miss in the L1 cache)
⇒ Every entry in the L1 cache must be present in the L2
cache, called the inclusion property
– If Snoop gets a hit in L2 cache, then it must arbitrate for the L1
cache to update the state and possibly retrieve the data,
which usually requires a stall of the processor
61

Example: Write Back MSI Snooping Protocol
• Snooping coherence protocol is usually implemented by
incorporating a finite‐state controller in each node
• There is only one finite-state machine per cache, with stimuli coming
either from the attached processor or from the bus
• Logically, think of a separate controller associated with each cache
block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to
distinct blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
62

Example: Write Back MSI Snooping Protocol…
• Processor only observes state of memory system by issuing memory
operations
• Assume bus transactions and memory operations are atomic and a
one‐level cache
– all phases of one bus transaction complete before next one starts
– processor waits for memory operation to complete before issuing next
– with one‐level cache, assume invalidations applied during bus transaction
• All writes go to bus + atomicity
– Writes serialized by order in which they appear on bus (bus order) => invalidations applied to
caches in bus order
• How to insert reads in this order?
– Important since processors see writes through reads, so determines whether write
serialization is satisfied
– But read hits may happen independently and do not appear on bus or enter directly in bus
order
63

Example: Write Back MSI Snooping Protocol
• Invalidation protocol, write‐back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to the
read request and aborts the memory access
• State of block B in cache C can be
– Invalid: B is not cached in C
• To read or write, must make a request on the bus
– Modified: B is dirty in C
• C has the block, no other cache has the block, and C must update memory when it displaces B
• Can read or write B without going to the bus
– Shared: B is clean in C
• C has the block, other caches have the block, and C need not update memory when it
displaces B
• Can read B without going to bus
• To write, must send an upgrade request to the bus
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses
64
note that the modified state implies that the block is
exclusive

Write‐Back State Machine ‐ Processor Request
65
Transition Arcs: The stimulus causing a state change is shown on the transition
arcs in Blue
Bus actions generated as part of the state transition are shown on the transition
arc in Bold.

Write‐Back State Machine ‐ Processor Request…
66
Finite-state transition diagram for a single private cache block
using a write invalidation protocol and a write-back cache
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU read hit
CPU
read
miss
Place read
miss on bus
CPU read
CPU write
CPU write hit
CPU read hit
CPU write miss
Place read miss on bus
Place
write
miss
on
bus
Write-back cache block
Place write miss on bus
• Any transition to the Exclusive
state (which is required for a
processor to write to the block)
requires an invalidate or write
miss to be placed on the bus,
 causing all local caches to make the
block invalid.
 In addition, if some other local
cache had the block in Exclusive
state, that local cache generates a
write-back, which supplies the
block containing the desired
address.
Cache block States:
 Invalid
 Shared and
 Exclusive (Modified)
Cache state transitions
based on requests from CPU

Write‐Back State Machine ‐ Bus Request
67
Finite-state transition diagram for a single private cache block using a
write invalidation protocol and a write-back cache
• If a read miss occurs on the bus to a
block in the exclusive state,
 the local cache with the exclusive copy
changes its state to shared.
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU read miss
write miss for this block
Write
miss
For
this
block
Invalidate for
this block
Write-back;
block
abort
memory
access
Request Source State of
addressed
cache block
Type of
Cache action
Function and explanation
Read miss Bus Shared No action Allow shared cache or memory to service read miss
Read miss Bus Modified Coherence Attempt to share data: place cache block on bus and change state to shared.
invalidate Bus Shared Coherence Attempt to write shared block; invalidate the block
Write miss Bus Shared Coherence Attempt to write shared block; invalidate the cache block
Write miss Bus Modified Coherence Attempt to write block that is exclusive elsewhere; write-back the block and
make its state invalid in the local cache
Cache state transitions
based on requests from Bus

Combined Cache Coherence State Diagram for both
Processor and Bus Requests
68
Invalid
Exclusive
(read/write)
Shared
(read only)
CPU
read
hit
CPU
read
miss
Place read
miss on bus
CPU write
CPU write hit
CPU read hit
CPU write miss
Write miss
for block
CPU read
Place read miss on bus
Place
write
miss
on
bus
Write-back cache block
Place write miss on bus
Invalidate for this block
Write miss for this block
Write-back
block
Transition Arcs:
 Local Processor induced transition in
Black
 Bus activities induced transition in Blu
 Activities on transition in Red

Example
• Assume:
– initial cache state is invalid and
– addresses A1 and A2 map to same cache block,
• but A1 != A2
69
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1: Write 10 to A1 Excl
P1 Read Ai
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
Processor 1 Processor 2 Bus Memory

Example: Step 1
• Assume:
• but A1 != A2
70
P1 P2 Bus Memory
Step State Addr Value State Addr Value Action Processor Addr Value Addr Value
P1: Write 10 to A1 Excl A1 10 WrMs P1 A1
P1 Read Ai
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
• Active arrow =

Example: Step 2
• Assume:
• but A1 != A2
71
P1 P2 Bus Memory
P1 Read A1 Excl A1 10
P2 Read A1
P2: Write 20 to A1
P2: Write 40 to A2
• Active arrow =

Example: Step 3
• Assume:
• but A1 != A2
72
P1 P2 Bus Memory
P2 Read A1 Shar. A1 RdMs P2 A1 A1 10
Shar A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1
P2: Write 40 to A2
• Active arrow =

Example: Step 4
• Assume:
• but A1 != A2
73
P1 P2 Bus Memory
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10
P2: Write 40 to A2
• Active arrow =

Example: Step 5
• Assume:
• but A1 != A2
74
P1 P2 Bus Memory
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write 20 to A1 Inv. Excl. A1 20 Inv. P2 A1 A1 10
P2: Write 40 to A2 WrBk P2 A1 20 A1 20
Excl. A2 40 WrMs P2 A2 A1 20
• Active arrow =

Limitations in Symmetric Shared-Memory
Multiprocessors and Snooping Protocols
• As the number of processors in a multiprocessor grows, or as the memory demands of each
processor grow, any centralized resource in the system can become a bottleneck.
• As processors have increased in speed in the last few years, the number of processors that
can be supported on a single bus or by using a single physical memory unit has fallen
• Single memory accommodate all CPUs
– Multiple memory banks
• Bus-based multiprocessor, bus must support both coherence traffic & normal memory
traffic
– Multiple buses or interconnection networks (cross bar or small point-to-point)
75
InIn such designs, the memory system
can be configured into multiple physical
banks, so as to boost the effective
memory bandwidth while retaining
uniform access time to memory
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
1 or more
levels of
cache
Processor Processor Processor Processor
Memory
I/O system
Memory Memory Memory

Cache Performance
• Cache performance is combination of
1. Behaviour of uniprocessor cache miss traffic
2. Traffic caused by communication
• Results in invalidations and subsequent cache misses
– Changing the processor count, cache size, and block size can
affect these two components of the miss rate in different ways.
• Uniprocessor miss rate:
– Can be broken down into:
• Compulsory,
• Capacity and
• Conflict misses
76

Cache Performance…
• Compulsory miss:
– The very first access to a block cannot be in the cache.
• Capacity miss:
– The cache cannot contain all the blocks needed during execution
of a program, capacity miss will occur because of blocks being
discarded and later retrieved.
• Conflict miss:
– If the block placement strategy is set associative or direct
mapped, conflict misses will occur because a block may be
discarded and later retrieved if too many blocks map to its set.
77

Coherency Misses
The misses that arise from inter-processor communication, which are
often called coherence misses, can be broken into two separate
sources.
1. True sharing misses arise from the communication of data
through the cache coherence mechanism
– In an invalidation-based protocol, the 1st write to a shared block causes an
invalidation to establish ownership of the block.
– When another processor attempts to read a modified word in the cache block,
a miss occurs and the block is transferred.
2. False sharing misses when a block is invalidated because some
word in the block, other than the one being read, is written into
– Invalidation does not cause a new value to be communicated, but only causes
an extra cache miss
– Block is shared, but no word in block is actually shared  miss would not
occur if block size were 1 word
78

Example: True v. False Sharing v. Hit?
• Assume x1 and x2 in same cache block and in shared state
• P1 and P2 both read x1 and x2 before.
79
Time P1 P2 True, False, Hit? Why?
1 Write x1 True miss: invalidate x1 in P2
2 Read x2 False miss: x1 irrelevant to P2
3 Write x1 False miss: x1 irrelevant to P2
4 Write x2 False miss: x1 irrelevant to P2
5 Read x2 True miss: invalidate x2 in P1

Classifications by Time Step
1. This event is a true sharing miss, since x1 was read by P2 and needs to be
invalidated from P2.
2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in
P1, but that value of x1 is not used in P2.
3. This event is a false sharing miss, since the block containing x1 is marked shared
due to the read in P2, but P2 did not read x1. The cache block containing x1 will
be in the shared state after the read by P2; a write miss is required to obtain
exclusive access to the block. In some protocols this will be handled as an
upgrade request, which generates a bus invalidate, but does not transfer the
cache block.
4. This event is a false sharing miss for the same reason as step 3.
5. This event is a true sharing miss, since the value being read was written by P2.
80

Cache to Cache transfers
• Problem
– P1 has block B in M state
– P2 wants to read B, puts a RdReq on bus
– If P1 does nothing, memory will supply the data to P2
– What does P1 do?
• Solution 1: abort/retry
– P1 cancels P2’s request, issues a write back
– P2 later retries RdReq and gets data from memory
– Too slow (two memory latencies to move data from P1 to P2)
• Solution 2: intervention
– P1 indicates it will supply the data (“intervention” bus signal)
– Memory sees that, does not supply the data, and waits for P1’s data
– P1 starts sending the data on the bus, memory is updated
– P2 snoops the transfer during the write-back and gets the block
81

Cache to Cache transfers…
• Intervention works if some cache has data in M state
– Nobody else has the correct data, clear who supplies the
data
• What if a cache has requested data in S state
– There might be others who have it, who should supply the
data?
– Solution 1: let memory supply the data
– Solution 2: whoever wins arbitration supplies the data
– Solution 3: A separate state similar to S that indicates there
are maybe others who have the block in S state, but if
anybody asks for the data we should supply it
82

Extensions to the Basic Coherence Protocol
• We have just considered a coherence protocol
with 3 states: Modified, Shared, Invalid (MSI)
• There are many extensions of MSI
– With additional states and transactions, which
optimise certain behaviours, possibly resulting in
improved performance.
• Two of the most common extensions are: MESI
and MOESI
83

MESI (Modified, Exclusive, shared & Invalid)
• MESI adds the state Exclusive (E) to the basic MSI protocol.
• Exclusive indicates when a cache block is resident only in a single cache but is
clean
• If a block is in the E state, it can be written without generating any invalidates,
which optimizes the case where a block is read by a single cache before being
written by that same cache.
• Of course, when a read miss to a block in the E state occurs, the block must be
changed to the S state to maintain coherence.
– Because all subsequent accesses are snooped, it is possible to maintain the accuracy of this state.
– In particular, if another processor issues a read miss, the state is changed from exclusive to shared
• Pros of adding E state:
– subsequent write to a block in the exclusive state by the same core need not acquire bus access or
generate an invalidate, since the block is known to be exclusively in this local cache; the processor
merely changes the state to modified.
• The Intel i7 uses a variant of a MESI protocol, called MESIF, which adds a state
(Forward) to designate which sharing processor should respond to a request.
– It is designed to enhance performance in distributed memory organizations.
84

MOESI
(Modified, Owned, Exclusive, Shared & Invalid)
• MOESI adds the state Owned to the MESI protocol to indicate that the associated block is
owned by that cache and out-of-date in memory.
• In MSI and MESI protocols, when there is an attempt to share a block in the Modified state,
the state is changed to Shared (in both the original and newly sharing cache), and the block
must be written back to memory.
• In a MOESI protocol, the block can be changed from the Modified to Owned state in the
original cache without writing it to memory.
• Other caches, which are newly sharing the block, keep the block in the Shared state; the O
state, which only the original cache holds, indicates that the main memory copy is out of
date and that the designated cache is the owner.
• The owner of the block must supply it on a miss, since memory is not up to date and must
write the block back to memory if it is replaced.
• The AMD Opteron uses the MOESI protocol.
85

Directory-Based Coherence Protocol
• Typically in distributed shared memory
• For every local memory block, local directory
has an entry
• Directory entry indicates
–Who has cached copies of the block
–In what state do they have the block
86

Distributed-Memory Multiprocessor with the
directories added to each Node
87

Directory-Based Cache Coherence Protocols:
The Basics
• Just as with a snooping protocol, there are two primary operations that a directory
protocol must implement:
– handling a read miss and
– handling a write to a shared, clean cache block.
• Handling a write miss to a block that is currently shared is a simple combination of
these two.
• To implement these operations, a directory must track the state of each cache
block.
• In a simple protocol, these states could be the following:
– Shared—One or more nodes have the block cached, and the value in memory is up to date
(as well as in all the caches).
– Uncached—No node has a copy of the cache block.
– Modified—Exactly one node has a copy of the cache block, and it has written the block, so
the memory copy is out of date. The processor is called the owner of the block.
88

Basic Directory Scheme
• Each entry has
– One dirty bit (1 if there is a dirty cached copy)
– A presence vector (1 bit for each node) Tells which nodes may
have cached copies
• All misses sent to block’s home
• Directory performs needed coherence actions
• Eventually, directory responds with data
89

Read Miss
• Processor Pk has a read miss on block B, sends
request to home node of the block
• Directory controller
– Finds entry for B, checks D bit
– If D=0
• Read memory and send data back, set P[k]
– If D=1
• Request block from processor whose P bit is 1
• When block arrives, update memory, clear D bit,
send block to Pk and set P[k]
90

Directory Operation
• Network controller connected to each bus
– A proxy for remote caches and memories
• Requests for remote addresses forwarded to home,
responses from home placed on the bus
• Requests from home placed on the bus,
cache responses sent back to home node
• Each cache still has its own coherence state
– Directory is there just to avoid broadcasts
and order accesses to each location
• Simplest scheme:
If access A1 to block B still not fully processed by directory
when A2 arrives, A2 waits in a queue until A1 is done
91

92
MULTIPROCESSOR INTERCONNECTION NETWORKS

Multiprocessor Interconnection Networks
• Multiprocessors interconnection networks (INs) can be classified based on
a number of criteria. These include
– (1) mode of operation (synchronous versus asynchronous),
– (2) control strategy (centralized versus decentralized),
– (3) switching technique(circuit versus packet), and
– (4) topology (static versus dynamic).
• Mode of Operation
– According to the mode of operation, INs are classified as synchronous versus asynchronous.
• In synchronous mode of operation, a single global clock is used by all components in the system such
that the whole system is operating in a lock–step manner.
• Asynchronous mode of operation, on the other hand, does not require a global clock. Handshaking
signals are used instead in order to coordinate the operation of asynchronous systems.
– While synchronous systems tend to be slower compared to asynchronous systems, they are
race and hazard-free.
93

Multiprocessor Interconnection Networks…
• Control Strategy
– According to the control strategy, INs can be classified as centralized versus
decentralized.
• In centralized control systems, a single central control unit is used to oversee and
control the operation of the components of the system.
• In decentralized control, the control function is distributed among different
components in the system.
– The function and reliability of the central control unit can become the bottleneck
in a centralized control system. While the crossbar is a centralized system, the
multistage interconnection networks are decentralized.
• Switching Techniques
– Interconnection networks can be classified according to the switching
mechanism as circuit versus packet switching networks.
• In the circuit switching mechanism, a complete path has to be established prior to the
start of communication between a source and a destination. The established path will
remain in existence during the whole communication period.
• In a packet switching mechanism, communication between a source and destination
takes place via messages that are divided into smaller entities, called packets. On their
way to the destination, packets can be sent from a node to another in a store-and-
forward manner until they reach their destination.
– While packet switching tends to use the network resources more efficiently
compared to circuit switching, it suffers from variable packet delays.
94

Multiprocessor Interconnection Networks…
• Topology
– An interconnection network topology is a mapping function from the set of
processors and memories onto the same set of processors and memories.
In other words, the topology describes how to connect processors and
memories to other processors and memories.
– A fully connected topology, for example, is a mapping in which each
processor is connected to all other processors in the computer.
– A ring topology is a mapping that connects processor k to its neighbours,
processors (k – 1) and (k + 1).
– In general, interconnection networks can be classified as
• static versus dynamic networks.
– In static networks, direct fixed links are established among nodes to form a
fixed network, while
– In dynamic networks, connections are established as needed.
– Switching elements are used to establish connections among inputs and
outputs.
– Depending on the switch settings, different interconnections can be
established.
– Nearly all multiprocessor systems can be distinguished by their 95

Interconnection networks for
Shared Memory and Message Passing Systems.
• Shared memory
– Shared memory systems can be designed using bus-based or switch-based
INs.
– The simplest IN for shared memory systems is the bus. However, the bus
may get saturated if multiple processors are trying to access the shared
memory (via the bus) simultaneously.
– A typical bus-based design uses caches to solve the bus contention
problem.
– Other shared memory designs rely on switches for interconnection.
• For example, a crossbar switch can be used to connect multiple processors to
multiple memory modules.
96
(a) (b)
Figure: Shared memory interconnection networks
(a) bus-based and (b) switch-based
Figure Single bus and multiple bus
systems.

Shared Memory and Message Passing Systems…
• Message passing INs
– Message passing INs can be divided into static and dynamic.
• Static networks form all connections when the system is designed rather than when the
connection is needed. In a static network, messages must be routed along established links.
• Dynamic INs establish a connection between two or more nodes on the fly as messages are
routed along the links. The number of hops in a path from source to destination node is equal
to the number of point-to-point links a message must traverse to reach its destination.
– In either static or dynamic networks, a single message may have to hop through
intermediate processors on its way to its destination.
• Therefore, the ultimate performance of an interconnection network is greatly influenced by
the number of hops taken to traverse the network.
97
Figure Examples of static topologies.

98
(a) (b) (c)
Figure Example dynamic INs: (a) single-stage, (b) multistage, and (c) crossbar
switch.
• The single-stage interconnection network of Figure (a) is a simple
dynamic network that connects each of the inputs on the left side to
some, but not all, outputs on the right side through a single layer of
binary switches represented by the rectangles.
 The binary switches can direct the message on the left-side input to
one of two possible outputs on the right side.

• Figure (b). The Omega MIN (Multistage Interconnection Network) connects eight
sources to eight destinations.
– The connection from the source 010 to the destination 010 is shown as a bold path
– These are dynamic INs because the connection is made on the fly, as needed.
– In order to connect a source to a destination, we simply use a function of the bits of the
source and destination addresses as instructions for dynamically selecting a path through the
switches.
– For example, to connect source 111 to destination 001 in the omega network,
• the switches in the first and second stage must be set to connect to the upper output port,
• while the switch at the third stage must be set to connect to the lower output port (001).
• In general, when using k × k switches, a Omega MIN with N input-output ports requires
at least logk N stages, each of which contains N/k switches, for a total of N/k (logk N)
switches.
• Figure (c ) Crossbar Switch provides a path from any input or source to any other
output or destination by simply selecting a direction on the fly.
– To connect row 111 to column 001 requires only one binary switch at the intersection of the
111 input line and 001 output line to be set.
• The crossbar switch clearly uses more binary switching components;
– for example, N2 components are needed to connect N x N source/destination pairs.
99

Pros and Cons of Crossbar Switch and Omega MIN
• Pros
– Crossbar switch has potential for speed. In one clock, a connection can be made
between source and destination.
– The diameter of the crossbar is one.
• (Note: Diameter, D, of a network having N nodes is defined as the maximum shortest paths
between any two nodes in the network.)
– The omega MIN, on the other hand requires log N clocks to make a connection.
• The diameter of the omega MIN is therefore log N.
• Cons
– Both Crossbar Switch and Omega MIN networks limit the number of alternate paths
between any source/destination pair.
– This leads to limited fault tolerance and network traffic congestion.
– If the single path between pairs becomes faulty, that pair cannot communicate.
– If two pairs attempt to communicate at the same time along a shared path, one pair
must wait for the other.
• This is called blocking, and such MINs are called blocking networks.
• A network that can handle all possible connections without blocking is called a nonblocking
network.
100

Example Problem
• Example:
– Compute the cost of interconnecting 4096 nodes using a single crossbar switch
relative to doing so using a MIN built from 2 × 2, 4 × 4, and 16 × 16 switches.
Consider separately the relative cost of the unidirectional links and the relative
cost of the switches. Switch cost is assumed to grow quadratically with the
number of input (alternatively, output) ports, k, for k × k switches.
• Solution:
– The switch cost of the network when using a single crossbar is proportional to
40962
.
– The unidirectional link cost is 8192, which accounts for the set of links from the
end nodes to the crossbar and also from the crossbar back to the end nodes.
– When using a MIN with k × k switches, the cost of each switch is proportional
to k2 but there are 4096/k (logk 4096) total switches.
– Likewise, there are (logk 4096) stages of N unidirectional links per stage from
the switches plus N links to the MIN from the end nodes.
– Therefore, the relative costs of the crossbar with respect to each MIN is given
by the following:
101

Example Problem…
102
Relative cost (2 × 2)switches = 40962/ (22× 4096/2 × log2 4096) = 170
Relative cost (2 × 2)links = 8192/ (4096× (log2 4096 + 1) = 2/13= 0.1538

Example Problem…
• Conclusion
– In all cases, the single crossbar has much higher switch cost than the
MINs.
– The most dramatic reduction in cost comes from the MIN composed
from the smallest sized but largest number of switches, but it is
interesting to see that the MINs with 2 × 2 and 4 × 4 switches yield the
same relative switch cost.
– The relative link cost of the crossbar is lower than the MINs, but by less
than an order of magnitude in all cases.
– We must keep in mind that end node links are different from switch
links in their length and packaging requirements, so they usually have
different associated costs.
– Despite the lower link cost, the crossbar has higher overall relative cost.
103

Performance Comparison of some Dynamic Ins
• In the table below, m represents the number of multiple
buses used, while N represents the number of processors
(memory modules) or input/output of the network.
104
Network Delay Cost (Complexity)
Bus O(N) O(1)
Multiple-bus O(mN) O(m)
Multistage INs (MINs) O(log N) O(N log N)
Table Performance Comparison of Some Dynamic INs

Performance Comparison of some Static
INs.
• The table below shows a performance comparison among a
number of static INs.
 In this table, the degree of a network is defined as the maximum
number of links (channels) connected to any node in the network.
 The diameter of a network is defined as the maximum path, p, of the
shortest paths between any two nodes. Degree of a node, d, is
defined as the number of channels incident on the node.
105
Network Degree Diameter Cost (No. of links)
Linear array 2 N – 1 N – 1
Binary tree 3 2([Log2 N] – 1 ) N – 1
n-cube Log2 N Log2 N nN/2
2D-mesh 4 2(n – 1) 2(N – n)
Table Performance Characteristics of Static INs

Multiprocessor.pptx

Recommended

Recommended

More Related Content

Similar to Multiprocessor.pptx

Similar to Multiprocessor.pptx (20)

Recently uploaded

Recently uploaded (20)

Multiprocessor.pptx

Editor's Notes