Introduction to
High Performance Computer
Architecture
Introduction to
Multiprocessors
1
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
Introduction
● Initial computer performance improvements
came from use of:
– Innovative manufacturing techniques.– Innovative manufacturing techniques.
● In later years,
– Most improvements came from exploitation of ILP.
– Both software and hardware techniques are being
used.
– Pipelining, dynamic instruction scheduling, out of
order execution, VLIW, vector processing, etc.
2
order execution, VLIW, vector processing, etc.
● ILP now appears fully exploited:
– Further performance improvements from ILP
appears limited.
Thread and Process-
Level Parallelism
● The way to achieve higher performance:
Of late, exploitation of thread and process-– Of late, exploitation of thread and process-
level parallelism is being focused.
● Exploit parallelism existing across
multiple processes or threads:
– Cannot be exploited by any ILP processor.
3
– Cannot be exploited by any ILP processor.
● Consider a banking application:
– Individual transactions can be executed in
parallel.
Processes versus Threads
● Processes:
– A process is a program in execution.A process is a program in execution.
– An application normally consists of
multiple processes.
● Threads:
– A process consists of one of more
threads.
4
– A process consists of one of more
threads.
– Threads belonging to the same process
share data, and code space.
Single and Multithreaded
Processes
5
How can Threads be
Created?
● By using any of the popular
thread libraries:
By using any of the popular
thread libraries:
– POSIX Pthreads
– Win32 threads
Java threads, etc.
6
– Java threads, etc.
User Threads
● Thread management done in user
space.space.
● User threads are supported and
managed without kernel support.
– Invisible to the kernel.
If one thread blocks, entire
7
– If one thread blocks, entire
process blocks.
– Limited benefits of threading.
Kernel Threads
● Kernel threads supported and
managed directly by the OS.
– Kernel creates Light Weight Processes– Kernel creates Light Weight Processes
(LWPs).
● Most modern OS support kernel
threads:
– Windows XP/2000
8
Windows XP/2000
– Solaris
– Linux
– Mac OS, etc.
Benefits of Threading
● Responsiveness:
– Threads share code, and data.
Thread creation and switching– Thread creation and switching
therefore much more efficient than
that for processes;
● As an example in Solaris:
Creating threads 30x less costly
9
– Creating threads 30x less costly
than processes.
– Context switching about 5x faster
than processes.
Benefits of Threading
cont…
● Truly concurrent execution:
Possible with processors
Truly concurrent execution:
–Possible with processors
supporting concurrent execution
of threads: SMP, multi-core,
SMT (hyper threading), etc.
10
SMT (hyper threading), etc.
A Few Thread Examples
● Independent threads occur
naturally in several applications:
Web server: different http– Web server: different http
requests are the threads.
– File server
– Name server
– Banking: independent transactions
11
– Banking: independent transactions
– Desktop applications: file loading,
display, computations, etc. can be
threads.
Reflection on Threading
● To think of it:
– Threading is inherent to any– Threading is inherent to any
server application.
● Threads are also easily
identifiable in traditional
applications:
12
applications:
– Banking, Scientific computations,
etc.
Thread-level Parallelism
--- Cons cont…
● Threads with severe
dependencies:
Threads with severe
dependencies:
– May make multithreading an
exercise in futility.
● Also not as “programmer
13
● Also not as “programmer
friendly” as ILP.
Thread Vs. Process-
Level Parallelism
● Threads are light weight (or fine-
grained):grained):
– Threads share address space, data, files etc.
– Even when extent of data sharing and
synchronization is low: Exploitation of
thread-level parallelism meaningful only when
communication latency is low.
14
communication latency is low.
– Consequently, shared memory architectures
(UMA) are a popular way to exploit thread-
level parallelism.
A Broad Classification of
Computers
● Shared-memory multiprocessors
Also called UMA– Also called UMA
● Distributed memory computers
– Also called NUMA:
● Distributed Shared-memory (DSM)
architectures
15
architectures
● Clusters
● Grids, etc.
UMA vs. NUMA
Computers
Latency = several
milliseconds to seconds
Cache
P1
Cache
P2
Cache
Pn
Cache
P1
Cache
P2
Cache
Pn
Main
Main
Memory
Main
Memory
Main
Memory
Bus
milliseconds to seconds
16
Network
Main
Memory
(a) UMA Model (b) NUMA Model
Latency = 100s of ns
Distributed Memory
Computers
● Distributed memory computers use:
Message Passing Model– Message Passing Model
● Explicit message send and receive
instructions have to be written by the
programmer.
– Send: specifies local buffer + receiving
17
– Send: specifies local buffer + receiving
process (id) on remote computer (address).
–Receive: specifies sending process on
remote computer + local buffer to place
data.
Advantages of Message-
Passing Communication
● Hardware for communication and
synchronization are much simpler:synchronization are much simpler:
– Compared to communication in a shared memory
model.
● Explicit communication:
– Programs simpler to understand, helps to reduce
maintenance and development costs.
18
maintenance and development costs.
● Synchronization is implicit:
– Naturally associated with sending/receiving
messages.
– Easier to debug.
Disadvantages of Message-
Passing Communication
● Programmer has to write explicit
message passing constructs.
Programmer has to write explicit
message passing constructs.
– Also, precisely identify the
processes (or threads) with which
communication is to occur.
19
communication is to occur.
● Explicit calls to operating
system:
– Higher overhead.
DSM
● Physically separate memories are
accessed as one logical address space.accessed as one logical address space.
● Processors running on a multi-
computer system share their memory.
– Implemented by operating system.
DSM multiprocessors are NUMA:
20
● DSM multiprocessors are NUMA:
– Access time depends on the exact
location of the data.
Distributed Shared-Memory
Architecture (DSM)
● Underlying mechanism is message
passing:passing:
– Shared memory convenience provided to
the programmer by the operating system.
– Basically, an operating system facility
takes care of message passing implicitly.
21
takes care of message passing implicitly.
● Advantage of DSM:
– Ease of programming
Disadvantage of DSM
● High communication cost:
– A program not specifically optimized– A program not specifically optimized
for DSM by the programmer shall
perform extremely poorly.
– Data (variables) accessed by specific
program segments have to be
22
program segments have to be
collocated.
– Useful only for process-level (coarse-
grained) parallelism.
Symmetric
High Performance Computer
Architecture
Symmetric
Multiprocessors(SMPs)
23
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
24
Cache Cache Cache Cache
Main memory I/O system
Bus
SMPs: Some Insights
● In any multiprocessor, main memory
access is a bottleneck:access is a bottleneck:
–Multilevel caches reduce the memory demand
of a processor.
– Multilevel caches in fact make it possible for
more than one processor to meaningfully
share the memory bus.
25
share the memory bus.
–Hence multilevel caches are a must in a
multiprocessor!
Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
Integrated on the same chip (multi-core)
26
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core
Pros of SMPs
● Ease of programming:
–Especially when communication–Especially when communication
patterns are complex or vary
dynamically during execution.
27
Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
Scalability of the SMP model restricted.– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
28
Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult.
Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
29
limited.
● Any additional transistors have to be
used in a different core.
Why Multicores?
Cont…
● Multiple cores on the same
physical packaging:physical packaging:
– Execute different threads.
– Switched off, if no thread to
execute (power saving).
30
execute (power saving).
– Dual core, quad core, etc.
Cache Organizations for
Multicores
● L1 caches are always private to a core
L2 caches can be private or shared● L2 caches can be private or shared
– which is better?
P4P3P2P1
L1L1L1L1
P4P3P2P1
L1L1L1L1
31
L1L1L1L1
L2L2L2L2
L1L1L1L1
L2
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
Advantages of a private L2 cache:
32
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
33
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
The Cache Coherency
Problem
4
5
P1 P2 P3
U:5 U:5
4
U:? U:? U:7 3
5
1 3
U:
?
34
U:51 2
What value will P1 and P2 read?
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
Track the state of sharing of every– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
35
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
Basic Idea Behind Cache
Coherency Protocols
P P P P
Cache Cache Cache Cache
36
Main memory I/O system
Bus
Pros and Cons of the
Solution
● Pro:
–Consistency maintenance becomes–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
Con:
37
● Con:
–Increased hardware complexity .
Two Important Cache
Coherency Protocols
● Snooping protocol:
Each cache “snoops” the bus to find out– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
A directory is a centralized register for
38
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
Snoopy and Directory-
Based Protocols
P P P P
Cache Cache Cache Cache
Bus
39
Main memory I/O system
Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
40
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
–All transmissions on a bus are essentially
broadcast:
41
broadcast:
● Snooping is therefore effortless.
–Dominates almost all small scale machines.
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
Write Broadcast Protocol
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
–When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
Write broadcast:
42
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value.
Write Invalidate Vs.
Write Update Protocols
P P P P
Cache Cache Cache Cache
Bus
43
Main memory I/O system
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
44
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
45
Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
Write Invalidate versus
Broadcast cont…
● Invalidate exploits spatial locality:
Only one bus transaction for any–Only one bus transaction for any
number of writes to the same block.
–Obviously, more efficient.
● Broadcast has lower latency for
46
● Broadcast has lower latency for
writes and reads:
–As compared to invalidate.
Cache Coherence
High Performance Computer
Architecture
Cache Coherence
Protocols
Mr. SUBHASIS DASH
47
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
An Example Snoopy
Protocol
● Assume:
– Invalidation protocol, write-back cache.– Invalidation protocol, write-back cache.
● Each block of memory is in one of the
following states:
– Shared: Clean in all caches and up-to-date
in memory, block can be read.
48
–Exclusive: cache has the only copy, it is
writeable, and dirty.
–Invalid: Data present in the block obsolete,
cannot be used.
Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
Exact actions of the cache controller
49
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller.
Snoopy-Cache State
Machine-I
● State machine
considering only
CPU requests
a each cache
block.
Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Place read missa each cache
block. nly)
CPU Write
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus
CPU Write
CPU Read miss
Place read miss
on bus
50
Exclusive
(read/wr
ite)
CPU Write
Place Write Miss on Bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Snoopy-Cache State
Machine-II
● State machine
considering only
bus requests
for each cache
Invalid
Shared
(read/o
nly)
Write miss
for this block
for each cache
block.
nly)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write Back
51
Exclusive
(read/wr
ite)
memory access) Write Back
Block; (abort
memory
access)
Place read miss
Combined Snoopy-Cache
State Machine● State machine
considering both
CPU requests
and bus requests Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Write miss
for this block
Place read miss
on bus
and bus requests
for each
cache block.
Invalid
nly)
CPU Write
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
Write Back
Block; Abort
memory access.
Write miss
for this block
Write Back
52
Exclusive
(read/wr
ite)
Place Write Miss on Bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
memory access.
Read miss
for this block
Write Back
Block; (abort
memory access)
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
Also, broadcast is inefficient --- all– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
53
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
Directory Protocol
● Three states as in Snoopy Protocol
–Shared: 1 or more processors have data,
memory is up-to-date.memory is up-to-date.
– Uncached: No processor has the block.
–Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
–Must track which processors have data when
in the shared state.
54
Must track which processors have data when
in the shared state.
–Usually implemented using bit vector, 1 if
processor has copy.
Directory Behavior
● On a read:
– Unused:
give (exclusive) copy to requester● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
record owner
55
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
Directory Behavior
● On Write
Send invalidate messages to all– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
56
CPU-Cache State Machine
● State machine
for CPU requests
for each
Invalidate
or Miss due to
address conflict:Uncacheed Shared
(read/o
CPU Read hit
for each
memory block
● Invalid state
if in
memory
Fetch/Invalidate
or Miss due to
address conflict:
send Data Write Back message
Uncacheed
(read/o
nly)
CPU Read
Send Read Miss
message
CPU Write:
Send Write Miss
msg to h.d.
CPU Write:
Send
Write Miss message
to home directory
57
send Data Write Back message
to home directory
Exclusive
(read/wri
te)CPU read hit
CPU write hit
Fetch: send
Data Write Back message
to home directory
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
Same states as the transition diagram● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
–Update of directory state
Send msgs to statisfy requests.
58
–Send msgs to statisfy requests.
–Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
Directory State Machine
● State machine
for Directory
requests for each
memory block Uncached
Shared
Read miss:
Sharers = {P}
send Data Value
Reply
Read miss:
Sharers += {P};
send Data Value Reply
memory block
● Uncached state
if in memory
Data Write Back:
Sharers = {}
(Write back block)
Uncached
Shared
(read
only)
Reply
Write Miss:
send Invalidate
to Sharers;
then Sharers = {P};
send Data Value
Write Miss:
Sharers = {P};
send Data
Value Reply
msg
59
Exclusive
(read/wri
te)
send Data Value
Reply msg
Read miss:
Sharers += {P};
send Fetch;
send Data Value Reply
msg to remote cache
(Write back block)
Write Miss:
Sharers = {P};
send Fetch/Invalidate;
send Data Value Reply
msg to remote cache

High Performance Computer Architecture

  • 1.
    Introduction to High PerformanceComputer Architecture Introduction to Multiprocessors 1 Mr. SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 2.
    Introduction ● Initial computerperformance improvements came from use of: – Innovative manufacturing techniques.– Innovative manufacturing techniques. ● In later years, – Most improvements came from exploitation of ILP. – Both software and hardware techniques are being used. – Pipelining, dynamic instruction scheduling, out of order execution, VLIW, vector processing, etc. 2 order execution, VLIW, vector processing, etc. ● ILP now appears fully exploited: – Further performance improvements from ILP appears limited.
  • 3.
    Thread and Process- LevelParallelism ● The way to achieve higher performance: Of late, exploitation of thread and process-– Of late, exploitation of thread and process- level parallelism is being focused. ● Exploit parallelism existing across multiple processes or threads: – Cannot be exploited by any ILP processor. 3 – Cannot be exploited by any ILP processor. ● Consider a banking application: – Individual transactions can be executed in parallel.
  • 4.
    Processes versus Threads ●Processes: – A process is a program in execution.A process is a program in execution. – An application normally consists of multiple processes. ● Threads: – A process consists of one of more threads. 4 – A process consists of one of more threads. – Threads belonging to the same process share data, and code space.
  • 5.
  • 6.
    How can Threadsbe Created? ● By using any of the popular thread libraries: By using any of the popular thread libraries: – POSIX Pthreads – Win32 threads Java threads, etc. 6 – Java threads, etc.
  • 7.
    User Threads ● Threadmanagement done in user space.space. ● User threads are supported and managed without kernel support. – Invisible to the kernel. If one thread blocks, entire 7 – If one thread blocks, entire process blocks. – Limited benefits of threading.
  • 8.
    Kernel Threads ● Kernelthreads supported and managed directly by the OS. – Kernel creates Light Weight Processes– Kernel creates Light Weight Processes (LWPs). ● Most modern OS support kernel threads: – Windows XP/2000 8 Windows XP/2000 – Solaris – Linux – Mac OS, etc.
  • 9.
    Benefits of Threading ●Responsiveness: – Threads share code, and data. Thread creation and switching– Thread creation and switching therefore much more efficient than that for processes; ● As an example in Solaris: Creating threads 30x less costly 9 – Creating threads 30x less costly than processes. – Context switching about 5x faster than processes.
  • 10.
    Benefits of Threading cont… ●Truly concurrent execution: Possible with processors Truly concurrent execution: –Possible with processors supporting concurrent execution of threads: SMP, multi-core, SMT (hyper threading), etc. 10 SMT (hyper threading), etc.
  • 11.
    A Few ThreadExamples ● Independent threads occur naturally in several applications: Web server: different http– Web server: different http requests are the threads. – File server – Name server – Banking: independent transactions 11 – Banking: independent transactions – Desktop applications: file loading, display, computations, etc. can be threads.
  • 12.
    Reflection on Threading ●To think of it: – Threading is inherent to any– Threading is inherent to any server application. ● Threads are also easily identifiable in traditional applications: 12 applications: – Banking, Scientific computations, etc.
  • 13.
    Thread-level Parallelism --- Conscont… ● Threads with severe dependencies: Threads with severe dependencies: – May make multithreading an exercise in futility. ● Also not as “programmer 13 ● Also not as “programmer friendly” as ILP.
  • 14.
    Thread Vs. Process- LevelParallelism ● Threads are light weight (or fine- grained):grained): – Threads share address space, data, files etc. – Even when extent of data sharing and synchronization is low: Exploitation of thread-level parallelism meaningful only when communication latency is low. 14 communication latency is low. – Consequently, shared memory architectures (UMA) are a popular way to exploit thread- level parallelism.
  • 15.
    A Broad Classificationof Computers ● Shared-memory multiprocessors Also called UMA– Also called UMA ● Distributed memory computers – Also called NUMA: ● Distributed Shared-memory (DSM) architectures 15 architectures ● Clusters ● Grids, etc.
  • 16.
    UMA vs. NUMA Computers Latency= several milliseconds to seconds Cache P1 Cache P2 Cache Pn Cache P1 Cache P2 Cache Pn Main Main Memory Main Memory Main Memory Bus milliseconds to seconds 16 Network Main Memory (a) UMA Model (b) NUMA Model Latency = 100s of ns
  • 17.
    Distributed Memory Computers ● Distributedmemory computers use: Message Passing Model– Message Passing Model ● Explicit message send and receive instructions have to be written by the programmer. – Send: specifies local buffer + receiving 17 – Send: specifies local buffer + receiving process (id) on remote computer (address). –Receive: specifies sending process on remote computer + local buffer to place data.
  • 18.
    Advantages of Message- PassingCommunication ● Hardware for communication and synchronization are much simpler:synchronization are much simpler: – Compared to communication in a shared memory model. ● Explicit communication: – Programs simpler to understand, helps to reduce maintenance and development costs. 18 maintenance and development costs. ● Synchronization is implicit: – Naturally associated with sending/receiving messages. – Easier to debug.
  • 19.
    Disadvantages of Message- PassingCommunication ● Programmer has to write explicit message passing constructs. Programmer has to write explicit message passing constructs. – Also, precisely identify the processes (or threads) with which communication is to occur. 19 communication is to occur. ● Explicit calls to operating system: – Higher overhead.
  • 20.
    DSM ● Physically separatememories are accessed as one logical address space.accessed as one logical address space. ● Processors running on a multi- computer system share their memory. – Implemented by operating system. DSM multiprocessors are NUMA: 20 ● DSM multiprocessors are NUMA: – Access time depends on the exact location of the data.
  • 21.
    Distributed Shared-Memory Architecture (DSM) ●Underlying mechanism is message passing:passing: – Shared memory convenience provided to the programmer by the operating system. – Basically, an operating system facility takes care of message passing implicitly. 21 takes care of message passing implicitly. ● Advantage of DSM: – Ease of programming
  • 22.
    Disadvantage of DSM ●High communication cost: – A program not specifically optimized– A program not specifically optimized for DSM by the programmer shall perform extremely poorly. – Data (variables) accessed by specific program segments have to be 22 program segments have to be collocated. – Useful only for process-level (coarse- grained) parallelism.
  • 23.
    Symmetric High Performance Computer Architecture Symmetric Multiprocessors(SMPs) 23 Mr.SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 24.
    Symmetric Multiprocessors (SMPs) ● SMPsare a popular shared memory multiprocessor architecture: – Processors share Memory and I/O – Bus based: access time for all memory locations is equal --- “Symmetric MP” P P P P 24 Cache Cache Cache Cache Main memory I/O system Bus
  • 25.
    SMPs: Some Insights ●In any multiprocessor, main memory access is a bottleneck:access is a bottleneck: –Multilevel caches reduce the memory demand of a processor. – Multilevel caches in fact make it possible for more than one processor to meaningfully share the memory bus. 25 share the memory bus. –Hence multilevel caches are a must in a multiprocessor!
  • 26.
    Different SMP Organizations ● Processorand cache on separate extension boards (1980s):extension boards (1980s): – Plugged on to the backplane. ● Integrated on the main board (1990s): – 4 or 6 processors placed per board. Integrated on the same chip (multi-core) 26 ● Integrated on the same chip (multi-core) (2000s): – Dual core (IBM, Intel, AMD) – Quad core
  • 27.
    Pros of SMPs ●Ease of programming: –Especially when communication–Especially when communication patterns are complex or vary dynamically during execution. 27
  • 28.
    Cons of SMPs ●As the number of processors increases, contention for the bus increases. Scalability of the SMP model restricted.– Scalability of the SMP model restricted. – One way out may be to use switches (crossbar, multistage networks, etc.) instead of a bus. – Switches set up parallel point-to-point connections. 28 Switches set up parallel point-to-point connections. – Again switches are not without any disadvantages: make implementation of cache coherence difficult.
  • 29.
    Why Multicores? ● Canyou recollect the constraints on further increase in circuit complexity:further increase in circuit complexity: – Clock skew and temperature. ● Use of more complex techniques to improve single-thread performance is limited. 29 limited. ● Any additional transistors have to be used in a different core.
  • 30.
    Why Multicores? Cont… ● Multiplecores on the same physical packaging:physical packaging: – Execute different threads. – Switched off, if no thread to execute (power saving). 30 execute (power saving). – Dual core, quad core, etc.
  • 31.
    Cache Organizations for Multicores ●L1 caches are always private to a core L2 caches can be private or shared● L2 caches can be private or shared – which is better? P4P3P2P1 L1L1L1L1 P4P3P2P1 L1L1L1L1 31 L1L1L1L1 L2L2L2L2 L1L1L1L1 L2
  • 32.
    L2 Organizations ● Advantagesof a shared L2 cache: – Efficient dynamic use of space by each core– Efficient dynamic use of space by each core – Data shared by multiple cores is not replicated. – Every block has a fixed “home” – hence, easy to find the latest copy. Advantages of a private L2 cache: 32 ● Advantages of a private L2 cache: – Quick access to private L2 – Private bus to private L2, less contention.
  • 33.
    An Important Problemwith Shared-Memory: Coherence ● When shared data are cached:When shared data are cached: – These are replicated in multiple caches. – The data in the caches of different processors may become inconsistent. 33 processors may become inconsistent. ● How to enforce cache coherency? – How does a processor know changes in the caches of other processors?
  • 34.
    The Cache Coherency Problem 4 5 P1P2 P3 U:5 U:5 4 U:? U:? U:7 3 5 1 3 U: ? 34 U:51 2 What value will P1 and P2 read?
  • 35.
    Cache Coherence Solutions (Protocols) ●The key to maintain cache coherence: Track the state of sharing of every– Track the state of sharing of every data block. ● Based on this idea, following can be an overall solution: 35 – Dynamically recognize any potential inconsistency at run-time and carry out preventive action.
  • 36.
    Basic Idea BehindCache Coherency Protocols P P P P Cache Cache Cache Cache 36 Main memory I/O system Bus
  • 37.
    Pros and Consof the Solution ● Pro: –Consistency maintenance becomes–Consistency maintenance becomes transparent to programmers, compilers, as well as to the operating system. Con: 37 ● Con: –Increased hardware complexity .
  • 38.
    Two Important Cache CoherencyProtocols ● Snooping protocol: Each cache “snoops” the bus to find out– Each cache “snoops” the bus to find out which data is being used by whom. ● Directory-based protocol: – Keep track of the sharing state of each data block using a directory. A directory is a centralized register for 38 – A directory is a centralized register for all memory blocks. – Allows coherency protocol to avoid broadcasts.
  • 39.
    Snoopy and Directory- BasedProtocols P P P P Cache Cache Cache Cache Bus 39 Main memory I/O system
  • 40.
    Snooping vs. Directory- basedProtocols ● Snooping protocol reduces memory traffic.traffic. – More efficient. ● Snooping protocol requires broadcasts: – Can meaningfully be implemented only when there is a shared bus. 40 there is a shared bus. – Even when there is a shared bus, scalability is a problem. – Some work arounds have been tried: Sun Enterprise server has up to 4 buses.
  • 41.
    Snooping Protocol ● Assoon as a request for any data block by a processor is put out on the bus: – Other processors “snoop” to check if they have a copy and respond accordingly. ● Works well with bus interconnection: –All transmissions on a bus are essentially broadcast: 41 broadcast: ● Snooping is therefore effortless. –Dominates almost all small scale machines.
  • 42.
    Categories of Snoopy Protocols ●Essentially two types: – Write Invalidate Protocol Write Broadcast Protocol – Write Invalidate Protocol – Write Broadcast Protocol ● Write invalidate protocol: –When one processor writes to its cache, all other processors having a copy of that data block invalidate that block. Write broadcast: 42 ● Write broadcast: – When one processor writes to its cache, all other processors having a copy of that data block update that block with the recent written value.
  • 43.
    Write Invalidate Vs. WriteUpdate Protocols P P P P Cache Cache Cache Cache Bus 43 Main memory I/O system
  • 44.
    Write Invalidate Protocol ●Handling a write to shared data: – An invalidate command is sent on bus ---– An invalidate command is sent on bus --- all caches snoop and invalidate any copies they have. ● Handling a read Miss: – Write-through: memory is always up-to- 44 – Write-through: memory is always up-to- date. – Write-back: snooping finds most recent copy.
  • 45.
    Write Invalidate inWrite Through Caches ● Simple implementation. ● Writes:● Writes: – Write to shared data: broadcast on bus, processors snoop, and update any copies. – Read miss: memory is always up-to-date. ● Concurrent writes: 45 Concurrent writes: – Write serialization automatically achieved since bus serializes requests. – Bus provides the basic arbitration support.
  • 46.
    Write Invalidate versus Broadcastcont… ● Invalidate exploits spatial locality: Only one bus transaction for any–Only one bus transaction for any number of writes to the same block. –Obviously, more efficient. ● Broadcast has lower latency for 46 ● Broadcast has lower latency for writes and reads: –As compared to invalidate.
  • 47.
    Cache Coherence High PerformanceComputer Architecture Cache Coherence Protocols Mr. SUBHASIS DASH 47 Mr. SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 48.
    An Example Snoopy Protocol ●Assume: – Invalidation protocol, write-back cache.– Invalidation protocol, write-back cache. ● Each block of memory is in one of the following states: – Shared: Clean in all caches and up-to-date in memory, block can be read. 48 –Exclusive: cache has the only copy, it is writeable, and dirty. –Invalid: Data present in the block obsolete, cannot be used.
  • 49.
    Implementation of the SnoopingProtocol ● A cache controller at every processor would implement the protocol:would implement the protocol: – Has to perform specific actions: ● When the local processor requests certain things. ● Also, certain actions are required when certain address appears on the bus. Exact actions of the cache controller 49 address appears on the bus. – Exact actions of the cache controller depends on the state of the cache block. – Two FSMs can show the different types of actions to be performed by a controller.
  • 50.
    Snoopy-Cache State Machine-I ● Statemachine considering only CPU requests a each cache block. Invalid Shared (read/o nly) CPU Read CPU Read hit Place read missa each cache block. nly) CPU Write Place read miss on bus Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write CPU Read miss Place read miss on bus 50 Exclusive (read/wr ite) CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit
  • 51.
    Snoopy-Cache State Machine-II ● Statemachine considering only bus requests for each cache Invalid Shared (read/o nly) Write miss for this block for each cache block. nly) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write Back 51 Exclusive (read/wr ite) memory access) Write Back Block; (abort memory access)
  • 52.
    Place read miss CombinedSnoopy-Cache State Machine● State machine considering both CPU requests and bus requests Invalid Shared (read/o nly) CPU Read CPU Read hit Write miss for this block Place read miss on bus and bus requests for each cache block. Invalid nly) CPU Write Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus Write Back Block; Abort memory access. Write miss for this block Write Back 52 Exclusive (read/wr ite) Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit memory access. Read miss for this block Write Back Block; (abort memory access)
  • 53.
    Directory-based Solution ● InNUMA computers: – Messages have long latency. Also, broadcast is inefficient --- all– Also, broadcast is inefficient --- all messages have explicit responses. ● Main memory controller to keep track of: – Which processors are having cached copies of which memory locations. ● On a write, 53 ● On a write, – Only need to inform users, not everyone ● On a dirty read, – Forward to owner
  • 54.
    Directory Protocol ● Threestates as in Snoopy Protocol –Shared: 1 or more processors have data, memory is up-to-date.memory is up-to-date. – Uncached: No processor has the block. –Exclusive: 1 processor (owner) has the block. ● In addition to cache state, –Must track which processors have data when in the shared state. 54 Must track which processors have data when in the shared state. –Usually implemented using bit vector, 1 if processor has copy.
  • 55.
    Directory Behavior ● Ona read: – Unused: give (exclusive) copy to requester● give (exclusive) copy to requester ● record owner – Exclusive or shared: ● send share message to current exclusive owner record owner 55 owner ● record owner ● return value – Exclusive dirty: ● forward read request to exclusive owner.
  • 56.
    Directory Behavior ● OnWrite Send invalidate messages to all– Send invalidate messages to all hosts caching values. ● On Write-Thru/Write-back – Update value. 56
  • 57.
    CPU-Cache State Machine ●State machine for CPU requests for each Invalidate or Miss due to address conflict:Uncacheed Shared (read/o CPU Read hit for each memory block ● Invalid state if in memory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message Uncacheed (read/o nly) CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory 57 send Data Write Back message to home directory Exclusive (read/wri te)CPU read hit CPU write hit Fetch: send Data Write Back message to home directory
  • 58.
    State Transition Diagram forthe Directory ● Tracks all copies of memory block. Same states as the transition diagram● Same states as the transition diagram for an individual cache. ● Memory controller actions: –Update of directory state Send msgs to statisfy requests. 58 –Send msgs to statisfy requests. –Also indicates an action that updates the sharing set, Sharers, as well as sending a message.
  • 59.
    Directory State Machine ●State machine for Directory requests for each memory block Uncached Shared Read miss: Sharers = {P} send Data Value Reply Read miss: Sharers += {P}; send Data Value Reply memory block ● Uncached state if in memory Data Write Back: Sharers = {} (Write back block) Uncached Shared (read only) Reply Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Write Miss: Sharers = {P}; send Data Value Reply msg 59 Exclusive (read/wri te) send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache