High Performance Computer Architecture

Introduction to
High Performance Computer
Architecture
Introduction to
Multiprocessors
1
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR

Introduction
● Initial computer performance improvements
came from use of:
– Innovative manufacturing techniques.– Innovative manufacturing techniques.
● In later years,
– Most improvements came from exploitation of ILP.
– Both software and hardware techniques are being
used.
– Pipelining, dynamic instruction scheduling, out of
order execution, VLIW, vector processing, etc.
2
order execution, VLIW, vector processing, etc.
● ILP now appears fully exploited:
– Further performance improvements from ILP
appears limited.

Thread and Process-
Level Parallelism
● The way to achieve higher performance:
Of late, exploitation of thread and process-– Of late, exploitation of thread and process-
level parallelism is being focused.
● Exploit parallelism existing across
multiple processes or threads:
– Cannot be exploited by any ILP processor.
3
– Cannot be exploited by any ILP processor.
● Consider a banking application:
– Individual transactions can be executed in
parallel.

Processes versus Threads
● Processes:
– A process is a program in execution.A process is a program in execution.
– An application normally consists of
multiple processes.
● Threads:
– A process consists of one of more
threads.
4
– A process consists of one of more
threads.
– Threads belonging to the same process
share data, and code space.

Single and Multithreaded
Processes
5

How can Threads be
Created?
● By using any of the popular
thread libraries:
By using any of the popular
thread libraries:
– POSIX Pthreads
– Win32 threads
Java threads, etc.
6
– Java threads, etc.

User Threads
● Thread management done in user
space.space.
● User threads are supported and
managed without kernel support.
– Invisible to the kernel.
If one thread blocks, entire
7
– If one thread blocks, entire
process blocks.
– Limited benefits of threading.

Kernel Threads
● Kernel threads supported and
managed directly by the OS.
– Kernel creates Light Weight Processes– Kernel creates Light Weight Processes
(LWPs).
● Most modern OS support kernel
threads:
– Windows XP/2000
8
Windows XP/2000
– Solaris
– Linux
– Mac OS, etc.

Benefits of Threading
● Responsiveness:
– Threads share code, and data.
Thread creation and switching– Thread creation and switching
therefore much more efficient than
that for processes;
● As an example in Solaris:
Creating threads 30x less costly
9
– Creating threads 30x less costly
than processes.
– Context switching about 5x faster
than processes.

Benefits of Threading
cont…
● Truly concurrent execution:
Possible with processors
Truly concurrent execution:
–Possible with processors
supporting concurrent execution
of threads: SMP, multi-core,
SMT (hyper threading), etc.
10
SMT (hyper threading), etc.

A Few Thread Examples
● Independent threads occur
naturally in several applications:
Web server: different http– Web server: different http
requests are the threads.
– File server
– Name server
– Banking: independent transactions
11
– Banking: independent transactions
– Desktop applications: file loading,
display, computations, etc. can be
threads.

Reflection on Threading
● To think of it:
– Threading is inherent to any– Threading is inherent to any
server application.
● Threads are also easily
identifiable in traditional
applications:
12
applications:
– Banking, Scientific computations,
etc.

Thread-level Parallelism
--- Cons cont…
● Threads with severe
dependencies:
Threads with severe
dependencies:
– May make multithreading an
exercise in futility.
● Also not as “programmer
13
● Also not as “programmer
friendly” as ILP.

Thread Vs. Process-
Level Parallelism
● Threads are light weight (or fine-
grained):grained):
– Threads share address space, data, files etc.
– Even when extent of data sharing and
synchronization is low: Exploitation of
thread-level parallelism meaningful only when
communication latency is low.
14
communication latency is low.
– Consequently, shared memory architectures
(UMA) are a popular way to exploit thread-
level parallelism.

A Broad Classification of
Computers
● Shared-memory multiprocessors
Also called UMA– Also called UMA
● Distributed memory computers
– Also called NUMA:
● Distributed Shared-memory (DSM)
architectures
15
architectures
● Clusters
● Grids, etc.

UMA vs. NUMA
Computers
Latency = several
milliseconds to seconds
Cache
P1
Cache
P2
Cache
Pn
Cache
P1
Cache
P2
Cache
Pn
Main
Main
Memory
Main
Memory
Main
Memory
Bus
milliseconds to seconds
16
Network
Main
Memory
(a) UMA Model (b) NUMA Model
Latency = 100s of ns

Distributed Memory
Computers
● Distributed memory computers use:
Message Passing Model– Message Passing Model
● Explicit message send and receive
instructions have to be written by the
programmer.
– Send: specifies local buffer + receiving
17
– Send: specifies local buffer + receiving
process (id) on remote computer (address).
–Receive: specifies sending process on
remote computer + local buffer to place
data.

Advantages of Message-
Passing Communication
● Hardware for communication and
synchronization are much simpler:synchronization are much simpler:
– Compared to communication in a shared memory
model.
● Explicit communication:
– Programs simpler to understand, helps to reduce
maintenance and development costs.
18
maintenance and development costs.
● Synchronization is implicit:
– Naturally associated with sending/receiving
messages.
– Easier to debug.

Disadvantages of Message-
Passing Communication
● Programmer has to write explicit
message passing constructs.
Programmer has to write explicit
message passing constructs.
– Also, precisely identify the
processes (or threads) with which
communication is to occur.
19
communication is to occur.
● Explicit calls to operating
system:
– Higher overhead.

DSM
● Physically separate memories are
accessed as one logical address space.accessed as one logical address space.
● Processors running on a multi-
computer system share their memory.
– Implemented by operating system.
DSM multiprocessors are NUMA:
20
● DSM multiprocessors are NUMA:
– Access time depends on the exact
location of the data.

Distributed Shared-Memory
Architecture (DSM)
● Underlying mechanism is message
passing:passing:
– Shared memory convenience provided to
the programmer by the operating system.
– Basically, an operating system facility
takes care of message passing implicitly.
21
takes care of message passing implicitly.
● Advantage of DSM:
– Ease of programming

Disadvantage of DSM
● High communication cost:
– A program not specifically optimized– A program not specifically optimized
for DSM by the programmer shall
perform extremely poorly.
– Data (variables) accessed by specific
program segments have to be
22
program segments have to be
collocated.
– Useful only for process-level (coarse-
grained) parallelism.

Symmetric
Architecture
Symmetric
Multiprocessors(SMPs)
23
Mr. SUBHASIS DASH

Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
24
Cache Cache Cache Cache
Main memory I/O system
Bus

SMPs: Some Insights
● In any multiprocessor, main memory
access is a bottleneck:access is a bottleneck:
–Multilevel caches reduce the memory demand
of a processor.
– Multilevel caches in fact make it possible for
more than one processor to meaningfully
share the memory bus.
25
share the memory bus.
–Hence multilevel caches are a must in a
multiprocessor!

Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
Integrated on the same chip (multi-core)
26
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core

Pros of SMPs
● Ease of programming:
–Especially when communication–Especially when communication
patterns are complex or vary
dynamically during execution.
27

Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
Scalability of the SMP model restricted.– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
28
Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult.

Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
29
limited.
● Any additional transistors have to be
used in a different core.

Why Multicores?
Cont…
● Multiple cores on the same
physical packaging:physical packaging:
– Execute different threads.
– Switched off, if no thread to
execute (power saving).
30
execute (power saving).
– Dual core, quad core, etc.

Cache Organizations for
Multicores
● L1 caches are always private to a core
L2 caches can be private or shared● L2 caches can be private or shared
– which is better?
P4P3P2P1
L1L1L1L1
P4P3P2P1
L1L1L1L1
31
L1L1L1L1
L2L2L2L2
L1L1L1L1
L2

L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
Advantages of a private L2 cache:
32
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.

An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
33
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?

The Cache Coherency
Problem
4
5
P1 P2 P3
U:5 U:5
4
U:? U:? U:7 3
5
1 3
U:
?
34
U:51 2
What value will P1 and P2 read?

Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
Track the state of sharing of every– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
35
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.

Basic Idea Behind Cache
Coherency Protocols
P P P P
36
Bus

Pros and Cons of the
Solution
● Pro:
–Consistency maintenance becomes–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
Con:
37
● Con:
–Increased hardware complexity .

Two Important Cache
Coherency Protocols
● Snooping protocol:
Each cache “snoops” the bus to find out– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
A directory is a centralized register for
38
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.

Snoopy and Directory-
Based Protocols
P P P P
Bus
39

Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
40
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.

Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
–All transmissions on a bus are essentially
broadcast:
41
broadcast:
● Snooping is therefore effortless.
–Dominates almost all small scale machines.

Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
Write Broadcast Protocol
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
–When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
Write broadcast:
42
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value.

Write Invalidate Vs.
Write Update Protocols
P P P P
Bus
43

Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
44
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.

Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
45
Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.

Write Invalidate versus
Broadcast cont…
● Invalidate exploits spatial locality:
Only one bus transaction for any–Only one bus transaction for any
number of writes to the same block.
–Obviously, more efficient.
● Broadcast has lower latency for
46
● Broadcast has lower latency for
writes and reads:
–As compared to invalidate.

Cache Coherence
Architecture
Cache Coherence
Protocols
Mr. SUBHASIS DASH
47
Mr. SUBHASIS DASH

An Example Snoopy
Protocol
● Assume:
– Invalidation protocol, write-back cache.– Invalidation protocol, write-back cache.
● Each block of memory is in one of the
following states:
– Shared: Clean in all caches and up-to-date
in memory, block can be read.
48
–Exclusive: cache has the only copy, it is
writeable, and dirty.
–Invalid: Data present in the block obsolete,
cannot be used.

Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
Exact actions of the cache controller
49
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller.

Snoopy-Cache State
Machine-I
● State machine
considering only
CPU requests
a each cache
block.
Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Place read missa each cache
block. nly)
CPU Write
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus
CPU Write
CPU Read miss
Place read miss
on bus
50
Exclusive
(read/wr
ite)
CPU Write
Place Write Miss on Bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit

Snoopy-Cache State
Machine-II
● State machine
considering only
bus requests
for each cache
Invalid
Shared
(read/o
nly)
Write miss
for this block
for each cache
block.
nly)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write Back
51
Exclusive
(read/wr
ite)
memory access) Write Back
Block; (abort
memory
access)

Place read miss
Combined Snoopy-Cache
State Machine● State machine
considering both
CPU requests
and bus requests Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Write miss
for this block
Place read miss
on bus
and bus requests
for each
cache block.
Invalid
nly)
CPU Write
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
CPU Read miss
Place read miss
on bus
Write Back
Block; Abort
memory access.
Write miss
for this block
Write Back
52
Exclusive
(read/wr
ite)
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
memory access.
Read miss
for this block
Write Back
Block; (abort
memory access)

Directory-based Solution
● In NUMA computers:
– Messages have long latency.
Also, broadcast is inefficient --- all– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
53
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner

Directory Protocol
● Three states as in Snoopy Protocol
–Shared: 1 or more processors have data,
memory is up-to-date.memory is up-to-date.
– Uncached: No processor has the block.
–Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
–Must track which processors have data when
in the shared state.
54
Must track which processors have data when
in the shared state.
–Usually implemented using bit vector, 1 if
processor has copy.

Directory Behavior
● On a read:
– Unused:
give (exclusive) copy to requester● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
record owner
55
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.

Directory Behavior
● On Write
Send invalidate messages to all– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
56

CPU-Cache State Machine
● State machine
for CPU requests
for each
Invalidate
or Miss due to
address conflict:Uncacheed Shared
(read/o
CPU Read hit
for each
memory block
● Invalid state
if in
memory
Fetch/Invalidate
or Miss due to
address conflict:
send Data Write Back message
Uncacheed
(read/o
nly)
CPU Read
Send Read Miss
message
CPU Write:
Send Write Miss
msg to h.d.
CPU Write:
Send
Write Miss message
to home directory
57
send Data Write Back message
to home directory
Exclusive
(read/wri
te)CPU read hit
CPU write hit
Fetch: send
Data Write Back message
to home directory

State Transition Diagram
for the Directory
● Tracks all copies of memory block.
Same states as the transition diagram● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
–Update of directory state
Send msgs to statisfy requests.
58
–Send msgs to statisfy requests.
–Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.

Directory State Machine
● State machine
for Directory
requests for each
memory block Uncached
Shared
Read miss:
Sharers = {P}
send Data Value
Reply
Read miss:
Sharers += {P};
send Data Value Reply
memory block
● Uncached state
if in memory
Data Write Back:
Sharers = {}
(Write back block)
Uncached
Shared
(read
only)
Reply
Write Miss:
send Invalidate
to Sharers;
then Sharers = {P};
send Data Value
Write Miss:
Sharers = {P};
send Data
Value Reply
msg
59
Exclusive
(read/wri
te)
send Data Value
Reply msg
Read miss:
Sharers += {P};
send Fetch;
msg to remote cache
(Write back block)
Write Miss:
Sharers = {P};
send Fetch/Invalidate;
msg to remote cache

High Performance Computer Architecture

More Related Content

What's hot

Similar to High Performance Computer Architecture

More from Subhasis Dash

Recently uploaded

High Performance Computer Architecture