2. Introduction to Parallel
Processing
Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
It is a method used to improve performance in a
computer system
When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs
Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
It is a method used to improve performance in a
computer system
When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs
3. Introduction to Parallel
Processing
There are two primary benefits to parallel
processing systems:
Increased reliability
The availability of more than one CPU
If one processor fails then, the others can
continue to operate and absorb the load
Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded
There are two primary benefits to parallel
processing systems:
Increased reliability
The availability of more than one CPU
If one processor fails then, the others can
continue to operate and absorb the load
Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded
4. Introduction to Parallel
Processing
There are two primary benefits to parallel
processing systems:
Increased throughput due to fast processing
The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
Increased flexibility brings increased complexity
There are two primary benefits to parallel
processing systems:
Increased throughput due to fast processing
The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
Increased flexibility brings increased complexity
5. Tightly Coupled - SMP
Processors share memory
Communicate via that shared memory
Symmetric Multiprocessor (SMP)
Share single memory or pool
Shared bus to access memory
Memory access time to given area of
memory is approximately the same for
each processor
Processors share memory
Communicate via that shared memory
Symmetric Multiprocessor (SMP)
Share single memory or pool
Shared bus to access memory
Memory access time to given area of
memory is approximately the same for
each processor
6. Tightly Coupled - NUMA
Nonuniform memory access
Access times to different regions of memory
may differ
7. Loosely Coupled - Clusters
Interconnection of collection of independent
uniprocessors or SMPs
Communication among computers via fixed paths or
via network facility
8. Flynn’s Classification
It is based on instruction and data
processing.
A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.
It is based on instruction and data
processing.
A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.
9. Flynn’s Classification
Single instruction, single data stream -
SISD
Single instruction, multiple data stream -
SIMD
Multiple instruction, single data stream -
MISD
Multiple instruction, multiple data
stream- MIMD
Single instruction, single data stream -
SISD
Single instruction, multiple data stream -
SIMD
Multiple instruction, single data stream -
MISD
Multiple instruction, multiple data
stream- MIMD
10. Single Instruction, Single Data
Stream - SISD
Single processor
Single instruction stream
Data stored in single memory
No instruction parallelism
No data parallelism
Uni-processor
Single processor
Single instruction stream
Data stored in single memory
No instruction parallelism
No data parallelism
Uni-processor
12. Single Instruction, Multiple Data
Stream - SIMD
Single machine instruction
Controls simultaneous execution
Number of processing elements
Each processing element has associated data
memory
Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
Vector and array processors
Single machine instruction
Controls simultaneous execution
Number of processing elements
Each processing element has associated data
memory
Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
Vector and array processors
14. Multiple Instruction, Single Data
Stream - MISD
Sequence of data
Transmitted to set of processors
Each processor executes different
instruction sequence on same data set
Never been implemented
Sequence of data
Transmitted to set of processors
Each processor executes different
instruction sequence on same data set
Never been implemented
16. Multiple Instruction, Multiple Data
Stream- MIMD
Set of processors
Simultaneously execute different
instruction sequences
Different sets of data
MIMD based computer system can
Use shared memory or
Work with distributed memory
SMPs, clusters and NUMA systems
Set of processors
Simultaneously execute different
instruction sequences
Different sets of data
MIMD based computer system can
Use shared memory or
Work with distributed memory
SMPs, clusters and NUMA systems
20. MIMD - Overview
General purpose processors
Each can process all instructions
necessary
Further classified by method of
processor communication
General purpose processors
Each can process all instructions
necessary
Further classified by method of
processor communication
21. Concepts of Superscalar
Architecture
Execution of two or more instructions simultaneously,
in different pipelines
Throughput of superscalar processor is greater than
that of pipelined scalar processor
May use RISC (PowerPC) or CISC (Pentium 4)
architectures
Instruction execution sequencing may be
Static (during compilation) or
Dynamic (at run time)
Execution of two or more instructions simultaneously,
in different pipelines
Throughput of superscalar processor is greater than
that of pipelined scalar processor
May use RISC (PowerPC) or CISC (Pentium 4)
architectures
Instruction execution sequencing may be
Static (during compilation) or
Dynamic (at run time)
23. Requirements of Superscalar
Architecture
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic
24. Superscalar Pipelines
Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
Instruction takes 200ns
using pipeline shown in Fig.
13.1
Superscalar Pipeline (Fig.
13.2)
Fetch and execute multiple
instructions in each clock
cycle
Multiple instructions are
issued to multiple functional
units in each clock cycle
Issue instructions which are
ready for issue, Out-of-
Order (OoO)
Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
Instruction takes 200ns
using pipeline shown in Fig.
13.1
Superscalar Pipeline (Fig.
13.2)
Fetch and execute multiple
instructions in each clock
cycle
Multiple instructions are
issued to multiple functional
units in each clock cycle
Issue instructions which are
ready for issue, Out-of-
Order (OoO)
25. Superscalar Pipelines (cont.)
More pipelines are
added – improves
productivity
Execution units work
independently
Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle
More pipelines are
added – improves
productivity
Execution units work
independently
Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle
27. Superscalar Pipelines (cont.)
Two-way superscalar
processor
Pipelined integer and
floating point unit
Integer unit 5 stages
Floating unit 7 stages
Two instructions gets
completed in every clock
from 7th clock onwards
Difficult to achieve due to
dependencies and branch
instructions
Two-way superscalar
processor
Pipelined integer and
floating point unit
Integer unit 5 stages
Floating unit 7 stages
Two instructions gets
completed in every clock
from 7th clock onwards
Difficult to achieve due to
dependencies and branch
instructions
29. Out-of-Order(OoO) in a
nutshell
Execute instructions based on the “data flow”
graph, (rather than program order)
Improvement in performance
Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
Retirement unit or commit unit
Execute instructions based on the “data flow”
graph, (rather than program order)
Improvement in performance
Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
Retirement unit or commit unit
30. Dynamic Scheduling
Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
Instructions are issued in-order but executed out-of-
order
Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
Instructions are issued in-order but executed out-of-
order
33. Out-of-Order Execution
Advantages: Better performance!
Exploit Instruction Level Parallelism (ILP)
Hide latencies (e.g., L1 data cache miss)
Disadvantages:
HW is much more complex than that of in-order processors
Can compilers do this work?
In a very limited way – can only statically schedule
instructions (VLIW)
Compilers lack runtime information
Conditional branch direction (→ compiler limited to basic
blocks)
Data values, which may affect calculation time and control
Cache miss / hit
Advantages: Better performance!
Exploit Instruction Level Parallelism (ILP)
Hide latencies (e.g., L1 data cache miss)
Disadvantages:
HW is much more complex than that of in-order processors
Can compilers do this work?
In a very limited way – can only statically schedule
instructions (VLIW)
Compilers lack runtime information
Conditional branch direction (→ compiler limited to basic
blocks)
Data values, which may affect calculation time and control
Cache miss / hit
34. Speculative Execution
In branching it is not known which should be the next
instruction until condition is evaluated
Simple processor does stalling
Advanced processor - speculation
Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
If assumption goes wrong – results will be dropped
With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
Misprediction – Hardware undo (Rollback and recovery)
In branching it is not known which should be the next
instruction until condition is evaluated
Simple processor does stalling
Advanced processor - speculation
Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
If assumption goes wrong – results will be dropped
With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
Misprediction – Hardware undo (Rollback and recovery)
35. Speculative Execution (cont.)
Common Implementation
Fetch/Decode instructions from the predicted
execution path
Instructions can execute as soon as their
operands become ready
Instructions can graduate and commit to
memory only once it is certain they should have
been executed
An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
Instructions on a mis-predicted execution path are
flushed
Common Implementation
Fetch/Decode instructions from the predicted
execution path
Instructions can execute as soon as their
operands become ready
Instructions can graduate and commit to
memory only once it is certain they should have
been executed
An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
Instructions on a mis-predicted execution path are
flushed
36. Speculative Execution (cont.)
Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
Addition or subtraction?
Branch prediction based on previous information
Improvement using Branch Target Buffer (BTB) cache
If same branch, makes guess about present path to
be followed based on earlier result
Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
Addition or subtraction?
Branch prediction based on previous information
Improvement using Branch Target Buffer (BTB) cache
If same branch, makes guess about present path to
be followed based on earlier result
37. Increasing Performance
Processor performance can be measured by
the rate at which it executes instructions
MIPS rate = f * IPC
f processor clock frequency, in MHz
IPC is average instructions per cycle
Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
May be reaching limit
Complexity
Power consumption
Processor performance can be measured by
the rate at which it executes instructions
MIPS rate = f * IPC
f processor clock frequency, in MHz
IPC is average instructions per cycle
Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
May be reaching limit
Complexity
Power consumption
38. Multithreaded Processor
Instruction stream divided into smaller
streams (threads)
Executed in parallel
Wide variety of multithreading designs
Instruction stream divided into smaller
streams (threads)
Executed in parallel
Wide variety of multithreading designs
39. Definitions of Threads and
Processes
Thread in multithreaded processors may or may not be same as
software threads
Process
An instance of program running on computer
Resource ownership
Virtual address space to hold process image
Scheduling/execution
Process switch
An operation that switches processor from one process to
another
Thread: dispatchable unit of work within process
Includes processor context (which includes the program counter
and stack pointer) and data area for stack
Thread executes sequentially
Interruptible: processor can turn to another thread
Thread switch
Switching processor between threads within same process
Typically less costly than process switch
Thread in multithreaded processors may or may not be same as
software threads
Process
An instance of program running on computer
Resource ownership
Virtual address space to hold process image
Scheduling/execution
Process switch
An operation that switches processor from one process to
another
Thread: dispatchable unit of work within process
Includes processor context (which includes the program counter
and stack pointer) and data area for stack
Thread executes sequentially
Interruptible: processor can turn to another thread
Thread switch
Switching processor between threads within same process
Typically less costly than process switch
40. Implicit and Explicit
Multithreading
All commercial processors and most experimental
ones use explicit multithreading
Concurrently execute instructions from different explicit
threads
Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
Implicit threads defined statically by compiler or
dynamically by hardware
All commercial processors and most experimental
ones use explicit multithreading
Concurrently execute instructions from different explicit
threads
Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
Implicit threads defined statically by compiler or
dynamically by hardware
41. Approaches to Explicit
Multithreading
Interleaved Multithreading
Fine-grained
Processor deals with two or more thread contexts at a time
Switching thread at each clock cycle
If thread is blocked it is skipped
Blocked Multithreading
Coarse-grained
Thread executed until event causes delay
E.g. Cache miss
Effective on in-order processor
Avoids pipeline stall
Simultaneous Multithreading (SMT)
Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
Chip multiprocessing
Processor is replicated on a single chip
Each processor handles separate threads
Interleaved Multithreading
Fine-grained
Processor deals with two or more thread contexts at a time
Switching thread at each clock cycle
If thread is blocked it is skipped
Blocked Multithreading
Coarse-grained
Thread executed until event causes delay
E.g. Cache miss
Effective on in-order processor
Avoids pipeline stall
Simultaneous Multithreading (SMT)
Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
Chip multiprocessing
Processor is replicated on a single chip
Each processor handles separate threads
42. Scalar Processor Approaches
Single-threaded scalar
Simple pipeline
No multithreading
Interleaved multithreaded scalar
Easiest multithreading to implement
Switch threads at each clock cycle
Pipeline stages kept close to fully occupied
Hardware needs to switch thread context between cycles
Blocked multithreaded scalar
Thread executed until latency event occurs
Would stop pipeline
Processor switches to another thread
Single-threaded scalar
Simple pipeline
No multithreading
Interleaved multithreaded scalar
Easiest multithreading to implement
Switch threads at each clock cycle
Pipeline stages kept close to fully occupied
Hardware needs to switch thread context between cycles
Blocked multithreaded scalar
Thread executed until latency event occurs
Would stop pipeline
Processor switches to another thread
44. Multiple Instruction Issue
Processors (1)
Superscalar
No multithreading
Interleaved multithreading superscalar
Each cycle, as many instructions as possible
issued from single thread
Delays due to thread switches eliminated
Number of instructions issued in cycle limited by
dependencies
Blocked multithreaded superscalar
Instructions from one thread
Blocked multithreading used
Superscalar
No multithreading
Interleaved multithreading superscalar
Each cycle, as many instructions as possible
issued from single thread
Delays due to thread switches eliminated
Number of instructions issued in cycle limited by
dependencies
Blocked multithreaded superscalar
Instructions from one thread
Blocked multithreading used
46. Multiple Instruction Issue
Processors (2)
Very long instruction word (VLIW)
E.g. IA-64
Multiple instructions in single word
Typically constructed by compiler
Operations that may be executed in parallel in same word
May pad with no-ops
Interleaved multithreading VLIW
Similar efficiencies to interleaved multithreading on
superscalar architecture
Blocked multithreaded VLIW
Similar efficiencies to blocked multithreading on
superscalar architecture
Very long instruction word (VLIW)
E.g. IA-64
Multiple instructions in single word
Typically constructed by compiler
Operations that may be executed in parallel in same word
May pad with no-ops
Interleaved multithreading VLIW
Similar efficiencies to interleaved multithreading on
superscalar architecture
Blocked multithreaded VLIW
Similar efficiencies to blocked multithreading on
superscalar architecture
48. Parallel, Simultaneous
Execution of Multiple Threads
Simultaneous multithreading
Issue multiple instructions at a time
One thread may fill all horizontal slots
Instructions from two or more threads may be
issued
With enough threads, can issue maximum
number of instructions on each cycle
Chip multiprocessor
Multiple processors
Each has two-issue superscalar processor
Each processor is assigned thread
Can issue up to two instructions per cycle per thread
Simultaneous multithreading
Issue multiple instructions at a time
One thread may fill all horizontal slots
Instructions from two or more threads may be
issued
With enough threads, can issue maximum
number of instructions on each cycle
Chip multiprocessor
Multiple processors
Each has two-issue superscalar processor
Each processor is assigned thread
Can issue up to two instructions per cycle per thread
50. Examples
Some Pentium 4
Intel calls it hyperthreading
SMT with support for two threads
Single multithreaded processor, logically two
processors
IBM Power5
High-end PowerPC
Combines chip multiprocessing with SMT
Chip has two separate processors
Each supporting two threads concurrently using
SMT
Some Pentium 4
Intel calls it hyperthreading
SMT with support for two threads
Single multithreaded processor, logically two
processors
IBM Power5
High-end PowerPC
Combines chip multiprocessing with SMT
Chip has two separate processors
Each supporting two threads concurrently using
SMT
51. VLIW Architecture
Aim at speeding up computation by exploiting instruction-
level parallelism.
Follows static scheduling
Compiler groups several operations into very long
instruction word (VLIW)
Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
All operations in an instruction are executed in a lock-step
mode.
Rely on compiler to find parallelism and schedule
dependency free program code.
Aim at speeding up computation by exploiting instruction-
level parallelism.
Follows static scheduling
Compiler groups several operations into very long
instruction word (VLIW)
Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
All operations in an instruction are executed in a lock-step
mode.
Rely on compiler to find parallelism and schedule
dependency free program code.
52. VLIW Architecture (cont.)
VLIW offers Plan Of Execution (POE) created statically
during the compilation
VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
Delivers POE via instruction set
Very simple control logic, no dynamic scheduling
Superscalar
Increasing no. of functional units results in complex
instruction scheduling hardware
5 or 6 instructions dispatched per cycle
VLIW architecture
Compiler selects instructions without dependencies and
joins them as very long instructions
VLIW offers Plan Of Execution (POE) created statically
during the compilation
VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
Delivers POE via instruction set
Very simple control logic, no dynamic scheduling
Superscalar
Increasing no. of functional units results in complex
instruction scheduling hardware
5 or 6 instructions dispatched per cycle
VLIW architecture
Compiler selects instructions without dependencies and
joins them as very long instructions
56. VLIW Characteristics
Only RISC like operation support
Short cycle times
Flexible: Can implement any FU mixture
Extensible
Tight inter FU connectivity required
Large instructions (up to 1024 bits)
Not binary compatible !!!
But good compilers exist
Only RISC like operation support
Short cycle times
Flexible: Can implement any FU mixture
Extensible
Tight inter FU connectivity required
Large instructions (up to 1024 bits)
Not binary compatible !!!
But good compilers exist
57. VLIW: Merits
Reduce hardware complexity
Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
Potentially higher clock rate
Higher degree of parallelism with global program
information
No need for register renaming
Less power consumption
Reduce hardware complexity
Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
Potentially higher clock rate
Higher degree of parallelism with global program
information
No need for register renaming
Less power consumption
58. VLIW: Demerits
Object code requires more memory
Compiler development is involved and time
consuming process
Compilation is slow process
Compiler needs in-depth knowledge of hardware
details of its processor
A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
Inefficient for object-oriented and event-driven
programs
Object code requires more memory
Compiler development is involved and time
consuming process
Compilation is slow process
Compiler needs in-depth knowledge of hardware
details of its processor
A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
Inefficient for object-oriented and event-driven
programs
59. Superscalar
and VLIW
VLIW
computers
Multiflow, Cyndrome
(old)
Intel i860 (based on
single chip)
Kertsev’s M10, M13
and Elbrus-3
(popular)
VLIW
computers
Multiflow, Cyndrome
(old)
Intel i860 (based on
single chip)
Kertsev’s M10, M13
and Elbrus-3
(popular)
61. Data Flow Computing
Instruction execution is driven by data
availability
Not guided by program counter (PC)
Instructions are not ordered
No need of shared memory – data held
directly inside instructions
Instructions are activated on availability of
data tokens
Represented by directed graphs
Instruction execution is driven by data
availability
Not guided by program counter (PC)
Instructions are not ordered
No need of shared memory – data held
directly inside instructions
Instructions are activated on availability of
data tokens
Represented by directed graphs
62. Data Flow Graph
Graph shows flow of data
Each instruction consist of
An operator
One or more operands
One or more destinations where result will be sent
Graph shows flow of data
Each instruction consist of
An operator
One or more operands
One or more destinations where result will be sent
63. Control Flow vs Data Flow
Von Neumann or Control flow computing model:
Uses program counter to sequence execution of
instructions
Uses shared memory to hold instructions and data
Data dependency and synchronization issues restricts
parallel processing
Can be made parallel using special parallel control
operators like FORK and JOIN
Dataflow model:
The execution is driven only by the availability of operand!
No PC and global updateable store
The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing
Von Neumann or Control flow computing model:
Uses program counter to sequence execution of
instructions
Uses shared memory to hold instructions and data
Data dependency and synchronization issues restricts
parallel processing
Can be made parallel using special parallel control
operators like FORK and JOIN
Dataflow model:
The execution is driven only by the availability of operand!
No PC and global updateable store
The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing
64. Static Dataflow
Combine control and data into a
template like a reservation station
Presence bit indicates operands
are available/ready or not
After instruction execution
corresponding presence bits are
set
Combine control and data into a
template like a reservation station
Presence bit indicates operands
are available/ready or not
After instruction execution
corresponding presence bits are
set Fig. (a) Static dataflow computer
Fig. (b) Opcode structure
65. Dynamic Dataflow
Separate data tokens and control
Tagged token: labeled packet of
informaion
Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
The operation is held by matching token’s
tag in matching store via associative
search
If no match, make entry, wait for partner
When there is a match, fetch
corresponding instruction from program
memory and execute
Separate data tokens and control
Tagged token: labeled packet of
informaion
Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
The operation is held by matching token’s
tag in matching store via associative
search
If no match, make entry, wait for partner
When there is a match, fetch
corresponding instruction from program
memory and execute
Fig. (c) Dynamic dataflow computer
Fig. (d) Opcode structure
66. Dataflow:
Advantages/Disadvantages
Advantages:
No program counter
Data-driven
Execution inhibited only by true data dependences
Stateless / side-effect free
Enhances parallelism
Disadvantages:
No program counter leads to very long fetch/execute
latency
Spatial locality in instruction fetch is hard to exploit
Requires matching (e.g., via associative compares)
No shared data structures
No pointers into data structures (implies state)
Advantages:
No program counter
Data-driven
Execution inhibited only by true data dependences
Stateless / side-effect free
Enhances parallelism
Disadvantages:
No program counter leads to very long fetch/execute
latency
Spatial locality in instruction fetch is hard to exploit
Requires matching (e.g., via associative compares)
No shared data structures
No pointers into data structures (implies state)
67. Multicore Organization
Number of core processors on chip
Number of levels of cache on chip
Amount of shared cache
Next slide examples of each organization:
(a) ARM11 MPCore
(b) AMD Opteron
(c) Intel Core Duo
(d) Intel Core i7
Number of core processors on chip
Number of levels of cache on chip
Amount of shared cache
Next slide examples of each organization:
(a) ARM11 MPCore
(b) AMD Opteron
(c) Intel Core Duo
(d) Intel Core i7
69. Advantages of shared L2
Cache
Constructive interference reduces overall miss rate
Data shared by multiple cores not replicated at cache level
With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
Threads with less locality can have more cache
Easy inter-process communication through shared memory
Cache coherency confined to L1
Dedicated L2 cache gives each core more rapid access
Good for threads with strong locality
Shared L3 cache may also improve performance
Constructive interference reduces overall miss rate
Data shared by multiple cores not replicated at cache level
With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
Threads with less locality can have more cache
Easy inter-process communication through shared memory
Cache coherency confined to L1
Dedicated L2 cache gives each core more rapid access
Good for threads with strong locality
Shared L3 cache may also improve performance
70. Individual Core Architecture
Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multithreading
(SMT)
Scales up number of threads supported
4 SMT cores, each supporting 4 threads appears as
16 cores
Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multithreading
(SMT)
Scales up number of threads supported
4 SMT cores, each supporting 4 threads appears as
16 cores
71. Intel x86 Multicore Organization -
Core Duo (1)
2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
32KB instruction and 32KB data
Thermal control unit per core
Manages chip heat dissipation
Maximize performance within constraints
Improved ergonomics
Advanced Programmable Interrupt Controller (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can interrupt core
2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
32KB instruction and 32KB data
Thermal control unit per core
Manages chip heat dissipation
Maximize performance within constraints
Improved ergonomics
Advanced Programmable Interrupt Controller (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can interrupt core
72. Intel x86 Multicore Organization -
Core Duo (2)
Power Management Logic
Monitors thermal conditions and CPU activity
Adjusts voltage and power consumption
Can switch individual logic subsystems
2MB shared L2 cache
Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
L2 data shared between local cores or external
Bus interface
Power Management Logic
Monitors thermal conditions and CPU activity
Adjusts voltage and power consumption
Can switch individual logic subsystems
2MB shared L2 cache
Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
L2 data shared between local cores or external
Bus interface
74. Intel x86 Multicore Organization -
Core i7
November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus
QuickPath Interconnection (QPI)
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s
November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus
QuickPath Interconnection (QPI)
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s