Advanced processor Principles

Advanced Processor
Principles
Advanced Processor
Principles
By Prof. Vinit Raut

Introduction to Parallel
Processing
 Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
 It is a method used to improve performance in a
computer system
 When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
 The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs
 Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
 It is a method used to improve performance in a
computer system
 When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
 The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs

Processing
 There are two primary benefits to parallel
processing systems:
 Increased reliability
 The availability of more than one CPU
 If one processor fails then, the others can
continue to operate and absorb the load
 Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded
processing systems:
 Increased reliability
 The availability of more than one CPU
 If one processor fails then, the others can
continue to operate and absorb the load
 Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded

Processing
processing systems:
 Increased throughput due to fast processing
 The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
 Increased flexibility brings increased complexity
processing systems:
 Increased throughput due to fast processing
 The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
 Increased flexibility brings increased complexity

Tightly Coupled - SMP
 Processors share memory
 Communicate via that shared memory
 Symmetric Multiprocessor (SMP)
 Share single memory or pool
 Shared bus to access memory
 Memory access time to given area of
memory is approximately the same for
each processor
 Processors share memory
 Communicate via that shared memory
 Symmetric Multiprocessor (SMP)
 Share single memory or pool
 Shared bus to access memory
 Memory access time to given area of
memory is approximately the same for
each processor

Tightly Coupled - NUMA
 Nonuniform memory access
 Access times to different regions of memory
may differ

Loosely Coupled - Clusters
 Interconnection of collection of independent
uniprocessors or SMPs
 Communication among computers via fixed paths or
via network facility

Flynn’s Classification
 It is based on instruction and data
processing.
 A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.
 It is based on instruction and data
processing.
 A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.

Flynn’s Classification
 Single instruction, single data stream -
SISD
 Single instruction, multiple data stream -
SIMD
 Multiple instruction, single data stream -
MISD
 Multiple instruction, multiple data
stream- MIMD
 Single instruction, single data stream -
SISD
 Single instruction, multiple data stream -
SIMD
 Multiple instruction, single data stream -
MISD
 Multiple instruction, multiple data
stream- MIMD

Single Instruction, Single Data
Stream - SISD
 Single processor
 Single instruction stream
 Data stored in single memory
 No instruction parallelism
 No data parallelism
 Uni-processor
 Single processor
 Single instruction stream
 Data stored in single memory
 No instruction parallelism
 No data parallelism
 Uni-processor

Single Instruction, Multiple Data
Stream - SIMD
 Single machine instruction
 Controls simultaneous execution
 Number of processing elements
 Each processing element has associated data
memory
 Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
 Vector and array processors
 Single machine instruction
 Controls simultaneous execution
 Number of processing elements
 Each processing element has associated data
memory
 Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
 Vector and array processors

Multiple Instruction, Single Data
Stream - MISD
 Sequence of data
 Transmitted to set of processors
 Each processor executes different
instruction sequence on same data set
 Never been implemented
 Sequence of data
 Transmitted to set of processors
 Each processor executes different
instruction sequence on same data set
 Never been implemented

Multiple Instruction, Multiple Data
Stream- MIMD
 Set of processors
 Simultaneously execute different
instruction sequences
 Different sets of data
 MIMD based computer system can
 Use shared memory or
 Work with distributed memory
 SMPs, clusters and NUMA systems
 Set of processors
 Simultaneously execute different
instruction sequences
 Different sets of data
 MIMD based computer system can
 Use shared memory or
 Work with distributed memory
 SMPs, clusters and NUMA systems

Parallel Organizations - MIMD
Shared Memory

Parallel Organizations - MIMD
Distributed Memory

Taxonomy of Parallel Processor
Architectures

MIMD - Overview
 General purpose processors
 Each can process all instructions
necessary
 Further classified by method of
processor communication
 General purpose processors
 Each can process all instructions
necessary
 Further classified by method of
processor communication

Concepts of Superscalar
Architecture
 Execution of two or more instructions simultaneously,
in different pipelines
 Throughput of superscalar processor is greater than
that of pipelined scalar processor
 May use RISC (PowerPC) or CISC (Pentium 4)
architectures
 Instruction execution sequencing may be
 Static (during compilation) or
 Dynamic (at run time)
 Execution of two or more instructions simultaneously,
in different pipelines
 Throughput of superscalar processor is greater than
that of pipelined scalar processor
 May use RISC (PowerPC) or CISC (Pentium 4)
architectures
 Instruction execution sequencing may be
 Static (during compilation) or
 Dynamic (at run time)

Superscalar vs Pipelined
Scalar

Requirements of Superscalar
Architecture
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic

Superscalar Pipelines
 Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
 Instruction takes 200ns
using pipeline shown in Fig.
13.1
 Superscalar Pipeline (Fig.
13.2)
 Fetch and execute multiple
instructions in each clock
cycle
 Multiple instructions are
issued to multiple functional
units in each clock cycle
 Issue instructions which are
ready for issue, Out-of-
Order (OoO)
 Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
 Instruction takes 200ns
using pipeline shown in Fig.
13.1
 Superscalar Pipeline (Fig.
13.2)
 Fetch and execute multiple
instructions in each clock
cycle
 Multiple instructions are
issued to multiple functional
units in each clock cycle
 Issue instructions which are
ready for issue, Out-of-
Order (OoO)

Superscalar Pipelines (cont.)
 More pipelines are
added – improves
productivity
 Execution units work
independently
 Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle
 More pipelines are
added – improves
productivity
 Execution units work
independently
 Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle

 Timing diagram for system shown in Fig 13.3
assuming no presence of hazards

 Two-way superscalar
processor
 Pipelined integer and
floating point unit
 Integer unit 5 stages
 Floating unit 7 stages
 Two instructions gets
completed in every clock
from 7th clock onwards
 Difficult to achieve due to
dependencies and branch
instructions
 Two-way superscalar
processor
 Pipelined integer and
floating point unit
 Integer unit 5 stages
 Floating unit 7 stages
 Two instructions gets
completed in every clock
from 7th clock onwards
 Difficult to achieve due to
dependencies and branch
instructions

Superscalar Techniques
 Operand forwarding
 Delayed branching
 Dynamic scheduling
 Register renaming
 Branch prediction
 Multiple instruction issue
 Speculative Execution
 Loop unrolling
 Software pipelining
 Operand forwarding
 Delayed branching
 Dynamic scheduling
 Register renaming
 Branch prediction
 Multiple instruction issue
 Speculative Execution
 Loop unrolling
 Software pipelining

Out-of-Order(OoO) in a
nutshell
 Execute instructions based on the “data flow”
graph, (rather than program order)
 Improvement in performance
 Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
 Retirement unit or commit unit
 Execute instructions based on the “data flow”
graph, (rather than program order)
 Improvement in performance
 Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
 Retirement unit or commit unit

Dynamic Scheduling
 Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
 In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
 In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
 Instructions are issued in-order but executed out-of-
order
 Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
 In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
 In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
 Instructions are issued in-order but executed out-of-
order

Superscalar basics: Data flow
analysis
 Example:

Out-of-Order Execution
 Advantages: Better performance!
 Exploit Instruction Level Parallelism (ILP)
 Hide latencies (e.g., L1 data cache miss)
 Disadvantages:
 HW is much more complex than that of in-order processors
 Can compilers do this work?
 In a very limited way – can only statically schedule
instructions (VLIW)
 Compilers lack runtime information
 Conditional branch direction (→ compiler limited to basic
blocks)
 Data values, which may affect calculation time and control
 Cache miss / hit
 Advantages: Better performance!
 Exploit Instruction Level Parallelism (ILP)
 Hide latencies (e.g., L1 data cache miss)
 Disadvantages:
 HW is much more complex than that of in-order processors
 Can compilers do this work?
 In a very limited way – can only statically schedule
instructions (VLIW)
 Compilers lack runtime information
 Conditional branch direction (→ compiler limited to basic
blocks)
 Data values, which may affect calculation time and control
 Cache miss / hit

Speculative Execution
 In branching it is not known which should be the next
instruction until condition is evaluated
 Simple processor does stalling
 Advanced processor - speculation
 Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
 If assumption goes wrong – results will be dropped
 With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
 Misprediction – Hardware undo (Rollback and recovery)
 In branching it is not known which should be the next
instruction until condition is evaluated
 Simple processor does stalling
 Advanced processor - speculation
 Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
 If assumption goes wrong – results will be dropped
 With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
 Misprediction – Hardware undo (Rollback and recovery)

Speculative Execution (cont.)
 Common Implementation
 Fetch/Decode instructions from the predicted
execution path
 Instructions can execute as soon as their
operands become ready
 Instructions can graduate and commit to
memory only once it is certain they should have
been executed
 An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
 Instructions on a mis-predicted execution path are
flushed
 Common Implementation
 Fetch/Decode instructions from the predicted
execution path
 Instructions can execute as soon as their
operands become ready
 Instructions can graduate and commit to
memory only once it is certain they should have
been executed
 An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
 Instructions on a mis-predicted execution path are
flushed

Speculative Execution (cont.)
 Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
 Addition or subtraction?
 Branch prediction based on previous information
 Improvement using Branch Target Buffer (BTB) cache
 If same branch, makes guess about present path to
be followed based on earlier result
 Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
 Addition or subtraction?
 Branch prediction based on previous information
 Improvement using Branch Target Buffer (BTB) cache
 If same branch, makes guess about present path to
be followed based on earlier result

Increasing Performance
 Processor performance can be measured by
the rate at which it executes instructions
 MIPS rate = f * IPC
 f processor clock frequency, in MHz
 IPC is average instructions per cycle
 Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
 May be reaching limit
 Complexity
 Power consumption
 Processor performance can be measured by
the rate at which it executes instructions
 MIPS rate = f * IPC
 f processor clock frequency, in MHz
 IPC is average instructions per cycle
 Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
 May be reaching limit
 Complexity
 Power consumption

Multithreaded Processor
 Instruction stream divided into smaller
streams (threads)
 Executed in parallel
 Wide variety of multithreading designs
 Instruction stream divided into smaller
streams (threads)
 Executed in parallel
 Wide variety of multithreading designs

Definitions of Threads and
Processes
 Thread in multithreaded processors may or may not be same as
software threads
 Process
 An instance of program running on computer
 Resource ownership
 Virtual address space to hold process image
 Scheduling/execution
 Process switch
 An operation that switches processor from one process to
another
 Thread: dispatchable unit of work within process
 Includes processor context (which includes the program counter
and stack pointer) and data area for stack
 Thread executes sequentially
 Interruptible: processor can turn to another thread
 Thread switch
 Switching processor between threads within same process
 Typically less costly than process switch
 Thread in multithreaded processors may or may not be same as
software threads
 Process
 An instance of program running on computer
 Resource ownership
 Virtual address space to hold process image
 Scheduling/execution
 Process switch
 An operation that switches processor from one process to
another
 Thread: dispatchable unit of work within process
 Includes processor context (which includes the program counter
and stack pointer) and data area for stack
 Thread executes sequentially
 Interruptible: processor can turn to another thread
 Thread switch
 Switching processor between threads within same process
 Typically less costly than process switch

Implicit and Explicit
Multithreading
 All commercial processors and most experimental
ones use explicit multithreading
 Concurrently execute instructions from different explicit
threads
 Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
 Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
 Implicit threads defined statically by compiler or
dynamically by hardware
 All commercial processors and most experimental
ones use explicit multithreading
 Concurrently execute instructions from different explicit
threads
 Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
 Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
 Implicit threads defined statically by compiler or
dynamically by hardware

Approaches to Explicit
Multithreading
 Interleaved Multithreading
 Fine-grained
 Processor deals with two or more thread contexts at a time
 Switching thread at each clock cycle
 If thread is blocked it is skipped
 Blocked Multithreading
 Coarse-grained
 Thread executed until event causes delay
 E.g. Cache miss
 Effective on in-order processor
 Avoids pipeline stall
 Simultaneous Multithreading (SMT)
 Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
 Chip multiprocessing
 Processor is replicated on a single chip
 Each processor handles separate threads
 Interleaved Multithreading
 Fine-grained
 Processor deals with two or more thread contexts at a time
 Switching thread at each clock cycle
 If thread is blocked it is skipped
 Blocked Multithreading
 Coarse-grained
 Thread executed until event causes delay
 E.g. Cache miss
 Effective on in-order processor
 Avoids pipeline stall
 Simultaneous Multithreading (SMT)
 Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
 Chip multiprocessing
 Processor is replicated on a single chip
 Each processor handles separate threads

Scalar Processor Approaches
 Single-threaded scalar
 Simple pipeline
 No multithreading
 Interleaved multithreaded scalar
 Easiest multithreading to implement
 Switch threads at each clock cycle
 Pipeline stages kept close to fully occupied
 Hardware needs to switch thread context between cycles
 Blocked multithreaded scalar
 Thread executed until latency event occurs
 Would stop pipeline
 Processor switches to another thread
 Single-threaded scalar
 Simple pipeline
 Interleaved multithreaded scalar
 Easiest multithreading to implement
 Switch threads at each clock cycle
 Pipeline stages kept close to fully occupied
 Hardware needs to switch thread context between cycles
 Blocked multithreaded scalar
 Thread executed until latency event occurs
 Would stop pipeline
 Processor switches to another thread

Multiple Instruction Issue
Processors (1)
 Superscalar
 Interleaved multithreading superscalar
 Each cycle, as many instructions as possible
issued from single thread
 Delays due to thread switches eliminated
 Number of instructions issued in cycle limited by
dependencies
 Blocked multithreaded superscalar
 Instructions from one thread
 Blocked multithreading used
 Superscalar
 Interleaved multithreading superscalar
 Each cycle, as many instructions as possible
issued from single thread
 Delays due to thread switches eliminated
 Number of instructions issued in cycle limited by
dependencies
 Blocked multithreaded superscalar
 Instructions from one thread
 Blocked multithreading used

Diagram (1)

Processors (2)
 Very long instruction word (VLIW)
 E.g. IA-64
 Multiple instructions in single word
 Typically constructed by compiler
 Operations that may be executed in parallel in same word
 May pad with no-ops
 Interleaved multithreading VLIW
 Similar efficiencies to interleaved multithreading on
superscalar architecture
 Blocked multithreaded VLIW
 Similar efficiencies to blocked multithreading on
 Very long instruction word (VLIW)
 E.g. IA-64
 Multiple instructions in single word
 Typically constructed by compiler
 Operations that may be executed in parallel in same word
 May pad with no-ops
 Interleaved multithreading VLIW
 Similar efficiencies to interleaved multithreading on
 Blocked multithreaded VLIW
 Similar efficiencies to blocked multithreading on

Diagram (2)

Parallel, Simultaneous
Execution of Multiple Threads
 Simultaneous multithreading
 Issue multiple instructions at a time
 One thread may fill all horizontal slots
 Instructions from two or more threads may be
issued
 With enough threads, can issue maximum
number of instructions on each cycle
 Chip multiprocessor
 Multiple processors
 Each has two-issue superscalar processor
 Each processor is assigned thread
 Can issue up to two instructions per cycle per thread
 Simultaneous multithreading
 Issue multiple instructions at a time
 One thread may fill all horizontal slots
 Instructions from two or more threads may be
issued
 With enough threads, can issue maximum
number of instructions on each cycle
 Chip multiprocessor
 Multiple processors
 Each has two-issue superscalar processor
 Each processor is assigned thread
 Can issue up to two instructions per cycle per thread

Examples
 Some Pentium 4
 Intel calls it hyperthreading
 SMT with support for two threads
 Single multithreaded processor, logically two
processors
 IBM Power5
 High-end PowerPC
 Combines chip multiprocessing with SMT
 Chip has two separate processors
 Each supporting two threads concurrently using
SMT
 Some Pentium 4
 Intel calls it hyperthreading
 SMT with support for two threads
 Single multithreaded processor, logically two
processors
 IBM Power5
 High-end PowerPC
 Combines chip multiprocessing with SMT
 Chip has two separate processors
 Each supporting two threads concurrently using
SMT

VLIW Architecture
 Aim at speeding up computation by exploiting instruction-
level parallelism.
 Follows static scheduling
 Compiler groups several operations into very long
instruction word (VLIW)
 Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
 An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
 All operations in an instruction are executed in a lock-step
mode.
 Rely on compiler to find parallelism and schedule
dependency free program code.
 Aim at speeding up computation by exploiting instruction-
level parallelism.
 Follows static scheduling
 Compiler groups several operations into very long
instruction word (VLIW)
 Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
 An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
 All operations in an instruction are executed in a lock-step
mode.
 Rely on compiler to find parallelism and schedule
dependency free program code.

VLIW Architecture (cont.)
 VLIW offers Plan Of Execution (POE) created statically
during the compilation
 VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
 Delivers POE via instruction set
 Very simple control logic, no dynamic scheduling
 Superscalar
 Increasing no. of functional units results in complex
instruction scheduling hardware
 5 or 6 instructions dispatched per cycle
 VLIW architecture
 Compiler selects instructions without dependencies and
joins them as very long instructions
 VLIW offers Plan Of Execution (POE) created statically
during the compilation
 VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
 Delivers POE via instruction set
 Very simple control logic, no dynamic scheduling
 Superscalar
 Increasing no. of functional units results in complex
instruction scheduling hardware
 5 or 6 instructions dispatched per cycle
 VLIW architecture
 Compiler selects instructions without dependencies and
joins them as very long instructions

VLIW Characteristics
 Only RISC like operation support
 Short cycle times
 Flexible: Can implement any FU mixture
 Extensible
 Tight inter FU connectivity required
 Large instructions (up to 1024 bits)
 Not binary compatible !!!
 But good compilers exist
 Only RISC like operation support
 Short cycle times
 Flexible: Can implement any FU mixture
 Extensible
 Tight inter FU connectivity required
 Large instructions (up to 1024 bits)
 Not binary compatible !!!
 But good compilers exist

VLIW: Merits
 Reduce hardware complexity
 Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
 Potentially higher clock rate
 Higher degree of parallelism with global program
information
 No need for register renaming
 Less power consumption
 Reduce hardware complexity
 Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
 Potentially higher clock rate
 Higher degree of parallelism with global program
information
 No need for register renaming
 Less power consumption

VLIW: Demerits
 Object code requires more memory
 Compiler development is involved and time
consuming process
 Compilation is slow process
 Compiler needs in-depth knowledge of hardware
details of its processor
 A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
 Inefficient for object-oriented and event-driven
programs
 Object code requires more memory
 Compiler development is involved and time
consuming process
 Compilation is slow process
 Compiler needs in-depth knowledge of hardware
details of its processor
 A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
 Inefficient for object-oriented and event-driven
programs

Superscalar
and VLIW
 VLIW
computers
 Multiflow, Cyndrome
(old)
 Intel i860 (based on
single chip)
 Kertsev’s M10, M13
and Elbrus-3
(popular)
 VLIW
computers
 Multiflow, Cyndrome
(old)
 Intel i860 (based on
single chip)
 Kertsev’s M10, M13
and Elbrus-3
(popular)

Data Flow Computing
 Instruction execution is driven by data
availability
 Not guided by program counter (PC)
 Instructions are not ordered
 No need of shared memory – data held
directly inside instructions
 Instructions are activated on availability of
data tokens
 Represented by directed graphs
 Instruction execution is driven by data
availability
 Not guided by program counter (PC)
 Instructions are not ordered
 No need of shared memory – data held
directly inside instructions
 Instructions are activated on availability of
data tokens
 Represented by directed graphs

Data Flow Graph
 Graph shows flow of data
 Each instruction consist of
 An operator
 One or more operands
 One or more destinations where result will be sent
 Graph shows flow of data
 Each instruction consist of
 An operator
 One or more operands
 One or more destinations where result will be sent

Control Flow vs Data Flow
 Von Neumann or Control flow computing model:
 Uses program counter to sequence execution of
instructions
 Uses shared memory to hold instructions and data
 Data dependency and synchronization issues restricts
parallel processing
 Can be made parallel using special parallel control
operators like FORK and JOIN
 Dataflow model:
 The execution is driven only by the availability of operand!
 No PC and global updateable store
 The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing
 Von Neumann or Control flow computing model:
 Uses program counter to sequence execution of
instructions
 Uses shared memory to hold instructions and data
 Data dependency and synchronization issues restricts
parallel processing
 Can be made parallel using special parallel control
operators like FORK and JOIN
 Dataflow model:
 The execution is driven only by the availability of operand!
 No PC and global updateable store
 The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing

Static Dataflow
 Combine control and data into a
template like a reservation station
 Presence bit indicates operands
are available/ready or not
 After instruction execution
corresponding presence bits are
set
 Combine control and data into a
template like a reservation station
 Presence bit indicates operands
are available/ready or not
 After instruction execution
corresponding presence bits are
set Fig. (a) Static dataflow computer
Fig. (b) Opcode structure

Dynamic Dataflow
 Separate data tokens and control
 Tagged token: labeled packet of
informaion
 Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
 The operation is held by matching token’s
tag in matching store via associative
search
 If no match, make entry, wait for partner
 When there is a match, fetch
corresponding instruction from program
memory and execute
 Separate data tokens and control
 Tagged token: labeled packet of
informaion
 Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
 The operation is held by matching token’s
tag in matching store via associative
search
 If no match, make entry, wait for partner
 When there is a match, fetch
corresponding instruction from program
memory and execute
Fig. (c) Dynamic dataflow computer
Fig. (d) Opcode structure

Dataflow:
Advantages/Disadvantages
 Advantages:
 No program counter
 Data-driven
 Execution inhibited only by true data dependences
 Stateless / side-effect free
 Enhances parallelism
 Disadvantages:
 No program counter leads to very long fetch/execute
latency
 Spatial locality in instruction fetch is hard to exploit
 Requires matching (e.g., via associative compares)
 No shared data structures
 No pointers into data structures (implies state)
 Advantages:
 No program counter
 Data-driven
 Execution inhibited only by true data dependences
 Stateless / side-effect free
 Enhances parallelism
 Disadvantages:
 No program counter leads to very long fetch/execute
latency
 Spatial locality in instruction fetch is hard to exploit
 Requires matching (e.g., via associative compares)
 No shared data structures
 No pointers into data structures (implies state)

Multicore Organization
 Number of core processors on chip
 Number of levels of cache on chip
 Amount of shared cache
 Next slide examples of each organization:
 (a) ARM11 MPCore
 (b) AMD Opteron
 (c) Intel Core Duo
 (d) Intel Core i7
 Number of core processors on chip
 Number of levels of cache on chip
 Amount of shared cache
 Next slide examples of each organization:
 (a) ARM11 MPCore
 (b) AMD Opteron
 (c) Intel Core Duo
 (d) Intel Core i7

Multicore Organization
Alternatives

Advantages of shared L2
Cache
 Constructive interference reduces overall miss rate
 Data shared by multiple cores not replicated at cache level
 With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
 Threads with less locality can have more cache
 Easy inter-process communication through shared memory
 Cache coherency confined to L1
 Dedicated L2 cache gives each core more rapid access
 Good for threads with strong locality
 Shared L3 cache may also improve performance
 Constructive interference reduces overall miss rate
 Data shared by multiple cores not replicated at cache level
 With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
 Threads with less locality can have more cache
 Easy inter-process communication through shared memory
 Cache coherency confined to L1
 Dedicated L2 cache gives each core more rapid access
 Good for threads with strong locality
 Shared L3 cache may also improve performance

Individual Core Architecture
 Intel Core Duo uses superscalar cores
 Intel Core i7 uses simultaneous multithreading
(SMT)
 Scales up number of threads supported
 4 SMT cores, each supporting 4 threads appears as
16 cores
 Intel Core Duo uses superscalar cores
 Intel Core i7 uses simultaneous multithreading
(SMT)
 Scales up number of threads supported
 4 SMT cores, each supporting 4 threads appears as
16 cores

Intel x86 Multicore Organization -
Core Duo (1)
 2006
 Two x86 superscalar, shared L2 cache
 Dedicated L1 cache per core
 32KB instruction and 32KB data
 Thermal control unit per core
 Manages chip heat dissipation
 Maximize performance within constraints
 Improved ergonomics
 Advanced Programmable Interrupt Controller (APIC)
 Inter-process interrupts between cores
 Routes interrupts to appropriate core
 Includes timer so OS can interrupt core
 2006
 Two x86 superscalar, shared L2 cache
 Dedicated L1 cache per core
 32KB instruction and 32KB data
 Thermal control unit per core
 Manages chip heat dissipation
 Maximize performance within constraints
 Improved ergonomics
 Advanced Programmable Interrupt Controller (APIC)
 Inter-process interrupts between cores
 Routes interrupts to appropriate core
 Includes timer so OS can interrupt core

Core Duo (2)
 Power Management Logic
 Monitors thermal conditions and CPU activity
 Adjusts voltage and power consumption
 Can switch individual logic subsystems
 2MB shared L2 cache
 Dynamic allocation
 MESI support for L1 caches
 Extended to support multiple Core Duo in SMP
 L2 data shared between local cores or external
 Bus interface
 Power Management Logic
 Monitors thermal conditions and CPU activity
 Adjusts voltage and power consumption
 Can switch individual logic subsystems
 2MB shared L2 cache
 Dynamic allocation
 MESI support for L1 caches
 Extended to support multiple Core Duo in SMP
 L2 data shared between local cores or external
 Bus interface

Core i7
 November 2008
 Four x86 SMT processors
 Dedicated L2, shared L3 cache
 Speculative pre-fetch for caches
 On chip DDR3 memory controller
 Three 8 byte channels (192 bits) giving 32GB/s
 No front side bus
 QuickPath Interconnection (QPI)
 Cache coherent point-to-point link
 High speed communications between processor chips
 6.4G transfers per second, 16 bits per transfer
 Dedicated bi-directional pairs
 Total bandwidth 25.6GB/s
 November 2008
 Four x86 SMT processors
 Dedicated L2, shared L3 cache
 Speculative pre-fetch for caches
 On chip DDR3 memory controller
 Three 8 byte channels (192 bits) giving 32GB/s
 No front side bus
 QuickPath Interconnection (QPI)
 Cache coherent point-to-point link
 High speed communications between processor chips
 6.4G transfers per second, 16 bits per transfer
 Dedicated bi-directional pairs
 Total bandwidth 25.6GB/s

Advanced processor Principles

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced processor Principles

Similar to Advanced processor Principles (20)

Recently uploaded

Recently uploaded (20)

Advanced processor Principles