SlideShare a Scribd company logo
1 of 75
Download to read offline
Advanced Processor
Principles
Advanced Processor
Principles
By Prof. Vinit Raut
Introduction to Parallel
Processing
 Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
 It is a method used to improve performance in a
computer system
 When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
 The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs
 Parallel processing, one form of multiprocessing, is a
situation in which one or more processors operate in
unison
 It is a method used to improve performance in a
computer system
 When two or more CPUs are executing instructions
simultaneously, it is performing parallel processing
 The Processor Manager has to coordinate the activity
of each processor as well as synchronize cooperative
interaction among the CPUs
Introduction to Parallel
Processing
 There are two primary benefits to parallel
processing systems:
 Increased reliability
 The availability of more than one CPU
 If one processor fails then, the others can
continue to operate and absorb the load
 Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded
 There are two primary benefits to parallel
processing systems:
 Increased reliability
 The availability of more than one CPU
 If one processor fails then, the others can
continue to operate and absorb the load
 Not simple to implement
o The system must be carefully designed so that
• The failing processor can inform the other
processors to take over
• The OS must reconstruct its resource allocation
strategies so the remaining processors don’t
become overloaded
Introduction to Parallel
Processing
 There are two primary benefits to parallel
processing systems:
 Increased throughput due to fast processing
 The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
 Increased flexibility brings increased complexity
 There are two primary benefits to parallel
processing systems:
 Increased throughput due to fast processing
 The processing speed is achieved because
sometimes instructions can be processed in
parallel, two or more at a time in one of several
ways:
o Some systems allocate a CPU to each program or job
o Others allocate CPU to each working set or parts of it
o Others subdivide individual instructions so that each
subdivision can be processed simultaneously
• Concurrent programming
 Increased flexibility brings increased complexity
Tightly Coupled - SMP
 Processors share memory
 Communicate via that shared memory
 Symmetric Multiprocessor (SMP)
 Share single memory or pool
 Shared bus to access memory
 Memory access time to given area of
memory is approximately the same for
each processor
 Processors share memory
 Communicate via that shared memory
 Symmetric Multiprocessor (SMP)
 Share single memory or pool
 Shared bus to access memory
 Memory access time to given area of
memory is approximately the same for
each processor
Tightly Coupled - NUMA
 Nonuniform memory access
 Access times to different regions of memory
may differ
Loosely Coupled - Clusters
 Interconnection of collection of independent
uniprocessors or SMPs
 Communication among computers via fixed paths or
via network facility
Flynn’s Classification
 It is based on instruction and data
processing.
 A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.
 It is based on instruction and data
processing.
 A computer is classified by whether it
processes a single instruction at a time
or multiple instructions simultaneously,
and whether it operates on one or
multiple data sets.
Flynn’s Classification
 Single instruction, single data stream -
SISD
 Single instruction, multiple data stream -
SIMD
 Multiple instruction, single data stream -
MISD
 Multiple instruction, multiple data
stream- MIMD
 Single instruction, single data stream -
SISD
 Single instruction, multiple data stream -
SIMD
 Multiple instruction, single data stream -
MISD
 Multiple instruction, multiple data
stream- MIMD
Single Instruction, Single Data
Stream - SISD
 Single processor
 Single instruction stream
 Data stored in single memory
 No instruction parallelism
 No data parallelism
 Uni-processor
 Single processor
 Single instruction stream
 Data stored in single memory
 No instruction parallelism
 No data parallelism
 Uni-processor
Parallel Organizations - SISD
Single Instruction, Multiple Data
Stream - SIMD
 Single machine instruction
 Controls simultaneous execution
 Number of processing elements
 Each processing element has associated data
memory
 Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
 Vector and array processors
 Single machine instruction
 Controls simultaneous execution
 Number of processing elements
 Each processing element has associated data
memory
 Each instruction executed on different set of
data by different processors (i.e. Data level
parallelism )
 Vector and array processors
Parallel Organizations - SIMD
Multiple Instruction, Single Data
Stream - MISD
 Sequence of data
 Transmitted to set of processors
 Each processor executes different
instruction sequence on same data set
 Never been implemented
 Sequence of data
 Transmitted to set of processors
 Each processor executes different
instruction sequence on same data set
 Never been implemented
Parallel Organizations - MISD
Multiple Instruction, Multiple Data
Stream- MIMD
 Set of processors
 Simultaneously execute different
instruction sequences
 Different sets of data
 MIMD based computer system can
 Use shared memory or
 Work with distributed memory
 SMPs, clusters and NUMA systems
 Set of processors
 Simultaneously execute different
instruction sequences
 Different sets of data
 MIMD based computer system can
 Use shared memory or
 Work with distributed memory
 SMPs, clusters and NUMA systems
Parallel Organizations - MIMD
Shared Memory
Parallel Organizations - MIMD
Distributed Memory
Taxonomy of Parallel Processor
Architectures
MIMD - Overview
 General purpose processors
 Each can process all instructions
necessary
 Further classified by method of
processor communication
 General purpose processors
 Each can process all instructions
necessary
 Further classified by method of
processor communication
Concepts of Superscalar
Architecture
 Execution of two or more instructions simultaneously,
in different pipelines
 Throughput of superscalar processor is greater than
that of pipelined scalar processor
 May use RISC (PowerPC) or CISC (Pentium 4)
architectures
 Instruction execution sequencing may be
 Static (during compilation) or
 Dynamic (at run time)
 Execution of two or more instructions simultaneously,
in different pipelines
 Throughput of superscalar processor is greater than
that of pipelined scalar processor
 May use RISC (PowerPC) or CISC (Pentium 4)
architectures
 Instruction execution sequencing may be
 Static (during compilation) or
 Dynamic (at run time)
Superscalar vs Pipelined
Scalar
Requirements of Superscalar
Architecture
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic
1. More than one instructions should be fetched at one
time
2. Decoding logic should check whether instructions
are independent and hence executable,
simultaneously
3. Sufficient number of execution units, preferably
pipelined
4. Cycle time for each pipelined stage should match
with cycle times for the fetching and decoding logic
Superscalar Pipelines
 Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
 Instruction takes 200ns
using pipeline shown in Fig.
13.1
 Superscalar Pipeline (Fig.
13.2)
 Fetch and execute multiple
instructions in each clock
cycle
 Multiple instructions are
issued to multiple functional
units in each clock cycle
 Issue instructions which are
ready for issue, Out-of-
Order (OoO)
 Time spent in each stage
of pipeline is same for all
stages and determined by
slowest stage
 Instruction takes 200ns
using pipeline shown in Fig.
13.1
 Superscalar Pipeline (Fig.
13.2)
 Fetch and execute multiple
instructions in each clock
cycle
 Multiple instructions are
issued to multiple functional
units in each clock cycle
 Issue instructions which are
ready for issue, Out-of-
Order (OoO)
Superscalar Pipelines (cont.)
 More pipelines are
added – improves
productivity
 Execution units work
independently
 Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle
 More pipelines are
added – improves
productivity
 Execution units work
independently
 Fig. 13.3- first
instruction takes 5 clock
cycles and thereafter
two instructions gets
completed for every
clock cycle
Superscalar Pipelines (cont.)
 Timing diagram for system shown in Fig 13.3
assuming no presence of hazards
Superscalar Pipelines (cont.)
 Two-way superscalar
processor
 Pipelined integer and
floating point unit
 Integer unit 5 stages
 Floating unit 7 stages
 Two instructions gets
completed in every clock
from 7th clock onwards
 Difficult to achieve due to
dependencies and branch
instructions
 Two-way superscalar
processor
 Pipelined integer and
floating point unit
 Integer unit 5 stages
 Floating unit 7 stages
 Two instructions gets
completed in every clock
from 7th clock onwards
 Difficult to achieve due to
dependencies and branch
instructions
Superscalar Techniques
 Operand forwarding
 Delayed branching
 Dynamic scheduling
 Register renaming
 Branch prediction
 Multiple instruction issue
 Speculative Execution
 Loop unrolling
 Software pipelining
 Operand forwarding
 Delayed branching
 Dynamic scheduling
 Register renaming
 Branch prediction
 Multiple instruction issue
 Speculative Execution
 Loop unrolling
 Software pipelining
Out-of-Order(OoO) in a
nutshell
 Execute instructions based on the “data flow”
graph, (rather than program order)
 Improvement in performance
 Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
 Retirement unit or commit unit
 Execute instructions based on the “data flow”
graph, (rather than program order)
 Improvement in performance
 Still need to keep the semantics of the original
program (i.e. results rearranged in correct
order)
 Retirement unit or commit unit
Dynamic Scheduling
 Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
 In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
 In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
 Instructions are issued in-order but executed out-of-
order
 Consider following instructions sequence
ADD R2, R3
SUB R4, R2
MUL R5, R6
 In simple scalar processor pipeline is stalled while
executing SUB and also held up MUL instruction
 In superscalar processor with dynamic scheduling
MUL can be executed out-of-order in another pipeline
 Instructions are issued in-order but executed out-of-
order
OoO general scheme
Superscalar basics: Data flow
analysis
 Example:
Out-of-Order Execution
 Advantages: Better performance!
 Exploit Instruction Level Parallelism (ILP)
 Hide latencies (e.g., L1 data cache miss)
 Disadvantages:
 HW is much more complex than that of in-order processors
 Can compilers do this work?
 In a very limited way – can only statically schedule
instructions (VLIW)
 Compilers lack runtime information
 Conditional branch direction (→ compiler limited to basic
blocks)
 Data values, which may affect calculation time and control
 Cache miss / hit
 Advantages: Better performance!
 Exploit Instruction Level Parallelism (ILP)
 Hide latencies (e.g., L1 data cache miss)
 Disadvantages:
 HW is much more complex than that of in-order processors
 Can compilers do this work?
 In a very limited way – can only statically schedule
instructions (VLIW)
 Compilers lack runtime information
 Conditional branch direction (→ compiler limited to basic
blocks)
 Data values, which may affect calculation time and control
 Cache miss / hit
Speculative Execution
 In branching it is not known which should be the next
instruction until condition is evaluated
 Simple processor does stalling
 Advanced processor - speculation
 Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
 If assumption goes wrong – results will be dropped
 With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
 Misprediction – Hardware undo (Rollback and recovery)
 In branching it is not known which should be the next
instruction until condition is evaluated
 Simple processor does stalling
 Advanced processor - speculation
 Speculative execution - Execute control dependent
instructions even when we are not sure if they should
be executed
 If assumption goes wrong – results will be dropped
 With branch prediction, we speculate on the outcome
of the branches and execute the program as if our
guesses were correct
 Misprediction – Hardware undo (Rollback and recovery)
Speculative Execution (cont.)
 Common Implementation
 Fetch/Decode instructions from the predicted
execution path
 Instructions can execute as soon as their
operands become ready
 Instructions can graduate and commit to
memory only once it is certain they should have
been executed
 An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
 Instructions on a mis-predicted execution path are
flushed
 Common Implementation
 Fetch/Decode instructions from the predicted
execution path
 Instructions can execute as soon as their
operands become ready
 Instructions can graduate and commit to
memory only once it is certain they should have
been executed
 An instruction commits only when all previous (in-order)
instructions have committed ⇒ instructions commit in-
order
 Instructions on a mis-predicted execution path are
flushed
Speculative Execution (cont.)
 Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
 Addition or subtraction?
 Branch prediction based on previous information
 Improvement using Branch Target Buffer (BTB) cache
 If same branch, makes guess about present path to
be followed based on earlier result
 Consider following source code
IF I=J THEN
M=M+1
ELSE
M=M-1
 Addition or subtraction?
 Branch prediction based on previous information
 Improvement using Branch Target Buffer (BTB) cache
 If same branch, makes guess about present path to
be followed based on earlier result
Increasing Performance
 Processor performance can be measured by
the rate at which it executes instructions
 MIPS rate = f * IPC
 f processor clock frequency, in MHz
 IPC is average instructions per cycle
 Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
 May be reaching limit
 Complexity
 Power consumption
 Processor performance can be measured by
the rate at which it executes instructions
 MIPS rate = f * IPC
 f processor clock frequency, in MHz
 IPC is average instructions per cycle
 Increase performance by increasing clock
frequency and increasing instructions that
complete during cycle
 May be reaching limit
 Complexity
 Power consumption
Multithreaded Processor
 Instruction stream divided into smaller
streams (threads)
 Executed in parallel
 Wide variety of multithreading designs
 Instruction stream divided into smaller
streams (threads)
 Executed in parallel
 Wide variety of multithreading designs
Definitions of Threads and
Processes
 Thread in multithreaded processors may or may not be same as
software threads
 Process
 An instance of program running on computer
 Resource ownership
 Virtual address space to hold process image
 Scheduling/execution
 Process switch
 An operation that switches processor from one process to
another
 Thread: dispatchable unit of work within process
 Includes processor context (which includes the program counter
and stack pointer) and data area for stack
 Thread executes sequentially
 Interruptible: processor can turn to another thread
 Thread switch
 Switching processor between threads within same process
 Typically less costly than process switch
 Thread in multithreaded processors may or may not be same as
software threads
 Process
 An instance of program running on computer
 Resource ownership
 Virtual address space to hold process image
 Scheduling/execution
 Process switch
 An operation that switches processor from one process to
another
 Thread: dispatchable unit of work within process
 Includes processor context (which includes the program counter
and stack pointer) and data area for stack
 Thread executes sequentially
 Interruptible: processor can turn to another thread
 Thread switch
 Switching processor between threads within same process
 Typically less costly than process switch
Implicit and Explicit
Multithreading
 All commercial processors and most experimental
ones use explicit multithreading
 Concurrently execute instructions from different explicit
threads
 Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
 Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
 Implicit threads defined statically by compiler or
dynamically by hardware
 All commercial processors and most experimental
ones use explicit multithreading
 Concurrently execute instructions from different explicit
threads
 Interleave instructions from different threads on shared
pipelines or parallel execution on parallel pipelines
 Implicit multithreading is concurrent execution of
multiple threads extracted from single sequential
program
 Implicit threads defined statically by compiler or
dynamically by hardware
Approaches to Explicit
Multithreading
 Interleaved Multithreading
 Fine-grained
 Processor deals with two or more thread contexts at a time
 Switching thread at each clock cycle
 If thread is blocked it is skipped
 Blocked Multithreading
 Coarse-grained
 Thread executed until event causes delay
 E.g. Cache miss
 Effective on in-order processor
 Avoids pipeline stall
 Simultaneous Multithreading (SMT)
 Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
 Chip multiprocessing
 Processor is replicated on a single chip
 Each processor handles separate threads
 Interleaved Multithreading
 Fine-grained
 Processor deals with two or more thread contexts at a time
 Switching thread at each clock cycle
 If thread is blocked it is skipped
 Blocked Multithreading
 Coarse-grained
 Thread executed until event causes delay
 E.g. Cache miss
 Effective on in-order processor
 Avoids pipeline stall
 Simultaneous Multithreading (SMT)
 Instructions simultaneously issued from multiple threads to
execution units of superscalar processor
 Chip multiprocessing
 Processor is replicated on a single chip
 Each processor handles separate threads
Scalar Processor Approaches
 Single-threaded scalar
 Simple pipeline
 No multithreading
 Interleaved multithreaded scalar
 Easiest multithreading to implement
 Switch threads at each clock cycle
 Pipeline stages kept close to fully occupied
 Hardware needs to switch thread context between cycles
 Blocked multithreaded scalar
 Thread executed until latency event occurs
 Would stop pipeline
 Processor switches to another thread
 Single-threaded scalar
 Simple pipeline
 No multithreading
 Interleaved multithreaded scalar
 Easiest multithreading to implement
 Switch threads at each clock cycle
 Pipeline stages kept close to fully occupied
 Hardware needs to switch thread context between cycles
 Blocked multithreaded scalar
 Thread executed until latency event occurs
 Would stop pipeline
 Processor switches to another thread
Scalar Diagrams
Multiple Instruction Issue
Processors (1)
 Superscalar
 No multithreading
 Interleaved multithreading superscalar
 Each cycle, as many instructions as possible
issued from single thread
 Delays due to thread switches eliminated
 Number of instructions issued in cycle limited by
dependencies
 Blocked multithreaded superscalar
 Instructions from one thread
 Blocked multithreading used
 Superscalar
 No multithreading
 Interleaved multithreading superscalar
 Each cycle, as many instructions as possible
issued from single thread
 Delays due to thread switches eliminated
 Number of instructions issued in cycle limited by
dependencies
 Blocked multithreaded superscalar
 Instructions from one thread
 Blocked multithreading used
Multiple Instruction Issue
Diagram (1)
Multiple Instruction Issue
Processors (2)
 Very long instruction word (VLIW)
 E.g. IA-64
 Multiple instructions in single word
 Typically constructed by compiler
 Operations that may be executed in parallel in same word
 May pad with no-ops
 Interleaved multithreading VLIW
 Similar efficiencies to interleaved multithreading on
superscalar architecture
 Blocked multithreaded VLIW
 Similar efficiencies to blocked multithreading on
superscalar architecture
 Very long instruction word (VLIW)
 E.g. IA-64
 Multiple instructions in single word
 Typically constructed by compiler
 Operations that may be executed in parallel in same word
 May pad with no-ops
 Interleaved multithreading VLIW
 Similar efficiencies to interleaved multithreading on
superscalar architecture
 Blocked multithreaded VLIW
 Similar efficiencies to blocked multithreading on
superscalar architecture
Multiple Instruction Issue
Diagram (2)
Parallel, Simultaneous
Execution of Multiple Threads
 Simultaneous multithreading
 Issue multiple instructions at a time
 One thread may fill all horizontal slots
 Instructions from two or more threads may be
issued
 With enough threads, can issue maximum
number of instructions on each cycle
 Chip multiprocessor
 Multiple processors
 Each has two-issue superscalar processor
 Each processor is assigned thread
 Can issue up to two instructions per cycle per thread
 Simultaneous multithreading
 Issue multiple instructions at a time
 One thread may fill all horizontal slots
 Instructions from two or more threads may be
issued
 With enough threads, can issue maximum
number of instructions on each cycle
 Chip multiprocessor
 Multiple processors
 Each has two-issue superscalar processor
 Each processor is assigned thread
 Can issue up to two instructions per cycle per thread
Parallel Diagram
Examples
 Some Pentium 4
 Intel calls it hyperthreading
 SMT with support for two threads
 Single multithreaded processor, logically two
processors
 IBM Power5
 High-end PowerPC
 Combines chip multiprocessing with SMT
 Chip has two separate processors
 Each supporting two threads concurrently using
SMT
 Some Pentium 4
 Intel calls it hyperthreading
 SMT with support for two threads
 Single multithreaded processor, logically two
processors
 IBM Power5
 High-end PowerPC
 Combines chip multiprocessing with SMT
 Chip has two separate processors
 Each supporting two threads concurrently using
SMT
VLIW Architecture
 Aim at speeding up computation by exploiting instruction-
level parallelism.
 Follows static scheduling
 Compiler groups several operations into very long
instruction word (VLIW)
 Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
 An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
 All operations in an instruction are executed in a lock-step
mode.
 Rely on compiler to find parallelism and schedule
dependency free program code.
 Aim at speeding up computation by exploiting instruction-
level parallelism.
 Follows static scheduling
 Compiler groups several operations into very long
instruction word (VLIW)
 Same hardware core as superscalar processors, having
multiple execution units (EUs) working in parallel, to
execute operations in one clock cycle
 An instruction is consisted of multiple operations; typical
word length from 128 bits to 1 Kbits.
 All operations in an instruction are executed in a lock-step
mode.
 Rely on compiler to find parallelism and schedule
dependency free program code.
VLIW Architecture (cont.)
 VLIW offers Plan Of Execution (POE) created statically
during the compilation
 VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
 Delivers POE via instruction set
 Very simple control logic, no dynamic scheduling
 Superscalar
 Increasing no. of functional units results in complex
instruction scheduling hardware
 5 or 6 instructions dispatched per cycle
 VLIW architecture
 Compiler selects instructions without dependencies and
joins them as very long instructions
 VLIW offers Plan Of Execution (POE) created statically
during the compilation
 VLIW processor consist of set of functional units like
adders, multipliers, branch units etc.
 Delivers POE via instruction set
 Very simple control logic, no dynamic scheduling
 Superscalar
 Increasing no. of functional units results in complex
instruction scheduling hardware
 5 or 6 instructions dispatched per cycle
 VLIW architecture
 Compiler selects instructions without dependencies and
joins them as very long instructions
VLIW Architecture (cont.)
VLIW Processor Organization
VLIW vs Superscalar
VLIW Characteristics
 Only RISC like operation support
 Short cycle times
 Flexible: Can implement any FU mixture
 Extensible
 Tight inter FU connectivity required
 Large instructions (up to 1024 bits)
 Not binary compatible !!!
 But good compilers exist
 Only RISC like operation support
 Short cycle times
 Flexible: Can implement any FU mixture
 Extensible
 Tight inter FU connectivity required
 Large instructions (up to 1024 bits)
 Not binary compatible !!!
 But good compilers exist
VLIW: Merits
 Reduce hardware complexity
 Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
 Potentially higher clock rate
 Higher degree of parallelism with global program
information
 No need for register renaming
 Less power consumption
 Reduce hardware complexity
 Tasks such as decoding, data dependency detection,
instruction issue, …, etc. becoming simple
 Potentially higher clock rate
 Higher degree of parallelism with global program
information
 No need for register renaming
 Less power consumption
VLIW: Demerits
 Object code requires more memory
 Compiler development is involved and time
consuming process
 Compilation is slow process
 Compiler needs in-depth knowledge of hardware
details of its processor
 A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
 Inefficient for object-oriented and event-driven
programs
 Object code requires more memory
 Compiler development is involved and time
consuming process
 Compilation is slow process
 Compiler needs in-depth knowledge of hardware
details of its processor
 A new version of VLIW processor cannot recognize
the object code of an old VLIW processor
 Inefficient for object-oriented and event-driven
programs
Superscalar
and VLIW
 VLIW
computers
 Multiflow, Cyndrome
(old)
 Intel i860 (based on
single chip)
 Kertsev’s M10, M13
and Elbrus-3
(popular)
 VLIW
computers
 Multiflow, Cyndrome
(old)
 Intel i860 (based on
single chip)
 Kertsev’s M10, M13
and Elbrus-3
(popular)
Superscalar and VLIW (cont.)
Data Flow Computing
 Instruction execution is driven by data
availability
 Not guided by program counter (PC)
 Instructions are not ordered
 No need of shared memory – data held
directly inside instructions
 Instructions are activated on availability of
data tokens
 Represented by directed graphs
 Instruction execution is driven by data
availability
 Not guided by program counter (PC)
 Instructions are not ordered
 No need of shared memory – data held
directly inside instructions
 Instructions are activated on availability of
data tokens
 Represented by directed graphs
Data Flow Graph
 Graph shows flow of data
 Each instruction consist of
 An operator
 One or more operands
 One or more destinations where result will be sent
 Graph shows flow of data
 Each instruction consist of
 An operator
 One or more operands
 One or more destinations where result will be sent
Control Flow vs Data Flow
 Von Neumann or Control flow computing model:
 Uses program counter to sequence execution of
instructions
 Uses shared memory to hold instructions and data
 Data dependency and synchronization issues restricts
parallel processing
 Can be made parallel using special parallel control
operators like FORK and JOIN
 Dataflow model:
 The execution is driven only by the availability of operand!
 No PC and global updateable store
 The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing
 Von Neumann or Control flow computing model:
 Uses program counter to sequence execution of
instructions
 Uses shared memory to hold instructions and data
 Data dependency and synchronization issues restricts
parallel processing
 Can be made parallel using special parallel control
operators like FORK and JOIN
 Dataflow model:
 The execution is driven only by the availability of operand!
 No PC and global updateable store
 The two features of von Neumann model that become
bottlenecks in exploiting parallelism are missing
Static Dataflow
 Combine control and data into a
template like a reservation station
 Presence bit indicates operands
are available/ready or not
 After instruction execution
corresponding presence bits are
set
 Combine control and data into a
template like a reservation station
 Presence bit indicates operands
are available/ready or not
 After instruction execution
corresponding presence bits are
set Fig. (a) Static dataflow computer
Fig. (b) Opcode structure
Dynamic Dataflow
 Separate data tokens and control
 Tagged token: labeled packet of
informaion
 Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
 The operation is held by matching token’s
tag in matching store via associative
search
 If no match, make entry, wait for partner
 When there is a match, fetch
corresponding instruction from program
memory and execute
 Separate data tokens and control
 Tagged token: labeled packet of
informaion
 Allows multiple iterations to be
simultaneously active with shared control
(instruction) and separate data tokens
 The operation is held by matching token’s
tag in matching store via associative
search
 If no match, make entry, wait for partner
 When there is a match, fetch
corresponding instruction from program
memory and execute
Fig. (c) Dynamic dataflow computer
Fig. (d) Opcode structure
Dataflow:
Advantages/Disadvantages
 Advantages:
 No program counter
 Data-driven
 Execution inhibited only by true data dependences
 Stateless / side-effect free
 Enhances parallelism
 Disadvantages:
 No program counter leads to very long fetch/execute
latency
 Spatial locality in instruction fetch is hard to exploit
 Requires matching (e.g., via associative compares)
 No shared data structures
 No pointers into data structures (implies state)
 Advantages:
 No program counter
 Data-driven
 Execution inhibited only by true data dependences
 Stateless / side-effect free
 Enhances parallelism
 Disadvantages:
 No program counter leads to very long fetch/execute
latency
 Spatial locality in instruction fetch is hard to exploit
 Requires matching (e.g., via associative compares)
 No shared data structures
 No pointers into data structures (implies state)
Multicore Organization
 Number of core processors on chip
 Number of levels of cache on chip
 Amount of shared cache
 Next slide examples of each organization:
 (a) ARM11 MPCore
 (b) AMD Opteron
 (c) Intel Core Duo
 (d) Intel Core i7
 Number of core processors on chip
 Number of levels of cache on chip
 Amount of shared cache
 Next slide examples of each organization:
 (a) ARM11 MPCore
 (b) AMD Opteron
 (c) Intel Core Duo
 (d) Intel Core i7
Multicore Organization
Alternatives
Advantages of shared L2
Cache
 Constructive interference reduces overall miss rate
 Data shared by multiple cores not replicated at cache level
 With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
 Threads with less locality can have more cache
 Easy inter-process communication through shared memory
 Cache coherency confined to L1
 Dedicated L2 cache gives each core more rapid access
 Good for threads with strong locality
 Shared L3 cache may also improve performance
 Constructive interference reduces overall miss rate
 Data shared by multiple cores not replicated at cache level
 With proper frame replacement algorithms mean amount of
shared cache dedicated to each core is dynamic
 Threads with less locality can have more cache
 Easy inter-process communication through shared memory
 Cache coherency confined to L1
 Dedicated L2 cache gives each core more rapid access
 Good for threads with strong locality
 Shared L3 cache may also improve performance
Individual Core Architecture
 Intel Core Duo uses superscalar cores
 Intel Core i7 uses simultaneous multithreading
(SMT)
 Scales up number of threads supported
 4 SMT cores, each supporting 4 threads appears as
16 cores
 Intel Core Duo uses superscalar cores
 Intel Core i7 uses simultaneous multithreading
(SMT)
 Scales up number of threads supported
 4 SMT cores, each supporting 4 threads appears as
16 cores
Intel x86 Multicore Organization -
Core Duo (1)
 2006
 Two x86 superscalar, shared L2 cache
 Dedicated L1 cache per core
 32KB instruction and 32KB data
 Thermal control unit per core
 Manages chip heat dissipation
 Maximize performance within constraints
 Improved ergonomics
 Advanced Programmable Interrupt Controller (APIC)
 Inter-process interrupts between cores
 Routes interrupts to appropriate core
 Includes timer so OS can interrupt core
 2006
 Two x86 superscalar, shared L2 cache
 Dedicated L1 cache per core
 32KB instruction and 32KB data
 Thermal control unit per core
 Manages chip heat dissipation
 Maximize performance within constraints
 Improved ergonomics
 Advanced Programmable Interrupt Controller (APIC)
 Inter-process interrupts between cores
 Routes interrupts to appropriate core
 Includes timer so OS can interrupt core
Intel x86 Multicore Organization -
Core Duo (2)
 Power Management Logic
 Monitors thermal conditions and CPU activity
 Adjusts voltage and power consumption
 Can switch individual logic subsystems
 2MB shared L2 cache
 Dynamic allocation
 MESI support for L1 caches
 Extended to support multiple Core Duo in SMP
 L2 data shared between local cores or external
 Bus interface
 Power Management Logic
 Monitors thermal conditions and CPU activity
 Adjusts voltage and power consumption
 Can switch individual logic subsystems
 2MB shared L2 cache
 Dynamic allocation
 MESI support for L1 caches
 Extended to support multiple Core Duo in SMP
 L2 data shared between local cores or external
 Bus interface
Intel Core Duo Block
Diagram
Intel x86 Multicore Organization -
Core i7
 November 2008
 Four x86 SMT processors
 Dedicated L2, shared L3 cache
 Speculative pre-fetch for caches
 On chip DDR3 memory controller
 Three 8 byte channels (192 bits) giving 32GB/s
 No front side bus
 QuickPath Interconnection (QPI)
 Cache coherent point-to-point link
 High speed communications between processor chips
 6.4G transfers per second, 16 bits per transfer
 Dedicated bi-directional pairs
 Total bandwidth 25.6GB/s
 November 2008
 Four x86 SMT processors
 Dedicated L2, shared L3 cache
 Speculative pre-fetch for caches
 On chip DDR3 memory controller
 Three 8 byte channels (192 bits) giving 32GB/s
 No front side bus
 QuickPath Interconnection (QPI)
 Cache coherent point-to-point link
 High speed communications between processor chips
 6.4G transfers per second, 16 bits per transfer
 Dedicated bi-directional pairs
 Total bandwidth 25.6GB/s
Intel Core i7 Block Diagram

More Related Content

What's hot

Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systemsvampugani
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture Haris456
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
Stored program concept
Stored program conceptStored program concept
Stored program conceptgaurav jain
 
Distributed Operating System
Distributed Operating SystemDistributed Operating System
Distributed Operating SystemSanthiNivas
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Bhavik Vashi
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system modelHarshad Umredkar
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 

What's hot (20)

Multiprocessor Systems
Multiprocessor SystemsMultiprocessor Systems
Multiprocessor Systems
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
Multithreading computer architecture
 Multithreading computer architecture  Multithreading computer architecture
Multithreading computer architecture
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Stored program concept
Stored program conceptStored program concept
Stored program concept
 
Distributed Operating System
Distributed Operating SystemDistributed Operating System
Distributed Operating System
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)
 
distributed Computing system model
distributed Computing system modeldistributed Computing system model
distributed Computing system model
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Parallel processing and pipelining
Parallel processing and pipeliningParallel processing and pipelining
Parallel processing and pipelining
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
Parallel computing(1)
Parallel computing(1)Parallel computing(1)
Parallel computing(1)
 
Amdahl`s law -Processor performance
Amdahl`s law -Processor performanceAmdahl`s law -Processor performance
Amdahl`s law -Processor performance
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 

Similar to Advanced processor Principles

Similar to Advanced processor Principles (20)

unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
unit 4.pptx
unit 4.pptxunit 4.pptx
unit 4.pptx
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
 
Module2 MultiThreads.ppt
Module2 MultiThreads.pptModule2 MultiThreads.ppt
Module2 MultiThreads.ppt
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
parallel-processing.ppt
parallel-processing.pptparallel-processing.ppt
parallel-processing.ppt
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processing
 
Advanced processor principles
Advanced processor principlesAdvanced processor principles
Advanced processor principles
 
parallel processing.ppt
parallel processing.pptparallel processing.ppt
parallel processing.ppt
 
chapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).pptchapter-18-parallel-processing-multiprocessing (1).ppt
chapter-18-parallel-processing-multiprocessing (1).ppt
 
Parallel Processing Presentation2
Parallel Processing Presentation2Parallel Processing Presentation2
Parallel Processing Presentation2
 
Classification of Parallel Computers.pptx
Classification of Parallel Computers.pptxClassification of Parallel Computers.pptx
Classification of Parallel Computers.pptx
 
Parallel Processing
Parallel ProcessingParallel Processing
Parallel Processing
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Os
OsOs
Os
 
Operating System Lecture 4
Operating System Lecture 4Operating System Lecture 4
Operating System Lecture 4
 
Parallel Processing.pptx
Parallel Processing.pptxParallel Processing.pptx
Parallel Processing.pptx
 
M7_L1_PPT.computer organization and archi
M7_L1_PPT.computer organization and archiM7_L1_PPT.computer organization and archi
M7_L1_PPT.computer organization and archi
 
Parallel Processing (Part 2)
Parallel Processing (Part 2)Parallel Processing (Part 2)
Parallel Processing (Part 2)
 
Parallel and Distributed Computing chapter 3
Parallel and Distributed Computing chapter 3Parallel and Distributed Computing chapter 3
Parallel and Distributed Computing chapter 3
 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 

Advanced processor Principles

  • 2. Introduction to Parallel Processing  Parallel processing, one form of multiprocessing, is a situation in which one or more processors operate in unison  It is a method used to improve performance in a computer system  When two or more CPUs are executing instructions simultaneously, it is performing parallel processing  The Processor Manager has to coordinate the activity of each processor as well as synchronize cooperative interaction among the CPUs  Parallel processing, one form of multiprocessing, is a situation in which one or more processors operate in unison  It is a method used to improve performance in a computer system  When two or more CPUs are executing instructions simultaneously, it is performing parallel processing  The Processor Manager has to coordinate the activity of each processor as well as synchronize cooperative interaction among the CPUs
  • 3. Introduction to Parallel Processing  There are two primary benefits to parallel processing systems:  Increased reliability  The availability of more than one CPU  If one processor fails then, the others can continue to operate and absorb the load  Not simple to implement o The system must be carefully designed so that • The failing processor can inform the other processors to take over • The OS must reconstruct its resource allocation strategies so the remaining processors don’t become overloaded  There are two primary benefits to parallel processing systems:  Increased reliability  The availability of more than one CPU  If one processor fails then, the others can continue to operate and absorb the load  Not simple to implement o The system must be carefully designed so that • The failing processor can inform the other processors to take over • The OS must reconstruct its resource allocation strategies so the remaining processors don’t become overloaded
  • 4. Introduction to Parallel Processing  There are two primary benefits to parallel processing systems:  Increased throughput due to fast processing  The processing speed is achieved because sometimes instructions can be processed in parallel, two or more at a time in one of several ways: o Some systems allocate a CPU to each program or job o Others allocate CPU to each working set or parts of it o Others subdivide individual instructions so that each subdivision can be processed simultaneously • Concurrent programming  Increased flexibility brings increased complexity  There are two primary benefits to parallel processing systems:  Increased throughput due to fast processing  The processing speed is achieved because sometimes instructions can be processed in parallel, two or more at a time in one of several ways: o Some systems allocate a CPU to each program or job o Others allocate CPU to each working set or parts of it o Others subdivide individual instructions so that each subdivision can be processed simultaneously • Concurrent programming  Increased flexibility brings increased complexity
  • 5. Tightly Coupled - SMP  Processors share memory  Communicate via that shared memory  Symmetric Multiprocessor (SMP)  Share single memory or pool  Shared bus to access memory  Memory access time to given area of memory is approximately the same for each processor  Processors share memory  Communicate via that shared memory  Symmetric Multiprocessor (SMP)  Share single memory or pool  Shared bus to access memory  Memory access time to given area of memory is approximately the same for each processor
  • 6. Tightly Coupled - NUMA  Nonuniform memory access  Access times to different regions of memory may differ
  • 7. Loosely Coupled - Clusters  Interconnection of collection of independent uniprocessors or SMPs  Communication among computers via fixed paths or via network facility
  • 8. Flynn’s Classification  It is based on instruction and data processing.  A computer is classified by whether it processes a single instruction at a time or multiple instructions simultaneously, and whether it operates on one or multiple data sets.  It is based on instruction and data processing.  A computer is classified by whether it processes a single instruction at a time or multiple instructions simultaneously, and whether it operates on one or multiple data sets.
  • 9. Flynn’s Classification  Single instruction, single data stream - SISD  Single instruction, multiple data stream - SIMD  Multiple instruction, single data stream - MISD  Multiple instruction, multiple data stream- MIMD  Single instruction, single data stream - SISD  Single instruction, multiple data stream - SIMD  Multiple instruction, single data stream - MISD  Multiple instruction, multiple data stream- MIMD
  • 10. Single Instruction, Single Data Stream - SISD  Single processor  Single instruction stream  Data stored in single memory  No instruction parallelism  No data parallelism  Uni-processor  Single processor  Single instruction stream  Data stored in single memory  No instruction parallelism  No data parallelism  Uni-processor
  • 12. Single Instruction, Multiple Data Stream - SIMD  Single machine instruction  Controls simultaneous execution  Number of processing elements  Each processing element has associated data memory  Each instruction executed on different set of data by different processors (i.e. Data level parallelism )  Vector and array processors  Single machine instruction  Controls simultaneous execution  Number of processing elements  Each processing element has associated data memory  Each instruction executed on different set of data by different processors (i.e. Data level parallelism )  Vector and array processors
  • 14. Multiple Instruction, Single Data Stream - MISD  Sequence of data  Transmitted to set of processors  Each processor executes different instruction sequence on same data set  Never been implemented  Sequence of data  Transmitted to set of processors  Each processor executes different instruction sequence on same data set  Never been implemented
  • 16. Multiple Instruction, Multiple Data Stream- MIMD  Set of processors  Simultaneously execute different instruction sequences  Different sets of data  MIMD based computer system can  Use shared memory or  Work with distributed memory  SMPs, clusters and NUMA systems  Set of processors  Simultaneously execute different instruction sequences  Different sets of data  MIMD based computer system can  Use shared memory or  Work with distributed memory  SMPs, clusters and NUMA systems
  • 17. Parallel Organizations - MIMD Shared Memory
  • 18. Parallel Organizations - MIMD Distributed Memory
  • 19. Taxonomy of Parallel Processor Architectures
  • 20. MIMD - Overview  General purpose processors  Each can process all instructions necessary  Further classified by method of processor communication  General purpose processors  Each can process all instructions necessary  Further classified by method of processor communication
  • 21. Concepts of Superscalar Architecture  Execution of two or more instructions simultaneously, in different pipelines  Throughput of superscalar processor is greater than that of pipelined scalar processor  May use RISC (PowerPC) or CISC (Pentium 4) architectures  Instruction execution sequencing may be  Static (during compilation) or  Dynamic (at run time)  Execution of two or more instructions simultaneously, in different pipelines  Throughput of superscalar processor is greater than that of pipelined scalar processor  May use RISC (PowerPC) or CISC (Pentium 4) architectures  Instruction execution sequencing may be  Static (during compilation) or  Dynamic (at run time)
  • 23. Requirements of Superscalar Architecture 1. More than one instructions should be fetched at one time 2. Decoding logic should check whether instructions are independent and hence executable, simultaneously 3. Sufficient number of execution units, preferably pipelined 4. Cycle time for each pipelined stage should match with cycle times for the fetching and decoding logic 1. More than one instructions should be fetched at one time 2. Decoding logic should check whether instructions are independent and hence executable, simultaneously 3. Sufficient number of execution units, preferably pipelined 4. Cycle time for each pipelined stage should match with cycle times for the fetching and decoding logic
  • 24. Superscalar Pipelines  Time spent in each stage of pipeline is same for all stages and determined by slowest stage  Instruction takes 200ns using pipeline shown in Fig. 13.1  Superscalar Pipeline (Fig. 13.2)  Fetch and execute multiple instructions in each clock cycle  Multiple instructions are issued to multiple functional units in each clock cycle  Issue instructions which are ready for issue, Out-of- Order (OoO)  Time spent in each stage of pipeline is same for all stages and determined by slowest stage  Instruction takes 200ns using pipeline shown in Fig. 13.1  Superscalar Pipeline (Fig. 13.2)  Fetch and execute multiple instructions in each clock cycle  Multiple instructions are issued to multiple functional units in each clock cycle  Issue instructions which are ready for issue, Out-of- Order (OoO)
  • 25. Superscalar Pipelines (cont.)  More pipelines are added – improves productivity  Execution units work independently  Fig. 13.3- first instruction takes 5 clock cycles and thereafter two instructions gets completed for every clock cycle  More pipelines are added – improves productivity  Execution units work independently  Fig. 13.3- first instruction takes 5 clock cycles and thereafter two instructions gets completed for every clock cycle
  • 26. Superscalar Pipelines (cont.)  Timing diagram for system shown in Fig 13.3 assuming no presence of hazards
  • 27. Superscalar Pipelines (cont.)  Two-way superscalar processor  Pipelined integer and floating point unit  Integer unit 5 stages  Floating unit 7 stages  Two instructions gets completed in every clock from 7th clock onwards  Difficult to achieve due to dependencies and branch instructions  Two-way superscalar processor  Pipelined integer and floating point unit  Integer unit 5 stages  Floating unit 7 stages  Two instructions gets completed in every clock from 7th clock onwards  Difficult to achieve due to dependencies and branch instructions
  • 28. Superscalar Techniques  Operand forwarding  Delayed branching  Dynamic scheduling  Register renaming  Branch prediction  Multiple instruction issue  Speculative Execution  Loop unrolling  Software pipelining  Operand forwarding  Delayed branching  Dynamic scheduling  Register renaming  Branch prediction  Multiple instruction issue  Speculative Execution  Loop unrolling  Software pipelining
  • 29. Out-of-Order(OoO) in a nutshell  Execute instructions based on the “data flow” graph, (rather than program order)  Improvement in performance  Still need to keep the semantics of the original program (i.e. results rearranged in correct order)  Retirement unit or commit unit  Execute instructions based on the “data flow” graph, (rather than program order)  Improvement in performance  Still need to keep the semantics of the original program (i.e. results rearranged in correct order)  Retirement unit or commit unit
  • 30. Dynamic Scheduling  Consider following instructions sequence ADD R2, R3 SUB R4, R2 MUL R5, R6  In simple scalar processor pipeline is stalled while executing SUB and also held up MUL instruction  In superscalar processor with dynamic scheduling MUL can be executed out-of-order in another pipeline  Instructions are issued in-order but executed out-of- order  Consider following instructions sequence ADD R2, R3 SUB R4, R2 MUL R5, R6  In simple scalar processor pipeline is stalled while executing SUB and also held up MUL instruction  In superscalar processor with dynamic scheduling MUL can be executed out-of-order in another pipeline  Instructions are issued in-order but executed out-of- order
  • 32. Superscalar basics: Data flow analysis  Example:
  • 33. Out-of-Order Execution  Advantages: Better performance!  Exploit Instruction Level Parallelism (ILP)  Hide latencies (e.g., L1 data cache miss)  Disadvantages:  HW is much more complex than that of in-order processors  Can compilers do this work?  In a very limited way – can only statically schedule instructions (VLIW)  Compilers lack runtime information  Conditional branch direction (→ compiler limited to basic blocks)  Data values, which may affect calculation time and control  Cache miss / hit  Advantages: Better performance!  Exploit Instruction Level Parallelism (ILP)  Hide latencies (e.g., L1 data cache miss)  Disadvantages:  HW is much more complex than that of in-order processors  Can compilers do this work?  In a very limited way – can only statically schedule instructions (VLIW)  Compilers lack runtime information  Conditional branch direction (→ compiler limited to basic blocks)  Data values, which may affect calculation time and control  Cache miss / hit
  • 34. Speculative Execution  In branching it is not known which should be the next instruction until condition is evaluated  Simple processor does stalling  Advanced processor - speculation  Speculative execution - Execute control dependent instructions even when we are not sure if they should be executed  If assumption goes wrong – results will be dropped  With branch prediction, we speculate on the outcome of the branches and execute the program as if our guesses were correct  Misprediction – Hardware undo (Rollback and recovery)  In branching it is not known which should be the next instruction until condition is evaluated  Simple processor does stalling  Advanced processor - speculation  Speculative execution - Execute control dependent instructions even when we are not sure if they should be executed  If assumption goes wrong – results will be dropped  With branch prediction, we speculate on the outcome of the branches and execute the program as if our guesses were correct  Misprediction – Hardware undo (Rollback and recovery)
  • 35. Speculative Execution (cont.)  Common Implementation  Fetch/Decode instructions from the predicted execution path  Instructions can execute as soon as their operands become ready  Instructions can graduate and commit to memory only once it is certain they should have been executed  An instruction commits only when all previous (in-order) instructions have committed ⇒ instructions commit in- order  Instructions on a mis-predicted execution path are flushed  Common Implementation  Fetch/Decode instructions from the predicted execution path  Instructions can execute as soon as their operands become ready  Instructions can graduate and commit to memory only once it is certain they should have been executed  An instruction commits only when all previous (in-order) instructions have committed ⇒ instructions commit in- order  Instructions on a mis-predicted execution path are flushed
  • 36. Speculative Execution (cont.)  Consider following source code IF I=J THEN M=M+1 ELSE M=M-1  Addition or subtraction?  Branch prediction based on previous information  Improvement using Branch Target Buffer (BTB) cache  If same branch, makes guess about present path to be followed based on earlier result  Consider following source code IF I=J THEN M=M+1 ELSE M=M-1  Addition or subtraction?  Branch prediction based on previous information  Improvement using Branch Target Buffer (BTB) cache  If same branch, makes guess about present path to be followed based on earlier result
  • 37. Increasing Performance  Processor performance can be measured by the rate at which it executes instructions  MIPS rate = f * IPC  f processor clock frequency, in MHz  IPC is average instructions per cycle  Increase performance by increasing clock frequency and increasing instructions that complete during cycle  May be reaching limit  Complexity  Power consumption  Processor performance can be measured by the rate at which it executes instructions  MIPS rate = f * IPC  f processor clock frequency, in MHz  IPC is average instructions per cycle  Increase performance by increasing clock frequency and increasing instructions that complete during cycle  May be reaching limit  Complexity  Power consumption
  • 38. Multithreaded Processor  Instruction stream divided into smaller streams (threads)  Executed in parallel  Wide variety of multithreading designs  Instruction stream divided into smaller streams (threads)  Executed in parallel  Wide variety of multithreading designs
  • 39. Definitions of Threads and Processes  Thread in multithreaded processors may or may not be same as software threads  Process  An instance of program running on computer  Resource ownership  Virtual address space to hold process image  Scheduling/execution  Process switch  An operation that switches processor from one process to another  Thread: dispatchable unit of work within process  Includes processor context (which includes the program counter and stack pointer) and data area for stack  Thread executes sequentially  Interruptible: processor can turn to another thread  Thread switch  Switching processor between threads within same process  Typically less costly than process switch  Thread in multithreaded processors may or may not be same as software threads  Process  An instance of program running on computer  Resource ownership  Virtual address space to hold process image  Scheduling/execution  Process switch  An operation that switches processor from one process to another  Thread: dispatchable unit of work within process  Includes processor context (which includes the program counter and stack pointer) and data area for stack  Thread executes sequentially  Interruptible: processor can turn to another thread  Thread switch  Switching processor between threads within same process  Typically less costly than process switch
  • 40. Implicit and Explicit Multithreading  All commercial processors and most experimental ones use explicit multithreading  Concurrently execute instructions from different explicit threads  Interleave instructions from different threads on shared pipelines or parallel execution on parallel pipelines  Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program  Implicit threads defined statically by compiler or dynamically by hardware  All commercial processors and most experimental ones use explicit multithreading  Concurrently execute instructions from different explicit threads  Interleave instructions from different threads on shared pipelines or parallel execution on parallel pipelines  Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program  Implicit threads defined statically by compiler or dynamically by hardware
  • 41. Approaches to Explicit Multithreading  Interleaved Multithreading  Fine-grained  Processor deals with two or more thread contexts at a time  Switching thread at each clock cycle  If thread is blocked it is skipped  Blocked Multithreading  Coarse-grained  Thread executed until event causes delay  E.g. Cache miss  Effective on in-order processor  Avoids pipeline stall  Simultaneous Multithreading (SMT)  Instructions simultaneously issued from multiple threads to execution units of superscalar processor  Chip multiprocessing  Processor is replicated on a single chip  Each processor handles separate threads  Interleaved Multithreading  Fine-grained  Processor deals with two or more thread contexts at a time  Switching thread at each clock cycle  If thread is blocked it is skipped  Blocked Multithreading  Coarse-grained  Thread executed until event causes delay  E.g. Cache miss  Effective on in-order processor  Avoids pipeline stall  Simultaneous Multithreading (SMT)  Instructions simultaneously issued from multiple threads to execution units of superscalar processor  Chip multiprocessing  Processor is replicated on a single chip  Each processor handles separate threads
  • 42. Scalar Processor Approaches  Single-threaded scalar  Simple pipeline  No multithreading  Interleaved multithreaded scalar  Easiest multithreading to implement  Switch threads at each clock cycle  Pipeline stages kept close to fully occupied  Hardware needs to switch thread context between cycles  Blocked multithreaded scalar  Thread executed until latency event occurs  Would stop pipeline  Processor switches to another thread  Single-threaded scalar  Simple pipeline  No multithreading  Interleaved multithreaded scalar  Easiest multithreading to implement  Switch threads at each clock cycle  Pipeline stages kept close to fully occupied  Hardware needs to switch thread context between cycles  Blocked multithreaded scalar  Thread executed until latency event occurs  Would stop pipeline  Processor switches to another thread
  • 44. Multiple Instruction Issue Processors (1)  Superscalar  No multithreading  Interleaved multithreading superscalar  Each cycle, as many instructions as possible issued from single thread  Delays due to thread switches eliminated  Number of instructions issued in cycle limited by dependencies  Blocked multithreaded superscalar  Instructions from one thread  Blocked multithreading used  Superscalar  No multithreading  Interleaved multithreading superscalar  Each cycle, as many instructions as possible issued from single thread  Delays due to thread switches eliminated  Number of instructions issued in cycle limited by dependencies  Blocked multithreaded superscalar  Instructions from one thread  Blocked multithreading used
  • 46. Multiple Instruction Issue Processors (2)  Very long instruction word (VLIW)  E.g. IA-64  Multiple instructions in single word  Typically constructed by compiler  Operations that may be executed in parallel in same word  May pad with no-ops  Interleaved multithreading VLIW  Similar efficiencies to interleaved multithreading on superscalar architecture  Blocked multithreaded VLIW  Similar efficiencies to blocked multithreading on superscalar architecture  Very long instruction word (VLIW)  E.g. IA-64  Multiple instructions in single word  Typically constructed by compiler  Operations that may be executed in parallel in same word  May pad with no-ops  Interleaved multithreading VLIW  Similar efficiencies to interleaved multithreading on superscalar architecture  Blocked multithreaded VLIW  Similar efficiencies to blocked multithreading on superscalar architecture
  • 48. Parallel, Simultaneous Execution of Multiple Threads  Simultaneous multithreading  Issue multiple instructions at a time  One thread may fill all horizontal slots  Instructions from two or more threads may be issued  With enough threads, can issue maximum number of instructions on each cycle  Chip multiprocessor  Multiple processors  Each has two-issue superscalar processor  Each processor is assigned thread  Can issue up to two instructions per cycle per thread  Simultaneous multithreading  Issue multiple instructions at a time  One thread may fill all horizontal slots  Instructions from two or more threads may be issued  With enough threads, can issue maximum number of instructions on each cycle  Chip multiprocessor  Multiple processors  Each has two-issue superscalar processor  Each processor is assigned thread  Can issue up to two instructions per cycle per thread
  • 50. Examples  Some Pentium 4  Intel calls it hyperthreading  SMT with support for two threads  Single multithreaded processor, logically two processors  IBM Power5  High-end PowerPC  Combines chip multiprocessing with SMT  Chip has two separate processors  Each supporting two threads concurrently using SMT  Some Pentium 4  Intel calls it hyperthreading  SMT with support for two threads  Single multithreaded processor, logically two processors  IBM Power5  High-end PowerPC  Combines chip multiprocessing with SMT  Chip has two separate processors  Each supporting two threads concurrently using SMT
  • 51. VLIW Architecture  Aim at speeding up computation by exploiting instruction- level parallelism.  Follows static scheduling  Compiler groups several operations into very long instruction word (VLIW)  Same hardware core as superscalar processors, having multiple execution units (EUs) working in parallel, to execute operations in one clock cycle  An instruction is consisted of multiple operations; typical word length from 128 bits to 1 Kbits.  All operations in an instruction are executed in a lock-step mode.  Rely on compiler to find parallelism and schedule dependency free program code.  Aim at speeding up computation by exploiting instruction- level parallelism.  Follows static scheduling  Compiler groups several operations into very long instruction word (VLIW)  Same hardware core as superscalar processors, having multiple execution units (EUs) working in parallel, to execute operations in one clock cycle  An instruction is consisted of multiple operations; typical word length from 128 bits to 1 Kbits.  All operations in an instruction are executed in a lock-step mode.  Rely on compiler to find parallelism and schedule dependency free program code.
  • 52. VLIW Architecture (cont.)  VLIW offers Plan Of Execution (POE) created statically during the compilation  VLIW processor consist of set of functional units like adders, multipliers, branch units etc.  Delivers POE via instruction set  Very simple control logic, no dynamic scheduling  Superscalar  Increasing no. of functional units results in complex instruction scheduling hardware  5 or 6 instructions dispatched per cycle  VLIW architecture  Compiler selects instructions without dependencies and joins them as very long instructions  VLIW offers Plan Of Execution (POE) created statically during the compilation  VLIW processor consist of set of functional units like adders, multipliers, branch units etc.  Delivers POE via instruction set  Very simple control logic, no dynamic scheduling  Superscalar  Increasing no. of functional units results in complex instruction scheduling hardware  5 or 6 instructions dispatched per cycle  VLIW architecture  Compiler selects instructions without dependencies and joins them as very long instructions
  • 56. VLIW Characteristics  Only RISC like operation support  Short cycle times  Flexible: Can implement any FU mixture  Extensible  Tight inter FU connectivity required  Large instructions (up to 1024 bits)  Not binary compatible !!!  But good compilers exist  Only RISC like operation support  Short cycle times  Flexible: Can implement any FU mixture  Extensible  Tight inter FU connectivity required  Large instructions (up to 1024 bits)  Not binary compatible !!!  But good compilers exist
  • 57. VLIW: Merits  Reduce hardware complexity  Tasks such as decoding, data dependency detection, instruction issue, …, etc. becoming simple  Potentially higher clock rate  Higher degree of parallelism with global program information  No need for register renaming  Less power consumption  Reduce hardware complexity  Tasks such as decoding, data dependency detection, instruction issue, …, etc. becoming simple  Potentially higher clock rate  Higher degree of parallelism with global program information  No need for register renaming  Less power consumption
  • 58. VLIW: Demerits  Object code requires more memory  Compiler development is involved and time consuming process  Compilation is slow process  Compiler needs in-depth knowledge of hardware details of its processor  A new version of VLIW processor cannot recognize the object code of an old VLIW processor  Inefficient for object-oriented and event-driven programs  Object code requires more memory  Compiler development is involved and time consuming process  Compilation is slow process  Compiler needs in-depth knowledge of hardware details of its processor  A new version of VLIW processor cannot recognize the object code of an old VLIW processor  Inefficient for object-oriented and event-driven programs
  • 59. Superscalar and VLIW  VLIW computers  Multiflow, Cyndrome (old)  Intel i860 (based on single chip)  Kertsev’s M10, M13 and Elbrus-3 (popular)  VLIW computers  Multiflow, Cyndrome (old)  Intel i860 (based on single chip)  Kertsev’s M10, M13 and Elbrus-3 (popular)
  • 61. Data Flow Computing  Instruction execution is driven by data availability  Not guided by program counter (PC)  Instructions are not ordered  No need of shared memory – data held directly inside instructions  Instructions are activated on availability of data tokens  Represented by directed graphs  Instruction execution is driven by data availability  Not guided by program counter (PC)  Instructions are not ordered  No need of shared memory – data held directly inside instructions  Instructions are activated on availability of data tokens  Represented by directed graphs
  • 62. Data Flow Graph  Graph shows flow of data  Each instruction consist of  An operator  One or more operands  One or more destinations where result will be sent  Graph shows flow of data  Each instruction consist of  An operator  One or more operands  One or more destinations where result will be sent
  • 63. Control Flow vs Data Flow  Von Neumann or Control flow computing model:  Uses program counter to sequence execution of instructions  Uses shared memory to hold instructions and data  Data dependency and synchronization issues restricts parallel processing  Can be made parallel using special parallel control operators like FORK and JOIN  Dataflow model:  The execution is driven only by the availability of operand!  No PC and global updateable store  The two features of von Neumann model that become bottlenecks in exploiting parallelism are missing  Von Neumann or Control flow computing model:  Uses program counter to sequence execution of instructions  Uses shared memory to hold instructions and data  Data dependency and synchronization issues restricts parallel processing  Can be made parallel using special parallel control operators like FORK and JOIN  Dataflow model:  The execution is driven only by the availability of operand!  No PC and global updateable store  The two features of von Neumann model that become bottlenecks in exploiting parallelism are missing
  • 64. Static Dataflow  Combine control and data into a template like a reservation station  Presence bit indicates operands are available/ready or not  After instruction execution corresponding presence bits are set  Combine control and data into a template like a reservation station  Presence bit indicates operands are available/ready or not  After instruction execution corresponding presence bits are set Fig. (a) Static dataflow computer Fig. (b) Opcode structure
  • 65. Dynamic Dataflow  Separate data tokens and control  Tagged token: labeled packet of informaion  Allows multiple iterations to be simultaneously active with shared control (instruction) and separate data tokens  The operation is held by matching token’s tag in matching store via associative search  If no match, make entry, wait for partner  When there is a match, fetch corresponding instruction from program memory and execute  Separate data tokens and control  Tagged token: labeled packet of informaion  Allows multiple iterations to be simultaneously active with shared control (instruction) and separate data tokens  The operation is held by matching token’s tag in matching store via associative search  If no match, make entry, wait for partner  When there is a match, fetch corresponding instruction from program memory and execute Fig. (c) Dynamic dataflow computer Fig. (d) Opcode structure
  • 66. Dataflow: Advantages/Disadvantages  Advantages:  No program counter  Data-driven  Execution inhibited only by true data dependences  Stateless / side-effect free  Enhances parallelism  Disadvantages:  No program counter leads to very long fetch/execute latency  Spatial locality in instruction fetch is hard to exploit  Requires matching (e.g., via associative compares)  No shared data structures  No pointers into data structures (implies state)  Advantages:  No program counter  Data-driven  Execution inhibited only by true data dependences  Stateless / side-effect free  Enhances parallelism  Disadvantages:  No program counter leads to very long fetch/execute latency  Spatial locality in instruction fetch is hard to exploit  Requires matching (e.g., via associative compares)  No shared data structures  No pointers into data structures (implies state)
  • 67. Multicore Organization  Number of core processors on chip  Number of levels of cache on chip  Amount of shared cache  Next slide examples of each organization:  (a) ARM11 MPCore  (b) AMD Opteron  (c) Intel Core Duo  (d) Intel Core i7  Number of core processors on chip  Number of levels of cache on chip  Amount of shared cache  Next slide examples of each organization:  (a) ARM11 MPCore  (b) AMD Opteron  (c) Intel Core Duo  (d) Intel Core i7
  • 69. Advantages of shared L2 Cache  Constructive interference reduces overall miss rate  Data shared by multiple cores not replicated at cache level  With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic  Threads with less locality can have more cache  Easy inter-process communication through shared memory  Cache coherency confined to L1  Dedicated L2 cache gives each core more rapid access  Good for threads with strong locality  Shared L3 cache may also improve performance  Constructive interference reduces overall miss rate  Data shared by multiple cores not replicated at cache level  With proper frame replacement algorithms mean amount of shared cache dedicated to each core is dynamic  Threads with less locality can have more cache  Easy inter-process communication through shared memory  Cache coherency confined to L1  Dedicated L2 cache gives each core more rapid access  Good for threads with strong locality  Shared L3 cache may also improve performance
  • 70. Individual Core Architecture  Intel Core Duo uses superscalar cores  Intel Core i7 uses simultaneous multithreading (SMT)  Scales up number of threads supported  4 SMT cores, each supporting 4 threads appears as 16 cores  Intel Core Duo uses superscalar cores  Intel Core i7 uses simultaneous multithreading (SMT)  Scales up number of threads supported  4 SMT cores, each supporting 4 threads appears as 16 cores
  • 71. Intel x86 Multicore Organization - Core Duo (1)  2006  Two x86 superscalar, shared L2 cache  Dedicated L1 cache per core  32KB instruction and 32KB data  Thermal control unit per core  Manages chip heat dissipation  Maximize performance within constraints  Improved ergonomics  Advanced Programmable Interrupt Controller (APIC)  Inter-process interrupts between cores  Routes interrupts to appropriate core  Includes timer so OS can interrupt core  2006  Two x86 superscalar, shared L2 cache  Dedicated L1 cache per core  32KB instruction and 32KB data  Thermal control unit per core  Manages chip heat dissipation  Maximize performance within constraints  Improved ergonomics  Advanced Programmable Interrupt Controller (APIC)  Inter-process interrupts between cores  Routes interrupts to appropriate core  Includes timer so OS can interrupt core
  • 72. Intel x86 Multicore Organization - Core Duo (2)  Power Management Logic  Monitors thermal conditions and CPU activity  Adjusts voltage and power consumption  Can switch individual logic subsystems  2MB shared L2 cache  Dynamic allocation  MESI support for L1 caches  Extended to support multiple Core Duo in SMP  L2 data shared between local cores or external  Bus interface  Power Management Logic  Monitors thermal conditions and CPU activity  Adjusts voltage and power consumption  Can switch individual logic subsystems  2MB shared L2 cache  Dynamic allocation  MESI support for L1 caches  Extended to support multiple Core Duo in SMP  L2 data shared between local cores or external  Bus interface
  • 73. Intel Core Duo Block Diagram
  • 74. Intel x86 Multicore Organization - Core i7  November 2008  Four x86 SMT processors  Dedicated L2, shared L3 cache  Speculative pre-fetch for caches  On chip DDR3 memory controller  Three 8 byte channels (192 bits) giving 32GB/s  No front side bus  QuickPath Interconnection (QPI)  Cache coherent point-to-point link  High speed communications between processor chips  6.4G transfers per second, 16 bits per transfer  Dedicated bi-directional pairs  Total bandwidth 25.6GB/s  November 2008  Four x86 SMT processors  Dedicated L2, shared L3 cache  Speculative pre-fetch for caches  On chip DDR3 memory controller  Three 8 byte channels (192 bits) giving 32GB/s  No front side bus  QuickPath Interconnection (QPI)  Cache coherent point-to-point link  High speed communications between processor chips  6.4G transfers per second, 16 bits per transfer  Dedicated bi-directional pairs  Total bandwidth 25.6GB/s
  • 75. Intel Core i7 Block Diagram