Topics Left
• Superscalar machines
• IA64 / EPIC architecture
• Multithreading (explicit and implicit)
• Multicore Machines
• Clusters
• Parallel Processors
• Hardware implementation vs microprogramming
Chapter 14
Superscalar Processors
• Definition of Superscalar
• Design Issues:
- Instruction Issue Policy
- Register renaming
- Machine parallelism
- Branch Prediction
- Execution
• Pentium 4 example
What is Superscalar?
• “Common” instructions (arithmetic, load/store, conditional branch)
can be executed independently.
• Equally applicable to RISC & CISC, but more straightforward in
RISC machines.
• The order of execution is usually assisted by the compiler.
A Superscalar machine executes multiple independent
instructions in parallel.
They are pipelined as well.
Example of Superscalar Organization
• 2 Integer ALU pipelines,
• 2 FP ALU pipelines,
• 1 memory pipeline (?)
Superscalar v Superpipelined
Limitations of Superscalar
• Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support
• Limited by
— Data dependency
— Procedural dependency
— Resource conflicts
(Recall) True Data Dependency
(Must W before R)
ADD r1, r2 r1+r2  r1
MOVE r3, r1 r1  r3
• Can fetch and decode second instruction in parallel with
first
LOAD r1, X x (memory)  r1
MOVE r3, r1 r1 r3
• Can NOT execute second instruction until first is
finished
Second instruction is dependent on first (R after W)
(recall) Antidependancy (Must R before W)
ADD R4, R3, 1 R3 + 1  R4
ADD R3, R5, 1 R5 + 1  R3
• Cannot complete the second instruction before the first has
read R3
(Recall) Procedural Dependency
• Can’t execute instructions after a branch in parallel
with instructions before a branch, because?
Note: Also, if instruction length is not fixed,
instructions have to be decoded to find out how many
fetches are needed
(recall) Resource Conflict
• Two or more instructions requiring access to the
same resource at the same time
— e.g. two arithmetic instructions need the ALU
• Solution - Can possibly duplicate resources
— e.g. have two arithmetic units
Effect of Dependencies on Superscalar Operation
Notes:
1) Superscalar operation is double impacted by a stall.
2) CISC machines typically have different length instructions and need to be at least
partially decoded before the next can be fetched – not good for superscalar operation
Instruction-level Parallelism – degree of
• Consider:
LOAD R1, R2
ADD R3, 1
ADD R4, R2
These can be handled in parallel.
• Consider:
ADD R3, 1
ADD R4, R3
STO (R4), R0
These cannot be handled in parallel.
The “degree” of instruction-level parallelism is determined by the
number of instructions that can be executed in parallel without
stalling for dependencies
Instruction Issue Policies
• Order in which instructions are fetched
• Order in which instructions are executed
• Order in which instructions update registers and
memory values (order of completion)
Standard Categories:
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of order issue with out-of-order completion
In-Order Issue -- In-Order Completion
Issue instructions in the order they occur:
• Not very efficient
• Instructions must stall if necessary (and stalling in
superpipelining is expensive)
In-Order Issue -- In-Order Completion
(Example)
Assume:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
In-Order Issue -- Out-of-Order Completion
(Example)
How does this effect interrupts?
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Out-of-Order Issue -- Out-of-Order Completion
• Decouple decode pipeline from execution pipeline
• Can continue to fetch and decode until the “window”
is full
• When a functional unit becomes available an
instruction can be executed (usually in as much in-
order as possible)
• Since instructions have been decoded, processor can
look ahead
Out-of-Order Issue -- Out-of-Order Completion
(Example)
Note: I5 depends upon I4, but I6 does not
Again:
• I1 requires 2 cycles to execute
• I3 & I4 conflict for the same functional unit
• I5 depends upon value produced by I4
• I5 & I6 conflict for a functional unit
Register Renaming
to avoid hazards
• Output and antidependencies occur because register
contents may not reflect the correct ordering from the
program
• Can require a pipeline stall
• One solution: Allocate Registers dynamically
(renaming registers)
Register Renaming example
Add R3, R3, R5 R3b:=R3a + R5a (I1)
Add R4, R3, 1 R4b:=R3b + 1 (I2)
Add R3, R5, 1 R3c:=R5a + 1 (I3)
Add R7, R3, R4 R7b:=R3c + R4b (I4)
• Without “subscript” refers to logical register in
instruction
• With subscript is hardware register allocated:
R3a R3b R3c
Note: R3c avoids: antidependency on I2
output dependency I1
Recaping: Machine Parallelism Support
• Duplication of Resources
• Out of order issue hardware
• Windowing to decouple execution from decode
• Register Renaming capability
Speedups of Machine Organizations
(Without Procedural Dependencies)
• Not worth duplication of functional units without register renaming
• Need instruction window large enough (more than 8, probably not more than 32)
Branch Prediction in Superscalar Machines
• Delayed branch not used much. Why?
Multiple instructions need to execute in the delay slot.
This leads to much complexity in recovery.
• Branch prediction should be used - Branch history is
very useful
View of Superscalar Execution
Committing or Retiring Instructions
Results need to be put into order (commit or retire)
• Results sometimes must be held in temporary storage
until it is certain they can be placed in “permanent”
storage.
(either committed or retired/flushed)
• Temporary storage requires regular clean up –
overhead – done in hardware.
Superscalar Hardware Support
• Facilities to simultaneously fetch multiple
instructions
• Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values
• Mechanisms to initiate multiple instructions in
parallel
• Resources for parallel execution of multiple
instructions
• Mechanisms for committing process state in correct
order
Example: Pentium 4
A Superscalar CISC Machine
Pentium 4 alternate view
Pentium 4 pipeline
20 stages !
a) Generation of Micro-ops (stages 1 &2)
• Using the Branch Target Buffer and Instruction Translation
Lookaside Buffer, the x86 instructions are fetched 64 bytes at a
time from the L2 cache
•The instruction boundaries are determined and instructions
decoded into 1-4 118-bit RISC micro-ops
• Micro-ops are stored in the trace cache
b) Trace cache next instruction pointer (stage 3)
• The Trace Cache Branch Target Buffer contains dynamic
gathered history information (4 bit tag)
• If target is not in BTB
- Branch not PC relative: predict branch taken if it is a return,
predict not taken otherwise
- For PC relative backward conditional branches, predict take,
otherwise not taken
c) Trace Cache fetch (stage 4)
• Orders micro-ops in program-ordered sequences called traces
• These are fetched in order, subject to branch prediction
• Some micro-ops require many micro-ops (CISC instructions).
These are coded into the ROM and fetched from the ROM
d) Drive (stage 5)
• Delivers instructions from the Trace Cache to the
Rename/Allocator module for reordering
e) Allocate: register naming (stages 6, 7, & 8)
• Allocates resources for execution (3 micro-ops arrive per clock cycle):
- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which
tracks progress of the micro-ops.
Buffer entries include:
- State – scheduled, dispatched, completed, ready for retire
- Address that generated the micro-op
- Operation
- Alias registers are assigned for one of 16 arch reg (128 alias registers)
{to remove data
dependencies}
• The micro-ops are dispatched out of order as resources are available
• Allocates an entry to one of the 2 scheduler queues - memory access or not
f) Micro-op queuing (stage 9)
• Micro-ops are loaded into one of 2 queues:
- one for memory operations
- one for non memory operations
• Each queue operates on a FIFO policy
g) Micro-op scheduling
(stages 10, 11, & 12)
• The 2 schedulers retrieve micro-ops based upon having all
the operands ready and dispatch them to an available unit (up
to 6 per clock cycle)
• If two micro-ops need the same unit, they are dispatched
in sequence.
h) Dispatch
(stages 13 & 14)
i) Register file
(stages 15 & 16)
j) Execute: flags
(stages 17 & 18)
• The register files are the sources for pending fixed and FF
operations
• A separate stage is used to compute the flags
k) Branch check
(stage 19)
l) Branch check results
(stage 20)
• Checks flags and compares results with predictions
• If the branch prediction was wrong:
- all incorrect micro-ops must be flushed (don’t want to be wrong!)
- the correct branch destination is provided to the Branch Predictor
- the pipeline is restarted from the new target address

13_Superscalar.ppt

  • 1.
    Topics Left • Superscalarmachines • IA64 / EPIC architecture • Multithreading (explicit and implicit) • Multicore Machines • Clusters • Parallel Processors • Hardware implementation vs microprogramming
  • 2.
    Chapter 14 Superscalar Processors •Definition of Superscalar • Design Issues: - Instruction Issue Policy - Register renaming - Machine parallelism - Branch Prediction - Execution • Pentium 4 example
  • 3.
    What is Superscalar? •“Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. • Equally applicable to RISC & CISC, but more straightforward in RISC machines. • The order of execution is usually assisted by the compiler. A Superscalar machine executes multiple independent instructions in parallel. They are pipelined as well.
  • 4.
    Example of SuperscalarOrganization • 2 Integer ALU pipelines, • 2 FP ALU pipelines, • 1 memory pipeline (?)
  • 5.
  • 6.
    Limitations of Superscalar •Dependent upon: - Instruction level parallelism possible - Compiler based optimization - Hardware support • Limited by — Data dependency — Procedural dependency — Resource conflicts
  • 7.
    (Recall) True DataDependency (Must W before R) ADD r1, r2 r1+r2  r1 MOVE r3, r1 r1  r3 • Can fetch and decode second instruction in parallel with first LOAD r1, X x (memory)  r1 MOVE r3, r1 r1 r3 • Can NOT execute second instruction until first is finished Second instruction is dependent on first (R after W)
  • 8.
    (recall) Antidependancy (MustR before W) ADD R4, R3, 1 R3 + 1  R4 ADD R3, R5, 1 R5 + 1  R3 • Cannot complete the second instruction before the first has read R3
  • 9.
    (Recall) Procedural Dependency •Can’t execute instructions after a branch in parallel with instructions before a branch, because? Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed
  • 10.
    (recall) Resource Conflict •Two or more instructions requiring access to the same resource at the same time — e.g. two arithmetic instructions need the ALU • Solution - Can possibly duplicate resources — e.g. have two arithmetic units
  • 11.
    Effect of Dependencieson Superscalar Operation Notes: 1) Superscalar operation is double impacted by a stall. 2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation
  • 12.
    Instruction-level Parallelism –degree of • Consider: LOAD R1, R2 ADD R3, 1 ADD R4, R2 These can be handled in parallel. • Consider: ADD R3, 1 ADD R4, R3 STO (R4), R0 These cannot be handled in parallel. The “degree” of instruction-level parallelism is determined by the number of instructions that can be executed in parallel without stalling for dependencies
  • 13.
    Instruction Issue Policies •Order in which instructions are fetched • Order in which instructions are executed • Order in which instructions update registers and memory values (order of completion) Standard Categories: • In-order issue with in-order completion • In-order issue with out-of-order completion • Out-of order issue with out-of-order completion
  • 14.
    In-Order Issue --In-Order Completion Issue instructions in the order they occur: • Not very efficient • Instructions must stall if necessary (and stalling in superpipelining is expensive)
  • 15.
    In-Order Issue --In-Order Completion (Example) Assume: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 16.
    In-Order Issue --Out-of-Order Completion (Example) How does this effect interrupts? Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 17.
    Out-of-Order Issue --Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can continue to fetch and decode until the “window” is full • When a functional unit becomes available an instruction can be executed (usually in as much in- order as possible) • Since instructions have been decoded, processor can look ahead
  • 18.
    Out-of-Order Issue --Out-of-Order Completion (Example) Note: I5 depends upon I4, but I6 does not Again: • I1 requires 2 cycles to execute • I3 & I4 conflict for the same functional unit • I5 depends upon value produced by I4 • I5 & I6 conflict for a functional unit
  • 19.
    Register Renaming to avoidhazards • Output and antidependencies occur because register contents may not reflect the correct ordering from the program • Can require a pipeline stall • One solution: Allocate Registers dynamically (renaming registers)
  • 20.
    Register Renaming example AddR3, R3, R5 R3b:=R3a + R5a (I1) Add R4, R3, 1 R4b:=R3b + 1 (I2) Add R3, R5, 1 R3c:=R5a + 1 (I3) Add R7, R3, R4 R7b:=R3c + R4b (I4) • Without “subscript” refers to logical register in instruction • With subscript is hardware register allocated: R3a R3b R3c Note: R3c avoids: antidependency on I2 output dependency I1
  • 21.
    Recaping: Machine ParallelismSupport • Duplication of Resources • Out of order issue hardware • Windowing to decouple execution from decode • Register Renaming capability
  • 22.
    Speedups of MachineOrganizations (Without Procedural Dependencies) • Not worth duplication of functional units without register renaming • Need instruction window large enough (more than 8, probably not more than 32)
  • 23.
    Branch Prediction inSuperscalar Machines • Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot. This leads to much complexity in recovery. • Branch prediction should be used - Branch history is very useful
  • 24.
  • 25.
    Committing or RetiringInstructions Results need to be put into order (commit or retire) • Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage. (either committed or retired/flushed) • Temporary storage requires regular clean up – overhead – done in hardware.
  • 26.
    Superscalar Hardware Support •Facilities to simultaneously fetch multiple instructions • Logic to determine true dependencies involving register values and Mechanisms to communicate these values • Mechanisms to initiate multiple instructions in parallel • Resources for parallel execution of multiple instructions • Mechanisms for committing process state in correct order
  • 27.
    Example: Pentium 4 ASuperscalar CISC Machine
  • 28.
  • 29.
  • 30.
    a) Generation ofMicro-ops (stages 1 &2) • Using the Branch Target Buffer and Instruction Translation Lookaside Buffer, the x86 instructions are fetched 64 bytes at a time from the L2 cache •The instruction boundaries are determined and instructions decoded into 1-4 118-bit RISC micro-ops • Micro-ops are stored in the trace cache
  • 31.
    b) Trace cachenext instruction pointer (stage 3) • The Trace Cache Branch Target Buffer contains dynamic gathered history information (4 bit tag) • If target is not in BTB - Branch not PC relative: predict branch taken if it is a return, predict not taken otherwise - For PC relative backward conditional branches, predict take, otherwise not taken
  • 32.
    c) Trace Cachefetch (stage 4) • Orders micro-ops in program-ordered sequences called traces • These are fetched in order, subject to branch prediction • Some micro-ops require many micro-ops (CISC instructions). These are coded into the ROM and fetched from the ROM
  • 33.
    d) Drive (stage5) • Delivers instructions from the Trace Cache to the Rename/Allocator module for reordering
  • 34.
    e) Allocate: registernaming (stages 6, 7, & 8) • Allocates resources for execution (3 micro-ops arrive per clock cycle): - Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which tracks progress of the micro-ops. Buffer entries include: - State – scheduled, dispatched, completed, ready for retire - Address that generated the micro-op - Operation - Alias registers are assigned for one of 16 arch reg (128 alias registers) {to remove data dependencies} • The micro-ops are dispatched out of order as resources are available • Allocates an entry to one of the 2 scheduler queues - memory access or not
  • 35.
    f) Micro-op queuing(stage 9) • Micro-ops are loaded into one of 2 queues: - one for memory operations - one for non memory operations • Each queue operates on a FIFO policy
  • 36.
    g) Micro-op scheduling (stages10, 11, & 12) • The 2 schedulers retrieve micro-ops based upon having all the operands ready and dispatch them to an available unit (up to 6 per clock cycle) • If two micro-ops need the same unit, they are dispatched in sequence. h) Dispatch (stages 13 & 14)
  • 37.
    i) Register file (stages15 & 16) j) Execute: flags (stages 17 & 18) • The register files are the sources for pending fixed and FF operations • A separate stage is used to compute the flags
  • 38.
    k) Branch check (stage19) l) Branch check results (stage 20) • Checks flags and compares results with predictions • If the branch prediction was wrong: - all incorrect micro-ops must be flushed (don’t want to be wrong!) - the correct branch destination is provided to the Branch Predictor - the pipeline is restarted from the new target address