Out-of-Order Execution
Out-of-Order Execution
Out-of-order execution is an approach that is used in high performance
microprocessors. Here the instructions begin execution as soon as
their operands are ready. Although instructions are issued in-order, they
can proceed out-of- order with respect to each other. A processor will
execute the instructions in an order of availability of data or operands
instead of original order of the instructions in the program. By doing so the
processor will avoid being idle while data is retrieved for the next
instruction in a program. This flexibility will improve the performance of the
processor.
In other words, processor that uses multiple execution units completes
the processing of instructions in wrong order.
The first machine to use out-of-order execution is CDC 6600(1964) which
is used to resolve score board conflicts. In 1966 IBM introduced
Tomasula’s Algorithm which supports full out-of-order execution.
2
Out-of-Order Execution
In-Order Processing:
1. The processor retrieves program instructions from its memory.
2. If the input operands are available in the register the instruction is sent
to execution unit.
3. If the operand in unavailable during the clock cycle the processor will
wait until they are available. This is because the operands are fetched from
the memory and are unavailable, so the processor needs to wait until they
are available during the current clock cycle. So the processor stalls until
the operands are available.
4. Then the instruction is executed by the appropriate execution unit
.
5. After the instruction is executed by the execution unit, it writes back to
the register.
3
Out-of-Order Execution
Out-of-Order Processing:
1.The processor retrieves program instructions from its memory.
2. Instruction are sent (dispatched) to an instruction queue. Instruction queue is
also called instruction buffer or Reservation Stations.
3. Until the input operand is available the instruction waits in the queue. The
instruction are allowed to leave the queue for the execution. Instruction doesn't
need to wait in the queue until its turn. Whenever the operand is available the
instruction will leave the queue or buffer for execution.
4. The instruction is sent to appropriate execution unit for execution
.
5. Then the results are queued.
6. If all the older instructions have their results written back to register, then the
current result is written back to the register file. This is called the Graduation or
Retire stage.
4
Out-of-Order Execution
5
Out-of-Order Execution Scheme
Steps:
1)Fetch and Decode In-Order
a)Multiple instructions are fetched and decoded in
parallel.
b)The instructions are stored in Reservation Stations.
2) If operands are ready, send instructions to
functional units for execution. If operands not ready,
listen to bypass network and wait for operands.
a)Operands must be ready
b)Resources must be ready
3) Following the execution
a)Broadcast the computed values on bypass network.
b)Signal all dependent instructions that the data is ready.
c)Results are stored in Reorder Buffer / Completion
Buffer.
d)Mark executed instructions in reorder buffer as
“Completed”.
4) Commit instructions In-Order
a)Can commit an instruction only when all previous
instructions are committed.
6
Out-of-Order Execution Pipeline
Stages:
1)Fetch
•Branch Prediction
2) Decode
•Register Renaming
3) Reservation Stations (RS)
a)Instructions wait for inputs
b)Instructions wait for functional units.
4) Execute
•Functional Units
5) Bypass Network
•Broadcast all calculated values back to
reservation stations.
6) Reorder Buffer
•De-speculate execution mostly by
committing instructions in-order.
7
Speculative Execution
While performing out-of-order execution, it may happen that the processor
arrives at a branch for which the condition is dependent on values that
are yet to be computed in preceding instructions. At this stage the
processor has two choices, either wait for the preceding instructions to
complete and incur a huge performance delay, or speculate, and guess
the result of the conditional branch. After the guess is made, the current
register state is stored as a checkpoint, and subsequent instructions
start to execute. Once the previous instructions on which the condition of
branch dependent on are finished, the guess made is validated, and if
the guess was wrong, the program state is reverted back to a
checkpoint, and the execution of the correct path is initiated. Also in case
of a wrong guess, all pending instructions are abandoned, so that it does
not make any visible effect.
8
OOO Execution – P6 Processor
9
OOO Execution – P6 Processor
In-Order Execution
10
OOO Execution – P6 Processor
Out-of-Order Execution
11
OOO Execution – P6 Processor
Out-of-Order Execution
12
OOO Execution – P6 Processor
Out-of-Order Execution
13
Dynamic Scheduling
Dynamic Scheduling
The Dynamic Scheduling is used to handle some cases when dependences
are unknown at a compile time. If instruction j depends on a long-running
instruction i, currently in execution in the pipeline, then all instructions after j
must be stalled until i is finished and j can execute.
In Dynamic Scheduling the hardware rearranges the instruction execution to
reduce the stalls while maintaining data flow and exception behavior.
Out-of-order execution introduces the possibility of WAR and WAW
hazards. Out-of-order completion also creates major complications in
handling exceptions. Dynamic scheduling with out-of-order completion
A major limitation of the simple pipelining techniques is that they all use
in-order instruction issue and execution: Instructions are issued in
program order and if an instruction is stalled in the pipeline, no later
instructions can proceed. In a dynamically scheduled pipeline, all
instructions pass through the issue stage in order (in-order issue);
however, they can be stalled or bypass each other in the second stage
(read operands) and thus enter execution out of order. 14
OOO Execution Techniques
Out-of-Order Execution can be achieved through two techniques:
1)Scoreboarding : Score-boarding is a technique for allowing instructions
to execute out-of-order when there are sufficient resources and no data
dependences; it is named after the CDC 6600 scoreboard,
2)Tumasulo’s Algorithm
Tomasulo’s Algorithm
Tomasulo's algorithm is a computer architecture hardware algorithm for
dynamic scheduling of instructions that allows out-of-order execution and
enables more efficient use of multiple execution units. It was developed
by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM
System/360 Model 91’s floating point unit.
Robert Tomasulo received the Eckert–Mauchly Award in 1997 for his work
on the algorithm.
15
Tomasulo’s Algorithm : Register Renaming
Tomasulo’s algorithm uses Register Renaming to eliminate output and
anti-dependencies, i.e. WAW and WAR hazards. Tomasulo's algorithm
implements register renaming through the use of what are called
Reservation Stations. Reservation stations are buffers which fetch and
store instruction operands as soon as they are available.
Example:
MULTD F4,F2,F2
ADDD F2,F0,F6
This code contains an anti-dependence since the first instruction reads from F2
and the second instruction writes to F2 (a WAR hazard). However, there is no
data dependence, as is shown by the code below (assuming F8 is unused):
MULTD F4,F2,F2
ADDD F8,F0,F6
The anti-dependence is removed without changing the semantics of the code
simply by changing the F2 to say, F8. This is what register renaming is; it
allows the hardware to detect a name dependence and eliminate it by storing
the result of an instruction somewhere else.
16
Tomasulo’s Algorithm - III
Tomasulo-based MIPS Processor:
17
Tomasulo’s Algorithm - IV
Tomasulo-based MIPS Processor:
1) Instructions are sent from the instruction unit into the instruction queue
from which they are issued in FIFO order.
2)The reservation stations include the operation and the actual operands, as
well as information used for detecting and resolving hazards.
3)Load buffers have three functions: hold the components of the effective
address until it is computed, track outstanding loads that are waiting on the
memory, and hold the results of completed loads that are waiting for the CDB.
4)Store buffers have three functions: hold the components of the effective
until it is computed, hold the destination memory addresses of outstanding
stores that are waiting for the data value to store, and hold the address and
value to store until the memory unit is available.
5)All results from either the FP units or the load unit are put on the CDB, which
goes to the FP register file as well as to the reservation stations and store
buffers.
6)The FP adders implement addition and subtraction, and the FP multipliers
do multiplication and division.
18
Tomasulo’s Algorithm - V
Algorithm Steps:
1) Issue (I) —Get the next instruction from the head of the instruction
queue. If there is a matching reservation station that is empty, issue the
instruction to the station with the operand values (renames registers).
2) Execute (EX) — When all the operands are available, place into the
corresponding reservation stations for execution. If operands are not yet
available, monitor the common data bus (CDB) while waiting for it to be
computed.
3) Write result (WB)—When the result is available, write it on the CDB
and from there into the registers and into any reservation stations
(including store buffers) waiting for this result. Stores also write data to
memory during this step: When both the address and data value are
available, they are sent to the memory unit and the store completes.
19
Tomasulo’s Algorithm - VI
Tomasulo’s Pipeline
New pipeline structure: IF, DS, IS, EX, WB
• DS (Dispatch)
Check availability of RS, Allocate RS, Copy ready values, non-ready tags
to RS) : (stall if unavailable)
• IS (issue)
Check if the operands are ready, Execute : (wait, monitor CDB if not
ready)
• WB (writeback)
Check if CDB available, Broadcast result, Write reg, Free RS : (wait if
unavailable)
20

Out of Order Execution (operating System).ppt

  • 1.
  • 2.
    Out-of-Order Execution Out-of-order executionis an approach that is used in high performance microprocessors. Here the instructions begin execution as soon as their operands are ready. Although instructions are issued in-order, they can proceed out-of- order with respect to each other. A processor will execute the instructions in an order of availability of data or operands instead of original order of the instructions in the program. By doing so the processor will avoid being idle while data is retrieved for the next instruction in a program. This flexibility will improve the performance of the processor. In other words, processor that uses multiple execution units completes the processing of instructions in wrong order. The first machine to use out-of-order execution is CDC 6600(1964) which is used to resolve score board conflicts. In 1966 IBM introduced Tomasula’s Algorithm which supports full out-of-order execution. 2
  • 3.
    Out-of-Order Execution In-Order Processing: 1.The processor retrieves program instructions from its memory. 2. If the input operands are available in the register the instruction is sent to execution unit. 3. If the operand in unavailable during the clock cycle the processor will wait until they are available. This is because the operands are fetched from the memory and are unavailable, so the processor needs to wait until they are available during the current clock cycle. So the processor stalls until the operands are available. 4. Then the instruction is executed by the appropriate execution unit . 5. After the instruction is executed by the execution unit, it writes back to the register. 3
  • 4.
    Out-of-Order Execution Out-of-Order Processing: 1.Theprocessor retrieves program instructions from its memory. 2. Instruction are sent (dispatched) to an instruction queue. Instruction queue is also called instruction buffer or Reservation Stations. 3. Until the input operand is available the instruction waits in the queue. The instruction are allowed to leave the queue for the execution. Instruction doesn't need to wait in the queue until its turn. Whenever the operand is available the instruction will leave the queue or buffer for execution. 4. The instruction is sent to appropriate execution unit for execution . 5. Then the results are queued. 6. If all the older instructions have their results written back to register, then the current result is written back to the register file. This is called the Graduation or Retire stage. 4
  • 5.
  • 6.
    Out-of-Order Execution Scheme Steps: 1)Fetchand Decode In-Order a)Multiple instructions are fetched and decoded in parallel. b)The instructions are stored in Reservation Stations. 2) If operands are ready, send instructions to functional units for execution. If operands not ready, listen to bypass network and wait for operands. a)Operands must be ready b)Resources must be ready 3) Following the execution a)Broadcast the computed values on bypass network. b)Signal all dependent instructions that the data is ready. c)Results are stored in Reorder Buffer / Completion Buffer. d)Mark executed instructions in reorder buffer as “Completed”. 4) Commit instructions In-Order a)Can commit an instruction only when all previous instructions are committed. 6
  • 7.
    Out-of-Order Execution Pipeline Stages: 1)Fetch •BranchPrediction 2) Decode •Register Renaming 3) Reservation Stations (RS) a)Instructions wait for inputs b)Instructions wait for functional units. 4) Execute •Functional Units 5) Bypass Network •Broadcast all calculated values back to reservation stations. 6) Reorder Buffer •De-speculate execution mostly by committing instructions in-order. 7
  • 8.
    Speculative Execution While performingout-of-order execution, it may happen that the processor arrives at a branch for which the condition is dependent on values that are yet to be computed in preceding instructions. At this stage the processor has two choices, either wait for the preceding instructions to complete and incur a huge performance delay, or speculate, and guess the result of the conditional branch. After the guess is made, the current register state is stored as a checkpoint, and subsequent instructions start to execute. Once the previous instructions on which the condition of branch dependent on are finished, the guess made is validated, and if the guess was wrong, the program state is reverted back to a checkpoint, and the execution of the correct path is initiated. Also in case of a wrong guess, all pending instructions are abandoned, so that it does not make any visible effect. 8
  • 9.
    OOO Execution –P6 Processor 9
  • 10.
    OOO Execution –P6 Processor In-Order Execution 10
  • 11.
    OOO Execution –P6 Processor Out-of-Order Execution 11
  • 12.
    OOO Execution –P6 Processor Out-of-Order Execution 12
  • 13.
    OOO Execution –P6 Processor Out-of-Order Execution 13
  • 14.
    Dynamic Scheduling Dynamic Scheduling TheDynamic Scheduling is used to handle some cases when dependences are unknown at a compile time. If instruction j depends on a long-running instruction i, currently in execution in the pipeline, then all instructions after j must be stalled until i is finished and j can execute. In Dynamic Scheduling the hardware rearranges the instruction execution to reduce the stalls while maintaining data flow and exception behavior. Out-of-order execution introduces the possibility of WAR and WAW hazards. Out-of-order completion also creates major complications in handling exceptions. Dynamic scheduling with out-of-order completion A major limitation of the simple pipelining techniques is that they all use in-order instruction issue and execution: Instructions are issued in program order and if an instruction is stalled in the pipeline, no later instructions can proceed. In a dynamically scheduled pipeline, all instructions pass through the issue stage in order (in-order issue); however, they can be stalled or bypass each other in the second stage (read operands) and thus enter execution out of order. 14
  • 15.
    OOO Execution Techniques Out-of-OrderExecution can be achieved through two techniques: 1)Scoreboarding : Score-boarding is a technique for allowing instructions to execute out-of-order when there are sufficient resources and no data dependences; it is named after the CDC 6600 scoreboard, 2)Tumasulo’s Algorithm Tomasulo’s Algorithm Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit. Robert Tomasulo received the Eckert–Mauchly Award in 1997 for his work on the algorithm. 15
  • 16.
    Tomasulo’s Algorithm :Register Renaming Tomasulo’s algorithm uses Register Renaming to eliminate output and anti-dependencies, i.e. WAW and WAR hazards. Tomasulo's algorithm implements register renaming through the use of what are called Reservation Stations. Reservation stations are buffers which fetch and store instruction operands as soon as they are available. Example: MULTD F4,F2,F2 ADDD F2,F0,F6 This code contains an anti-dependence since the first instruction reads from F2 and the second instruction writes to F2 (a WAR hazard). However, there is no data dependence, as is shown by the code below (assuming F8 is unused): MULTD F4,F2,F2 ADDD F8,F0,F6 The anti-dependence is removed without changing the semantics of the code simply by changing the F2 to say, F8. This is what register renaming is; it allows the hardware to detect a name dependence and eliminate it by storing the result of an instruction somewhere else. 16
  • 17.
    Tomasulo’s Algorithm -III Tomasulo-based MIPS Processor: 17
  • 18.
    Tomasulo’s Algorithm -IV Tomasulo-based MIPS Processor: 1) Instructions are sent from the instruction unit into the instruction queue from which they are issued in FIFO order. 2)The reservation stations include the operation and the actual operands, as well as information used for detecting and resolving hazards. 3)Load buffers have three functions: hold the components of the effective address until it is computed, track outstanding loads that are waiting on the memory, and hold the results of completed loads that are waiting for the CDB. 4)Store buffers have three functions: hold the components of the effective until it is computed, hold the destination memory addresses of outstanding stores that are waiting for the data value to store, and hold the address and value to store until the memory unit is available. 5)All results from either the FP units or the load unit are put on the CDB, which goes to the FP register file as well as to the reservation stations and store buffers. 6)The FP adders implement addition and subtraction, and the FP multipliers do multiplication and division. 18
  • 19.
    Tomasulo’s Algorithm -V Algorithm Steps: 1) Issue (I) —Get the next instruction from the head of the instruction queue. If there is a matching reservation station that is empty, issue the instruction to the station with the operand values (renames registers). 2) Execute (EX) — When all the operands are available, place into the corresponding reservation stations for execution. If operands are not yet available, monitor the common data bus (CDB) while waiting for it to be computed. 3) Write result (WB)—When the result is available, write it on the CDB and from there into the registers and into any reservation stations (including store buffers) waiting for this result. Stores also write data to memory during this step: When both the address and data value are available, they are sent to the memory unit and the store completes. 19
  • 20.
    Tomasulo’s Algorithm -VI Tomasulo’s Pipeline New pipeline structure: IF, DS, IS, EX, WB • DS (Dispatch) Check availability of RS, Allocate RS, Copy ready values, non-ready tags to RS) : (stall if unavailable) • IS (issue) Check if the operands are ready, Execute : (wait, monitor CDB if not ready) • WB (writeback) Check if CDB available, Broadcast result, Write reg, Free RS : (wait if unavailable) 20