Pipelining understanding:
Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses
all the available hardware while respecting the dependencies of each stage upon the previous
stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash
stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is
minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must
remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by
the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into
smaller sub-stages. For those less concerned with laundry-based examples, consider a video
game. The CPU computes the keyboard/mouse input each frame and moves the camera
accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the
CPU has already begun calculating what's going to happen in the next frame.
How Pipelining will done:
In class, we mentioned that interpreting each computer instruction is a four step process: fetching
the instruction, decoding it and reading the register, executing it, and recording the results. Each
instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle,
then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish
this, we can split up an instruction's work into the 4 different steps so that other pieces of
hardware work to decode, execute, and record results while the CPU performs the fetch. The
latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every
cycle, after four cycles, one instruction has been completed and three are "in progress" (they're
in the pipeline). After many cycles the steady state throughput approaches one completed
instruction every cycle.
An assembly line in a auto manufacturing plant is another good example of a pipelined process.
There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline.
Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a
lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a
steady state. The larger the depth, the more costly it is to turn the system around: A branch
misprediction in an instruction pipeline would be like getting one of the steps wrong in the
assembly line: all the cars affected would have to go back to the beginning of the assembly line
and be processed again.
OnLive Example[Realtime]:
OnLive is a company that allows gamers to play video games in the cloud. The games are run on
one of the company's server farms, and video of the game is sent back to your computer. The
idea is that even the lamest of computers can run the most highly intensive games because all the
computer does is send your joystick input over the internet and display the frames it gets back.
Of course, no one wants to play a game with a noticeably low framerate. We're going to
demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume
that OnLive uses a four step process: the user's computer sends over the input to the server
(10ms), the server tells the game about the user's input and then compresses the resulting game
frame (15ms), the compressed video is sent back to the user (60ms) where it is then
decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers
are contrived.
The latency of this process is 100ms (10+15+60+15). This means that there will always be a
tenth of a second lag from when you perform an action to when you see it affect things on the
screen.
Communication between different parts of a machine is not particularly easy to manage since
often it only occurs in burst situations - that is a huge demand on the communication framework
followed by a period of very little activity. Communication can be sped up by pipelining
however. We do not necessarily have to wait for a message to be delivered before we send
another piece of information. Therefore we can set up a level of pipelining. Often, however, the
rate at which we can send messages is much faster than the time it takes data to go through the
slowest part of our system. Therefore, pipelining only helps to an extent because in the long run
our communication is limited by the slowest part of our system.
Data hazards:
Data hazards occur when instructions that exhibit data dependence modify data in different
stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed
race hazards). There are three situations in which a data hazard can occur:
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW):
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that has not yet been calculated or retrieved. This
can occur because even though an instruction is executed after a prior instruction, the prior
instruction has been processed only partly through the pipeline.
For example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to
use this value to compute a result for register R4. However, in a pipeline, when operands are
fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a
data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion of instruction
i1.
Write after write (WAW):
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard
may occur in a concurrent execution environment.
Example:
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
Structural hazards:
A structural hazard occurs when a part of the processor's hardware is needed by two or more
instructions at the same time. A canonical example is a single memory unit that is accessed both
in the fetch stage where an instruction is retrieved from memory, and the memory stage where
data is written and/or read from memory.[3] They can often be resolved by separating the
component into orthogonal units (such as separate caches) or bubbling the pipeline.
Control hazards (branch hazards):
Further information: Branch (computer science)
Branching hazards (also termed control hazards) occur with branches. On many instruction
pipeline microarchitectures, the processor will not know the outcome of the branch when it needs
to insert a new instruction into the pipeline.
Forwarding:
The problem with data hazards, introduced by this sequence of instructions can be solved with a
simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1,R2,R3 IF ID EX MEM WB
SUB R4,R5,R1
IF ID SUB EX MEM WB
AND R6,R1,R7
IF ID AND EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the
SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register
corresponding to the source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU
multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.
Solution
Pipelining understanding:
Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses
all the available hardware while respecting the dependencies of each stage upon the previous
stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash
stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is
minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must
remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by
the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into
smaller sub-stages. For those less concerned with laundry-based examples, consider a video
game. The CPU computes the keyboard/mouse input each frame and moves the camera
accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the
CPU has already begun calculating what's going to happen in the next frame.
How Pipelining will done:
In class, we mentioned that interpreting each computer instruction is a four step process: fetching
the instruction, decoding it and reading the register, executing it, and recording the results. Each
instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle,
then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish
this, we can split up an instruction's work into the 4 different steps so that other pieces of
hardware work to decode, execute, and record results while the CPU performs the fetch. The
latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every
cycle, after four cycles, one instruction has been completed and three are "in progress" (they're
in the pipeline). After many cycles the steady state throughput approaches one completed
instruction every cycle.
An assembly line in a auto manufacturing plant is another good example of a pipelined process.
There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline.
Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a
lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a
steady state. The larger the depth, the more costly it is to turn the system around: A branch
misprediction in an instruction pipeline would be like getting one of the steps wrong in the
assembly line: all the cars affected would have to go back to the beginning of the assembly line
and be processed again.
OnLive Example[Realtime]:
OnLive is a company that allows gamers to play video games in the cloud. The games are run on
one of the company's server farms, and video of the game is sent back to your computer. The
idea is that even the lamest of computers can run the most highly intensive games because all the
computer does is send your joystick input over the internet and display the frames it gets back.
Of course, no one wants to play a game with a noticeably low framerate. We're going to
demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume
that OnLive uses a four step process: the user's computer sends over the input to the server
(10ms), the server tells the game about the user's input and then compresses the resulting game
frame (15ms), the compressed video is sent back to the user (60ms) where it is then
decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers
are contrived.
The latency of this process is 100ms (10+15+60+15). This means that there will always be a
tenth of a second lag from when you perform an action to when you see it affect things on the
screen.
Communication between different parts of a machine is not particularly easy to manage since
often it only occurs in burst situations - that is a huge demand on the communication framework
followed by a period of very little activity. Communication can be sped up by pipelining
however. We do not necessarily have to wait for a message to be delivered before we send
another piece of information. Therefore we can set up a level of pipelining. Often, however, the
rate at which we can send messages is much faster than the time it takes data to go through the
slowest part of our system. Therefore, pipelining only helps to an extent because in the long run
our communication is limited by the slowest part of our system.
Data hazards:
Data hazards occur when instructions that exhibit data dependence modify data in different
stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed
race hazards). There are three situations in which a data hazard can occur:
read after write (RAW), a true dependency
write after read (WAR), an anti-dependency
write after write (WAW), an output dependency
Consider two instructions i1 and i2, with i1 occurring before i2 in program order.
Read after write (RAW):
(i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a
situation where an instruction refers to a result that has not yet been calculated or retrieved. This
can occur because even though an instruction is executed after a prior instruction, the prior
instruction has been processed only partly through the pipeline.
For example:
i1. R2 <- R1 + R3
i2. R4 <- R2 + R3
The first instruction is calculating a value to be saved in register R2, and the second is going to
use this value to compute a result for register R4. However, in a pipeline, when operands are
fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a
data dependency occurs.
A data dependency occurs with instruction i2, as it is dependent on the completion of instruction
i1.
Write after write (WAW):
(i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard
may occur in a concurrent execution environment.
Example:
For example:
i1. R2 <- R4 + R7
i2. R2 <- R1 + R3
The write back (WB) of i2 must be delayed until i1 finishes executing.
Structural hazards:
A structural hazard occurs when a part of the processor's hardware is needed by two or more
instructions at the same time. A canonical example is a single memory unit that is accessed both
in the fetch stage where an instruction is retrieved from memory, and the memory stage where
data is written and/or read from memory.[3] They can often be resolved by separating the
component into orthogonal units (such as separate caches) or bubbling the pipeline.
Control hazards (branch hazards):
Further information: Branch (computer science)
Branching hazards (also termed control hazards) occur with branches. On many instruction
pipeline microarchitectures, the processor will not know the outcome of the branch when it needs
to insert a new instruction into the pipeline.
Forwarding:
The problem with data hazards, introduced by this sequence of instructions can be solved with a
simple hardware technique called forwarding.
1 2 3 4 5 6 7
ADD R1,R2,R3 IF ID EX MEM WB
SUB R4,R5,R1
IF ID SUB EX MEM WB
AND R6,R1,R7
IF ID AND EX MEM WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD
actually produces it. The only problem is to make it available for SUB when it needs it.
If the result can be moved from where the ADD produces it (EX/MEM register), to where the
SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register
corresponding to the source for the current ALU operation, control logic selects the forwarded
result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU
multiplexer and the addtion of three paths to the new inputs.
The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.

Pipelining understandingPipelining is running multiple stages of .pdf

  • 1.
    Pipelining understanding: Pipelining isrunning multiple stages of the same process in parallel in a way that efficiently uses all the available hardware while respecting the dependencies of each stage upon the previous stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into smaller sub-stages. For those less concerned with laundry-based examples, consider a video game. The CPU computes the keyboard/mouse input each frame and moves the camera accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the CPU has already begun calculating what's going to happen in the next frame. How Pipelining will done: In class, we mentioned that interpreting each computer instruction is a four step process: fetching the instruction, decoding it and reading the register, executing it, and recording the results. Each instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle, then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish this, we can split up an instruction's work into the 4 different steps so that other pieces of hardware work to decode, execute, and record results while the CPU performs the fetch. The latency to process each instruction is fixed at 4 cycles, so by processing a new instruction every cycle, after four cycles, one instruction has been completed and three are "in progress" (they're in the pipeline). After many cycles the steady state throughput approaches one completed instruction every cycle. An assembly line in a auto manufacturing plant is another good example of a pipelined process. There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline. Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a steady state. The larger the depth, the more costly it is to turn the system around: A branch misprediction in an instruction pipeline would be like getting one of the steps wrong in the assembly line: all the cars affected would have to go back to the beginning of the assembly line and be processed again. OnLive Example[Realtime]: OnLive is a company that allows gamers to play video games in the cloud. The games are run on one of the company's server farms, and video of the game is sent back to your computer. The idea is that even the lamest of computers can run the most highly intensive games because all the
  • 2.
    computer does issend your joystick input over the internet and display the frames it gets back. Of course, no one wants to play a game with a noticeably low framerate. We're going to demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume that OnLive uses a four step process: the user's computer sends over the input to the server (10ms), the server tells the game about the user's input and then compresses the resulting game frame (15ms), the compressed video is sent back to the user (60ms) where it is then decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers are contrived. The latency of this process is 100ms (10+15+60+15). This means that there will always be a tenth of a second lag from when you perform an action to when you see it affect things on the screen. Communication between different parts of a machine is not particularly easy to manage since often it only occurs in burst situations - that is a huge demand on the communication framework followed by a period of very little activity. Communication can be sped up by pipelining however. We do not necessarily have to wait for a message to be delivered before we send another piece of information. Therefore we can set up a level of pipelining. Often, however, the rate at which we can send messages is much faster than the time it takes data to go through the slowest part of our system. Therefore, pipelining only helps to an extent because in the long run our communication is limited by the slowest part of our system. Data hazards: Data hazards occur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR), an anti-dependency write after write (WAW), an output dependency Consider two instructions i1 and i2, with i1 occurring before i2 in program order. Read after write (RAW): (i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline. For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3
  • 3.
    The first instructionis calculating a value to be saved in register R2, and the second is going to use this value to compute a result for register R4. However, in a pipeline, when operands are fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a data dependency occurs. A data dependency occurs with instruction i2, as it is dependent on the completion of instruction i1. Write after write (WAW): (i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard may occur in a concurrent execution environment. Example: For example: i1. R2 <- R4 + R7 i2. R2 <- R1 + R3 The write back (WB) of i2 must be delayed until i1 finishes executing. Structural hazards: A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory.[3] They can often be resolved by separating the component into orthogonal units (such as separate caches) or bubbling the pipeline. Control hazards (branch hazards): Further information: Branch (computer science) Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline. Forwarding: The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. 1 2 3 4 5 6 7 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID SUB EX MEM WB AND R6,R1,R7 IF ID AND EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD
  • 4.
    actually produces it.The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation , forwarding works as follows: The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs. The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM. Solution Pipelining understanding: Pipelining is running multiple stages of the same process in parallel in a way that efficiently uses all the available hardware while respecting the dependencies of each stage upon the previous stages. In the laundry example, the stages are washing, drying, and folding. By starting a wash stage as soon as the previous wash stage is moved to the dryer, the idle time of the washer is minimized. Notice that the wash stage takes less time than the dry stage, so the wash stage must remain idle until the dry stage finishes: the steady state throughput of the pipeline is limited by the slowest stage in the pipeline. This can be mitigated by breaking up the bottleneck stage into smaller sub-stages. For those less concerned with laundry-based examples, consider a video game. The CPU computes the keyboard/mouse input each frame and moves the camera accordingly, then the GPU takes that information and actually renders the scene; meanwhile, the CPU has already begun calculating what's going to happen in the next frame. How Pipelining will done: In class, we mentioned that interpreting each computer instruction is a four step process: fetching the instruction, decoding it and reading the register, executing it, and recording the results. Each instruction may take 4 cycles to complete, but if our throughput is one instruction each cycle, then we would like to perform, on average, $n$ instructions every $n$ cycles. To accomplish this, we can split up an instruction's work into the 4 different steps so that other pieces of hardware work to decode, execute, and record results while the CPU performs the fetch. The
  • 5.
    latency to processeach instruction is fixed at 4 cycles, so by processing a new instruction every cycle, after four cycles, one instruction has been completed and three are "in progress" (they're in the pipeline). After many cycles the steady state throughput approaches one completed instruction every cycle. An assembly line in a auto manufacturing plant is another good example of a pipelined process. There are many steps in the assembly of the car, each of which is assigned a stage in the pipeline. Typically the depth of these pipelines is very large: cars are pretty complex, so there need to be a lot of stages in the assembly line. The more stages, the longer it takes to crank the system up to a steady state. The larger the depth, the more costly it is to turn the system around: A branch misprediction in an instruction pipeline would be like getting one of the steps wrong in the assembly line: all the cars affected would have to go back to the beginning of the assembly line and be processed again. OnLive Example[Realtime]: OnLive is a company that allows gamers to play video games in the cloud. The games are run on one of the company's server farms, and video of the game is sent back to your computer. The idea is that even the lamest of computers can run the most highly intensive games because all the computer does is send your joystick input over the internet and display the frames it gets back. Of course, no one wants to play a game with a noticeably low framerate. We're going to demonstrate how OnLive could deliver a reasonable experience. For our purposes, we'll assume that OnLive uses a four step process: the user's computer sends over the input to the server (10ms), the server tells the game about the user's input and then compresses the resulting game frame (15ms), the compressed video is sent back to the user (60ms) where it is then decompressed and displayed (15ms). Note that OnLive doesn't share its data, so these numbers are contrived. The latency of this process is 100ms (10+15+60+15). This means that there will always be a tenth of a second lag from when you perform an action to when you see it affect things on the screen. Communication between different parts of a machine is not particularly easy to manage since often it only occurs in burst situations - that is a huge demand on the communication framework followed by a period of very little activity. Communication can be sped up by pipelining however. We do not necessarily have to wait for a message to be delivered before we send another piece of information. Therefore we can set up a level of pipelining. Often, however, the rate at which we can send messages is much faster than the time it takes data to go through the slowest part of our system. Therefore, pipelining only helps to an extent because in the long run our communication is limited by the slowest part of our system.
  • 6.
    Data hazards: Data hazardsoccur when instructions that exhibit data dependence modify data in different stages of a pipeline. Ignoring potential data hazards can result in race conditions (also termed race hazards). There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR), an anti-dependency write after write (WAW), an output dependency Consider two instructions i1 and i2, with i1 occurring before i2 in program order. Read after write (RAW): (i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline. For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3 The first instruction is calculating a value to be saved in register R2, and the second is going to use this value to compute a result for register R4. However, in a pipeline, when operands are fetched for the 2nd operation, the results from the first will not yet have been saved, and hence a data dependency occurs. A data dependency occurs with instruction i2, as it is dependent on the completion of instruction i1. Write after write (WAW): (i2 tries to write an operand before it is written by i1) A write after write (WAW) data hazard may occur in a concurrent execution environment. Example: For example: i1. R2 <- R4 + R7 i2. R2 <- R1 + R3 The write back (WB) of i2 must be delayed until i1 finishes executing. Structural hazards: A structural hazard occurs when a part of the processor's hardware is needed by two or more instructions at the same time. A canonical example is a single memory unit that is accessed both in the fetch stage where an instruction is retrieved from memory, and the memory stage where data is written and/or read from memory.[3] They can often be resolved by separating the component into orthogonal units (such as separate caches) or bubbling the pipeline.
  • 7.
    Control hazards (branchhazards): Further information: Branch (computer science) Branching hazards (also termed control hazards) occur with branches. On many instruction pipeline microarchitectures, the processor will not know the outcome of the branch when it needs to insert a new instruction into the pipeline. Forwarding: The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. 1 2 3 4 5 6 7 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID SUB EX MEM WB AND R6,R1,R7 IF ID AND EX MEM WB The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation , forwarding works as follows: The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs. The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM.