1. Pipeline
Concept of pipeline utilization of all units of a system at the same time, simultaneous use.
Each units wait for the output of preceding unit is it slows down.
This increases throughput of the system. Units are same as in case of sequential but because of
simultaneous use.
The word pipeline is borrowed from assembly line operation in Industry.
Same concept is applied in processor.
Say two units – IF instruction fetch and EX execution unit defines a processor.
Sequential working
inst/clk 1 2 3 4
1. IF IF idle
EX idle EX execute
2. IF IF idle
EX idle EX execute
only one unit at a time used. Two instructions executed in 4 clocks.
Pipeline working
inst/clk 1 2 3 4 5
1. IF1 EX1
2. IF 2 EX2
3. IF 3 EX 3
4. IF 4 EX 4
four instructions are executed in 5 clocks. All stages are being used simultaneously.
If there are n stages and each stage takes t time
N number of instructions will take
in case of sequential processor N*n*t time
pipeline (n+N1)*t time
(first instruction takes n*t and rest N1 takes (N1)*t as after that each clock gives output)
performance enhancement= sequential / pipeline= N*n/(n+N1)
N >> n ; max. performance increase = n
The clock period of stages is if not same then other stages will wait and it affects the performance.
The clock period of longest stage is considered for performance calculation.
In case of pipeline, inter stage buffers are used to store the output of previous stages.
Cycle time of a pipeline is calculated as
t(p) = t(s)/n +buffer latency. Where t(s) time needs to execute one inst in sequential m/c, n is number
of stages.
Total pipeline depth in terms of time= t(p)*n + (n1)* buffer latency
Consider four stages pipeline IF instruction fetch , ID instruction decode and read operands,
Ex execute instruction , W write results in memory.
Hazards
Any condition that causes pipeline to wait for operands, instructions or other resources is called
pipeline hazards. Also called stall or bubble. It deteriorates the performance of pipeline.
Three types of hazards Data hazards, Control hazards, Structural hazards.
Structural hazard when two instructions require the access to a h/w resource at the same time.
2. Example access to common memory by two instructions where first writes and second tries to fetch
instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I1 IF1 ID EX W
I2 IF2 ID EX W
I3 IF2 ID EX W
I4 Stall
IF4
stall stall stall
I4 IF4 ID EX W
At clock 4, instruction I1 access memory for write and in same clock instruction I4 access same
memory for instruction fetch. so I4 starts in clock 5.pipeline stalls.
Data hazardsWhen data (operands) required by next instruction yet to be executed by forward
(previous) instruction.
If the next instruction does not wait for operand with latest value , programme behaves like concurrent
programme.
Example a=10,b=15,c=0 output
c=a+b (1) sequential program c=25,d=40
d=b+c (2) concurrent program – c=25,d=15
for sequential program next instruction wait for operand results in bubble in pipeline
example
ADD R2,R3,R4 ( R4=R2+R3)
SUB R5,R4,R6 (R6=R5R4)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Add IF1 ID EX W
Sub IF2 ID stall ID stall ID EX W
I3 IF3 IF stall IF stall ID EX W
I4 Stall stall IF4 ID EX W
I5 IF5 ID EX W
R4 will be available after clock 4 (after W stage)
in clock 3, value of R4 is not available to pipeline stall till clock 4, decoding takes place in clock 5.
Solution
software solution is either provided by compiler or you need to modify codes .
In above case since two consecutive instructions are operand dependent then put two NOPs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Add IF1 ID EX W
Nop IF2 ID EX W
Nop IF3 ID EX W
Sub IF4 ID EX W
I5 IF5 ID EX W
In 5 th clock R4 is available and decoded by SUB instruction correctly.
4. 1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID ID ID EX W
4 IF4 IF IF ID ID ID EX W
5 IF5 IF IF ID EX W
6 IF6 ID EX W
7 IF7 ID ID ID EX W
8 IF8 IF IF ID ID ID EX W
Recoding the above example
Equivalent codes remarks related to pipeline behavior for the code
1 LOAD b,R1
2 LOAD c,R2
3 LOAD e,R4
4 LOAD f,R5
5 ADD R1,R2,R3
6 SUB R4,R5,R6 R5 available in 8th
clock
7 STA R3,a ID available in 9th
clock
8 STA R6,d R6 available in 11th
clock
Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IF5 ID EX W
6 IF6 ID ID EX W
7 IF7 IF ID EX W
8 IF ID ID EX W
Hardware solution with more internal circuitry the intermediate results are provided to back stages for
next instructions. Thus next instruction does not wait for availability of operands after the last stage.
This path is called forward path.
DIAGRAM
The value of the expression is made available to EX stage through forward path and mux.
So the instruction does not stall in ID stage as the required value of expression is available in EX stage.
Equivalent codes remarks related to pipeline behavior for the code
1 LOAD b,R1
2 LOAD c,R2
3 ADD R1,R2,R3 R1,R2 available in 5th
clock in EX through forward path
4 STA R3,a R3available in 6th
clock
5 LOAD e,R4
6 LOAD f,R5
7 SUB R4,R5,R6 R5 available in 9th
clock
8 STA R6,d R6 available in 10th
clock
5. Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IF5 ID EX W
6 IF6 ID EX W
7 IF7 ID EX W
8 IF8 ID EX W
Control hazard when pipeline waits for next instructions which are available because of two reasons
cache miss, branch instructions in which case PC is modified.
In cache miss pipeline IF stage waits for expected instruction.
In branch instructions, the decision of branch is taken either in ID ore EX units and the modified PC is
available later. By that time pipeline IF fetches next instruction in sequence which is partially executed
when branch decision is taken. Therefore the partially executed instruction is flushed from the pipeline
and pipeline reads instruction from next (branched) location.
This deteriorates the performance of the pipeline.
Example
1 ADD R1,R2,R3
2 JMP xx
3 INC R2
4 ADD R5,R1,R3
......
xx MOV R1,R3
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX
4 IF4 ID
5 IF5
xx IFxx
The value of the new PC is assumed in 5th
clock. The IF stage in 6th
clock fetch from xx location.
Since the sequential instructions 3 to xx1 is not required therefore it is flushed from the pipeline.
(assuming their partial execution do not result any side effect). This period is delay slot. The output
from (execution of) branched instruction will take n clocks.
This causes clock penalty and known as branch penalty.
The branch instructions can be conditional and unconditional type.
Conditional branch instruction if condition true pipeline flushed and instruction at the branch address
enters in pipeline , if condition fails pipeline continues as it does not affect.
In Unconditional branch instruction pipeline is always flushed.
7. 3 SHL R1
4 NOP
5 ADD R4, R3,R1
6 ......
( with forward path , changed PC and flag based on R2 available after EX stage in forward path)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IFxx
In clock 5, the fetched instruction is either 1 (R2 not 0) or 5 (R2=0). As there is no other instruction to
reorder a NOP is inserted between SHL and ADD.
( In case of unconditional branch instruction, the modified PC is calculated in ID stage )
To reduce the branch penalty predict whether or not a particular branch will take place.
Static branch prediction no past history.
Dynamic branch predictionpast history of branching is included in prediction.
Limitations
Some stages of pipeline are inherently more complex and take longer than others, slowest stage time is
is taken as time of each pipeline stage. So faster stage is idle for most of time that results in overall
instruction execution time gets longer. This also affects completion rate of different level of stages.