Pipeline

Pipeline
Concept of pipeline utilization of all units of a system at the same time, simultaneous use.
                                 Each units wait for the output of preceding unit is it slows down.
This increases throughput of the system. Units are same as in case of sequential but because of
simultaneous use.
The word pipeline is borrowed from assembly line operation in Industry.
Same concept is applied in processor.
Say two units – IF instruction fetch and EX execution unit defines a processor.
Sequential working
inst/clk           1 2 3 4
1.   IF     IF idle
        EX idle      EX execute
2.   IF    IF idle
          EX idle           EX execute
only one unit at a time used. Two instructions executed in 4 clocks.
Pipeline working
inst/clk            1 2 3 4    5
1.   IF1            EX1
2.       IF 2       EX2
3.   IF 3     EX 3
4.       IF 4          EX 4
four instructions are executed in 5 clocks. All stages are being used simultaneously.
If there are n stages and each stage takes t time
N number of instructions will take
in case of sequential processor    N*n*t  time
pipeline (n+N1)*t  time
(first instruction takes n*t and rest N1 takes (N1)*t as after that each clock gives output)
performance enhancement=  sequential / pipeline= N*n/(n+N1)
N >> n ; max. performance increase = n
The clock period of stages is if not same then other stages will wait and it affects the performance.
The clock period of longest stage is considered for performance calculation.
In case of pipeline, inter stage buffers are used to store the output of previous stages.
Cycle time of a pipeline is calculated as
t(p) = t(s)/n +buffer latency.    Where t(s) time needs to execute one inst in sequential m/c, n is number
of stages.
Total pipeline depth in terms of time=   t(p)*n + (n1)* buffer latency
Consider four stages pipeline IF instruction fetch ,  ID instruction decode and read operands,
Ex execute instruction , W write results in memory.
Hazards
Any condition that causes pipeline to wait for operands, instructions or other resources is called
pipeline hazards. Also called stall or bubble. It deteriorates the performance of pipeline.
Three types of hazards Data hazards, Control hazards, Structural hazards.
Structural hazard when two instructions require the access to a h/w resource at the same time.

Example access to common memory by two instructions where first writes and second tries to fetch
instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I1 IF1 ID EX W
I2 IF2 ID EX W
I3 IF2 ID EX W
I4 Stall
IF4
stall stall stall
I4 IF4 ID EX W
At clock 4, instruction I1 access memory for write and in same clock instruction I4 access same
memory for instruction fetch. so I4 starts in clock 5.pipeline stalls.
Data hazardsWhen data (operands) required by next instruction yet to be executed by forward
(previous) instruction.
If the next instruction does not wait for operand with latest value , programme behaves like concurrent
programme.
Example  a=10,b=15,c=0                      output
c=a+b  (1)      sequential program c=25,d=40
d=b+c (2)      concurrent program – c=25,d=15
for sequential program next instruction wait for operand results in bubble in pipeline
example
ADD R2,R3,R4     ( R4=R2+R3)
SUB R5,R4,R6      (R6=R5R4)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Add IF1 ID EX W
Sub IF2 ID stall ID stall ID EX W
I3 IF3 IF stall IF stall ID EX W
I4 Stall stall IF4 ID EX W
I5 IF5 ID EX W
R4 will be available after clock 4 (after W stage)
in clock 3, value of R4 is not available to pipeline stall till clock 4, decoding takes place in clock 5.
Solution
software solution is either provided by compiler or you need to modify codes .
In above case since two consecutive instructions are operand dependent then put two NOPs.

1 2 3 4 5 6 7 8 9 10 11 12 13 14
Add IF1 ID EX W
Nop IF2 ID EX W
Nop IF3 ID EX W
Sub IF4 ID EX W
I5 IF5 ID EX W
In 5 th clock R4 is available and decoded by SUB instruction correctly.

Compiler can reorder the program codes without changing the logic but with better performance than
stalled pipeline.
Different types of instructions sets
RAR read after read – both instructions read the same operand. Does not affect the data hazard.
Example
ADD R3,R2,R1 (R1=R3+R2)
SUB R3, R5,R4  (R4=R3R5)
RAW read after write – next instruction reads a operand which is yet to be written(modified) by
previous one. Causes data hazard.
ADD R3,R2,R1 (R1=R3+R2)
SUB R3, R1,R4  (R4=R3R1)
previous instruction yet to write modified R1, next instruction needs R1. Pipeline cannot decode/read
R1 results in pipeline stall.
WAR write after read next instruction writes a operand which is already read by previous one. Does
not causes data hazard in normal case as pipeline does not permit. But if there is conditional data path
between different stages of pipeline then it result data hazard.
ADD R3,R2,R1 (R1=R3+R2)   suppose ADD refer R2 in clock 6.By this time Sub instruction has
SUB R5, R6,R2  (R2=R5R6)   changed the R2. Previous inst. refers the operand changed by next inst.
1 2 3 4 5 6 7
Add IF ID EX M M M W
Sub IF ID EX W
WAW write after write next instruction writes a operand which is yet to be written by previous one.
Does not causes data hazard in normal case. But if there is conditional data path between different
stages of pipeline then it result data hazard.
ADD R3,R2,R1 (R1=R3+R2)    ADD writes the modified R1 in clock 7.By this time Sub instruction has
SUB R5, R6,R1  (R1=R5R6)   written the R1. Previous inst. overwrites the operand written by next inst.
1 2 3 4 5 6 7
Add IF ID EX M M M W
Sub IF ID EX W
examples of RAW
a=b+c
f=ed
Equivalent codes remarks related to pipeline behavior for the code
1 LOAD b,R1
2 LOAD c,R2
3 ADD R1,R2,R3 R2 available in 6th
clock
4 STA R3,a ID available in 7th
clock and R3 available in 9th
clock
5 LOAD e,R4 ID available in 10th
clock
6 LOAD f,R5
7 SUB R4,R5,R6 R5 available in 14th
clock
8 STA R6,d ID available in 15th
clock and R6 available in 17th
clock
Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID ID ID EX W
4 IF4 IF IF ID ID ID EX W
5 IF5 IF IF ID EX W
6 IF6 ID EX W
7 IF7 ID ID ID EX W
8 IF8 IF IF ID ID ID EX W
Recoding the above example
1 LOAD b,R1
2 LOAD c,R2
3 LOAD e,R4
4 LOAD f,R5
5 ADD R1,R2,R3
clock
7 STA R3,a ID available in 9th
clock
8 STA R6,d R6 available in 11th
clock
Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IF5 ID EX W
6 IF6 ID ID EX W
7 IF7 IF ID EX W
8 IF ID ID EX W
Hardware solution with more internal circuitry the intermediate results are provided to back stages for
next instructions. Thus next instruction does not wait for availability of operands after the last stage.
This path is called forward path.
DIAGRAM
The value of the expression is made available to EX stage through forward path and mux.
So the instruction does not stall in ID stage as the required value of expression is available in EX stage.
1 LOAD b,R1
2 LOAD c,R2
3 ADD R1,R2,R3 R1,R2 available in 5th
clock in EX through forward path
4 STA R3,a R3available in 6th
clock
5 LOAD e,R4
6 LOAD f,R5
clock
8 STA R6,d R6 available in 10th
clock

Pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IF5 ID EX W
6 IF6 ID EX W
7 IF7 ID EX W
8 IF8 ID EX W
Control hazard when pipeline waits for next instructions which are available because of two reasons
cache miss, branch instructions in which case PC is modified.
In cache miss pipeline IF stage waits for expected instruction.
In branch instructions, the decision of branch is taken either in ID ore EX units and the modified PC is
available later. By that time pipeline IF fetches next instruction in sequence which is partially executed
when branch decision is taken. Therefore the partially executed instruction is flushed from the pipeline
and pipeline reads instruction from next (branched) location.
This deteriorates the performance of the pipeline.
Example
1 ADD R1,R2,R3
2 JMP xx
3 INC R2
4 ADD R5,R1,R3
......
xx MOV R1,R3
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX
4 IF4 ID
5 IF5
xx IFxx
The value of the new PC is assumed in 5th
clock. The IF stage in 6th
clock fetch from xx location.
Since the sequential instructions 3 to xx1 is not required therefore it is flushed from the pipeline.
(assuming their partial execution do not  result any side effect). This period is delay slot. The output
from (execution of) branched instruction will take n clocks.
This causes clock penalty and known as branch penalty.
The branch instructions can be conditional and unconditional type.
Conditional branch instruction if condition true pipeline flushed and instruction at the branch address
enters in pipeline ,  if condition fails pipeline continues as it does not affect.
In Unconditional branch instruction pipeline is always flushed.

The performance of pipeline depends on  No. of branch instructions , and for conditional branch
instructions probability of condition true.
Simple analysis
Lets there are N instructions in a programme, with probability of p with branch instructions and q is the
probability of the success of branch instructions.
Total number of cycle needed to execute= Npqn  +  N(1p)  + Np(1q)   ;
average No. of clocks/inst. = 1+pq(n1)   for n stage  pipeline processor.
average No. of clocks/inst. = n for non pipeline processor.
Performance improvement = (1+pq(n1))
Branch delay slot the locations following branch instruction in the delay slot are always fetched and
partially executed before decision of branching is made.
To minimize the performance deterioration, this delay slot is used called delayed branch.
Arrange codes such that the instructions in delayed slot are need execution whether branch or not and it
does not change logic of programme. It is also possible to use NOP in delayed slot.
example
1 SHL R1
2   DCR R2
3   JNZ 1 flag based on R2 value available after W stage
4   ADD R4, R3,R1
5  .......
( No forward path . changed  PC after W stage, )
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID ID ID EX W
4 IF4 ID EX
5 IF5 ID
IF6
IFxx
In clock 9, the fetched instruction is either 1 (R2 not 0) or 4 (R2=0). As a simple rule all intermediate
instructions are flushed from pipeline.
(  with forward path , changed  PC  and  flag based on R2  available after EX stage in forward path)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID
5 IF4
IFxx
In clock 6, the fetched instruction is either 1 (R2 not 0) or 4 (R2=0). As a simple rule all intermediate
instructions are flushed from pipeline.
Recode the example
1 DCR R2
2  JNZ 1 flag based on R2 value available after EX stage

3  SHL R1
4  NOP
5  ADD R4, R3,R1
6   ......
(  with forward path , changed  PC  and  flag based on R2  available after EX stage in forward path)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 IF1 ID EX W
2 IF2 ID EX W
3 IF3 ID EX W
4 IF4 ID EX W
5 IFxx
In clock 5, the fetched instruction is either 1 (R2 not 0) or 5 (R2=0). As there is no other instruction to
reorder a NOP is inserted between SHL and ADD.
( In case of unconditional branch instruction, the modified PC is calculated in ID stage )

To reduce the branch penalty predict whether or not a particular branch will take place.
Static branch prediction no past history.
Dynamic branch predictionpast history of branching is included in prediction.

Limitations
Some stages of pipeline are inherently more complex and take longer than others, slowest stage time is
is taken as time of each pipeline stage. So faster stage is idle for most of time that results in  overall
instruction execution time gets longer. This also affects completion rate of different level of stages.

Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Pipeline

Similar to Pipeline (20)

Recently uploaded

Recently uploaded (20)

Pipeline