Pipelining – An
Introduction
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
Outline
 Introduction
 Pipeline hazards
 Implementing pipelines
2
Pipelining – It’s Natural!
 Laundry example
 Amal, Bimal, Chamal, & Dinal
each have one load of clothes
to wash, dry, & fold
 Washer takes 30 minutes
 Dryer takes 40 minutes
 Folder takes 20 minutes
A B C D
3
Sequential Laundry
 Sequential laundry takes 6 hours for 4 loads
 If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
4
Pipelined Laundry – Start Work ASAP
 Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
5
Pipelining Lessons
 Multiple tasks operating
simultaneously
 Improve throughput of entire
workload
 Potential speedup = No pipe
stages
 Pipelining doesn’t reduce
latency of a single task
 Pipeline rate limited by
slowest pipeline stage
 Unbalanced lengths of pipe
stages reduces speedup
 Time to fill pipeline & time to
drain it reduces speedup
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
MIPS Architecture
 MIPS – Microprocessor without Interlocked
Pipeline Stages
 RISC Architecture
 Implementations are primarily used in
embedded systems
 Windows CE devices, routers, & video game
consoles (PlayStation 2)
 32 & 64-bit instructions
 ADD, SUB, OR, AND – 32-bit
 DADD, DSUB, OR, AND – 64-bit
7
5 Steps of MIPS Datapath
8
5 Steps of MIPS Datapath (Cont.)
 Instruction Fetch (IF)
 Fetch current instruction & update PC
 Instruction Decode/register fetch cycle (ID)
 Decode instruction
 Read registers
 Parallel decoding & reading registers
 Do equality test for branching
 Store branch-target address in PC
 Execution & effective address calculation (EX)
 ALU
 Memory references
9
5 Steps of MIPS Datapath (Cont.)
 Memory Access (MEM)
 Read/write using effective address calculated in EX
stage
 Load & Store
 Write-Back cycle (WB)
 Write results into register
10
Inst. Set Processor Controller
11
Pipelining Speedup
 If each pipeline stage is balanced
Time per instruction on a pipelined processor =
Time per instruction on unpipelined processor
No of pipeline stages
Or
speedup = Average instruction time unpipelined
Average instruction time pipelined
= CPI unpipelined × Clock cycle unpipelined
CPI pipelined Clock cycle pipelined
12
Example
 Unpipelined CPU with 1 ns clock cycle. Take 4 cycles for
ALU operations & branches, & 5 cycles for memory
operations. Relative mix of operations is 50%, 10%, & 40%
respectively. What is the average instruction execution
time?
 4 * 0.5 + 4 * 0.1 + 5 * 0.4 = 4.4 ns
 Due to clock skew & setup, pipelined processor add 0.1 ns
of overhead to clock. What is the speedup?
 Now the pipeline takes 1 + 0.1 = 1.1 ns
 Speedup = Average instruction time unpipelined = 4.4 = 4
Average instruction time pipelined 1.1
13
ARM Cortex-A53 Pipeline
14
Source: www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/4
Pipelining Isn’t That Easy
 Hazards prevent next instruction from executing
during its designated clock cycle
 Reduce performance from ideal speedup
1. Structural hazards
 Hardware can’t support a given combination of
instructions
2. „
Data hazards
 Instruction depends on result of a prior instruction still in
pipeline
3. Control hazards
 Caused by branches & other instructions that change PC
15
Hazards
 When occurred pipeline need to be stalled
 Avoid
 Some instructions should be allowed to proceed while
the stalled one(s) are delayed
 We assume all instructions after a stalled
instruction are delayed
16
Pipelining Speedup
CPIpipelined = Ideal CPI
+ Pipeline stall clock cycles per instruction
Speedup = CPI unpipelined × Clock cycle unpipelined
CPI pipelined Clock cycle pipelined
= CPI unpipelined × Clock cycle unpipelined
CPI pipelined + Stall cycles per inst Clock cycle pipelined
 Ideal CPI = 1
= 1 × Clock cycle unpipelined
1 + Stall cycles per inst Clock cycle pipelined
17
Structural Hazards
 When some combination of instructions can’t be
accommodated because of resource conflict
 E.g., Shared data & instruction bus
18
Shared Data & Instruction Bus
19
Stall is called a “pipeline bubble” or just a “bubble”
Example – Dual-Port vs. Single-Port
 Machine A – Dual ported memory (Harvard
Architecture)
 Machine B – Single ported memory, but its
pipelined implementation has a 1.05× faster
clock rate
 Ideal CPI = 1 for both
 Loads are 40% of instructions executed
 Which machine is faster?
20
Example (Cont.)
 SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
 SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe/ 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
 SpeedUpA/SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
 A is 1.33x faster
21
Shared Register File
22
Read - 2nd half
of clock cycle
write – 1st half
of clock cycle
Data Hazards
 Consider following code
DADD R1, R2, R3
DSUB R4, R1, R3
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
 Any problem?
23
Data Hazards (Cont.)
24
Generic Data Hazards
 3 generic data hazards
 Read After Write (RAW)
 Write After Reads (WAR)
 Write After Write (WAW)
25
Read After Write (RAW)
 Instruction J tries to read operand before Instruction I
writes it
 Caused by a dependence
 This hazard results from an actual need for
communication
26
Write After Reads (WAR)
 Instruction J writes operand before Instruction I reads it
 Called an anti-dependence by compiler developers
 This results from reuse of the name r1
 Can’t happen in MIPS 5 stage pipeline because
 All instructions take 5 stages
 Reads are always in stage 2
 Writes are always in stage 5 27
Write After Write (WAW)
 Instruction J writes operand before Instruction I writes it
 Called an output dependence by compiler developers
 This also results from the reuse of name r1
 Can’t happen in MIPS 5 stage pipeline because
 All instructions take 5 stages
 Writes are always in stage 5
 WAR & WAW can happen in more complicated pipelines
28
Pipeline Registers
29
Forwarding to Avoid Data Hazard
30
Hardware Changes for Forwarding
31
Need additional hardware to detect & resolve hazard
Another Example
32
Data Hazard Even with Forwarding
33
Data Hazard Even with Forwarding
(Cont.)
34
Software Scheduling to Avoid Load
Hazards
 Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, & f in memory
35
Control Hazards
 What to do with 3 instructions in between?
 How do you do it?
 Where is the commit 36
Branch Stall Impact
 If CPI = 1, 30% branch, Stall 3 cycles
 New CPI = 1.9 (1 + 3x0.3)
 2 part solution
 „
Determine branch taken or not sooner and
 Compute taken branch address earlier
 MIPS branch tests if register = 0 or ≠0
 MIPS solution
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF stage
 1 clock cycle penalty for branch versus 3
37
Pipelined MIPS Datapath
38
Branch Hazard Alternatives
1. Stall until branch direction is clear
2. Predict Branch Not Taken
 Execute successor instructions in sequence
 Squash instructions in pipeline, if branch actually taken
 47% MIPS branches not taken on average
 PC+4 already calculated, so use it to get next instruction
3. Predict Branch Taken
 53% MIPS branches taken on average
 But haven’t calculated branch target address in MIPS
 MIPS still incurs 1 cycle branch penalty
 Other machines: branch target known before outcome39
Branch Hazard Alternatives (Cont.)
4. Delayed Branch
 Define branch to take place after a following instruction
 1 slot delay allows proper decision & branch target
address in 5 stage pipeline
 MIPS uses this
40
Scheduling (Filling) Branch Delay
Slots
41
Scheduling Branch Delay Slots (Cont.)
 A is the best choice, fills delay slot & reduces
instruction count (IC)
 In B, the sub-instruction may need to be copied,
increasing IC
 In B & C, must be okay to execute SUB when
branch fails
42
Delayed Branch
 Compiler effectiveness for single branch delay
slot
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay
slots useful in computation
 About 50% (60% x 80%) of slots usefully filled
 Delayed Branch downside
 With deeper pipelines & multiple issue, branch delay
grows & need more than 1 delay slot
 Delayed branching has lost popularity compared to
more expensive but more flexible dynamic approaches
– Possible with additional hardware 43
Problems with Pipelining
 Exceptions
 An unusual event happens to an instruction during its execution
 E.g., divide by zero, undefined opcode
 Interrupts
 Hardware signal to switch processor to a new instruction stream
 E.g., a sound card interrupts when it needs more audio output
samples
 It must appear that exception or interrupt must appear
between 2 instructions (Ii & Ii+1)
 Effect of all instructions up to & including Ii is totaling complete
 No effect of any instruction after Ii can take place
 Interrupt (exception) handler either aborts program or
restarts at instruction Ii+1 44
Summary
 Hazards reduce performance
1. Structural hazards
2. „
Data hazards
3. Control hazards
 Control hazards are the hardest
 Will look at more options next time
45

CPU Pipelining and Hazards - An Introduction

  • 1.
    Pipelining – An Introduction CS4342Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2.
    Outline  Introduction  Pipelinehazards  Implementing pipelines 2
  • 3.
    Pipelining – It’sNatural!  Laundry example  Amal, Bimal, Chamal, & Dinal each have one load of clothes to wash, dry, & fold  Washer takes 30 minutes  Dryer takes 40 minutes  Folder takes 20 minutes A B C D 3
  • 4.
    Sequential Laundry  Sequentiallaundry takes 6 hours for 4 loads  If they learned pipelining, how long would laundry take? A B C D 30 40 20 30 40 20 30 40 20 30 40 20 6 PM 7 8 9 10 11 Midnight T a s k O r d e r Time 4
  • 5.
    Pipelined Laundry –Start Work ASAP  Pipelined laundry takes 3.5 hours for 4 loads A B C D 6 PM 7 8 9 10 11 Midnight T a s k O r d e r Time 30 40 40 40 40 20 5
  • 6.
    Pipelining Lessons  Multipletasks operating simultaneously  Improve throughput of entire workload  Potential speedup = No pipe stages  Pipelining doesn’t reduce latency of a single task  Pipeline rate limited by slowest pipeline stage  Unbalanced lengths of pipe stages reduces speedup  Time to fill pipeline & time to drain it reduces speedup A B C D 6 PM 7 8 9 T a s k O r d e r Time 30 40 40 40 40 20
  • 7.
    MIPS Architecture  MIPS– Microprocessor without Interlocked Pipeline Stages  RISC Architecture  Implementations are primarily used in embedded systems  Windows CE devices, routers, & video game consoles (PlayStation 2)  32 & 64-bit instructions  ADD, SUB, OR, AND – 32-bit  DADD, DSUB, OR, AND – 64-bit 7
  • 8.
    5 Steps ofMIPS Datapath 8
  • 9.
    5 Steps ofMIPS Datapath (Cont.)  Instruction Fetch (IF)  Fetch current instruction & update PC  Instruction Decode/register fetch cycle (ID)  Decode instruction  Read registers  Parallel decoding & reading registers  Do equality test for branching  Store branch-target address in PC  Execution & effective address calculation (EX)  ALU  Memory references 9
  • 10.
    5 Steps ofMIPS Datapath (Cont.)  Memory Access (MEM)  Read/write using effective address calculated in EX stage  Load & Store  Write-Back cycle (WB)  Write results into register 10
  • 11.
    Inst. Set ProcessorController 11
  • 12.
    Pipelining Speedup  Ifeach pipeline stage is balanced Time per instruction on a pipelined processor = Time per instruction on unpipelined processor No of pipeline stages Or speedup = Average instruction time unpipelined Average instruction time pipelined = CPI unpipelined × Clock cycle unpipelined CPI pipelined Clock cycle pipelined 12
  • 13.
    Example  Unpipelined CPUwith 1 ns clock cycle. Take 4 cycles for ALU operations & branches, & 5 cycles for memory operations. Relative mix of operations is 50%, 10%, & 40% respectively. What is the average instruction execution time?  4 * 0.5 + 4 * 0.1 + 5 * 0.4 = 4.4 ns  Due to clock skew & setup, pipelined processor add 0.1 ns of overhead to clock. What is the speedup?  Now the pipeline takes 1 + 0.1 = 1.1 ns  Speedup = Average instruction time unpipelined = 4.4 = 4 Average instruction time pipelined 1.1 13
  • 14.
    ARM Cortex-A53 Pipeline 14 Source:www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/4
  • 15.
    Pipelining Isn’t ThatEasy  Hazards prevent next instruction from executing during its designated clock cycle  Reduce performance from ideal speedup 1. Structural hazards  Hardware can’t support a given combination of instructions 2. „ Data hazards  Instruction depends on result of a prior instruction still in pipeline 3. Control hazards  Caused by branches & other instructions that change PC 15
  • 16.
    Hazards  When occurredpipeline need to be stalled  Avoid  Some instructions should be allowed to proceed while the stalled one(s) are delayed  We assume all instructions after a stalled instruction are delayed 16
  • 17.
    Pipelining Speedup CPIpipelined =Ideal CPI + Pipeline stall clock cycles per instruction Speedup = CPI unpipelined × Clock cycle unpipelined CPI pipelined Clock cycle pipelined = CPI unpipelined × Clock cycle unpipelined CPI pipelined + Stall cycles per inst Clock cycle pipelined  Ideal CPI = 1 = 1 × Clock cycle unpipelined 1 + Stall cycles per inst Clock cycle pipelined 17
  • 18.
    Structural Hazards  Whensome combination of instructions can’t be accommodated because of resource conflict  E.g., Shared data & instruction bus 18
  • 19.
    Shared Data &Instruction Bus 19 Stall is called a “pipeline bubble” or just a “bubble”
  • 20.
    Example – Dual-Portvs. Single-Port  Machine A – Dual ported memory (Harvard Architecture)  Machine B – Single ported memory, but its pipelined implementation has a 1.05× faster clock rate  Ideal CPI = 1 for both  Loads are 40% of instructions executed  Which machine is faster? 20
  • 21.
    Example (Cont.)  SpeedUpA= Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth  SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe/ 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth  SpeedUpA/SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33  A is 1.33x faster 21
  • 22.
    Shared Register File 22 Read- 2nd half of clock cycle write – 1st half of clock cycle
  • 23.
    Data Hazards  Considerfollowing code DADD R1, R2, R3 DSUB R4, R1, R3 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11  Any problem? 23
  • 24.
  • 25.
    Generic Data Hazards 3 generic data hazards  Read After Write (RAW)  Write After Reads (WAR)  Write After Write (WAW) 25
  • 26.
    Read After Write(RAW)  Instruction J tries to read operand before Instruction I writes it  Caused by a dependence  This hazard results from an actual need for communication 26
  • 27.
    Write After Reads(WAR)  Instruction J writes operand before Instruction I reads it  Called an anti-dependence by compiler developers  This results from reuse of the name r1  Can’t happen in MIPS 5 stage pipeline because  All instructions take 5 stages  Reads are always in stage 2  Writes are always in stage 5 27
  • 28.
    Write After Write(WAW)  Instruction J writes operand before Instruction I writes it  Called an output dependence by compiler developers  This also results from the reuse of name r1  Can’t happen in MIPS 5 stage pipeline because  All instructions take 5 stages  Writes are always in stage 5  WAR & WAW can happen in more complicated pipelines 28
  • 29.
  • 30.
    Forwarding to AvoidData Hazard 30
  • 31.
    Hardware Changes forForwarding 31 Need additional hardware to detect & resolve hazard
  • 32.
  • 33.
    Data Hazard Evenwith Forwarding 33
  • 34.
    Data Hazard Evenwith Forwarding (Cont.) 34
  • 35.
    Software Scheduling toAvoid Load Hazards  Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, & f in memory 35
  • 36.
    Control Hazards  Whatto do with 3 instructions in between?  How do you do it?  Where is the commit 36
  • 37.
    Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles  New CPI = 1.9 (1 + 3x0.3)  2 part solution  „ Determine branch taken or not sooner and  Compute taken branch address earlier  MIPS branch tests if register = 0 or ≠0  MIPS solution  Move Zero test to ID/RF stage  Adder to calculate new PC in ID/RF stage  1 clock cycle penalty for branch versus 3 37
  • 38.
  • 39.
    Branch Hazard Alternatives 1.Stall until branch direction is clear 2. Predict Branch Not Taken  Execute successor instructions in sequence  Squash instructions in pipeline, if branch actually taken  47% MIPS branches not taken on average  PC+4 already calculated, so use it to get next instruction 3. Predict Branch Taken  53% MIPS branches taken on average  But haven’t calculated branch target address in MIPS  MIPS still incurs 1 cycle branch penalty  Other machines: branch target known before outcome39
  • 40.
    Branch Hazard Alternatives(Cont.) 4. Delayed Branch  Define branch to take place after a following instruction  1 slot delay allows proper decision & branch target address in 5 stage pipeline  MIPS uses this 40
  • 41.
  • 42.
    Scheduling Branch DelaySlots (Cont.)  A is the best choice, fills delay slot & reduces instruction count (IC)  In B, the sub-instruction may need to be copied, increasing IC  In B & C, must be okay to execute SUB when branch fails 42
  • 43.
    Delayed Branch  Compilereffectiveness for single branch delay slot  Fills about 60% of branch delay slots  About 80% of instructions executed in branch delay slots useful in computation  About 50% (60% x 80%) of slots usefully filled  Delayed Branch downside  With deeper pipelines & multiple issue, branch delay grows & need more than 1 delay slot  Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches – Possible with additional hardware 43
  • 44.
    Problems with Pipelining Exceptions  An unusual event happens to an instruction during its execution  E.g., divide by zero, undefined opcode  Interrupts  Hardware signal to switch processor to a new instruction stream  E.g., a sound card interrupts when it needs more audio output samples  It must appear that exception or interrupt must appear between 2 instructions (Ii & Ii+1)  Effect of all instructions up to & including Ii is totaling complete  No effect of any instruction after Ii can take place  Interrupt (exception) handler either aborts program or restarts at instruction Ii+1 44
  • 45.
    Summary  Hazards reduceperformance 1. Structural hazards 2. „ Data hazards 3. Control hazards  Control hazards are the hardest  Will look at more options next time 45

Editor's Notes

  • #7 Example for pipeline flush: Drier crashed with an oil leak - wash & dry again Folding table collapse with all cloths on the floor. Wash, dry & fold again
  • #12 RR – Register-Register RI – Register-Immediate
  • #13 From Processor Performance Equation
  • #14 clock skew (TSkew) is the difference in the arrival time between two sequentially-adjacent registers
  • #19 Stall is called a “pipeline bubble” or just a “bubble”
  • #30 Instructions in different stages of the pipeline should not interfere with one another
  • #37 3 cycle loss because PC update is in 4th cycle