CPU Pipelining and Hazards - An Introduction

Pipelining – An
Introduction
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by John L.
Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers

Outline
 Introduction
 Pipeline hazards
 Implementing pipelines
2

Pipelining – It’s Natural!
 Laundry example
 Amal, Bimal, Chamal, & Dinal
each have one load of clothes
to wash, dry, & fold
 Washer takes 30 minutes
 Dryer takes 40 minutes
 Folder takes 20 minutes
A B C D
3

Sequential Laundry
 Sequential laundry takes 6 hours for 4 loads
 If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
4

Pipelined Laundry – Start Work ASAP
 Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
5

Pipelining Lessons
 Multiple tasks operating
simultaneously
 Improve throughput of entire
workload
 Potential speedup = No pipe
stages
 Pipelining doesn’t reduce
latency of a single task
 Pipeline rate limited by
slowest pipeline stage
 Unbalanced lengths of pipe
stages reduces speedup
 Time to fill pipeline & time to
drain it reduces speedup
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20

MIPS Architecture
 MIPS – Microprocessor without Interlocked
Pipeline Stages
 RISC Architecture
 Implementations are primarily used in
embedded systems
 Windows CE devices, routers, & video game
consoles (PlayStation 2)
 32 & 64-bit instructions
 ADD, SUB, OR, AND – 32-bit
 DADD, DSUB, OR, AND – 64-bit
7

5 Steps of MIPS Datapath (Cont.)
 Instruction Fetch (IF)
 Fetch current instruction & update PC
 Instruction Decode/register fetch cycle (ID)
 Decode instruction
 Read registers
 Parallel decoding & reading registers
 Do equality test for branching
 Store branch-target address in PC
 Execution & effective address calculation (EX)
 ALU
 Memory references
9

5 Steps of MIPS Datapath (Cont.)
 Memory Access (MEM)
 Read/write using effective address calculated in EX
stage
 Load & Store
 Write-Back cycle (WB)
 Write results into register
10

Inst. Set Processor Controller
11

Pipelining Speedup
 If each pipeline stage is balanced
Time per instruction on a pipelined processor =
Time per instruction on unpipelined processor
No of pipeline stages
Or
speedup = Average instruction time unpipelined
Average instruction time pipelined
= CPI unpipelined × Clock cycle unpipelined
CPI pipelined Clock cycle pipelined
12

Example
 Unpipelined CPU with 1 ns clock cycle. Take 4 cycles for
ALU operations & branches, & 5 cycles for memory
operations. Relative mix of operations is 50%, 10%, & 40%
respectively. What is the average instruction execution
time?
 4 * 0.5 + 4 * 0.1 + 5 * 0.4 = 4.4 ns
 Due to clock skew & setup, pipelined processor add 0.1 ns
of overhead to clock. What is the speedup?
 Now the pipeline takes 1 + 0.1 = 1.1 ns
 Speedup = Average instruction time unpipelined = 4.4 = 4
Average instruction time pipelined 1.1
13

ARM Cortex-A53 Pipeline
14
Source: www.anandtech.com/show/11441/dynamiq-and-arms-new-cpus-cortex-a75-a55/4

Pipelining Isn’t That Easy
 Hazards prevent next instruction from executing
during its designated clock cycle
 Reduce performance from ideal speedup
1. Structural hazards
 Hardware can’t support a given combination of
instructions
2. „
Data hazards
 Instruction depends on result of a prior instruction still in
pipeline
3. Control hazards
 Caused by branches & other instructions that change PC
15

Hazards
 When occurred pipeline need to be stalled
 Avoid
 Some instructions should be allowed to proceed while
the stalled one(s) are delayed
 We assume all instructions after a stalled
instruction are delayed
16

Pipelining Speedup
CPIpipelined = Ideal CPI
+ Pipeline stall clock cycles per instruction
Speedup = CPI unpipelined × Clock cycle unpipelined
CPI pipelined Clock cycle pipelined
= CPI unpipelined × Clock cycle unpipelined
CPI pipelined + Stall cycles per inst Clock cycle pipelined
 Ideal CPI = 1
= 1 × Clock cycle unpipelined
1 + Stall cycles per inst Clock cycle pipelined
17

Structural Hazards
 When some combination of instructions can’t be
accommodated because of resource conflict
 E.g., Shared data & instruction bus
18

Shared Data & Instruction Bus
19
Stall is called a “pipeline bubble” or just a “bubble”

Example – Dual-Port vs. Single-Port
 Machine A – Dual ported memory (Harvard
Architecture)
 Machine B – Single ported memory, but its
pipelined implementation has a 1.05× faster
clock rate
 Ideal CPI = 1 for both
 Loads are 40% of instructions executed
 Which machine is faster?
20

Example (Cont.)
 SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
 SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe/ 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
 SpeedUpA/SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
 A is 1.33x faster
21

Shared Register File
22
Read - 2nd half
of clock cycle
write – 1st half
of clock cycle

Data Hazards
 Consider following code
DADD R1, R2, R3
DSUB R4, R1, R3
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11
 Any problem?
23

Generic Data Hazards
 3 generic data hazards
 Read After Write (RAW)
 Write After Reads (WAR)
 Write After Write (WAW)
25

Read After Write (RAW)
 Instruction J tries to read operand before Instruction I
writes it
 Caused by a dependence
 This hazard results from an actual need for
communication
26

Write After Reads (WAR)
 Instruction J writes operand before Instruction I reads it
 Called an anti-dependence by compiler developers
 This results from reuse of the name r1
 Can’t happen in MIPS 5 stage pipeline because
 All instructions take 5 stages
 Reads are always in stage 2
 Writes are always in stage 5 27

Write After Write (WAW)
 Instruction J writes operand before Instruction I writes it
 Called an output dependence by compiler developers
 This also results from the reuse of name r1
 Can’t happen in MIPS 5 stage pipeline because
 All instructions take 5 stages
 Writes are always in stage 5
 WAR & WAW can happen in more complicated pipelines
28

Forwarding to Avoid Data Hazard
30

Hardware Changes for Forwarding
31
Need additional hardware to detect & resolve hazard

Data Hazard Even with Forwarding
33

Data Hazard Even with Forwarding
(Cont.)
34

Software Scheduling to Avoid Load
Hazards
 Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, & f in memory
35

Control Hazards
 What to do with 3 instructions in between?
 How do you do it?
 Where is the commit 36

Branch Stall Impact
 If CPI = 1, 30% branch, Stall 3 cycles
 New CPI = 1.9 (1 + 3x0.3)
 2 part solution
 „
Determine branch taken or not sooner and
 Compute taken branch address earlier
 MIPS branch tests if register = 0 or ≠0
 MIPS solution
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF stage
 1 clock cycle penalty for branch versus 3
37

Branch Hazard Alternatives
1. Stall until branch direction is clear
2. Predict Branch Not Taken
 Execute successor instructions in sequence
 Squash instructions in pipeline, if branch actually taken
 47% MIPS branches not taken on average
 PC+4 already calculated, so use it to get next instruction
3. Predict Branch Taken
 53% MIPS branches taken on average
 But haven’t calculated branch target address in MIPS
 MIPS still incurs 1 cycle branch penalty
 Other machines: branch target known before outcome39

Branch Hazard Alternatives (Cont.)
4. Delayed Branch
 Define branch to take place after a following instruction
 1 slot delay allows proper decision & branch target
address in 5 stage pipeline
 MIPS uses this
40

Scheduling (Filling) Branch Delay
Slots
41

Scheduling Branch Delay Slots (Cont.)
 A is the best choice, fills delay slot & reduces
instruction count (IC)
 In B, the sub-instruction may need to be copied,
increasing IC
 In B & C, must be okay to execute SUB when
branch fails
42

Delayed Branch
 Compiler effectiveness for single branch delay
slot
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay
slots useful in computation
 About 50% (60% x 80%) of slots usefully filled
 Delayed Branch downside
 With deeper pipelines & multiple issue, branch delay
grows & need more than 1 delay slot
 Delayed branching has lost popularity compared to
more expensive but more flexible dynamic approaches
– Possible with additional hardware 43

Problems with Pipelining
 Exceptions
 An unusual event happens to an instruction during its execution
 E.g., divide by zero, undefined opcode
 Interrupts
 Hardware signal to switch processor to a new instruction stream
 E.g., a sound card interrupts when it needs more audio output
samples
 It must appear that exception or interrupt must appear
between 2 instructions (Ii & Ii+1)
 Effect of all instructions up to & including Ii is totaling complete
 No effect of any instruction after Ii can take place
 Interrupt (exception) handler either aborts program or
restarts at instruction Ii+1 44

Summary
 Hazards reduce performance
1. Structural hazards
2. „
Data hazards
3. Control hazards
 Control hazards are the hardest
 Will look at more options next time
45

CPU Pipelining and Hazards - An Introduction

More Related Content

What's hot

Similar to CPU Pipelining and Hazards - An Introduction

More from Dilum Bandara

Recently uploaded

CPU Pipelining and Hazards - An Introduction

Editor's Notes