SlideShare a Scribd company logo
1 of 43
Download to read offline
Section 6
Instruction-Level Parallelism
Topics:
Pipelining
Superscalar processors
VLIW architecture
Instruction level parallelism
Overview
Modern processors apply techniques for executing several instructions in
parallel to enhance the computing power.
The potential of executing machine instructions in parallel is called instruction
level parallelism (ILP).
Remember: execution of one instruction is broken into several steps
Pipelining:
Slide 6-2
Pipelining:
Different steps of multiple instructions are executed simultaneously.
Concurrent execution:
The same steps of multiple machine instructions may be executed
simultaneously.
Requires multiple functional units
Techniques: Superscalar
VLIW (very long instruction word)
Instruction level parallelism
Pipelining: principle
Principle:
The execution of a machine instruction is divided into several
steps – called pipeline stages - taking nearly the same
execution time. These stages may be executed in parallel.
Example MIPS: 5 pipeline stages
Slide 6-3
1. IF: instruction fetch
2. ID: instruction decode and register file read
3. EX: execution / memory address calculation
4. MEM: data memory access
5. WB: result write back
Instruction level parallelism
Pipelining: principle
Executing 6 instructions using pipelining
Executing two 5-step instructions (e.g. lw) without pipelining
Instruction 1: S1 S2 S3 S4 S5
Instruction 2: S1 S2 S3 S4 S5
Clock cycle: 1 2 3 4 5 6 7 8 9 10
Slide 6-4
Instruction 1: S1 S2 S3 S4 S5
Instruction 2: S1 S2 S3 S4 S5
Instruction 3: S1 S2 S3 S4 S5
Instruction 4: S1 S2 S3 S4 S5
Instruction 5: S1 S2 S3 S4 S5
Instruction 6: S1 S2 S3 S4 S5
Clock cycle: 1 2 3 4 5 6 7 8 9 10
Instruction level parallelism
In this chapter we will design a pipelined MIPS datapath for the following
instructions: lw, sw, add, sub, and, or, slt, beq
Situations may occur where two instructions cannot be executed in the
pipeline right after each other!
Example: Non-pipelined multi-cycle CPU has shared ALU for
1. executing arithmetic/logical instructions
Pipelined MIPS datapath
Slide 6-5
1. executing arithmetic/logical instructions
2. incrementing PC
Structural hazard: Two instructions wish to use a certain hardware
component in the same clock cycle leading to a
resource conflict.
For RISC instruction sets structural hazards can often be resolved by
additional hardware.
Instruction level parallelism
Additional Hardware being required:
1. Permit incrementing PC and executing arithmetic/logical instructions
concurrently:
use separate adder for incrementing PC
2. Permit reading next instruction and reading/writing data from/to memory:
divide memory into instruction memory and data memory (Harvard architecture)
Pipelined MIPS datapath
Slide 6-6
3. Permit executing an arithmetic/logical instruction (uses ALU in 3. cycle) followed by
a branch (calculates branch target in 2. cycle):
use separate adder for branch address calculation
Duplicating hardware components in general leads to less and/or smaller multiplexers.
Instruction level parallelism
MIPS datapath (without pipelining)
0
1
shift
left 2
Add
result
Add
result
4
IF: Instruction fetch ID: Instruction de-
code / register read
EX: Execute /
address calculation
MEM: Memory
access
WB:
Write
back
Slide 6-7
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32Datapath for
executing one
instruction per
clock: single cycle
implementation
Instruction level parallelism
Pipelined MIPS datapath additionally requires
Pipeline registers:
• Store all the data occurring at the end of one pipeline stage that are
required as input data in the next stage
• Divide datapath into pipeline stages
Pipelined MIPS datapath
Slide 6-8
• Divide datapath into pipeline stages
• Replace temporary datapath registers of non-pipelined multi-cycle
implementation, e.g.:
ALU target register T replaced by pipeline register EX/MEM
Instruction register IR replaced by pipeline register IF/ID
Instruction level parallelism
Pipelined MIPS datapath
0
1
Add
result
4
Instruction fetch ID: Instruction de-
code / register read
EX: Execute /
address calculation
MEM: Memory
access
WB:
Write
back
IF/ID ID/EX EX/MEM MEM/WB
shift
left 2
Add
result
Slide 6-9
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 1: instruction fetch
0
1
Add
result
4
Instruction fetch
IF/ID ID/EX EX/MEM MEM/WB
lw
E.g. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-10
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 2: instruction decode
0
1
Add
result
4
Instruction decode
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-11
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 3: execution
0
1
Add
result
4
execution
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-12
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
result
Instruction level parallelism
Executing an instruction, phase 4: memory access
0
1
Add
result
4
Memory
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-13
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 5: write back
0
1
Add
result
4
Write
Back
IF/ID ID/EX EX/MEM MEM/WB
lw
BUG !!
LOAD instruction writes result into wrong register: the used
register number belongs to the instruction that has just been
fed into the pipeline!
shift
left 2
Add
result
Slide 6-14
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Revised hardware
0
1
Add
result
4
IF/ID ID/EX EX/MEM MEM/WB
Solution: keep register number and pass it to the last stage
⇒ 5 additional bits for each of the last 3 pipeline registers
shift
left 2
Add
result
Slide 6-15
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Control for pipelined MIPS processor
General Approach:
In stage ID, create all control signals which are needed for an
instruction in subsequent stages (EX, MEM, WB) and store them in the
ID/EX pipeline register.
Then, in each clock cycle hand over control signals to the next stage
using the corresponding pipeline registers.
Slide 6-16
Which signals are required in which stage?
We can divide the control signals into 5 groups corresponding to the
pipeline stages where they are needed.
Instruction level parallelism
Control for pipelined MIPS processor
1. Instruction fetch:
Instruction memory is read and PC is written in every clock cycle ⇒ no control
signals required!
2. Instruction decode / register file read:
The same operations are performed in every clock cycle ⇒ no control signals
required!
3. Execute / address calculation:
ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target)
Slide 6-17
ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target)
4. Memory access:
MemRead and MemWrite (control data memory): set by lw,sw
Branch (PC will be reloaded if condition is fulfilled): set by beq
PCsrc is determined from Branch and zero (from ALU, condition is fulfilled if set)
5. Write back:
MemtoReg (send either ALU result or memory value to register file)
RegWrite (register file write enable)
Instruction level parallelism
Pipelined MIPS data path and control
0
1
Add
result
4
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
PCSrc
RegWrite
Branch
shift
left 2
Add
result
Slide 6-18
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
MemtoReg
ALU
Control
6
instr. [15-0]
instr. [20-16]
instr. [15-11]
0
1
RegDst
ALUOp
ALUSrc
MemWrite
MemRead
Branch
Instruction level parallelism
Consider the following program
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Example
# $2 = 23-3 = 20
# $12 = 20 and 7 = 4
# $13 = 3 or 20 = 23
# $14 = 20+20= 40
# save $15 to 100(20)
Slide 6-19
Assume the following initial register contents:
$1 = 23
$2 = 10
$3 = 3
$5 = 7
$6 = 3
Instruction level parallelism
Data dependences and hazards
IM Reg
IM Reg
sub $2, $1, $3
Program
execution
order
(in instructions)
Time (in clock cycles)
and $12, $2, $5
DM Reg
Reg DM
$2 = 23-3 = 20
$12 = 10 and 7 = 2
Initial values:
$1 = 23
$2 = 10
$3 = 3
$5 = 7
$6 = 3
Consider in the following only data hazards for
register-register-type instructions
Slide 6-20
IM Reg DM Reg
IM DM Reg
IM DM Reg
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2) Reg
Reg
$13 = 3 or 10 = 11
$12 = 10+10 = 20
Data dependence leading to error (hazard)!
Instruction level parallelism
Dependences
Consider an instruction a that precedes an instruction b in program order:
A data dependence between a and b occurs when a writes into a register
that will be read by b.
An antidependence between a and b occurs when b writes into a register
that is read by a.
An output dependence between a and b occurs when both, a and b
Slide 6-21
An output dependence between a and b occurs when both, a and b
write into the same register.
A data hazard is created whenever the overlapping (pipelined) execution
of a and b would change the order of access to the operands which are
involved in the dependency.
Instruction level parallelism
Data hazards
Consider an instruction a that precedes an instruction b in program order:
Depending on the type of the dependence between a and b the following
hazards may occur:
RAW: read after write
b reads a source before a writes it, so b incorrectly gets the old value.
Slide 6-22
WAR: write after read
b writes an operand before it is read by a, so a incorrectly gets the new
value.
WAW: write after write
b writes an operand before it is written by a, leaving the wrong result in
the target register.
In the following we consider only data hazards for R-type instructions
Instruction level parallelism
Software solution for resolving data hazards
Compiler resolves all data hazards:
• Test machine language program for potential data hazards
• Eliminate them by inserting NOP – instructions (no operation)
Example:
sub $2, $1, $3
nop
nop
nop
Slide 6-23
nop
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Modern processors are able to detect data hazards during program
execution by analyzing the register numbers of the instructions using
additional control logic!
Instruction level parallelism
Hardware solution for resolving data hazards
I M R e g
I M R e g
su b $ 2 , $ 1, $ 3
P ro gra m
e x e c utio n
o r d e r
(in i n str u ctio n s)
Ti m e (i n clo c k cy cle s)
a n d $ 1 2, $ 2 , $ 5
D M
R e g
R e g D M
Slide 6-24
I M R e g D M R e g
I M D M R e g
I M D M R e g
o r $ 1 3, $ 6, $ 2
a d d $ 1 4, $ 2 , $ 2
s w $ 1 5, 1 0 0 ($ 2 ) R e g
R e g
Data required by subsequent instructions exists already in pipeline register!
Register file: If a register is read and written in the same clock cycle, send
new data to data output!
Instruction level parallelism
MIPS datapath using forwarding
Forwarding unit gets as input:
Forwarding:
ALU may read operands from each of the pipeline registers.
The correct operands are selected by multiplexers that are controlled by an
additional control unit: forwarding unit
Slide 6-25
• Register operand numbers of instruction in EX stage
• Target register number of instructions being in MEM and WB stage
• Control signals indicating type of instructions being in MEM and WB stage
Register numbers are stored and moved forward in the pipeline registers
For reasons of clarity the hardware structure shown on the following
slide has been simplified. Adder for branch target calculation, ALU input
for address calculation and address input of data memory are missing.
Instruction level parallelism
MIPS datapath using forwarding
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
For R-type instruction
Slide 6-26
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
write
data
data memory
read
data
PC
0
1
2
1
0
MemtoReg
Forwarding
Unit
IF/ID.RegisterRs
MemWrite
MemRead
0
1
2
IF/ID.RegisterRd
IF/ID.RegisterRt
EX/MEM.RegisterRd
MEM/WB.RegisterRd
Instruction level parallelism
• Data hazards may be resolved if the operands being read by the
instruction in the EX stage are already stored in one of the pipeline
registers!
• Now consider the following program:
lw $2, 20($1)
and $4, $2, $5
Forwarding
AND instruction requires $2 at the beginning of
the 3. stage (4. cycle)
Slide 6-27
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
BUT: value for $2 is stored in a pipeline register
at the end of stage 4 of LW (4. cycle)
⇒ hazard may not be resolved by forwarding
We have to stall the pipeline for combinations of a load followed by
an instruction that reads its result!
Additional hardware for detecting hazards and stalling the pipeline:
Hazard detection unit
Instruction level parallelism
Illustration
Reg
IM Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Program
execution
order
(in instructions)
and $4, $2, $5
CC 7 CC 8 CC 9
DM Reg
Reg DM
Slide 6-28
Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
Instruction level parallelism
Stalling the pipeline
lw $2, 20($1)
Program
execution
order
(in instructions)
and $4, $2, $5 Reg
IM Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Slide 6-29
Stalling the pipeline means to repeat all actions from the previous clock
cycle in the corresponding stages.
PC and IF/ID register must be prevented from being overwritten.
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7 Reg
IM Reg DM RegIM
IM DM Reg
IM DM Reg
Reg
bubble
Instruction level parallelism
Control Hazards
Consider the following program:
beq $1, $3, L0 # PC relative addressing
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
...
Slide 6-30
L0: lw $4, 50($14)
Efficient pipelining: one instruction is fetched at every clock cycle
BUT: Which instruction has to be executed after the branch?
Control (or branch) hazard :
We start executing instructions before we know whether they are really part
of the program flow!
Instruction level parallelism
CC 1
Time (in clock cy cle s)
be q $1, $3, 7
Progra m
exe cution
order
(in instru ctio ns)
IM Re g D M Reg
C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9
Strategy: Assume branch not taken
For every branch we assume that it is not taken and we begin
executing the subsequent instructions $pc+4, $pc+8 and $pc+12.
(ALU is used to compute branch address)
Slide 6-31
Reg
be q $1, $3, 7 IM Re g
IM D M
IM D M
IM D M
D M Reg
Reg Reg
Reg
Reg
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2 Reg
Instruction level parallelism
Assume Branch not taken (continued)
However, if the branch is taken we have to discard all instructions from the pipeline!
CC 1
Time (in clock cycle s)
beq $1, $3, 7
Program
execution
order
(in instructions)
IM Reg
IM DM
DM Reg
Reg Regand $12, $2, $5
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Slide 6-32
Reg
Reg
IM DM
IM DM
IM DM
DM
Reg Reg
Reg
Reg
RegIM
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
lw $4, 50($7)
Reg
discard!discard!
In CC5 all data calculated in the stages ID, EX and MEM have to be marked as invalid!
⇒ Set control signals for writing of memory and register file to zero
Instruction level parallelism
Reducing the delay of branches
The earlier we know whether a branch will be taken, the fewer
instructions need to be flushed from the pipeline!
1. Calculate branch target address in
ID stage by separate adder in ID
stage
a
a
b
b
7
7
6
6
Slide 6-33
2. Test condition in ID stage using
an additional comparator
Comparator is faster than ALU so
that it can be integrated into ID
Phase!
⇒ Only one instruction needs to be
flushed
a
b
0
0
a = b ?
8 bit comparator
Instruction level parallelism
Add
Reducing the delay of branches
0
1
shift
left 2
Add
result
4
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
PCSrc
RegWrite
Slide 6-34
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
MemtoReg
ALU
Control
6
instr. [15-0]
instr. [20-16]
instr. [15-11]
0
1
RegDst
ALUOp
ALUSrc
MemWrite
MemRead
= ?
Instruction level parallelism
Delayed Branches
Delayed branching:
No instructions are flushed from the pipeline. An instruction
following immediately after a branch is always executed.
Programming strategy:
Slide 6-35
Place an instruction originally preceding the branch and not
affected by it immediately after the branch (=branch delay slot ).
If no suitable instruction is found place a NOP there.
Typically the compiler/assembler will fill about 50% of all delay slots with
useful instructions.
Instruction level parallelism
Processors with several functional units
The times required for executing two arithmetic instructions may differ
significantly depending on the type of the instruction:
• Integer addition faster than floating point addition
• Addition much faster than multiplication/division
Making the cycle time long enough so that the slowest instruction can be
executed in one cycle would slow down the processor dramatically!
Slide 6-36
Solution:
• Distribute the EX stage of complex operations over several clock cycles
• Use several functional units in the EX stage
⇒ Allows to execute several instructions in parallel!
Instruction level parallelism
Extending the MIPS pipeline to handle multicycle
floating point operations
MIPS implementation with floating point (FP) instructions (MIPS R4000):
• 1 Integer unit: used for load/store, integer ALU operations and branches
• 1 Multiplier for integer and FP numbers
Slide 6-37
• 1 Adder for FP addition and subtraction
• 1 Divider for FP and integer numbers
Instruction level parallelism
Extended MIPS pipeline
EX
integer
unit
EX
FP/int
multiply
MIPS pipeline with multiple functional units (FUs)
Slide 6-38
IF ID MEM WB
multiply
EX
FP
add
EX
FP
divide
FU Execution
time
Structure
INT 1 Not pipelined
MUL 7 Pipelined
ADD 4 Pipelined
DIV 25 Not pipelined
Out of order completion possible!
Instruction level parallelism
Extended MIPS pipeline
MIPS pipeline with multiple functional units (FUs)
M1 M2 M3 M4 M5 M6 M7
EX
Integer unit
FP/integer multiplier
Slide 6-39
IF ID
A1 A2 A3 A4
DIV
FP adder
FP/integer divider
MEM WB
Instruction level parallelism
Extended MIPS pipeline
Separate register file for storing FP operands:
• FP registers f0 – f31
• FP instructions operate on FP registers
• Integer instructions operate on integer registers
• Exception: FP load/store: address in integer register, data in FP register
+ no increase in number of bits needed for addressing registers
+ simplifies hazard detection
Slide 6-40
+ simplifies hazard detection
+ read/write integer and FP operands at the same time
+ no increase in complexity of multiplexers/decoders (speed!)
- Additional moves for copying data from FP registers to integer
register and vice versa necessary
• FP operands may be 32 or 64 bit wide
One 64 bit operand occupies a pair of FP registers (e.g. f0 and f1)
64 bit path from/to memory to speed up double precision load/store
Instruction level parallelism
Structural Hazard: functional unit
Example: Floating point operations
Div.d $f0,$f2,$f4
Mul.d $f4,$f6,$f4
Div.d $f8,$f8,$f14
Add.d $f10,$f4,$f8
The *.d extension indicates 64 bit floating point operations
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Slide 6-41
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Div.d $f0,$f2,$f4 IF ID DIV-----------------------------------------------------
Mul.d $f4,$f6,$f4 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Div.d $f8,$f8,$f14 IF ID stall...
Add.d $f10,$f4,$f8 IF stall...
Functional units which are not pipelined and which require more than
one clock cycle for execution may create structural hazards!
Instruction has to be stalled in ID-stage!
Instruction level parallelism
Structural Hazard: write back
Example:
Cycle 1 2 3 4 5 6 7 8 9 10 11
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add $r3,$r0,$r0 IF ID EX MEMWB
Add.d $f2,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB
Sw $r3,0($r2) IF ID EX MEMWB
IF ID EX MEMWB
Structural Hazard:
3 instructions wish
to write their
results to FP
Slide 6-42
Sw $r0,4($r2) IF ID EX MEMWB
L.d $f2,0($r2) IF ID EX MEMWB
results to FP
register file!
Solution: Track use of the write port of register file in ID stage by
using a shift register. If a structural hazard would occur
the instruction being in ID stage is stalled for one cycle
Instruction level parallelism
Structural Hazard: write back
Example: resolved structural hazard
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add $r3,$r0,$r0 IF ID EX MEMWB
Add.d $f2,$f4,$f6 IF ID stallA1 A2 A3 A4 MEMWB
Sw $r3,0($r2) IF stall ID EX MEMWB
Sw $r0,4($r2) IF ID EX MEMWB
Slide 6-43
Sw $r0,4($r2) IF ID EX MEMWB
L.d $f2,0($r2) IF ID stallEX MEM WB
Instruction level parallelism
WAW-Hazards
Example:
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add.d $f0,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB
WAW-Hazard:
Add.d writes f0
before Mul.d
doesOut-of-order completion may lead to WAW hazard!
Slide 6-44
Solution: Stall Add.d instruction in ID stage
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add.d $f0,$f4,$f6 IF ID stallstallA1 A2 A3 A4 MEMWB
⇒ Hazard detection logic detects all hazards in ID stage and resolves them
by stalling the corresponding instruction
Instruction level parallelism
Extended MIPS pipeline
Instruction execution:
1. Fetch
2. Decode:
1. Check for structural hazards: Wait until the required FU is not busy and make sure
the register write port is available when it will be needed
2. Check for RAW data hazards: Wait until source registers are not listed as
destination register of any instruction in M1-M6, A1 – A3, DIV or a load in EX
Optimization: e.g.: if the division is in the final clock cycle its result may be
Slide 6-45
Optimization: e.g.: if the division is in the final clock cycle its result may be
forwarded to the requesting FU in the cycle following.
3. Check for WAW data hazards: Determine if any instruction in A1 - A4, M1 - M7,
DIV, has the same destination as this instruction. If so, stall instruction for the
number of clock cycles being necessary
Simplification: Since WAW hazards are rare, stall instruction until no other
instruction in the pipeline has the same destination
3. Execute
4. Memory Access
5. Write Back
Instruction level parallelism
Dynamic Branch Prediction
Assume branch not taken is a crude form of branch prediction.
Typically it fails in 50% of all cases.
In processors with multiple functional units deep pipelines are used. This
may lead to large branch delays if a branch is predicted the wrong way!
⇒ we need more accurate methods for predicting branches!
Slide 6-46
Idea: dynamic branch prediction
predict branches using the program’s past behaviour.
Branch prediction buffer or branch history table
Small memory addressed by the lower bits of the instruction address,
contains a flag indicating whether the branch has been taken or not.
This flag is set or reset at each branch.
Instruction level parallelism
Dynamic Branch Prediction
For loops the hit rate may be improved by using two bits for branch
prediction. A prediction must be wrong twice before it is changed.
Predict taken Predict taken
WrongCorrect
Slide 6-47
10 11
Predict
not taken
01
Predict
not taken
00
Correct
Wrong
Correct
Wrong
Correct
Wrong
2 bit prediction scheme
Instruction level parallelism
Branch Target Buffer
Observation: the target address (calculated from PC and offset) for a particular
branch remains constant during program execution.
Idea: store the branch target addresses in a lookup table: branch target buffer
Addresses of branch
instructions
Branch target
addresses Predicted taken or
untaken
Slide 6-48
PC
Instruction
memory
= ?
control
Branch target buffer in combination with correct branch prediction allows
to execute branches without stalling the pipeline!
Instruction level parallelism
Dynamic Scheduling
Static scheduling:
Execution is started in the order in which the instructions have been fetched.
(e.g., in the order which the compiler has determined).
If a data dependence occurs that can not be resolved by forwarding, the pipeline
is stalled (starting with the instruction that waits for a result). No new instructions
are fetched until the dependence is cleared.
Idea:Hardware rearranges instruction executions dynamically to reduce stalls
⇒ Dynamic Scheduling
Slide 6-49
⇒ Dynamic Scheduling
Dynamic Scheduling takes structural hazards and data hazards into consideration!
To avoid that an instruction is stalled because a data hazard delays all subsequent
instructions, the ID stage is spilt into two stages:
1. Issue: Decode instructions, check for structural hazards
2. Read operands: Wait until no data hazards, then read operands
Leads to out-of-order execution and out-of-order completion
Instruction level parallelism
Out-of-order execution
Out-of-order execution may lead to WAR hazards!
Example: Floating point operations
Div.d $f0, $f2, $f4
Add.d $f10,$f0, $f8
Mul.d $f8, $f8, $f14
Slide 6-50
Add.d needs to be stalled because of RAW hazard.
Mul.d may be started,
BUT: if mul.d completes before add.d reads its operands, add.d
will read the wrong value in f8!
The control logic deciding when an instruction is executed has to detect
and resolve hazards!
Instruction level parallelism
Score Board
Dynamic Scheduling with a Score Board
Goal: maintain an execution rate of one instruction per clock cycle by
executing an instruction as early as possible
If an instruction needs to be stalled because of a data hazard other
instructions can be issued and executed.
⇒ We have to analyze the program flow for hazards!
Slide 6-51
⇒ We have to analyze the program flow for hazards!
Scoreboard:
• Detects structural hazards and data hazards
• Determines when an instruction may read operands and when it is
executed
• Determines when an instruction can write its result into the destination
register
Instruction level parallelism
Dynamic Scheduling with a Score Board
In the following we will consider dynamic scheduling only for arithmetic
instructions – no MEM access-phase necessary.
4 stages (replace ID, EX, WB stage of standard MIPS pipeline):
1. Issue: If
• a functional unit (FU) for the instruction is free (resolve structural hazards)
and
• no other active instruction has the same destination register (resolve WAW
Slide 6-52
hazards)
the score board issues the instruction to the FU and updates its internal data
structure.
If a hazard exists, the issue stage stalls. Subsequent instructions are written into
a buffer between instruction fetch and issue. If this buffer is filled then the
instruction fetch stage stalls.
2. Read operands: When all operands are available the score board tells the FU to
read its operands and to begin execution (may lead to out of order execution).
A source operand is available when no active instruction issued earlier is going
to write it (resolve RAW hazards).
Instruction level parallelism
Dynamic Scheduling with a Score Board
3. Execution: The FU executes the instruction (may take several clock
cycles). When the result is ready the FU notifies the scoreboard that it has
completed execution.
4. Write result: When an FU announces the completion of an execution the
scoreboard checks for WAR hazards. If no such hazard exists the result
can be written to the destination register. A WAR hazard occurs when
there is an instruction preceding the completing instruction that
Slide 6-53
• has not read its operands yet and
• one of these operands is the same register as the destination register of
the completing instruction.
Score Boarding does not use forwarding!
If no WAR hazard occurs the result is written to the destination register
during the clock cycle following the execution. (we do not have to wait for
a statically assigned WB stage that may be several cycles away).
Instruction level parallelism
Example
MIPS processor with dynamic scheduling using a score board with the
following functional units (not pipelined) in the datapath:
- 1 Integer unit: for load/store, integer ALU operations and branches
- 2 Multiplier for FP numbers
- 1 Adder for FP addition/subtraction
Slide 6-54
- 1 Divider FP numbers
MIPS program with floating point instructions (64 Bit):
L.d $f6, 34 ($r2)
L.d $f2, 45 ($r3)
Mul.d $f0, $f2, $f4
Sub.d $f8, $f2, $f6
Div.d $f10, $f0, $f6
Add.d $f6, $f8, $f2
Assumptions:
EX phase for double precision takes:
2 cycles for load and add
10 cycles for mult
40 cycles for div
Instruction level parallelism
MIPS with a Score Board
FP multiplier
FP multiplier
FP divider
Register
Slide 6-55
FP divider
FP adder
Integer unit
Score Board
Control/Status Control/Status
Data busses
Instruction level parallelism
Components of the Score Board
1. Instruction status: indicates for each instruction which of the four steps
the instruction is in.
2. FU status: indicates for each FU its state:
busy: FU busy or not
Score Board consists of three parts containing the following data:
Slide 6-56
busy: FU busy or not
OP: Operation to perform (e.g. add or subtract)
fi: Destination register
fj, fk: Source registers
Qj, Qk: Functional units writing the source registers fj und fk
Rj, Rk: Flags indicating whether fj and fk are ready to be read but have
not been read yet. Are set to “no” after the operands have been
read.
3. Result register status: indicates for each register, whether a FU is going
to write it and which FU this will be.
Instruction level parallelism
Components of the Score Boards
Instruction Issue Read operands Execution complete Write result
L.d $f6, 34 ($r2) √ √ √ √
L.d $f2, 45 ($r3) √ √ √
Mul.d $f0, $f2, $f4 √
Sub.d $f8, $f2, $f6 √
Div.d $f10, $f0, $f6 √
Add.d $f6, $f8, $f2
Instruction status
Slide 6-57
Name Busy Op fi fj fk Qj Qk Rj Rk
integer yes load f2 r3 0 no
mult1 yes mult f0 f2 f4 integer 0 no yes
mult2 no
add yes sub f8 f2 f6 integer 0 no yes
divide yes div f10 f0 f6 mult1 0 no yes
f0 f2 f4 f6 f8 f10 f12 f30
FU mult1 integer 0 0 add divide 0 0
Functional unit status
Result register status
(double precision floating point numbers number ⇒ allocate two 32 bit registers)
Instruction level parallelism
Bookkeeping in the Score Board
Instruction status Wait until Bookkeeping
Issue Busy[FU] = no Busy[FU] := yes; Op[FU] := op; Result[d] := FU;
When an instruction has passed through one step the score board is updated.
FU: FU used by instruction fi[FU], fj[FU], fk[FU]: destination/source registers of FU
d: destination register Rj[FU], Rk[FU]: s1, s2 ready?
s1, s2:source registers Qj[FU], Qk[FU]: FUs producing s1 and s2
op: type of operation Result[d]: FU that will write register d
Op[FU]: operation which FU will execute
Slide 6-58
Issue Busy[FU] = no
and
Result[d] = 0
(no other FU has d as
destination register)
Busy[FU] := yes; Op[FU] := op; Result[d] := FU;
fi[FU] := d; fj[FU] := s1; fk[FU] := s2;
Qj := Result[s1]; Qk := Result[s2];
if Qj = 0 then Rj := yes; else Rj := no
if Qk = 0 then Rk := yes; else Rk := no
Read operands Rj = yes and Rk = yes Rj := no; Rk := no; Qj := 0; Qk := 0
Execution Functional unit done
Write results ∀f((fj[f] ≠ fi[FU] or Rj[f] = no)
and
(fk[f] ≠ fi[FU] or Rk[f] = no))
∀f(if Qj[f] = FU then Rj[f] := yes);
∀f(if Qk[f] = FU then Rk[f] := yes);
Result[fi[FU]] := 0; Busy[FU] := nofor all FUs
Instruction level parallelism
Bookkeeping in the Score Board
Comment for step write results:
∀f (fj[f] ≠ fi[FU] or Rj[f] = no)
„Rj[f] = no“ means, that the instruction which is now active at f will not read the
current contents of source register fj
a) either, since the operation has already been executed and currently waits for
permission to write, or
Slide 6-59
b) since the required source operand must still be computed and the current
instruction is waiting for that.
In the first case register fj is overwritten since the previous contents are no longer
needed. In the second case the register is overwritten since this will provide the
expected operand.
Ri[f] = yes means that the instruction being active at f still requires the current
content of the register specified by fi.
Instruction level parallelism
Dynamic Scheduling: Tomasulo‘s Schema
Example:
div.d $f0, $f2, $f4
add.d $f6, $f0, $f8
sub.d $f8, $f10, $f14
Are there further possibilities for eliminating stalls resulting from hazards?
RAW hazard: No way - we have to wait until all operands are calculated!
WAR hazard and WAW hazard:
RAW hazard for f0
WAR hazard for f8
Slide 6-60
sub.d $f8, $f10, $f14
mul.d $f6, $f10, $f8
WAR hazard for f8
WAW hazard for f6, RAW hazard for f8
Idea: Register renaming
Rename destination registers of instructions in a way that prevents instructions
being executed out-of-order from overwriting operands being still required by
other instructions ⇒ Tomasulo‘s scheme or Tomasulo‘s algorithm
Observation: WAR and WAW hazard could have been avoided by compiler!
Instruction level parallelism
Register renaming
Example (continued):
Assume we have two temporary registers S and T.
replace f6 in add.d by a temporary register S and
replace f8 in sub.d and mul.d by a temporary register T:
div.d $f0, $f2, $f4
add.d $S, $f0, $f8
Slide 6-61
add.d $S, $f0, $f8
sub.d $T, $f10, $f14
mul.d $f6, $f10, $T
Replace target registers affected by a WAW or a WAR hazard by
temporary registers and modify subsequent instructions reading
these registers appropriately.
Instruction level parallelism
Reservation Station
Temporary registers are part of reservation stations :
• Buffer the operands for instructions waiting for execution.
If an operand is not yet calculated the corresponding reservation station
contains the number of the reservation station which will deliver the result.
• Renaming of register numbers for pending operands to the names of the
reservation stations, this is done during instruction issue.
Slide 6-62
• Information about the availability of the operands stored in a reservation
station determines when the corresponding instruction can be executed.
• As results become available they are sent directly from the reservation
stations to the waiting FU over the common data bus (CDB)
• When successive writes to a register overlap in execution, only the result of
the instruction being issued at last is used to update the register.
⇒ resolves WAR/WAW hazards
Instruction level parallelism
Tomasulo‘s algorithm
From instruction unit
Instruction
queue
(FIFO)
FP registers
Operand
buses
FP operationsload/store operations
Store buffers
MIPS floating point unit
using Tomasulo‘s algorithm
Slide 6-63
Address unit
Memory FP adders
FP multipliers/
dividers
buses
Reservation
stations
3
2
1
2
1
Common Data Bus (CDB)
Load
buffers
Instruction level parallelism
Tomasulo‘s algorithm - stages
1. Issue:
Get next instruction from the head of the instruction queue and issue it to a
matching reservation station that is empty.
Load/store buffers storing data/addresses coming from and going to memory
behave similarly like reservation stations for arithmetic units.
Steps in execution of an FP instruction:
Slide 6-64
matching reservation station that is empty.
Operands available in registers?
yes: hand over values to reservation station
no: hand over names of those reservation stations that are calculating
the values
Buffering operands resolves WAR hazards!
If no matching reservation station is empty there is a structural hazard.
⇒ instruction stalls until a station is freed
Instruction level parallelism
Tomasulo‘s algorithm - stages
2. Execution
1. If one or more of the operands are not available, monitor the CDB.
2. If an operand becomes available place it in the waiting reservation station(s).
3. Wait until all operands for an instruction are available, then start execution
⇒ resolves RAW hazards
In case of stores: Execution may start (address calculation) even if data to
Slide 6-65
In case of stores: Execution may start (address calculation) even if data to
be stored is not available yet. Address calculation unit is
occupied during address calculation only
3. Write result
1. When the result is available, send it to the CDB.
2. From the CDB it is sent directly to waiting reservation stations (and store
buffers)
Only if the instruction is the one being issued last that writes to a certain
target register, write result also to the target register ⇒ avoids WAW hazards
Instruction level parallelism
Reservation stations
Each reservation station has the following fields:
Op: Type of the operation to perform (e.g. add or subtract)
Qj, Qk: Names of the reservation stations containing the instructions calculating
the operands. Zero values indicate that the operands are already available
Vj, Vk: Values of the source operands
Busy: Flag indicating that this station/buffer is already occupied.
Slide 6-66
Busy: Flag indicating that this station/buffer is already occupied.
Each load/store buffer has an additional field:
A: Initially the immediate field of the address is stored there; after address
calculation the effective address is stored there
For each register of the register file there is one field
Qi: Name of the reservation station containing the instruction being issued
last that calculates the result for this register. A zero value indicates that
no active instruction is calculating a result for that register.
Instruction level parallelism
Tomasulo‘s method: information tables
Name Busy Op Vj Vk Qj Qk A
load1 no
Instruction Issue execute Write result
L.d $f6, 34 ($r2) √ √ √
L.d $f2, 45 ($r3) √ √
Mul.d $f0, $f2, $f4 √
Sub.d $f8, $f2, $f6 √
Div.d $f10, $f0, $f6 √
Add.d $f6, $f8, $f2 √
Instruction
status
Slide 6-67
load1 no
load2 yes load 45+Regs[r3]
add1 yes sub Mem[34+Regs[r2]] load2
add2 yes add add1 load2
add3 no
mult1 yes mul Regs[f4] load2
mult2 yes div Mem[34+Regs[r2]] mult1
Register: f0 f2 f4 f6 f8 f10 f12 f30
Qi : mult1 load2 0 add2 add1 mult2 0 0
Reservation
Stations
Register
status
Instruction level parallelism
Dynamic Scheduling: Data hazards through memory
A load and a store instruction may only be done in a different order if
they access different addresses! (RAW/WAR hazard!)
Two stores sharing the same data memory address may not be done in
different order! (WAW hazard!)
Load: read memory only if there is no uncompleted store which has
been issued earlier and which shares the same data memory
Slide 6-68
been issued earlier and which shares the same data memory
address with the load.
Store: write data only if there are no uncompleted loads and stores
being issued earlier using the same data memory address as the
store.
Instruction level parallelism
Dynamic Scheduling: Instructions following branches
It may take many clock cycles until we know whether a branch has been
predicted correctly or not!
1. Instructions being issued after a branch may complete before.
⇒ Write back stage of these instructions has to be stalled until we know
whether the prediction has been correct or not!
2. Exceptions:
Slide 6-69
2. Exceptions:
We have to ensure that exactly the same exceptions are handled as in the
case where the pipeline would have used in-order-execution and no
branch prediction!
Simple solution:
Instructions following a branch are issued only. Execution starts only after
the branch prediction has turned out to be correct.
⇒ Can reduce the efficiency of a dynamically scheduled pipeline
dramatically!
Instruction level parallelism
Speculative execution
Write result stage is spilt into two stages:
3. Write results:
• Instructions are executed as operands become available. Results are written into
a reorder buffer (ROB).
• For each active instruction there is one entry in the ROB. Their order corresponds
to the order in which the instructions have been issued.
⇒ The head of the ROB contains the result of the active instruction
being issued first
Subsequent instructions can read their operands from the ROB.
Slide 6-70
Subsequent instructions can read their operands from the ROB.
• Writes going to register file and memory are delayed until branch predictions turn
out to be correct.
4. Commit:
• When an instruction that writes to memory or register file reaches the head of
the ROB its result is written. Exception is handled now if necessary!
• If the head of the ROB contains an incorrectly predicted branch the ROB is
flushed
⇒ results calculated by instructions following the branch are discarded!
ROB restores initial order of instructions: in-order-commitment
Instruction level parallelism
Speculative Execution
MIPS FP unit using Tomasulo‘s
algorithm and Reorder buffer From instruction unit
Instruction
queue
(FIFO)
FP registers
FP operationsload/store operations
ROB
Slide 6-71
Address unit
Memory FP adders
FP multipliers/
dividers
Operand
buses
Reservation
stations
3
2
1
2
1
Common Data Bus (CDB)
Store
data
load/store
buffers
Store
address
Instruction level parallelism
Multiple Issue Processor
Using multiple FUs, dynamic scheduling, branch prediction and
speculation allow to achieve an CPI of nearly one.
CPI < 1 not possible because we issue only one instruction per clock
cycle!
Further speedup:
Slide 6-72
Issue multiple instructions in one clock cycle (up to 8 in practice)
⇒ CPI < 1 possible!
The sets of instructions being issued in parallel are called
instruction packets or issue packets.
Instruction level parallelism
Multiple Issue Processors
Multiple Issue Processor
Superscalar Processors
Instruction packets: generated
VLIW (very long instruction word)-
Processors
Slide 6-73
Instruction packets: generated
by hardware
Processors
Instruction packets: generated by
compiler
Dynamic scheduling
(hardware)
Static scheduling
(compiler)
Instruction level parallelism
Overview
Name Issue Hazard
Detection
Scheduling Distinguishing
characteristics
Examples
superscalar
(static)
dynamic Hardware static
(Compiler)
in-order
execution
Sun UltraSPARC
II/III
superscalar dynamic Hardware dynamic out-of-order IBM Power PC
Slide 6-74
superscalar
(dynamic)
dynamic Hardware dynamic out-of-order
execution
IBM Power PC
superscalar
(speculative)
dynamic Hardware dynamic with
speculation
out-of-order
execution with
speculation
Pentium III/4,
MIPS R10K,
Alpha 21264,
HP PA 8500,
IBM RS64III
VLIW static software
(Compiler)
static
(Compiler)
no hazards
between issue
packets
Trimedia, i860
Instruction level parallelism
Statically scheduled superscalar Processors
Example: dual-issue static superscalar processor
In one clock cycle we can issue
• one integer instruction (including load/store, branches, integer ALU
operations) and
• one arithmetic FP instruction
Slide 6-75
Only slight extensions of the hardware necessary compared to a single
issue implementation with two FUs.
Typical for high-end embedded processors.
Instruction level parallelism
Statically scheduled Dual Issue Pipeline
Instruction type
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
Pipeline stages
Slide 6-76
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
CPI of 0.5 possible !
Instruction level parallelism
Multiple Issue Pipeline
In order to enable multiple issues per clock cycle we must be able to fetch
multiple instructions per cycle also!
Example: 4-way issuing processor
Fetches instructions stored at PC, PC+4, PC+8, PC+12 from memory
⇒ Wide bus to instruction memory required!
Slide 6-77
⇒ Wide bus to instruction memory required!
Problem: what if one of these instruction is a branch?
1. Reading branch target buffer and accessing instruction memory in one
clock cycle would increase cycle time.
2. If n instructions of the package are allowed to be a branch we would
have to lookup n instructions in the branch target buffer in parallel!
Typical simplification: single issue for branches
Instruction level parallelism
Multiple issue with dynamic pipelining (Tomasulo)
Example: Superscalar processor with:
• Dual issue (single issue for branches)
• Dynamic Tomasulo scheduling (no speculation, i.e. execution of instructions following
branch must be delayed until the branch condition is evaluated)
• One FP unit
• One FU for integer instructions and load/stores and branch condition testing
• Separate FU for branch address calculation
• Several reservation stations/load store buffers for each FU: Load/stores occupy the
FU only during address calculation, branches only during condition testing;
Stores are allowed to execute even if the data to be stored is not available yet
Slide 6-78
Loop: l.d $f0, 0 ($r1); # f0 := array element
add.d $f4, $f0, $f2; # add f2 to f0
s.d $f4, 0 ($r1); # store result
addi $r1, $r1, -8; # decrement pointer
bne $r1, $r2, LOOP;# repeat loop if r1 ≠ r2
Stores are allowed to execute even if the data to be stored is not available yet
Latency: Number of cycles from the beginning of execution step to the
moment when the result is available on the CDB
Integer operations: 1 cycle
Load: 2 cycles (1 in EX stage + 1 in MEM stage)
FP operation: 3 cycles (in EX stage)
Instruction level parallelism
Multiple issue with dynamic pipelining
Iteration
number
Instruction Issues
at
Executes
at
Memory
access at
Write
CDB at
Comment
1 L.d $f0, 0($r1) 1 2 3 4 First issue
1 Add.d $f4, $f0, $f2 1 5 - 7 8 Wait for l.d
1 S.d $f4, 0 ($r1) 2 3 9 Wait for add.d
1 addi $r1, $r1, -8 2 4 5 Wait for ALU
1 bne $r1, $r2, LOOP 3 6 Wait for addi
2 L.d $f0, 0($r1) 4 7 8 9 Wait for bne
2 Add.d $f4, $f0, $f2 4 10 - 12 13 Wait for l.d
Slide 6-79
2 Add.d $f4, $f0, $f2 4 10 - 12 13 Wait for l.d
2 S.d $f4, 0 ($r1) 5 8 14 Wait for add.d
2 addi $r1, $r1, -8 5 9 10 Wait for ALU
2 bne $r1, $r2, LOOP 6 11 Wait for addi
3 L.d $f0, 0($r1) 7 12 13 14 Wait for bne
3 Add.d $f4, $f0, $f2 7 15 - 17 18 Wait for l.d
3 S.d $f4, 0 ($r1) 8 13 19 Wait for add.d
3 addi $r1, $r1, -8 8 14 15 Wait for ALU
3 bne $r1, $r2, LOOP 9 16 Wait for addi
Instruction level parallelism
Resource usage
Clock
cycle
Integer unit FP unit Data memory CDB
2 1 / l.d
3 1 / s.d 1 / l.d
4 1 / addi 1 / l.d
5 1 / add.d 1/ addi
6 1 / bne 1 / add.d
7 2 / l.d 1 / add.d
8 2 / s.d 2 / l.d 1 /add.d
9 2 / addi 1 / s.d 2/ l.d
Slide 6-80
9 2 / addi 1 / s.d 2/ l.d
10 2 / add.d 2 / addi
11 2 / bne 2 / add.d
12 3 / l.d 2 / add.d
13 3 / s.d 3 / l.d 2 / add.d
14 3 / addi 2 / s.d 3/ l.d
15 3 / add.d 3 / addi
16 3 / bne 3 / add.d
17 3 / add.d
18 3 / add.d
19 3 / s.d
20
Instruction level parallelism
Example
CPI significantly greater than 0.5:
Problem: Integer unit used for memory address calculation, for
incrementing pointer and for condition test
⇒ branch execution is delayed by one cycle
Possible solution: additional integer FU
Slide 6-81
Problem: The execution step of an instruction following a branch has to
be delayed until the branch is executed
Possible solution: use speculative execution
Example: Dual-issue processor with speculative execution
In order to achieve a CPI < 1 we must allow two instructions to commit in
parallel!
⇒ More buses required
Instruction level parallelism
Compiler techniques
Observation: If branch prediction is perfect then loops are unrolled
automatically by the hardware. Operations that belong to
different iterations of the loop overlap.
Loops may be unrolled in advance by the compiler also!
⇒ Improves performance for processors without speculative execution
Loop: addi $s1, $s1, -16;
lw $t0, 16($s1);
add $t0, $t0, $s2;
Slide 6-82
Loop: lw $t0, 0 ($s1);
add $t0, $t0, $s2;
sw $t0, 0 ($s1);
addi $s1, $s1, -4;
bne $s1, $zero, LOOP;
add $t0, $t0, $s2;
sw $t0, 16($s1);
lw $t1, 12($s1);
add $t1, $t1, $s2;
sw $t1, 12($s1);
lw $t2, 8($s1);
add $t2, $t2, $s2;
sw $t2, 8($s1);
lw $t3, 4($s1);
add $t3, $t3, $s2;
sw $t3, 4($s1);
bne $s1, $zero, LOOP;
Loop before unrolling
Loop after unrolling
Register renaming done by
compiler!
Instruction level parallelism
Summary
Superscalar processors determine during program execution how many
instructions are issued in one clock cycle.
Statically scheduled:
• Must detect dependences in instruction packets and resolve them by
inserting stalls
Slide 6-83
• Needs assistance of the compiler for achieving a high amount of
parallelism.
• Simple hardware
Dynamically scheduled:
• Requires less assistance of the compiler
• Hardware is much more complex
Instruction level parallelism
Static Multiple Issue – VLIW approach
For highly superscalar processors the hardware becomes very complex.
Idea: let the compiler do as much work as possible!
VLIW approach: used for digital signal processing (DSP)
compiler groups instructions with no dependences between that may be
executed in parallel into a „very long instruction word“ (VLIW).
Slide 6-84
⇒ no hardware for hazard detection and scheduling necessary
Does the program contain enough parallelism?
The compiler has to find enough parallelism for using the full capacity of
all functional units!
local scheduling : scheduling inside lists of instructions without branches
(= basic blocks)
global scheduling : scheduling over several basic blocks
Instruction level parallelism
Example
Loop: lw.d $f0, 0($r1);
add.d $f4, $f0, $f2;
sw.d $f4, 0($r1);
For VLIW processors one instruction must contain all operations that are
executed in parallel explicitly. Therefore VLIW processors are sometimes
also called EPICs (explicitly parallel instruction computer).
Example
Loop: lw.d $f0, 0($r1);
add.d $f4, $f0, $f2;
sw.d $f4, 0($r1);
lw.d $f6, -8($r1);
unroll
Slide 6-85
sw.d $f4, 0($r1);
addi $r1, $r1, -8;
bne $r1, $r2, LOOP;
Consider a VLIW processor with:
• 2 FUs for memory access (2 cycles for EX)
• 2 FUs for FP operations (Pipelined, 3 cycles for EX)
• 1 FU for integer operations and branches. (1 cycle)
Create a schedule for 7 iterations using loop
unrolling. Branches have zero latency.
lw.d $f6, -8($r1);
add.d $f8, $f6, $f2;
sw.d $f8, -8($r1);
lw.d $f10, -16($r1);
add.d $f12, $f10, $f2;
sw.d $f12, -16($r1);
lw.d $f14, -24($r1);
add.d $f16, $f14, $f2;
sw.d $f16, -24($r1);
…
addi $r1, $r1, -56;
bne $r1, $r2, LOOP;
Instruction level parallelism
Static Multiple Issue – VLIW approach
Memory unit 1 Memory unit 2 FP unit 1 FP unit 2 Integer unit
lw.d $f0,0($r1) lw.d $f6,-8($r1)
lw.d $f10,-16($r1) lw.d $f14,-24($r1)
lw.d $f18,-32($r1) lw.d $f22,-40($r1) add $f4,$f0,$f2 add $f8,$f6,$f2
lw.d $f26,-48($r1) add $f12,$f10,$f2 add $f16,$f14,$f2
Slide 6-86
add $f20,$f18,$f2 add $f24,$f22,$f2
sw.d $f4,0($r1) sw.d $f8,-8($r1) add $f28,$f26,$f2
sw.d $f12,-16($r1) sw.d $f16,-24($r1) addi $r1,$r1,-56
sw.d $f20,24($r1) sw.d $f24,16($r1)
sw.d $f28,8($r1) bne $r1,$r2,Loop
Each row corresponds to an VLIW instruction

More Related Content

What's hot

Chapter 4 The Processor
Chapter 4 The ProcessorChapter 4 The Processor
Chapter 4 The Processorguest4f73554
 
Complex instruction set computer ppt
Complex instruction set computer pptComplex instruction set computer ppt
Complex instruction set computer pptVenkatesh Pensalwar
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowkaran saini
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and ciscMark Gibbs
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Introduction to Assembly Language
Introduction to Assembly Language Introduction to Assembly Language
Introduction to Assembly Language ApekshaShinde6
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc DifferenceSehrish Asif
 
Memory interleaving and superscalar processor
Memory interleaving and superscalar processorMemory interleaving and superscalar processor
Memory interleaving and superscalar processorsshwetasrivastava
 
INSTRUCTION PIPELINING
INSTRUCTION PIPELININGINSTRUCTION PIPELINING
INSTRUCTION PIPELININGrubysistec
 
Performance Characterization of the Pentium Pro Processor
Performance Characterization of the Pentium Pro ProcessorPerformance Characterization of the Pentium Pro Processor
Performance Characterization of the Pentium Pro ProcessorDileep Bhandarkar
 
Chapter 2 instructions language of the computer
Chapter 2 instructions language of the computerChapter 2 instructions language of the computer
Chapter 2 instructions language of the computerBATMUNHMUNHZAYA
 

What's hot (20)

Chapter 4 The Processor
Chapter 4 The ProcessorChapter 4 The Processor
Chapter 4 The Processor
 
Complex instruction set computer ppt
Complex instruction set computer pptComplex instruction set computer ppt
Complex instruction set computer ppt
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and cisc
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Introduction to Assembly Language
Introduction to Assembly Language Introduction to Assembly Language
Introduction to Assembly Language
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc Difference
 
Memory interleaving and superscalar processor
Memory interleaving and superscalar processorMemory interleaving and superscalar processor
Memory interleaving and superscalar processor
 
INSTRUCTION PIPELINING
INSTRUCTION PIPELININGINSTRUCTION PIPELINING
INSTRUCTION PIPELINING
 
Different addressing mode and risc, cisc microprocessor
Different addressing mode and risc, cisc microprocessorDifferent addressing mode and risc, cisc microprocessor
Different addressing mode and risc, cisc microprocessor
 
pipelining
pipeliningpipelining
pipelining
 
Performance Characterization of the Pentium Pro Processor
Performance Characterization of the Pentium Pro ProcessorPerformance Characterization of the Pentium Pro Processor
Performance Characterization of the Pentium Pro Processor
 
Risc & cisk
Risc & ciskRisc & cisk
Risc & cisk
 
Unit 2
Unit 2Unit 2
Unit 2
 
DMA
DMADMA
DMA
 
Chapter 2 instructions language of the computer
Chapter 2 instructions language of the computerChapter 2 instructions language of the computer
Chapter 2 instructions language of the computer
 
Risc and cisc
Risc and ciscRisc and cisc
Risc and cisc
 

Similar to Arch 1112-6

Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-iAnkit Shah
 
03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnectionnoman yasin
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazardAJAL A J
 
4bit pc report[cse 08-section-b2_group-02]
4bit pc report[cse 08-section-b2_group-02]4bit pc report[cse 08-section-b2_group-02]
4bit pc report[cse 08-section-b2_group-02]shibbirtanvin
 
4bit PC report
4bit PC report4bit PC report
4bit PC reporttanvin
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formatsMazin Alwaaly
 
multi cycle in microprocessor 8086 sy B-tech
multi cycle  in microprocessor 8086 sy B-techmulti cycle  in microprocessor 8086 sy B-tech
multi cycle in microprocessor 8086 sy B-techRushikeshThorat24
 
ELEC3300_09-DMA.pdf
ELEC3300_09-DMA.pdfELEC3300_09-DMA.pdf
ELEC3300_09-DMA.pdfKwunHokChong
 
IntroductionCPU performance factorsInstruction countDeterm.docx
IntroductionCPU performance factorsInstruction countDeterm.docxIntroductionCPU performance factorsInstruction countDeterm.docx
IntroductionCPU performance factorsInstruction countDeterm.docxnormanibarber20063
 
Operating system
Operating systemOperating system
Operating systemraj732723
 
Data Manipulation
Data ManipulationData Manipulation
Data ManipulationAsfi Bhai
 
Describr the features of pentium microppr
Describr the features of pentium micropprDescribr the features of pentium microppr
Describr the features of pentium microppredwardkiwalabye1
 

Similar to Arch 1112-6 (20)

Bc0040
Bc0040Bc0040
Bc0040
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 
CA UNIT III.pptx
CA UNIT III.pptxCA UNIT III.pptx
CA UNIT III.pptx
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
Unit 5-lecture 5
Unit 5-lecture 5Unit 5-lecture 5
Unit 5-lecture 5
 
03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection03. top level view of computer function &amp; interconnection
03. top level view of computer function &amp; interconnection
 
Processor Basics
Processor BasicsProcessor Basics
Processor Basics
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
4bit pc report[cse 08-section-b2_group-02]
4bit pc report[cse 08-section-b2_group-02]4bit pc report[cse 08-section-b2_group-02]
4bit pc report[cse 08-section-b2_group-02]
 
4bit PC report
4bit PC report4bit PC report
4bit PC report
 
Computer architecture instruction formats
Computer architecture instruction formatsComputer architecture instruction formats
Computer architecture instruction formats
 
multi cycle in microprocessor 8086 sy B-tech
multi cycle  in microprocessor 8086 sy B-techmulti cycle  in microprocessor 8086 sy B-tech
multi cycle in microprocessor 8086 sy B-tech
 
pipelining
pipeliningpipelining
pipelining
 
Highridge ISA
Highridge ISAHighridge ISA
Highridge ISA
 
ELEC3300_09-DMA.pdf
ELEC3300_09-DMA.pdfELEC3300_09-DMA.pdf
ELEC3300_09-DMA.pdf
 
IntroductionCPU performance factorsInstruction countDeterm.docx
IntroductionCPU performance factorsInstruction countDeterm.docxIntroductionCPU performance factorsInstruction countDeterm.docx
IntroductionCPU performance factorsInstruction countDeterm.docx
 
Operating system
Operating systemOperating system
Operating system
 
Data Manipulation
Data ManipulationData Manipulation
Data Manipulation
 
Assembly p1
Assembly p1Assembly p1
Assembly p1
 
Describr the features of pentium microppr
Describr the features of pentium micropprDescribr the features of pentium microppr
Describr the features of pentium microppr
 

Arch 1112-6

  • 1. Section 6 Instruction-Level Parallelism Topics: Pipelining Superscalar processors VLIW architecture Instruction level parallelism Overview Modern processors apply techniques for executing several instructions in parallel to enhance the computing power. The potential of executing machine instructions in parallel is called instruction level parallelism (ILP). Remember: execution of one instruction is broken into several steps Pipelining: Slide 6-2 Pipelining: Different steps of multiple instructions are executed simultaneously. Concurrent execution: The same steps of multiple machine instructions may be executed simultaneously. Requires multiple functional units Techniques: Superscalar VLIW (very long instruction word)
  • 2. Instruction level parallelism Pipelining: principle Principle: The execution of a machine instruction is divided into several steps – called pipeline stages - taking nearly the same execution time. These stages may be executed in parallel. Example MIPS: 5 pipeline stages Slide 6-3 1. IF: instruction fetch 2. ID: instruction decode and register file read 3. EX: execution / memory address calculation 4. MEM: data memory access 5. WB: result write back Instruction level parallelism Pipelining: principle Executing 6 instructions using pipelining Executing two 5-step instructions (e.g. lw) without pipelining Instruction 1: S1 S2 S3 S4 S5 Instruction 2: S1 S2 S3 S4 S5 Clock cycle: 1 2 3 4 5 6 7 8 9 10 Slide 6-4 Instruction 1: S1 S2 S3 S4 S5 Instruction 2: S1 S2 S3 S4 S5 Instruction 3: S1 S2 S3 S4 S5 Instruction 4: S1 S2 S3 S4 S5 Instruction 5: S1 S2 S3 S4 S5 Instruction 6: S1 S2 S3 S4 S5 Clock cycle: 1 2 3 4 5 6 7 8 9 10
  • 3. Instruction level parallelism In this chapter we will design a pipelined MIPS datapath for the following instructions: lw, sw, add, sub, and, or, slt, beq Situations may occur where two instructions cannot be executed in the pipeline right after each other! Example: Non-pipelined multi-cycle CPU has shared ALU for 1. executing arithmetic/logical instructions Pipelined MIPS datapath Slide 6-5 1. executing arithmetic/logical instructions 2. incrementing PC Structural hazard: Two instructions wish to use a certain hardware component in the same clock cycle leading to a resource conflict. For RISC instruction sets structural hazards can often be resolved by additional hardware. Instruction level parallelism Additional Hardware being required: 1. Permit incrementing PC and executing arithmetic/logical instructions concurrently: use separate adder for incrementing PC 2. Permit reading next instruction and reading/writing data from/to memory: divide memory into instruction memory and data memory (Harvard architecture) Pipelined MIPS datapath Slide 6-6 3. Permit executing an arithmetic/logical instruction (uses ALU in 3. cycle) followed by a branch (calculates branch target in 2. cycle): use separate adder for branch address calculation Duplicating hardware components in general leads to less and/or smaller multiplexers.
  • 4. Instruction level parallelism MIPS datapath (without pipelining) 0 1 shift left 2 Add result Add result 4 IF: Instruction fetch ID: Instruction de- code / register read EX: Execute / address calculation MEM: Memory access WB: Write back Slide 6-7 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU zero result address write data data memory read data PC 0 1 1 0 sign extend 16 32Datapath for executing one instruction per clock: single cycle implementation Instruction level parallelism Pipelined MIPS datapath additionally requires Pipeline registers: • Store all the data occurring at the end of one pipeline stage that are required as input data in the next stage • Divide datapath into pipeline stages Pipelined MIPS datapath Slide 6-8 • Divide datapath into pipeline stages • Replace temporary datapath registers of non-pipelined multi-cycle implementation, e.g.: ALU target register T replaced by pipeline register EX/MEM Instruction register IR replaced by pipeline register IF/ID
  • 5. Instruction level parallelism Pipelined MIPS datapath 0 1 Add result 4 Instruction fetch ID: Instruction de- code / register read EX: Execute / address calculation MEM: Memory access WB: Write back IF/ID ID/EX EX/MEM MEM/WB shift left 2 Add result Slide 6-9 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32 Instruction level parallelism Executing an instruction, phase 1: instruction fetch 0 1 Add result 4 Instruction fetch IF/ID ID/EX EX/MEM MEM/WB lw E.g. lw $t0, 32($s3) shift left 2 Add result Slide 6-10 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32
  • 6. Instruction level parallelism Executing an instruction, phase 2: instruction decode 0 1 Add result 4 Instruction decode IF/ID ID/EX EX/MEM MEM/WB lw z.B. lw $t0, 32($s3) shift left 2 Add result Slide 6-11 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32 Instruction level parallelism Executing an instruction, phase 3: execution 0 1 Add result 4 execution IF/ID ID/EX EX/MEM MEM/WB lw z.B. lw $t0, 32($s3) shift left 2 Add result Slide 6-12 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU address write data data memory read data PC 0 1 1 0 sign extend 16 32 result
  • 7. Instruction level parallelism Executing an instruction, phase 4: memory access 0 1 Add result 4 Memory IF/ID ID/EX EX/MEM MEM/WB lw z.B. lw $t0, 32($s3) shift left 2 Add result Slide 6-13 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32 Instruction level parallelism Executing an instruction, phase 5: write back 0 1 Add result 4 Write Back IF/ID ID/EX EX/MEM MEM/WB lw BUG !! LOAD instruction writes result into wrong register: the used register number belongs to the instruction that has just been fed into the pipeline! shift left 2 Add result Slide 6-14 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32
  • 8. Instruction level parallelism Revised hardware 0 1 Add result 4 IF/ID ID/EX EX/MEM MEM/WB Solution: keep register number and pass it to the last stage ⇒ 5 additional bits for each of the last 3 pipeline registers shift left 2 Add result Slide 6-15 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU result address write data data memory read data PC 0 1 1 0 sign extend 16 32 Instruction level parallelism Control for pipelined MIPS processor General Approach: In stage ID, create all control signals which are needed for an instruction in subsequent stages (EX, MEM, WB) and store them in the ID/EX pipeline register. Then, in each clock cycle hand over control signals to the next stage using the corresponding pipeline registers. Slide 6-16 Which signals are required in which stage? We can divide the control signals into 5 groups corresponding to the pipeline stages where they are needed.
  • 9. Instruction level parallelism Control for pipelined MIPS processor 1. Instruction fetch: Instruction memory is read and PC is written in every clock cycle ⇒ no control signals required! 2. Instruction decode / register file read: The same operations are performed in every clock cycle ⇒ no control signals required! 3. Execute / address calculation: ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target) Slide 6-17 ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target) 4. Memory access: MemRead and MemWrite (control data memory): set by lw,sw Branch (PC will be reloaded if condition is fulfilled): set by beq PCsrc is determined from Branch and zero (from ALU, condition is fulfilled if set) 5. Write back: MemtoReg (send either ALU result or memory value to register file) RegWrite (register file write enable) Instruction level parallelism Pipelined MIPS data path and control 0 1 Add result 4 IF/ID ID/EX EX/MEM MEM/WB WB M EX WB M WB Control MemWrite PCSrc RegWrite Branch shift left 2 Add result Slide 6-18 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU zero result address write data data memory read data PC 0 1 1 0 sign extend 16 32 MemtoReg ALU Control 6 instr. [15-0] instr. [20-16] instr. [15-11] 0 1 RegDst ALUOp ALUSrc MemWrite MemRead Branch
  • 10. Instruction level parallelism Consider the following program sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Example # $2 = 23-3 = 20 # $12 = 20 and 7 = 4 # $13 = 3 or 20 = 23 # $14 = 20+20= 40 # save $15 to 100(20) Slide 6-19 Assume the following initial register contents: $1 = 23 $2 = 10 $3 = 3 $5 = 7 $6 = 3 Instruction level parallelism Data dependences and hazards IM Reg IM Reg sub $2, $1, $3 Program execution order (in instructions) Time (in clock cycles) and $12, $2, $5 DM Reg Reg DM $2 = 23-3 = 20 $12 = 10 and 7 = 2 Initial values: $1 = 23 $2 = 10 $3 = 3 $5 = 7 $6 = 3 Consider in the following only data hazards for register-register-type instructions Slide 6-20 IM Reg DM Reg IM DM Reg IM DM Reg or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Reg Reg $13 = 3 or 10 = 11 $12 = 10+10 = 20 Data dependence leading to error (hazard)!
  • 11. Instruction level parallelism Dependences Consider an instruction a that precedes an instruction b in program order: A data dependence between a and b occurs when a writes into a register that will be read by b. An antidependence between a and b occurs when b writes into a register that is read by a. An output dependence between a and b occurs when both, a and b Slide 6-21 An output dependence between a and b occurs when both, a and b write into the same register. A data hazard is created whenever the overlapping (pipelined) execution of a and b would change the order of access to the operands which are involved in the dependency. Instruction level parallelism Data hazards Consider an instruction a that precedes an instruction b in program order: Depending on the type of the dependence between a and b the following hazards may occur: RAW: read after write b reads a source before a writes it, so b incorrectly gets the old value. Slide 6-22 WAR: write after read b writes an operand before it is read by a, so a incorrectly gets the new value. WAW: write after write b writes an operand before it is written by a, leaving the wrong result in the target register. In the following we consider only data hazards for R-type instructions
  • 12. Instruction level parallelism Software solution for resolving data hazards Compiler resolves all data hazards: • Test machine language program for potential data hazards • Eliminate them by inserting NOP – instructions (no operation) Example: sub $2, $1, $3 nop nop nop Slide 6-23 nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Modern processors are able to detect data hazards during program execution by analyzing the register numbers of the instructions using additional control logic! Instruction level parallelism Hardware solution for resolving data hazards I M R e g I M R e g su b $ 2 , $ 1, $ 3 P ro gra m e x e c utio n o r d e r (in i n str u ctio n s) Ti m e (i n clo c k cy cle s) a n d $ 1 2, $ 2 , $ 5 D M R e g R e g D M Slide 6-24 I M R e g D M R e g I M D M R e g I M D M R e g o r $ 1 3, $ 6, $ 2 a d d $ 1 4, $ 2 , $ 2 s w $ 1 5, 1 0 0 ($ 2 ) R e g R e g Data required by subsequent instructions exists already in pipeline register! Register file: If a register is read and written in the same clock cycle, send new data to data output!
  • 13. Instruction level parallelism MIPS datapath using forwarding Forwarding unit gets as input: Forwarding: ALU may read operands from each of the pipeline registers. The correct operands are selected by multiplexers that are controlled by an additional control unit: forwarding unit Slide 6-25 • Register operand numbers of instruction in EX stage • Target register number of instructions being in MEM and WB stage • Control signals indicating type of instructions being in MEM and WB stage Register numbers are stored and moved forward in the pipeline registers For reasons of clarity the hardware structure shown on the following slide has been simplified. Adder for branch target calculation, ALU input for address calculation and address input of data memory are missing. Instruction level parallelism MIPS datapath using forwarding IF/ID ID/EX EX/MEM MEM/WB WB M EX WB M WB Control MemWrite For R-type instruction Slide 6-26 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU write data data memory read data PC 0 1 2 1 0 MemtoReg Forwarding Unit IF/ID.RegisterRs MemWrite MemRead 0 1 2 IF/ID.RegisterRd IF/ID.RegisterRt EX/MEM.RegisterRd MEM/WB.RegisterRd
  • 14. Instruction level parallelism • Data hazards may be resolved if the operands being read by the instruction in the EX stage are already stored in one of the pipeline registers! • Now consider the following program: lw $2, 20($1) and $4, $2, $5 Forwarding AND instruction requires $2 at the beginning of the 3. stage (4. cycle) Slide 6-27 and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 BUT: value for $2 is stored in a pipeline register at the end of stage 4 of LW (4. cycle) ⇒ hazard may not be resolved by forwarding We have to stall the pipeline for combinations of a load followed by an instruction that reads its result! Additional hardware for detecting hazards and stalling the pipeline: Hazard detection unit Instruction level parallelism Illustration Reg IM Reg IM CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) lw $2, 20($1) Program execution order (in instructions) and $4, $2, $5 CC 7 CC 8 CC 9 DM Reg Reg DM Slide 6-28 Reg IM Reg DM Reg IM DM Reg IM DM Reg or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 Reg
  • 15. Instruction level parallelism Stalling the pipeline lw $2, 20($1) Program execution order (in instructions) and $4, $2, $5 Reg IM Reg IM DM CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Time (in clock cycles) CC 7 CC 8 CC 9 CC 10 DM Reg RegReg Slide 6-29 Stalling the pipeline means to repeat all actions from the previous clock cycle in the corresponding stages. PC and IF/ID register must be prevented from being overwritten. or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 Reg IM Reg DM RegIM IM DM Reg IM DM Reg Reg bubble Instruction level parallelism Control Hazards Consider the following program: beq $1, $3, L0 # PC relative addressing and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 ... Slide 6-30 L0: lw $4, 50($14) Efficient pipelining: one instruction is fetched at every clock cycle BUT: Which instruction has to be executed after the branch? Control (or branch) hazard : We start executing instructions before we know whether they are really part of the program flow!
  • 16. Instruction level parallelism CC 1 Time (in clock cy cle s) be q $1, $3, 7 Progra m exe cution order (in instru ctio ns) IM Re g D M Reg C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 Strategy: Assume branch not taken For every branch we assume that it is not taken and we begin executing the subsequent instructions $pc+4, $pc+8 and $pc+12. (ALU is used to compute branch address) Slide 6-31 Reg be q $1, $3, 7 IM Re g IM D M IM D M IM D M D M Reg Reg Reg Reg Reg and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 Reg Instruction level parallelism Assume Branch not taken (continued) However, if the branch is taken we have to discard all instructions from the pipeline! CC 1 Time (in clock cycle s) beq $1, $3, 7 Program execution order (in instructions) IM Reg IM DM DM Reg Reg Regand $12, $2, $5 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Slide 6-32 Reg Reg IM DM IM DM IM DM DM Reg Reg Reg Reg RegIM and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 lw $4, 50($7) Reg discard!discard! In CC5 all data calculated in the stages ID, EX and MEM have to be marked as invalid! ⇒ Set control signals for writing of memory and register file to zero
  • 17. Instruction level parallelism Reducing the delay of branches The earlier we know whether a branch will be taken, the fewer instructions need to be flushed from the pipeline! 1. Calculate branch target address in ID stage by separate adder in ID stage a a b b 7 7 6 6 Slide 6-33 2. Test condition in ID stage using an additional comparator Comparator is faster than ALU so that it can be integrated into ID Phase! ⇒ Only one instruction needs to be flushed a b 0 0 a = b ? 8 bit comparator Instruction level parallelism Add Reducing the delay of branches 0 1 shift left 2 Add result 4 IF/ID ID/EX EX/MEM MEM/WB WB M EX WB M WB Control MemWrite PCSrc RegWrite Slide 6-34 address instruction instruction memory register file read register1 read register2 write register write data read data 1 read data 2 ALU zero result address write data data memory read data PC 0 1 1 0 sign extend 16 32 MemtoReg ALU Control 6 instr. [15-0] instr. [20-16] instr. [15-11] 0 1 RegDst ALUOp ALUSrc MemWrite MemRead = ?
  • 18. Instruction level parallelism Delayed Branches Delayed branching: No instructions are flushed from the pipeline. An instruction following immediately after a branch is always executed. Programming strategy: Slide 6-35 Place an instruction originally preceding the branch and not affected by it immediately after the branch (=branch delay slot ). If no suitable instruction is found place a NOP there. Typically the compiler/assembler will fill about 50% of all delay slots with useful instructions. Instruction level parallelism Processors with several functional units The times required for executing two arithmetic instructions may differ significantly depending on the type of the instruction: • Integer addition faster than floating point addition • Addition much faster than multiplication/division Making the cycle time long enough so that the slowest instruction can be executed in one cycle would slow down the processor dramatically! Slide 6-36 Solution: • Distribute the EX stage of complex operations over several clock cycles • Use several functional units in the EX stage ⇒ Allows to execute several instructions in parallel!
  • 19. Instruction level parallelism Extending the MIPS pipeline to handle multicycle floating point operations MIPS implementation with floating point (FP) instructions (MIPS R4000): • 1 Integer unit: used for load/store, integer ALU operations and branches • 1 Multiplier for integer and FP numbers Slide 6-37 • 1 Adder for FP addition and subtraction • 1 Divider for FP and integer numbers Instruction level parallelism Extended MIPS pipeline EX integer unit EX FP/int multiply MIPS pipeline with multiple functional units (FUs) Slide 6-38 IF ID MEM WB multiply EX FP add EX FP divide FU Execution time Structure INT 1 Not pipelined MUL 7 Pipelined ADD 4 Pipelined DIV 25 Not pipelined Out of order completion possible!
  • 20. Instruction level parallelism Extended MIPS pipeline MIPS pipeline with multiple functional units (FUs) M1 M2 M3 M4 M5 M6 M7 EX Integer unit FP/integer multiplier Slide 6-39 IF ID A1 A2 A3 A4 DIV FP adder FP/integer divider MEM WB Instruction level parallelism Extended MIPS pipeline Separate register file for storing FP operands: • FP registers f0 – f31 • FP instructions operate on FP registers • Integer instructions operate on integer registers • Exception: FP load/store: address in integer register, data in FP register + no increase in number of bits needed for addressing registers + simplifies hazard detection Slide 6-40 + simplifies hazard detection + read/write integer and FP operands at the same time + no increase in complexity of multiplexers/decoders (speed!) - Additional moves for copying data from FP registers to integer register and vice versa necessary • FP operands may be 32 or 64 bit wide One 64 bit operand occupies a pair of FP registers (e.g. f0 and f1) 64 bit path from/to memory to speed up double precision load/store
  • 21. Instruction level parallelism Structural Hazard: functional unit Example: Floating point operations Div.d $f0,$f2,$f4 Mul.d $f4,$f6,$f4 Div.d $f8,$f8,$f14 Add.d $f10,$f4,$f8 The *.d extension indicates 64 bit floating point operations Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Slide 6-41 Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Div.d $f0,$f2,$f4 IF ID DIV----------------------------------------------------- Mul.d $f4,$f6,$f4 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB Div.d $f8,$f8,$f14 IF ID stall... Add.d $f10,$f4,$f8 IF stall... Functional units which are not pipelined and which require more than one clock cycle for execution may create structural hazards! Instruction has to be stalled in ID-stage! Instruction level parallelism Structural Hazard: write back Example: Cycle 1 2 3 4 5 6 7 8 9 10 11 Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB Add $r0,$r2,$r3 IF ID EX MEMWB Add $r3,$r0,$r0 IF ID EX MEMWB Add.d $f2,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB Sw $r3,0($r2) IF ID EX MEMWB IF ID EX MEMWB Structural Hazard: 3 instructions wish to write their results to FP Slide 6-42 Sw $r0,4($r2) IF ID EX MEMWB L.d $f2,0($r2) IF ID EX MEMWB results to FP register file! Solution: Track use of the write port of register file in ID stage by using a shift register. If a structural hazard would occur the instruction being in ID stage is stalled for one cycle
  • 22. Instruction level parallelism Structural Hazard: write back Example: resolved structural hazard Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB Add $r0,$r2,$r3 IF ID EX MEMWB Add $r3,$r0,$r0 IF ID EX MEMWB Add.d $f2,$f4,$f6 IF ID stallA1 A2 A3 A4 MEMWB Sw $r3,0($r2) IF stall ID EX MEMWB Sw $r0,4($r2) IF ID EX MEMWB Slide 6-43 Sw $r0,4($r2) IF ID EX MEMWB L.d $f2,0($r2) IF ID stallEX MEM WB Instruction level parallelism WAW-Hazards Example: Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB Add $r0,$r2,$r3 IF ID EX MEMWB Add.d $f0,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB WAW-Hazard: Add.d writes f0 before Mul.d doesOut-of-order completion may lead to WAW hazard! Slide 6-44 Solution: Stall Add.d instruction in ID stage Cycle 1 2 3 4 5 6 7 8 9 10 11 12 Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB Add $r0,$r2,$r3 IF ID EX MEMWB Add.d $f0,$f4,$f6 IF ID stallstallA1 A2 A3 A4 MEMWB ⇒ Hazard detection logic detects all hazards in ID stage and resolves them by stalling the corresponding instruction
  • 23. Instruction level parallelism Extended MIPS pipeline Instruction execution: 1. Fetch 2. Decode: 1. Check for structural hazards: Wait until the required FU is not busy and make sure the register write port is available when it will be needed 2. Check for RAW data hazards: Wait until source registers are not listed as destination register of any instruction in M1-M6, A1 – A3, DIV or a load in EX Optimization: e.g.: if the division is in the final clock cycle its result may be Slide 6-45 Optimization: e.g.: if the division is in the final clock cycle its result may be forwarded to the requesting FU in the cycle following. 3. Check for WAW data hazards: Determine if any instruction in A1 - A4, M1 - M7, DIV, has the same destination as this instruction. If so, stall instruction for the number of clock cycles being necessary Simplification: Since WAW hazards are rare, stall instruction until no other instruction in the pipeline has the same destination 3. Execute 4. Memory Access 5. Write Back Instruction level parallelism Dynamic Branch Prediction Assume branch not taken is a crude form of branch prediction. Typically it fails in 50% of all cases. In processors with multiple functional units deep pipelines are used. This may lead to large branch delays if a branch is predicted the wrong way! ⇒ we need more accurate methods for predicting branches! Slide 6-46 Idea: dynamic branch prediction predict branches using the program’s past behaviour. Branch prediction buffer or branch history table Small memory addressed by the lower bits of the instruction address, contains a flag indicating whether the branch has been taken or not. This flag is set or reset at each branch.
  • 24. Instruction level parallelism Dynamic Branch Prediction For loops the hit rate may be improved by using two bits for branch prediction. A prediction must be wrong twice before it is changed. Predict taken Predict taken WrongCorrect Slide 6-47 10 11 Predict not taken 01 Predict not taken 00 Correct Wrong Correct Wrong Correct Wrong 2 bit prediction scheme Instruction level parallelism Branch Target Buffer Observation: the target address (calculated from PC and offset) for a particular branch remains constant during program execution. Idea: store the branch target addresses in a lookup table: branch target buffer Addresses of branch instructions Branch target addresses Predicted taken or untaken Slide 6-48 PC Instruction memory = ? control Branch target buffer in combination with correct branch prediction allows to execute branches without stalling the pipeline!
  • 25. Instruction level parallelism Dynamic Scheduling Static scheduling: Execution is started in the order in which the instructions have been fetched. (e.g., in the order which the compiler has determined). If a data dependence occurs that can not be resolved by forwarding, the pipeline is stalled (starting with the instruction that waits for a result). No new instructions are fetched until the dependence is cleared. Idea:Hardware rearranges instruction executions dynamically to reduce stalls ⇒ Dynamic Scheduling Slide 6-49 ⇒ Dynamic Scheduling Dynamic Scheduling takes structural hazards and data hazards into consideration! To avoid that an instruction is stalled because a data hazard delays all subsequent instructions, the ID stage is spilt into two stages: 1. Issue: Decode instructions, check for structural hazards 2. Read operands: Wait until no data hazards, then read operands Leads to out-of-order execution and out-of-order completion Instruction level parallelism Out-of-order execution Out-of-order execution may lead to WAR hazards! Example: Floating point operations Div.d $f0, $f2, $f4 Add.d $f10,$f0, $f8 Mul.d $f8, $f8, $f14 Slide 6-50 Add.d needs to be stalled because of RAW hazard. Mul.d may be started, BUT: if mul.d completes before add.d reads its operands, add.d will read the wrong value in f8! The control logic deciding when an instruction is executed has to detect and resolve hazards!
  • 26. Instruction level parallelism Score Board Dynamic Scheduling with a Score Board Goal: maintain an execution rate of one instruction per clock cycle by executing an instruction as early as possible If an instruction needs to be stalled because of a data hazard other instructions can be issued and executed. ⇒ We have to analyze the program flow for hazards! Slide 6-51 ⇒ We have to analyze the program flow for hazards! Scoreboard: • Detects structural hazards and data hazards • Determines when an instruction may read operands and when it is executed • Determines when an instruction can write its result into the destination register Instruction level parallelism Dynamic Scheduling with a Score Board In the following we will consider dynamic scheduling only for arithmetic instructions – no MEM access-phase necessary. 4 stages (replace ID, EX, WB stage of standard MIPS pipeline): 1. Issue: If • a functional unit (FU) for the instruction is free (resolve structural hazards) and • no other active instruction has the same destination register (resolve WAW Slide 6-52 hazards) the score board issues the instruction to the FU and updates its internal data structure. If a hazard exists, the issue stage stalls. Subsequent instructions are written into a buffer between instruction fetch and issue. If this buffer is filled then the instruction fetch stage stalls. 2. Read operands: When all operands are available the score board tells the FU to read its operands and to begin execution (may lead to out of order execution). A source operand is available when no active instruction issued earlier is going to write it (resolve RAW hazards).
  • 27. Instruction level parallelism Dynamic Scheduling with a Score Board 3. Execution: The FU executes the instruction (may take several clock cycles). When the result is ready the FU notifies the scoreboard that it has completed execution. 4. Write result: When an FU announces the completion of an execution the scoreboard checks for WAR hazards. If no such hazard exists the result can be written to the destination register. A WAR hazard occurs when there is an instruction preceding the completing instruction that Slide 6-53 • has not read its operands yet and • one of these operands is the same register as the destination register of the completing instruction. Score Boarding does not use forwarding! If no WAR hazard occurs the result is written to the destination register during the clock cycle following the execution. (we do not have to wait for a statically assigned WB stage that may be several cycles away). Instruction level parallelism Example MIPS processor with dynamic scheduling using a score board with the following functional units (not pipelined) in the datapath: - 1 Integer unit: for load/store, integer ALU operations and branches - 2 Multiplier for FP numbers - 1 Adder for FP addition/subtraction Slide 6-54 - 1 Divider FP numbers MIPS program with floating point instructions (64 Bit): L.d $f6, 34 ($r2) L.d $f2, 45 ($r3) Mul.d $f0, $f2, $f4 Sub.d $f8, $f2, $f6 Div.d $f10, $f0, $f6 Add.d $f6, $f8, $f2 Assumptions: EX phase for double precision takes: 2 cycles for load and add 10 cycles for mult 40 cycles for div
  • 28. Instruction level parallelism MIPS with a Score Board FP multiplier FP multiplier FP divider Register Slide 6-55 FP divider FP adder Integer unit Score Board Control/Status Control/Status Data busses Instruction level parallelism Components of the Score Board 1. Instruction status: indicates for each instruction which of the four steps the instruction is in. 2. FU status: indicates for each FU its state: busy: FU busy or not Score Board consists of three parts containing the following data: Slide 6-56 busy: FU busy or not OP: Operation to perform (e.g. add or subtract) fi: Destination register fj, fk: Source registers Qj, Qk: Functional units writing the source registers fj und fk Rj, Rk: Flags indicating whether fj and fk are ready to be read but have not been read yet. Are set to “no” after the operands have been read. 3. Result register status: indicates for each register, whether a FU is going to write it and which FU this will be.
  • 29. Instruction level parallelism Components of the Score Boards Instruction Issue Read operands Execution complete Write result L.d $f6, 34 ($r2) √ √ √ √ L.d $f2, 45 ($r3) √ √ √ Mul.d $f0, $f2, $f4 √ Sub.d $f8, $f2, $f6 √ Div.d $f10, $f0, $f6 √ Add.d $f6, $f8, $f2 Instruction status Slide 6-57 Name Busy Op fi fj fk Qj Qk Rj Rk integer yes load f2 r3 0 no mult1 yes mult f0 f2 f4 integer 0 no yes mult2 no add yes sub f8 f2 f6 integer 0 no yes divide yes div f10 f0 f6 mult1 0 no yes f0 f2 f4 f6 f8 f10 f12 f30 FU mult1 integer 0 0 add divide 0 0 Functional unit status Result register status (double precision floating point numbers number ⇒ allocate two 32 bit registers) Instruction level parallelism Bookkeeping in the Score Board Instruction status Wait until Bookkeeping Issue Busy[FU] = no Busy[FU] := yes; Op[FU] := op; Result[d] := FU; When an instruction has passed through one step the score board is updated. FU: FU used by instruction fi[FU], fj[FU], fk[FU]: destination/source registers of FU d: destination register Rj[FU], Rk[FU]: s1, s2 ready? s1, s2:source registers Qj[FU], Qk[FU]: FUs producing s1 and s2 op: type of operation Result[d]: FU that will write register d Op[FU]: operation which FU will execute Slide 6-58 Issue Busy[FU] = no and Result[d] = 0 (no other FU has d as destination register) Busy[FU] := yes; Op[FU] := op; Result[d] := FU; fi[FU] := d; fj[FU] := s1; fk[FU] := s2; Qj := Result[s1]; Qk := Result[s2]; if Qj = 0 then Rj := yes; else Rj := no if Qk = 0 then Rk := yes; else Rk := no Read operands Rj = yes and Rk = yes Rj := no; Rk := no; Qj := 0; Qk := 0 Execution Functional unit done Write results ∀f((fj[f] ≠ fi[FU] or Rj[f] = no) and (fk[f] ≠ fi[FU] or Rk[f] = no)) ∀f(if Qj[f] = FU then Rj[f] := yes); ∀f(if Qk[f] = FU then Rk[f] := yes); Result[fi[FU]] := 0; Busy[FU] := nofor all FUs
  • 30. Instruction level parallelism Bookkeeping in the Score Board Comment for step write results: ∀f (fj[f] ≠ fi[FU] or Rj[f] = no) „Rj[f] = no“ means, that the instruction which is now active at f will not read the current contents of source register fj a) either, since the operation has already been executed and currently waits for permission to write, or Slide 6-59 b) since the required source operand must still be computed and the current instruction is waiting for that. In the first case register fj is overwritten since the previous contents are no longer needed. In the second case the register is overwritten since this will provide the expected operand. Ri[f] = yes means that the instruction being active at f still requires the current content of the register specified by fi. Instruction level parallelism Dynamic Scheduling: Tomasulo‘s Schema Example: div.d $f0, $f2, $f4 add.d $f6, $f0, $f8 sub.d $f8, $f10, $f14 Are there further possibilities for eliminating stalls resulting from hazards? RAW hazard: No way - we have to wait until all operands are calculated! WAR hazard and WAW hazard: RAW hazard for f0 WAR hazard for f8 Slide 6-60 sub.d $f8, $f10, $f14 mul.d $f6, $f10, $f8 WAR hazard for f8 WAW hazard for f6, RAW hazard for f8 Idea: Register renaming Rename destination registers of instructions in a way that prevents instructions being executed out-of-order from overwriting operands being still required by other instructions ⇒ Tomasulo‘s scheme or Tomasulo‘s algorithm Observation: WAR and WAW hazard could have been avoided by compiler!
  • 31. Instruction level parallelism Register renaming Example (continued): Assume we have two temporary registers S and T. replace f6 in add.d by a temporary register S and replace f8 in sub.d and mul.d by a temporary register T: div.d $f0, $f2, $f4 add.d $S, $f0, $f8 Slide 6-61 add.d $S, $f0, $f8 sub.d $T, $f10, $f14 mul.d $f6, $f10, $T Replace target registers affected by a WAW or a WAR hazard by temporary registers and modify subsequent instructions reading these registers appropriately. Instruction level parallelism Reservation Station Temporary registers are part of reservation stations : • Buffer the operands for instructions waiting for execution. If an operand is not yet calculated the corresponding reservation station contains the number of the reservation station which will deliver the result. • Renaming of register numbers for pending operands to the names of the reservation stations, this is done during instruction issue. Slide 6-62 • Information about the availability of the operands stored in a reservation station determines when the corresponding instruction can be executed. • As results become available they are sent directly from the reservation stations to the waiting FU over the common data bus (CDB) • When successive writes to a register overlap in execution, only the result of the instruction being issued at last is used to update the register. ⇒ resolves WAR/WAW hazards
  • 32. Instruction level parallelism Tomasulo‘s algorithm From instruction unit Instruction queue (FIFO) FP registers Operand buses FP operationsload/store operations Store buffers MIPS floating point unit using Tomasulo‘s algorithm Slide 6-63 Address unit Memory FP adders FP multipliers/ dividers buses Reservation stations 3 2 1 2 1 Common Data Bus (CDB) Load buffers Instruction level parallelism Tomasulo‘s algorithm - stages 1. Issue: Get next instruction from the head of the instruction queue and issue it to a matching reservation station that is empty. Load/store buffers storing data/addresses coming from and going to memory behave similarly like reservation stations for arithmetic units. Steps in execution of an FP instruction: Slide 6-64 matching reservation station that is empty. Operands available in registers? yes: hand over values to reservation station no: hand over names of those reservation stations that are calculating the values Buffering operands resolves WAR hazards! If no matching reservation station is empty there is a structural hazard. ⇒ instruction stalls until a station is freed
  • 33. Instruction level parallelism Tomasulo‘s algorithm - stages 2. Execution 1. If one or more of the operands are not available, monitor the CDB. 2. If an operand becomes available place it in the waiting reservation station(s). 3. Wait until all operands for an instruction are available, then start execution ⇒ resolves RAW hazards In case of stores: Execution may start (address calculation) even if data to Slide 6-65 In case of stores: Execution may start (address calculation) even if data to be stored is not available yet. Address calculation unit is occupied during address calculation only 3. Write result 1. When the result is available, send it to the CDB. 2. From the CDB it is sent directly to waiting reservation stations (and store buffers) Only if the instruction is the one being issued last that writes to a certain target register, write result also to the target register ⇒ avoids WAW hazards Instruction level parallelism Reservation stations Each reservation station has the following fields: Op: Type of the operation to perform (e.g. add or subtract) Qj, Qk: Names of the reservation stations containing the instructions calculating the operands. Zero values indicate that the operands are already available Vj, Vk: Values of the source operands Busy: Flag indicating that this station/buffer is already occupied. Slide 6-66 Busy: Flag indicating that this station/buffer is already occupied. Each load/store buffer has an additional field: A: Initially the immediate field of the address is stored there; after address calculation the effective address is stored there For each register of the register file there is one field Qi: Name of the reservation station containing the instruction being issued last that calculates the result for this register. A zero value indicates that no active instruction is calculating a result for that register.
  • 34. Instruction level parallelism Tomasulo‘s method: information tables Name Busy Op Vj Vk Qj Qk A load1 no Instruction Issue execute Write result L.d $f6, 34 ($r2) √ √ √ L.d $f2, 45 ($r3) √ √ Mul.d $f0, $f2, $f4 √ Sub.d $f8, $f2, $f6 √ Div.d $f10, $f0, $f6 √ Add.d $f6, $f8, $f2 √ Instruction status Slide 6-67 load1 no load2 yes load 45+Regs[r3] add1 yes sub Mem[34+Regs[r2]] load2 add2 yes add add1 load2 add3 no mult1 yes mul Regs[f4] load2 mult2 yes div Mem[34+Regs[r2]] mult1 Register: f0 f2 f4 f6 f8 f10 f12 f30 Qi : mult1 load2 0 add2 add1 mult2 0 0 Reservation Stations Register status Instruction level parallelism Dynamic Scheduling: Data hazards through memory A load and a store instruction may only be done in a different order if they access different addresses! (RAW/WAR hazard!) Two stores sharing the same data memory address may not be done in different order! (WAW hazard!) Load: read memory only if there is no uncompleted store which has been issued earlier and which shares the same data memory Slide 6-68 been issued earlier and which shares the same data memory address with the load. Store: write data only if there are no uncompleted loads and stores being issued earlier using the same data memory address as the store.
  • 35. Instruction level parallelism Dynamic Scheduling: Instructions following branches It may take many clock cycles until we know whether a branch has been predicted correctly or not! 1. Instructions being issued after a branch may complete before. ⇒ Write back stage of these instructions has to be stalled until we know whether the prediction has been correct or not! 2. Exceptions: Slide 6-69 2. Exceptions: We have to ensure that exactly the same exceptions are handled as in the case where the pipeline would have used in-order-execution and no branch prediction! Simple solution: Instructions following a branch are issued only. Execution starts only after the branch prediction has turned out to be correct. ⇒ Can reduce the efficiency of a dynamically scheduled pipeline dramatically! Instruction level parallelism Speculative execution Write result stage is spilt into two stages: 3. Write results: • Instructions are executed as operands become available. Results are written into a reorder buffer (ROB). • For each active instruction there is one entry in the ROB. Their order corresponds to the order in which the instructions have been issued. ⇒ The head of the ROB contains the result of the active instruction being issued first Subsequent instructions can read their operands from the ROB. Slide 6-70 Subsequent instructions can read their operands from the ROB. • Writes going to register file and memory are delayed until branch predictions turn out to be correct. 4. Commit: • When an instruction that writes to memory or register file reaches the head of the ROB its result is written. Exception is handled now if necessary! • If the head of the ROB contains an incorrectly predicted branch the ROB is flushed ⇒ results calculated by instructions following the branch are discarded! ROB restores initial order of instructions: in-order-commitment
  • 36. Instruction level parallelism Speculative Execution MIPS FP unit using Tomasulo‘s algorithm and Reorder buffer From instruction unit Instruction queue (FIFO) FP registers FP operationsload/store operations ROB Slide 6-71 Address unit Memory FP adders FP multipliers/ dividers Operand buses Reservation stations 3 2 1 2 1 Common Data Bus (CDB) Store data load/store buffers Store address Instruction level parallelism Multiple Issue Processor Using multiple FUs, dynamic scheduling, branch prediction and speculation allow to achieve an CPI of nearly one. CPI < 1 not possible because we issue only one instruction per clock cycle! Further speedup: Slide 6-72 Issue multiple instructions in one clock cycle (up to 8 in practice) ⇒ CPI < 1 possible! The sets of instructions being issued in parallel are called instruction packets or issue packets.
  • 37. Instruction level parallelism Multiple Issue Processors Multiple Issue Processor Superscalar Processors Instruction packets: generated VLIW (very long instruction word)- Processors Slide 6-73 Instruction packets: generated by hardware Processors Instruction packets: generated by compiler Dynamic scheduling (hardware) Static scheduling (compiler) Instruction level parallelism Overview Name Issue Hazard Detection Scheduling Distinguishing characteristics Examples superscalar (static) dynamic Hardware static (Compiler) in-order execution Sun UltraSPARC II/III superscalar dynamic Hardware dynamic out-of-order IBM Power PC Slide 6-74 superscalar (dynamic) dynamic Hardware dynamic out-of-order execution IBM Power PC superscalar (speculative) dynamic Hardware dynamic with speculation out-of-order execution with speculation Pentium III/4, MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III VLIW static software (Compiler) static (Compiler) no hazards between issue packets Trimedia, i860
  • 38. Instruction level parallelism Statically scheduled superscalar Processors Example: dual-issue static superscalar processor In one clock cycle we can issue • one integer instruction (including load/store, branches, integer ALU operations) and • one arithmetic FP instruction Slide 6-75 Only slight extensions of the hardware necessary compared to a single issue implementation with two FUs. Typical for high-end embedded processors. Instruction level parallelism Statically scheduled Dual Issue Pipeline Instruction type Integer instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB Integer instruction IF ID EX MEM WB Pipeline stages Slide 6-76 FP instruction IF ID EX EX EX WB Integer instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB Integer instruction IF ID EX MEM WB FP instruction IF ID EX EX EX WB CPI of 0.5 possible !
  • 39. Instruction level parallelism Multiple Issue Pipeline In order to enable multiple issues per clock cycle we must be able to fetch multiple instructions per cycle also! Example: 4-way issuing processor Fetches instructions stored at PC, PC+4, PC+8, PC+12 from memory ⇒ Wide bus to instruction memory required! Slide 6-77 ⇒ Wide bus to instruction memory required! Problem: what if one of these instruction is a branch? 1. Reading branch target buffer and accessing instruction memory in one clock cycle would increase cycle time. 2. If n instructions of the package are allowed to be a branch we would have to lookup n instructions in the branch target buffer in parallel! Typical simplification: single issue for branches Instruction level parallelism Multiple issue with dynamic pipelining (Tomasulo) Example: Superscalar processor with: • Dual issue (single issue for branches) • Dynamic Tomasulo scheduling (no speculation, i.e. execution of instructions following branch must be delayed until the branch condition is evaluated) • One FP unit • One FU for integer instructions and load/stores and branch condition testing • Separate FU for branch address calculation • Several reservation stations/load store buffers for each FU: Load/stores occupy the FU only during address calculation, branches only during condition testing; Stores are allowed to execute even if the data to be stored is not available yet Slide 6-78 Loop: l.d $f0, 0 ($r1); # f0 := array element add.d $f4, $f0, $f2; # add f2 to f0 s.d $f4, 0 ($r1); # store result addi $r1, $r1, -8; # decrement pointer bne $r1, $r2, LOOP;# repeat loop if r1 ≠ r2 Stores are allowed to execute even if the data to be stored is not available yet Latency: Number of cycles from the beginning of execution step to the moment when the result is available on the CDB Integer operations: 1 cycle Load: 2 cycles (1 in EX stage + 1 in MEM stage) FP operation: 3 cycles (in EX stage)
  • 40. Instruction level parallelism Multiple issue with dynamic pipelining Iteration number Instruction Issues at Executes at Memory access at Write CDB at Comment 1 L.d $f0, 0($r1) 1 2 3 4 First issue 1 Add.d $f4, $f0, $f2 1 5 - 7 8 Wait for l.d 1 S.d $f4, 0 ($r1) 2 3 9 Wait for add.d 1 addi $r1, $r1, -8 2 4 5 Wait for ALU 1 bne $r1, $r2, LOOP 3 6 Wait for addi 2 L.d $f0, 0($r1) 4 7 8 9 Wait for bne 2 Add.d $f4, $f0, $f2 4 10 - 12 13 Wait for l.d Slide 6-79 2 Add.d $f4, $f0, $f2 4 10 - 12 13 Wait for l.d 2 S.d $f4, 0 ($r1) 5 8 14 Wait for add.d 2 addi $r1, $r1, -8 5 9 10 Wait for ALU 2 bne $r1, $r2, LOOP 6 11 Wait for addi 3 L.d $f0, 0($r1) 7 12 13 14 Wait for bne 3 Add.d $f4, $f0, $f2 7 15 - 17 18 Wait for l.d 3 S.d $f4, 0 ($r1) 8 13 19 Wait for add.d 3 addi $r1, $r1, -8 8 14 15 Wait for ALU 3 bne $r1, $r2, LOOP 9 16 Wait for addi Instruction level parallelism Resource usage Clock cycle Integer unit FP unit Data memory CDB 2 1 / l.d 3 1 / s.d 1 / l.d 4 1 / addi 1 / l.d 5 1 / add.d 1/ addi 6 1 / bne 1 / add.d 7 2 / l.d 1 / add.d 8 2 / s.d 2 / l.d 1 /add.d 9 2 / addi 1 / s.d 2/ l.d Slide 6-80 9 2 / addi 1 / s.d 2/ l.d 10 2 / add.d 2 / addi 11 2 / bne 2 / add.d 12 3 / l.d 2 / add.d 13 3 / s.d 3 / l.d 2 / add.d 14 3 / addi 2 / s.d 3/ l.d 15 3 / add.d 3 / addi 16 3 / bne 3 / add.d 17 3 / add.d 18 3 / add.d 19 3 / s.d 20
  • 41. Instruction level parallelism Example CPI significantly greater than 0.5: Problem: Integer unit used for memory address calculation, for incrementing pointer and for condition test ⇒ branch execution is delayed by one cycle Possible solution: additional integer FU Slide 6-81 Problem: The execution step of an instruction following a branch has to be delayed until the branch is executed Possible solution: use speculative execution Example: Dual-issue processor with speculative execution In order to achieve a CPI < 1 we must allow two instructions to commit in parallel! ⇒ More buses required Instruction level parallelism Compiler techniques Observation: If branch prediction is perfect then loops are unrolled automatically by the hardware. Operations that belong to different iterations of the loop overlap. Loops may be unrolled in advance by the compiler also! ⇒ Improves performance for processors without speculative execution Loop: addi $s1, $s1, -16; lw $t0, 16($s1); add $t0, $t0, $s2; Slide 6-82 Loop: lw $t0, 0 ($s1); add $t0, $t0, $s2; sw $t0, 0 ($s1); addi $s1, $s1, -4; bne $s1, $zero, LOOP; add $t0, $t0, $s2; sw $t0, 16($s1); lw $t1, 12($s1); add $t1, $t1, $s2; sw $t1, 12($s1); lw $t2, 8($s1); add $t2, $t2, $s2; sw $t2, 8($s1); lw $t3, 4($s1); add $t3, $t3, $s2; sw $t3, 4($s1); bne $s1, $zero, LOOP; Loop before unrolling Loop after unrolling Register renaming done by compiler!
  • 42. Instruction level parallelism Summary Superscalar processors determine during program execution how many instructions are issued in one clock cycle. Statically scheduled: • Must detect dependences in instruction packets and resolve them by inserting stalls Slide 6-83 • Needs assistance of the compiler for achieving a high amount of parallelism. • Simple hardware Dynamically scheduled: • Requires less assistance of the compiler • Hardware is much more complex Instruction level parallelism Static Multiple Issue – VLIW approach For highly superscalar processors the hardware becomes very complex. Idea: let the compiler do as much work as possible! VLIW approach: used for digital signal processing (DSP) compiler groups instructions with no dependences between that may be executed in parallel into a „very long instruction word“ (VLIW). Slide 6-84 ⇒ no hardware for hazard detection and scheduling necessary Does the program contain enough parallelism? The compiler has to find enough parallelism for using the full capacity of all functional units! local scheduling : scheduling inside lists of instructions without branches (= basic blocks) global scheduling : scheduling over several basic blocks
  • 43. Instruction level parallelism Example Loop: lw.d $f0, 0($r1); add.d $f4, $f0, $f2; sw.d $f4, 0($r1); For VLIW processors one instruction must contain all operations that are executed in parallel explicitly. Therefore VLIW processors are sometimes also called EPICs (explicitly parallel instruction computer). Example Loop: lw.d $f0, 0($r1); add.d $f4, $f0, $f2; sw.d $f4, 0($r1); lw.d $f6, -8($r1); unroll Slide 6-85 sw.d $f4, 0($r1); addi $r1, $r1, -8; bne $r1, $r2, LOOP; Consider a VLIW processor with: • 2 FUs for memory access (2 cycles for EX) • 2 FUs for FP operations (Pipelined, 3 cycles for EX) • 1 FU for integer operations and branches. (1 cycle) Create a schedule for 7 iterations using loop unrolling. Branches have zero latency. lw.d $f6, -8($r1); add.d $f8, $f6, $f2; sw.d $f8, -8($r1); lw.d $f10, -16($r1); add.d $f12, $f10, $f2; sw.d $f12, -16($r1); lw.d $f14, -24($r1); add.d $f16, $f14, $f2; sw.d $f16, -24($r1); … addi $r1, $r1, -56; bne $r1, $r2, LOOP; Instruction level parallelism Static Multiple Issue – VLIW approach Memory unit 1 Memory unit 2 FP unit 1 FP unit 2 Integer unit lw.d $f0,0($r1) lw.d $f6,-8($r1) lw.d $f10,-16($r1) lw.d $f14,-24($r1) lw.d $f18,-32($r1) lw.d $f22,-40($r1) add $f4,$f0,$f2 add $f8,$f6,$f2 lw.d $f26,-48($r1) add $f12,$f10,$f2 add $f16,$f14,$f2 Slide 6-86 add $f20,$f18,$f2 add $f24,$f22,$f2 sw.d $f4,0($r1) sw.d $f8,-8($r1) add $f28,$f26,$f2 sw.d $f12,-16($r1) sw.d $f16,-24($r1) addi $r1,$r1,-56 sw.d $f20,24($r1) sw.d $f24,16($r1) sw.d $f28,8($r1) bne $r1,$r2,Loop Each row corresponds to an VLIW instruction