Cs718min1 2008soln View

CSL 718
Architecture of High Performance Systems
Minor Test I Solution
2008

1. Consider the following architectural changes in a non-pipelined
processor that has a clock period of T ns, executes N instructions
to run a particular benchmark with an average of C cycles per
instruction.
i) A new instruction is introduced which replaces a sequence of
operations occurring at several places in that benchmark.
ii) Pipelining is introduced.
iii) The stage with maximum propagation delay is split into two
stages.
For each of these changes, indicate how are N, T, C, T*C, and
N*T*C likely to change, giving reasons. Suppose the new
instruction in i) is able to replace 75% of the instructions executed,
what is the upper bound on possible performance improvement by
this change?

Solution:
i) N will decrease because multiple instructions are being
replaced by a single instruction. T and C are likely to go up
because the new instruction has a more complex task to
perform which would need more cycles and/or the cycles have
to accommodate more work.

Assuming that the CPI of the instructions replaced and that of
the instructions not replaced is same, 25% of the execution
time is remaining unaffected. Suppose the remaining execution
time, which is 75%, reduces by a factor k by using the new
instruction. Then the overall speedup is -
1
.75
.25 +
k
This can be at most 4.

ii) Pipelining will lead to overlapped execution of instructions.
Therefore, C will decrease. Pipelining will ideally tend to make
C = 1, but because of hazards, it would usually be more than 1.
If pipeline stages correspond to the original break-up of
instructions into cycles, T will remain unchanged. N will
certainly remain unchanged as there is no change in the
instruction set.

iii) The stage with maximum propagation delay determines the
clock period T. Therefore, if this stage is split into two stages, T
will decrease (provided that there was no other stage with the
same propagation delay). This would also introduce an
additional cycle for the affected instructions. Therefore, C will
go up. N will remain unchanged as there is no change in the
instruction set.

2. A processor has a non-linear pipeline with 4 stages A, B, C and
D. Each instruction goes through different stages in the following
order A B C B A D C. Find the bounds on the maximum
instruction throughput in a static hazard free schedule.

Solution:
The reservation table for this pipeline is as follows.

1 2 3 4 5 6 7
A X X
B X X
C X X
D X
Intervals which cause collision are:
Row A – 4 Row B – 2 Row C – 4 Row D – none.
Therefore, the initial collision vector is - 001010

No. of 1’s in the initial collision vector = 2.
Therefore, minimum average latency ≤ 2+1 = 3
That is, maximum instruction throughput ≥ 1/3 instructions per
cycle.

Maximum number of checks in a row of the reservation table = 2
Therefore, minimum average latency ≥ 2
That is, maximum instruction throughput ≤ 1/2 instructions per
cycle.

3. Compute the number of cycles lost due to a branch hazard in a
pipelined processor with 5 stages – instruction fetch (IF), decode
(D), execute (EX), memory access (M) and write back (WB).
Assume that in a branch instruction, decision-making as well as
address calculation are completed in EX stage and also assume
that the branches are taken 70% of the times. Consider the
following cases –
i) there is no delayed branch and no branch prediction,
ii) there is one delayed branch slot which is filled with a useful
instruction,
iii) branch is statically predicted to be taken,
iv) there is a branch target address buffer which is looked up in the
IF stage itself and a hit (or miss) in this buffer (assume 80% hit) is
used for predicting the branch to be taken (or not taken).

Solution: Instruction N is the branch instruction and T is the target
instruction. Instructions wrongly started and abandoned are shown
in red and those executed correctly are shown in green. Time slots
in which an instruction is stalled are shown as ██.
i) No delayed branch slot, no branch prediction
(a) branch not taken
N IF|D |EX
N+1 IF|██|D |EX|M |WB
N+2 ██|IF|D |EX|M |WB
delay = 1
(b) branch taken
N IF|D |EX
N+1/T IF|██|IF|D |EX|M |WB
delay = 2
T+1 ██|██|IF|D |EX|M |WB
Average delay = 1*0.3 + 2*0.7 = 1.7

ii) One delayed branch slot, filled with useful instruction N+1

(a) branch not taken
N IF|D |EX
N+1 IF|D |EX|M |WB
delay = 1

(b) branch taken
N IF|D |EX
N+1 IF|D |EX|M |WB
T ██|IF|D |EX|M |WB
T+1 ██|IF|D |EX|M |WB
delay = 1

Average delay = 1*0.3 + 1*0.7 = 1.0

iii) Branch statically predicted to be taken

(a) branch not taken (prediction incorrect)
N IF|D |EX
N+1/T IF|██|IF
delay = 1

(b) branch taken (prediction correct)
N IF|D |EX
N+1/T IF|██|IF|D |EX|M |WB
T+1 ██|██|IF|D |EX|M |WB
delay = 2
Average delay = 1*0.3 + 2*0.7 = 1.7
Here branch prediction offers no advantage, because target address
calculation and decision making are happening in the same stage.

iv) Branch target address buffer with 80% hit

(a) hit and branch not taken (prediction incorrect)
N IF|D |EX
T/N+1 IF|D |IF|D |EX|M |WB
T+1/N+2 IF|██|IF|D |EX|M |WB
delay = 2

(b) hit and branch taken (prediction correct)
N IF|D |EX
T IF|D |EX|M |WB
T+1 IF|D |EX|M |WB
delay = 0

(c) miss and branch not taken (prediction correct)
N IF|D |EX
N+1 IF|D |EX|M |WB
N+2 IF|D |EX|M |WB
delay = 0

(d) miss and branch taken (prediction incorrect)
N IF|D |EX
N+1/T IF|D |IF|D |EX|M |WB
N+2/T+1 IF|██|IF|D |EX|M |WB
delay = 2

Average delay = 0.8*(2*0.3 + 0*0.7) + 0.2*(0*0.3 + 2*0.7) =
0.8*0.6 + 0.2*1.4 = 0.76

4. A processor with dynamic scheduling and issue bound operand
fetch has 3 execution units – one LOAD/STORE unit, one
ADD/SUB unit and one MUL/DIV unit. It has a reservation
station with 1 slot per execution unit and a single register file.
Starting with the following instruction sequence in the instruction
fetch buffer and empty reservation stations, for each instruction
find the cycle in which it will be issued and the cycle in which it
will write result.
Assume out of order issue and out
load R6, 34(R12) of order execution. Execute cycles
load R2, 45(R13) taken by different instructions are -
mul R0, R2, R4 LOAD/STORE : 2
sub R8, R2, R6 ADD/SUB : 1
div R10, R0, R6 MUL : 2
add R6, R8, R2 DIV : 4.

Solution:
The following chart shows the execution of the given instruction
sequence cycle by cycle. The stages of instruction execution are
annotated as follows:
IF Instruction fetch
D Decode and issue
EX1 Execute in LOAD/STORE unit
EX2 Execute in ADD/SUB unit
EX3 Execute in MUL/DIV unit
WB Write back into register file and reservation stations
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Instr cycle
⇓ no.⇒
load IF D EX1 EX1 WB
• • •
load IF D EX1 EX1 WB
mul IF D EX3 EX3 WB
sub IF D EX2 WB
• • • • • • • • •
div IF D EX3 EX3 EX3 EX3 WB
• • • • • • • •
add IF D EX2 WB

Cycles in which an instruction is waiting for a reservation station
are marked as • and the cycles in which an instruction is waiting for
one or more operands are marked as . As seen in the time chart,
the issue and write back cycles for various instructions are as
follows.
Instruction issue cycle write back cycle
load 1 4
load 4 7
mul 1 10
sub 1 9
div 10 16
add 9 12

Cs718min1 2008soln View

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Cs718min1 2008soln View

Similar to Cs718min1 2008soln View (20)

More from Ravi Soni

More from Ravi Soni (10)

Recently uploaded

Recently uploaded (20)

Cs718min1 2008soln View