Unit 2 contd. and( unit 3 voice over ppt)

Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-1
Chapter 3
Instruction-Level Parallelism
and Its Dynamic Exploitation
• Unit 2 contd…
• Unit 3
» Dr Reeja S R
» CSE Dept
» Dayananda Sagar University - SOE

Instruction Level Parallelism: Concepts and
Challenges
• Instruction-level parallelism(ILP)
– The potential of overlapping the execution of multiple
instructions is called instruction-level parallelism.

Techniques to Reduce Pipeline CPI
• Recall,
– Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls +Control stalls.
– Instruction-level parallelism is to reduce the number of
stalls.
– How to find out ILP
• Dynamically locate ILP by hardware
• Statically locate ILP by software
– Techniques that affect CPI (fig. 3.1 on page 173).

ILP Within and Across a Basic Block
• ILP within a basic block
– If the branch frequency is 15%~25%, there are only 4 ~ 7
instructions within a basic block. This implies that we
must exploit ILP across a basic block.
• Loop-Level Parallelism(ILP across a basic block)
– Exploit parallelism among iteration of a loop.

Loop-Level Parallelism
– Parallelism among iterations of a loop.
• Example: for(I=1; I<=100; I++)
X[I]=X[I]+Y[I];
– Each iteration of the loop can overlap with any other iteration in
this example.
– Techniques converting the loop-level parallelism into ILP
• Loop unrolling
• Use of vector instructions (Appendix G)
– LOAD X; LOAD Y; ADD X, Y; STORE X
– Originally used in mainframe and supercomputers.
– Die away due to the effective use of pipelining in desktop and
server processors
– See a renaissance for use in graphics, DSP, and multimedia
applications

Data Dependence and Hazards
• To have ILP, instructions should have no dependence
• A dependence indicates the possibility of a hazard,
– Determines the order in which results must be calculated, and
– Sets an upper bound on how much parallelism can possibly be
exploited.
• Overcome the limitation of dependence on ILP by
– Maintaining the dependence but avoiding a hazard,
– Eliminating a dependence by transforming the code.
• Dependence types
– Data dependence
• Creating RAW, WAR, and WAW hazards
– Name dependences
– Control dependences

Name Dependence
– Name dependences
• Occurs when two instructions use the same register or memory
location, called a name, but no data flow between the instructions
with that name.
– Two types of name dependences:
• Antidependence: Occur when instruction j writes a register or
memory location that instruction i reads and instruction i is
executed first.
• Output dependence: Occur when instruction i and instruction j
write the same register or memory location.
– Register renaming can be employed to eliminate name
dependences

Control Dependence
• A control dependence determines the ordering of an
instruction with respect to a branch instruction.
– Example: S1 is control dependent on p1, but not on p2.
if p1 {
S1;
};
if p2 {
S1;
};

Two Constraints Imposed by Control
Dependences
– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
– An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.

How the Simple Pipeline in Appendix A
Preserves Control Dependence
– Instructions execute in order.
– Detection of control or branch hazards ensures that an
instruction that is control dependent on a branch is not
executed until the branch direction is known.

Can We Violate Control Dependence?
• Yes, we can
– If we can ensure that violating the control dependence will not result
in incorrectness of the programs, control dependence is not critical
property that must be preserved.
– Instead, the exception behavior and data flow critical to correctness of
the program are normally preserved by data and control dependence

Preserving Exception Behavior
– Preserving the exception behavior means that any changes
in the ordering of instruction execution must not change
how exceptions are raised in the program.
• Often this is relaxed to mean that the reordering of instruction
execution must not cause any new exceptions in the program.
• Example
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1: …
How about LW is moved before BEQZ and there is a memory
exception while the branch is taken?

Preserving Data Flow
– The actual flows of data among instructions that produce
results and those that consume them must be preserved.
– Branch makes data flow dynamic (i.e., coming from
multiple points).
– Example
DADDU R1, R2, R3
BEQZ R4, L
DSUBU R1, R5, R6
L: …
OR R7, R1, R8
– “Preserving data flow” means that if branch is not taken, the value
of R1 computed by DSUBU is used by OR, otherwise, the value
of R1 computed by DADDU is used.

Speculation
• Check whether an instruction can be executed with
violation of control dependence yet preserve the
exception behavior and the data flow.
• Example
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
Skipnext: OR R7, R8, R9
– How about moving DSUBD before BEQZ if R4 were not
used in taken path?

Overcoming Data Hazards with Dynamic
Scheduling
– Basic idea:
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
– SUB.D is stalled, but not data dependent on anything.
– The major limitation of so-far introduced pipeline is in-
order issuing of instructions.
– To allow SUB.D to execute by dynamically scheduling the
instructions, it create out-of-order execution, and thus
out-of-order completion.

Advantages and Problems of Dynamic
Scheduling
– Advantages
• Enable handling some cases when dependences are unknown at
compile time (e.g. When involving memory reference).
• Simplify the compiler.
• Allow code that was compiled with one pipeline in mind to run
efficiently on a different pipeline.
– Problems
• It creates WAR and WAW hazards.
• It complicates exception handling due to out-of-order completion.
It creates imprecise exception.
– The processor state when an exception is raised does not look
exactly as if the instructions were executed sequentially in strict
program order.

Support Dynamic Scheduling for the Simple
Five-Stage Pipeline
• Divide the ID stage into the following two stages:
– Issue: Decode instructions and check for structural
hazards.
– Read operands: Wait until no data hazards, then read
operands.

Dynamic Scheduling Algorithms
• Algorithms
– Scoreboarding, originated from CDC 6600 (Appendix A).
• Effective when there are sufficient resources and no data
dependence.
– Tomasulo algorithm, originated from IBM 360.
• Both algorithms can be applied to pipelining or
multi-functional units implementations.

Dynamic Scheduling Using Tomasulo’s
Approach
• Combine key elements of the scoreboarding scheme
with register renaming.
– Track availability of operands to minimize RAW.
– Use register renaming to minimize RAW and WAW.

Concept of Register Renaming
• Code before renaming
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1)
SUB.D F8, F10, F14
MUL.D F6, F10, F8
• Code after renaming
DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0(R1)
SUB.D T, F10, F14
MUL.D F6, F10, T

Basic Architecture for Tomasulo’s Approach

Basic Ideas
– A reservation station (RS) fetches and buffers an operand
as soon as it is available.
– Pending instructions designate the RS that will provide
their inputs.
– When successive writes to a register appear, only the last
one is actually used to update the register.
– As instructions are issued, the register specifiers for
pending operands are renamed to the names of the RS, i.e.,
register renaming
• The functionality of register renaming is provided by
– The reservation stations (RS), which buffer the operands of
instructions waiting to issue.
– The issue logic
• Since three can be more RSs than real registers, the technique can
eliminate hazards that could not be eliminated by a compiler

What Is a Reservation Station Actually Held?
– Instructions that have been issued and are awaiting
execution at a functional unit.
– The operands if available, otherwise, the source of the
operands.
– The information needed to control the execution of the
instruction at the unit.
– The load buffers and store buffers hold data or addresses
coming from and going to memory.

Steps in Tomasulo’s Approach
– Issue……get instruction from instruction queue
• Get an instruction from the floating point queue. If it is a floating point
operation, issue it if there is an empty RS, and send the operands to the
RS if they are in the registers. If it is a load or store , it can be issued if
there is an available buffer. If the hardware resource is not available, the
instruction stalls.
– Execute…..operate an operand
• If one or more operands is not yet available, monitor the CDB to obtain
the required operands. When two operands are available, the instruction
is executed. This step checks for RAW hazards.
– Write result …..finish execution(WB)
• When the result is available, write it on the CDB and from there into the
registers and any RS waiting for this result.
- Commit…..Update register or memory with ROB result
• when instruction reaches head of ROB and results present, update
register with result or store to memory and remove instruction from ROB
• If an incorrectly predicted branch reaches the head of ROB, flush the
ROB and restart at correct successor of branch.
• The above steps differ from Scoreboarding in the following
three aspects:

Data Structures
– Data structures used to detect and eliminate hazards are
attached to the RS, the register file, and the load and store
buffers.
• Everything contains a tag field per entry. The tags are essentially
names for an extended set of virtual registers used in renaming.
• In this example, the tag is a four-bit quantity that denotes one of
the five RSs or one of the six load buffers.

Fields in Data Structures
– Each RS has seven field
• Op: The operation to perform.
• Qj, Qk: The RS that produces the corresponding source operand.
• Vj, Vk: The value of the source operands.
• Busy: Indicates the RS and its corresponding functional unit are
occupied.
• A: Used to hold information for memory address calculation for a
load or store.
– The register file and store buffer each have a field, Qi:
• Qi:The number of the RS that contains the operation whose result
should be stored into this register or into memory.
– The load and store buffers each require a busy field. The
store buffer has a field A, which holds the result of the
effective address.

Example for Tomasulo’s Approach (1)
• Fig. 3.3 on page 190

Example for Tomasulo’s Approach (1)
• Fig. 3.4 on page 192

Advantage of Tomasulo’s Approach over
Scoreboarding
• The distribution of hazard detection over the RSs
• The elimination of stalls for WAW and WAR
hazards

Tomasulo’s Algorithm: A Loop-Based
Example
– By using reservation stations, a loop can be dynamically
unrolled. Assume the following loop has been issued in
two successive iterations, but none of the floating-point
loads-stores or operations has completed (fig. 3.6 on page
194).

A Loop-Based Example

Dynamic Disambiguation of Addresses
– If the load address matches the store-buffer addresses, we
must stop and wait until the store buffer gets a value; we
can then access it or get the value from memory. This
makes the load operation in the second iteration in fig. 3.6
completed earlier than the store operation in the first
iteration.
– The key components for enhancing ILP in Tomasulo
algorithm are dynamic scheduling, register renaming and
dynamic memory disambiguation.

Reducing Branch Costs with Dynamic
Hardware Prediction
– Dynamic hardware branch prediction
• The prediction will change if the branch changes its behavior
while the program is running.
• The effectiveness of a branch prediction scheme depends
– not only on the accuracy,
– but also on the cost of a branch when the branch is correct and
when the prediction is not incorrect.
• The branch penalties depend on
– the structure of the pipeline,
– the type of predictor, and
– the strategies used for recovering from misprediction.

Basic Branch Prediction and Branch-
Prediction Buffer
– A branch prediction buffer is a small memory indexed by
the lower portion of the address of the branch instruction.
The memory contains bits that say whether the branch
was recently taken or not.

The Simple One-Bit Prediction Scheme
– If a prediction is correct, the prediction bit remains,
otherwise, it is inverted.
• Example on page 197 (mis-prediction rate 20%)
Correct? Prediction Instruction Taken/untaken
Y(es) T(aken) I1 T(aken)
Y T I2 T
…
Y T I9 T
N T I10 U(ntaken)
N U I11 T

Two-Bit Branch Prediction Scheme
– A prediction must miss twice before it is changed.

Accuracy of Two-Bit Branch Prediction Buffer (1)
• With 4096 entries

Accuracy of Two-Bit Branch Prediction Buffer (2)

Correlating Branch Predictor (1)
– Basic concept: the behavior of a branch depends on other branches.
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
• Equivalent code fragment
DSUBI R3, R1, #2
BNEZ R3, L1 ; branch b1 (aa != 2)
DADD R1, R0, R0 ; aa == 0
L1: DSUBI R3, R2, #2
BNEZ R3, L2 ; branch b2 (bb != 2)
DADD R2, R0, R0 ; bb == 0
L2: DSUB R3, R1, R2 ; R3 = aa-bb
BEQZ R3, L3 ; branch b3 (aa == bb)
• Branches b1 and b2 not taken implies b3 will be taken.

Correlating Branch Predictor (2)
• Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-level
predictors.
– Consider the following simplified code fragment,
if (d ==0 )
d =1;
if (d == 1)
» Equivalent code fragment is
BNEZ R1, L1 ; branch b1 (d != 0)
DADDIU R1, R0, #1
L1: DADDIU R3, R1, # -1
BNEZ R3, L2 ; branch b2 (d != 1)
…
L2:
» if b1 is not taken, b2 will be not taken.

Possible Execution Sequence
• Fig. 3.10 on page 202

If Done by One-Bit Predictor
• Fig. 3.11 on page 202

One-Bit Predictor with One-Bit Correlation, i.e.,
(1,1) Predictor
• The first bit being the prediction if the last branch was not taken
and the second bit being the prediction if the last branch in the
program was taken.
• The four possible combinations
– Fig. 3.12 on page 203
–

If Done by (1,1) Predictor
• Fig. 3.13 on page 203

General Correlating Predictor
– (m, n) predictor uses the behavior of the last m branches
to choose from 2m branch predictors, each of which is a n-
bit predictor for a single branch.
• Examples on page 205.

A (2,2) Predictor

Comparison of Two-Bit Predictors

Tournament Predictors
• Actively combining local and global predictors.
• It is the most popular form of multilevel branch
predictor.
• A multilevel branch predictor uses several levels of
branch-prediction tables together with an algorithm
for choosing among the multiple predictors.

Sate Transition Diagram of a Tournament
Predictor
• 0/0: predictor1 is wrong/predictor2 is wrong
• 1/1: predictor1 is correct/predictor2 is correct

The Fraction of Predictions Done by Local
Predictor

Mis-prediction Rate for Three Predictors

Integrated Instruction Fetch Units
• Perform the following functions
– Integrated branch prediction
– Instruction pre-fetch
– Instruction memory access and buffering

Hardware-Based Speculation(Unit 2)
– Why?
• To resolve control dependence to increase ILP
– Overcoming control dependence is done by
• Speculating on the outcome of branches and executing the
program as if our guess is correct.
– Three key ideas combined in hardware-based speculation:
• Dynamic branch prediction,
• Speculative execution, and
• Dynamic scheduling.
– Instruction commit
• When an instruction is no longer speculative, we allow it to update
the register file or memory
– The key idea behind implementing speculations is to allow
instructions to execute out of order but to force them to
commit in order.

Extend the Tomasulo’s Approach to Support
Speculation
– Separate the bypassing of results among instructions from
the actual completion of an instruction.
• By doing so, the result of an instruction can be used by other
instruction without allowing the instruction to perform any
irrecoverable update, until the instruction is no longer speculative.
– A reorder buffer is employed to pass results among
instructions that may be speculated.
• The reorder buffer holds the results of an instruction between the
time the operation associated with the instruction completes and
the time the instruction commits.
• The store buffers in the original Tomasulo’s algorithm are
integrated into reorder buffer.
• The renaming function of reservation station (RS) is replaced by
the reorder buffer. Thus, a result is usually tagged by using the
reorder buffer entry number.

Data Structure for Reorder Buffer
– Each entry in the reorder buffer contains four fields:
• The instruction type field indicates whether the instruction is a
branch, store, or a register operation.
• The destination field supplies the register number or the memory
address, where the instruction result should be written.
• The value field is used to hold the value of the instruction results
until the instruction commits.
• Busy (ready) field

The Four Steps in Instruction Execution (1)
– Issue (dispatch)
• Issue an FP instruction if there is an empty RS and an empty slot
in the reorder buffer, send the operands to the RS if they are in
the registers or the reorder buffer, and update the control entries
to indicate the buffers are in use.
• The number of the reorder buffer allocated for the result is also
sent to the RS for tagging the result sent on the CDB.
• If either all RSs are full or the reorder buffer is full, instruction
issued is stalled.
– Execute
• The CDB is monitored for not-yet-available operands. When both
operands are available at a RS, execute the operation. This step
checks for RAW hazards.

The Four Steps in Instruction Execution (2)
– Write result
• When the result is available, write it on the CDB and then into
the reorder buffer and to any RSs waiting for this result. Mark
the RS as available.
– Commit
• When an instruction, other than a branch with incorrect
prediction, reaches the head of the reorder buffer and its result is
in the buffer, update the register with the result.
• When a branch with incorrect prediction, indicating a wrong
speculation, reaches the head of the reorder buffer, the reorder
buffer is flushed and execution is restarted at the correct successor
of the branch. If the branch was correctly predicted, the branch is
finished.

The Architecture with Speculation

Exception Handling
• Exceptions are handled by not recognizing the
execution until it is ready to commit. Thus maintain
precise exception is straight forward.

Example for Hardware-Based Speculation
• Fig. 3.30 on page 230

Comparison of with and without Speculation (1)
• Fig. 3.33 on page 236

Comparison of with and without Speculation (2)
• Fig. 3.34 on page 237

Studies of the Limitations of ILP
• Ideal hardware model
– With infinite number of physical registers for renaming
– With perfect branch prediction
– With perfect jump prediction
– Memory address alias
• All memory addresses are known exactly and a load can be moved
before a store provided that the addresses are not identical.
– Enough functional units
• The ILP limitation in the ideal hardware model is
due to data dependence.

Unit 3
ILP-2
Exploiting ILP Using Multiple Issue
and Static Scheduling

Taking Advantage of More ILP with Multiple
Issues
– How can we reduce the CPI to less than one?
• Multiple issues
– Allow multiple instructions to issue in a clock cycle.
– Multiple-issued processors:
• Superscalar processor
– Issue varying number of instructions per clock cycle and may be
either statically scheduled by a compiler or dynamically
scheduled using techniques based on scoreboarding and
Tomasulo’s algorithm.
• Very long instruction word (VLIW) processor
– Issues a fixed number of instructions formatted either as one large
instruction or as a fixed instruction packet. Inherently scheduled
by a compiler.

Statically Versus Dynamically Scheduled
Superscalar Processors
• Statically scheduled superscalar
– Instructions are issued in order and are executed in order
– All pipeline hazards are checked at issued time
– Dynamically issue a varying number of instructions
• Dynamically scheduled superscalar
– Allow out-of-order execution.

Five Primary Approaches for Multiple-Issue
Processors
• Fig. 3.23 on page 216

A Statically Scheduled Superscalar MIPS
Processor
– Dual issues: One integer and one floating-point operation
• Some restrictions:
– The second instruction can be issued if the first instruction can be
issued.
– If the second instruction depends on the first instruction, it can not
be issued.
• Influence on load dependency: Waste 3 instruction issuing slots.
– Waste one instruction issuing slot at current clock cycle.
– Waste two instruction issuing slots at next clock cycle.
• Influence on branch dependency: Waste 2 or 3 instruction issuing
slots.
– Depends on whether a branch must be the first instruction.

Dual-Issue Superscalar Pipeline in Operation
• Fig. 3.24

Possible Hazards
– Note: the integer operations may be floating-point load,
move, and store.
– Possible hazards (new):
• Structural hazard
– Occur when an FP move, store or load is paired with an FP
instructions and FP register file does not provide enough read or
write ports.
• WAW, and
• RAW hazards
– Dependency on the instructions issued at the same clock cycle..

Exploiting ILP Using Dynamic Scheduling,
Multiple Issue, and Speculation
Multiple Instruction Issue with Dynamic Scheduling
– To support dual issues of instructions
• Separate data structures (for the reservation stations) for the
integer and floating-point registers are employed (still prevent the
issuing of dependent FP move or load on the FP instruction).
• Pipeline the issuing stage such that it runs twice as fast as the basic
clock rate. The first half issues the dependent move or load, while
the second half issues the floating-point instruction.

A Scenario of Dual Issues

Resource Usage Table

Factors Limit the Performance of Dual-Issue
Processor
– Limitations in multiple-issue processors
• Inherent limitations of ILP in programs
– Difficult to find a large number of independent instructions to
keep fully use of FP units.
• The amount of overhead per loop iteration is very high
– Two out of five instructions (DADDIU and BNE)
• Control hazard

Advanced Techniques for Instruction
Delivery and Speculation (Unit3)
• Branch-Target Buffers
• - A branch-prediction cache that stores the predicted address for
the next instruction after a branch is called a branch-target buffer
or branch-target cache.

The steps involved in handling an
instruction with a branch-target buffer.

Penalties for Each Individual Situations
• Performance of branch-target buffer (Example on page 211).
• One variation of branch-target buffer:
– Store one or more target instructions instead of the predicted
address.

Return Address Predictor
– Optimized indirect jumps, especially for procedure calls
and returns
• Problem
– The accuracy of predicting return address by branch-target buffer
can be low if the procedure is called from multiple sites and the
calls from one site are not clustered in time.
• Solution
– Use a stack to buffer the most recent return addresses, pushing a
return address on the stack at a call and popping one off at a
return.

Prediction Accuracy

The Intel Pentium 4

Unit 2 contd. and( unit 3 voice over ppt)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unit 2 contd. and( unit 3 voice over ppt)

Similar to Unit 2 contd. and( unit 3 voice over ppt) (20)

More from Dr Reeja S R

More from Dr Reeja S R (17)

Recently uploaded

Recently uploaded (20)

Unit 2 contd. and( unit 3 voice over ppt)