SlideShare a Scribd company logo
1 of 82
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-1
Chapter 3
Instruction-Level Parallelism
and Its Dynamic Exploitation
• Unit 2 contd…
• Unit 3
» Dr Reeja S R
» CSE Dept
» Dayananda Sagar University - SOE
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-2
Instruction Level Parallelism: Concepts and
Challenges
• Instruction-level parallelism(ILP)
– The potential of overlapping the execution of multiple
instructions is called instruction-level parallelism.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-3
Techniques to Reduce Pipeline CPI
• Recall,
– Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls +Control stalls.
– Instruction-level parallelism is to reduce the number of
stalls.
– How to find out ILP
• Dynamically locate ILP by hardware
• Statically locate ILP by software
– Techniques that affect CPI (fig. 3.1 on page 173).
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-4
ILP Within and Across a Basic Block
• ILP within a basic block
– If the branch frequency is 15%~25%, there are only 4 ~ 7
instructions within a basic block. This implies that we
must exploit ILP across a basic block.
• Loop-Level Parallelism(ILP across a basic block)
– Exploit parallelism among iteration of a loop.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-5
Loop-Level Parallelism
– Parallelism among iterations of a loop.
• Example: for(I=1; I<=100; I++)
X[I]=X[I]+Y[I];
– Each iteration of the loop can overlap with any other iteration in
this example.
– Techniques converting the loop-level parallelism into ILP
• Loop unrolling
• Use of vector instructions (Appendix G)
– LOAD X; LOAD Y; ADD X, Y; STORE X
– Originally used in mainframe and supercomputers.
– Die away due to the effective use of pipelining in desktop and
server processors
– See a renaissance for use in graphics, DSP, and multimedia
applications
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-6
Data Dependence and Hazards
• To have ILP, instructions should have no dependence
• A dependence indicates the possibility of a hazard,
– Determines the order in which results must be calculated, and
– Sets an upper bound on how much parallelism can possibly be
exploited.
• Overcome the limitation of dependence on ILP by
– Maintaining the dependence but avoiding a hazard,
– Eliminating a dependence by transforming the code.
• Dependence types
– Data dependence
• Creating RAW, WAR, and WAW hazards
– Name dependences
– Control dependences
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-7
Name Dependence
– Name dependences
• Occurs when two instructions use the same register or memory
location, called a name, but no data flow between the instructions
with that name.
– Two types of name dependences:
• Antidependence: Occur when instruction j writes a register or
memory location that instruction i reads and instruction i is
executed first.
• Output dependence: Occur when instruction i and instruction j
write the same register or memory location.
– Register renaming can be employed to eliminate name
dependences
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-8
Control Dependence
• A control dependence determines the ordering of an
instruction with respect to a branch instruction.
– Example: S1 is control dependent on p1, but not on p2.
if p1 {
S1;
};
if p2 {
S1;
};
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-9
Two Constraints Imposed by Control
Dependences
– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
– An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-10
How the Simple Pipeline in Appendix A
Preserves Control Dependence
– Instructions execute in order.
– Detection of control or branch hazards ensures that an
instruction that is control dependent on a branch is not
executed until the branch direction is known.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-11
Can We Violate Control Dependence?
• Yes, we can
– If we can ensure that violating the control dependence will not result
in incorrectness of the programs, control dependence is not critical
property that must be preserved.
– Instead, the exception behavior and data flow critical to correctness of
the program are normally preserved by data and control dependence
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-12
Preserving Exception Behavior
– Preserving the exception behavior means that any changes
in the ordering of instruction execution must not change
how exceptions are raised in the program.
• Often this is relaxed to mean that the reordering of instruction
execution must not cause any new exceptions in the program.
• Example
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1: …
How about LW is moved before BEQZ and there is a memory
exception while the branch is taken?
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-13
Preserving Data Flow
– The actual flows of data among instructions that produce
results and those that consume them must be preserved.
– Branch makes data flow dynamic (i.e., coming from
multiple points).
– Example
DADDU R1, R2, R3
BEQZ R4, L
DSUBU R1, R5, R6
L: …
OR R7, R1, R8
– “Preserving data flow” means that if branch is not taken, the value
of R1 computed by DSUBU is used by OR, otherwise, the value
of R1 computed by DADDU is used.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-14
Speculation
• Check whether an instruction can be executed with
violation of control dependence yet preserve the
exception behavior and the data flow.
• Example
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
Skipnext: OR R7, R8, R9
– How about moving DSUBD before BEQZ if R4 were not
used in taken path?
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-15
Overcoming Data Hazards with Dynamic
Scheduling
– Basic idea:
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
– SUB.D is stalled, but not data dependent on anything.
– The major limitation of so-far introduced pipeline is in-
order issuing of instructions.
– To allow SUB.D to execute by dynamically scheduling the
instructions, it create out-of-order execution, and thus
out-of-order completion.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-16
Advantages and Problems of Dynamic
Scheduling
– Advantages
• Enable handling some cases when dependences are unknown at
compile time (e.g. When involving memory reference).
• Simplify the compiler.
• Allow code that was compiled with one pipeline in mind to run
efficiently on a different pipeline.
– Problems
• It creates WAR and WAW hazards.
• It complicates exception handling due to out-of-order completion.
It creates imprecise exception.
– The processor state when an exception is raised does not look
exactly as if the instructions were executed sequentially in strict
program order.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-17
Support Dynamic Scheduling for the Simple
Five-Stage Pipeline
• Divide the ID stage into the following two stages:
– Issue: Decode instructions and check for structural
hazards.
– Read operands: Wait until no data hazards, then read
operands.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-18
Dynamic Scheduling Algorithms
• Algorithms
– Scoreboarding, originated from CDC 6600 (Appendix A).
• Effective when there are sufficient resources and no data
dependence.
– Tomasulo algorithm, originated from IBM 360.
• Both algorithms can be applied to pipelining or
multi-functional units implementations.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-19
Dynamic Scheduling Using Tomasulo’s
Approach
• Combine key elements of the scoreboarding scheme
with register renaming.
– Track availability of operands to minimize RAW.
– Use register renaming to minimize RAW and WAW.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-20
Concept of Register Renaming
• Code before renaming
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1)
SUB.D F8, F10, F14
MUL.D F6, F10, F8
• Code after renaming
DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0(R1)
SUB.D T, F10, F14
MUL.D F6, F10, T
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-21
Basic Architecture for Tomasulo’s Approach
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-22
Basic Ideas
– A reservation station (RS) fetches and buffers an operand
as soon as it is available.
– Pending instructions designate the RS that will provide
their inputs.
– When successive writes to a register appear, only the last
one is actually used to update the register.
– As instructions are issued, the register specifiers for
pending operands are renamed to the names of the RS, i.e.,
register renaming
• The functionality of register renaming is provided by
– The reservation stations (RS), which buffer the operands of
instructions waiting to issue.
– The issue logic
• Since three can be more RSs than real registers, the technique can
eliminate hazards that could not be eliminated by a compiler
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-23
What Is a Reservation Station Actually Held?
– Instructions that have been issued and are awaiting
execution at a functional unit.
– The operands if available, otherwise, the source of the
operands.
– The information needed to control the execution of the
instruction at the unit.
– The load buffers and store buffers hold data or addresses
coming from and going to memory.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-24
Steps in Tomasulo’s Approach
– Issue……get instruction from instruction queue
• Get an instruction from the floating point queue. If it is a floating point
operation, issue it if there is an empty RS, and send the operands to the
RS if they are in the registers. If it is a load or store , it can be issued if
there is an available buffer. If the hardware resource is not available, the
instruction stalls.
– Execute…..operate an operand
• If one or more operands is not yet available, monitor the CDB to obtain
the required operands. When two operands are available, the instruction
is executed. This step checks for RAW hazards.
– Write result …..finish execution(WB)
• When the result is available, write it on the CDB and from there into the
registers and any RS waiting for this result.
- Commit…..Update register or memory with ROB result
• when instruction reaches head of ROB and results present, update
register with result or store to memory and remove instruction from ROB
• If an incorrectly predicted branch reaches the head of ROB, flush the
ROB and restart at correct successor of branch.
• The above steps differ from Scoreboarding in the following
three aspects:
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-25
Data Structures
– Data structures used to detect and eliminate hazards are
attached to the RS, the register file, and the load and store
buffers.
• Everything contains a tag field per entry. The tags are essentially
names for an extended set of virtual registers used in renaming.
• In this example, the tag is a four-bit quantity that denotes one of
the five RSs or one of the six load buffers.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-26
Fields in Data Structures
– Each RS has seven field
• Op: The operation to perform.
• Qj, Qk: The RS that produces the corresponding source operand.
• Vj, Vk: The value of the source operands.
• Busy: Indicates the RS and its corresponding functional unit are
occupied.
• A: Used to hold information for memory address calculation for a
load or store.
– The register file and store buffer each have a field, Qi:
• Qi:The number of the RS that contains the operation whose result
should be stored into this register or into memory.
– The load and store buffers each require a busy field. The
store buffer has a field A, which holds the result of the
effective address.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-27
Example for Tomasulo’s Approach (1)
• Fig. 3.3 on page 190
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-28
Example for Tomasulo’s Approach (1)
• Fig. 3.4 on page 192
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-29
Advantage of Tomasulo’s Approach over
Scoreboarding
• The distribution of hazard detection over the RSs
• The elimination of stalls for WAW and WAR
hazards
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-30
Tomasulo’s Algorithm: A Loop-Based
Example
– By using reservation stations, a loop can be dynamically
unrolled. Assume the following loop has been issued in
two successive iterations, but none of the floating-point
loads-stores or operations has completed (fig. 3.6 on page
194).
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-31
A Loop-Based Example
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-32
Dynamic Disambiguation of Addresses
– If the load address matches the store-buffer addresses, we
must stop and wait until the store buffer gets a value; we
can then access it or get the value from memory. This
makes the load operation in the second iteration in fig. 3.6
completed earlier than the store operation in the first
iteration.
– The key components for enhancing ILP in Tomasulo
algorithm are dynamic scheduling, register renaming and
dynamic memory disambiguation.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-33
Reducing Branch Costs with Dynamic
Hardware Prediction
– Dynamic hardware branch prediction
• The prediction will change if the branch changes its behavior
while the program is running.
• The effectiveness of a branch prediction scheme depends
– not only on the accuracy,
– but also on the cost of a branch when the branch is correct and
when the prediction is not incorrect.
• The branch penalties depend on
– the structure of the pipeline,
– the type of predictor, and
– the strategies used for recovering from misprediction.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-34
Basic Branch Prediction and Branch-
Prediction Buffer
– A branch prediction buffer is a small memory indexed by
the lower portion of the address of the branch instruction.
The memory contains bits that say whether the branch
was recently taken or not.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-35
The Simple One-Bit Prediction Scheme
– If a prediction is correct, the prediction bit remains,
otherwise, it is inverted.
• Example on page 197 (mis-prediction rate 20%)
Correct? Prediction Instruction Taken/untaken
Y(es) T(aken) I1 T(aken)
Y T I2 T
…
Y T I9 T
N T I10 U(ntaken)
N U I11 T
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-36
Two-Bit Branch Prediction Scheme
– A prediction must miss twice before it is changed.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-37
Accuracy of Two-Bit Branch Prediction Buffer (1)
• With 4096 entries
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-38
Accuracy of Two-Bit Branch Prediction Buffer (2)
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-39
Correlating Branch Predictor (1)
– Basic concept: the behavior of a branch depends on other branches.
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
• Equivalent code fragment
DSUBI R3, R1, #2
BNEZ R3, L1 ; branch b1 (aa != 2)
DADD R1, R0, R0 ; aa == 0
L1: DSUBI R3, R2, #2
BNEZ R3, L2 ; branch b2 (bb != 2)
DADD R2, R0, R0 ; bb == 0
L2: DSUB R3, R1, R2 ; R3 = aa-bb
BEQZ R3, L3 ; branch b3 (aa == bb)
• Branches b1 and b2 not taken implies b3 will be taken.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-40
Correlating Branch Predictor (2)
• Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-level
predictors.
– Consider the following simplified code fragment,
if (d ==0 )
d =1;
if (d == 1)
» Equivalent code fragment is
BNEZ R1, L1 ; branch b1 (d != 0)
DADDIU R1, R0, #1
L1: DADDIU R3, R1, # -1
BNEZ R3, L2 ; branch b2 (d != 1)
…
L2:
» if b1 is not taken, b2 will be not taken.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-41
Possible Execution Sequence
• Fig. 3.10 on page 202
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-42
If Done by One-Bit Predictor
• Fig. 3.11 on page 202
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-43
One-Bit Predictor with One-Bit Correlation, i.e.,
(1,1) Predictor
• The first bit being the prediction if the last branch was not taken
and the second bit being the prediction if the last branch in the
program was taken.
• The four possible combinations
– Fig. 3.12 on page 203
–
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-44
If Done by (1,1) Predictor
• Fig. 3.13 on page 203
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-45
General Correlating Predictor
– (m, n) predictor uses the behavior of the last m branches
to choose from 2m branch predictors, each of which is a n-
bit predictor for a single branch.
• Examples on page 205.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-46
A (2,2) Predictor
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-47
Comparison of Two-Bit Predictors
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-48
Tournament Predictors
• Actively combining local and global predictors.
• It is the most popular form of multilevel branch
predictor.
• A multilevel branch predictor uses several levels of
branch-prediction tables together with an algorithm
for choosing among the multiple predictors.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-49
Sate Transition Diagram of a Tournament
Predictor
• 0/0: predictor1 is wrong/predictor2 is wrong
• 1/1: predictor1 is correct/predictor2 is correct
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-50
The Fraction of Predictions Done by Local
Predictor
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-51
Mis-prediction Rate for Three Predictors
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-52
Integrated Instruction Fetch Units
• Perform the following functions
– Integrated branch prediction
– Instruction pre-fetch
– Instruction memory access and buffering
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-53
Hardware-Based Speculation(Unit 2)
– Why?
• To resolve control dependence to increase ILP
– Overcoming control dependence is done by
• Speculating on the outcome of branches and executing the
program as if our guess is correct.
– Three key ideas combined in hardware-based speculation:
• Dynamic branch prediction,
• Speculative execution, and
• Dynamic scheduling.
– Instruction commit
• When an instruction is no longer speculative, we allow it to update
the register file or memory
– The key idea behind implementing speculations is to allow
instructions to execute out of order but to force them to
commit in order.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-54
Extend the Tomasulo’s Approach to Support
Speculation
– Separate the bypassing of results among instructions from
the actual completion of an instruction.
• By doing so, the result of an instruction can be used by other
instruction without allowing the instruction to perform any
irrecoverable update, until the instruction is no longer speculative.
– A reorder buffer is employed to pass results among
instructions that may be speculated.
• The reorder buffer holds the results of an instruction between the
time the operation associated with the instruction completes and
the time the instruction commits.
• The store buffers in the original Tomasulo’s algorithm are
integrated into reorder buffer.
• The renaming function of reservation station (RS) is replaced by
the reorder buffer. Thus, a result is usually tagged by using the
reorder buffer entry number.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-55
Data Structure for Reorder Buffer
– Each entry in the reorder buffer contains four fields:
• The instruction type field indicates whether the instruction is a
branch, store, or a register operation.
• The destination field supplies the register number or the memory
address, where the instruction result should be written.
• The value field is used to hold the value of the instruction results
until the instruction commits.
• Busy (ready) field
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-56
The Four Steps in Instruction Execution (1)
– Issue (dispatch)
• Issue an FP instruction if there is an empty RS and an empty slot
in the reorder buffer, send the operands to the RS if they are in
the registers or the reorder buffer, and update the control entries
to indicate the buffers are in use.
• The number of the reorder buffer allocated for the result is also
sent to the RS for tagging the result sent on the CDB.
• If either all RSs are full or the reorder buffer is full, instruction
issued is stalled.
– Execute
• The CDB is monitored for not-yet-available operands. When both
operands are available at a RS, execute the operation. This step
checks for RAW hazards.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-57
The Four Steps in Instruction Execution (2)
– Write result
• When the result is available, write it on the CDB and then into
the reorder buffer and to any RSs waiting for this result. Mark
the RS as available.
– Commit
• When an instruction, other than a branch with incorrect
prediction, reaches the head of the reorder buffer and its result is
in the buffer, update the register with the result.
• When a branch with incorrect prediction, indicating a wrong
speculation, reaches the head of the reorder buffer, the reorder
buffer is flushed and execution is restarted at the correct successor
of the branch. If the branch was correctly predicted, the branch is
finished.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-58
The Architecture with Speculation
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-59
Exception Handling
• Exceptions are handled by not recognizing the
execution until it is ready to commit. Thus maintain
precise exception is straight forward.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-60
Example for Hardware-Based Speculation
• Fig. 3.30 on page 230
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-61
Comparison of with and without Speculation (1)
• Fig. 3.33 on page 236
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-62
Comparison of with and without Speculation (2)
• Fig. 3.34 on page 237
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-63
Studies of the Limitations of ILP
• Ideal hardware model
– With infinite number of physical registers for renaming
– With perfect branch prediction
– With perfect jump prediction
– Memory address alias
• All memory addresses are known exactly and a load can be moved
before a store provided that the addresses are not identical.
– Enough functional units
• The ILP limitation in the ideal hardware model is
due to data dependence.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-64
Unit 3
ILP-2
Exploiting ILP Using Multiple Issue
and Static Scheduling
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-65
Taking Advantage of More ILP with Multiple
Issues
– How can we reduce the CPI to less than one?
• Multiple issues
– Allow multiple instructions to issue in a clock cycle.
– Multiple-issued processors:
• Superscalar processor
– Issue varying number of instructions per clock cycle and may be
either statically scheduled by a compiler or dynamically
scheduled using techniques based on scoreboarding and
Tomasulo’s algorithm.
• Very long instruction word (VLIW) processor
– Issues a fixed number of instructions formatted either as one large
instruction or as a fixed instruction packet. Inherently scheduled
by a compiler.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-66
Statically Versus Dynamically Scheduled
Superscalar Processors
• Statically scheduled superscalar
– Instructions are issued in order and are executed in order
– All pipeline hazards are checked at issued time
– Dynamically issue a varying number of instructions
• Dynamically scheduled superscalar
– Allow out-of-order execution.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-67
Five Primary Approaches for Multiple-Issue
Processors
• Fig. 3.23 on page 216
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-68
A Statically Scheduled Superscalar MIPS
Processor
– Dual issues: One integer and one floating-point operation
• Some restrictions:
– The second instruction can be issued if the first instruction can be
issued.
– If the second instruction depends on the first instruction, it can not
be issued.
• Influence on load dependency: Waste 3 instruction issuing slots.
– Waste one instruction issuing slot at current clock cycle.
– Waste two instruction issuing slots at next clock cycle.
• Influence on branch dependency: Waste 2 or 3 instruction issuing
slots.
– Depends on whether a branch must be the first instruction.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-69
Dual-Issue Superscalar Pipeline in Operation
• Fig. 3.24
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-70
Possible Hazards
– Note: the integer operations may be floating-point load,
move, and store.
– Possible hazards (new):
• Structural hazard
– Occur when an FP move, store or load is paired with an FP
instructions and FP register file does not provide enough read or
write ports.
• WAW, and
• RAW hazards
– Dependency on the instructions issued at the same clock cycle..
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-71
Exploiting ILP Using Dynamic Scheduling,
Multiple Issue, and Speculation
Multiple Instruction Issue with Dynamic Scheduling
– To support dual issues of instructions
• Separate data structures (for the reservation stations) for the
integer and floating-point registers are employed (still prevent the
issuing of dependent FP move or load on the FP instruction).
• Pipeline the issuing stage such that it runs twice as fast as the basic
clock rate. The first half issues the dependent move or load, while
the second half issues the floating-point instruction.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-72
A Scenario of Dual Issues
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-73
Resource Usage Table
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-74
Factors Limit the Performance of Dual-Issue
Processor
– Limitations in multiple-issue processors
• Inherent limitations of ILP in programs
– Difficult to find a large number of independent instructions to
keep fully use of FP units.
• The amount of overhead per loop iteration is very high
– Two out of five instructions (DADDIU and BNE)
• Control hazard
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-75
Advanced Techniques for Instruction
Delivery and Speculation (Unit3)
• Branch-Target Buffers
• - A branch-prediction cache that stores the predicted address for
the next instruction after a branch is called a branch-target buffer
or branch-target cache.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-76
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-77
The steps involved in handling an
instruction with a branch-target buffer.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-78
Penalties for Each Individual Situations
• Performance of branch-target buffer (Example on page 211).
• One variation of branch-target buffer:
– Store one or more target instructions instead of the predicted
address.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-79
Return Address Predictor
– Optimized indirect jumps, especially for procedure calls
and returns
• Problem
– The accuracy of predicting return address by branch-target buffer
can be low if the procedure is called from multiple sites and the
calls from one site are not clustered in time.
• Solution
– Use a stack to buffer the most recent return addresses, pushing a
return address on the stack at a call and popping one off at a
return.
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-80
Prediction Accuracy
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-81
The Intel Pentium 4
Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-82

More Related Content

What's hot

Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentationbhavanadonthi
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with PipeliningAneesh Raveendran
 
Pipelinig hazardous
Pipelinig hazardousPipelinig hazardous
Pipelinig hazardousjasscheema
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInteX Research Lab
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards Bharti Khemani
 
Pipelining
PipeliningPipelining
PipeliningAJAL A J
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processorBảo Hoang
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMKamran Ashraf
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
Pipelining , structural hazards
Pipelining , structural hazardsPipelining , structural hazards
Pipelining , structural hazardsMunaam Munawar
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer OrganozationAabha Tiwari
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture S. Hasnain Raza
 
Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of ProcessorsGaditek
 

What's hot (20)

Instruction Pipelining
Instruction PipeliningInstruction Pipelining
Instruction Pipelining
 
Pipelining powerpoint presentation
Pipelining powerpoint presentationPipelining powerpoint presentation
Pipelining powerpoint presentation
 
pipelining
pipeliningpipelining
pipelining
 
Performance Enhancement with Pipelining
Performance Enhancement with PipeliningPerformance Enhancement with Pipelining
Performance Enhancement with Pipelining
 
Pipelinig hazardous
Pipelinig hazardousPipelinig hazardous
Pipelinig hazardous
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 
3 Pipelining
3 Pipelining3 Pipelining
3 Pipelining
 
Presentation on risc pipeline
Presentation on risc pipelinePresentation on risc pipeline
Presentation on risc pipeline
 
pipeline and pipeline hazards
pipeline and pipeline hazards pipeline and pipeline hazards
pipeline and pipeline hazards
 
Pipelining
PipeliningPipelining
Pipelining
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
 
pipelining
pipeliningpipelining
pipelining
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Pipelining , structural hazards
Pipelining , structural hazardsPipelining , structural hazards
Pipelining , structural hazards
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Computer Organozation
Computer OrganozationComputer Organozation
Computer Organozation
 
Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture Pipeline processing - Computer Architecture
Pipeline processing - Computer Architecture
 
Pipelining of Processors
Pipelining of ProcessorsPipelining of Processors
Pipelining of Processors
 

Similar to Unit 2 contd. and( unit 3 voice over ppt)

CALecture3Module1.ppt
CALecture3Module1.pptCALecture3Module1.ppt
CALecture3Module1.pptBeeMUcz
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.pptShifaZahra7
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsButtaRajasekhar2
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
Pipelining And Vector Processing
Pipelining And Vector ProcessingPipelining And Vector Processing
Pipelining And Vector ProcessingTheInnocentTuber
 
Tsn nfinn-control-and-config-0515-v03
Tsn nfinn-control-and-config-0515-v03Tsn nfinn-control-and-config-0515-v03
Tsn nfinn-control-and-config-0515-v03Jörgen Gade
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorSmit Shah
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsScyllaDB
 

Similar to Unit 2 contd. and( unit 3 voice over ppt) (20)

CALecture3Module1.ppt
CALecture3Module1.pptCALecture3Module1.ppt
CALecture3Module1.ppt
 
CA UNIT III.pptx
CA UNIT III.pptxCA UNIT III.pptx
CA UNIT III.pptx
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
 
2. ILP Processors.ppt
2. ILP Processors.ppt2. ILP Processors.ppt
2. ILP Processors.ppt
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
Arm
ArmArm
Arm
 
ES-CH5.ppt
ES-CH5.pptES-CH5.ppt
ES-CH5.ppt
 
IS.pptx
IS.pptxIS.pptx
IS.pptx
 
Pipelining And Vector Processing
Pipelining And Vector ProcessingPipelining And Vector Processing
Pipelining And Vector Processing
 
Tsn nfinn-control-and-config-0515-v03
Tsn nfinn-control-and-config-0515-v03Tsn nfinn-control-and-config-0515-v03
Tsn nfinn-control-and-config-0515-v03
 
Assembly p1
Assembly p1Assembly p1
Assembly p1
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
Lect08
Lect08Lect08
Lect08
 
Pipeline & Nonpipeline Processor
Pipeline & Nonpipeline ProcessorPipeline & Nonpipeline Processor
Pipeline & Nonpipeline Processor
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 

More from Dr Reeja S R

Fundamentals of data network
Fundamentals of data networkFundamentals of data network
Fundamentals of data networkDr Reeja S R
 
Module ii continued
Module ii continuedModule ii continued
Module ii continuedDr Reeja S R
 
Sa unit-2-three-vignets
Sa unit-2-three-vignetsSa unit-2-three-vignets
Sa unit-2-three-vignetsDr Reeja S R
 
Architectural styles 3
Architectural styles   3Architectural styles   3
Architectural styles 3Dr Reeja S R
 
Architectural styles 2
Architectural styles   2Architectural styles   2
Architectural styles 2Dr Reeja S R
 
Architectural styles class 1
Architectural  styles class 1Architectural  styles class 1
Architectural styles class 1Dr Reeja S R
 
Importance of software architecture 1
Importance of software architecture 1Importance of software architecture 1
Importance of software architecture 1Dr Reeja S R
 
Architecture business cycle ( abc )
Architecture business cycle ( abc )Architecture business cycle ( abc )
Architecture business cycle ( abc )Dr Reeja S R
 
Architectural structures and views
Architectural structures and viewsArchitectural structures and views
Architectural structures and viewsDr Reeja S R
 
Software Architecture
Software ArchitectureSoftware Architecture
Software ArchitectureDr Reeja S R
 

More from Dr Reeja S R (17)

Fundamentals of data network
Fundamentals of data networkFundamentals of data network
Fundamentals of data network
 
Module iv
Module ivModule iv
Module iv
 
Module ii continued
Module ii continuedModule ii continued
Module ii continued
 
Module ii
Module iiModule ii
Module ii
 
Sa unit-2-three-vignets
Sa unit-2-three-vignetsSa unit-2-three-vignets
Sa unit-2-three-vignets
 
Case study 4
Case study 4Case study 4
Case study 4
 
Case study 3
Case study 3Case study 3
Case study 3
 
Case study 2
Case study 2Case study 2
Case study 2
 
Case study 1
Case study 1Case study 1
Case study 1
 
Architectural styles 3
Architectural styles   3Architectural styles   3
Architectural styles 3
 
Architectural styles 2
Architectural styles   2Architectural styles   2
Architectural styles 2
 
Architectural styles class 1
Architectural  styles class 1Architectural  styles class 1
Architectural styles class 1
 
Importance of software architecture 1
Importance of software architecture 1Importance of software architecture 1
Importance of software architecture 1
 
Ch2
Ch2Ch2
Ch2
 
Architecture business cycle ( abc )
Architecture business cycle ( abc )Architecture business cycle ( abc )
Architecture business cycle ( abc )
 
Architectural structures and views
Architectural structures and viewsArchitectural structures and views
Architectural structures and views
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 

Recently uploaded

Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Call Girls in Nagpur High Profile
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Naicy mandal
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样qaffana
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointGetawu
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsDeira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsEscorts Call Girls
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...Call Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...MOHANI PANDEY
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...Pooja Nehwal
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Call Girls in Nagpur High Profile
 
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcR
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcRCALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcR
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcRdollysharma2066
 
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...amitlee9823
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...amitlee9823
 
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...amitlee9823
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power point
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
 
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort GirlsDeira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
Deira Dubai Escorts +0561951007 Escort Service in Dubai by Dubai Escort Girls
 
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Mayapuri  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Mayapuri (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
🔝 9953056974🔝 Delhi Call Girls in Ajmeri Gate
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
 
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcR
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcRCALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcR
CALL GIRLS IN Saket 83778-77756 | Escort Service In DELHI NcR
 
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
Kothanur Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Bang...
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
 
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
 

Unit 2 contd. and( unit 3 voice over ppt)

  • 1. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-1 Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation • Unit 2 contd… • Unit 3 » Dr Reeja S R » CSE Dept » Dayananda Sagar University - SOE
  • 2. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-2 Instruction Level Parallelism: Concepts and Challenges • Instruction-level parallelism(ILP) – The potential of overlapping the execution of multiple instructions is called instruction-level parallelism.
  • 3. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-3 Techniques to Reduce Pipeline CPI • Recall, – Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls +Control stalls. – Instruction-level parallelism is to reduce the number of stalls. – How to find out ILP • Dynamically locate ILP by hardware • Statically locate ILP by software – Techniques that affect CPI (fig. 3.1 on page 173).
  • 4. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-4 ILP Within and Across a Basic Block • ILP within a basic block – If the branch frequency is 15%~25%, there are only 4 ~ 7 instructions within a basic block. This implies that we must exploit ILP across a basic block. • Loop-Level Parallelism(ILP across a basic block) – Exploit parallelism among iteration of a loop.
  • 5. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-5 Loop-Level Parallelism – Parallelism among iterations of a loop. • Example: for(I=1; I<=100; I++) X[I]=X[I]+Y[I]; – Each iteration of the loop can overlap with any other iteration in this example. – Techniques converting the loop-level parallelism into ILP • Loop unrolling • Use of vector instructions (Appendix G) – LOAD X; LOAD Y; ADD X, Y; STORE X – Originally used in mainframe and supercomputers. – Die away due to the effective use of pipelining in desktop and server processors – See a renaissance for use in graphics, DSP, and multimedia applications
  • 6. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-6 Data Dependence and Hazards • To have ILP, instructions should have no dependence • A dependence indicates the possibility of a hazard, – Determines the order in which results must be calculated, and – Sets an upper bound on how much parallelism can possibly be exploited. • Overcome the limitation of dependence on ILP by – Maintaining the dependence but avoiding a hazard, – Eliminating a dependence by transforming the code. • Dependence types – Data dependence • Creating RAW, WAR, and WAW hazards – Name dependences – Control dependences
  • 7. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-7 Name Dependence – Name dependences • Occurs when two instructions use the same register or memory location, called a name, but no data flow between the instructions with that name. – Two types of name dependences: • Antidependence: Occur when instruction j writes a register or memory location that instruction i reads and instruction i is executed first. • Output dependence: Occur when instruction i and instruction j write the same register or memory location. – Register renaming can be employed to eliminate name dependences
  • 8. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-8 Control Dependence • A control dependence determines the ordering of an instruction with respect to a branch instruction. – Example: S1 is control dependent on p1, but not on p2. if p1 { S1; }; if p2 { S1; };
  • 9. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-9 Two Constraints Imposed by Control Dependences – An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. – An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch.
  • 10. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-10 How the Simple Pipeline in Appendix A Preserves Control Dependence – Instructions execute in order. – Detection of control or branch hazards ensures that an instruction that is control dependent on a branch is not executed until the branch direction is known.
  • 11. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-11 Can We Violate Control Dependence? • Yes, we can – If we can ensure that violating the control dependence will not result in incorrectness of the programs, control dependence is not critical property that must be preserved. – Instead, the exception behavior and data flow critical to correctness of the program are normally preserved by data and control dependence
  • 12. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-12 Preserving Exception Behavior – Preserving the exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program. • Often this is relaxed to mean that the reordering of instruction execution must not cause any new exceptions in the program. • Example DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) L1: … How about LW is moved before BEQZ and there is a memory exception while the branch is taken?
  • 13. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-13 Preserving Data Flow – The actual flows of data among instructions that produce results and those that consume them must be preserved. – Branch makes data flow dynamic (i.e., coming from multiple points). – Example DADDU R1, R2, R3 BEQZ R4, L DSUBU R1, R5, R6 L: … OR R7, R1, R8 – “Preserving data flow” means that if branch is not taken, the value of R1 computed by DSUBU is used by OR, otherwise, the value of R1 computed by DADDU is used.
  • 14. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-14 Speculation • Check whether an instruction can be executed with violation of control dependence yet preserve the exception behavior and the data flow. • Example DADDU R1, R2, R3 BEQZ R12, skipnext DSUBU R4, R5, R6 DADDU R5, R4, R9 Skipnext: OR R7, R8, R9 – How about moving DSUBD before BEQZ if R4 were not used in taken path?
  • 15. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-15 Overcoming Data Hazards with Dynamic Scheduling – Basic idea: DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F12, F8, F14 – SUB.D is stalled, but not data dependent on anything. – The major limitation of so-far introduced pipeline is in- order issuing of instructions. – To allow SUB.D to execute by dynamically scheduling the instructions, it create out-of-order execution, and thus out-of-order completion.
  • 16. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-16 Advantages and Problems of Dynamic Scheduling – Advantages • Enable handling some cases when dependences are unknown at compile time (e.g. When involving memory reference). • Simplify the compiler. • Allow code that was compiled with one pipeline in mind to run efficiently on a different pipeline. – Problems • It creates WAR and WAW hazards. • It complicates exception handling due to out-of-order completion. It creates imprecise exception. – The processor state when an exception is raised does not look exactly as if the instructions were executed sequentially in strict program order.
  • 17. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-17 Support Dynamic Scheduling for the Simple Five-Stage Pipeline • Divide the ID stage into the following two stages: – Issue: Decode instructions and check for structural hazards. – Read operands: Wait until no data hazards, then read operands.
  • 18. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-18 Dynamic Scheduling Algorithms • Algorithms – Scoreboarding, originated from CDC 6600 (Appendix A). • Effective when there are sufficient resources and no data dependence. – Tomasulo algorithm, originated from IBM 360. • Both algorithms can be applied to pipelining or multi-functional units implementations.
  • 19. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-19 Dynamic Scheduling Using Tomasulo’s Approach • Combine key elements of the scoreboarding scheme with register renaming. – Track availability of operands to minimize RAW. – Use register renaming to minimize RAW and WAW.
  • 20. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-20 Concept of Register Renaming • Code before renaming DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 • Code after renaming DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T
  • 21. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-21 Basic Architecture for Tomasulo’s Approach
  • 22. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-22 Basic Ideas – A reservation station (RS) fetches and buffers an operand as soon as it is available. – Pending instructions designate the RS that will provide their inputs. – When successive writes to a register appear, only the last one is actually used to update the register. – As instructions are issued, the register specifiers for pending operands are renamed to the names of the RS, i.e., register renaming • The functionality of register renaming is provided by – The reservation stations (RS), which buffer the operands of instructions waiting to issue. – The issue logic • Since three can be more RSs than real registers, the technique can eliminate hazards that could not be eliminated by a compiler
  • 23. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-23 What Is a Reservation Station Actually Held? – Instructions that have been issued and are awaiting execution at a functional unit. – The operands if available, otherwise, the source of the operands. – The information needed to control the execution of the instruction at the unit. – The load buffers and store buffers hold data or addresses coming from and going to memory.
  • 24. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-24 Steps in Tomasulo’s Approach – Issue……get instruction from instruction queue • Get an instruction from the floating point queue. If it is a floating point operation, issue it if there is an empty RS, and send the operands to the RS if they are in the registers. If it is a load or store , it can be issued if there is an available buffer. If the hardware resource is not available, the instruction stalls. – Execute…..operate an operand • If one or more operands is not yet available, monitor the CDB to obtain the required operands. When two operands are available, the instruction is executed. This step checks for RAW hazards. – Write result …..finish execution(WB) • When the result is available, write it on the CDB and from there into the registers and any RS waiting for this result. - Commit…..Update register or memory with ROB result • when instruction reaches head of ROB and results present, update register with result or store to memory and remove instruction from ROB • If an incorrectly predicted branch reaches the head of ROB, flush the ROB and restart at correct successor of branch. • The above steps differ from Scoreboarding in the following three aspects:
  • 25. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-25 Data Structures – Data structures used to detect and eliminate hazards are attached to the RS, the register file, and the load and store buffers. • Everything contains a tag field per entry. The tags are essentially names for an extended set of virtual registers used in renaming. • In this example, the tag is a four-bit quantity that denotes one of the five RSs or one of the six load buffers.
  • 26. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-26 Fields in Data Structures – Each RS has seven field • Op: The operation to perform. • Qj, Qk: The RS that produces the corresponding source operand. • Vj, Vk: The value of the source operands. • Busy: Indicates the RS and its corresponding functional unit are occupied. • A: Used to hold information for memory address calculation for a load or store. – The register file and store buffer each have a field, Qi: • Qi:The number of the RS that contains the operation whose result should be stored into this register or into memory. – The load and store buffers each require a busy field. The store buffer has a field A, which holds the result of the effective address.
  • 27. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-27 Example for Tomasulo’s Approach (1) • Fig. 3.3 on page 190
  • 28. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-28 Example for Tomasulo’s Approach (1) • Fig. 3.4 on page 192
  • 29. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-29 Advantage of Tomasulo’s Approach over Scoreboarding • The distribution of hazard detection over the RSs • The elimination of stalls for WAW and WAR hazards
  • 30. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-30 Tomasulo’s Algorithm: A Loop-Based Example – By using reservation stations, a loop can be dynamically unrolled. Assume the following loop has been issued in two successive iterations, but none of the floating-point loads-stores or operations has completed (fig. 3.6 on page 194).
  • 31. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-31 A Loop-Based Example
  • 32. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-32 Dynamic Disambiguation of Addresses – If the load address matches the store-buffer addresses, we must stop and wait until the store buffer gets a value; we can then access it or get the value from memory. This makes the load operation in the second iteration in fig. 3.6 completed earlier than the store operation in the first iteration. – The key components for enhancing ILP in Tomasulo algorithm are dynamic scheduling, register renaming and dynamic memory disambiguation.
  • 33. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-33 Reducing Branch Costs with Dynamic Hardware Prediction – Dynamic hardware branch prediction • The prediction will change if the branch changes its behavior while the program is running. • The effectiveness of a branch prediction scheme depends – not only on the accuracy, – but also on the cost of a branch when the branch is correct and when the prediction is not incorrect. • The branch penalties depend on – the structure of the pipeline, – the type of predictor, and – the strategies used for recovering from misprediction.
  • 34. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-34 Basic Branch Prediction and Branch- Prediction Buffer – A branch prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction. The memory contains bits that say whether the branch was recently taken or not.
  • 35. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-35 The Simple One-Bit Prediction Scheme – If a prediction is correct, the prediction bit remains, otherwise, it is inverted. • Example on page 197 (mis-prediction rate 20%) Correct? Prediction Instruction Taken/untaken Y(es) T(aken) I1 T(aken) Y T I2 T … Y T I9 T N T I10 U(ntaken) N U I11 T
  • 36. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-36 Two-Bit Branch Prediction Scheme – A prediction must miss twice before it is changed.
  • 37. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-37 Accuracy of Two-Bit Branch Prediction Buffer (1) • With 4096 entries
  • 38. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-38 Accuracy of Two-Bit Branch Prediction Buffer (2)
  • 39. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-39 Correlating Branch Predictor (1) – Basic concept: the behavior of a branch depends on other branches. if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { • Equivalent code fragment DSUBI R3, R1, #2 BNEZ R3, L1 ; branch b1 (aa != 2) DADD R1, R0, R0 ; aa == 0 L1: DSUBI R3, R2, #2 BNEZ R3, L2 ; branch b2 (bb != 2) DADD R2, R0, R0 ; bb == 0 L2: DSUB R3, R1, R2 ; R3 = aa-bb BEQZ R3, L3 ; branch b3 (aa == bb) • Branches b1 and b2 not taken implies b3 will be taken.
  • 40. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-40 Correlating Branch Predictor (2) • Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. – Consider the following simplified code fragment, if (d ==0 ) d =1; if (d == 1) » Equivalent code fragment is BNEZ R1, L1 ; branch b1 (d != 0) DADDIU R1, R0, #1 L1: DADDIU R3, R1, # -1 BNEZ R3, L2 ; branch b2 (d != 1) … L2: » if b1 is not taken, b2 will be not taken.
  • 41. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-41 Possible Execution Sequence • Fig. 3.10 on page 202
  • 42. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-42 If Done by One-Bit Predictor • Fig. 3.11 on page 202
  • 43. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-43 One-Bit Predictor with One-Bit Correlation, i.e., (1,1) Predictor • The first bit being the prediction if the last branch was not taken and the second bit being the prediction if the last branch in the program was taken. • The four possible combinations – Fig. 3.12 on page 203 –
  • 44. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-44 If Done by (1,1) Predictor • Fig. 3.13 on page 203
  • 45. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-45 General Correlating Predictor – (m, n) predictor uses the behavior of the last m branches to choose from 2m branch predictors, each of which is a n- bit predictor for a single branch. • Examples on page 205.
  • 46. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-46 A (2,2) Predictor
  • 47. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-47 Comparison of Two-Bit Predictors
  • 48. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-48 Tournament Predictors • Actively combining local and global predictors. • It is the most popular form of multilevel branch predictor. • A multilevel branch predictor uses several levels of branch-prediction tables together with an algorithm for choosing among the multiple predictors.
  • 49. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-49 Sate Transition Diagram of a Tournament Predictor • 0/0: predictor1 is wrong/predictor2 is wrong • 1/1: predictor1 is correct/predictor2 is correct
  • 50. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-50 The Fraction of Predictions Done by Local Predictor
  • 51. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-51 Mis-prediction Rate for Three Predictors
  • 52. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-52 Integrated Instruction Fetch Units • Perform the following functions – Integrated branch prediction – Instruction pre-fetch – Instruction memory access and buffering
  • 53. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-53 Hardware-Based Speculation(Unit 2) – Why? • To resolve control dependence to increase ILP – Overcoming control dependence is done by • Speculating on the outcome of branches and executing the program as if our guess is correct. – Three key ideas combined in hardware-based speculation: • Dynamic branch prediction, • Speculative execution, and • Dynamic scheduling. – Instruction commit • When an instruction is no longer speculative, we allow it to update the register file or memory – The key idea behind implementing speculations is to allow instructions to execute out of order but to force them to commit in order.
  • 54. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-54 Extend the Tomasulo’s Approach to Support Speculation – Separate the bypassing of results among instructions from the actual completion of an instruction. • By doing so, the result of an instruction can be used by other instruction without allowing the instruction to perform any irrecoverable update, until the instruction is no longer speculative. – A reorder buffer is employed to pass results among instructions that may be speculated. • The reorder buffer holds the results of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. • The store buffers in the original Tomasulo’s algorithm are integrated into reorder buffer. • The renaming function of reservation station (RS) is replaced by the reorder buffer. Thus, a result is usually tagged by using the reorder buffer entry number.
  • 55. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-55 Data Structure for Reorder Buffer – Each entry in the reorder buffer contains four fields: • The instruction type field indicates whether the instruction is a branch, store, or a register operation. • The destination field supplies the register number or the memory address, where the instruction result should be written. • The value field is used to hold the value of the instruction results until the instruction commits. • Busy (ready) field
  • 56. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-56 The Four Steps in Instruction Execution (1) – Issue (dispatch) • Issue an FP instruction if there is an empty RS and an empty slot in the reorder buffer, send the operands to the RS if they are in the registers or the reorder buffer, and update the control entries to indicate the buffers are in use. • The number of the reorder buffer allocated for the result is also sent to the RS for tagging the result sent on the CDB. • If either all RSs are full or the reorder buffer is full, instruction issued is stalled. – Execute • The CDB is monitored for not-yet-available operands. When both operands are available at a RS, execute the operation. This step checks for RAW hazards.
  • 57. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-57 The Four Steps in Instruction Execution (2) – Write result • When the result is available, write it on the CDB and then into the reorder buffer and to any RSs waiting for this result. Mark the RS as available. – Commit • When an instruction, other than a branch with incorrect prediction, reaches the head of the reorder buffer and its result is in the buffer, update the register with the result. • When a branch with incorrect prediction, indicating a wrong speculation, reaches the head of the reorder buffer, the reorder buffer is flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished.
  • 58. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-58 The Architecture with Speculation
  • 59. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-59 Exception Handling • Exceptions are handled by not recognizing the execution until it is ready to commit. Thus maintain precise exception is straight forward.
  • 60. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-60 Example for Hardware-Based Speculation • Fig. 3.30 on page 230
  • 61. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-61 Comparison of with and without Speculation (1) • Fig. 3.33 on page 236
  • 62. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-62 Comparison of with and without Speculation (2) • Fig. 3.34 on page 237
  • 63. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-63 Studies of the Limitations of ILP • Ideal hardware model – With infinite number of physical registers for renaming – With perfect branch prediction – With perfect jump prediction – Memory address alias • All memory addresses are known exactly and a load can be moved before a store provided that the addresses are not identical. – Enough functional units • The ILP limitation in the ideal hardware model is due to data dependence.
  • 64. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-64 Unit 3 ILP-2 Exploiting ILP Using Multiple Issue and Static Scheduling
  • 65. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-65 Taking Advantage of More ILP with Multiple Issues – How can we reduce the CPI to less than one? • Multiple issues – Allow multiple instructions to issue in a clock cycle. – Multiple-issued processors: • Superscalar processor – Issue varying number of instructions per clock cycle and may be either statically scheduled by a compiler or dynamically scheduled using techniques based on scoreboarding and Tomasulo’s algorithm. • Very long instruction word (VLIW) processor – Issues a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet. Inherently scheduled by a compiler.
  • 66. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-66 Statically Versus Dynamically Scheduled Superscalar Processors • Statically scheduled superscalar – Instructions are issued in order and are executed in order – All pipeline hazards are checked at issued time – Dynamically issue a varying number of instructions • Dynamically scheduled superscalar – Allow out-of-order execution.
  • 67. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-67 Five Primary Approaches for Multiple-Issue Processors • Fig. 3.23 on page 216
  • 68. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-68 A Statically Scheduled Superscalar MIPS Processor – Dual issues: One integer and one floating-point operation • Some restrictions: – The second instruction can be issued if the first instruction can be issued. – If the second instruction depends on the first instruction, it can not be issued. • Influence on load dependency: Waste 3 instruction issuing slots. – Waste one instruction issuing slot at current clock cycle. – Waste two instruction issuing slots at next clock cycle. • Influence on branch dependency: Waste 2 or 3 instruction issuing slots. – Depends on whether a branch must be the first instruction.
  • 69. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-69 Dual-Issue Superscalar Pipeline in Operation • Fig. 3.24
  • 70. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-70 Possible Hazards – Note: the integer operations may be floating-point load, move, and store. – Possible hazards (new): • Structural hazard – Occur when an FP move, store or load is paired with an FP instructions and FP register file does not provide enough read or write ports. • WAW, and • RAW hazards – Dependency on the instructions issued at the same clock cycle..
  • 71. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-71 Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation Multiple Instruction Issue with Dynamic Scheduling – To support dual issues of instructions • Separate data structures (for the reservation stations) for the integer and floating-point registers are employed (still prevent the issuing of dependent FP move or load on the FP instruction). • Pipeline the issuing stage such that it runs twice as fast as the basic clock rate. The first half issues the dependent move or load, while the second half issues the floating-point instruction.
  • 72. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-72 A Scenario of Dual Issues
  • 73. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-73 Resource Usage Table
  • 74. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-74 Factors Limit the Performance of Dual-Issue Processor – Limitations in multiple-issue processors • Inherent limitations of ILP in programs – Difficult to find a large number of independent instructions to keep fully use of FP units. • The amount of overhead per loop iteration is very high – Two out of five instructions (DADDIU and BNE) • Control hazard
  • 75. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-75 Advanced Techniques for Instruction Delivery and Speculation (Unit3) • Branch-Target Buffers • - A branch-prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or branch-target cache.
  • 76. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-76
  • 77. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-77 The steps involved in handling an instruction with a branch-target buffer.
  • 78. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-78 Penalties for Each Individual Situations • Performance of branch-target buffer (Example on page 211). • One variation of branch-target buffer: – Store one or more target instructions instead of the predicted address.
  • 79. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-79 Return Address Predictor – Optimized indirect jumps, especially for procedure calls and returns • Problem – The accuracy of predicting return address by branch-target buffer can be low if the procedure is called from multiple sites and the calls from one site are not clustered in time. • Solution – Use a stack to buffer the most recent return addresses, pushing a return address on the stack at a call and popping one off at a return.
  • 80. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-80 Prediction Accuracy
  • 81. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-81 The Intel Pentium 4
  • 82. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-82