Pipelining and ILP (Instruction Level Parallelism)

Pipelining
&
Instruction Level Parallelism
Mr. A. B. Shinde
Assistant Professor,
Electronics Engineering,
P.V.P.I.T., Budhgaon

Contents…
2
 Fundamentals of computer design:
 Measuring and reports performance
 Instruction level parallelism and its
exploitation
 Concepts and challenges
 Overcoming data hazards with
dynamics scheduling
 Dynamic branch prediction
 Speculation
 Multiple issue processors

Early Computers
First Generation
-1940-1956:
-Vacuum Tubes
Second Generation
-1956-1963:
-Transistors
Third Generation –
1964-1971:
Integrated Circuits
Fourth Generation
-1971-Present:
- Microprocessors
Fifth Generation –
Present and Beyond:
Artificial Intelligence
4

Fundamentals of Computer
 Computer technology has made incredible progress over the last 60
years.
 This improvement has come:
 From advances in the technology used to build computers and
 From innovation in computer design.
 During the first 25 years, both forces made a major contribution,
delivering performance improvement of about 25% per year.
 The late 1970s (after emergence of the microprocessor):
The higher rate of improvement— roughly 35% growth per year in
performance.
5

Fundamentals of Computer
 Two significant changes in the computer made new architecture:
 The virtual elimination of assembly language programming
reduced the need for object-code compatibility.
 The creation of standardized, vendor-independent operating
systems, such as UNIX, Linux, lowered the cost.
 These changes developed a new set of architectures with simpler
instructions, called RISC (Reduced Instruction Set Computer)
architectures, in the early 1980s.
 The RISC-based machines focused on instruction level parallelism and
the use of caches.
 The RISC-based computers raised the performance.
6

Fundamentals of Computer Design
Growth in processor performance since the mid-1980s.
7

Classes of Computers
 The 1980s: desktop computer were invented
(based on microprocessors)
 Personal computers and workstations
 The 1990s: Emergence of the Internet and the World Wide Web
(www).
 Cell phones has been introduced in 2000, with rapid improvements in
functions and sales.
 More recent applications used embedded computers.
8

 Desktop Computing:
 Desktop computing spans from low-end systems to high-end (heavily
configured workstations)
 The desktop market tends to be driven to optimize price
performance.
 Desktop computers are widely used for applications and
benchmarking.
9

 Servers:
 Servers are used to provide larger-scale and more reliable file and
computing services.
 Consider the servers running Google, taking orders for CISCO, or
running auctions on eBay. Failure of such server systems is far more
catastrophic than failure of a single desktop, since these servers must
operate 24 x 7.
10

 Servers:
 Servers are designed for efficient throughput i.e. in terms of
transactions per minute or Web pages served per second.
 Supercomputers are the most expensive computers, emphasize
floating-point performance.
 Clusters of desktop computers, have largely overtaken this class of
computer.
11

 Clusters/Warehouse-Scale Computers
 The growth of Software as a Service (SaaS) for applications like
search, social networking, video sharing, multiplayer games, online
shopping, and so on has led to the growth of a class of computers
called clusters.
 Clusters are collections of desktop computers or servers
connected by local area networks to act as a single larger computer.
 Each node runs its own operating system, and nodes communicate
using a networking protocol.
 The largest of the clusters are called warehouse-scale computers
(WSCs), in that they are designed so that tens of thousands of servers
can act as one.
12

 Embedded Computers
 Embedded computers are the fastest growing computer market.
 They range from microwaves, washing machines, printers, networking
switches and all cars contain simple embedded microprocessors — to
handheld digital devices, such as cell phones and smart cards to video
games and digital set-top boxes.
 Embedded applications: Minimize memory and minimize power.
13

Classes of Parallelism and Parallel Architectures
 Parallelism at multiple levels is now the driving force of computer
design across all four classes of computers.
 There are basically two kinds of parallelism in applications:
1. Data-Level Parallelism (DLP): Arises because there are many data
items that can be operated on at the same time.
2. Task-Level Parallelism (TLP): Arises because tasks of work are
created that can operate independently and largely in parallel.
14

 Data parallelism:
 Consider a 2-processor system (CPUs A and B) in a parallel
environment, and we wish to do a task on some data „d‟.
 It is possible to tell CPU A to do that task on one part of „d‟ and CPU
B on another part simultaneously, thereby reducing the duration of the
execution.
 The data can be assigned using conditional statements
 As a specific example, consider adding two matrices.
In a data parallel implementation, CPU A could add all elements from
the top half of the matrices, while CPU B could add all elements from
the bottom half of the matrices.
15

 Task parallelism:
 Task parallelism (function parallelism or control parallelism) is a
form of parallelization of computer code across multiple processors in
parallel computing environments.
 Task parallelism focuses on distributing execution processes
(threads) across different parallel computing nodes.
 In a multiprocessor system, task parallelism is achieved when each
processor executes a different thread (or process) on the same or
different data.
16

 Task parallelism
 As a simple example, if we are running code on a 2- processor system
(CPUs "a" & "b") in a parallel environment and we wish to do
tasks "A" and "B”.
 It is possible to tell CPU "a" to do task "A" & CPU "b" to do task 'B"
simultaneously, thereby reducing the runtime of the execution.
 The tasks can be assigned using conditional statements.
17

Defining Computer Architecture
 The Computer designer faces the problems:
 To maximize performance while staying within cost, power, and
availability constraints.
 Instruction set design, functional organization, logic design, and
implementation.
 The implementation may encompass integrated circuit design,
packaging, power and cooling.
18

 Instruction Set Architecture:
 Class of ISA —
ISAs are classified as general-purpose register architectures, where
the operands are either registers or memory locations.
All recent ISAs have load-store architecture.
 Memory addressing —
All desktops and servers uses byte addressing to access memory
operands.
19

 Addressing modes — Specifies registers and constant operands, &
the address of a memory object.
1. MIPS addressing modes are Register, Immediate (for constants),
and Displacement,
2. The 80x86 supports those three plus three variations of
displacement
 Types and sizes of operands — Like most ISAs, MIPS and 80x86
support operand sizes of 8-bit, 16-bit, 32-bit, 64-bit, and IEEE 754
floating point in 32-bit and 64-bit.
MIPS = Microprocessor without Interlocked Pipeline Stages, is basically RISC
20

 Operations — The categories of operations are data transfer,
arithmetic logical, control and floating point.
 Control flow instructions — All ISAs, supports conditional branches,
unconditional jumps, procedure calls, and returns.
 Encoding an ISA — There are two basic choices on encoding:
 fixed length and variable length .
 All MIPS instructions are 32 bits long.
 Variable length instructions can take less space than fixed-length
instructions.
21

Trends in Technology
 Integrated circuit logic technology —
Transistor density increases by about 35% per year. Increases in die
size are less predictable and slower. The combined effect is a growth
rate in transistor count on a chip of about 40% to 55% per year.
 Semiconductor DRAM — Capacity increases by about 40% per year,
doubling roughly every two years.
 Magnetic disk technology — Prior to 1990, density increased by
about 30% per year, doubling in three years. It rose to 60% per year
thereafter, and increased to 100% per year in 1996. Since in 2004, it
has dropped back to 30% per year.
 Network technology — Network performance depends on switches
and transmission system.
22

Trends in Power in Integrated
Circuits
 Initially power must be brought in and distributed around the chip.
 Second, power is dissipated as heat and must be removed.
 For CMOS chips, energy consumption is due to switching
transistors, also called dynamic power.
 The power required per transistor is proportional to the product of the
load capacitance of the transistor, the square of the voltage, and the
frequency of switching:
Mobile devices care about battery life more than power, so energy is
the proper metric, measured in joules:
23

Circuits
 Example: Some microprocessors today are designed to have
adjustable voltage, so that a 15% reduction in voltage may result in a
15% reduction in frequency. What would be the impact on dynamic
power?
 Answer: Since the capacitance is unchanged, the answer is the ratios
of the voltages and frequencies:
thereby reducing power to about 60% of the original.
24

Circuits
 The increase in the number of transistors switching, and the
frequency of switching, dominates the decrease in load capacitance
and voltage, leading to an overall growth in power consumption and
energy.
 Power is now the major limitation, therefore most microprocessors
today turn off the clock of inactive modules to save energy and dynamic
power e.g. if no floating-point instructions are executing, the clock of the
floating-point unit is disabled.
 Although dynamic power is the primary source of power dissipation in
CMOS, static power is becoming an important issue because leakage
current flows even when a transistor is off:
 Static Power is Calculated by:
25

Trends in Cost
 Although there are computers where cost tends to be less important—
specifically supercomputers.
 In the past 20 years:
The technology improvements to lower cost, increased
performance, was a major theme in the computer industry.
 Yet an understanding of cost and its factors is essential for
designers to make intelligent decisions about whether or not a new
feature should be included in designs.
26

Trends in Cost:
 The Impact of Time, Volume, and Commodification
 Cost of a manufactured computer component decreases over time
even without major improvements in the implementation technology.
 One example is that the price per megabyte of DRAM has dropped
over the long term by 40% per year.
 Volume is a second key factor in determining cost.
 Increasing volumes affects cost in several ways.
 First, they decrease the time needed, which is proportional to the
number of systems manufactured.
 Second, volume decreases cost, since it increases manufacturing
efficiency.
27

Trends in Cost:
 Cost of an Integrated Circuit:
 Although the costs of integrated circuits have dropped exponentially, the
basic process of silicon manufacture is unchanged.
 The cost of a packaged integrated circuit is
28

Trends in Cost:
 Example: Find the number of dies per 300 mm (30 cm) wafer for a die
that is 1.5 cm on a side.
 Answer: The die area is 2.25 cm2.
Thus
29

Trends in Cost:
 However, this only gives the maximum number of dies per wafer.
The critical question is:
 What is the fraction of good dies on a wafer number, or the die yield?
 It is assumed that the defects are randomly distributed over the wafer
and the yield is inversely proportional to the complexity of the fabrication
process
30

Trends in Cost:
 Example:
Find the die yield for dies that are 1.5 cm on a side and 1.0 cm on a
side, assuming a defect density of 0.4 per cm 2 and α is 4.
 Answer:
The total die areas are 2.25 cm 2 and 1.00 cm 2 . For the larger die, the
yield is
That is, less than half of all the large die are good but more than two-
thirds of the small die are good.
31

Measuring, Reporting and
Summarizing Performance
32

Measuring Performance
 When we say one computer is faster than another, what do we
mean?
 The user of a desktop computer may say a computer is faster when a
program runs in less time,
 While an amazon.com administrator may say a computer is faster
when it completes more transactions per hour.
 The computer user is interested in reducing response time — the
time between the start and the completion of an event (execution time).
 The administrator of a large data processing center may be
interested in increasing throughput — the total amount of work done
in a given time.
33

 Consider two different computers, X and Y.
 The phrase “X is faster than Y” is used to mean that the response
time or execution time is lower on X than on Y for the given task.
 In particular, “X is n times faster than Y” will mean
Since execution time is the reciprocal of performance, the following
relationship holds:
The throughput of X is 1.3 times higher than Y, signifies that the number of
tasks completed per unit time on computer X is 1.3 times the number
completed on Y.
34

 Execution time can be defined in different ways like
wall-clock time,
response time, or
elapsed time,………which is the latency to complete a task.
 The response time seen by the user is the elapsed time of the
program, not the CPU time.
 To evaluate a new system the users would simply compare the
execution time of their workloads.
35

Reporting Performance
 Reporting Performance Results
 The reporting performance measurements should be for
reproducibility — list everything another experimenter would need to
duplicate the results.
 A SPEC (Standard Performance Evaluation Corporation)
(www.spec.org.) benchmark report requires an extensive description of
the computer and the compiler flags, as well as the publication of both
the baseline and optimized results.
36

Reporting Performance
 Reporting Performance Results
 In addition to hardware, software, and baseline tuning parameter
descriptions, a SPEC report contains the actual performance times,
shown both in tabular form and as a graph.
 A TPC (Transaction Processing Council) benchmark report is even
more complete.
These reports are excellent sources for finding the real cost of
computing systems.
37

 Summarizing Performance Results
 A straightforward approach to computing a summary result would be
to compare the arithmetic means of the execution times of the
programs in the suite.
 An alternative would be to add a weighting factor to each benchmark
and use the weighted arithmetic mean as the single number to
summarize performance.
 Each company might have their own set of weights.
38

 Summarizing Performance Results
 Rather than pick weights, we could normalize execution times to a
reference computer by dividing the time on the reference computer by
the time on the computer being rated.
 SPEC uses this approach, calling the ratio the SPECRatio.
 For example, suppose that the SPECRatio of computer A on a
benchmark was 1.25 times higher than computer B; then you would
know
39

 SPECRatio is a ratio rather than an absolute execution time, the mean
must be computed using the geometric mean.
(Since, SPECRatios have no units, comparing SPECRatios
arithmetically is meaningless.)
 The formula is
40

 Example: Show that the ratio of the geometric means is equal to the
geometric mean of the performance ratios, and that the reference
computer of SPECRatio matters not.
 Answer: Assume two computers A and B and a set of SPECRatios for
each.
That is, the ratio of the geometric means of the SPECRatios of A and B is
the geometric mean of the performance ratios of A to B.
41

Pipelining (Concept)
 Lets, consider the example of washing a car:
Suppose washing, drying & polishing of car requires 30 minutes each.
To wash, dry and polish:
1 car will take 1.5 hrs
4 cars will need (1.5 hrs x 4) 6 hours…
Suppose,
After washing the first car, it is sent for drying at the same time second
car was taken for washing.
Washing of second car and drying of first car was done simultaneously;
and will be done/over at same time.
When washing of second car and drying of first car is over, then first car
was sent for polishing, washed car is sent to drying and third car was
taken for washing.
The total time to complete all three operations for 4 cars is ___.
43

What is Pipelining?
 In computing, a pipeline is a set of data processing elements
connected in series, so that the output of one element is the input
of the next one.
 Pipelining is an implementation technique whereby multiple
instructions are overlapped in execution;
 The elements of a pipeline are often executed in parallel or in time-
sliced fashion.
 Today, pipelining is the key implementation technique used to make
fast CPUs
44

Pipelining Types
 Buffered, Synchronous pipelines:
Conventional microprocessors are synchronous circuits that use
buffered, synchronous pipelines.
In these pipelines, "pipeline registers" are inserted in-between
pipeline stages, and are clocked synchronously.
 Buffered, Asynchronous pipelines:
Asynchronous pipelines are used in asynchronous circuits, and have
their pipeline registers clocked asynchronously.
They use a request/acknowledge system, wherein each stage can
detect when it's finished.
45

Pipelining Types
 Unbuffered pipelines:
Unbuffered pipelines, called "wave pipelines", do not have registers in-
between pipeline stages.
Instead, the delays in the pipeline are "balanced" so that, for each
stage, the difference between the first stabilized output data and the last
is minimized.
46

Pipelining
 Because all stages proceed at the same time, the length of a processor
cycle is determined by the time required for the slowest pipe stage
 In a computer, this processor cycle is usually 1 clock cycle (sometimes
it is 2, rarely more).
 The pipeline designer’s goal is to balance the length of each pipeline
stage.
 If the stages are perfectly balanced, then the time per instruction on the
pipelined processor is
47

Implementation of a RISC Instruction Set
 How RISC instruction set is implemented without pipelining?
 RISC instruction takes at most 5 clock cycles.
 This basic implementation to a pipelined version, resulting in a much
lower CPI.
 Unpipelined implementation is not the most economical or the highest-
performance implementation.
 Implementing the instruction set requires the introduction of several
temporary registers that are not part of the architecture
48

 Every instruction in this RISC subset can be implemented in at most 5
clock cycles. The 5 clock cycles are as follows.
1. Instruction fetch cycle (IF):
Send the program counter (PC) to memory and fetch the current
instruction from memory. Update the PC to the next sequential PC by
adding 4 (since each instruction is 4 bytes) to the PC.
2. Instruction decode/register fetch cycle (ID):
Decode the instruction and read the registers corresponding to register
source specifiers from the register file.
Decoding is done in parallel with reading registers, which is possible
because the register specifiers are at a fixed location in a RISC
architecture. This technique is known as fixed-field decoding
49

3. Execution/effective address cycle (EX):
The ALU operates on the operands prepared in the prior cycle,
performing one of three functions.
 Memory reference:.
 Register-Register ALU instruction:
 Register-Immediate ALU instruction:
 In a load-store architecture the effective address and execution cycles
can be combined into a single clock cycle.
50

4. Memory access (MEM):
 If the instruction is a load, memory does a read using the effective
address computed in the previous cycle. If it is a store, then the memory
writes the data from the second register read from the register file using
the effective address.
5. Write-back cycle (WB):
 Register-Register ALU instruction or Load instruction: Write the result
into the register file, whether it comes from the memory system (for a
load) or from the ALU (for an ALU instruction).
51

Simple RISC pipeline
IF = instruction fetch,
ID = instruction decode,
EX = execution,
MEM = memory access, and
WB = write back.
52

Simple RISC pipeline
Pipeline as a series of data paths shifted in time
IM = instruction memory,
DM = data memory, and
CC = clock cycle
53

Performance Issues in Pipelining
 Pipelining increases the CPU instruction throughput — the number
of instructions completed per unit of time — but it does not reduce the
execution time of an individual instruction.
 In fact, it usually slightly increases the execution time of each
instruction due to overhead in the control of the pipeline.
 The increase in instruction throughput means that a program runs
faster and has lower total execution time, even though no single
instruction runs faster.
54

 Imbalance among the pipeline stages reduces performance.
 Pipeline overhead arises from the combination of pipeline register
delay and clock skew.
 The pipeline registers add setup time, (time that a register input must
be stable before the clock signal).
 Clock skew, also contributes to the lower limit on the clock cycle.
 Once the clock cycle is as small as the sum of the clock skew and
latch overhead, no further pipelining is useful.
(there is no time left in the cycle for useful work)
55

Example: Consider the unpipelined processor in the previous section.
Assume that it has a 1 ns clock cycle and that it uses 4 cycles for ALU
operations and branches and 5 cycles for memory operations. Assume
that the relative frequencies of these operations are 40%, 20%, and
40%, respectively. Suppose that due to clock skew and setup, pipelining
the processor adds 0.2 ns of overhead to the clock. Ignoring any latency
impact, how much speedup in the instruction execution rate will we gain
from a pipeline?
Answer: The average instruction execution time on the unpipelined
processor is
Average instruction execution time = Clock cycle × Average CPI
= 1 n s × ((40% + 20%) × 4 + 40% × 5)
= 1 n s × 4.4
= 4.4 ns
56

 In the pipelined implementation, the clock must run at the speed of
the slowest stage plus overhead, which will be 1 + 0.2 or 1.2 ns; this
is the average instruction execution time.
 Thus, the speedup from pipelining is
The 0.2 ns overhead essentially establishes a limit on the effectiveness of
pipelining. If the overhead is not affected by changes in the clock cycle,
Amdahl’s Law tells us that the overhead limits the speedup.
57

Pipeline Hazards
 There are situations, called hazards, that prevents the next
instruction in the instruction stream from executing during its
designated clock cycle.
 Hazards reduce the performance gained from pipelining.
 There are three classes of hazards:
1. Structural hazards arise from resource conflicts when the hardware
cannot support all possible combinations of instructions simultaneously
in overlapped execution.
2. Data hazards arise when an instruction depends on the results of a
previous instruction, because of overlapping of instructions.
3. Control hazards arise from the pipelining of branches and other
instructions that change the PC.
58

Pipeline Hazards
 Hazards in pipelines can make it necessary to stall (stop, halt or pause),
the pipeline.
 Avoiding a hazard often requires that some instructions in the pipeline
be allowed to proceed while others are delayed.
 When an instruction is stalled, all instructions issued later than the
stalled instruction are also stalled.
 Instructions issued earlier than the stalled instruction must continue,
otherwise the hazard will never clear.
59

Pipeline Hazards
 Structural Hazards
If some combination of instructions cannot be accommodated because
of resource conflicts, the processor is said to have a structural
hazard.
 Structural Hazards arise because of:
1. When some functional units are not fully pipelined.
2. When some resource has not been duplicated enough to allow
all combinations of instructions in the pipeline to execute.
 For example, a processor may have only one register-file write port, but
the pipeline might want to perform two writes in a clock cycle.
This will generate a structural hazard.
60

Pipeline Hazards
 Structural Hazards
 When a instructions encounters this hazard, the pipeline will stall
one of the instructions until the required unit is available.
 Such stalls will increase the CPI from its usual ideal value of 1.
 To resolve this hazard, we need to stall the pipeline for 1 clock
cycle. A stall is commonly called a pipeline bubble or just bubble.
 The effect of the pipeline bubble is actually to occupy the resources
for that instruction slot as it travels through the pipeline.
61

Pipeline Hazards
 Data Hazards
 A major effect of pipelining is to change the relative timing of instructions
by overlapping their execution.
 This overlap introduces data and control hazards.
 Data hazards occur when the pipeline changes the order of read/write
accesses to operands so that the order differs from the order seen by
sequentially executing instructions on an unpipelined processor.
62

Pipeline Hazards
 Data Hazards
 Consider the pipelined execution of these instructions:
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
 All the instructions after the DADD use the result of the DADD
instruction.
 The DADD instruction writes the value of R1 in the WB pipe stage, but
the DSUB instruction reads the value during its ID stage.
 This problem is called a data hazard. Unless precautions are taken to
prevent it, the DSUB instruction will read the wrong value and try to use
it.
63

Pipeline Hazards
 Data Hazards
 If an interrupt occurs between the DADD and DSUB instructions, then
WB stage of the DADD will complete, and the value of R1 at that point
will be the result of the DADD.
 The AND instruction is also affected by this hazard. The AND instruction
that reads the registers during clock cycle 4 will receive the wrong
results.
 The XOR instruction operates properly because its register read occurs
in clock cycle 6, after the register write.
 The OR instruction also operates without incurring a hazard.
64
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11

Pipeline Hazards
 Minimizing Data Hazard Stalls by Forwarding
 The problem of data hazard can be solved with a simple hardware
technique called forwarding (also called bypassing and sometimes
short-circuiting).
 The result is not needed by the DSUB until the DADD produces it.
 If the result moved from the pipeline register where the DADD stores
it, to where the DSUB needs it, then the need for a stall can be
avoided.
65
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11

Pipeline Hazards
 Minimizing Data Hazard Stalls by Forwarding
 Using this observation, forwarding works as follows:
1. The ALU result from both the EX/MEM and MEM/WB pipeline registers
is always fed back to the ALU inputs.
2. If the forwarding hardware detects that the previous ALU operation has
written to the register corresponding to a source for the current ALU
operation, control logic selects the forwarded result.
66
DADD R1,R2,R3
DSUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11

Pipeline Hazards
 Branch Hazards
 Control hazards can cause a greater performance loss for MIPS
pipeline than data hazards.
 When a branch is executed, it may or may not change the PC to
something other than its current value plus 4.
 If the branch is not taken, then the repetition of the IF stage is
unnecessary since the correct instruction was fetched.
 One stall cycle for every branch will yield a performance loss of
10% to 30% depending on the branch frequency.
67

Pipeline Hazards
 Reducing Pipeline Branch Penalties
 The software can try to minimize the branch penalty using
knowledge of the hardware scheme and of branch behavior.
 The simplest scheme to handle branches is to freeze or flush the
pipeline, holding or deleting any instructions after the branch until
the branch destination is known.
 It is simple from both sides hardware and software.
68

Pipeline Hazards
 Treat every branch as not taken.
(Allow the hardware to continue as if the branch were not
executed).
 In the simple five-stage pipeline, this predicted untaken scheme is
implemented by continuing to fetch instructions as if the branch
were a normal instruction.
 If the branch is taken, however, we need to turn the fetched
instruction into a no-op and restart the fetch at the target address.
69

Pipeline Hazards
 An alternative scheme is to treat every branch as taken.
 As soon as the branch is decoded and the target address is
computed, we assume the branch to be taken and begin fetching
and executing at the target address (location).
 In some processors — powerful (hence slower) branch conditions —
the branch target is known, and a predicted-taken scheme might make
sense.
 A another scheme used in some processors is called delayed branch.
This technique was heavily used in early RISC processors.
70

Performance of Pipelines with Stalls
 A stall causes the pipeline performance to degrade from the ideal
performance.
 Equation for finding the actual speedup from pipelining, starting with the
earlier formula.
71

 Pipelining is decreasing the CPI or the clock cycle time.
 The ideal CPI on a pipelined processor is always 1.
 Hence, we can compute the pipelined CPI:
If we ignore the cycle time overhead of pipelining and assume the
stages are perfectly balanced, then the cycle time of the two processors
can be equal, leading to
72

If all instructions take the same number of cycles, which must also
equal the number of pipeline stages, then the unpipelined CPI is equal to
the depth of the pipeline, leading to
If there are no pipeline stalls, then pipelining can improve performance
by the depth of the pipeline.
If pipelining improves the clock cycle time, then we can calculate the CPI
of the unpipelined processor, as well as pipelined processor.
This leads to…
73

 In cases where the pipe stages are perfectly balanced and there is
no overhead, the clock cycle on the pipelined processor is smaller than
the clock cycle of the unpipelined processor by a factor equal to the
pipelined depth:
This leads to the following
74

MIPS Instructions (load and store)
75

MIPS Instructions (arithmetic/logical)
76

MIPS Instructions (control flow)
77

ILP
(Instruction-Level Parallelism)
78

Instruction-Level Parallelism
 All processors since about 1985 use pipelining to overlap the execution
of instructions and improve performance.
 This potential overlap among instructions is called instruction-level
parallelism (ILP), since the instructions can be evaluated in parallel.
79

 Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously.
Consider the following program:
 For Example:
1. e = a + b
2. f = c + d
3. g = e * f
Here, Operation 3 depends on the results of operations 1 and 2, so
it cannot be calculated until both of them are completed. As, operations
1 and 2 do not depend on any other operation, so they can be
calculated simultaneously.
If each operation is completed in one unit of time then three
instructions can be completed in two units of time, giving an ILP of 3/2.
80

 Instruction-level parallelism (ILP):
 Ordinary programs are written and executed sequentially.
ILP allows the compiler and the processor to overlap the execution
of multiple instructions or even to change the order in which
instructions are executed.
 How much ILP exists in programs is very application specific.
In certain fields, such as graphics and scientific computing the amount
can be very large.
However, cryptography exhibit much less parallelism.
81

 There are two largely separable approaches to exploiting (utilizing) ILP:
(1) an approach that relies on hardware to help discover and exploit
(utilize) the parallelism, and
(2) an approach that relies on software technology to find parallelism
at compile time.
82

 The value of the CPI for a pipelined processor is the sum of the base
CPI and all contributions from stalls:
Pipeline CPI (Ideal pipeline)=CPI + Structural stalls + Data hazard stalls
+ Control stalls
 The ideal pipeline CPI is a measure of the maximum performance
attainable by the implementation.
83

 The simplest and most common way to increase the ILP is to exploit
parallelism among iterations of a loop, which is called as loop-level
parallelism.
 Example: A loop that adds two 1000 - element arrays:
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
 Every iteration of the loop can overlap with any other iteration.
 Such techniques works by unrolling the loop either statically by the
compiler or dynamically by the hardware
84

 An important alternative method for exploiting loop-level parallelism
is the use of SIMD in both vector processors and Graphics Processing
Units (GPUs).
 A SIMD instruction exploits data-level parallelism by operating on a
small to moderate number of data items in parallel.
85

SIMD
 SIMD machines are capable of
applying the exact same
instruction stream to multiple
streams of data simultaneously.
 This type of architecture is
perfectly suited to achieving very
high processing rates
86

SIMD
SIMD Processable Patterns SIMD Unprocesable Patterns
Example: Brightness Computation by SIMD Operations
87

ILP Challenges
 Determining how one instruction depends on another is critical &
determining how much parallelism exists in a program and how that
parallelism can be exploited is major problem.
 If two instructions are independent, they can execute
simultaneously in a pipeline, provided that pipeline has sufficient
resources (and hence no structural hazards exist).
 If two instructions are dependent, they are not parallel and must be
executed in predefined order.
88

ILP Challenges
 There are three different types of dependences:
 Data dependences (also called true data dependences),
 Name dependences, and
 Control dependences.
 An instruction j is data dependent on instruction i if either of the
following conditions are true:
 Instruction i produces a result that may be used by instruction j.
 Instruction j is data dependent on instruction k, and instruction k
is data dependent on instruction i.
89
Instruction i
…
Instruction j
…
Instruction k

ILP Challenges
 Data Dependences
 For example, consider the following MIPS code sequence that
increments a vector of values in memory (starting at 0(R1) and with the
last element at 8(R2) ) by a scalar in register F2.
(For simplicity, ignore the effects of delayed branches.)
Here, each instruction depends on the previous one
90

ILP Challenges
 Data Dependences
 A data dependence conveys three things:
(1) The possibility of a hazard,
(2) The order in which results must be calculated, and
(3) An upper bound on how much parallelism can be exploited.
 A data dependence can be overcomed in two different ways:
(1) Maintaining the dependence but avoiding a hazard, and
(2) Eliminating a dependence by transforming the code.
Scheduling the code is the primary method used to avoid a hazard
without altering a dependence, and such scheduling can be done both
by the compiler and by the hardware.
91

ILP Challenges
 Name Dependences
 A name dependence occurs when two instructions use the same
register or memory location, called a name.
 There are two types of name dependences between an instruction i
that precedes instruction j in program order:
1. An antidependence: When instruction j writes a register or memory
location that instruction i reads.
2. An output dependence: When instruction i and instruction j write the
same register or memory location.
92
Instruction i
Instruction j

ILP Challenges
 Name Dependences
 Because a name dependence is not a true dependence, instructions
involved in a name dependence can execute simultaneously or be
reordered.
 If the name (register number or memory location) used in the
instructions is changed, then the instructions do not conflict.
 This renaming can be more easily done for register operands, where it
is called register renaming.
 Register renaming can be done either statically by a compiler or
dynamically by the hardware.
93

ILP Challenges
 Data Hazards
 A hazard exists whenever there is a name or data dependence
between instructions.
 Normally, we must preserve program order
 The goal of both software and hardware techniques is to exploit
parallelism by preserving program order.
 Data hazards, may be classified as one of three types, depending on
the order of read and write accesses in the instructions.
94

ILP Challenges
 Data Hazards
 The possible data hazards are
■ RAW (Read After Write) — j tries to read a source before i writes it,
so j incorrectly gets the old value. This hazard is the most common type
and corresponds to a true data dependence.
■ WAW (Write After Write) — This hazard corresponds to an output
dependence. WAW hazards are present only in pipelines that write in
more than one pipe stage.
■ WAR (Write After Read) — This hazard arises from an
antidependence. WAR hazards cannot occur in most static issue
pipelines. A WAR hazard occurs when there are some instructions that
write results early in the instruction pipeline.
Note that the RAR (Read After Read) case is not a hazard.
95
Instruction i
Instruction j

ILP Challenges
 Control Dependences
 A control dependence determines the ordering of an instruction, i,
with respect to a branch instruction so that instruction i is executed in
correct program order.
 Examples of a control dependence is the dependence of the statements
in the “then” part of an “if” statement on the branch.
96

ILP Challenges
 Control Dependences
 S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
 In general, two constraints are imposed by control
dependences:
1. An instruction that is control dependent on a branch cannot
be moved before the branch
2. An instruction that is not control dependent on a branch
cannot be moved after the branch.
For example,
We cannot take a statement before the if statement and
move it into the then portion.
97

Overcoming Data Hazards
 A simple statically scheduled pipeline fetches an instruction and issues
it, unless there is a data dependence.
 If there is a data dependence, then the hazard detection hardware
stalls the pipeline starting with the instruction that uses the result.
 No new instructions are fetched or issued until the dependence is
cleared.
98

Dynamic Scheduling
 In dynamic scheduling, the hardware rearranges the instruction
execution to reduce the stalls while maintaining data flow.
 Dynamic scheduling offers several advantages.
 First, it allows code that was compiled with one pipeline to run
efficiently on a different pipelines, eliminating the need to recompile
for a different microarchitectures.
 Second, it enables handling of code when dependences are
unknown at time of compilation.
 Third, and most important, it allows the processor to tolerate
unpredictable delays.
99

Dynamic Scheduling: The Idea
 For example, consider this code:
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
 The SUB.D instruction cannot execute because the dependence of
ADD.D on DIV.D causes the pipeline to stall; yet, SUB.D is not data
dependent on anything in the pipeline.
 This hazard creates a performance limitation that can be eliminated by
not requiring instructions to execute in program order.
100

Dynamic Scheduling:
The Idea
 In the classic five-stage pipeline, both structural and data hazards
could be checked during instruction decode (ID).
 To allow us to begin executing the SUB.D in the above example,
We must separate the issue process into two parts:
- checking for any structural hazards and
- waiting for the absence of a data hazard.
 Thus, we still use in-order instruction issue, but we want an instruction
to begin execution as soon as its data operands are available.
101
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14

Dynamic Scheduling
 To understand how register renaming eliminates WAR and WAW
hazards, consider the following example code sequence:
DIV.D F0,F2,F4
ADD.D F6,F0,F8
S.D F6,0(R1)
SUB.D F8,F10,F14
MUL.D F6,F10,F8
 There are two antidependences: between the ADD.D and the SUB.D
and between the S.D and the MUL.D.
 There is also an output dependence between the ADD.D and MUL.D.
 There are also three true data dependences: between the DIV.D and
the ADD.D, between the SUB.D and the MUL.D, and between the
ADD.D and the S.D.
102

Dynamic Scheduling
 These name dependences can all be eliminated by register
renaming.
 For simplicity, assume the existence of two temporary registers, S
and T. Using S and T, the sequence can be rewritten without any
dependences as:
DIV.D F0,F2,F4 DIV.D F0,F2,F4
ADD.D F6,F0,F8 ADD.D S,F0,F8
S.D F6,0(R1) S.D S,0(R1)
SUB.D F8,F10,F14 SUB.D T,F10,F14
MUL.D F6,F10,F8 MUL.D F6,F10,T
 In addition, any subsequent uses of F8 must be replaced by the register
T. In this code segment, the renaming process can be done statically by
the compiler.
103

Speculation
 Exploiting more parallelism requires that we should overcome the
limitation of control dependence.
 Overcoming control dependence is done by speculating
(guessing) on the outcome of branches and executing the program
as if our guesses were correct.
 With speculation (guesswork), we can fetch, issue, and execute
instructions, as if our branch predictions were always correct;
dynamic scheduling only fetches and issues such instructions.
 Hardware speculation, extends the ideas of dynamic scheduling.
104

Hardware-Based Speculation
 Hardware-based speculation combines three key ideas:
(1) Dynamic branch prediction to choose which instructions to execute,
(2) Speculation to allow the execution of instructions before the
control dependences are resolved and
(3) Dynamic scheduling to deal with the scheduling of different
combinations of basic blocks.
 Hardware-based speculation follows the predicted flow of data values to
choose when to execute instructions.
105

 Speculation allows instructions to execute out of order but to force
them to commit in order execution.
 Adding this commit phase to the instruction execution sequence
requires an additional set of hardware buffers.
 This hardware buffer (reorder buffer), is also used to pass results
among instructions that may be speculated.
 The ROB supplies operands in the interval between completion of
instruction execution and instruction commit.
106

 Each entry in the ROB contains four fields:
- the instruction type,
- the destination field,
- the value field, and
- the ready field.
 The instruction type field indicates whether the instruction is a branch
(and has no destination), a store (memory address), or a register
operation.
 The destination field supplies the register number (for load) or the
memory address (for stores)
 The value field is used to hold the value of the instruction result until the
instruction commits.
 The ready field indicates that the instruction has completed execution,
and the value is ready.
107

Hardware-
Based
Speculation
The basic
structure of a FP
unit using
Tomasulo’s
algorithm and
extended to
handle
speculation.
108

 The hardware structure of the processor including the ROB is shown in
figure.
 The ROB includes the store buffers. The renaming function of the
reservation stations is replaced by the ROB.
 This tagging requires that the ROB assigned for an instruction must be
tracked in the reservation station.
109

 Here are the four steps involved in instruction execution:
1. Issue —
Get an instruction from the instruction queue.
Issue the instruction if there is an empty reservation station and an
empty slot in the ROB.
If either all reservations are full or the ROB is full, then instruction issue
is stalled until both have available entries.
2. Execute —
If one or more of the operands is not yet available, monitor the CDB
(Common Data Bus) while waiting for the register to be computed.
This step checks for RAW hazards. When both operands are available
at a reservation station, execute the operation.
Instructions may take multiple clock cycles in this stage.
110

 Here are the four steps involved in instruction execution:
3. Write result—
When the result is available, write it on the CDB (Common Data Bus)
and from the CDB into the ROB, as well as to any reservation stations
waiting for this result.
4. Commit—
This is the final stage of completing an instruction, after which only its
result remains. (Commit phase is also called as “completion” or
“graduation”)
111

There are three different sequences of actions at commit:
The normal commit case occurs when an instruction reaches the head
of the ROB and its result is present in the buffer.
Committing a store is similar except that memory is updated rather
than a result register.
When a branch with incorrect prediction reaches the head of the
ROB, it indicates that the speculation was wrong. The ROB is flushed
and execution is restarted at the correct successor of the branch. If the
branch was correctly predicted, the branch is finished.
112

ILP Using Multiple Issue
 To improve performance, we would like to decrease the CPI < 1, but the
CPI cannot be reduced below one if we issue only one instruction every
clock cycle.
 The goal of the multiple-issue processors, is to allow multiple
instructions to issue in a clock cycle.
 Multiple-issue processors come in three major flavors:
1. Statically scheduled superscalar processors
2. VLIW (very long instruction word) processors
3. Dynamically scheduled superscalar processors
113

Thank You …
shindesir.pvp@gmail.com
(This Presentation is Published Only for Educational Purpose)
114

Pipelining and ILP (Instruction Level Parallelism)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Pipelining and ILP (Instruction Level Parallelism)

Similar to Pipelining and ILP (Instruction Level Parallelism) (20)

More from A B Shinde

More from A B Shinde (20)

Recently uploaded

Recently uploaded (20)

Pipelining and ILP (Instruction Level Parallelism)