SlideShare a Scribd company logo
1 of 36
Advanced Techniques
For Exploiting ILP
Mr. A. B. Shinde
Assistant Professor,
Electronics Engineering,
P.V.P.I.T., Budhgaon
Contents…
 Complier techniques for
exposing ILP
 Limitation on ILP for realizable
Processors
 Hardware versus software
speculation
2
Pipelining
 Pipeline is a set of data processing elements connected in
series, so that the output of one element is the input of the next one.
 The elements of a pipeline are often executed in parallel.
Parallel Computing
 Parallel computing is a form of computation in which many
calculations are carried out simultaneously ("in parallel").
 There are several different forms of parallel computing:
 Bit-level,
 Instruction level,
 Data, and
 Task parallelism.
Instruction Level Parallelism (ILP)
 A computer program is, a stream of instructions executed by
a processor.
 These instructions can be re-ordered and combined into groups
which are then executed in parallel without changing the result of the
program.
Instruction Level Parallelism (ILP)
In ordinary programs instructions are executed in the order specified by
the programmer.
 How much ILP exists in programs is very application specific.
In certain fields, such as graphics and scientific computing the
amount can be very large.
However, workloads such as cryptography exhibit much less
parallelism.
Pipeline Scheduling
 The straightforward MIPS code, not scheduled for the pipeline, looks
like this:
Loop: L.D F0,0(R1) ;F0=array element
ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer by
;8 bytes
BNE R1,R2,Loop ;branch R1!=R2
 Let’s see how well this loop will run when it is scheduled on a simple
pipeline for MIPS.
Pipeline Scheduling
 Example: Show how the loop would look on MIPS, both scheduled and
unscheduled, including any stalls or idle clock cycles.
 Answer: Without any scheduling, the loop will execute as follows,
taking 9 cycles:
Clock cycle issued
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
Pipeline Scheduling
 We can schedule the loop to obtain only two stalls and reduce the
time to 7 cycles:
Clock cycles
Loop: L.D F0,0(R1) 1
DADDUI R1,R1,#-8 2
ADD.D F4,F0,F2 3
Stall 4
Stall 5
S.D F4,8(R1) 6
BNE R1,R2,Loop 7
 The stalls after ADD.D are for use by the S.D.
Clock cycles
Loop: L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1,R2,Loop 9
Loop Unrolling
 In the previous example, we complete one loop iteration and store
back one array element every 7 clock cycles…
but the actual work of operating on the array element takes
just 3 (the load, add, and store) of those 7 clock cycles.
 The remaining 4 clock cycles consist of loop overhead—the
DADDUI and BNE—and two stalls.
 To eliminate these 4 clock cycles we need to get more operations
relative to the number of overhead instructions.
 A simple scheme for increasing the number of instructions relative to
the branch and overhead instructions is loop unrolling.
 Unrolling simply replicates the loop body multiple times.
Loop Unrolling
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1) ;drop DADDUI & BNE
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1) ;drop DADDUI & BNE
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1) ;drop DADDUI & BNE
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24(R1)
DADDUI R1,R1,#-32
BNE R1,R2,Loop
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
Stall
Stall
S.D F4,8(R1)
BNE R1,R2,Loop
Code without Loop unrolling
Code with Loop unrolling
Loop Unrolling
 Loop unrolling can also be used to improve scheduling, because it
eliminates the branch, it allows instructions from different iterations to be
scheduled together.
 If we simply replicated the instructions when we unrolled the loop,
the resulting use of the same registers could prevent us from
effectively scheduling the loop.
 Thus, the required number of registers will be increased.
Loop Unrolling and Scheduling:
Summary
 To obtain the final unrolled code we need to make the following
decisions and transformations:
 Determine that unrolling the loop would be useful by finding that the
loop iterations were independent.
 Use different registers to avoid unnecessary forcing by using the
same registers for different computations.
 Eliminate the extra test and branch instructions and adjust the loop
termination.
 Determine that the loads and stores in the unrolled loop can be
interchanged, if loads and stores from different iterations are
independent.
 Schedule the code, preserving any dependences needed to produce
same result as the original code.
Loop Unrolling and Scheduling:
Summary
 There are three different types of limits to the gains that can be
achieved by loop unrolling:
1. A decrease in the amount of overhead with each unroll,
2. Code size limitations, and
3. Compiler limitations.
Let’s consider the question of loop overhead first.
When we unrolled the loop four times, it generated sufficient
parallelism among the instructions that the loop could be scheduled
with no stall cycles.
In previous example, in 14 clock cycles, only 2 cycles were loop
overhead: the DADDUI, and BNE.
Loop Unrolling and Scheduling:
Summary
 A second limit to unrolling is the growth in code size.
 Factor often more important than code size is the potential shortfall
in registers that is created by aggressive unrolling and scheduling.
 The transformed code, is theoretically faster but it generates a
shortage of registers.
Loop Unrolling and Scheduling:
Summary
 Loop unrolling is a simple but useful method for increasing the size
of straight-line code fragments that can be scheduled effectively.
 This transformation is useful in a variety of processors, from simple
pipelines to the multiple-issue processors
Limitations of ILP
Limitations of ILP
 Exploiting ILP to increase performance began with the first pipelined
processors in the 1960s.
 In the 1980s and 1990s, these techniques were used to achieve rapid
performance improvements.
 For enhancing performance at a speed of integrated circuit
technology… The critical question is:
What is needed to exploit more ILP is crucial to both computer
designers and compiler writers.
Limitations of ILP
 To know what actually limits ILP…
… we first need to define an ideal processor.
 An ideal processor is one where all constraints on ILP are
removed.
 The only limits on ILP in ideal processor are by the actual data flows
through either registers or memory.
Ideal Processor
 The assumptions made for an ideal or perfect processor are as
follows:
1. Register renaming
There are an infinite number of virtual registers available, and hence
all WAW and WAR hazards are avoided and number of instructions
can begin execution simultaneously.
2. Branch prediction
Branch prediction is perfect. All conditional branches are predicted
exactly.
3. Jump prediction
All jumps are perfectly predicted.
Ideal Processor
 The assumptions made for an ideal or perfect processor are as
follows:
4. Memory address analysis
All memory addresses are known exactly, and a load can be moved
before a store provided that the addresses are not identical.
This implements perfect address analysis.
5. Perfect caches
All memory accesses take 1 clock cycle.
Ideal Processor
 Assumptions 2 and 3 eliminate all control dependences.
 Assumptions 1 and 4 eliminate all but the true data dependences.
 These four assumptions mean that any instruction in the program’s
execution can be scheduled on the cycle immediately following the
execution of the predecessor on which it depends.
 Under these assumptions, it is possible, for the last executed
instruction in the program to be scheduled on the very first cycle.
Ideal Processor
 How close could a dynamically scheduled, speculative processor
come to the ideal processor?
To get into this question, consider what the perfect processor must do:
1. Look arbitrarily far ahead to find a set of instructions to issue,
predicting all branches perfectly.
2. Rename all registers used to avoid WAR and WAW hazards.
3. Determine whether there are any data dependences among the
instructions; if so, rename accordingly.
4. Determine if any memory dependences exist among the issuing
instructions and handle them appropriately.
5. Provide enough replicated functional units to allow all the ready
instructions to issue (no structural hazards).
Ideal Processor
 For example, to determine whether n issuing instructions have any
register dependences among them, assuming all instructions are
register-register and the total number of registers is unbounded,
requires
Thus, issuing only 50 instructions requires 2450 comparisons.
This cost obviously limits the number of instructions that can be
considered for issue at once.
comparisons.
Limitations on ILP for Realizable
Processors
 The limitations are divided into two classes:
 Limitations that arise even for the perfect speculative processor,
and
 Limitations that arise for one or more realistic models.
Limitations on ILP for Realizable
Processors
 The most important limitations that apply even to the perfect model are
1. WAW and WAR hazards through memory
The WAW and WAR hazards are eliminated through register
renaming, but not in memory usage.
A normal procedure reuses the memory locations of a previous
procedure on the stack, and this can lead to WAW and WAR hazards.
Limitations on ILP for Realizable
Processors
 The most important limitations that apply even to the perfect model are
2. Unnecessary dependences
With infinite numbers of registers, all but true register data
dependences are removed.
There are, dependences arising from either recurrences or code
generation conventions that introduce unnecessary data dependences.
Code generation conventions introduce unneeded dependences, in
particular the use of return address registers and a register for the stack
pointer (which is incremented and decremented in the call/return
sequence).
Limitations on ILP for Realizable
Processors
 The most important limitations that apply even to the perfect model are
3. Overcoming the data flow limit
If value prediction worked with high accuracy, it could overcome the
data flow limit.
It is difficult to achieve significant enhancement in ILP, using a prediction
scheme.
Limitations on ILP for Realizable
Processors
 For a less-than-perfect processor, several ideas have been
proposed that could expose more ILP.
 To speculate along multiple paths: This idea was discussed by Lam
and Wilson [1992]. By speculating on multiple paths, the cost of
incorrect recovery is reduced and more parallelism can be
exposed.
 Wall [1993] provides data for speculating in both directions on up to
eight branches.
Out of both paths, one will be thrown away.
Every commercial design has instead devoted additional hardware to
better speculation on the correct path.
Hardware Vs Software Speculation
Hardware Vs Software Speculation
 To speculate extensively, we must be able to disambiguate (clear
the ambiguity) memory references.
 This capability is difficult to do at compile time for integer
programs that contain pointers.
 In a hardware-based scheme, dynamic run time disambiguation of
memory addresses is done using Tomasulo’s algorithm.
Hardware Vs Software Speculation
 Hardware-based speculation works better when control flow is
unpredictable, and
Hardware-based branch prediction is superior than software-based
branch prediction done at compile time.
For example:
a good static predictor has a misprediction rate of about 16% for four
major integer SPEC92 programs, and a hardware predictor has a
misprediction rate of under 10%, because, speculated instructions may
slow down the computation when the prediction is incorrect.
Hardware Vs Software Speculation
 Hardware-based speculation maintains a completely precise
exception model even for speculated instructions.
 Hardware-based speculation does not require compensation or
book-keeping code, which is needed by software speculation
mechanisms.
 Compiler-based approaches may benefit from the ability to see
further in the code sequence, resulting in better code scheduling than
a purely hardware driven approach.
Hardware Vs Software Speculation
 Hardware-based speculation with dynamic scheduling does not
require different code sequences to achieve good performance for
different implementations of an architecture.
 On the other hand, more recent explicitly parallel architectures,
(such as IA-64), have added flexibility that reduces the hardware
dependence inherent in a code sequence.
Hardware Vs Software Speculation
 The major disadvantage of supporting speculation in hardware is
the complexity and additional hardware resources required.
 Some designers have tried to combine the dynamic and compiler-based
approaches to achieve the best of each.
 For example:
If conditional moves are combined with register renaming, then a slight
side effect appears. A conditional move that is invalid must copy a value
to the destination register, since it was renamed earlier.
Thank You…
shindesir.pvp@gmail.com
(This Presentation is Published Only for Educational Purpose)

More Related Content

What's hot

Unit 3-pipelining & vector processing
Unit 3-pipelining & vector processingUnit 3-pipelining & vector processing
Unit 3-pipelining & vector processingvishal choudhary
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelismprashantdahake
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
Embedded computing platform design
Embedded computing platform designEmbedded computing platform design
Embedded computing platform designRAMPRAKASHT1
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processorMuhammad Ishaq
 
Multiprocessor
MultiprocessorMultiprocessor
MultiprocessorA B Shinde
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...Ahmed kasim
 
Real Time OS For Embedded Systems
Real Time OS For Embedded SystemsReal Time OS For Embedded Systems
Real Time OS For Embedded SystemsHimanshu Ghetia
 
Random scan displays and raster scan displays
Random scan displays and raster scan displaysRandom scan displays and raster scan displays
Random scan displays and raster scan displaysSomya Bagai
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniquesHardik Devani
 

What's hot (20)

Crash recovery in database
Crash recovery in databaseCrash recovery in database
Crash recovery in database
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
Code Optimization
Code OptimizationCode Optimization
Code Optimization
 
DMA operation
DMA operationDMA operation
DMA operation
 
Unit 3-pipelining & vector processing
Unit 3-pipelining & vector processingUnit 3-pipelining & vector processing
Unit 3-pipelining & vector processing
 
Hardware and Software parallelism
Hardware and Software parallelismHardware and Software parallelism
Hardware and Software parallelism
 
Code optimization
Code optimizationCode optimization
Code optimization
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
Embedded computing platform design
Embedded computing platform designEmbedded computing platform design
Embedded computing platform design
 
Superscalar & superpipeline processor
Superscalar & superpipeline processorSuperscalar & superpipeline processor
Superscalar & superpipeline processor
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
Real Time OS For Embedded Systems
Real Time OS For Embedded SystemsReal Time OS For Embedded Systems
Real Time OS For Embedded Systems
 
Random scan displays and raster scan displays
Random scan displays and raster scan displaysRandom scan displays and raster scan displays
Random scan displays and raster scan displays
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniques
 
Introduction to Compiler design
Introduction to Compiler design Introduction to Compiler design
Introduction to Compiler design
 
loaders and linkers
 loaders and linkers loaders and linkers
loaders and linkers
 

Viewers also liked

System on chip buses
System on chip busesSystem on chip buses
System on chip busesA B Shinde
 
Intel CPU Manufacturing Process
Intel CPU Manufacturing ProcessIntel CPU Manufacturing Process
Intel CPU Manufacturing ProcessA B Shinde
 
SOC Interconnects: AMBA & CoreConnect
SOC Interconnects: AMBA  & CoreConnectSOC Interconnects: AMBA  & CoreConnect
SOC Interconnects: AMBA & CoreConnectA B Shinde
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOCA B Shinde
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
Image processing fundamentals
Image processing fundamentalsImage processing fundamentals
Image processing fundamentalsA B Shinde
 
VLSI Design Flow
VLSI Design FlowVLSI Design Flow
VLSI Design FlowA B Shinde
 
[2017.03.18] hst binary training part 1
[2017.03.18] hst binary training   part 1[2017.03.18] hst binary training   part 1
[2017.03.18] hst binary training part 1Chia-Hao Tsai
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
 
Hydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processorHydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processorAntonio Ricca
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip BasicsA B Shinde
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMKamran Ashraf
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architecturesA B Shinde
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelismdeviyasharwin
 
Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)A B Shinde
 
How to Make Effective Presentation
How to Make Effective PresentationHow to Make Effective Presentation
How to Make Effective PresentationA B Shinde
 

Viewers also liked (20)

System on chip buses
System on chip busesSystem on chip buses
System on chip buses
 
Intel CPU Manufacturing Process
Intel CPU Manufacturing ProcessIntel CPU Manufacturing Process
Intel CPU Manufacturing Process
 
SOC Interconnects: AMBA & CoreConnect
SOC Interconnects: AMBA  & CoreConnectSOC Interconnects: AMBA  & CoreConnect
SOC Interconnects: AMBA & CoreConnect
 
SOC Processors Used in SOC
SOC Processors Used in SOCSOC Processors Used in SOC
SOC Processors Used in SOC
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
Image processing fundamentals
Image processing fundamentalsImage processing fundamentals
Image processing fundamentals
 
VLSI Design Flow
VLSI Design FlowVLSI Design Flow
VLSI Design Flow
 
[2017.03.18] hst binary training part 1
[2017.03.18] hst binary training   part 1[2017.03.18] hst binary training   part 1
[2017.03.18] hst binary training part 1
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Hydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processorHydrogen production by a thermally integrated ATR based fuel processor
Hydrogen production by a thermally integrated ATR based fuel processor
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...Pragmatic optimization in modern programming - modern computer architecture c...
Pragmatic optimization in modern programming - modern computer architecture c...
 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip Basics
 
INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
Dual-core processor
Dual-core processorDual-core processor
Dual-core processor
 
System on chip architectures
System on chip architecturesSystem on chip architectures
System on chip architectures
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
 
Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)Spartan-II FPGA (xc2s30)
Spartan-II FPGA (xc2s30)
 
How to Make Effective Presentation
How to Make Effective PresentationHow to Make Effective Presentation
How to Make Effective Presentation
 

Similar to Advanced Techniques for Exploiting ILP

Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organizationssuserdfc773
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsButtaRajasekhar2
 
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...IRJET Journal
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationAmrutaMehata
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxKandavelEee
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-iAnkit Shah
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processorBảo Hoang
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
 
Bca examination 2015 csa
Bca examination 2015 csaBca examination 2015 csa
Bca examination 2015 csaAnjaan Gajendra
 

Similar to Advanced Techniques for Exploiting ILP (20)

Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
Assembly p1
Assembly p1Assembly p1
Assembly p1
 
Computer Architecture and Organization
Computer Architecture and OrganizationComputer Architecture and Organization
Computer Architecture and Organization
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
 
UNIT 3 - General Purpose Processors
UNIT 3 - General Purpose ProcessorsUNIT 3 - General Purpose Processors
UNIT 3 - General Purpose Processors
 
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
Design and Implementation of Pipelined 8-Bit RISC Processor using Verilog HDL...
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
viva q&a for mp lab
viva q&a for mp labviva q&a for mp lab
viva q&a for mp lab
 
Computer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organizationComputer Organization : CPU, Memory and I/O organization
Computer Organization : CPU, Memory and I/O organization
 
Lec1 final
Lec1 finalLec1 final
Lec1 final
 
ES-CH5.ppt
ES-CH5.pptES-CH5.ppt
ES-CH5.ppt
 
Cisc(a022& a023)
Cisc(a022& a023)Cisc(a022& a023)
Cisc(a022& a023)
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Ch2 embedded processors-i
Ch2 embedded processors-iCh2 embedded processors-i
Ch2 embedded processors-i
 
Highridge ISA
Highridge ISAHighridge ISA
Highridge ISA
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Chapter 04 the processor
Chapter 04   the processorChapter 04   the processor
Chapter 04 the processor
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
 
Bca examination 2015 csa
Bca examination 2015 csaBca examination 2015 csa
Bca examination 2015 csa
 

More from A B Shinde

Communication System Basics
Communication System BasicsCommunication System Basics
Communication System BasicsA B Shinde
 
MOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC AmplifierMOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC AmplifierA B Shinde
 
Color Image Processing: Basics
Color Image Processing: BasicsColor Image Processing: Basics
Color Image Processing: BasicsA B Shinde
 
Edge Detection and Segmentation
Edge Detection and SegmentationEdge Detection and Segmentation
Edge Detection and SegmentationA B Shinde
 
Image Processing: Spatial filters
Image Processing: Spatial filtersImage Processing: Spatial filters
Image Processing: Spatial filtersA B Shinde
 
Image Enhancement in Spatial Domain
Image Enhancement in Spatial DomainImage Enhancement in Spatial Domain
Image Enhancement in Spatial DomainA B Shinde
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image FundamentalsA B Shinde
 
Resume Writing
Resume WritingResume Writing
Resume WritingA B Shinde
 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing BasicsA B Shinde
 
Blooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering EducationBlooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering EducationA B Shinde
 
ISE 7.1i Software
ISE 7.1i SoftwareISE 7.1i Software
ISE 7.1i SoftwareA B Shinde
 
VHDL Coding Syntax
VHDL Coding SyntaxVHDL Coding Syntax
VHDL Coding SyntaxA B Shinde
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing TechniquesA B Shinde
 
Selecting Engineering Project
Selecting Engineering ProjectSelecting Engineering Project
Selecting Engineering ProjectA B Shinde
 
Interview Techniques
Interview TechniquesInterview Techniques
Interview TechniquesA B Shinde
 
Semiconductors
SemiconductorsSemiconductors
SemiconductorsA B Shinde
 
Diode Applications & Transistor Basics
Diode Applications & Transistor BasicsDiode Applications & Transistor Basics
Diode Applications & Transistor BasicsA B Shinde
 

More from A B Shinde (20)

Communication System Basics
Communication System BasicsCommunication System Basics
Communication System Basics
 
MOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC AmplifierMOSFETs: Single Stage IC Amplifier
MOSFETs: Single Stage IC Amplifier
 
MOSFETs
MOSFETsMOSFETs
MOSFETs
 
Color Image Processing: Basics
Color Image Processing: BasicsColor Image Processing: Basics
Color Image Processing: Basics
 
Edge Detection and Segmentation
Edge Detection and SegmentationEdge Detection and Segmentation
Edge Detection and Segmentation
 
Image Processing: Spatial filters
Image Processing: Spatial filtersImage Processing: Spatial filters
Image Processing: Spatial filters
 
Image Enhancement in Spatial Domain
Image Enhancement in Spatial DomainImage Enhancement in Spatial Domain
Image Enhancement in Spatial Domain
 
Resume Format
Resume FormatResume Format
Resume Format
 
Digital Image Fundamentals
Digital Image FundamentalsDigital Image Fundamentals
Digital Image Fundamentals
 
Resume Writing
Resume WritingResume Writing
Resume Writing
 
Image Processing Basics
Image Processing BasicsImage Processing Basics
Image Processing Basics
 
Blooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering EducationBlooms Taxonomy in Engineering Education
Blooms Taxonomy in Engineering Education
 
ISE 7.1i Software
ISE 7.1i SoftwareISE 7.1i Software
ISE 7.1i Software
 
VHDL Coding Syntax
VHDL Coding SyntaxVHDL Coding Syntax
VHDL Coding Syntax
 
VHDL Programs
VHDL ProgramsVHDL Programs
VHDL Programs
 
VLSI Testing Techniques
VLSI Testing TechniquesVLSI Testing Techniques
VLSI Testing Techniques
 
Selecting Engineering Project
Selecting Engineering ProjectSelecting Engineering Project
Selecting Engineering Project
 
Interview Techniques
Interview TechniquesInterview Techniques
Interview Techniques
 
Semiconductors
SemiconductorsSemiconductors
Semiconductors
 
Diode Applications & Transistor Basics
Diode Applications & Transistor BasicsDiode Applications & Transistor Basics
Diode Applications & Transistor Basics
 

Recently uploaded

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 

Recently uploaded (20)

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 

Advanced Techniques for Exploiting ILP

  • 1. Advanced Techniques For Exploiting ILP Mr. A. B. Shinde Assistant Professor, Electronics Engineering, P.V.P.I.T., Budhgaon
  • 2. Contents…  Complier techniques for exposing ILP  Limitation on ILP for realizable Processors  Hardware versus software speculation 2
  • 3. Pipelining  Pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one.  The elements of a pipeline are often executed in parallel.
  • 4. Parallel Computing  Parallel computing is a form of computation in which many calculations are carried out simultaneously ("in parallel").  There are several different forms of parallel computing:  Bit-level,  Instruction level,  Data, and  Task parallelism.
  • 5. Instruction Level Parallelism (ILP)  A computer program is, a stream of instructions executed by a processor.  These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program.
  • 6. Instruction Level Parallelism (ILP) In ordinary programs instructions are executed in the order specified by the programmer.  How much ILP exists in programs is very application specific. In certain fields, such as graphics and scientific computing the amount can be very large. However, workloads such as cryptography exhibit much less parallelism.
  • 7. Pipeline Scheduling  The straightforward MIPS code, not scheduled for the pipeline, looks like this: Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer by ;8 bytes BNE R1,R2,Loop ;branch R1!=R2  Let’s see how well this loop will run when it is scheduled on a simple pipeline for MIPS.
  • 8. Pipeline Scheduling  Example: Show how the loop would look on MIPS, both scheduled and unscheduled, including any stalls or idle clock cycles.  Answer: Without any scheduling, the loop will execute as follows, taking 9 cycles: Clock cycle issued Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 stall 8 BNE R1,R2,Loop 9
  • 9. Pipeline Scheduling  We can schedule the loop to obtain only two stalls and reduce the time to 7 cycles: Clock cycles Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 Stall 4 Stall 5 S.D F4,8(R1) 6 BNE R1,R2,Loop 7  The stalls after ADD.D are for use by the S.D. Clock cycles Loop: L.D F0,0(R1) 1 stall 2 ADD.D F4,F0,F2 3 stall 4 stall 5 S.D F4,0(R1) 6 DADDUI R1,R1,#-8 7 stall 8 BNE R1,R2,Loop 9
  • 10. Loop Unrolling  In the previous example, we complete one loop iteration and store back one array element every 7 clock cycles… but the actual work of operating on the array element takes just 3 (the load, add, and store) of those 7 clock cycles.  The remaining 4 clock cycles consist of loop overhead—the DADDUI and BNE—and two stalls.  To eliminate these 4 clock cycles we need to get more operations relative to the number of overhead instructions.  A simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling.  Unrolling simply replicates the loop body multiple times.
  • 11. Loop Unrolling Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 Stall Stall S.D F4,8(R1) BNE R1,R2,Loop Code without Loop unrolling Code with Loop unrolling
  • 12. Loop Unrolling  Loop unrolling can also be used to improve scheduling, because it eliminates the branch, it allows instructions from different iterations to be scheduled together.  If we simply replicated the instructions when we unrolled the loop, the resulting use of the same registers could prevent us from effectively scheduling the loop.  Thus, the required number of registers will be increased.
  • 13. Loop Unrolling and Scheduling: Summary  To obtain the final unrolled code we need to make the following decisions and transformations:  Determine that unrolling the loop would be useful by finding that the loop iterations were independent.  Use different registers to avoid unnecessary forcing by using the same registers for different computations.  Eliminate the extra test and branch instructions and adjust the loop termination.  Determine that the loads and stores in the unrolled loop can be interchanged, if loads and stores from different iterations are independent.  Schedule the code, preserving any dependences needed to produce same result as the original code.
  • 14. Loop Unrolling and Scheduling: Summary  There are three different types of limits to the gains that can be achieved by loop unrolling: 1. A decrease in the amount of overhead with each unroll, 2. Code size limitations, and 3. Compiler limitations. Let’s consider the question of loop overhead first. When we unrolled the loop four times, it generated sufficient parallelism among the instructions that the loop could be scheduled with no stall cycles. In previous example, in 14 clock cycles, only 2 cycles were loop overhead: the DADDUI, and BNE.
  • 15. Loop Unrolling and Scheduling: Summary  A second limit to unrolling is the growth in code size.  Factor often more important than code size is the potential shortfall in registers that is created by aggressive unrolling and scheduling.  The transformed code, is theoretically faster but it generates a shortage of registers.
  • 16. Loop Unrolling and Scheduling: Summary  Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.  This transformation is useful in a variety of processors, from simple pipelines to the multiple-issue processors
  • 18. Limitations of ILP  Exploiting ILP to increase performance began with the first pipelined processors in the 1960s.  In the 1980s and 1990s, these techniques were used to achieve rapid performance improvements.  For enhancing performance at a speed of integrated circuit technology… The critical question is: What is needed to exploit more ILP is crucial to both computer designers and compiler writers.
  • 19. Limitations of ILP  To know what actually limits ILP… … we first need to define an ideal processor.  An ideal processor is one where all constraints on ILP are removed.  The only limits on ILP in ideal processor are by the actual data flows through either registers or memory.
  • 20. Ideal Processor  The assumptions made for an ideal or perfect processor are as follows: 1. Register renaming There are an infinite number of virtual registers available, and hence all WAW and WAR hazards are avoided and number of instructions can begin execution simultaneously. 2. Branch prediction Branch prediction is perfect. All conditional branches are predicted exactly. 3. Jump prediction All jumps are perfectly predicted.
  • 21. Ideal Processor  The assumptions made for an ideal or perfect processor are as follows: 4. Memory address analysis All memory addresses are known exactly, and a load can be moved before a store provided that the addresses are not identical. This implements perfect address analysis. 5. Perfect caches All memory accesses take 1 clock cycle.
  • 22. Ideal Processor  Assumptions 2 and 3 eliminate all control dependences.  Assumptions 1 and 4 eliminate all but the true data dependences.  These four assumptions mean that any instruction in the program’s execution can be scheduled on the cycle immediately following the execution of the predecessor on which it depends.  Under these assumptions, it is possible, for the last executed instruction in the program to be scheduled on the very first cycle.
  • 23. Ideal Processor  How close could a dynamically scheduled, speculative processor come to the ideal processor? To get into this question, consider what the perfect processor must do: 1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches perfectly. 2. Rename all registers used to avoid WAR and WAW hazards. 3. Determine whether there are any data dependences among the instructions; if so, rename accordingly. 4. Determine if any memory dependences exist among the issuing instructions and handle them appropriately. 5. Provide enough replicated functional units to allow all the ready instructions to issue (no structural hazards).
  • 24. Ideal Processor  For example, to determine whether n issuing instructions have any register dependences among them, assuming all instructions are register-register and the total number of registers is unbounded, requires Thus, issuing only 50 instructions requires 2450 comparisons. This cost obviously limits the number of instructions that can be considered for issue at once. comparisons.
  • 25. Limitations on ILP for Realizable Processors  The limitations are divided into two classes:  Limitations that arise even for the perfect speculative processor, and  Limitations that arise for one or more realistic models.
  • 26. Limitations on ILP for Realizable Processors  The most important limitations that apply even to the perfect model are 1. WAW and WAR hazards through memory The WAW and WAR hazards are eliminated through register renaming, but not in memory usage. A normal procedure reuses the memory locations of a previous procedure on the stack, and this can lead to WAW and WAR hazards.
  • 27. Limitations on ILP for Realizable Processors  The most important limitations that apply even to the perfect model are 2. Unnecessary dependences With infinite numbers of registers, all but true register data dependences are removed. There are, dependences arising from either recurrences or code generation conventions that introduce unnecessary data dependences. Code generation conventions introduce unneeded dependences, in particular the use of return address registers and a register for the stack pointer (which is incremented and decremented in the call/return sequence).
  • 28. Limitations on ILP for Realizable Processors  The most important limitations that apply even to the perfect model are 3. Overcoming the data flow limit If value prediction worked with high accuracy, it could overcome the data flow limit. It is difficult to achieve significant enhancement in ILP, using a prediction scheme.
  • 29. Limitations on ILP for Realizable Processors  For a less-than-perfect processor, several ideas have been proposed that could expose more ILP.  To speculate along multiple paths: This idea was discussed by Lam and Wilson [1992]. By speculating on multiple paths, the cost of incorrect recovery is reduced and more parallelism can be exposed.  Wall [1993] provides data for speculating in both directions on up to eight branches. Out of both paths, one will be thrown away. Every commercial design has instead devoted additional hardware to better speculation on the correct path.
  • 30. Hardware Vs Software Speculation
  • 31. Hardware Vs Software Speculation  To speculate extensively, we must be able to disambiguate (clear the ambiguity) memory references.  This capability is difficult to do at compile time for integer programs that contain pointers.  In a hardware-based scheme, dynamic run time disambiguation of memory addresses is done using Tomasulo’s algorithm.
  • 32. Hardware Vs Software Speculation  Hardware-based speculation works better when control flow is unpredictable, and Hardware-based branch prediction is superior than software-based branch prediction done at compile time. For example: a good static predictor has a misprediction rate of about 16% for four major integer SPEC92 programs, and a hardware predictor has a misprediction rate of under 10%, because, speculated instructions may slow down the computation when the prediction is incorrect.
  • 33. Hardware Vs Software Speculation  Hardware-based speculation maintains a completely precise exception model even for speculated instructions.  Hardware-based speculation does not require compensation or book-keeping code, which is needed by software speculation mechanisms.  Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling than a purely hardware driven approach.
  • 34. Hardware Vs Software Speculation  Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture.  On the other hand, more recent explicitly parallel architectures, (such as IA-64), have added flexibility that reduces the hardware dependence inherent in a code sequence.
  • 35. Hardware Vs Software Speculation  The major disadvantage of supporting speculation in hardware is the complexity and additional hardware resources required.  Some designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each.  For example: If conditional moves are combined with register renaming, then a slight side effect appears. A conditional move that is invalid must copy a value to the destination register, since it was renamed earlier.
  • 36. Thank You… shindesir.pvp@gmail.com (This Presentation is Published Only for Educational Purpose)