SlideShare a Scribd company logo
1 of 26
Instruction Level Parallelism –
Compiler Techniques
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
Outline
 Instruction Level Parallelism (ILP)
 Compiler techniques to increase ILP
 Loop Unrolling
 Static Branch Prediction
 Hardware techniques to increase ILP (next
Topic)
 Dynamic Branch Prediction
 Tomasulo Algorithm
 Multithreading
2
Source: www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/
Instruction Level Parallelism
 Overlap execution of instructions to improve
performance
 2 approaches
1. Rely on software techniques to find parallelism,
statically at compile-time
 E.g., Itanium 2, ARM Cortex-A9
2. Rely on hardware to help discover & exploit
parallelism dynamically
 E.g., Pentium 4, AMD Opteron, IBM Power
3
Techniques to ILP
 Software
 Branch prediction
 Register renaming
 Loop unrolling
 Vector instructions
 Hardware
 Instruction pipelining
 Register renaming
 Branch prediction
 Superscalars & VLIW
 Out-of-order execution
 Speculative execution 4
Instruction Level Parallelism
 Basic Block (BB) ILP is quite small
 BB – straight-line code sequence with no branches in
except to entry & no branches out except at exit
 Average dynamic branch frequency 15% to 25%
 4 to 7 instructions execute between a pair of branches
 Also, instructions in BB likely to depend on each other
5
Loop Level Parallelism
for(i = 1; i <= 1000; i = i+1)
x[i] = x[i] + y[i];
 Each iteration is independent
 Exploit loop-level parallelism to parallelize by
“unrolling loop” either by
1. Static via loop unrolling by compiler
 Another way is vectors & GPUs, to be covered later
2. Dynamic via branch prediction
 Determining instruction dependence is critical to
loop-level parallelism
6
Dependencies
 Data
 Name
 Control
7
Data Dependences
 Instruction J tries to read operand before Instruction I
writes it
 Caused by a dependence
 This hazard results from an actual need for
communication – True dependence
 Instruction K depends on instruction J, & J depends on I
 K  J  I
 Is add r1, r1, r2 a dependence?
8
Data Dependences (Cont.)
 Program order must be preserved
 Dependences are a property of programs
 Data dependencies
 Indicates potential for a hazard
 But actual hazard & length of any stall is a property of the
pipeline
 Determines order in which results must be calculated
 Sets an upper bound on how much parallelism can
possibly be exploited
 Goal
 Exploit parallelism by preserving program order only
where it affects outcome of the program 9
Name Dependences
 When 2 instructions use same register or
memory location (called a name), but no flow of
data between instructions associated with that
name
 Because of use of same name
10
2 Types of Name Dependence
1. Antidependence
 Instruction J writes operand before instruction I reads
it
 Problem caused by use of name r1
 Write After Read (WAR) hazard
 Not a problem in MIPS 5-stage pipeline
11
2 Types of Name Dependence (Cont.)
2. Output dependence
 Instruction J writes operand before instruction I writes
it
 Problem caused by reuse of name r1
 Write After Write (WAW) hazard
 Not a problem in MIPS 5-stage pipeline
12
Name Dependences Solution –
Register Renaming
 Rename registers either by compiler or hardware
 How to overcome these?
13
Control Dependencies
if p1 {
…
S1;
…
};
if p2 {
…
S2;
…
}
 S1 is control dependent on p1, & S2 is control
dependent on p2 but not on p1
 Instructions that are control dependent can’t be
moved before or after the branch 14
Control Dependencies (Cont.)
 Control dependence need not be preserved
 Execute instructions that shouldn’t have been
executed, thereby violating control dependences
 If can do so without affecting correctness of program
 2 properties critical to program correctness
1. Data flow
2. Exception behavior
15
Data Flow
 Actual flow of data among instructions that
produce results & those that consume them
 Branches make flow dynamic, determine which
instruction is supplier of data
DADDU R1,R2,R3
BEQZ R4,L
DSUBU R1,R5,R6
L:
…
OR R7,R1,R8
 OR depends on DADDU or DSUBU?
16
Exception Behaviour
 Any changes in instruction execution order
mustn’t change how exceptions are raised in
program  no new exceptions
DADDU R2,R3,R4
BEQZ R2,L1
LW R1,0(R2)
L1:
....
 Can we move LW before BEQZ?
 No data dependence
 Control dependence? 17
Compiler Techniques
 Consider following code
for (I = 999; I >= 0; I = i-1)
x[i] = x[i] + s;
 Consider following latencies
 Write program in Assembly
 To simplify, assume 8 is lowest address 18
Compiler Techniques (Cont.)
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
 R – integer registers
 F – floating point registers
 R1 – highest address of array
 F2 – s
 F4 – result of computation
 DADDUI – decrement pointer by
8 bytes
 BNZ – branch not equal
Loop: LD F0,0(R1)
stall
ADDD F4,F0,F2
stall
stall
SD F4,0(R1)
DADDUI R1,R1,#-8
stall (assume int load latency 1)
BNE R1,R2,Loop
19
Need 9 clock cycles
Revised Code – Pipeline Scheduling
Loop: LD F0,0(R1)
DADDUI R1,R1,#-8
ADDD F4,F0,F2
stall
stall
SD F4,8(R1)
BNE R1,R2,Loop
 Need 7 clock cycles
 3 for execution (LD, ADDD,SD)
 4 for loop overhead
 How to make it even faster? 20
Solution – Loop Unrolling
 Unroll by a factor of 4 – Assume no of elements is divisible by 4
 Eliminate unnecessary instructions
Loop: LD F0,0(R1)
ADDD F4,F0,F2
SD F4,0(R1) ;drop DADDUI & BNE
LD F6,-8(R1)
ADDD F8,F6,F2
SD F8,-8(R1) ;drop DADDUI & BNE
LD F10,-16(R1)
ADDD F12,F10,F2
SD F12,-16(R1) ;drop DADDUI & BNE
LD F14,-24(R1)
ADDD F16,F14,F2
SD F16,-24(R1)
DADDUI R1,R1,#-32 ; 4 x 8
BNE R1,R2,Loop 21
1 cycle stall
2 cycle stall
1 cycle stall
27 clock cycles, 6.75 per iteration
Revised Code – Pipeline Scheduling
Loop: LD F0,0(R1)
LD F6,-8(R1)
LD F10,-16(R1)
LD F14,-24(R1)
ADDD F4,F0,F2
ADDD F8,F6,F2
ADDD F12,F10,F2
ADDD F16,F14,F2
SD F4,0(R1)
SD F8,-8(R1)
DADDUI R1,R1,#-32
SD F12,16(R1)
SD F16,8(R1)
BNE R1,R2,Loop 22
14 clock cycles, 3.5 per iteration
Loop Unrolling
 Usually don’t know upper bound of loop
 Suppose it is n
 Unroll loop to make k copies of the body
 Generate a pair of consecutive loops
 1st executes n mod k times & has a body that is the
original loop
 2nd unrolled body surrounded by an outer loop that
iterates n/k times
 For large n, most of the execution time spent in
unrolled loop
23
Conditions for Unrolling Loops
 Loop unrolling requires understanding of
 How 1 instruction depends on another
 How instructions can be changed or reordered given
dependences
 5 loop unrolling decisions
1. Determine loop unrolling useful by finding that loop
iterations were independent
2. Use different registers to avoid unnecessary
constraints forced by using same registers for
different computations
24
Conditions for Unrolling Loops (Cont.)
3. Eliminate extra test & branch instructions & adjust
loop termination & iteration code
4. Determine that loads & stores in unrolled loop can
be interchanged by observing that loads & stores
from different iterations are independent
 Transformation requires analyzing memory addresses &
finding that they don’t refer to the same address
5. Schedule code, preserving any dependences
needed to yield the same result as original code
25
Limits of Loop Unrolling
 Decrease in amount of overhead amortized with
each extra unrolling
 Amdahl’s Law
 Growth in code size
 Larger loops increases the instruction cache miss rate
 Register pressure
 Potential shortfall in registers created by aggressive
unrolling & scheduling
 Loop unrolling reduces impact of branches on
pipeline; another way is branch prediction
26

More Related Content

Similar to Instruction Level Parallelism – Compiler Techniques

ARM - Advance RISC Machine
ARM - Advance RISC MachineARM - Advance RISC Machine
ARM - Advance RISC MachineEdutechLearners
 
Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Buvana Buvana
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set ArchitectureDilum Bandara
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compilerkeanumit
 
Lect05 Prog Model
Lect05 Prog ModelLect05 Prog Model
Lect05 Prog Modelanoosdomain
 
Microprocessor Microcontroller Interview & Viva Question.pdf
Microprocessor Microcontroller Interview & Viva Question.pdfMicroprocessor Microcontroller Interview & Viva Question.pdf
Microprocessor Microcontroller Interview & Viva Question.pdfEngineering Funda
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxtrupeace
 
CALecture3Module1.ppt
CALecture3Module1.pptCALecture3Module1.ppt
CALecture3Module1.pptBeeMUcz
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
07 processor basics
07 processor basics07 processor basics
07 processor basicsMurali M
 
Lec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processorLec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processorMayank Roy
 

Similar to Instruction Level Parallelism – Compiler Techniques (20)

ARM - Advance RISC Machine
ARM - Advance RISC MachineARM - Advance RISC Machine
ARM - Advance RISC Machine
 
arm
armarm
arm
 
Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013Cs2253 coa-2marks-2013
Cs2253 coa-2marks-2013
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set Architecture
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
Lect05 Prog Model
Lect05 Prog ModelLect05 Prog Model
Lect05 Prog Model
 
Patt patelch04
Patt patelch04Patt patelch04
Patt patelch04
 
Microprocessor Microcontroller Interview & Viva Question.pdf
Microprocessor Microcontroller Interview & Viva Question.pdfMicroprocessor Microcontroller Interview & Viva Question.pdf
Microprocessor Microcontroller Interview & Viva Question.pdf
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
CALecture3Module1.ppt
CALecture3Module1.pptCALecture3Module1.ppt
CALecture3Module1.ppt
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Embedded system - introduction to arm7
Embedded system -  introduction to arm7Embedded system -  introduction to arm7
Embedded system - introduction to arm7
 
07 processor basics
07 processor basics07 processor basics
07 processor basics
 
Lec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processorLec 12-15 mips instruction set processor
Lec 12-15 mips instruction set processor
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 

More from Dilum Bandara

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningDilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeDilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCADilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsDilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresDilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixDilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersDilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level ParallelismDilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesDilum Bandara
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesDilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionDilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPDilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery NetworksDilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingDilum Bandara
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband CommunicationDilum Bandara
 

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband Communication
 

Recently uploaded

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 

Recently uploaded (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 

Instruction Level Parallelism – Compiler Techniques

  • 1. Instruction Level Parallelism – Compiler Techniques CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2. Outline  Instruction Level Parallelism (ILP)  Compiler techniques to increase ILP  Loop Unrolling  Static Branch Prediction  Hardware techniques to increase ILP (next Topic)  Dynamic Branch Prediction  Tomasulo Algorithm  Multithreading 2
  • 3. Source: www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/ Instruction Level Parallelism  Overlap execution of instructions to improve performance  2 approaches 1. Rely on software techniques to find parallelism, statically at compile-time  E.g., Itanium 2, ARM Cortex-A9 2. Rely on hardware to help discover & exploit parallelism dynamically  E.g., Pentium 4, AMD Opteron, IBM Power 3
  • 4. Techniques to ILP  Software  Branch prediction  Register renaming  Loop unrolling  Vector instructions  Hardware  Instruction pipelining  Register renaming  Branch prediction  Superscalars & VLIW  Out-of-order execution  Speculative execution 4
  • 5. Instruction Level Parallelism  Basic Block (BB) ILP is quite small  BB – straight-line code sequence with no branches in except to entry & no branches out except at exit  Average dynamic branch frequency 15% to 25%  4 to 7 instructions execute between a pair of branches  Also, instructions in BB likely to depend on each other 5
  • 6. Loop Level Parallelism for(i = 1; i <= 1000; i = i+1) x[i] = x[i] + y[i];  Each iteration is independent  Exploit loop-level parallelism to parallelize by “unrolling loop” either by 1. Static via loop unrolling by compiler  Another way is vectors & GPUs, to be covered later 2. Dynamic via branch prediction  Determining instruction dependence is critical to loop-level parallelism 6
  • 8. Data Dependences  Instruction J tries to read operand before Instruction I writes it  Caused by a dependence  This hazard results from an actual need for communication – True dependence  Instruction K depends on instruction J, & J depends on I  K  J  I  Is add r1, r1, r2 a dependence? 8
  • 9. Data Dependences (Cont.)  Program order must be preserved  Dependences are a property of programs  Data dependencies  Indicates potential for a hazard  But actual hazard & length of any stall is a property of the pipeline  Determines order in which results must be calculated  Sets an upper bound on how much parallelism can possibly be exploited  Goal  Exploit parallelism by preserving program order only where it affects outcome of the program 9
  • 10. Name Dependences  When 2 instructions use same register or memory location (called a name), but no flow of data between instructions associated with that name  Because of use of same name 10
  • 11. 2 Types of Name Dependence 1. Antidependence  Instruction J writes operand before instruction I reads it  Problem caused by use of name r1  Write After Read (WAR) hazard  Not a problem in MIPS 5-stage pipeline 11
  • 12. 2 Types of Name Dependence (Cont.) 2. Output dependence  Instruction J writes operand before instruction I writes it  Problem caused by reuse of name r1  Write After Write (WAW) hazard  Not a problem in MIPS 5-stage pipeline 12
  • 13. Name Dependences Solution – Register Renaming  Rename registers either by compiler or hardware  How to overcome these? 13
  • 14. Control Dependencies if p1 { … S1; … }; if p2 { … S2; … }  S1 is control dependent on p1, & S2 is control dependent on p2 but not on p1  Instructions that are control dependent can’t be moved before or after the branch 14
  • 15. Control Dependencies (Cont.)  Control dependence need not be preserved  Execute instructions that shouldn’t have been executed, thereby violating control dependences  If can do so without affecting correctness of program  2 properties critical to program correctness 1. Data flow 2. Exception behavior 15
  • 16. Data Flow  Actual flow of data among instructions that produce results & those that consume them  Branches make flow dynamic, determine which instruction is supplier of data DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R5,R6 L: … OR R7,R1,R8  OR depends on DADDU or DSUBU? 16
  • 17. Exception Behaviour  Any changes in instruction execution order mustn’t change how exceptions are raised in program  no new exceptions DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1: ....  Can we move LW before BEQZ?  No data dependence  Control dependence? 17
  • 18. Compiler Techniques  Consider following code for (I = 999; I >= 0; I = i-1) x[i] = x[i] + s;  Consider following latencies  Write program in Assembly  To simplify, assume 8 is lowest address 18
  • 19. Compiler Techniques (Cont.) Loop: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop  R – integer registers  F – floating point registers  R1 – highest address of array  F2 – s  F4 – result of computation  DADDUI – decrement pointer by 8 bytes  BNZ – branch not equal Loop: LD F0,0(R1) stall ADDD F4,F0,F2 stall stall SD F4,0(R1) DADDUI R1,R1,#-8 stall (assume int load latency 1) BNE R1,R2,Loop 19 Need 9 clock cycles
  • 20. Revised Code – Pipeline Scheduling Loop: LD F0,0(R1) DADDUI R1,R1,#-8 ADDD F4,F0,F2 stall stall SD F4,8(R1) BNE R1,R2,Loop  Need 7 clock cycles  3 for execution (LD, ADDD,SD)  4 for loop overhead  How to make it even faster? 20
  • 21. Solution – Loop Unrolling  Unroll by a factor of 4 – Assume no of elements is divisible by 4  Eliminate unnecessary instructions Loop: LD F0,0(R1) ADDD F4,F0,F2 SD F4,0(R1) ;drop DADDUI & BNE LD F6,-8(R1) ADDD F8,F6,F2 SD F8,-8(R1) ;drop DADDUI & BNE LD F10,-16(R1) ADDD F12,F10,F2 SD F12,-16(R1) ;drop DADDUI & BNE LD F14,-24(R1) ADDD F16,F14,F2 SD F16,-24(R1) DADDUI R1,R1,#-32 ; 4 x 8 BNE R1,R2,Loop 21 1 cycle stall 2 cycle stall 1 cycle stall 27 clock cycles, 6.75 per iteration
  • 22. Revised Code – Pipeline Scheduling Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD F4,0(R1) SD F8,-8(R1) DADDUI R1,R1,#-32 SD F12,16(R1) SD F16,8(R1) BNE R1,R2,Loop 22 14 clock cycles, 3.5 per iteration
  • 23. Loop Unrolling  Usually don’t know upper bound of loop  Suppose it is n  Unroll loop to make k copies of the body  Generate a pair of consecutive loops  1st executes n mod k times & has a body that is the original loop  2nd unrolled body surrounded by an outer loop that iterates n/k times  For large n, most of the execution time spent in unrolled loop 23
  • 24. Conditions for Unrolling Loops  Loop unrolling requires understanding of  How 1 instruction depends on another  How instructions can be changed or reordered given dependences  5 loop unrolling decisions 1. Determine loop unrolling useful by finding that loop iterations were independent 2. Use different registers to avoid unnecessary constraints forced by using same registers for different computations 24
  • 25. Conditions for Unrolling Loops (Cont.) 3. Eliminate extra test & branch instructions & adjust loop termination & iteration code 4. Determine that loads & stores in unrolled loop can be interchanged by observing that loads & stores from different iterations are independent  Transformation requires analyzing memory addresses & finding that they don’t refer to the same address 5. Schedule code, preserving any dependences needed to yield the same result as original code 25
  • 26. Limits of Loop Unrolling  Decrease in amount of overhead amortized with each extra unrolling  Amdahl’s Law  Growth in code size  Larger loops increases the instruction cache miss rate  Register pressure  Potential shortfall in registers created by aggressive unrolling & scheduling  Loop unrolling reduces impact of branches on pipeline; another way is branch prediction 26

Editor's Notes

  1. Here we don’t consider structural hazards
  2. BEQZ – branch if equal to 0 What if memory address is illegal?