SlideShare a Scribd company logo
1 of 20
Instruction Scheduling
Sourabh Kumar
ECE
Pipeline…
A pipeline is the mechanism a RISC processor uses to execute instructions.
Using a pipeline speeds up execution by fetching the next instruction while
other instructions are being decoded and executed.
The ARM pipeline has not processed an instruction until it passes completely
through the execute stage.
Three Stage Pipeline Five Stage Pipeline
Pipeline Execution…
Fetch Decode ALU LS1 LS2
pc Pc-4 Pc-8 pc-12 Pc-16
Action
Instruction Address
Fetch: get the instruction from memory into core.
Decode: decodes the previous instruction and reads the data operands from
bank registers or forwarding paths.
ALU: executed the previously executed instruction.
LS1: load or store instruction gets executed like LDR.
LS2: LDRB and STRB or LDRH and STRH gets executed.
Execution Cycles…
The following rules summarize the cycle timings for common instruction:
The time taken to execute instructions depends on the implementation
pipeline.
Instructions that are conditional on the value of the ARM condition codes
in the CPSR take one cycle if the condition is not met.
If the condition is met, then the following rules apply given on next slide:
Execution Cycles…
ALU operations such as addition, subtraction, and logical operations, shift by an
immediate value takes One cycle.
Register-specified shift takes Two cycles.
Load instructions such as LDR and LDM take N cycles.
Load instructions that load 16 or 8-bit data such as LDRB, take One cycle.
Branch instructions take Three cycles.
Store instructions that store N values take N cycles. An STM of a single value is
exceptional, taking Two cycles.
Multiply instructions take a varying number of cycles depending on the value of the
second operand in the product.
Pipeline Interlock…
If an instruction requires the result of a previous instruction that is not available,
then the processor stalls. This is called a pipeline hazard or pipeline Interlock.
Example: where there is No Interlock.
ADD r0, r0, r1
ADD r0, r0, r2
This instruction pair takes Two cycles.
The ALU calculates r0 + r1 in one cycle. Therefore this result is available for the
ALU to calculate r0 + r2 in the second cycle.
Pipeline Interlock…
Example: where there is one-cycle interlock.
LDR r1, [r2, #4]
ADD r0, r0, r1
This instruction pair takes Three cycles.
The ALU calculates the address r2 + 4 in the first cycle while decoding the
ADD instruction in parallel.
The ADD cannot proceed on the second cycle because the load instruction
has not yet loaded the value of r1.
Pipeline Interlock…
Pipeline Fetch Decode ALU LS1 LS2
Cycle 1 … ADD LDR …
Cycle 2 … ADD LDR …
Cycle 3 … ADD ---- LDR
The processor executes the ADD in the ALU on the third cycle.
 The processor stalls the ADD instruction for one cycle in the ALU stage of The
pipeline while the load instruction completes the LS1 stage.
 The LDR instruction proceeds down the pipeline, but the add instruction is stalled, a
gap opens up between them. This gap is sometimes called a pipeline bubble.
Why a branch instruction takes
three cycles???
MOV r1, #1
B case1
AND r0, r0, r1
EOR r2, r2, r3
...
case1
SUB r0, r0, r1
The MOV instruction executes on the first
cycle.
On the second cycle, the branch instruction
calculates the destination address.
This causes the core to flush the pipeline
and refill it using this new pc value.
The refill takes two cycles. Finally, the SUB
instruction executes normally.
Why a branch instruction takes
three cycles???
Pipeline Fetch Decode ALU LS1 LS2
Cycle 1 AND B MOV …
Cycle 2 EOR AND B MOV …
Cycle 3 SUB ---- ---- B MOV
Cycle 4 … SUB ---- ---- B
Cycle 5 … SUB ---- ----
 The pipeline drops the two instructions following the branch when the
branch takes place.
Scheduling of load instructions…
Load instructions occur frequently in compiled code, accounting for
approximately one third of all instructions.
Careful scheduling of load instructions so that pipeline stalls don’t occur can
improve performance.
The compiler attempts to schedule the code as best it can, but the aliasing
problems of C limits the available optimizations.
The compiler cannot move a load instruction before a store instruction
unless it is certain that the two pointers used do not point to the same
address.
C PROGRAM:
void str_tolower(char *out, char *in)
{
unsigned int c;
do
{
c = *(in++);
if (c>=’A’ && c<=’Z’)
{
c = c + (’a’ -’A’);
}
*(out++) = (char)c;
}
while (c);
}
Scheduling of load
instructions…
A B C D /0
65 66 67 68
*in
a b c d /0
97 98 99 100
*out
Scheduling of load instructions…
str_tolower
LDRB r2,[r1],#1 ; c = *(in++)
SUB r3,r2,#0x41 ; r3 = c -‘A’
CMP r3,#0x19 ; if (c <=‘Z’-‘A’)
ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’
STRB r2,[r0],#1 ; *(out++) = (char)c
CMP r2,#0 ; if (c!=0)
BNE str_tolower ; goto str_tolower
MOV pc,r14 ; return
The SUB instruction uses the value of c
directly after the LDRB instruction that
loads c.
The ARM pipeline will stall for two cycles.
Overall there are 11 execution cycles which
are needed to execute the whole program.
 There are two ways you can alter the
structure of the algorithm to avoid the
cycles by using assembly -- load
scheduling by preloading and unrolling.
Load scheduling by preloading…
Preloading: loads the data required for the loop at the end of the previous
loop, rather than at the beginning of the current loop.
Since loop i is loading data for loop i+1, there is always a problem with the
first and last loops. For the first loop, insert an extra load outside the loop.
For the last loop, be careful not to read any data. This can be effectively done
by conditional execution.
This reduces the loop from 11 cycles per character to 9 cycles per character
on an ARM, giving a 1.22 times speed improvement.
Load scheduling by preloading…
Load scheduling by unrolling…
Unroll and interleave the body of the loop.
For example, we can perform three loops together.
When the result of an operation from loop i is not ready, we can perform an
operation from loop i+1 that avoids waiting for the loop i result.
This loop is the most efficient implementation we’ve looked at so far. The
implementation requires seven cycles per character on ARM. This gives a
1.57 times speed increase over the original str_tolower.
Load scheduling by unrolling…
Load scheduling by unrolling…
Load scheduling by unrolling…
 More than doubling the code size
 Only efficient for a large data size
Instruction scheduling

More Related Content

What's hot

RISC (reduced instruction set computer)
RISC (reduced instruction set computer)RISC (reduced instruction set computer)
RISC (reduced instruction set computer)LokmanArman
 
Control Unit Design
Control Unit DesignControl Unit Design
Control Unit DesignVinit Raut
 
Architecture of 8051 microcontroller))
Architecture of 8051 microcontroller))Architecture of 8051 microcontroller))
Architecture of 8051 microcontroller))Ganesh Ram
 
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationsubramaniam shankar
 
Unit II arm 7 Instruction Set
Unit II arm 7 Instruction SetUnit II arm 7 Instruction Set
Unit II arm 7 Instruction SetDr. Pankaj Zope
 
ARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionanand hd
 
Clock driven scheduling
Clock driven schedulingClock driven scheduling
Clock driven schedulingKamal Acharya
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and ciscMark Gibbs
 
9. chapter 8 np hard and np complete problems
9. chapter 8   np hard and np complete problems9. chapter 8   np hard and np complete problems
9. chapter 8 np hard and np complete problemsJyotsna Suryadevara
 
basic computer programming and micro programmed control
basic computer programming and micro programmed controlbasic computer programming and micro programmed control
basic computer programming and micro programmed controlRai University
 
Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip A B Shinde
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and designSatya Harish
 
Logical Agents
Logical AgentsLogical Agents
Logical AgentsYasir Khan
 

What's hot (20)

RISC (reduced instruction set computer)
RISC (reduced instruction set computer)RISC (reduced instruction set computer)
RISC (reduced instruction set computer)
 
Control Unit Design
Control Unit DesignControl Unit Design
Control Unit Design
 
VLIW Processors
VLIW ProcessorsVLIW Processors
VLIW Processors
 
ARM- Programmer's Model
ARM- Programmer's ModelARM- Programmer's Model
ARM- Programmer's Model
 
Architecture of 8051 microcontroller))
Architecture of 8051 microcontroller))Architecture of 8051 microcontroller))
Architecture of 8051 microcontroller))
 
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitationChapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
 
Unit II arm 7 Instruction Set
Unit II arm 7 Instruction SetUnit II arm 7 Instruction Set
Unit II arm 7 Instruction Set
 
ARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introduction
 
Bat algorithm
Bat algorithmBat algorithm
Bat algorithm
 
Clock driven scheduling
Clock driven schedulingClock driven scheduling
Clock driven scheduling
 
Pipelining, processors, risc and cisc
Pipelining, processors, risc and ciscPipelining, processors, risc and cisc
Pipelining, processors, risc and cisc
 
Semiconductor memory
Semiconductor memorySemiconductor memory
Semiconductor memory
 
Unit 3
Unit 3Unit 3
Unit 3
 
9. chapter 8 np hard and np complete problems
9. chapter 8   np hard and np complete problems9. chapter 8   np hard and np complete problems
9. chapter 8 np hard and np complete problems
 
basic computer programming and micro programmed control
basic computer programming and micro programmed controlbasic computer programming and micro programmed control
basic computer programming and micro programmed control
 
Arm architecture
Arm architectureArm architecture
Arm architecture
 
Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 
L9 fuzzy implications
L9 fuzzy implicationsL9 fuzzy implications
L9 fuzzy implications
 
Logical Agents
Logical AgentsLogical Agents
Logical Agents
 

Similar to Instruction scheduling

18CS44-MES-Module-2(Chapter 6).pptx
18CS44-MES-Module-2(Chapter 6).pptx18CS44-MES-Module-2(Chapter 6).pptx
18CS44-MES-Module-2(Chapter 6).pptxrakshitha481121
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compilerkeanumit
 
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdfWilliamTom9
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System AchitectureYashiUpadhyay3
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer ArchitectureHaris456
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Computer organisation and architecture .
Computer organisation and architecture .Computer organisation and architecture .
Computer organisation and architecture .MalligaarjunanN
 
Pipelining 16 computers Artitacher pdf
Pipelining   16 computers Artitacher  pdfPipelining   16 computers Artitacher  pdf
Pipelining 16 computers Artitacher pdfMadhuGupta99385
 
Pipeline and data hazard
Pipeline and data hazardPipeline and data hazard
Pipeline and data hazardWaed Shagareen
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelinesturki_09
 
Unit 2 contd. and( unit 3 voice over ppt)
Unit 2 contd. and( unit 3   voice over ppt)Unit 2 contd. and( unit 3   voice over ppt)
Unit 2 contd. and( unit 3 voice over ppt)Dr Reeja S R
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 

Similar to Instruction scheduling (20)

IS.pptx
IS.pptxIS.pptx
IS.pptx
 
18CS44-MES-Module-2(Chapter 6).pptx
18CS44-MES-Module-2(Chapter 6).pptx18CS44-MES-Module-2(Chapter 6).pptx
18CS44-MES-Module-2(Chapter 6).pptx
 
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time CompilerA Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
 
pipelining ppt.pdf
pipelining ppt.pdfpipelining ppt.pdf
pipelining ppt.pdf
 
Assembly p1
Assembly p1Assembly p1
Assembly p1
 
mod 3-1.pptx
mod 3-1.pptxmod 3-1.pptx
mod 3-1.pptx
 
Pipelining in Computer System Achitecture
Pipelining in Computer System AchitecturePipelining in Computer System Achitecture
Pipelining in Computer System Achitecture
 
3 Pipelining
3 Pipelining3 Pipelining
3 Pipelining
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer Architecture
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Computer organisation and architecture .
Computer organisation and architecture .Computer organisation and architecture .
Computer organisation and architecture .
 
Pipelining 16 computers Artitacher pdf
Pipelining   16 computers Artitacher  pdfPipelining   16 computers Artitacher  pdf
Pipelining 16 computers Artitacher pdf
 
Pipelining & All Hazards Solution
Pipelining  & All Hazards SolutionPipelining  & All Hazards Solution
Pipelining & All Hazards Solution
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
Pipeline and data hazard
Pipeline and data hazardPipeline and data hazard
Pipeline and data hazard
 
Topic2a ss pipelines
Topic2a ss pipelinesTopic2a ss pipelines
Topic2a ss pipelines
 
Unit 2 contd. and( unit 3 voice over ppt)
Unit 2 contd. and( unit 3   voice over ppt)Unit 2 contd. and( unit 3   voice over ppt)
Unit 2 contd. and( unit 3 voice over ppt)
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 

More from SOURABH KUMAR

How to give feedback
How to give feedbackHow to give feedback
How to give feedbackSOURABH KUMAR
 
What to do in interview
What to do in interviewWhat to do in interview
What to do in interviewSOURABH KUMAR
 
Common questions for Interview
Common questions for InterviewCommon questions for Interview
Common questions for InterviewSOURABH KUMAR
 
Human resources management and planning
Human resources management and planningHuman resources management and planning
Human resources management and planningSOURABH KUMAR
 

More from SOURABH KUMAR (8)

How to give feedback
How to give feedbackHow to give feedback
How to give feedback
 
What to do in interview
What to do in interviewWhat to do in interview
What to do in interview
 
Common questions for Interview
Common questions for InterviewCommon questions for Interview
Common questions for Interview
 
Job design
Job designJob design
Job design
 
Job Analysis
Job AnalysisJob Analysis
Job Analysis
 
Human resources management and planning
Human resources management and planningHuman resources management and planning
Human resources management and planning
 
Rf module
Rf moduleRf module
Rf module
 
Ip packet delivery
Ip packet deliveryIp packet delivery
Ip packet delivery
 

Recently uploaded

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 

Recently uploaded (20)

CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

Instruction scheduling

  • 2. Pipeline… A pipeline is the mechanism a RISC processor uses to execute instructions. Using a pipeline speeds up execution by fetching the next instruction while other instructions are being decoded and executed. The ARM pipeline has not processed an instruction until it passes completely through the execute stage. Three Stage Pipeline Five Stage Pipeline
  • 3. Pipeline Execution… Fetch Decode ALU LS1 LS2 pc Pc-4 Pc-8 pc-12 Pc-16 Action Instruction Address Fetch: get the instruction from memory into core. Decode: decodes the previous instruction and reads the data operands from bank registers or forwarding paths. ALU: executed the previously executed instruction. LS1: load or store instruction gets executed like LDR. LS2: LDRB and STRB or LDRH and STRH gets executed.
  • 4. Execution Cycles… The following rules summarize the cycle timings for common instruction: The time taken to execute instructions depends on the implementation pipeline. Instructions that are conditional on the value of the ARM condition codes in the CPSR take one cycle if the condition is not met. If the condition is met, then the following rules apply given on next slide:
  • 5. Execution Cycles… ALU operations such as addition, subtraction, and logical operations, shift by an immediate value takes One cycle. Register-specified shift takes Two cycles. Load instructions such as LDR and LDM take N cycles. Load instructions that load 16 or 8-bit data such as LDRB, take One cycle. Branch instructions take Three cycles. Store instructions that store N values take N cycles. An STM of a single value is exceptional, taking Two cycles. Multiply instructions take a varying number of cycles depending on the value of the second operand in the product.
  • 6. Pipeline Interlock… If an instruction requires the result of a previous instruction that is not available, then the processor stalls. This is called a pipeline hazard or pipeline Interlock. Example: where there is No Interlock. ADD r0, r0, r1 ADD r0, r0, r2 This instruction pair takes Two cycles. The ALU calculates r0 + r1 in one cycle. Therefore this result is available for the ALU to calculate r0 + r2 in the second cycle.
  • 7. Pipeline Interlock… Example: where there is one-cycle interlock. LDR r1, [r2, #4] ADD r0, r0, r1 This instruction pair takes Three cycles. The ALU calculates the address r2 + 4 in the first cycle while decoding the ADD instruction in parallel. The ADD cannot proceed on the second cycle because the load instruction has not yet loaded the value of r1.
  • 8. Pipeline Interlock… Pipeline Fetch Decode ALU LS1 LS2 Cycle 1 … ADD LDR … Cycle 2 … ADD LDR … Cycle 3 … ADD ---- LDR The processor executes the ADD in the ALU on the third cycle.  The processor stalls the ADD instruction for one cycle in the ALU stage of The pipeline while the load instruction completes the LS1 stage.  The LDR instruction proceeds down the pipeline, but the add instruction is stalled, a gap opens up between them. This gap is sometimes called a pipeline bubble.
  • 9. Why a branch instruction takes three cycles??? MOV r1, #1 B case1 AND r0, r0, r1 EOR r2, r2, r3 ... case1 SUB r0, r0, r1 The MOV instruction executes on the first cycle. On the second cycle, the branch instruction calculates the destination address. This causes the core to flush the pipeline and refill it using this new pc value. The refill takes two cycles. Finally, the SUB instruction executes normally.
  • 10. Why a branch instruction takes three cycles??? Pipeline Fetch Decode ALU LS1 LS2 Cycle 1 AND B MOV … Cycle 2 EOR AND B MOV … Cycle 3 SUB ---- ---- B MOV Cycle 4 … SUB ---- ---- B Cycle 5 … SUB ---- ----  The pipeline drops the two instructions following the branch when the branch takes place.
  • 11. Scheduling of load instructions… Load instructions occur frequently in compiled code, accounting for approximately one third of all instructions. Careful scheduling of load instructions so that pipeline stalls don’t occur can improve performance. The compiler attempts to schedule the code as best it can, but the aliasing problems of C limits the available optimizations. The compiler cannot move a load instruction before a store instruction unless it is certain that the two pointers used do not point to the same address.
  • 12. C PROGRAM: void str_tolower(char *out, char *in) { unsigned int c; do { c = *(in++); if (c>=’A’ && c<=’Z’) { c = c + (’a’ -’A’); } *(out++) = (char)c; } while (c); } Scheduling of load instructions… A B C D /0 65 66 67 68 *in a b c d /0 97 98 99 100 *out
  • 13. Scheduling of load instructions… str_tolower LDRB r2,[r1],#1 ; c = *(in++) SUB r3,r2,#0x41 ; r3 = c -‘A’ CMP r3,#0x19 ; if (c <=‘Z’-‘A’) ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’ STRB r2,[r0],#1 ; *(out++) = (char)c CMP r2,#0 ; if (c!=0) BNE str_tolower ; goto str_tolower MOV pc,r14 ; return The SUB instruction uses the value of c directly after the LDRB instruction that loads c. The ARM pipeline will stall for two cycles. Overall there are 11 execution cycles which are needed to execute the whole program.  There are two ways you can alter the structure of the algorithm to avoid the cycles by using assembly -- load scheduling by preloading and unrolling.
  • 14. Load scheduling by preloading… Preloading: loads the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop. Since loop i is loading data for loop i+1, there is always a problem with the first and last loops. For the first loop, insert an extra load outside the loop. For the last loop, be careful not to read any data. This can be effectively done by conditional execution. This reduces the loop from 11 cycles per character to 9 cycles per character on an ARM, giving a 1.22 times speed improvement.
  • 15. Load scheduling by preloading…
  • 16. Load scheduling by unrolling… Unroll and interleave the body of the loop. For example, we can perform three loops together. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result. This loop is the most efficient implementation we’ve looked at so far. The implementation requires seven cycles per character on ARM. This gives a 1.57 times speed increase over the original str_tolower.
  • 17. Load scheduling by unrolling…
  • 18. Load scheduling by unrolling…
  • 19. Load scheduling by unrolling…  More than doubling the code size  Only efficient for a large data size