The document discusses pipeline hazards including structural, data, and control hazards. It provides details on how each hazard can occur in a 5-stage pipeline and techniques to resolve them, including forwarding, stalling, and compiler scheduling. Data hazards are classified as RAW, WAW, and WAR. Control hazards from branches are reduced by computing the branch target and outcome earlier in the ID phase to minimize stalls.
This document outlines a presentation on pipelining and data hazards in microprocessors. It begins with rules for participant questions and outlines the topics to be covered: what is pipelining, types of pipelining, data hazards and their types, and solutions to data hazards. It then defines pipelining as executing subsequent instructions before prior ones complete. Types of pipelining include control, data, and structure hazards. Data hazards occur if an instruction uses a value before it is ready, and their types are RAW, WAR, and WAW. Solutions involve forwarding newer register values to bypass stale values in the pipeline and prevent hazards.
The document summarizes different types of pipeline hazards that can occur in a processor pipeline: structural hazards which occur due to limited hardware resources and prevent certain combinations of instructions from executing simultaneously; data hazards which occur when instructions depend on results of previous instructions in a way exposed by pipelining; and control hazards which occur due to pipelining of branches whose target may not be known until later in the pipeline. It describes techniques for handling these hazards such as forwarding, stalling, and instruction scheduling to minimize performance impacts.
RAR (Read After Read) is not considered a data hazard because it does not change the order of memory accesses or introduce incorrect results. Multiple instructions can safely read the same register without interfering with each other. The three types of data hazards that can occur are RAW (Read After Write), WAR (Write After Read), and WAW (Write After Write) which all involve write operations that could potentially overwrite data before it is read.
There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated cycle
There are three classes of hazards
Structural hazard
Data hazard
Branch hazard
The document discusses instruction pipelining in CPUs. It explains that instruction pipelining achieves greater CPU performance by overlapping the execution of multiple instructions. It describes the different stages in a basic two-stage pipeline as fetch and execute. It then discusses how further dividing the pipeline into more stages, such as six stages for fetch, decode, calculate, fetch operands, execute, and writeback, can provide even higher performance. However, it notes conditional branches can reduce efficiency since the next instruction is unknown until the branch is resolved. Various techniques to handle branches like branch prediction, prefetching the target, and delayed branches are described to improve pipeline performance.
The document discusses pipeline hazards including structural, data, and control hazards. It provides details on how each hazard can occur in a 5-stage pipeline and techniques to resolve them, including forwarding, stalling, and compiler scheduling. Data hazards are classified as RAW, WAW, and WAR. Control hazards from branches are reduced by computing the branch target and outcome earlier in the ID phase to minimize stalls.
This document outlines a presentation on pipelining and data hazards in microprocessors. It begins with rules for participant questions and outlines the topics to be covered: what is pipelining, types of pipelining, data hazards and their types, and solutions to data hazards. It then defines pipelining as executing subsequent instructions before prior ones complete. Types of pipelining include control, data, and structure hazards. Data hazards occur if an instruction uses a value before it is ready, and their types are RAW, WAR, and WAW. Solutions involve forwarding newer register values to bypass stale values in the pipeline and prevent hazards.
The document summarizes different types of pipeline hazards that can occur in a processor pipeline: structural hazards which occur due to limited hardware resources and prevent certain combinations of instructions from executing simultaneously; data hazards which occur when instructions depend on results of previous instructions in a way exposed by pipelining; and control hazards which occur due to pipelining of branches whose target may not be known until later in the pipeline. It describes techniques for handling these hazards such as forwarding, stalling, and instruction scheduling to minimize performance impacts.
RAR (Read After Read) is not considered a data hazard because it does not change the order of memory accesses or introduce incorrect results. Multiple instructions can safely read the same register without interfering with each other. The three types of data hazards that can occur are RAW (Read After Write), WAR (Write After Read), and WAW (Write After Write) which all involve write operations that could potentially overwrite data before it is read.
There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated cycle
There are three classes of hazards
Structural hazard
Data hazard
Branch hazard
The document discusses instruction pipelining in CPUs. It explains that instruction pipelining achieves greater CPU performance by overlapping the execution of multiple instructions. It describes the different stages in a basic two-stage pipeline as fetch and execute. It then discusses how further dividing the pipeline into more stages, such as six stages for fetch, decode, calculate, fetch operands, execute, and writeback, can provide even higher performance. However, it notes conditional branches can reduce efficiency since the next instruction is unknown until the branch is resolved. Various techniques to handle branches like branch prediction, prefetching the target, and delayed branches are described to improve pipeline performance.
This document discusses instruction pipelining and various techniques used to handle hazards that can occur in pipelined processors. It describes a 6-stage instruction pipeline consisting of fetch, decode, calculate operands, fetch operands, execute, and write-back stages. There are three main types of hazards: resource hazards which occur when instructions need the same processor resource; data hazards which occur when an instruction uses a value not yet ready from a previous instruction; and control hazards which occur when the program flow changes due to a branch. Solutions to handle these hazards include branch prediction, delayed branching, pipeline stalls, and maintaining multiple instruction streams.
Pipelining is a technique where multiple instructions are overlapped during execution by dividing the instruction cycle into stages connected in a pipeline structure. There are two main types of pipelines - instruction pipelines which overlap the fetch, decode, and execute phases of instructions to improve throughput, and arithmetic pipelines used for floating point and fixed point operations. Pipeline conflicts can occur due to timing variations, data hazards, branching, interrupts, or data dependency which reduce the pipeline's performance. The main advantages of pipelining are reduced cycle time, increased throughput, and improved reliability, while the main disadvantage is increased complexity, cost, and instruction latency.
The document discusses parallel processing and pipelining. It defines parallel processing as performing concurrent data processing to achieve faster execution. This can be done by having multiple ALUs that can execute instructions simultaneously. The document then discusses Flynn's classification of computer architectures based on instruction and data streams. It describes single instruction single data (SISD), multiple instruction single data (MISD), and multiple instruction multiple data (MIMD) architectures. The document then defines pipelining as decomposing processes into sub-operations that flow through pipeline stages. It provides examples of arithmetic and instruction pipelines, describing the stages in each.
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
Pipeline Hazards can be classified into three types: structural hazards caused by hardware resource conflicts, data hazards caused when an instruction depends on the results of a previous instruction, and control hazards from conditional branches. Structural hazards arise from limited hardware resources like register files and memory ports. Data hazards include RAW, WAW, and WAR and are resolved by stalling or forwarding. Forwarding minimizes stalls by directly connecting new values to the next stage.
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
The document discusses pipelining in computer processors. It describes how pipelining can increase throughput by overlapping the execution of multiple instructions. It discusses the basic pipeline stages for a RISC instruction set, including fetch, decode, execute, memory access, and writeback. It also describes several types of pipeline hazards that can occur, such as structural hazards caused by resource conflicts, data hazards when instructions depend on previous results, and control hazards with branches. Forwarding techniques are presented to help address data hazards.
The document summarizes the RISC pipeline architecture. It discusses the five stages of the classic RISC pipeline: instruction fetch, instruction decode, execute, memory access, and writeback. Each stage is involved in processing one instruction at a time through the pipeline. The instruction fetch stage retrieves instructions from the instruction cache. The decode stage decodes the instruction and computes branch targets. The execute stage performs arithmetic and logical operations. The memory access stage handles data memory access. Finally, the writeback stage writes results back to registers. The document also discusses hazards like structural, data, and control hazards that can occur in pipelines.
The document discusses the structure and function of the central processing unit (CPU). It covers the following key points in 3 sentences:
The CPU must fetch, decode, and process instructions, fetching any required data. It uses registers for temporary storage and processing, including general purpose, data, address, and condition code registers. Different CPU designs vary in the number and functions of registers, which are the top level of the memory hierarchy.
Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
Interstage buffer B2 feeds the Compute stage with the two operands
Interstage buffer B3 holds the result of the ALU operation
Interstage buffer B4 feeds the Write stage with a value to be written into the register file
The document discusses pipelining in computer processors. It explains that pipelining allows for overlapping execution of multiple instructions to improve processor throughput. An analogy is drawn to an assembly line in laundry - non-pipelined execution is like completing an entire load sequentially, while pipelined is like having different stages of multiple loads occurring in parallel. Pipelining is achieved by breaking instruction execution into discrete stages, such as fetch, decode, execute, memory, and writeback. This allows new instructions to enter the pipeline before previous ones have finished, improving instruction completion rate.
This document discusses pipelining in microprocessors. It describes how pipelining works by dividing instruction processing into stages - fetch, decode, execute, memory, and write back. This allows subsequent instructions to begin processing before previous instructions have finished, improving processor efficiency. The document provides estimated timing for each stage and notes advantages like quicker execution for large programs, while disadvantages include added hardware and potential pipeline hazards disrupting smooth execution. It then gives examples of how four instructions would progress through each stage in a pipelined versus linear fashion.
This document discusses instruction-level parallelism (ILP), which refers to executing multiple instructions simultaneously in a program. It describes different types of parallel instructions that do not depend on each other, such as at the bit, instruction, loop, and thread levels. The document provides an example to illustrate ILP and explains that compilers and processors aim to maximize ILP. It outlines several ILP techniques used in microarchitecture, including instruction pipelining, superscalar, out-of-order execution, register renaming, speculative execution, and branch prediction. Pipelining and superscalar processing are explained in more detail.
This document discusses the history and characteristics of CISC and RISC architectures. It describes how CISC architectures were developed in the 1950s-1970s to address hardware limitations at the time by allowing instructions to perform multiple operations. RISC architectures emerged in the late 1970s-1980s as hardware improved, focusing on simpler instructions that could be executed faster through pipelining. Common RISC and CISC processors used commercially are also outlined.
The document discusses parallelism and techniques to improve computer performance through parallel execution. It describes instruction level parallelism (ILP) where multiple instructions can be executed simultaneously through techniques like pipelining and superscalar processing. It also discusses processor level parallelism using multiple processors or processor cores to concurrently execute different tasks or threads.
This document discusses instruction pipelining in processors. It begins with an introduction that defines pipelining as breaking down operations into sequential stages that can overlap execution. An example is given of pipelining laundry tasks to complete work more efficiently. The document then explains how instruction execution in computers lends itself to pipelining by separating tasks like fetch, decode, and execute into distinct stages. A six-stage instruction pipeline and timing diagram are presented. Advantages of pipelining include more efficient use of resources and faster execution for large programs. However, pipeline hazards like structural, data, and control hazards can cause problems if not addressed properly.
The document provides an overview of pipelining in computer processors. It discusses how pipelining works by dividing processor operations like fetch, decode, execute, memory, and write-back into discrete stages that can overlap, improving throughput. Key points made include:
- Pipelining allows multiple instructions to be in different stages of completion at the same time, improving instruction throughput.
- The document uses an example of a sequential laundry process versus a pipelined laundry process to illustrate how pipelining improves efficiency.
- It describes the five main stages of a RISC instruction set pipeline - fetch, decode, execute, memory, and write-back. The work done and data passed between each stage
This document discusses instruction level parallelism and techniques for exploiting it. It covers topics like pipelining, instruction dependencies, hazards, and approaches to overcoming limitations on parallelism both through dynamic scheduling in hardware and through static transformations by compilers. Key limitations to parallelism discussed are branches, dependencies between instructions, and pipeline stalls caused by dependencies. The document provides an overview of these core computer architecture concepts.
This document discusses the implementation of a basic MIPS processor including building the datapath, control implementation, pipelining, and handling hazards. It describes the MIPS instruction set and 5-stage pipeline. The datapath is built from components like registers, ALUs, and adders. Control signals are designed for different instructions. Pipelining is implemented using techniques like forwarding and branch prediction to handle data and control hazards between stages. Exceptions are handled using status registers or vectored interrupts.
This document discusses instruction pipelining and various techniques used to handle hazards that can occur in pipelined processors. It describes a 6-stage instruction pipeline consisting of fetch, decode, calculate operands, fetch operands, execute, and write-back stages. There are three main types of hazards: resource hazards which occur when instructions need the same processor resource; data hazards which occur when an instruction uses a value not yet ready from a previous instruction; and control hazards which occur when the program flow changes due to a branch. Solutions to handle these hazards include branch prediction, delayed branching, pipeline stalls, and maintaining multiple instruction streams.
Pipelining is a technique where multiple instructions are overlapped during execution by dividing the instruction cycle into stages connected in a pipeline structure. There are two main types of pipelines - instruction pipelines which overlap the fetch, decode, and execute phases of instructions to improve throughput, and arithmetic pipelines used for floating point and fixed point operations. Pipeline conflicts can occur due to timing variations, data hazards, branching, interrupts, or data dependency which reduce the pipeline's performance. The main advantages of pipelining are reduced cycle time, increased throughput, and improved reliability, while the main disadvantage is increased complexity, cost, and instruction latency.
The document discusses parallel processing and pipelining. It defines parallel processing as performing concurrent data processing to achieve faster execution. This can be done by having multiple ALUs that can execute instructions simultaneously. The document then discusses Flynn's classification of computer architectures based on instruction and data streams. It describes single instruction single data (SISD), multiple instruction single data (MISD), and multiple instruction multiple data (MIMD) architectures. The document then defines pipelining as decomposing processes into sub-operations that flow through pipeline stages. It provides examples of arithmetic and instruction pipelines, describing the stages in each.
This document discusses instruction pipelining in computer processors. It begins by defining pipelining and explaining how it works like an assembly line to increase throughput. It then discusses different types of pipelines and introduces the MIPS instruction pipeline as an example. The document goes on to explain different types of pipeline hazards like structural hazards, control hazards, and data hazards. It provides examples of how to detect and resolve these hazards through techniques like forwarding, stalling, predicting, and delayed branching. Key concepts covered include pipeline registers, control signals, forwarding units, and branch prediction buffers.
Pipeline Hazards can be classified into three types: structural hazards caused by hardware resource conflicts, data hazards caused when an instruction depends on the results of a previous instruction, and control hazards from conditional branches. Structural hazards arise from limited hardware resources like register files and memory ports. Data hazards include RAW, WAW, and WAR and are resolved by stalling or forwarding. Forwarding minimizes stalls by directly connecting new values to the next stage.
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
The document discusses pipelining in computer processors. It describes how pipelining can increase throughput by overlapping the execution of multiple instructions. It discusses the basic pipeline stages for a RISC instruction set, including fetch, decode, execute, memory access, and writeback. It also describes several types of pipeline hazards that can occur, such as structural hazards caused by resource conflicts, data hazards when instructions depend on previous results, and control hazards with branches. Forwarding techniques are presented to help address data hazards.
The document summarizes the RISC pipeline architecture. It discusses the five stages of the classic RISC pipeline: instruction fetch, instruction decode, execute, memory access, and writeback. Each stage is involved in processing one instruction at a time through the pipeline. The instruction fetch stage retrieves instructions from the instruction cache. The decode stage decodes the instruction and computes branch targets. The execute stage performs arithmetic and logical operations. The memory access stage handles data memory access. Finally, the writeback stage writes results back to registers. The document also discusses hazards like structural, data, and control hazards that can occur in pipelines.
The document discusses the structure and function of the central processing unit (CPU). It covers the following key points in 3 sentences:
The CPU must fetch, decode, and process instructions, fetching any required data. It uses registers for temporary storage and processing, including general purpose, data, address, and condition code registers. Different CPU designs vary in the number and functions of registers, which are the top level of the memory hierarchy.
Interstage buffer B1 feeds the Decode stage with a newly-fetched instruction.
Interstage buffer B2 feeds the Compute stage with the two operands
Interstage buffer B3 holds the result of the ALU operation
Interstage buffer B4 feeds the Write stage with a value to be written into the register file
The document discusses pipelining in computer processors. It explains that pipelining allows for overlapping execution of multiple instructions to improve processor throughput. An analogy is drawn to an assembly line in laundry - non-pipelined execution is like completing an entire load sequentially, while pipelined is like having different stages of multiple loads occurring in parallel. Pipelining is achieved by breaking instruction execution into discrete stages, such as fetch, decode, execute, memory, and writeback. This allows new instructions to enter the pipeline before previous ones have finished, improving instruction completion rate.
This document discusses pipelining in microprocessors. It describes how pipelining works by dividing instruction processing into stages - fetch, decode, execute, memory, and write back. This allows subsequent instructions to begin processing before previous instructions have finished, improving processor efficiency. The document provides estimated timing for each stage and notes advantages like quicker execution for large programs, while disadvantages include added hardware and potential pipeline hazards disrupting smooth execution. It then gives examples of how four instructions would progress through each stage in a pipelined versus linear fashion.
This document discusses instruction-level parallelism (ILP), which refers to executing multiple instructions simultaneously in a program. It describes different types of parallel instructions that do not depend on each other, such as at the bit, instruction, loop, and thread levels. The document provides an example to illustrate ILP and explains that compilers and processors aim to maximize ILP. It outlines several ILP techniques used in microarchitecture, including instruction pipelining, superscalar, out-of-order execution, register renaming, speculative execution, and branch prediction. Pipelining and superscalar processing are explained in more detail.
This document discusses the history and characteristics of CISC and RISC architectures. It describes how CISC architectures were developed in the 1950s-1970s to address hardware limitations at the time by allowing instructions to perform multiple operations. RISC architectures emerged in the late 1970s-1980s as hardware improved, focusing on simpler instructions that could be executed faster through pipelining. Common RISC and CISC processors used commercially are also outlined.
The document discusses parallelism and techniques to improve computer performance through parallel execution. It describes instruction level parallelism (ILP) where multiple instructions can be executed simultaneously through techniques like pipelining and superscalar processing. It also discusses processor level parallelism using multiple processors or processor cores to concurrently execute different tasks or threads.
This document discusses instruction pipelining in processors. It begins with an introduction that defines pipelining as breaking down operations into sequential stages that can overlap execution. An example is given of pipelining laundry tasks to complete work more efficiently. The document then explains how instruction execution in computers lends itself to pipelining by separating tasks like fetch, decode, and execute into distinct stages. A six-stage instruction pipeline and timing diagram are presented. Advantages of pipelining include more efficient use of resources and faster execution for large programs. However, pipeline hazards like structural, data, and control hazards can cause problems if not addressed properly.
The document provides an overview of pipelining in computer processors. It discusses how pipelining works by dividing processor operations like fetch, decode, execute, memory, and write-back into discrete stages that can overlap, improving throughput. Key points made include:
- Pipelining allows multiple instructions to be in different stages of completion at the same time, improving instruction throughput.
- The document uses an example of a sequential laundry process versus a pipelined laundry process to illustrate how pipelining improves efficiency.
- It describes the five main stages of a RISC instruction set pipeline - fetch, decode, execute, memory, and write-back. The work done and data passed between each stage
This document discusses instruction level parallelism and techniques for exploiting it. It covers topics like pipelining, instruction dependencies, hazards, and approaches to overcoming limitations on parallelism both through dynamic scheduling in hardware and through static transformations by compilers. Key limitations to parallelism discussed are branches, dependencies between instructions, and pipeline stalls caused by dependencies. The document provides an overview of these core computer architecture concepts.
This document discusses the implementation of a basic MIPS processor including building the datapath, control implementation, pipelining, and handling hazards. It describes the MIPS instruction set and 5-stage pipeline. The datapath is built from components like registers, ALUs, and adders. Control signals are designed for different instructions. Pipelining is implemented using techniques like forwarding and branch prediction to handle data and control hazards between stages. Exceptions are handled using status registers or vectored interrupts.
This document summarizes the key components and organization of superscalar processor pipelines. It discusses how superscalar processors can execute multiple instructions per cycle by exploiting instruction-level parallelism. The document outlines the major stages in a superscalar pipeline including instruction fetch, decode, dispatch, execution, completion, and retirement. It also discusses limiting factors like structural hazards from resource conflicts, data hazards from dependencies between instructions, and control hazards from branches.
This document provides an introduction to instruction-level parallel (ILP) processors. It discusses how ILP processors improve performance by executing multiple instructions in parallel through techniques like pipelining and superscalar execution. It also covers dependencies between instructions like data dependencies, control dependencies, and resource dependencies that limit parallelism. The document discusses approaches for instruction scheduling used by compilers and processors to detect and resolve dependencies to expose more instruction-level parallelism. It notes that while ILP processors can provide significant speedups for scientific programs, dependencies limit speedups for general-purpose programs to around 2-4 times.
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
In this talk, we evaluate training of deep recurrent neural networks with half-precision floats on Pascal and Volta GPUs. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers.
Strong scaling tests performed on GPU clusters show linear runtime scaling and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
This document discusses general-purpose processors. It begins by introducing general-purpose processors and their basic architecture, which consists of a control unit and datapath that is designed to perform a variety of computation tasks. It then describes the operations of loading, storing, and arithmetic/logical operations that can be performed by the datapath. Subsequent sections provide more details on the control unit and how it sequences operations, instruction cycles, architectural considerations like bit-width and clock frequency, and techniques for improving performance like pipelining and superscalar execution. The document concludes with sections on assembly-level instructions and programmer considerations.
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
In this paper, we have proposed a novel architectural
technique which can be used to boost performance of modern
day processors. It is especially useful in certain code constructs
like small loops and try-catch blocks. The technique is aimed
at improving performance by reducing the number of
instructions that need to enter the pipeline itself. We also
demonstrate its working in a scalar pipelined soft-core
processor developed by us. Lastly, we present how a superscalar
microprocessor can take advantage of this technique and
increase its performance.
This document discusses instruction-level parallelism (ILP) limitations. It covers ILP background using a MIPS example, hardware models that were studied including register renaming and branch/jump prediction assumptions. A study of ILP limitations found diminishing returns with larger window sizes and realizable processors are limited by complexity and power constraints. Simultaneous multithreading was explored as a technique to improve ILP but has its own design challenges. Today, x86 and ARM processors employ various ILP optimizations within pipeline constraints.
The document provides an overview of ARM basics including:
- The different CPU modes including user, fast interrupt, interrupt, supervisor, abort, undefined, and system modes.
- The banked registers that are used across modes including the program counter, stack pointer, link register, and current program status register.
- How pipelining works to improve processor throughput by dividing instructions into fetch, decode, and execute stages.
- The different types of exceptions including interrupts, undefined instructions, and how the processor switches modes and handles exceptions.
This document summarizes key topics from Chapter 5 of a book on designing embedded systems with PIC microcontrollers, including:
- Visualizing programs with flow diagrams and state diagrams
- Using program branching, subroutines, and delays
- Implementing logical instructions and look-up tables
- Optimizing assembler code and using advanced simulator features like breakpoints and timing measurements
The document discusses instruction scheduling on the ARM9TDMI processor. It describes the ARM9TDMI pipeline and how the timing of instructions depends on dependencies between stages. Two methods for scheduling load instructions are presented: preloading, where data for the next loop iteration is loaded at the end of the current loop, and unrolling the loop to interleave operations from different iterations. Unrolling achieves the best performance of 7 cycles per character compared to 11 cycles without scheduling.
The document discusses parallel processing and pipelining techniques in computer organization. It covers topics like parallel processing concepts and classifications, pipelining concepts and how it increases computational speed, arithmetic and instruction pipelining, handling pipeline hazards like data dependencies and branches. The key advantages of pipelining include decomposing tasks into sequential sub-operations that can complete concurrently, improving throughput and achieving speedup close to the number of pipeline stages when the number of tasks is large.
This document discusses control and configuration models for time-sensitive networking (TSN). It proposes an "all-managed object" (MO) model using a central network controller and management information bases/yang models rather than the alternative "ISIS-MRP" model proposed previously. The all-MO model requires less protocols, processing, and memory usage compared to ISIS-MRP. It also avoids needing to propagate all stream descriptions and paths via intermediate systems in the network. However, both models face scaling challenges with large networks.
Pipelining is a technique used in microprocessors to overlap the execution of multiple instructions by dividing instruction execution into discrete stages. It allows the next instruction to begin executing before the previous one has finished. The pipeline is divided into segments that perform discrete operations concurrently. This improves processor throughput by allowing new instructions to enter the pipeline every clock cycle.
Superscalar and VLIW architectures can exploit instruction-level parallelism (ILP) by processing multiple instructions simultaneously. There are two main approaches: superscalar processors fetch and execute independent instructions in parallel using dependency checking, while very long instruction word (VLIW) architectures rely on compilers to group independent instructions into single long instructions. List scheduling and trace scheduling are algorithms used to schedule instructions for ILP. Trace scheduling works by identifying common code traces and scheduling basic blocks within the trace together.
The document discusses various program control flow instructions used in programmable logic controllers (PLCs), including:
- Jump (JMP) and Label (LBL) instructions allow skipping portions of ladder logic to optimize scan time or create loops.
- Jump to Subroutine (JSR), Subroutine (SBR) and Return (RET) instructions allow executing reusable code blocks called subroutines.
- Master Control Reset (MCR) instructions create zones that reset non-retentive outputs when inactive, reducing scan time.
- Temporary End (TND) and Suspend (SUS) instructions halt or pause ladder logic execution for debugging purposes.
This chapter provides an introduction to Computer networks and covers fundamental topics like data, information to the definition of communication and computer networks.
A digital to analog converter (DAC) accepts a binary input and produces a proportional analog output signal. A 4-bit DAC has 4 digital inputs representing the 4 bits, with the most significant bit (MSB) as d0 and least significant bit (LSB) as d3. The output voltage v0 is plotted against all possible 16 input combinations. An inverted R/2R ladder DAC uses identical resistors and voltage scaling instead of resistor scaling and a common reference used in a binary-weighted resistor DAC. It uses additional series resistors between nodes for voltage dropping. In a 3-bit R/2R ladder DAC, the binary input 001 connects switches to ground or the inverting op
The document discusses Module II of Op-AMPIC 741 and focuses on the effect of input bias current. It examines how input bias current can affect the comparator function of an operational amplifier.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
The document describes a 5-level hierarchical process control system with the bottom three levels being object-oriented for process measurement, supervision, and management. The top two levels integrate conventional databases for plant and corporate management functions. It also discusses using a rule-based system with an interpretation engine that executes rules based on the current working memory state, and a blackboard model with independent knowledge modules that operate within hierarchical levels.
The document discusses desired features of mobile robotics architectures, including supporting deliberate and reactive behaviors, operating with uncertain environments, accounting for potential dangers, and providing flexibility for experimentation and reconfiguration. It proposes two solutions: implicit invocation using exception handling, wiretapping and monitors; and the blackboard approach which uses modules like the captain, map navigator, lookout and pilot to resolve uncertainties on a shared blackboard. The document also includes a case study on cruise control maintaining constant vehicle speed.
The document discusses various architectures for mobile robotics systems. It describes four solutions: 1) A control loop that initiates actions and monitors consequences to adjust plans. 2) A layered architecture with control routines, sensor analysis, world modeling, navigation, and planning layers. 3) An implicit invocation task control architecture using exceptions, wiretapping, and monitors. 4) A blackboard architecture with components interacting via a shared repository to resolve uncertainties.
The document discusses the development of a reusable software architecture for new oscilloscope products. It describes problems with previous approaches that led to little code reuse and custom products. Three solutions are evaluated:
1) An object-oriented approach had drawbacks like no overall model and confusion over object interactions.
2) A layered approach was not a good fit as the layer boundaries conflicted with needed function interactions.
3) A pipes and filters approach modeled oscilloscope functions as incremental data transformers but did not enable user interaction.
The best solution was a modified pipes and filters architecture that formed the basis for the next generation of oscilloscopes.
This document describes 4 solutions to designing a KWIC (Key Word In Context) index system:
1. Main program with subroutines and shared data
2. Abstract data types
3. Implicit invocation
4. Pipes and filters
It evaluates each solution based on 5 criteria: ability to change the overall algorithm, changes to data representation, extending functionality, efficient use of space and time, and reusability of code. The implicit invocation approach supports changing the overall algorithm most flexibly while pipes and filters best supports reusability but is less efficient with space.
Process control architectures aim to maintain specified properties of process outputs at reference values. Open loop systems run without monitoring outputs, while closed loop systems automatically achieve and maintain desired outputs by comparing them to actual outputs via feedback loops. Feed forward loops anticipate future effects by measuring other variables.
The document discusses three architectural styles: layered systems, repositories, and interpreters. Layered systems use components arranged in hierarchical layers with each layer acting as a server to the layer above and client to the layer below. Repositories use a central data structure and independent components that operate on the data. Interpreters take a program written in one language and interpret it to another language using four components: the program, interpretation engine, program data, and engine state.
This document discusses several architectural styles for building software systems. It describes pipes and filters, data abstraction, layered systems, implicit invocation, and repositories. Pipes and filters involve chaining components together where the output of one component feeds the input of the next. Data abstraction focuses on separating an abstraction from its implementation. Layered systems organize functionality into hierarchical layers. Implicit invocation uses events to trigger the execution of registered component procedures. Repositories provide shared data storage across components.
The document discusses the importance of architecture in software systems. It notes that architecture provides a common language for stakeholders, helps make early design decisions, and defines transferable abstractions. It then lists several key benefits of architecture, including constraining implementation, dictating organizational structure, enabling or inhibiting quality attributes, allowing prediction of qualities without full development, and easing management of change. Finally, it discusses how architecture can promote reuse across systems and product lines, composition using external elements, restricted design choices, template-based development, and serve as a basis for training.
The document discusses software architecture and its key elements. It defines software architecture as the structure of a system, including the software elements, their visible properties, and relationships. The document notes that architecture provides the overall structure, divides functionality into pieces with data flow between them, and can be a reference model mapped to software elements and their data flows that implement the model's functionality.
The document discusses the definition of software architecture as comprising the structure of a system including its software elements, externally visible properties, and relationships between elements. It notes that software architecture is influenced by technical, business, and social factors such as immediate business investments, long-term infrastructure investments, strategic investments, and current environmental standards. The architecture affects the developing organization's structure and goals as well as customer requirements for future systems, and the development process can influence the architect's experience on subsequent projects and potentially the software engineering culture. The key steps in architecting a system are outlined as creating the business case, understanding requirements, creating/selecting an architecture, documenting/communicating it, analyzing/evaluating it, implementing based on the architecture
An architecture is very complicated and involves three types of decisions: how the system is structured as code units, how it is structured as runtime components and interactions, and how it relates to non-software elements. The document discusses several common architectural structures, including decomposition, uses, layered, class/generalization, process, concurrency, shared data/repository, client-server, deployment, implementation, and work assignment structures. It also discusses Kruchten's four views of logical, process, development, and physical.
This document discusses what makes a good software architecture and provides guidelines. It states that an architecture should:
1. Be evaluated based on how well it meets its stated purpose and quality attributes, not on being inherently good or bad.
2. Follow basic guidelines such as being well-documented, analyzed for performance, and designed to facilitate incremental implementation.
3. Feature well-defined modules and interfaces, achieve quality attributes using known tactics, be independent of specific tools, and separate data producers from consumers.
The Future of Wearable Technology in Healthcare: Innovations and Trends to WatchbluetroyvictorVinay
As wearable technology continues to shape multiple facets of our lives, its potential in healthcare is becoming increasingly apparent. With the rapid advancement of technology, the integration of wearables into healthcare systems worldwide is accelerating. In this evolving field, we delve into the latest innovations and trends that are transforming healthcare.
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"Emmanuel Onwumere
In iOS 18, Apple has introduced a significant revamp to the Control Centre, making it more intuitive and user-friendly. One of the standout features is a quicker and more accessible way to shut down your iPhone. This enhancement aims to streamline the user experience, allowing for faster access to essential functions. Discover how iOS 18's redesigned Control Centre can simplify your daily interactions with your iPhone, bringing convenience right at your fingertips.
Company Profile of Tempcon - Chiller Manufacturer In Indiasoumotempcon
This is the company profile of Tempcon - chiller manufacturer in India. Tempcon manufactures water cooled and air cooled chillers and industrial AC. The company has been in the business since 1983.
website: https://www.tempcon.co.in/
We’re Underestimating the Damage Extreme Weather Does to Rooftop Solar PanelsGrid Freedom Inc.
Grid Freedom is the best solar leads company based in New Jersey that provides Exclusive solar appointments of qualified solar appointments for guaranteed solar appointments for the best way to get solar leads throughout the nation. Grid Freedom is a solar lead provider, that connects exclusive pre-set appointments with pre-screened homeowners who are ready for solar company leads. The solar lead generators company was founded to provide solar appointment leads contractors with better solar sales leads-buying high-quality exclusive solar leads experience that gives pre-set solar appointments great ROI.
Call Girls in Noida (Uttar Pradesh ) call me [🔝9899900591🔝] Escort In Noida s...
Unit 2 contd. and( unit 3 voice over ppt)
1. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-1
Chapter 3
Instruction-Level Parallelism
and Its Dynamic Exploitation
• Unit 2 contd…
• Unit 3
» Dr Reeja S R
» CSE Dept
» Dayananda Sagar University - SOE
2. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-2
Instruction Level Parallelism: Concepts and
Challenges
• Instruction-level parallelism(ILP)
– The potential of overlapping the execution of multiple
instructions is called instruction-level parallelism.
3. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-3
Techniques to Reduce Pipeline CPI
• Recall,
– Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW
stalls + WAR stalls + WAW stalls +Control stalls.
– Instruction-level parallelism is to reduce the number of
stalls.
– How to find out ILP
• Dynamically locate ILP by hardware
• Statically locate ILP by software
– Techniques that affect CPI (fig. 3.1 on page 173).
4. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-4
ILP Within and Across a Basic Block
• ILP within a basic block
– If the branch frequency is 15%~25%, there are only 4 ~ 7
instructions within a basic block. This implies that we
must exploit ILP across a basic block.
• Loop-Level Parallelism(ILP across a basic block)
– Exploit parallelism among iteration of a loop.
5. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-5
Loop-Level Parallelism
– Parallelism among iterations of a loop.
• Example: for(I=1; I<=100; I++)
X[I]=X[I]+Y[I];
– Each iteration of the loop can overlap with any other iteration in
this example.
– Techniques converting the loop-level parallelism into ILP
• Loop unrolling
• Use of vector instructions (Appendix G)
– LOAD X; LOAD Y; ADD X, Y; STORE X
– Originally used in mainframe and supercomputers.
– Die away due to the effective use of pipelining in desktop and
server processors
– See a renaissance for use in graphics, DSP, and multimedia
applications
6. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-6
Data Dependence and Hazards
• To have ILP, instructions should have no dependence
• A dependence indicates the possibility of a hazard,
– Determines the order in which results must be calculated, and
– Sets an upper bound on how much parallelism can possibly be
exploited.
• Overcome the limitation of dependence on ILP by
– Maintaining the dependence but avoiding a hazard,
– Eliminating a dependence by transforming the code.
• Dependence types
– Data dependence
• Creating RAW, WAR, and WAW hazards
– Name dependences
– Control dependences
7. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-7
Name Dependence
– Name dependences
• Occurs when two instructions use the same register or memory
location, called a name, but no data flow between the instructions
with that name.
– Two types of name dependences:
• Antidependence: Occur when instruction j writes a register or
memory location that instruction i reads and instruction i is
executed first.
• Output dependence: Occur when instruction i and instruction j
write the same register or memory location.
– Register renaming can be employed to eliminate name
dependences
8. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-8
Control Dependence
• A control dependence determines the ordering of an
instruction with respect to a branch instruction.
– Example: S1 is control dependent on p1, but not on p2.
if p1 {
S1;
};
if p2 {
S1;
};
9. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-9
Two Constraints Imposed by Control
Dependences
– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is
no longer controlled by the branch.
– An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
10. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-10
How the Simple Pipeline in Appendix A
Preserves Control Dependence
– Instructions execute in order.
– Detection of control or branch hazards ensures that an
instruction that is control dependent on a branch is not
executed until the branch direction is known.
11. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-11
Can We Violate Control Dependence?
• Yes, we can
– If we can ensure that violating the control dependence will not result
in incorrectness of the programs, control dependence is not critical
property that must be preserved.
– Instead, the exception behavior and data flow critical to correctness of
the program are normally preserved by data and control dependence
12. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-12
Preserving Exception Behavior
– Preserving the exception behavior means that any changes
in the ordering of instruction execution must not change
how exceptions are raised in the program.
• Often this is relaxed to mean that the reordering of instruction
execution must not cause any new exceptions in the program.
• Example
DADDU R2, R3, R4
BEQZ R2, L1
LW R1, 0(R2)
L1: …
How about LW is moved before BEQZ and there is a memory
exception while the branch is taken?
13. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-13
Preserving Data Flow
– The actual flows of data among instructions that produce
results and those that consume them must be preserved.
– Branch makes data flow dynamic (i.e., coming from
multiple points).
– Example
DADDU R1, R2, R3
BEQZ R4, L
DSUBU R1, R5, R6
L: …
OR R7, R1, R8
– “Preserving data flow” means that if branch is not taken, the value
of R1 computed by DSUBU is used by OR, otherwise, the value
of R1 computed by DADDU is used.
14. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-14
Speculation
• Check whether an instruction can be executed with
violation of control dependence yet preserve the
exception behavior and the data flow.
• Example
DADDU R1, R2, R3
BEQZ R12, skipnext
DSUBU R4, R5, R6
DADDU R5, R4, R9
Skipnext: OR R7, R8, R9
– How about moving DSUBD before BEQZ if R4 were not
used in taken path?
15. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-15
Overcoming Data Hazards with Dynamic
Scheduling
– Basic idea:
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
– SUB.D is stalled, but not data dependent on anything.
– The major limitation of so-far introduced pipeline is in-
order issuing of instructions.
– To allow SUB.D to execute by dynamically scheduling the
instructions, it create out-of-order execution, and thus
out-of-order completion.
16. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-16
Advantages and Problems of Dynamic
Scheduling
– Advantages
• Enable handling some cases when dependences are unknown at
compile time (e.g. When involving memory reference).
• Simplify the compiler.
• Allow code that was compiled with one pipeline in mind to run
efficiently on a different pipeline.
– Problems
• It creates WAR and WAW hazards.
• It complicates exception handling due to out-of-order completion.
It creates imprecise exception.
– The processor state when an exception is raised does not look
exactly as if the instructions were executed sequentially in strict
program order.
17. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-17
Support Dynamic Scheduling for the Simple
Five-Stage Pipeline
• Divide the ID stage into the following two stages:
– Issue: Decode instructions and check for structural
hazards.
– Read operands: Wait until no data hazards, then read
operands.
18. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-18
Dynamic Scheduling Algorithms
• Algorithms
– Scoreboarding, originated from CDC 6600 (Appendix A).
• Effective when there are sufficient resources and no data
dependence.
– Tomasulo algorithm, originated from IBM 360.
• Both algorithms can be applied to pipelining or
multi-functional units implementations.
19. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-19
Dynamic Scheduling Using Tomasulo’s
Approach
• Combine key elements of the scoreboarding scheme
with register renaming.
– Track availability of operands to minimize RAW.
– Use register renaming to minimize RAW and WAW.
21. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-21
Basic Architecture for Tomasulo’s Approach
22. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-22
Basic Ideas
– A reservation station (RS) fetches and buffers an operand
as soon as it is available.
– Pending instructions designate the RS that will provide
their inputs.
– When successive writes to a register appear, only the last
one is actually used to update the register.
– As instructions are issued, the register specifiers for
pending operands are renamed to the names of the RS, i.e.,
register renaming
• The functionality of register renaming is provided by
– The reservation stations (RS), which buffer the operands of
instructions waiting to issue.
– The issue logic
• Since three can be more RSs than real registers, the technique can
eliminate hazards that could not be eliminated by a compiler
23. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-23
What Is a Reservation Station Actually Held?
– Instructions that have been issued and are awaiting
execution at a functional unit.
– The operands if available, otherwise, the source of the
operands.
– The information needed to control the execution of the
instruction at the unit.
– The load buffers and store buffers hold data or addresses
coming from and going to memory.
24. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-24
Steps in Tomasulo’s Approach
– Issue……get instruction from instruction queue
• Get an instruction from the floating point queue. If it is a floating point
operation, issue it if there is an empty RS, and send the operands to the
RS if they are in the registers. If it is a load or store , it can be issued if
there is an available buffer. If the hardware resource is not available, the
instruction stalls.
– Execute…..operate an operand
• If one or more operands is not yet available, monitor the CDB to obtain
the required operands. When two operands are available, the instruction
is executed. This step checks for RAW hazards.
– Write result …..finish execution(WB)
• When the result is available, write it on the CDB and from there into the
registers and any RS waiting for this result.
- Commit…..Update register or memory with ROB result
• when instruction reaches head of ROB and results present, update
register with result or store to memory and remove instruction from ROB
• If an incorrectly predicted branch reaches the head of ROB, flush the
ROB and restart at correct successor of branch.
• The above steps differ from Scoreboarding in the following
three aspects:
25. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-25
Data Structures
– Data structures used to detect and eliminate hazards are
attached to the RS, the register file, and the load and store
buffers.
• Everything contains a tag field per entry. The tags are essentially
names for an extended set of virtual registers used in renaming.
• In this example, the tag is a four-bit quantity that denotes one of
the five RSs or one of the six load buffers.
26. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-26
Fields in Data Structures
– Each RS has seven field
• Op: The operation to perform.
• Qj, Qk: The RS that produces the corresponding source operand.
• Vj, Vk: The value of the source operands.
• Busy: Indicates the RS and its corresponding functional unit are
occupied.
• A: Used to hold information for memory address calculation for a
load or store.
– The register file and store buffer each have a field, Qi:
• Qi:The number of the RS that contains the operation whose result
should be stored into this register or into memory.
– The load and store buffers each require a busy field. The
store buffer has a field A, which holds the result of the
effective address.
27. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-27
Example for Tomasulo’s Approach (1)
• Fig. 3.3 on page 190
28. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-28
Example for Tomasulo’s Approach (1)
• Fig. 3.4 on page 192
29. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-29
Advantage of Tomasulo’s Approach over
Scoreboarding
• The distribution of hazard detection over the RSs
• The elimination of stalls for WAW and WAR
hazards
30. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-30
Tomasulo’s Algorithm: A Loop-Based
Example
– By using reservation stations, a loop can be dynamically
unrolled. Assume the following loop has been issued in
two successive iterations, but none of the floating-point
loads-stores or operations has completed (fig. 3.6 on page
194).
31. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-31
A Loop-Based Example
32. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-32
Dynamic Disambiguation of Addresses
– If the load address matches the store-buffer addresses, we
must stop and wait until the store buffer gets a value; we
can then access it or get the value from memory. This
makes the load operation in the second iteration in fig. 3.6
completed earlier than the store operation in the first
iteration.
– The key components for enhancing ILP in Tomasulo
algorithm are dynamic scheduling, register renaming and
dynamic memory disambiguation.
33. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-33
Reducing Branch Costs with Dynamic
Hardware Prediction
– Dynamic hardware branch prediction
• The prediction will change if the branch changes its behavior
while the program is running.
• The effectiveness of a branch prediction scheme depends
– not only on the accuracy,
– but also on the cost of a branch when the branch is correct and
when the prediction is not incorrect.
• The branch penalties depend on
– the structure of the pipeline,
– the type of predictor, and
– the strategies used for recovering from misprediction.
34. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-34
Basic Branch Prediction and Branch-
Prediction Buffer
– A branch prediction buffer is a small memory indexed by
the lower portion of the address of the branch instruction.
The memory contains bits that say whether the branch
was recently taken or not.
35. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-35
The Simple One-Bit Prediction Scheme
– If a prediction is correct, the prediction bit remains,
otherwise, it is inverted.
• Example on page 197 (mis-prediction rate 20%)
Correct? Prediction Instruction Taken/untaken
Y(es) T(aken) I1 T(aken)
Y T I2 T
…
Y T I9 T
N T I10 U(ntaken)
N U I11 T
36. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-36
Two-Bit Branch Prediction Scheme
– A prediction must miss twice before it is changed.
37. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-37
Accuracy of Two-Bit Branch Prediction Buffer (1)
• With 4096 entries
38. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-38
Accuracy of Two-Bit Branch Prediction Buffer (2)
39. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-39
Correlating Branch Predictor (1)
– Basic concept: the behavior of a branch depends on other branches.
if (aa == 2)
aa = 0;
if (bb == 2)
bb = 0;
if (aa != bb) {
• Equivalent code fragment
DSUBI R3, R1, #2
BNEZ R3, L1 ; branch b1 (aa != 2)
DADD R1, R0, R0 ; aa == 0
L1: DSUBI R3, R2, #2
BNEZ R3, L2 ; branch b2 (bb != 2)
DADD R2, R0, R0 ; bb == 0
L2: DSUB R3, R1, R2 ; R3 = aa-bb
BEQZ R3, L3 ; branch b3 (aa == bb)
• Branches b1 and b2 not taken implies b3 will be taken.
40. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-40
Correlating Branch Predictor (2)
• Branch predictors that use the behavior of other branches to
make a prediction are called correlating predictors or two-level
predictors.
– Consider the following simplified code fragment,
if (d ==0 )
d =1;
if (d == 1)
» Equivalent code fragment is
BNEZ R1, L1 ; branch b1 (d != 0)
DADDIU R1, R0, #1
L1: DADDIU R3, R1, # -1
BNEZ R3, L2 ; branch b2 (d != 1)
…
L2:
» if b1 is not taken, b2 will be not taken.
41. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-41
Possible Execution Sequence
• Fig. 3.10 on page 202
42. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-42
If Done by One-Bit Predictor
• Fig. 3.11 on page 202
43. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-43
One-Bit Predictor with One-Bit Correlation, i.e.,
(1,1) Predictor
• The first bit being the prediction if the last branch was not taken
and the second bit being the prediction if the last branch in the
program was taken.
• The four possible combinations
– Fig. 3.12 on page 203
–
44. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-44
If Done by (1,1) Predictor
• Fig. 3.13 on page 203
45. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-45
General Correlating Predictor
– (m, n) predictor uses the behavior of the last m branches
to choose from 2m branch predictors, each of which is a n-
bit predictor for a single branch.
• Examples on page 205.
46. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-46
A (2,2) Predictor
47. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-47
Comparison of Two-Bit Predictors
48. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-48
Tournament Predictors
• Actively combining local and global predictors.
• It is the most popular form of multilevel branch
predictor.
• A multilevel branch predictor uses several levels of
branch-prediction tables together with an algorithm
for choosing among the multiple predictors.
49. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-49
Sate Transition Diagram of a Tournament
Predictor
• 0/0: predictor1 is wrong/predictor2 is wrong
• 1/1: predictor1 is correct/predictor2 is correct
50. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-50
The Fraction of Predictions Done by Local
Predictor
51. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-51
Mis-prediction Rate for Three Predictors
52. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-52
Integrated Instruction Fetch Units
• Perform the following functions
– Integrated branch prediction
– Instruction pre-fetch
– Instruction memory access and buffering
53. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-53
Hardware-Based Speculation(Unit 2)
– Why?
• To resolve control dependence to increase ILP
– Overcoming control dependence is done by
• Speculating on the outcome of branches and executing the
program as if our guess is correct.
– Three key ideas combined in hardware-based speculation:
• Dynamic branch prediction,
• Speculative execution, and
• Dynamic scheduling.
– Instruction commit
• When an instruction is no longer speculative, we allow it to update
the register file or memory
– The key idea behind implementing speculations is to allow
instructions to execute out of order but to force them to
commit in order.
54. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-54
Extend the Tomasulo’s Approach to Support
Speculation
– Separate the bypassing of results among instructions from
the actual completion of an instruction.
• By doing so, the result of an instruction can be used by other
instruction without allowing the instruction to perform any
irrecoverable update, until the instruction is no longer speculative.
– A reorder buffer is employed to pass results among
instructions that may be speculated.
• The reorder buffer holds the results of an instruction between the
time the operation associated with the instruction completes and
the time the instruction commits.
• The store buffers in the original Tomasulo’s algorithm are
integrated into reorder buffer.
• The renaming function of reservation station (RS) is replaced by
the reorder buffer. Thus, a result is usually tagged by using the
reorder buffer entry number.
55. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-55
Data Structure for Reorder Buffer
– Each entry in the reorder buffer contains four fields:
• The instruction type field indicates whether the instruction is a
branch, store, or a register operation.
• The destination field supplies the register number or the memory
address, where the instruction result should be written.
• The value field is used to hold the value of the instruction results
until the instruction commits.
• Busy (ready) field
56. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-56
The Four Steps in Instruction Execution (1)
– Issue (dispatch)
• Issue an FP instruction if there is an empty RS and an empty slot
in the reorder buffer, send the operands to the RS if they are in
the registers or the reorder buffer, and update the control entries
to indicate the buffers are in use.
• The number of the reorder buffer allocated for the result is also
sent to the RS for tagging the result sent on the CDB.
• If either all RSs are full or the reorder buffer is full, instruction
issued is stalled.
– Execute
• The CDB is monitored for not-yet-available operands. When both
operands are available at a RS, execute the operation. This step
checks for RAW hazards.
57. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-57
The Four Steps in Instruction Execution (2)
– Write result
• When the result is available, write it on the CDB and then into
the reorder buffer and to any RSs waiting for this result. Mark
the RS as available.
– Commit
• When an instruction, other than a branch with incorrect
prediction, reaches the head of the reorder buffer and its result is
in the buffer, update the register with the result.
• When a branch with incorrect prediction, indicating a wrong
speculation, reaches the head of the reorder buffer, the reorder
buffer is flushed and execution is restarted at the correct successor
of the branch. If the branch was correctly predicted, the branch is
finished.
58. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-58
The Architecture with Speculation
59. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-59
Exception Handling
• Exceptions are handled by not recognizing the
execution until it is ready to commit. Thus maintain
precise exception is straight forward.
60. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-60
Example for Hardware-Based Speculation
• Fig. 3.30 on page 230
61. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-61
Comparison of with and without Speculation (1)
• Fig. 3.33 on page 236
62. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-62
Comparison of with and without Speculation (2)
• Fig. 3.34 on page 237
63. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-63
Studies of the Limitations of ILP
• Ideal hardware model
– With infinite number of physical registers for renaming
– With perfect branch prediction
– With perfect jump prediction
– Memory address alias
• All memory addresses are known exactly and a load can be moved
before a store provided that the addresses are not identical.
– Enough functional units
• The ILP limitation in the ideal hardware model is
due to data dependence.
64. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-64
Unit 3
ILP-2
Exploiting ILP Using Multiple Issue
and Static Scheduling
65. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-65
Taking Advantage of More ILP with Multiple
Issues
– How can we reduce the CPI to less than one?
• Multiple issues
– Allow multiple instructions to issue in a clock cycle.
– Multiple-issued processors:
• Superscalar processor
– Issue varying number of instructions per clock cycle and may be
either statically scheduled by a compiler or dynamically
scheduled using techniques based on scoreboarding and
Tomasulo’s algorithm.
• Very long instruction word (VLIW) processor
– Issues a fixed number of instructions formatted either as one large
instruction or as a fixed instruction packet. Inherently scheduled
by a compiler.
66. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-66
Statically Versus Dynamically Scheduled
Superscalar Processors
• Statically scheduled superscalar
– Instructions are issued in order and are executed in order
– All pipeline hazards are checked at issued time
– Dynamically issue a varying number of instructions
• Dynamically scheduled superscalar
– Allow out-of-order execution.
67. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-67
Five Primary Approaches for Multiple-Issue
Processors
• Fig. 3.23 on page 216
68. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-68
A Statically Scheduled Superscalar MIPS
Processor
– Dual issues: One integer and one floating-point operation
• Some restrictions:
– The second instruction can be issued if the first instruction can be
issued.
– If the second instruction depends on the first instruction, it can not
be issued.
• Influence on load dependency: Waste 3 instruction issuing slots.
– Waste one instruction issuing slot at current clock cycle.
– Waste two instruction issuing slots at next clock cycle.
• Influence on branch dependency: Waste 2 or 3 instruction issuing
slots.
– Depends on whether a branch must be the first instruction.
69. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-69
Dual-Issue Superscalar Pipeline in Operation
• Fig. 3.24
70. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-70
Possible Hazards
– Note: the integer operations may be floating-point load,
move, and store.
– Possible hazards (new):
• Structural hazard
– Occur when an FP move, store or load is paired with an FP
instructions and FP register file does not provide enough read or
write ports.
• WAW, and
• RAW hazards
– Dependency on the instructions issued at the same clock cycle..
71. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-71
Exploiting ILP Using Dynamic Scheduling,
Multiple Issue, and Speculation
Multiple Instruction Issue with Dynamic Scheduling
– To support dual issues of instructions
• Separate data structures (for the reservation stations) for the
integer and floating-point registers are employed (still prevent the
issuing of dependent FP move or load on the FP instruction).
• Pipeline the issuing stage such that it runs twice as fast as the basic
clock rate. The first half issues the dependent move or load, while
the second half issues the floating-point instruction.
72. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-72
A Scenario of Dual Issues
73. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-73
Resource Usage Table
74. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-74
Factors Limit the Performance of Dual-Issue
Processor
– Limitations in multiple-issue processors
• Inherent limitations of ILP in programs
– Difficult to find a large number of independent instructions to
keep fully use of FP units.
• The amount of overhead per loop iteration is very high
– Two out of five instructions (DADDIU and BNE)
• Control hazard
75. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-75
Advanced Techniques for Instruction
Delivery and Speculation (Unit3)
• Branch-Target Buffers
• - A branch-prediction cache that stores the predicted address for
the next instruction after a branch is called a branch-target buffer
or branch-target cache.
77. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-77
The steps involved in handling an
instruction with a branch-target buffer.
78. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-78
Penalties for Each Individual Situations
• Performance of branch-target buffer (Example on page 211).
• One variation of branch-target buffer:
– Store one or more target instructions instead of the predicted
address.
79. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-79
Return Address Predictor
– Optimized indirect jumps, especially for procedure calls
and returns
• Problem
– The accuracy of predicting return address by branch-target buffer
can be low if the procedure is called from multiple sites and the
calls from one site are not clustered in time.
• Solution
– Use a stack to buffer the most recent return addresses, pushing a
return address on the stack at a call and popping one off at a
return.
80. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-80
Prediction Accuracy
81. Rung-Bin LinChapter 3: Instruction-Level Parallelism and Its Dynamic Exploitation 3-81
The Intel Pentium 4