This document discusses instruction level parallelism techniques used in modern processors. It describes pipelining, where the execution of instructions is broken down into stages that can be executed concurrently for different instructions. It provides examples of a MIPS processor pipeline with 5 stages - instruction fetch, decode, execute, memory access, and write back. It discusses data hazards that can occur when instructions are executed in parallel and solutions like inserting NOP instructions or forwarding operands between pipeline stages.
This document compares the RISC Alpha 21164 chip and the CISC Pentium Pro chip. Both chips were leading implementations from their respective architectural schools at the time and were built using similar 0.5-0.6 micron technology. The Alpha 21164 is a quad-issue superscalar design with on-chip caches but no out-of-order execution, while the Pentium Pro uses dynamic execution with register renaming and out-of-order execution. Performance comparisons on industry benchmarks show the Alpha 21164 outperforming the Pentium Pro, though performance is also dependent on system platform and compiler used.
This document summarizes key aspects of CPU processor design, including:
1) It examines two implementations of a MIPS processor - a simple single-cycle version and a more realistic pipelined version. The pipelined version breaks instruction execution into five stages to improve performance.
2) It discusses hazards that can occur in a pipeline like data hazards and branch hazards. Techniques like forwarding, stalling, and branch prediction are used to resolve hazards.
3) The control logic for the pipelined MIPS processor is explained, including how it detects hazards and forwards data between stages when needed. Stalls are also inserted when necessary to ensure correctness.
The document summarizes the RISC pipeline architecture. It discusses the five stages of the classic RISC pipeline: instruction fetch, instruction decode, execute, memory access, and writeback. Each stage is involved in processing one instruction at a time through the pipeline. The instruction fetch stage retrieves instructions from the instruction cache. The decode stage decodes the instruction and computes branch targets. The execute stage performs arithmetic and logical operations. The memory access stage handles data memory access. Finally, the writeback stage writes results back to registers. The document also discusses hazards like structural, data, and control hazards that can occur in pipelines.
This document discusses RISC vs CISC architectures and the Harvard and von Neumann computer architectures. It provides examples of multiplying two numbers in memory using CISC and RISC approaches. CISC uses complex instructions that perform multiple operations, while RISC breaks operations into simpler instructions. Harvard architecture separates program and data memory while von Neumann uses shared memory.
This document discusses computer architecture and CPU design. It covers the von Neumann architecture, ways to speed up CPU operations like pipelining and superscalar designs. It also discusses the differences between CISC and RISC instruction set architectures. CISC instruction sets became more complex over time while RISC advocates designed simpler instruction sets with more registers to potentially achieve faster execution.
This document provides an overview of implementing a processor that executes a subset of the MIPS instruction set. It describes the basic components needed, including an instruction memory to store and fetch instructions, registers to hold data, an ALU to perform arithmetic and logical operations, multiplexers to direct data flow, and a program counter to keep track of the next instruction address. The implementation is built up incrementally, first explaining how instructions are fetched and the program counter updated. It then describes adding components for R-type instructions like arithmetic and logical operations. Finally, it discusses adding units to support load/store memory instructions by sign-extending offsets and calculating effective addresses. The goal is to explain at a high level how the MIPS
RAR (Read After Read) is not considered a data hazard because it does not change the order of memory accesses or introduce incorrect results. Multiple instructions can safely read the same register without interfering with each other. The three types of data hazards that can occur are RAW (Read After Write), WAR (Write After Read), and WAW (Write After Write) which all involve write operations that could potentially overwrite data before it is read.
The document provides an overview of MIPS 64-bit processors. It discusses that MIPS 64-bit architecture is backward compatible with MIPS32 and adds 64-bit addressing. Key features include 64-bit virtual addresses, instruction pointer and registers. It has separate integer and floating point units for high performance. The block diagram shows it has on-chip instruction and data caches, a write buffer, and dual issue superscalar pipelined architecture for high efficiency.
This document compares the RISC Alpha 21164 chip and the CISC Pentium Pro chip. Both chips were leading implementations from their respective architectural schools at the time and were built using similar 0.5-0.6 micron technology. The Alpha 21164 is a quad-issue superscalar design with on-chip caches but no out-of-order execution, while the Pentium Pro uses dynamic execution with register renaming and out-of-order execution. Performance comparisons on industry benchmarks show the Alpha 21164 outperforming the Pentium Pro, though performance is also dependent on system platform and compiler used.
This document summarizes key aspects of CPU processor design, including:
1) It examines two implementations of a MIPS processor - a simple single-cycle version and a more realistic pipelined version. The pipelined version breaks instruction execution into five stages to improve performance.
2) It discusses hazards that can occur in a pipeline like data hazards and branch hazards. Techniques like forwarding, stalling, and branch prediction are used to resolve hazards.
3) The control logic for the pipelined MIPS processor is explained, including how it detects hazards and forwards data between stages when needed. Stalls are also inserted when necessary to ensure correctness.
The document summarizes the RISC pipeline architecture. It discusses the five stages of the classic RISC pipeline: instruction fetch, instruction decode, execute, memory access, and writeback. Each stage is involved in processing one instruction at a time through the pipeline. The instruction fetch stage retrieves instructions from the instruction cache. The decode stage decodes the instruction and computes branch targets. The execute stage performs arithmetic and logical operations. The memory access stage handles data memory access. Finally, the writeback stage writes results back to registers. The document also discusses hazards like structural, data, and control hazards that can occur in pipelines.
This document discusses RISC vs CISC architectures and the Harvard and von Neumann computer architectures. It provides examples of multiplying two numbers in memory using CISC and RISC approaches. CISC uses complex instructions that perform multiple operations, while RISC breaks operations into simpler instructions. Harvard architecture separates program and data memory while von Neumann uses shared memory.
This document discusses computer architecture and CPU design. It covers the von Neumann architecture, ways to speed up CPU operations like pipelining and superscalar designs. It also discusses the differences between CISC and RISC instruction set architectures. CISC instruction sets became more complex over time while RISC advocates designed simpler instruction sets with more registers to potentially achieve faster execution.
This document provides an overview of implementing a processor that executes a subset of the MIPS instruction set. It describes the basic components needed, including an instruction memory to store and fetch instructions, registers to hold data, an ALU to perform arithmetic and logical operations, multiplexers to direct data flow, and a program counter to keep track of the next instruction address. The implementation is built up incrementally, first explaining how instructions are fetched and the program counter updated. It then describes adding components for R-type instructions like arithmetic and logical operations. Finally, it discusses adding units to support load/store memory instructions by sign-extending offsets and calculating effective addresses. The goal is to explain at a high level how the MIPS
RAR (Read After Read) is not considered a data hazard because it does not change the order of memory accesses or introduce incorrect results. Multiple instructions can safely read the same register without interfering with each other. The three types of data hazards that can occur are RAW (Read After Write), WAR (Write After Read), and WAW (Write After Write) which all involve write operations that could potentially overwrite data before it is read.
The document provides an overview of MIPS 64-bit processors. It discusses that MIPS 64-bit architecture is backward compatible with MIPS32 and adds 64-bit addressing. Key features include 64-bit virtual addresses, instruction pointer and registers. It has separate integer and floating point units for high performance. The block diagram shows it has on-chip instruction and data caches, a write buffer, and dual issue superscalar pipelined architecture for high efficiency.
The document summarizes key aspects of CPU and processor design, including:
1) It describes the stages of a MIPS pipeline, including instruction fetch, decode, execute, memory access, and write back.
2) It discusses hazards like structure hazards from conflicting resources, data hazards from instructions depending on previous results, and control hazards from branches.
3) Pipelining is introduced to improve performance by overlapping instruction execution, but it requires techniques like forwarding to address hazards between stages.
This document provides an overview of complex instruction set computers (CISC). It defines CISC as having a microprocessor architecture that aims to achieve complex operations with single instructions and favors instruction set richness over speed of individual instructions. It discusses typical CISC architectures like x86, characteristics like variable length instructions and powerful assembly instructions. It also covers addressing modes, performance considerations, pros and cons of CISC, and recent developments like explicitly parallel instruction computing (EPIC) being a potential future of CISC and RISC architectures.
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
This document summarizes new features in PCI Express Gen 3, including Atomic Operations, TLP Processing Hints, TLP Prefix, Resizable BAR, and others. It describes how each feature enhances PCI Express functionality, such as enabling atomic operations to facilitate migration of SMP applications to PCIe accelerators, and TLP Prefix allowing expansion of header sizes to carry additional information.
RISC and CISC architectures evolved from different philosophies but have converged over time. CISC aimed to optimize for simpler compilers by incorporating complex instructions while RISC focused on optimized hardware using reduced, uniform instruction sets. While CISC was better for early computers with slow memory, RISC emerged to improve performance. Advances now blur the lines as CISC uses pipelining and RISC supports more instructions, showing how the strategies have influenced each other in modern processors.
Pipelining allows multiple instructions to be processed simultaneously by splitting the fetch-decode-execute cycle into stages so different instructions can be at different stages. Array or vector processors can perform the same operation on multiple data elements in parallel using multiple ALUs. Parallel processing can happen at different levels from pipelining within a CPU to multi-core and multiprocessor systems that distribute work across CPUs. RISC processors use simpler instructions that can complete in one cycle while CISC processors have more complex instructions implemented in hardware.
This document discusses parallel processing techniques such as pipelining and vector processing to increase computational speed. It covers Flynn's classification of computer architectures, arithmetic pipelining using a floating-point adder as an example, instruction pipelining with a four-segment model, resolving data dependencies and branch difficulties in pipelines, and RISC pipeline examples addressing delayed load and branch issues. The key techniques discussed are decomposing operations into parallel suboperations, hardware interlocks, operand forwarding, and compiler assistance.
This document provides an introduction to assembly language programming. It discusses how assembly language works at a low level directly with a computer's processor. It outlines the basic components of an assembly language program like editors, assemblers, linkers and debuggers. It also describes the instruction sets, addressing modes, and common directives supported by the 8086 microprocessor. Finally, it provides an example of a simple assembly language program to perform an 8-bit number subtraction.
This presentation discusses array processors, which are parallel computers composed of multiple identical processing elements that can operate simultaneously. The presentation covers the history of array processors, how they work, classifications, architectures, performance and scalability. It explains that array processors are well-suited for tasks involving repetitive arithmetic operations on large datasets, as they can improve performance for such workloads, but may not provide benefits for operations with data dependencies or decisions based on computations.
This document compares RISC and CISC architectures by examining the MIPS R2000 and Intel 80386 processors. It discusses the history of RISC and CISC, providing examples of each. Experiments using benchmarks show that while the 80386 executes fewer instructions on average than the R2000, the difference is small at around a 2x ratio. Both instruction sets are becoming more alike over time. In the end, performance depends more on how fast a chip executes rather than whether it is RISC or CISC.
Pipeline and vector processors often require simultaneous access to memory from multiple sources. A modular memory system allows this by partitioning memory into multiple modules, each with their own address and data registers connected to common buses. This memory interleaving technique assigns different address ranges to different modules. A vector processor using an n-way interleaved memory can fetch n operands simultaneously, reducing effective memory cycle time close to the number of modules. Instruction pipelines also benefit from multiple independent memory modules.
The document discusses instruction pipelining in processors. It explains that instruction pipelining allows consecutive instructions to be fetched from memory while previous instructions are being executed in different pipeline stages. It identifies some challenges in instruction pipelining including different stage times, dependencies between instructions, and branch instructions. It proposes techniques to address these challenges such as combining pipeline stages, rearranging instructions, forwarding operands, and branch prediction.
This document discusses different addressing modes and RISC and CISC microprocessors. It defines eight addressing modes: register, register indirect, immediate, direct, indirect, implicit, relative, and index addressing modes. It provides examples for each mode. The document also defines RISC and CISC architectures, noting that RISC uses simple instructions that perform in one clock cycle while CISC uses more complex instructions that can perform multiple operations. It compares the two approaches using multiplying two numbers as an example.
pipelining is the concept of decomposing the sequential process into number of small stages in which each stage execute individual parts of instruction life cycle inside the processor.
Performance Characterization of the Pentium Pro ProcessorDileep Bhandarkar
HPCA 3 Paper
In this paper, we characterize the performance of several business and technical benchmarks on a Pentium Pro processor based system. Various architectural data are collected using a performance monitoring counter tool. Results show that the Pentium Pro processor achieves significantly lower cycles per instruction than the Pentium processor due to its out of order and speculative execution, and non-blocking cache and memory system. Its higher clock frequency also contributes to even higher performance.
This document compares RISC and CISC processor architectures. It discusses that CISC processors have more complex instructions that can perform multiple operations, while RISC processors have simpler instructions that are optimized to complete in one clock cycle. CISC was developed earlier when memory was expensive, to reduce the number of instructions, while RISC focuses on increasing processor speed. RISC has advantages of faster execution and simpler hardware design, while CISC allows for more compact code.
1) A computer program is a list of instructions that directs the computer hardware to perform tasks. There are many programming languages but computers ultimately understand binary (0s and 1s).
2) Both hardware and software influence each other. Programming languages can be machine language (binary code), assembly language (symbolic code), or high-level languages (like C++) which are compiled to machine code.
3) A microprogrammed computer uses control memory, which stores microinstructions specifying operations. Address sequencing in control memory uses incrementing, branching, and mapping instruction codes to routine addresses to execute program steps like fetch, compute, execute, and return from subroutines.
Direct Memory Access (DMA) allows data to be transferred directly between peripheral devices and memory without processor intervention. The DMA controller acts as a bus master that can access memory independently of the CPU. It is configured by the processor which provides the starting addresses and transfer length. This frees up the processor to perform other tasks while the DMA controller handles large data transfers faster than possible through traditional programmed input/output. Proper bus arbitration is required since both the CPU and DMA controller require access to shared system resources.
Chapter 2 instructions language of the computerBATMUNHMUNHZAYA
The document discusses the MIPS instruction set architecture. It describes the different types of instructions including arithmetic, logical, and conditional instructions. It explains the register-based and memory-based operands, immediate operands, and how instructions are encoded in binary machine code. Key aspects like simplicity, regularity, and optimization for common cases are emphasized in the design of the MIPS ISA.
CISC and RISC are two approaches to CPU architecture. CISC uses a complex instruction set that can perform multiple operations in one instruction, aiming to reduce the number of instructions per program. However, this comes at the cost of more cycles per instruction. RISC uses a simplified instruction set but is able to achieve higher performance through pipelining and reducing cycles per instruction, even if it increases the number of instructions per program. Key differences are that CISC sacrifices cycles per instruction for fewer instructions, while RISC does the opposite. Examples given are Intel Pentium as CISC-based and PowerPC as RISC-based.
The document describes the von Neumann architecture, including its main components: main memory, arithmetic logic unit (ALU), control unit, CPU registers, and I/O equipment. The CPU consists of registers like the program counter, instruction register, and memory address register. The control unit interprets instructions and causes them to execute. Main memory stores both instructions and data, while the ALU performs arithmetic operations. I/O equipment is controlled by the control unit to input and output data.
This document discusses embedded processors, including the VIA C3 processor and PowerPC MPC601 processor. It provides details on the architecture and design of the VIA C3 processor, including its instruction pipeline, instruction decode unit, branch prediction, integer unit, floating point unit, MMX and 3D unit. It notes some key characteristics of the VIA C3 such as its use of the x86 instruction set, pipelined design, and higher power consumption due to complexity. The document then briefly introduces the PowerPC MPC601 processor.
The document summarizes key aspects of CPU and processor design, including:
1) It describes the stages of a MIPS pipeline, including instruction fetch, decode, execute, memory access, and write back.
2) It discusses hazards like structure hazards from conflicting resources, data hazards from instructions depending on previous results, and control hazards from branches.
3) Pipelining is introduced to improve performance by overlapping instruction execution, but it requires techniques like forwarding to address hazards between stages.
This document provides an overview of complex instruction set computers (CISC). It defines CISC as having a microprocessor architecture that aims to achieve complex operations with single instructions and favors instruction set richness over speed of individual instructions. It discusses typical CISC architectures like x86, characteristics like variable length instructions and powerful assembly instructions. It also covers addressing modes, performance considerations, pros and cons of CISC, and recent developments like explicitly parallel instruction computing (EPIC) being a potential future of CISC and RISC architectures.
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
This document summarizes new features in PCI Express Gen 3, including Atomic Operations, TLP Processing Hints, TLP Prefix, Resizable BAR, and others. It describes how each feature enhances PCI Express functionality, such as enabling atomic operations to facilitate migration of SMP applications to PCIe accelerators, and TLP Prefix allowing expansion of header sizes to carry additional information.
RISC and CISC architectures evolved from different philosophies but have converged over time. CISC aimed to optimize for simpler compilers by incorporating complex instructions while RISC focused on optimized hardware using reduced, uniform instruction sets. While CISC was better for early computers with slow memory, RISC emerged to improve performance. Advances now blur the lines as CISC uses pipelining and RISC supports more instructions, showing how the strategies have influenced each other in modern processors.
Pipelining allows multiple instructions to be processed simultaneously by splitting the fetch-decode-execute cycle into stages so different instructions can be at different stages. Array or vector processors can perform the same operation on multiple data elements in parallel using multiple ALUs. Parallel processing can happen at different levels from pipelining within a CPU to multi-core and multiprocessor systems that distribute work across CPUs. RISC processors use simpler instructions that can complete in one cycle while CISC processors have more complex instructions implemented in hardware.
This document discusses parallel processing techniques such as pipelining and vector processing to increase computational speed. It covers Flynn's classification of computer architectures, arithmetic pipelining using a floating-point adder as an example, instruction pipelining with a four-segment model, resolving data dependencies and branch difficulties in pipelines, and RISC pipeline examples addressing delayed load and branch issues. The key techniques discussed are decomposing operations into parallel suboperations, hardware interlocks, operand forwarding, and compiler assistance.
This document provides an introduction to assembly language programming. It discusses how assembly language works at a low level directly with a computer's processor. It outlines the basic components of an assembly language program like editors, assemblers, linkers and debuggers. It also describes the instruction sets, addressing modes, and common directives supported by the 8086 microprocessor. Finally, it provides an example of a simple assembly language program to perform an 8-bit number subtraction.
This presentation discusses array processors, which are parallel computers composed of multiple identical processing elements that can operate simultaneously. The presentation covers the history of array processors, how they work, classifications, architectures, performance and scalability. It explains that array processors are well-suited for tasks involving repetitive arithmetic operations on large datasets, as they can improve performance for such workloads, but may not provide benefits for operations with data dependencies or decisions based on computations.
This document compares RISC and CISC architectures by examining the MIPS R2000 and Intel 80386 processors. It discusses the history of RISC and CISC, providing examples of each. Experiments using benchmarks show that while the 80386 executes fewer instructions on average than the R2000, the difference is small at around a 2x ratio. Both instruction sets are becoming more alike over time. In the end, performance depends more on how fast a chip executes rather than whether it is RISC or CISC.
Pipeline and vector processors often require simultaneous access to memory from multiple sources. A modular memory system allows this by partitioning memory into multiple modules, each with their own address and data registers connected to common buses. This memory interleaving technique assigns different address ranges to different modules. A vector processor using an n-way interleaved memory can fetch n operands simultaneously, reducing effective memory cycle time close to the number of modules. Instruction pipelines also benefit from multiple independent memory modules.
The document discusses instruction pipelining in processors. It explains that instruction pipelining allows consecutive instructions to be fetched from memory while previous instructions are being executed in different pipeline stages. It identifies some challenges in instruction pipelining including different stage times, dependencies between instructions, and branch instructions. It proposes techniques to address these challenges such as combining pipeline stages, rearranging instructions, forwarding operands, and branch prediction.
This document discusses different addressing modes and RISC and CISC microprocessors. It defines eight addressing modes: register, register indirect, immediate, direct, indirect, implicit, relative, and index addressing modes. It provides examples for each mode. The document also defines RISC and CISC architectures, noting that RISC uses simple instructions that perform in one clock cycle while CISC uses more complex instructions that can perform multiple operations. It compares the two approaches using multiplying two numbers as an example.
pipelining is the concept of decomposing the sequential process into number of small stages in which each stage execute individual parts of instruction life cycle inside the processor.
Performance Characterization of the Pentium Pro ProcessorDileep Bhandarkar
HPCA 3 Paper
In this paper, we characterize the performance of several business and technical benchmarks on a Pentium Pro processor based system. Various architectural data are collected using a performance monitoring counter tool. Results show that the Pentium Pro processor achieves significantly lower cycles per instruction than the Pentium processor due to its out of order and speculative execution, and non-blocking cache and memory system. Its higher clock frequency also contributes to even higher performance.
This document compares RISC and CISC processor architectures. It discusses that CISC processors have more complex instructions that can perform multiple operations, while RISC processors have simpler instructions that are optimized to complete in one clock cycle. CISC was developed earlier when memory was expensive, to reduce the number of instructions, while RISC focuses on increasing processor speed. RISC has advantages of faster execution and simpler hardware design, while CISC allows for more compact code.
1) A computer program is a list of instructions that directs the computer hardware to perform tasks. There are many programming languages but computers ultimately understand binary (0s and 1s).
2) Both hardware and software influence each other. Programming languages can be machine language (binary code), assembly language (symbolic code), or high-level languages (like C++) which are compiled to machine code.
3) A microprogrammed computer uses control memory, which stores microinstructions specifying operations. Address sequencing in control memory uses incrementing, branching, and mapping instruction codes to routine addresses to execute program steps like fetch, compute, execute, and return from subroutines.
Direct Memory Access (DMA) allows data to be transferred directly between peripheral devices and memory without processor intervention. The DMA controller acts as a bus master that can access memory independently of the CPU. It is configured by the processor which provides the starting addresses and transfer length. This frees up the processor to perform other tasks while the DMA controller handles large data transfers faster than possible through traditional programmed input/output. Proper bus arbitration is required since both the CPU and DMA controller require access to shared system resources.
Chapter 2 instructions language of the computerBATMUNHMUNHZAYA
The document discusses the MIPS instruction set architecture. It describes the different types of instructions including arithmetic, logical, and conditional instructions. It explains the register-based and memory-based operands, immediate operands, and how instructions are encoded in binary machine code. Key aspects like simplicity, regularity, and optimization for common cases are emphasized in the design of the MIPS ISA.
CISC and RISC are two approaches to CPU architecture. CISC uses a complex instruction set that can perform multiple operations in one instruction, aiming to reduce the number of instructions per program. However, this comes at the cost of more cycles per instruction. RISC uses a simplified instruction set but is able to achieve higher performance through pipelining and reducing cycles per instruction, even if it increases the number of instructions per program. Key differences are that CISC sacrifices cycles per instruction for fewer instructions, while RISC does the opposite. Examples given are Intel Pentium as CISC-based and PowerPC as RISC-based.
The document describes the von Neumann architecture, including its main components: main memory, arithmetic logic unit (ALU), control unit, CPU registers, and I/O equipment. The CPU consists of registers like the program counter, instruction register, and memory address register. The control unit interprets instructions and causes them to execute. Main memory stores both instructions and data, while the ALU performs arithmetic operations. I/O equipment is controlled by the control unit to input and output data.
This document discusses embedded processors, including the VIA C3 processor and PowerPC MPC601 processor. It provides details on the architecture and design of the VIA C3 processor, including its instruction pipeline, instruction decode unit, branch prediction, integer unit, floating point unit, MMX and 3D unit. It notes some key characteristics of the VIA C3 such as its use of the x86 instruction set, pipelined design, and higher power consumption due to complexity. The document then briefly introduces the PowerPC MPC601 processor.
This document discusses the implementation of a basic MIPS processor including building the datapath, control implementation, pipelining, and handling hazards. It describes the MIPS instruction set and 5-stage pipeline. The datapath is built from components like registers, ALUs, and adders. Control signals are designed for different instructions. Pipelining is implemented using techniques like forwarding and branch prediction to handle data and control hazards between stages. Exceptions are handled using status registers or vectored interrupts.
This document discusses the history and characteristics of CISC and RISC architectures. It describes how CISC architectures were developed in the 1950s-1970s to address hardware limitations at the time by allowing instructions to perform multiple operations. RISC architectures emerged in the late 1970s-1980s as hardware improved, focusing on simpler instructions that could be executed faster through pipelining. Common RISC and CISC processors used commercially are also outlined.
The document discusses computer architecture, including the central processing unit architecture, machine organization, and the von Neumann model. It covers topics like speeding up CPU operations through multiple registers, pipelining, superscalar and VLIW architectures. It also discusses the differences between CISC and RISC instruction set architectures.
03. top level view of computer function & interconnectionnoman yasin
This document discusses the basic components and functions of a computer system with a single processor architecture. It covers the von Neumann architecture, hardwired programs, instruction interpretation, the fetch-execute cycle, interrupts and how they are handled, I/O functions, and different structures for interconnecting computer components.
The document outlines the basics of processor operation, including the instruction cycle, representation of machine instructions, and types of instructions. It discusses how the processor clock synchronizes activities and how the program counter increments to fetch each subsequent instruction from memory. The core instruction cycle stages are fetch, decode, and execute, where the processor fetches instructions and data from memory, decodes the operation, and executes it by performing the required operation.
There are situations, called hazards, that prevent the next instruction in the instruction stream from executing during its designated cycle
There are three classes of hazards
Structural hazard
Data hazard
Branch hazard
4bit pc report[cse 08-section-b2_group-02]shibbirtanvin
The document describes the design and implementation of a 4-bit very simple computer system as an assignment. Key aspects of the design include a 2-stage pipeline with separate fetch and execution units, Harvard architecture with separate instruction and data memory, and a microprogrammed control unit. The computer is designed to execute 28 instructions from an assigned instruction set in an efficient manner using as few clock cycles and chips as possible.
The document describes the multi-cycle datapath and control approach for implementing a processor. It explains how instructions can be broken down into multiple execution steps that are each completed within a single clock cycle, allowing for simpler hardware at the cost of increased execution time per instruction. Additional registers are used to store intermediate results between cycles, and multiplexers allow functional units like the ALU and memory to be reused across cycles for different purposes. Control signals coordinate the movement of data between registers and functional units across each step of the multi-cycle implementation.
The document discusses parallel processing and pipelining. It defines parallel processing as performing concurrent data processing to achieve faster execution. This can be done by having multiple ALUs that can execute instructions simultaneously. The document then discusses Flynn's classification of computer architectures based on instruction and data streams. It describes single instruction single data (SISD), multiple instruction single data (MISD), and multiple instruction multiple data (MIMD) architectures. The document then defines pipelining as decomposing processes into sub-operations that flow through pipeline stages. It provides examples of arithmetic and instruction pipelines, describing the stages in each.
The ADD instruction performs 32-bit addition of two register operands and stores the result in a third register. It updates the N, C, Z, and V flags based on the result. There are no limitations. Examples demonstrate adding two values and storing the result in a third register.
The document discusses different methods for input/output (I/O) including programmed I/O, interrupt-driven I/O, and direct memory access (DMA). It explains that DMA allows I/O devices to access memory directly through a DMA controller in order to improve performance. The document then provides details on how DMA works, including an example of using DMA to read a block of data from a disk into memory.
The document discusses the design and implementation of a MIPS processor datapath. It describes building the datapath incrementally, starting with the basic components needed for instruction fetch, register-type instructions, load/store instructions, and branch instructions. Multiplexers are used to select between different data sources for different instruction types. Control signals are derived from the instruction fields to determine the operations during each step of the instruction execution cycle. The performance of a single-cycle and multicycle implementation is also analyzed.
The document discusses processor technology and architecture. It describes CPU instruction and execution cycles, how primitive instructions are combined to form complex operations, and key CPU design features like instruction format and registers. It explains how registers store intermediate results and status information. The document also outlines future processing trends.
The document provides an overview of key components of a computer's central processing unit (CPU). It discusses the CPU's arithmetic logic unit (ALU) and control unit, as well as registers, buses, cache memory, and main memory. It also describes machine language instructions and how programs are executed through fetching, decoding, and executing instructions in sequence.
Pipelining is a technique used in microprocessors to overlap the execution of multiple instructions by dividing instruction execution into discrete stages. It allows the next instruction to begin executing before the previous one has finished. The pipeline is divided into segments that perform discrete operations concurrently. This improves processor throughput by allowing new instructions to enter the pipeline every clock cycle.
The Pentium Pro processor is an improvement over previous Intel processors with additional features such as a 64-bit data bus, 8KB instruction and data caches, two parallel integer execution units, and a floating point unit. It also includes an on-chip 256/512KB L2 cache, branch prediction logic, and error detection capabilities. The Core 2 Duo processor is the first Intel processor to use a dual-core design, allowing two independent processor cores to operate simultaneously and share an L2 cache and front side bus. It features technologies such as virtualization, trusted execution, and an execute disable bit for enhanced security.
1. Section 6
Instruction-Level Parallelism
Topics:
Pipelining
Superscalar processors
VLIW architecture
Instruction level parallelism
Overview
Modern processors apply techniques for executing several instructions in
parallel to enhance the computing power.
The potential of executing machine instructions in parallel is called instruction
level parallelism (ILP).
Remember: execution of one instruction is broken into several steps
Pipelining:
Slide 6-2
Pipelining:
Different steps of multiple instructions are executed simultaneously.
Concurrent execution:
The same steps of multiple machine instructions may be executed
simultaneously.
Requires multiple functional units
Techniques: Superscalar
VLIW (very long instruction word)
2. Instruction level parallelism
Pipelining: principle
Principle:
The execution of a machine instruction is divided into several
steps – called pipeline stages - taking nearly the same
execution time. These stages may be executed in parallel.
Example MIPS: 5 pipeline stages
Slide 6-3
1. IF: instruction fetch
2. ID: instruction decode and register file read
3. EX: execution / memory address calculation
4. MEM: data memory access
5. WB: result write back
Instruction level parallelism
Pipelining: principle
Executing 6 instructions using pipelining
Executing two 5-step instructions (e.g. lw) without pipelining
Instruction 1: S1 S2 S3 S4 S5
Instruction 2: S1 S2 S3 S4 S5
Clock cycle: 1 2 3 4 5 6 7 8 9 10
Slide 6-4
Instruction 1: S1 S2 S3 S4 S5
Instruction 2: S1 S2 S3 S4 S5
Instruction 3: S1 S2 S3 S4 S5
Instruction 4: S1 S2 S3 S4 S5
Instruction 5: S1 S2 S3 S4 S5
Instruction 6: S1 S2 S3 S4 S5
Clock cycle: 1 2 3 4 5 6 7 8 9 10
3. Instruction level parallelism
In this chapter we will design a pipelined MIPS datapath for the following
instructions: lw, sw, add, sub, and, or, slt, beq
Situations may occur where two instructions cannot be executed in the
pipeline right after each other!
Example: Non-pipelined multi-cycle CPU has shared ALU for
1. executing arithmetic/logical instructions
Pipelined MIPS datapath
Slide 6-5
1. executing arithmetic/logical instructions
2. incrementing PC
Structural hazard: Two instructions wish to use a certain hardware
component in the same clock cycle leading to a
resource conflict.
For RISC instruction sets structural hazards can often be resolved by
additional hardware.
Instruction level parallelism
Additional Hardware being required:
1. Permit incrementing PC and executing arithmetic/logical instructions
concurrently:
use separate adder for incrementing PC
2. Permit reading next instruction and reading/writing data from/to memory:
divide memory into instruction memory and data memory (Harvard architecture)
Pipelined MIPS datapath
Slide 6-6
3. Permit executing an arithmetic/logical instruction (uses ALU in 3. cycle) followed by
a branch (calculates branch target in 2. cycle):
use separate adder for branch address calculation
Duplicating hardware components in general leads to less and/or smaller multiplexers.
4. Instruction level parallelism
MIPS datapath (without pipelining)
0
1
shift
left 2
Add
result
Add
result
4
IF: Instruction fetch ID: Instruction de-
code / register read
EX: Execute /
address calculation
MEM: Memory
access
WB:
Write
back
Slide 6-7
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32Datapath for
executing one
instruction per
clock: single cycle
implementation
Instruction level parallelism
Pipelined MIPS datapath additionally requires
Pipeline registers:
• Store all the data occurring at the end of one pipeline stage that are
required as input data in the next stage
• Divide datapath into pipeline stages
Pipelined MIPS datapath
Slide 6-8
• Divide datapath into pipeline stages
• Replace temporary datapath registers of non-pipelined multi-cycle
implementation, e.g.:
ALU target register T replaced by pipeline register EX/MEM
Instruction register IR replaced by pipeline register IF/ID
5. Instruction level parallelism
Pipelined MIPS datapath
0
1
Add
result
4
Instruction fetch ID: Instruction de-
code / register read
EX: Execute /
address calculation
MEM: Memory
access
WB:
Write
back
IF/ID ID/EX EX/MEM MEM/WB
shift
left 2
Add
result
Slide 6-9
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 1: instruction fetch
0
1
Add
result
4
Instruction fetch
IF/ID ID/EX EX/MEM MEM/WB
lw
E.g. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-10
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
6. Instruction level parallelism
Executing an instruction, phase 2: instruction decode
0
1
Add
result
4
Instruction decode
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-11
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 3: execution
0
1
Add
result
4
execution
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-12
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
result
7. Instruction level parallelism
Executing an instruction, phase 4: memory access
0
1
Add
result
4
Memory
IF/ID ID/EX EX/MEM MEM/WB
lw
z.B. lw $t0, 32($s3)
shift
left 2
Add
result
Slide 6-13
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Executing an instruction, phase 5: write back
0
1
Add
result
4
Write
Back
IF/ID ID/EX EX/MEM MEM/WB
lw
BUG !!
LOAD instruction writes result into wrong register: the used
register number belongs to the instruction that has just been
fed into the pipeline!
shift
left 2
Add
result
Slide 6-14
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
8. Instruction level parallelism
Revised hardware
0
1
Add
result
4
IF/ID ID/EX EX/MEM MEM/WB
Solution: keep register number and pass it to the last stage
⇒ 5 additional bits for each of the last 3 pipeline registers
shift
left 2
Add
result
Slide 6-15
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
Instruction level parallelism
Control for pipelined MIPS processor
General Approach:
In stage ID, create all control signals which are needed for an
instruction in subsequent stages (EX, MEM, WB) and store them in the
ID/EX pipeline register.
Then, in each clock cycle hand over control signals to the next stage
using the corresponding pipeline registers.
Slide 6-16
Which signals are required in which stage?
We can divide the control signals into 5 groups corresponding to the
pipeline stages where they are needed.
9. Instruction level parallelism
Control for pipelined MIPS processor
1. Instruction fetch:
Instruction memory is read and PC is written in every clock cycle ⇒ no control
signals required!
2. Instruction decode / register file read:
The same operations are performed in every clock cycle ⇒ no control signals
required!
3. Execute / address calculation:
ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target)
Slide 6-17
ALUop and ALUsrc (as described in Chapter 5), RegDst (use rd or rt as target)
4. Memory access:
MemRead and MemWrite (control data memory): set by lw,sw
Branch (PC will be reloaded if condition is fulfilled): set by beq
PCsrc is determined from Branch and zero (from ALU, condition is fulfilled if set)
5. Write back:
MemtoReg (send either ALU result or memory value to register file)
RegWrite (register file write enable)
Instruction level parallelism
Pipelined MIPS data path and control
0
1
Add
result
4
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
PCSrc
RegWrite
Branch
shift
left 2
Add
result
Slide 6-18
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
MemtoReg
ALU
Control
6
instr. [15-0]
instr. [20-16]
instr. [15-11]
0
1
RegDst
ALUOp
ALUSrc
MemWrite
MemRead
Branch
10. Instruction level parallelism
Consider the following program
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Example
# $2 = 23-3 = 20
# $12 = 20 and 7 = 4
# $13 = 3 or 20 = 23
# $14 = 20+20= 40
# save $15 to 100(20)
Slide 6-19
Assume the following initial register contents:
$1 = 23
$2 = 10
$3 = 3
$5 = 7
$6 = 3
Instruction level parallelism
Data dependences and hazards
IM Reg
IM Reg
sub $2, $1, $3
Program
execution
order
(in instructions)
Time (in clock cycles)
and $12, $2, $5
DM Reg
Reg DM
$2 = 23-3 = 20
$12 = 10 and 7 = 2
Initial values:
$1 = 23
$2 = 10
$3 = 3
$5 = 7
$6 = 3
Consider in the following only data hazards for
register-register-type instructions
Slide 6-20
IM Reg DM Reg
IM DM Reg
IM DM Reg
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2) Reg
Reg
$13 = 3 or 10 = 11
$12 = 10+10 = 20
Data dependence leading to error (hazard)!
11. Instruction level parallelism
Dependences
Consider an instruction a that precedes an instruction b in program order:
A data dependence between a and b occurs when a writes into a register
that will be read by b.
An antidependence between a and b occurs when b writes into a register
that is read by a.
An output dependence between a and b occurs when both, a and b
Slide 6-21
An output dependence between a and b occurs when both, a and b
write into the same register.
A data hazard is created whenever the overlapping (pipelined) execution
of a and b would change the order of access to the operands which are
involved in the dependency.
Instruction level parallelism
Data hazards
Consider an instruction a that precedes an instruction b in program order:
Depending on the type of the dependence between a and b the following
hazards may occur:
RAW: read after write
b reads a source before a writes it, so b incorrectly gets the old value.
Slide 6-22
WAR: write after read
b writes an operand before it is read by a, so a incorrectly gets the new
value.
WAW: write after write
b writes an operand before it is written by a, leaving the wrong result in
the target register.
In the following we consider only data hazards for R-type instructions
12. Instruction level parallelism
Software solution for resolving data hazards
Compiler resolves all data hazards:
• Test machine language program for potential data hazards
• Eliminate them by inserting NOP – instructions (no operation)
Example:
sub $2, $1, $3
nop
nop
nop
Slide 6-23
nop
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Modern processors are able to detect data hazards during program
execution by analyzing the register numbers of the instructions using
additional control logic!
Instruction level parallelism
Hardware solution for resolving data hazards
I M R e g
I M R e g
su b $ 2 , $ 1, $ 3
P ro gra m
e x e c utio n
o r d e r
(in i n str u ctio n s)
Ti m e (i n clo c k cy cle s)
a n d $ 1 2, $ 2 , $ 5
D M
R e g
R e g D M
Slide 6-24
I M R e g D M R e g
I M D M R e g
I M D M R e g
o r $ 1 3, $ 6, $ 2
a d d $ 1 4, $ 2 , $ 2
s w $ 1 5, 1 0 0 ($ 2 ) R e g
R e g
Data required by subsequent instructions exists already in pipeline register!
Register file: If a register is read and written in the same clock cycle, send
new data to data output!
13. Instruction level parallelism
MIPS datapath using forwarding
Forwarding unit gets as input:
Forwarding:
ALU may read operands from each of the pipeline registers.
The correct operands are selected by multiplexers that are controlled by an
additional control unit: forwarding unit
Slide 6-25
• Register operand numbers of instruction in EX stage
• Target register number of instructions being in MEM and WB stage
• Control signals indicating type of instructions being in MEM and WB stage
Register numbers are stored and moved forward in the pipeline registers
For reasons of clarity the hardware structure shown on the following
slide has been simplified. Adder for branch target calculation, ALU input
for address calculation and address input of data memory are missing.
Instruction level parallelism
MIPS datapath using forwarding
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
For R-type instruction
Slide 6-26
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
write
data
data memory
read
data
PC
0
1
2
1
0
MemtoReg
Forwarding
Unit
IF/ID.RegisterRs
MemWrite
MemRead
0
1
2
IF/ID.RegisterRd
IF/ID.RegisterRt
EX/MEM.RegisterRd
MEM/WB.RegisterRd
14. Instruction level parallelism
• Data hazards may be resolved if the operands being read by the
instruction in the EX stage are already stored in one of the pipeline
registers!
• Now consider the following program:
lw $2, 20($1)
and $4, $2, $5
Forwarding
AND instruction requires $2 at the beginning of
the 3. stage (4. cycle)
Slide 6-27
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
BUT: value for $2 is stored in a pipeline register
at the end of stage 4 of LW (4. cycle)
⇒ hazard may not be resolved by forwarding
We have to stall the pipeline for combinations of a load followed by
an instruction that reads its result!
Additional hardware for detecting hazards and stalling the pipeline:
Hazard detection unit
Instruction level parallelism
Illustration
Reg
IM Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Program
execution
order
(in instructions)
and $4, $2, $5
CC 7 CC 8 CC 9
DM Reg
Reg DM
Slide 6-28
Reg
IM Reg DM Reg
IM DM Reg
IM DM Reg
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
15. Instruction level parallelism
Stalling the pipeline
lw $2, 20($1)
Program
execution
order
(in instructions)
and $4, $2, $5 Reg
IM Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Slide 6-29
Stalling the pipeline means to repeat all actions from the previous clock
cycle in the corresponding stages.
PC and IF/ID register must be prevented from being overwritten.
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7 Reg
IM Reg DM RegIM
IM DM Reg
IM DM Reg
Reg
bubble
Instruction level parallelism
Control Hazards
Consider the following program:
beq $1, $3, L0 # PC relative addressing
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
...
Slide 6-30
L0: lw $4, 50($14)
Efficient pipelining: one instruction is fetched at every clock cycle
BUT: Which instruction has to be executed after the branch?
Control (or branch) hazard :
We start executing instructions before we know whether they are really part
of the program flow!
16. Instruction level parallelism
CC 1
Time (in clock cy cle s)
be q $1, $3, 7
Progra m
exe cution
order
(in instru ctio ns)
IM Re g D M Reg
C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9
Strategy: Assume branch not taken
For every branch we assume that it is not taken and we begin
executing the subsequent instructions $pc+4, $pc+8 and $pc+12.
(ALU is used to compute branch address)
Slide 6-31
Reg
be q $1, $3, 7 IM Re g
IM D M
IM D M
IM D M
D M Reg
Reg Reg
Reg
Reg
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2 Reg
Instruction level parallelism
Assume Branch not taken (continued)
However, if the branch is taken we have to discard all instructions from the pipeline!
CC 1
Time (in clock cycle s)
beq $1, $3, 7
Program
execution
order
(in instructions)
IM Reg
IM DM
DM Reg
Reg Regand $12, $2, $5
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Slide 6-32
Reg
Reg
IM DM
IM DM
IM DM
DM
Reg Reg
Reg
Reg
RegIM
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
lw $4, 50($7)
Reg
discard!discard!
In CC5 all data calculated in the stages ID, EX and MEM have to be marked as invalid!
⇒ Set control signals for writing of memory and register file to zero
17. Instruction level parallelism
Reducing the delay of branches
The earlier we know whether a branch will be taken, the fewer
instructions need to be flushed from the pipeline!
1. Calculate branch target address in
ID stage by separate adder in ID
stage
a
a
b
b
7
7
6
6
Slide 6-33
2. Test condition in ID stage using
an additional comparator
Comparator is faster than ALU so
that it can be integrated into ID
Phase!
⇒ Only one instruction needs to be
flushed
a
b
0
0
a = b ?
8 bit comparator
Instruction level parallelism
Add
Reducing the delay of branches
0
1
shift
left 2
Add
result
4
IF/ID
ID/EX EX/MEM
MEM/WB
WB
M
EX
WB
M
WB
Control
MemWrite
PCSrc
RegWrite
Slide 6-34
address
instruction
instruction
memory
register file
read
register1
read
register2
write
register
write
data
read
data 1
read
data 2
ALU
zero
result address
write
data
data memory
read
data
PC
0
1
1
0
sign
extend
16 32
MemtoReg
ALU
Control
6
instr. [15-0]
instr. [20-16]
instr. [15-11]
0
1
RegDst
ALUOp
ALUSrc
MemWrite
MemRead
= ?
18. Instruction level parallelism
Delayed Branches
Delayed branching:
No instructions are flushed from the pipeline. An instruction
following immediately after a branch is always executed.
Programming strategy:
Slide 6-35
Place an instruction originally preceding the branch and not
affected by it immediately after the branch (=branch delay slot ).
If no suitable instruction is found place a NOP there.
Typically the compiler/assembler will fill about 50% of all delay slots with
useful instructions.
Instruction level parallelism
Processors with several functional units
The times required for executing two arithmetic instructions may differ
significantly depending on the type of the instruction:
• Integer addition faster than floating point addition
• Addition much faster than multiplication/division
Making the cycle time long enough so that the slowest instruction can be
executed in one cycle would slow down the processor dramatically!
Slide 6-36
Solution:
• Distribute the EX stage of complex operations over several clock cycles
• Use several functional units in the EX stage
⇒ Allows to execute several instructions in parallel!
19. Instruction level parallelism
Extending the MIPS pipeline to handle multicycle
floating point operations
MIPS implementation with floating point (FP) instructions (MIPS R4000):
• 1 Integer unit: used for load/store, integer ALU operations and branches
• 1 Multiplier for integer and FP numbers
Slide 6-37
• 1 Adder for FP addition and subtraction
• 1 Divider for FP and integer numbers
Instruction level parallelism
Extended MIPS pipeline
EX
integer
unit
EX
FP/int
multiply
MIPS pipeline with multiple functional units (FUs)
Slide 6-38
IF ID MEM WB
multiply
EX
FP
add
EX
FP
divide
FU Execution
time
Structure
INT 1 Not pipelined
MUL 7 Pipelined
ADD 4 Pipelined
DIV 25 Not pipelined
Out of order completion possible!
20. Instruction level parallelism
Extended MIPS pipeline
MIPS pipeline with multiple functional units (FUs)
M1 M2 M3 M4 M5 M6 M7
EX
Integer unit
FP/integer multiplier
Slide 6-39
IF ID
A1 A2 A3 A4
DIV
FP adder
FP/integer divider
MEM WB
Instruction level parallelism
Extended MIPS pipeline
Separate register file for storing FP operands:
• FP registers f0 – f31
• FP instructions operate on FP registers
• Integer instructions operate on integer registers
• Exception: FP load/store: address in integer register, data in FP register
+ no increase in number of bits needed for addressing registers
+ simplifies hazard detection
Slide 6-40
+ simplifies hazard detection
+ read/write integer and FP operands at the same time
+ no increase in complexity of multiplexers/decoders (speed!)
- Additional moves for copying data from FP registers to integer
register and vice versa necessary
• FP operands may be 32 or 64 bit wide
One 64 bit operand occupies a pair of FP registers (e.g. f0 and f1)
64 bit path from/to memory to speed up double precision load/store
21. Instruction level parallelism
Structural Hazard: functional unit
Example: Floating point operations
Div.d $f0,$f2,$f4
Mul.d $f4,$f6,$f4
Div.d $f8,$f8,$f14
Add.d $f10,$f4,$f8
The *.d extension indicates 64 bit floating point operations
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Slide 6-41
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Div.d $f0,$f2,$f4 IF ID DIV-----------------------------------------------------
Mul.d $f4,$f6,$f4 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Div.d $f8,$f8,$f14 IF ID stall...
Add.d $f10,$f4,$f8 IF stall...
Functional units which are not pipelined and which require more than
one clock cycle for execution may create structural hazards!
Instruction has to be stalled in ID-stage!
Instruction level parallelism
Structural Hazard: write back
Example:
Cycle 1 2 3 4 5 6 7 8 9 10 11
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add $r3,$r0,$r0 IF ID EX MEMWB
Add.d $f2,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB
Sw $r3,0($r2) IF ID EX MEMWB
IF ID EX MEMWB
Structural Hazard:
3 instructions wish
to write their
results to FP
Slide 6-42
Sw $r0,4($r2) IF ID EX MEMWB
L.d $f2,0($r2) IF ID EX MEMWB
results to FP
register file!
Solution: Track use of the write port of register file in ID stage by
using a shift register. If a structural hazard would occur
the instruction being in ID stage is stalled for one cycle
22. Instruction level parallelism
Structural Hazard: write back
Example: resolved structural hazard
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add $r3,$r0,$r0 IF ID EX MEMWB
Add.d $f2,$f4,$f6 IF ID stallA1 A2 A3 A4 MEMWB
Sw $r3,0($r2) IF stall ID EX MEMWB
Sw $r0,4($r2) IF ID EX MEMWB
Slide 6-43
Sw $r0,4($r2) IF ID EX MEMWB
L.d $f2,0($r2) IF ID stallEX MEM WB
Instruction level parallelism
WAW-Hazards
Example:
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add.d $f0,$f4,$f6 IF ID A1 A2 A3 A4 MEMWB
WAW-Hazard:
Add.d writes f0
before Mul.d
doesOut-of-order completion may lead to WAW hazard!
Slide 6-44
Solution: Stall Add.d instruction in ID stage
Cycle 1 2 3 4 5 6 7 8 9 10 11 12
Mul.d $f0,$f4,$f6 IF ID M1 M2 M3 M4 M5 M6 M7 MEMWB
Add $r0,$r2,$r3 IF ID EX MEMWB
Add.d $f0,$f4,$f6 IF ID stallstallA1 A2 A3 A4 MEMWB
⇒ Hazard detection logic detects all hazards in ID stage and resolves them
by stalling the corresponding instruction
23. Instruction level parallelism
Extended MIPS pipeline
Instruction execution:
1. Fetch
2. Decode:
1. Check for structural hazards: Wait until the required FU is not busy and make sure
the register write port is available when it will be needed
2. Check for RAW data hazards: Wait until source registers are not listed as
destination register of any instruction in M1-M6, A1 – A3, DIV or a load in EX
Optimization: e.g.: if the division is in the final clock cycle its result may be
Slide 6-45
Optimization: e.g.: if the division is in the final clock cycle its result may be
forwarded to the requesting FU in the cycle following.
3. Check for WAW data hazards: Determine if any instruction in A1 - A4, M1 - M7,
DIV, has the same destination as this instruction. If so, stall instruction for the
number of clock cycles being necessary
Simplification: Since WAW hazards are rare, stall instruction until no other
instruction in the pipeline has the same destination
3. Execute
4. Memory Access
5. Write Back
Instruction level parallelism
Dynamic Branch Prediction
Assume branch not taken is a crude form of branch prediction.
Typically it fails in 50% of all cases.
In processors with multiple functional units deep pipelines are used. This
may lead to large branch delays if a branch is predicted the wrong way!
⇒ we need more accurate methods for predicting branches!
Slide 6-46
Idea: dynamic branch prediction
predict branches using the program’s past behaviour.
Branch prediction buffer or branch history table
Small memory addressed by the lower bits of the instruction address,
contains a flag indicating whether the branch has been taken or not.
This flag is set or reset at each branch.
24. Instruction level parallelism
Dynamic Branch Prediction
For loops the hit rate may be improved by using two bits for branch
prediction. A prediction must be wrong twice before it is changed.
Predict taken Predict taken
WrongCorrect
Slide 6-47
10 11
Predict
not taken
01
Predict
not taken
00
Correct
Wrong
Correct
Wrong
Correct
Wrong
2 bit prediction scheme
Instruction level parallelism
Branch Target Buffer
Observation: the target address (calculated from PC and offset) for a particular
branch remains constant during program execution.
Idea: store the branch target addresses in a lookup table: branch target buffer
Addresses of branch
instructions
Branch target
addresses Predicted taken or
untaken
Slide 6-48
PC
Instruction
memory
= ?
control
Branch target buffer in combination with correct branch prediction allows
to execute branches without stalling the pipeline!
25. Instruction level parallelism
Dynamic Scheduling
Static scheduling:
Execution is started in the order in which the instructions have been fetched.
(e.g., in the order which the compiler has determined).
If a data dependence occurs that can not be resolved by forwarding, the pipeline
is stalled (starting with the instruction that waits for a result). No new instructions
are fetched until the dependence is cleared.
Idea:Hardware rearranges instruction executions dynamically to reduce stalls
⇒ Dynamic Scheduling
Slide 6-49
⇒ Dynamic Scheduling
Dynamic Scheduling takes structural hazards and data hazards into consideration!
To avoid that an instruction is stalled because a data hazard delays all subsequent
instructions, the ID stage is spilt into two stages:
1. Issue: Decode instructions, check for structural hazards
2. Read operands: Wait until no data hazards, then read operands
Leads to out-of-order execution and out-of-order completion
Instruction level parallelism
Out-of-order execution
Out-of-order execution may lead to WAR hazards!
Example: Floating point operations
Div.d $f0, $f2, $f4
Add.d $f10,$f0, $f8
Mul.d $f8, $f8, $f14
Slide 6-50
Add.d needs to be stalled because of RAW hazard.
Mul.d may be started,
BUT: if mul.d completes before add.d reads its operands, add.d
will read the wrong value in f8!
The control logic deciding when an instruction is executed has to detect
and resolve hazards!
26. Instruction level parallelism
Score Board
Dynamic Scheduling with a Score Board
Goal: maintain an execution rate of one instruction per clock cycle by
executing an instruction as early as possible
If an instruction needs to be stalled because of a data hazard other
instructions can be issued and executed.
⇒ We have to analyze the program flow for hazards!
Slide 6-51
⇒ We have to analyze the program flow for hazards!
Scoreboard:
• Detects structural hazards and data hazards
• Determines when an instruction may read operands and when it is
executed
• Determines when an instruction can write its result into the destination
register
Instruction level parallelism
Dynamic Scheduling with a Score Board
In the following we will consider dynamic scheduling only for arithmetic
instructions – no MEM access-phase necessary.
4 stages (replace ID, EX, WB stage of standard MIPS pipeline):
1. Issue: If
• a functional unit (FU) for the instruction is free (resolve structural hazards)
and
• no other active instruction has the same destination register (resolve WAW
Slide 6-52
hazards)
the score board issues the instruction to the FU and updates its internal data
structure.
If a hazard exists, the issue stage stalls. Subsequent instructions are written into
a buffer between instruction fetch and issue. If this buffer is filled then the
instruction fetch stage stalls.
2. Read operands: When all operands are available the score board tells the FU to
read its operands and to begin execution (may lead to out of order execution).
A source operand is available when no active instruction issued earlier is going
to write it (resolve RAW hazards).
27. Instruction level parallelism
Dynamic Scheduling with a Score Board
3. Execution: The FU executes the instruction (may take several clock
cycles). When the result is ready the FU notifies the scoreboard that it has
completed execution.
4. Write result: When an FU announces the completion of an execution the
scoreboard checks for WAR hazards. If no such hazard exists the result
can be written to the destination register. A WAR hazard occurs when
there is an instruction preceding the completing instruction that
Slide 6-53
• has not read its operands yet and
• one of these operands is the same register as the destination register of
the completing instruction.
Score Boarding does not use forwarding!
If no WAR hazard occurs the result is written to the destination register
during the clock cycle following the execution. (we do not have to wait for
a statically assigned WB stage that may be several cycles away).
Instruction level parallelism
Example
MIPS processor with dynamic scheduling using a score board with the
following functional units (not pipelined) in the datapath:
- 1 Integer unit: for load/store, integer ALU operations and branches
- 2 Multiplier for FP numbers
- 1 Adder for FP addition/subtraction
Slide 6-54
- 1 Divider FP numbers
MIPS program with floating point instructions (64 Bit):
L.d $f6, 34 ($r2)
L.d $f2, 45 ($r3)
Mul.d $f0, $f2, $f4
Sub.d $f8, $f2, $f6
Div.d $f10, $f0, $f6
Add.d $f6, $f8, $f2
Assumptions:
EX phase for double precision takes:
2 cycles for load and add
10 cycles for mult
40 cycles for div
28. Instruction level parallelism
MIPS with a Score Board
FP multiplier
FP multiplier
FP divider
Register
Slide 6-55
FP divider
FP adder
Integer unit
Score Board
Control/Status Control/Status
Data busses
Instruction level parallelism
Components of the Score Board
1. Instruction status: indicates for each instruction which of the four steps
the instruction is in.
2. FU status: indicates for each FU its state:
busy: FU busy or not
Score Board consists of three parts containing the following data:
Slide 6-56
busy: FU busy or not
OP: Operation to perform (e.g. add or subtract)
fi: Destination register
fj, fk: Source registers
Qj, Qk: Functional units writing the source registers fj und fk
Rj, Rk: Flags indicating whether fj and fk are ready to be read but have
not been read yet. Are set to “no” after the operands have been
read.
3. Result register status: indicates for each register, whether a FU is going
to write it and which FU this will be.
29. Instruction level parallelism
Components of the Score Boards
Instruction Issue Read operands Execution complete Write result
L.d $f6, 34 ($r2) √ √ √ √
L.d $f2, 45 ($r3) √ √ √
Mul.d $f0, $f2, $f4 √
Sub.d $f8, $f2, $f6 √
Div.d $f10, $f0, $f6 √
Add.d $f6, $f8, $f2
Instruction status
Slide 6-57
Name Busy Op fi fj fk Qj Qk Rj Rk
integer yes load f2 r3 0 no
mult1 yes mult f0 f2 f4 integer 0 no yes
mult2 no
add yes sub f8 f2 f6 integer 0 no yes
divide yes div f10 f0 f6 mult1 0 no yes
f0 f2 f4 f6 f8 f10 f12 f30
FU mult1 integer 0 0 add divide 0 0
Functional unit status
Result register status
(double precision floating point numbers number ⇒ allocate two 32 bit registers)
Instruction level parallelism
Bookkeeping in the Score Board
Instruction status Wait until Bookkeeping
Issue Busy[FU] = no Busy[FU] := yes; Op[FU] := op; Result[d] := FU;
When an instruction has passed through one step the score board is updated.
FU: FU used by instruction fi[FU], fj[FU], fk[FU]: destination/source registers of FU
d: destination register Rj[FU], Rk[FU]: s1, s2 ready?
s1, s2:source registers Qj[FU], Qk[FU]: FUs producing s1 and s2
op: type of operation Result[d]: FU that will write register d
Op[FU]: operation which FU will execute
Slide 6-58
Issue Busy[FU] = no
and
Result[d] = 0
(no other FU has d as
destination register)
Busy[FU] := yes; Op[FU] := op; Result[d] := FU;
fi[FU] := d; fj[FU] := s1; fk[FU] := s2;
Qj := Result[s1]; Qk := Result[s2];
if Qj = 0 then Rj := yes; else Rj := no
if Qk = 0 then Rk := yes; else Rk := no
Read operands Rj = yes and Rk = yes Rj := no; Rk := no; Qj := 0; Qk := 0
Execution Functional unit done
Write results ∀f((fj[f] ≠ fi[FU] or Rj[f] = no)
and
(fk[f] ≠ fi[FU] or Rk[f] = no))
∀f(if Qj[f] = FU then Rj[f] := yes);
∀f(if Qk[f] = FU then Rk[f] := yes);
Result[fi[FU]] := 0; Busy[FU] := nofor all FUs
30. Instruction level parallelism
Bookkeeping in the Score Board
Comment for step write results:
∀f (fj[f] ≠ fi[FU] or Rj[f] = no)
„Rj[f] = no“ means, that the instruction which is now active at f will not read the
current contents of source register fj
a) either, since the operation has already been executed and currently waits for
permission to write, or
Slide 6-59
b) since the required source operand must still be computed and the current
instruction is waiting for that.
In the first case register fj is overwritten since the previous contents are no longer
needed. In the second case the register is overwritten since this will provide the
expected operand.
Ri[f] = yes means that the instruction being active at f still requires the current
content of the register specified by fi.
Instruction level parallelism
Dynamic Scheduling: Tomasulo‘s Schema
Example:
div.d $f0, $f2, $f4
add.d $f6, $f0, $f8
sub.d $f8, $f10, $f14
Are there further possibilities for eliminating stalls resulting from hazards?
RAW hazard: No way - we have to wait until all operands are calculated!
WAR hazard and WAW hazard:
RAW hazard for f0
WAR hazard for f8
Slide 6-60
sub.d $f8, $f10, $f14
mul.d $f6, $f10, $f8
WAR hazard for f8
WAW hazard for f6, RAW hazard for f8
Idea: Register renaming
Rename destination registers of instructions in a way that prevents instructions
being executed out-of-order from overwriting operands being still required by
other instructions ⇒ Tomasulo‘s scheme or Tomasulo‘s algorithm
Observation: WAR and WAW hazard could have been avoided by compiler!
31. Instruction level parallelism
Register renaming
Example (continued):
Assume we have two temporary registers S and T.
replace f6 in add.d by a temporary register S and
replace f8 in sub.d and mul.d by a temporary register T:
div.d $f0, $f2, $f4
add.d $S, $f0, $f8
Slide 6-61
add.d $S, $f0, $f8
sub.d $T, $f10, $f14
mul.d $f6, $f10, $T
Replace target registers affected by a WAW or a WAR hazard by
temporary registers and modify subsequent instructions reading
these registers appropriately.
Instruction level parallelism
Reservation Station
Temporary registers are part of reservation stations :
• Buffer the operands for instructions waiting for execution.
If an operand is not yet calculated the corresponding reservation station
contains the number of the reservation station which will deliver the result.
• Renaming of register numbers for pending operands to the names of the
reservation stations, this is done during instruction issue.
Slide 6-62
• Information about the availability of the operands stored in a reservation
station determines when the corresponding instruction can be executed.
• As results become available they are sent directly from the reservation
stations to the waiting FU over the common data bus (CDB)
• When successive writes to a register overlap in execution, only the result of
the instruction being issued at last is used to update the register.
⇒ resolves WAR/WAW hazards
32. Instruction level parallelism
Tomasulo‘s algorithm
From instruction unit
Instruction
queue
(FIFO)
FP registers
Operand
buses
FP operationsload/store operations
Store buffers
MIPS floating point unit
using Tomasulo‘s algorithm
Slide 6-63
Address unit
Memory FP adders
FP multipliers/
dividers
buses
Reservation
stations
3
2
1
2
1
Common Data Bus (CDB)
Load
buffers
Instruction level parallelism
Tomasulo‘s algorithm - stages
1. Issue:
Get next instruction from the head of the instruction queue and issue it to a
matching reservation station that is empty.
Load/store buffers storing data/addresses coming from and going to memory
behave similarly like reservation stations for arithmetic units.
Steps in execution of an FP instruction:
Slide 6-64
matching reservation station that is empty.
Operands available in registers?
yes: hand over values to reservation station
no: hand over names of those reservation stations that are calculating
the values
Buffering operands resolves WAR hazards!
If no matching reservation station is empty there is a structural hazard.
⇒ instruction stalls until a station is freed
33. Instruction level parallelism
Tomasulo‘s algorithm - stages
2. Execution
1. If one or more of the operands are not available, monitor the CDB.
2. If an operand becomes available place it in the waiting reservation station(s).
3. Wait until all operands for an instruction are available, then start execution
⇒ resolves RAW hazards
In case of stores: Execution may start (address calculation) even if data to
Slide 6-65
In case of stores: Execution may start (address calculation) even if data to
be stored is not available yet. Address calculation unit is
occupied during address calculation only
3. Write result
1. When the result is available, send it to the CDB.
2. From the CDB it is sent directly to waiting reservation stations (and store
buffers)
Only if the instruction is the one being issued last that writes to a certain
target register, write result also to the target register ⇒ avoids WAW hazards
Instruction level parallelism
Reservation stations
Each reservation station has the following fields:
Op: Type of the operation to perform (e.g. add or subtract)
Qj, Qk: Names of the reservation stations containing the instructions calculating
the operands. Zero values indicate that the operands are already available
Vj, Vk: Values of the source operands
Busy: Flag indicating that this station/buffer is already occupied.
Slide 6-66
Busy: Flag indicating that this station/buffer is already occupied.
Each load/store buffer has an additional field:
A: Initially the immediate field of the address is stored there; after address
calculation the effective address is stored there
For each register of the register file there is one field
Qi: Name of the reservation station containing the instruction being issued
last that calculates the result for this register. A zero value indicates that
no active instruction is calculating a result for that register.
34. Instruction level parallelism
Tomasulo‘s method: information tables
Name Busy Op Vj Vk Qj Qk A
load1 no
Instruction Issue execute Write result
L.d $f6, 34 ($r2) √ √ √
L.d $f2, 45 ($r3) √ √
Mul.d $f0, $f2, $f4 √
Sub.d $f8, $f2, $f6 √
Div.d $f10, $f0, $f6 √
Add.d $f6, $f8, $f2 √
Instruction
status
Slide 6-67
load1 no
load2 yes load 45+Regs[r3]
add1 yes sub Mem[34+Regs[r2]] load2
add2 yes add add1 load2
add3 no
mult1 yes mul Regs[f4] load2
mult2 yes div Mem[34+Regs[r2]] mult1
Register: f0 f2 f4 f6 f8 f10 f12 f30
Qi : mult1 load2 0 add2 add1 mult2 0 0
Reservation
Stations
Register
status
Instruction level parallelism
Dynamic Scheduling: Data hazards through memory
A load and a store instruction may only be done in a different order if
they access different addresses! (RAW/WAR hazard!)
Two stores sharing the same data memory address may not be done in
different order! (WAW hazard!)
Load: read memory only if there is no uncompleted store which has
been issued earlier and which shares the same data memory
Slide 6-68
been issued earlier and which shares the same data memory
address with the load.
Store: write data only if there are no uncompleted loads and stores
being issued earlier using the same data memory address as the
store.
35. Instruction level parallelism
Dynamic Scheduling: Instructions following branches
It may take many clock cycles until we know whether a branch has been
predicted correctly or not!
1. Instructions being issued after a branch may complete before.
⇒ Write back stage of these instructions has to be stalled until we know
whether the prediction has been correct or not!
2. Exceptions:
Slide 6-69
2. Exceptions:
We have to ensure that exactly the same exceptions are handled as in the
case where the pipeline would have used in-order-execution and no
branch prediction!
Simple solution:
Instructions following a branch are issued only. Execution starts only after
the branch prediction has turned out to be correct.
⇒ Can reduce the efficiency of a dynamically scheduled pipeline
dramatically!
Instruction level parallelism
Speculative execution
Write result stage is spilt into two stages:
3. Write results:
• Instructions are executed as operands become available. Results are written into
a reorder buffer (ROB).
• For each active instruction there is one entry in the ROB. Their order corresponds
to the order in which the instructions have been issued.
⇒ The head of the ROB contains the result of the active instruction
being issued first
Subsequent instructions can read their operands from the ROB.
Slide 6-70
Subsequent instructions can read their operands from the ROB.
• Writes going to register file and memory are delayed until branch predictions turn
out to be correct.
4. Commit:
• When an instruction that writes to memory or register file reaches the head of
the ROB its result is written. Exception is handled now if necessary!
• If the head of the ROB contains an incorrectly predicted branch the ROB is
flushed
⇒ results calculated by instructions following the branch are discarded!
ROB restores initial order of instructions: in-order-commitment
36. Instruction level parallelism
Speculative Execution
MIPS FP unit using Tomasulo‘s
algorithm and Reorder buffer From instruction unit
Instruction
queue
(FIFO)
FP registers
FP operationsload/store operations
ROB
Slide 6-71
Address unit
Memory FP adders
FP multipliers/
dividers
Operand
buses
Reservation
stations
3
2
1
2
1
Common Data Bus (CDB)
Store
data
load/store
buffers
Store
address
Instruction level parallelism
Multiple Issue Processor
Using multiple FUs, dynamic scheduling, branch prediction and
speculation allow to achieve an CPI of nearly one.
CPI < 1 not possible because we issue only one instruction per clock
cycle!
Further speedup:
Slide 6-72
Issue multiple instructions in one clock cycle (up to 8 in practice)
⇒ CPI < 1 possible!
The sets of instructions being issued in parallel are called
instruction packets or issue packets.
37. Instruction level parallelism
Multiple Issue Processors
Multiple Issue Processor
Superscalar Processors
Instruction packets: generated
VLIW (very long instruction word)-
Processors
Slide 6-73
Instruction packets: generated
by hardware
Processors
Instruction packets: generated by
compiler
Dynamic scheduling
(hardware)
Static scheduling
(compiler)
Instruction level parallelism
Overview
Name Issue Hazard
Detection
Scheduling Distinguishing
characteristics
Examples
superscalar
(static)
dynamic Hardware static
(Compiler)
in-order
execution
Sun UltraSPARC
II/III
superscalar dynamic Hardware dynamic out-of-order IBM Power PC
Slide 6-74
superscalar
(dynamic)
dynamic Hardware dynamic out-of-order
execution
IBM Power PC
superscalar
(speculative)
dynamic Hardware dynamic with
speculation
out-of-order
execution with
speculation
Pentium III/4,
MIPS R10K,
Alpha 21264,
HP PA 8500,
IBM RS64III
VLIW static software
(Compiler)
static
(Compiler)
no hazards
between issue
packets
Trimedia, i860
38. Instruction level parallelism
Statically scheduled superscalar Processors
Example: dual-issue static superscalar processor
In one clock cycle we can issue
• one integer instruction (including load/store, branches, integer ALU
operations) and
• one arithmetic FP instruction
Slide 6-75
Only slight extensions of the hardware necessary compared to a single
issue implementation with two FUs.
Typical for high-end embedded processors.
Instruction level parallelism
Statically scheduled Dual Issue Pipeline
Instruction type
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
Pipeline stages
Slide 6-76
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
Integer instruction IF ID EX MEM WB
FP instruction IF ID EX EX EX WB
CPI of 0.5 possible !
39. Instruction level parallelism
Multiple Issue Pipeline
In order to enable multiple issues per clock cycle we must be able to fetch
multiple instructions per cycle also!
Example: 4-way issuing processor
Fetches instructions stored at PC, PC+4, PC+8, PC+12 from memory
⇒ Wide bus to instruction memory required!
Slide 6-77
⇒ Wide bus to instruction memory required!
Problem: what if one of these instruction is a branch?
1. Reading branch target buffer and accessing instruction memory in one
clock cycle would increase cycle time.
2. If n instructions of the package are allowed to be a branch we would
have to lookup n instructions in the branch target buffer in parallel!
Typical simplification: single issue for branches
Instruction level parallelism
Multiple issue with dynamic pipelining (Tomasulo)
Example: Superscalar processor with:
• Dual issue (single issue for branches)
• Dynamic Tomasulo scheduling (no speculation, i.e. execution of instructions following
branch must be delayed until the branch condition is evaluated)
• One FP unit
• One FU for integer instructions and load/stores and branch condition testing
• Separate FU for branch address calculation
• Several reservation stations/load store buffers for each FU: Load/stores occupy the
FU only during address calculation, branches only during condition testing;
Stores are allowed to execute even if the data to be stored is not available yet
Slide 6-78
Loop: l.d $f0, 0 ($r1); # f0 := array element
add.d $f4, $f0, $f2; # add f2 to f0
s.d $f4, 0 ($r1); # store result
addi $r1, $r1, -8; # decrement pointer
bne $r1, $r2, LOOP;# repeat loop if r1 ≠ r2
Stores are allowed to execute even if the data to be stored is not available yet
Latency: Number of cycles from the beginning of execution step to the
moment when the result is available on the CDB
Integer operations: 1 cycle
Load: 2 cycles (1 in EX stage + 1 in MEM stage)
FP operation: 3 cycles (in EX stage)
41. Instruction level parallelism
Example
CPI significantly greater than 0.5:
Problem: Integer unit used for memory address calculation, for
incrementing pointer and for condition test
⇒ branch execution is delayed by one cycle
Possible solution: additional integer FU
Slide 6-81
Problem: The execution step of an instruction following a branch has to
be delayed until the branch is executed
Possible solution: use speculative execution
Example: Dual-issue processor with speculative execution
In order to achieve a CPI < 1 we must allow two instructions to commit in
parallel!
⇒ More buses required
Instruction level parallelism
Compiler techniques
Observation: If branch prediction is perfect then loops are unrolled
automatically by the hardware. Operations that belong to
different iterations of the loop overlap.
Loops may be unrolled in advance by the compiler also!
⇒ Improves performance for processors without speculative execution
Loop: addi $s1, $s1, -16;
lw $t0, 16($s1);
add $t0, $t0, $s2;
Slide 6-82
Loop: lw $t0, 0 ($s1);
add $t0, $t0, $s2;
sw $t0, 0 ($s1);
addi $s1, $s1, -4;
bne $s1, $zero, LOOP;
add $t0, $t0, $s2;
sw $t0, 16($s1);
lw $t1, 12($s1);
add $t1, $t1, $s2;
sw $t1, 12($s1);
lw $t2, 8($s1);
add $t2, $t2, $s2;
sw $t2, 8($s1);
lw $t3, 4($s1);
add $t3, $t3, $s2;
sw $t3, 4($s1);
bne $s1, $zero, LOOP;
Loop before unrolling
Loop after unrolling
Register renaming done by
compiler!
42. Instruction level parallelism
Summary
Superscalar processors determine during program execution how many
instructions are issued in one clock cycle.
Statically scheduled:
• Must detect dependences in instruction packets and resolve them by
inserting stalls
Slide 6-83
• Needs assistance of the compiler for achieving a high amount of
parallelism.
• Simple hardware
Dynamically scheduled:
• Requires less assistance of the compiler
• Hardware is much more complex
Instruction level parallelism
Static Multiple Issue – VLIW approach
For highly superscalar processors the hardware becomes very complex.
Idea: let the compiler do as much work as possible!
VLIW approach: used for digital signal processing (DSP)
compiler groups instructions with no dependences between that may be
executed in parallel into a „very long instruction word“ (VLIW).
Slide 6-84
⇒ no hardware for hazard detection and scheduling necessary
Does the program contain enough parallelism?
The compiler has to find enough parallelism for using the full capacity of
all functional units!
local scheduling : scheduling inside lists of instructions without branches
(= basic blocks)
global scheduling : scheduling over several basic blocks
43. Instruction level parallelism
Example
Loop: lw.d $f0, 0($r1);
add.d $f4, $f0, $f2;
sw.d $f4, 0($r1);
For VLIW processors one instruction must contain all operations that are
executed in parallel explicitly. Therefore VLIW processors are sometimes
also called EPICs (explicitly parallel instruction computer).
Example
Loop: lw.d $f0, 0($r1);
add.d $f4, $f0, $f2;
sw.d $f4, 0($r1);
lw.d $f6, -8($r1);
unroll
Slide 6-85
sw.d $f4, 0($r1);
addi $r1, $r1, -8;
bne $r1, $r2, LOOP;
Consider a VLIW processor with:
• 2 FUs for memory access (2 cycles for EX)
• 2 FUs for FP operations (Pipelined, 3 cycles for EX)
• 1 FU for integer operations and branches. (1 cycle)
Create a schedule for 7 iterations using loop
unrolling. Branches have zero latency.
lw.d $f6, -8($r1);
add.d $f8, $f6, $f2;
sw.d $f8, -8($r1);
lw.d $f10, -16($r1);
add.d $f12, $f10, $f2;
sw.d $f12, -16($r1);
lw.d $f14, -24($r1);
add.d $f16, $f14, $f2;
sw.d $f16, -24($r1);
…
addi $r1, $r1, -56;
bne $r1, $r2, LOOP;
Instruction level parallelism
Static Multiple Issue – VLIW approach
Memory unit 1 Memory unit 2 FP unit 1 FP unit 2 Integer unit
lw.d $f0,0($r1) lw.d $f6,-8($r1)
lw.d $f10,-16($r1) lw.d $f14,-24($r1)
lw.d $f18,-32($r1) lw.d $f22,-40($r1) add $f4,$f0,$f2 add $f8,$f6,$f2
lw.d $f26,-48($r1) add $f12,$f10,$f2 add $f16,$f14,$f2
Slide 6-86
add $f20,$f18,$f2 add $f24,$f22,$f2
sw.d $f4,0($r1) sw.d $f8,-8($r1) add $f28,$f26,$f2
sw.d $f12,-16($r1) sw.d $f16,-24($r1) addi $r1,$r1,-56
sw.d $f20,24($r1) sw.d $f24,16($r1)
sw.d $f28,8($r1) bne $r1,$r2,Loop
Each row corresponds to an VLIW instruction