Instruction-level parallelism (ILP) aims to improve performance by overlapping the execution of instructions. There are two main approaches: 1) relying on hardware to dynamically discover and exploit parallelism, and 2) relying on software to statically find parallelism at compile-time. Exploiting ILP across multiple basic blocks is needed to achieve substantial performance gains, as basic block ILP is typically small due to frequent branches. Data dependencies between instructions limit the amount of parallelism that can be exploited, as true dependencies must be preserved to maintain program correctness. Hardware and software aim to exploit parallelism while preserving program order where it affects the program outcome.
This document provides an introduction to parallel computing. It discusses serial versus parallel computing and how parallel computing involves simultaneously using multiple compute resources to solve problems. Common parallel computer architectures involve multiple processors on a single computer or connecting multiple standalone computers together in a cluster. Parallel computers can use shared memory, distributed memory, or hybrid memory architectures. The document outlines some of the key considerations and challenges in moving from serial to parallel code such as decomposing problems, identifying dependencies, mapping tasks to resources, and handling dependencies.
The document discusses different types of parallel computer architectures, including shared-memory multiprocessors. It describes taxonomy of parallel computers including SISD, SIMD, MISD, and MIMD models. For shared-memory multiprocessors, it outlines consistency models including strict, sequential, processor, weak and release consistency. It also discusses UMA and NUMA architectures, cache coherence protocols like MESI, and examples of multiprocessors using crossbar switches or multistage networks.
This document discusses different types of interconnection network topologies for parallel machines. It provides details on:
1) Linear array networks have nodes connected in a line, with diameter of n-1, node degree of 2, and bisection width of 1.
2) Mesh networks connect nodes in a grid, with diameter of 2(n-1), node degree of 4, and bisection width of n for an nXn mesh.
3) Hypercube networks have nodes connected by log2N routing functions, with diameter and node degree of log2N and bisection width of 2n-1 for a network with N nodes.
DSM system
Shared memory
On chip memory
Bus based multiprocessor
Working through cache
Write through cache
Write once protocol
Ring based multiprocessor
Protocol used
Similarities and differences b\w ring based and bus based
This document discusses superscalar and super pipeline approaches to improving processor performance. Superscalar processors execute multiple independent instructions in parallel using multiple pipelines. Super pipelines break pipeline stages into smaller stages to reduce clock period and increase instruction throughput. While superscalar utilizes multiple parallel pipelines, super pipelines perform multiple stages per clock cycle in each pipeline. Super pipelines benefit from higher parallelism but also increase potential stalls from dependencies. Both approaches aim to maximize parallel instruction execution but face limitations from true data and other dependencies.
Parallelism involves executing multiple processes simultaneously using two or more processors. There are different types of parallelism including instruction level, job level, and program level. Parallelism is used in supercomputing to solve complex problems more quickly in fields like weather forecasting, climate modeling, engineering, and material science. Parallel computers can be classified based on whether they have a single or multiple instruction and data streams, including SISD, MISD, SIMD, and MIMD architectures. Shared memory parallel computers allow processors to access a global address space but can have conflicts when simultaneous writes occur, while message passing computers communicate via messages to avoid conflicts. Factors like software overhead and load balancing can limit the speedup achieved by parallel algorithms
This document provides an introduction to parallel computing. It discusses serial versus parallel computing and how parallel computing involves simultaneously using multiple compute resources to solve problems. Common parallel computer architectures involve multiple processors on a single computer or connecting multiple standalone computers together in a cluster. Parallel computers can use shared memory, distributed memory, or hybrid memory architectures. The document outlines some of the key considerations and challenges in moving from serial to parallel code such as decomposing problems, identifying dependencies, mapping tasks to resources, and handling dependencies.
The document discusses different types of parallel computer architectures, including shared-memory multiprocessors. It describes taxonomy of parallel computers including SISD, SIMD, MISD, and MIMD models. For shared-memory multiprocessors, it outlines consistency models including strict, sequential, processor, weak and release consistency. It also discusses UMA and NUMA architectures, cache coherence protocols like MESI, and examples of multiprocessors using crossbar switches or multistage networks.
This document discusses different types of interconnection network topologies for parallel machines. It provides details on:
1) Linear array networks have nodes connected in a line, with diameter of n-1, node degree of 2, and bisection width of 1.
2) Mesh networks connect nodes in a grid, with diameter of 2(n-1), node degree of 4, and bisection width of n for an nXn mesh.
3) Hypercube networks have nodes connected by log2N routing functions, with diameter and node degree of log2N and bisection width of 2n-1 for a network with N nodes.
DSM system
Shared memory
On chip memory
Bus based multiprocessor
Working through cache
Write through cache
Write once protocol
Ring based multiprocessor
Protocol used
Similarities and differences b\w ring based and bus based
This document discusses superscalar and super pipeline approaches to improving processor performance. Superscalar processors execute multiple independent instructions in parallel using multiple pipelines. Super pipelines break pipeline stages into smaller stages to reduce clock period and increase instruction throughput. While superscalar utilizes multiple parallel pipelines, super pipelines perform multiple stages per clock cycle in each pipeline. Super pipelines benefit from higher parallelism but also increase potential stalls from dependencies. Both approaches aim to maximize parallel instruction execution but face limitations from true data and other dependencies.
Parallelism involves executing multiple processes simultaneously using two or more processors. There are different types of parallelism including instruction level, job level, and program level. Parallelism is used in supercomputing to solve complex problems more quickly in fields like weather forecasting, climate modeling, engineering, and material science. Parallel computers can be classified based on whether they have a single or multiple instruction and data streams, including SISD, MISD, SIMD, and MIMD architectures. Shared memory parallel computers allow processors to access a global address space but can have conflicts when simultaneous writes occur, while message passing computers communicate via messages to avoid conflicts. Factors like software overhead and load balancing can limit the speedup achieved by parallel algorithms
Parallel computing involves solving computational problems simultaneously using multiple processors. It can save time and money compared to serial computing and allow larger problems to be solved. Parallel programs break problems into discrete parts that can be solved concurrently on different CPUs. Shared memory parallel computers allow all processors to access a global address space, while distributed memory systems require communication between separate processor memories. Hybrid systems combine shared and distributed memory architectures.
IPC allows processes to communicate and share resources. There are several common IPC mechanisms, including message passing, shared memory, semaphores, files, signals, sockets, message queues, and pipes. Message passing involves establishing a communication link and exchanging fixed or variable sized messages using send and receive operations. Shared memory allows processes to access the same memory area. Semaphores are used to synchronize processes. Files provide durable storage that outlives individual processes. Signals asynchronously notify processes of events. Sockets enable two-way point-to-point communication between processes. Message queues allow asynchronous communication where senders and receivers do not need to interact simultaneously. Pipes create a pipeline between processes by connecting standard streams.
This document discusses multiprocessor architecture types and limitations. It describes tightly coupled and loosely coupled multiprocessing systems. Tightly coupled systems have shared memory that all CPUs can access, while loosely coupled systems have each CPU connected through message passing without shared memory. Examples given are symmetric multiprocessing (SMP) and Beowulf clusters. Interconnection structures like common buses, multiport memory, and crossbar switches are also outlined. The advantages of multiprocessing include improved performance from parallel processing, increased reliability, and higher throughput.
The document provides an overview of parallel processing and multiprocessor systems. It discusses Flynn's taxonomy, which classifies computers as SISD, SIMD, MISD, or MIMD based on whether they process single or multiple instructions and data in parallel. The goals of parallel processing are to reduce wall-clock time and solve larger problems. Multiprocessor topologies include uniform memory access (UMA) and non-uniform memory access (NUMA) architectures.
This slide contain the description about the various technique related to parallel Processing(vector Processing and array processor), Arithmetic pipeline, Instruction Pipeline, SIMD processor, Attached array processor
Interconnection Network
in this presentation there are some explain to Interconnection Network , and espically in computer architecture and parallel processing.
The document discusses code optimization techniques in compilers. It covers the following key points:
1. Code optimization aims to improve code performance by replacing high-level constructs with more efficient low-level code while preserving program semantics. It occurs at various compiler phases like source code, intermediate code, and target code.
2. Common optimization techniques include constant folding, propagation, algebraic simplification, strength reduction, copy propagation, and dead code elimination. Control and data flow analysis are required to perform many optimizations.
3. Optimizations can be local within basic blocks, global across blocks, or inter-procedural across procedures. Representations like flow graphs, basic blocks, and DAGs are used to apply optimizations at
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
This document discusses swap space management. It explains that swap space uses disk space as an extension of main memory through swapping and paging. It discusses how operating systems may support multiple swap spaces on separate disks to balance load. It also notes that it is better to overestimate than underestimate swap space needs to avoid crashing the system from running out of space. The document then covers locations for swap space, including within the file system or a separate partition, and tradeoffs of each approach.
This document discusses multicore computers and their organization. It describes how hardware performance issues around increasing parallelism and power consumption led to the development of multicore processors. Multicore computers combine two or more processors on a single chip for improved performance. The main variables in multicore organization are the number of cores, levels of cache memory, and whether cache is shared.
Parallel computing and its applicationsBurhan Ahmed
Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate. Parallel computing is also known as parallel processing.
↓↓↓↓ Read More:
Watch my videos on snack here: --> --> http://sck.io/x-B1f0Iy
@ Kindly Follow my Instagram Page to discuss about your mental health problems-
-----> https://instagram.com/mentality_streak?utm_medium=copy_link
@ Appreciate my work:
-----> behance.net/burhanahmed1
Thank-you !
Audio Version available in YouTube Link : https://www.youtube.com/AKSHARAM?sub_confirmation=1
subscribe the channel
Computer Architecture and Organization
V semester
Anna University
By
Babu M, Assistant Professor
Department of ECE
RMK College of Engineering and Technology
Chennai
RBFS is a memory bounded search algorithm similar to A* that uses heuristics. It recursively searches the problem space, but only keeps the current search path and sibling nodes in memory at a given time. When exploring a subtree no longer looks promising, it is abandoned to save space. The space complexity of RBFS is linear in the search depth, same as IDA*. RBFS inherits heuristic values h(n) from parent nodes to child nodes to guide the search when reexploring subtrees.
This document discusses loops in flow graphs. It defines dominators and uses them to define natural loops and inner loops. It explains how to build a dominator tree and find natural loops given a back edge. Reducible flow graphs are introduced as graphs that can be partitioned into forward and back edges such that the forward edges form an acyclic subgraph, allowing certain loop transformations. Examples of natural inner and outer loops are provided. Pre-headers, which are added to loops to facilitate transformations, are also discussed.
Pgp-Pretty Good Privacy is the open source freely available tool to encrypt your emails then you can very securely send mails to others over internet without fear of eavesdropping by cryptanalyst.
This document provides an overview of distributed operating systems, including:
- A distributed operating system runs applications on multiple connected computers that look like a single centralized system to users. It distributes jobs across processors for efficient processing.
- Early research began in the 1950s with systems like DYSEAC and Lincoln TX-2 that exhibited distributed control features. Major development occurred from the 1970s-1990s, though few systems achieved commercial success.
- Key considerations in designing distributed operating systems include transparency, inter-process communication, process management, resource management, reliability, and performance. Examples of distributed operating systems include Windows Server and Linux-based systems.
This document discusses instruction-level parallelism (ILP), which refers to executing multiple instructions simultaneously in a program. It describes different types of parallel instructions that do not depend on each other, such as at the bit, instruction, loop, and thread levels. The document provides an example to illustrate ILP and explains that compilers and processors aim to maximize ILP. It outlines several ILP techniques used in microarchitecture, including instruction pipelining, superscalar, out-of-order execution, register renaming, speculative execution, and branch prediction. Pipelining and superscalar processing are explained in more detail.
This document discusses instruction-level parallelism (ILP) limitations. It covers ILP background using a MIPS example, hardware models that were studied including register renaming and branch/jump prediction assumptions. A study of ILP limitations found diminishing returns with larger window sizes and realizable processors are limited by complexity and power constraints. Simultaneous multithreading was explored as a technique to improve ILP but has its own design challenges. Today, x86 and ARM processors employ various ILP optimizations within pipeline constraints.
Parallel computing involves solving computational problems simultaneously using multiple processors. It can save time and money compared to serial computing and allow larger problems to be solved. Parallel programs break problems into discrete parts that can be solved concurrently on different CPUs. Shared memory parallel computers allow all processors to access a global address space, while distributed memory systems require communication between separate processor memories. Hybrid systems combine shared and distributed memory architectures.
IPC allows processes to communicate and share resources. There are several common IPC mechanisms, including message passing, shared memory, semaphores, files, signals, sockets, message queues, and pipes. Message passing involves establishing a communication link and exchanging fixed or variable sized messages using send and receive operations. Shared memory allows processes to access the same memory area. Semaphores are used to synchronize processes. Files provide durable storage that outlives individual processes. Signals asynchronously notify processes of events. Sockets enable two-way point-to-point communication between processes. Message queues allow asynchronous communication where senders and receivers do not need to interact simultaneously. Pipes create a pipeline between processes by connecting standard streams.
This document discusses multiprocessor architecture types and limitations. It describes tightly coupled and loosely coupled multiprocessing systems. Tightly coupled systems have shared memory that all CPUs can access, while loosely coupled systems have each CPU connected through message passing without shared memory. Examples given are symmetric multiprocessing (SMP) and Beowulf clusters. Interconnection structures like common buses, multiport memory, and crossbar switches are also outlined. The advantages of multiprocessing include improved performance from parallel processing, increased reliability, and higher throughput.
The document provides an overview of parallel processing and multiprocessor systems. It discusses Flynn's taxonomy, which classifies computers as SISD, SIMD, MISD, or MIMD based on whether they process single or multiple instructions and data in parallel. The goals of parallel processing are to reduce wall-clock time and solve larger problems. Multiprocessor topologies include uniform memory access (UMA) and non-uniform memory access (NUMA) architectures.
This slide contain the description about the various technique related to parallel Processing(vector Processing and array processor), Arithmetic pipeline, Instruction Pipeline, SIMD processor, Attached array processor
Interconnection Network
in this presentation there are some explain to Interconnection Network , and espically in computer architecture and parallel processing.
The document discusses code optimization techniques in compilers. It covers the following key points:
1. Code optimization aims to improve code performance by replacing high-level constructs with more efficient low-level code while preserving program semantics. It occurs at various compiler phases like source code, intermediate code, and target code.
2. Common optimization techniques include constant folding, propagation, algebraic simplification, strength reduction, copy propagation, and dead code elimination. Control and data flow analysis are required to perform many optimizations.
3. Optimizations can be local within basic blocks, global across blocks, or inter-procedural across procedures. Representations like flow graphs, basic blocks, and DAGs are used to apply optimizations at
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
This document discusses swap space management. It explains that swap space uses disk space as an extension of main memory through swapping and paging. It discusses how operating systems may support multiple swap spaces on separate disks to balance load. It also notes that it is better to overestimate than underestimate swap space needs to avoid crashing the system from running out of space. The document then covers locations for swap space, including within the file system or a separate partition, and tradeoffs of each approach.
This document discusses multicore computers and their organization. It describes how hardware performance issues around increasing parallelism and power consumption led to the development of multicore processors. Multicore computers combine two or more processors on a single chip for improved performance. The main variables in multicore organization are the number of cores, levels of cache memory, and whether cache is shared.
Parallel computing and its applicationsBurhan Ahmed
Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate. Parallel computing is also known as parallel processing.
↓↓↓↓ Read More:
Watch my videos on snack here: --> --> http://sck.io/x-B1f0Iy
@ Kindly Follow my Instagram Page to discuss about your mental health problems-
-----> https://instagram.com/mentality_streak?utm_medium=copy_link
@ Appreciate my work:
-----> behance.net/burhanahmed1
Thank-you !
Audio Version available in YouTube Link : https://www.youtube.com/AKSHARAM?sub_confirmation=1
subscribe the channel
Computer Architecture and Organization
V semester
Anna University
By
Babu M, Assistant Professor
Department of ECE
RMK College of Engineering and Technology
Chennai
RBFS is a memory bounded search algorithm similar to A* that uses heuristics. It recursively searches the problem space, but only keeps the current search path and sibling nodes in memory at a given time. When exploring a subtree no longer looks promising, it is abandoned to save space. The space complexity of RBFS is linear in the search depth, same as IDA*. RBFS inherits heuristic values h(n) from parent nodes to child nodes to guide the search when reexploring subtrees.
This document discusses loops in flow graphs. It defines dominators and uses them to define natural loops and inner loops. It explains how to build a dominator tree and find natural loops given a back edge. Reducible flow graphs are introduced as graphs that can be partitioned into forward and back edges such that the forward edges form an acyclic subgraph, allowing certain loop transformations. Examples of natural inner and outer loops are provided. Pre-headers, which are added to loops to facilitate transformations, are also discussed.
Pgp-Pretty Good Privacy is the open source freely available tool to encrypt your emails then you can very securely send mails to others over internet without fear of eavesdropping by cryptanalyst.
This document provides an overview of distributed operating systems, including:
- A distributed operating system runs applications on multiple connected computers that look like a single centralized system to users. It distributes jobs across processors for efficient processing.
- Early research began in the 1950s with systems like DYSEAC and Lincoln TX-2 that exhibited distributed control features. Major development occurred from the 1970s-1990s, though few systems achieved commercial success.
- Key considerations in designing distributed operating systems include transparency, inter-process communication, process management, resource management, reliability, and performance. Examples of distributed operating systems include Windows Server and Linux-based systems.
This document discusses instruction-level parallelism (ILP), which refers to executing multiple instructions simultaneously in a program. It describes different types of parallel instructions that do not depend on each other, such as at the bit, instruction, loop, and thread levels. The document provides an example to illustrate ILP and explains that compilers and processors aim to maximize ILP. It outlines several ILP techniques used in microarchitecture, including instruction pipelining, superscalar, out-of-order execution, register renaming, speculative execution, and branch prediction. Pipelining and superscalar processing are explained in more detail.
This document discusses instruction-level parallelism (ILP) limitations. It covers ILP background using a MIPS example, hardware models that were studied including register renaming and branch/jump prediction assumptions. A study of ILP limitations found diminishing returns with larger window sizes and realizable processors are limited by complexity and power constraints. Simultaneous multithreading was explored as a technique to improve ILP but has its own design challenges. Today, x86 and ARM processors employ various ILP optimizations within pipeline constraints.
This document discusses instruction level parallelism (ILP) and how it can be used to improve performance by overlapping the execution of instructions through pipelining. ILP refers to the potential overlap among instructions within a basic block. Factors like dynamic branch prediction and compiler dependence analysis can impact the ideal pipeline CPI and number of data hazard stalls. Loop level parallelism refers to the parallelism available across iterations of a loop. Data dependencies between instructions, if not properly handled, can limit parallelism and require instructions to execute in order. The three types of data dependencies are data, name, and control dependencies.
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
The document discusses reduced instruction set computers (RISC) and compares them to complex instruction set computers (CISC). Key characteristics of RISC include simple, uniform instructions that are executed in one cycle; register-to-register operations with simple addressing modes; and a large number of registers to optimize register usage and minimize memory accesses. Studies show programs use simple operations, operands, and addressing modes most frequently, informing the RISC design which aims to efficiently support common cases through hard-wired, streamlined instructions.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
This document outlines a presentation on pipelining and data hazards in microprocessors. It begins with rules for participant questions and outlines the topics to be covered: what is pipelining, types of pipelining, data hazards and their types, and solutions to data hazards. It then defines pipelining as executing subsequent instructions before prior ones complete. Types of pipelining include control, data, and structure hazards. Data hazards occur if an instruction uses a value before it is ready, and their types are RAW, WAR, and WAW. Solutions involve forwarding newer register values to bypass stale values in the pipeline and prevent hazards.
This document discusses three types of hardware multithreading: coarse-grained, fine-grained, and simultaneous multithreading (SMT). Coarse-grained multithreading allows another thread to run during long stalls of the first thread. Fine-grained multithreading interleaves instructions from multiple threads in a round-robin fashion to hide stalls. SMT issues instructions from multiple threads in the same cycle by using register renaming and dynamic scheduling to maximize utilization.
The document discusses parallelism and techniques to improve computer performance through parallel execution. It describes instruction level parallelism (ILP) where multiple instructions can be executed simultaneously through techniques like pipelining and superscalar processing. It also discusses processor level parallelism using multiple processors or processor cores to concurrently execute different tasks or threads.
The document discusses parallelism in writing and provides examples of proper and faulty parallel construction. It defines parallelism as having closely related parts of a sentence fit harmoniously together. It then provides examples of parallelism in form, logic, and relationship and formulas for creating parallel structures.
The document discusses the importance of parallel structure or parallelism in writing. Parallel structure means that elements in a list or series are grammatically similar. It provides examples of parallel and non-parallel sentences and explains that to fix non-parallel sentences, the structure of all elements must be made consistent either by changing the non-parallel element to match the others or by changing the other elements to match the non-parallel one.
This document provides an overview of various scientific programming models for distributed computing. It introduces reference parallel programming models like MPI and OpenMP, and discusses their strengths and weaknesses. Novel programming models are also covered, such as Microsoft Dryad, MapReduce, and COMP Superscalar (COMPSs). The document concludes that while scientific problems are complex, reference models are often unsuitable, leading to new flexible models that aim to simplify programming workflows for distributed systems.
This document describes a technique called "multi-supply digital layout" that allows reliable back-annotation between digital blocks powered by different voltage supplies. It presents a design flow that uses standard CAD tools from RTL to layout. Digital blocks are grouped into voltage regions separated by isolation rings. Level shifter cells are used for voltage conversion at region interfaces. Libraries are generated for level shifters to integrate them into the digital flow for synthesis, simulation, and test. Floorplanning scripts automate placement of cells into the appropriate voltage regions.
From programming to software engineering: ICSE keynote slides availableCelso Martins
Meyer's blog:
"In response to many requests, I have made available [1] the slides of my education keynote at ICSE earlier this month. The theme was “From programming to software engineering: notes of an accidental teacher”. Some of the material has been presented before, notably at the Informatics Education Europe conference in Venice in 2009. (In research you can give a new talk every month, but in education things move at a more senatorial pace.) Still, part of the content is new. The talk is a summary of my experience teaching programming and software engineering at ETH."
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, DOI:10.1007/s00450-010-0108-2 (May. 2010).
http://ito-lab.naist.jp/~n-sibata/pdfs/isc10simd.pdf
http://freecode.com/projects/sleef
Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.
This document discusses various operators in C programming, including:
- Relational, logical, unary, binary, ternary, and compound assignment operators
- Examples of using logical, unary, binary, and ternary operators
- Practice questions and answers about using different operators
- Compound assignment operators can make code more concise by combining operations like addition with assignment
This document presents a benchmark for deep learning algorithms developed by identifying basic operations that account for most CPU usage. Three algorithms were implemented - sparse autoencoder, convolutional neural network, and FISTA optimization. The operations were abstracted into an API for easier optimization. Results showed the Theano GPU implementation was 3-15 times faster than Numpy. Challenges included choosing array dimensions and memory allocation to optimize performance. Convolution was identified as the most expensive operation for CNNs in terms of CPU usage.
Collaborative modeling and co simulation with destecs - a pilot studyDaniele Gianni
The document describes a pilot study conducted using the DESTECS collaborative modeling and co-simulation approach. The study involved developing models of a line-following robot using both discrete-event and continuous-time modeling formalisms. The models were integrated using the DESTECS co-simulation engine. Faults were then modeled and experiments conducted to test fault tolerance mechanisms. The results demonstrated the feasibility of the DESTECS concepts and identified areas for further work, such as model construction methods and design of experiments.
This document describes an efficient approach to identifying design patterns in source code using bit-vector algorithms. It first transforms code and patterns into string representations. It then uses an iterative bit-vector algorithm to match code strings to pattern strings and identify occurrences. The approach was tested on three programs and found to identify patterns much faster than existing constraint-based techniques, especially when incorporating approximations. Future work aims to improve precision by combining with metrics and adding more relationship and dynamic information.
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
oneDNN Graph API extends oneDNN with a graph interface which reduces deep learning integration costs and maximizes compute efficiency across a variety of AI hardware including AI accelerators. Get started on your AI Developer Journey @ software.intel.com/ai.
Madeo - a CAD Tool for reconfigurable HardwareESUG
This document discusses Madeo, a CAD tool for programming reconfigurable hardware using an object-oriented methodology. Madeo was developed over 10 years and allows describing circuits as objects in a high-level language. It supports various reconfigurable architectures by modeling them and can generate configuration bitstreams. The tool aims to improve on existing solutions by providing retargetability, exploiting flexibility of reconfigurable hardware, and applying principles like code reuse and portability through a virtual machine-like approach. The document outlines key aspects of Madeo like its architecture modeling, compilation flow, and results demonstrating its capabilities on different targets. It also discusses lessons learned like using meta-modeling for evolution and interchange support.
This document summarizes a study of the socio-technical evolution of the Eclipse open source project over multiple releases. Formal concept analysis was used to cluster people and software components into concepts based on their associations through bug reports. This revealed hierarchies showing expertise levels and potential coordination issues. The approach provides a more scalable and intuitive representation of socio-technical relationships than bipartite graphs.
This document summarizes a study of the socio-technical evolution of the Eclipse open source project over multiple releases. Formal concept analysis was used to cluster people and software components into concepts based on their associations through bug reports. This revealed hierarchies showing expertise levels and potential coordination issues. The approach provides a more scalable and intuitive representation of socio-technical relationships than bipartite graphs.
This document is a guide to using Verilog HDL (Hardware Description Language) for digital design and synthesis. It is divided into three parts, with the first part covering basic Verilog topics like modules, ports, gate-level modeling, behavioral modeling, tasks and functions. The second part discusses more advanced topics such as timing, switch-level modeling, and user-defined primitives. The third part contains appendices with reference information on modeling techniques, the Verilog language, and examples.
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
A floating-point unit (FPU) is a math coprocessor, a part of a computer system specially designed to carry
out operations on floating point numbers. The term floating point refers to the fact that the radix point can
"float"; that is, it can placed anywhere with respect to the significant digits of the number. Double
precision floating point, also known as double, is a commonly used format on PCs due to its wider range
over single precision in spite of its performance and bandwidth cost. This paper aims at developing the
verilog version of the double precision floating point core designed to meet the IEEE 754 standard .This
standard defines a double as sign bit, exponent and mantissa. The aim is to build an efficient FPU that
performs basic functions with reduced complexity of the logic used and also reduces the memory
requirement as far as possible.
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
The document discusses optimization best practices for Cray XT systems. It covers choosing compilers and compiler flags, profiling and debugging codes at scale with hardware performance counters and CrayPAT tools, optimizing communication with MPI by using techniques like pre-posting receives and reducing collectives, and optimizing I/O. The document emphasizes testing optimizations on the number of nodes the application will actually run on.
1. Parallel computation is needed to achieve high performance as modern processors have limitations despite features like caches, buses, and pipelines. Parallel computers use multiple CPUs working together to solve problems faster.
2. Flynn's classification categorizes computer architectures based on their instruction and data flows as single instruction stream single data stream (SISD), single instruction stream multiple data stream (SIMD), or multiple instruction stream multiple data stream (MIMD).
3. Important metrics for measuring parallel performance include speedup, which measures improvement over sequential execution, and efficiency, which relates speedup to number of processors used. According to Amdahl's law, even small amounts of sequential code limit maximum speedup attainable.
The document discusses RISC (reduced instruction set computers) architectures compared to CISC (complex instruction set computers) architectures. Some key points:
- RISCs aim to simplify the instruction set to allow for faster execution, while CISCs include more complex instructions closer to high-level languages.
- Studies show programs spend most time on simple operations like moves and branches, using simple addressing modes and local variables, informing the RISC approach.
- RISCs use load/store architectures, fixed-length instructions, delayed loading, and many registers to improve performance over CISCs.
- While RISCs have advantages in speed and simplicity, comparisons are complex and modern processors combine RIS
This document outlines the course information for FN311 – Internet Services. The course is a 4 credit hour course that focuses on developing additional internet services skills for networking. It will cover topics like routing, addressing, security and troubleshooting. Students will learn help desk and customer service skills. The course aims to allow students to support network systems and propose network upgrades. It will be assessed through various coursework, labs, tests and a final exam.
Similar to Chapter 3 instruction level parallelism and its exploitation (20)
Chapter 3 instruction level parallelism and its exploitation
1. Outline
2.1 Instruction-Level Parallelism: Concepts and Challenges
Computer Architecture 2.2 Basic Compiler Techniques for Exposing ILP
計算機結構 2.3 Reducing Branch Costs with Prediction
2.4 Overcoming Data Hazards with Dynamic Scheduling
Lecture 3 2.5 Dynamic Scheduling: Examples and the Algorithm
Instruction-Level Parallelism 2.6
26 Hardware-Based Speculation
H d B dS l ti
2.7 Exploiting ILP Using Multiple Issue and Static Scheduling
and Its Exploitation
p 2.8
28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue
Scheduling
(Chapter 2 in textbook) and Speculation
Ping-Liang Lai (賴秉樑)
Computer Architecture Ch3-1 Computer Architecture Ch3-2
2.1 ILP: Concepts and Challenges Instruction-Level Parallelism (ILP)
Instruction-Level Parallelism (ILP): overlap the execution of Basic Block (BB) ILP is quite small
instructions to improve performance. BB: a straight-line code sequence with no branches in except to the entry
2 approaches to exploit ILP and, no branches out except at the exit;
1) Rely on hardware to help discover and exploit the parallelism dynamically Average d
A dynamic branch frequency 15% to 25%;
i b hf t 25%
(e.g., Pentium 4, AMD Opteron, IBM Power), and » 3 to 6 instructions execute between a pair of branches.
2) Rely on software technology to find parallelism statically at compile-time
parallelism, Plus instructions in BB likely to depend on each other.
(e.g., Itanium 2) To obtain substantial performance enhancements, we must
Pipelining Review exploit ILP across multiple basic blocks. (ILP → LLP)
p p ( )
Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls Loop-Level Parallelism: to exploit parallelism among iterations of a loop.
+ Control Stalls E.g., add two matrixes.
for (i=1; i<=1000; i=i+1)
x[i] = x[i] + y[ ]
[] [ ] y[i];
Computer Architecture Ch3-3 Computer Architecture Ch3-4
2. Data Dependence and Hazards Data Dependence
Three data dependence: data dependences (true data Floating-point data part
dependences), name dependences, and control dependences.
1. Instruction i produces a result that may be used by instruction j (i → j), or Loop: L.D F0, 0(R1) ;F0=array element
2.
2 Instruction j i data dependent on instruction k and i
i is d d d i i k, d instruction k i data
i is d ADD.D
ADD D F4, F0,
F4 F0 F2 ;add scalar in
dd l i
dependent on instruction i (i → k → j, dependence chain). S.D F4, 0(R1) ;store result
For example, a MIPS code sequence
example
Integer data part
Loop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in DADDUI R1, R1, #-8 ;decrement pointer
S.D F4, 0(R1) ;store result
;8 bytes (per DW)
DADDUI R1, R1 # 8
R1 R1, #-8 ;decrement pointer 8 bytes
d t i t b t
BNE R1, R2, Loop ;branch R1!=R2 BNE R1, R2, Loop ;branch R1!=R2
† This type is called a Write After Read (WAR) hazard.
Computer Architecture Ch3-5 Computer Architecture Ch3-6
Data Dependence and Hazards ILP and Data Dependencies, Hazards
InstrJ is data dependent (aka true dependence) on InstrI: HW/SW must preserve program order: order instructions would
1) InstrJ tries to read operand before InstrI writes it; execute in if executed sequentially as determined by original
source program.
I: add r1 r2 r3
r1,r2,r3 Dependences are a property of programs.
d f
J: sub r4,r1,r3
Presence of dependence indicates potential for a hazard, but
2) Or InstrJ is data dependent on InstrK which is dependent on InstrI. actual hazard and length of any stall is property of the pipeline
pipeline.
If two instructions are data dependent, they cannot execute Importance of the data dependencies.
simultaneously or be completely overlapped.
y p y pp 1) Indicates the possibility of a hazard;
Data dependence in instruction sequence → data dependence in 2) Determines order in which results must be calculated;
source code → effect of original data dependence must be 3) Sets an upper bound on how much parallelism can possibly be exploited.
preserved. HW/SW goal: exploit parallelism by preserving program order
If data dependence caused a hazard in pipeline, called a Read only where it affects the outcome of the program.
After Write (RAW) hazard.
Computer Architecture Ch3-7 Computer Architecture Ch3-8
3. Name Dependence #1: Anti-dependence Name Dependence #2: Output dependence
Name dependence: when 2 instructions use same register or InstrJ writes operand before InstrI writes it.
memory location, called a name, but no flow of data between the
I: sub r1,r4,r3
instructions associated with that name; 2 versions of name
J: add r1,r2,r3
, ,3
dependence (WAR and WAW).
d d d WAW)
K: mul r6,r1,r7
InstrJ writes operand before InstrI reads it
Called an “output dependence by compiler writers This also
output dependence” writers.
I: sub r4,r1,r3 results from the reuse of name “r1”
J: add r1,r2,r3
K:
K mul r6,r1,r7
l 6 1 7 If anti dependence caused a hazard in the pipeline, called a
anti-dependence pipeline
Write After Write (WAW) hazard.
Called an “anti-dependence” by compiler writers. This results from reuse
Instructions involved in a name dependence can execute
of the name “r1”.
f h
simultaneously if name used in instructions is changed so
If anti-dependence caused a hazard in the pipeline, called a instructions do not conflict.
Write After Read (WAR) hazard.
hazard Register renaming resolves name dependence for regs;
Either by compiler or by HW.
Computer Architecture Ch3-9 Computer Architecture Ch3-10
Control Dependencies Control Dependence
Every instruction is control dependent on some set of branches, Two constrains imposed by control dependence
and, in general, these control dependencies must be preserved to 1. An instruction that is dependent on a branch cannot be moved before the
preserve program order. branch so that its execution is no longer controlled by the branch;
2.
2 An instruction that is control dependent on a branch cannot be moved
if p1 { after the branch so that its execution is controlled by the branch.
S1;
; Control dependence need not be preserved
p p
}; Willing to execute instructions that should not have been executed,
if p2 { thereby violating the control dependences, if can do so without affecting
S2;
S2 correctness of the program .
} Instead, 2 properties critical to program correctness are
1.
1 Exception behavior and,
and
S1 is control dependent on p1, and S2 is control dependent on p2 2. Data flow
but not on p1.
p
Computer Architecture Ch3-11 Computer Architecture Ch3-12
4. Exception Behavior Data Flow
Preserving exception behavior Data flow: actual flow of data values among instructions that
Any changes in instruction execution order must not change how produce results and those that consume them.
exceptions are raised in program (⇒ no new exceptions). Branches make flow dynamic, determine which instruction is supplier of
Example
E l data.
data
Example
DADDU R2,R3,R4
BEQZ R2,L1 DADDU R1, R2, R3
LW R1,0(R2) BEQZ R4, L
L1: DSUBU R1 R5 R6
R1, R5,
L: …
† Assume branches not delayed. OR R7, R1, R8
Problem with moving LW before BEQZ? R1 of OR depends on DADDU or DSUBU?
Must preserve data flow on execution.
execution
Computer Architecture Ch3-13 Computer Architecture Ch3-14
2.2 Basic Compiler Techniques
Outline for Exposing ILP
2.1 Instruction-Level Parallelism: Concepts and Challenges This code, add a scalar to a vector
2.2 Basic Compiler Techniques for Exposing ILP for (i=1000; i>0; i=i–1)
2.3 Reducing Branch Costs with Prediction x[i] = x[i] + s;
2.4 Overcoming Data Hazards with Dynamic Scheduling
Assume following latencies for all examples
2.5 Dynamic Scheduling: Examples and the Algorithm Ignore delayed branch in these examples
2.6
26 Hardware-Based Speculation
H d B dS l ti
2.7 Exploiting ILP Using Multiple Issue and Static Scheduling
Instruction producing result Instruction using result Latency in cycles
2.8
28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue
Scheduling FP ALU op Another FP ALU op 3
and Speculation
FP ALU op Store double 2
Load d bl
L d double FP ALU op 1
Load double Store double 0
Figure 2.2 Latencies of FP operations used in this chapter.
Computer Architecture Ch3-15 Computer Architecture Ch3-16
5. FP Loop: Where are the Hazards? FP Loop Showing Stalls
First translate into MIPS code Example 3-1 (p.76): Show how the loop would look on MIPS, both scheduled
To simplify, assume 8 is lowest address and unscheduled i l di any stalls or idle clock cycles. Schedule for delays
d h d l d including ll idl l k l h d l f d l
from floating-point operations, but remember that we are ignoring delayed
branches.
Loop:
L L.D
LD F0,0(R1)
F0 0(R1) ;F0=vector element
F0 l
Answer
ADD.D F4,F0,F2 ;add scalar from F2
1. Loop: L.D F0,0(R1) ;F0=vector element
S.D
SD 0(R1),F4
0(R1) F4 ;store result
2. stall
DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
3. ADD.D F4,F0,F2 ;add scalar in F2
BNEZ R1,Loop ;branch R1!=zero
4.
4 stall
5. stall
6. S.D 0(R1),F4
( ), ;
;store result
7. DADDUI R1,R1,-8 ;decrement pointer 8B (DW)
8. stall ;assumes can’t forward to branch
9. BNEZ R1,Loop ;branch R1!=zero
Computer Architecture Ch3-17
† 9 clock cycles: Rewrite code to minimize stalls? Computer Architecture Ch3-18
Revised FP Loop Minimizing Stalls Unroll Loop Four Times
1. Loop: L.D F0,0(R1)
1.
1 Loop: L.D
LD F0,0(R1)
F0 0(R1)
2. DADDUI R1,R1,-8 3. ADD.D F4,F0,F2
3. ADD.D F4,F0,F2 6. S.D 0(R1),F4 ;drop DSUBUI & BNEZ
4.
4 stall .
7. L.D
. F6,-8(R1)
6, 8( )
9. ADD.D F8,F6,F2
5. stall 12. S.D -8(R1),F8 ;drop DSUBUI & BNEZ
6. S.D 8(R1),F4
( ), ;
;altered offset when move DSUBUI 13. L.D ( )
F10,-16(R1)
7. BNEZ R1,Loop 15. ADD.D F12,F10,F2
18. S.D -16(R1),F12 ;drop DSUBUI & BNEZ
Swap DADDUI and S.D by changing address of S.D 19. L.D F14,-24(R1)
21. ADD.D F16,F14,F2
Instruction producing result Instruction using result Latency in cycles
24. S.D -24(R1),F16
FP ALU op Another FP ALU op 3 25. DADDUI R1,R1,#-32 ;alter to 4*8
FP ALU op Store double
S d bl 2 26. BNEZ R1,LOOP
Load double FP ALU op 1
Load double Store double 0
27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of
4).
4)
† 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead;
How make faster? Computer Architecture Ch3-19 Computer Architecture Ch3-20
6. Unrolled Loop Detail Unrolled Loop That Minimizes Stalls
Do not usually know upper bound of loop.
1. Loop:
1 L L.D
LD F0,
F0 0(R1)
Suppose it is n, and we would like to unroll the loop to make k 2. L.D F6, -8(R1)
copies of the body. 3. L.D F10, -16(R1)
Instead of a single unrolled loop, we generate a pair of 4. L.D F14, -24(R1)
5. ADD.D F4 ,F0, F2
consecutive loops: 6. ADD.D F8, F6, F2
, ,
1st executes (n mod k) times and has a body that is the original loop; 7. ADD.D F12, F10, F2
2nd is the unrolled body surrounded by an outer loop that iterates (n/k) 8. ADD.D F16, F14, F2
times. 9. S.D 0(R1), F4
For large values of n, most of the execution time will be spent in 10. S.D -8(R1), F8
11. S.D -16(R1), F12
the unrolled loop.
p 12.
12 DSUBUI R1, R1
R1 R1, #32
13. S.D 8(R1), F16 ; 8-32 = -24
14. BNEZ R1, LOOP
Computer Architecture Ch3-21 Computer Architecture Ch3-22
5 Loop Unrolling Decisions Limits to Loop Unrolling
Requires understanding how one instruction depends on another 3 Limits to Loop Unrolling
and how the instructions can be changed or reordered given the 1. Decrease in amount of overhead amortized with each extra unrolling.
dependences: Amdahl’s Law.
1.
1 Determine loop unrolling useful by finding that l
i l lli f l b fi di h loop iterations were
i i 2.
2 Growth i code size.
G th in d i
independent (except for maintenance code); For larger loops, concern it increases the instruction cache miss rate.
2. Use different registers to avoid unnecessary constraints forced by using
g y y g 3. Register p
g pressure: potential shortfall in registers created by aggressive
p g y gg
same registers for different computations; unrolling and scheduling.
3. Eliminate the extra test and branch instructions and adjust the loop If not be possible to allocate all live values to registers, may lose some or all
termination and iteration code; of its advantage
advantage.
4. Determine that loads and stores in unrolled loop can be interchanged by Loop unrolling reduces impact of branches on pipeline; another
observing that loads and stores from different iterations are independent; way is branch prediction.
y p
» Transformation requires analyzing memory addresses and finding that they do We discuss it in section 2.3: Reducing Branch Costs with Prediction.
not refer to the same address.
5.
5 Schedule the code, preserving any dependences needed to yield the same
code
result as the original code.
Computer Architecture Ch3-23 Computer Architecture Ch3-24
7. Outline 2.3 Reducing Branch Costs with Prediction
2.1 Instruction-Level Parallelism: Concepts and Challenges Because of the need to enforce control dependences through
2.2 Basic Compiler Techniques for Exposing ILP branch hazards and stall, branches will hurt pipeline performance.
2.3 Reducing Branch Costs with Prediction Solution 1: loop unrolling;
2.4 Overcoming Data Hazards with Dynamic Scheduling Solution 2 b
S l i 2: by predicting how they will behave.
di i h h ill b h
2.5 Dynamic Scheduling: Examples and the Algorithm SW/HW technology
2.6
26 Hardware-Based Speculation
H d B dS l ti SW: St ti B
SW Static Branch Prediction, statically at compile time;
h P di ti t ti ll t il ti
2.7 Exploiting ILP Using Multiple Issue and Static Scheduling HW: Dynamic Branch Prediction, dynamically by the hardware at
execution time.
2.8
28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue
Scheduling
and Speculation
Computer Architecture Ch3-25 Computer Architecture Ch3-26
Static Branch Prediction Dynamic Branch Prediction
Appendix A showed scheduling code around delayed branch. Why does prediction work?
Reorder code around branches, need to predict branch statically when compile. Underlying algorithm has regularities;
Simplest scheme is to predict a branch as taken. Data that is being operated on has regularities;
Average misprediction = untaken branch frequency = 34% SPEC. Unfortunately
SPEC Unfortunately, Instruction sequence h redundancies that are artifacts of way that
has d d i h if f h
from very accurate (59%) to highly accurate (9%).
humans/compilers think about problems.
25% 22% Is dynamic branch prediction better than static branch prediction?
Misprediction Rate
18%
More accurate
20%
15%
Seems to be;
scheme predicts 15% 12% 11% 12% There are a small number of important branches in programs which have
branches using 9% 10%
10%
6%
dynamic behavior.
profile information 5%
4%
collected from earlier
runs, and modify
d dif 0%
prediction based on
last run.
Integer (ave. 15%) Floating Point (ave. 9%)
Figure 2.3 The result of predict-taken inSPEC92
Computer Architecture Ch3-27 Computer Architecture Ch3-28
8. Dynamic Branch Prediction Basic Branch Prediction Buffers
Performance = ƒ(accuracy, cost of misprediction) a.k.a. Branch History Table (BHT) - Small direct-mapped cache
Branch History Table (also called Branch Prediction Buffer): of T/NT bits.
lower bits of PC address index table of 1-bit values.
Says whether the branch was recently taken or not;
No address check.
Problem: i a loop, 1-bit BHT will cause two mispredictions
P bl in l 1 bit ill t i di ti
(average is 9 in 10 iterations before exit).
End of loop case, when it exits instead of looping as before;
case
First time through loop on next time through code, when it predicts exit
instead of looping.
Computer Architecture Ch3-29 Computer Architecture Ch3-30
Dynamic Branch Prediction 2-bit Scheme Accuracy
Solution: 2-bit scheme where change prediction only if get Mispredict because either:
misprediction twice. Wrong guess for that branch;
T Got branch history of wrong branch when index the table.
4,096 entry table
NT 20% 18%
Predict Taken 11 10 Predict Taken 18%
iction Rate
e
T 16%
T NT 14% 12%
Predict Not NT Predict Not 12% 10%
01 00 9% 9% 9%
Taken Taken 10%
T
Mispredi
8%
6% 5% 5%
4%
1%
NT 2% 0%
0%
Red: stop, not taken;
Blue: go, taken;
t
c
c
0
li
p
7
ot
ice
ice
so
gc
du
pp
30
sa
nt
es
sp
sp
do
ri x
fp
na
eq
pr
at
Adds hysteresis to decision making process. es
m
Integer Floating Point
Computer Architecture Ch3-31 Figure 2.5 The result of 2-bit scheme inSPEC89 Computer Architecture Ch3-32
9. 2-bit Scheme Accuracy Improve Prediction Strategy By Correlating Branches
In figure 2.5, the accuracy of the predictors for integer programs, Consider the worst case for the 2-bit predictor
which typically also have higher branch frequencies, is lower DSUBUI R3, R1, #2
if (aa==2) if the first BNEZ R3, L1
than for the loop-intensive scientific programs. 2 fail then DADD R1, R0, R0
aa=0;
the 3rd will
Two ways to attack this problem if (bb==2) always be
L1:
DSUBUI R3, R2, #2
Large buffer size; bb=0; taken.
BNEZ R3, L2
Increasing the accuracy of the scheme we use for each prediction.
I i th f th h f h di ti if (aa != bb) {
DADD R2, R0, R0
However, simply increasing the number of bits per predictor † Single level predictors can never get this case. L2:
without changing the predictor structure also has little impact.
impact DSUBU R3, R1, R2
BEQZ R3, L3
Single branch predictor V.S. correlating branch predictors. Correlating predictors or 2-level predictors
Correlation = what happened on the last branch
pp This branch is based on
the Outcome of the
» Note that the last correlator branch may not always be the same. previous 2 branches.
Predictor = which way to go
» 4 possibilities: which way the last one went chooses the prediction.
prediction
− (Last-taken, last-not-taken) × (predict-taken, predict-not-taken)
Computer Architecture Ch3-33 Computer Architecture Ch3-34
Correlated Branch Prediction Correlating Branches
Idea: record m most recently executed branches as taken or not taken, and use (2, 2) predictor
that pattern to select the proper n-bit branch history table.
h l h bi b h hi bl Behavior of recent 2 branches selects between four predictions of next
In general, (m, n) predictor means record last m branches to select between 2m branch, updating just that prediction.
history tables, each with n-bit counters.
y ,
Thus, old 2-bit BHT is a (0, 2 ) predictor. Branch address
Global Branch History: m-bit shift register keeping T/NT status of last m 4
branches.
b h
2-bits per branch predictor
Each entry in table has m n-bit predictors.
Total bits for the (m, n) BHT prediction buffer:
Total _ memory _ bits = 2 m × n × 2 p Prediction
2m banks of memory selected by the global branch history (which is just a shift
register) - e.g. a column address;
Use p bits of the branch address to select row;
Get the n predictor bits in the entry to make the decision.
2-bit global branch history
Computer Architecture Ch3-35 Computer Architecture Ch3-36
10. Example of Correlating Branch Predictors Example: Multiple Consequent Branches
if(d == 0) ;not taken
d=1;
else ;taken
if(d==1) ;not taken
else ;taken
if (d==0) BNEZ R1,
R1 L1 ;branch b1 (d!=0)
If b1 is not taken, then b2 will be not taken
d = 1; DADDIU R1, R0, #1 ;d==0, so d=1
if (d==1)
( ) L1: DADDIU R3, R1, #-1
… BNEZ R3, L2 ;branch b2 (d!=1)
…
L2:
1-bit predictor: consider d alternates between 2 and 0. All branches are mispredicted
Computer Architecture Ch3-37 Computer Architecture Ch3-38
Example: Multiple Consequent Branches Accuracy of Different Schemes
if(d == 0) ;not taken
d=1; 20%
else ;taken 18% 4,096 Entries 2-bit BHT
ctions
if(d==1) ;not taken
else ;taken 16% Unlimited Entries 2-bit BHT
1,024
1 024 Entries (2 2) BHT
(2,
ency of Mispredic
14%
2-bits prediction : prediction if last branch not taken/ and prediction if last branch taken
12%
11%
10%
M
8%
6% 6% 6%
6%
5% 5%
4%
4%
Freque 2%
(1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or 1% 1%
0% 0%
not taken) decides which prediction bit will be considered or updated
a7
matrix300
tomcatv
v
d
doducd
fpppp
p
expresso
tt
li
ce
cc
l
eqntot
gc
spic
nasa
4,096 entries: 2-bits per entry
, p y Unlimited entries: 2-bits/entry
/ y 1,024 entries (2,2)
, ( , )
Computer Architecture Ch3-39 Computer Architecture Ch3-40
11. Outline Advantages of Dynamic Scheduling
2.1 Instruction-Level Parallelism: Concepts and Challenges Dynamic scheduling: hardware rearranges the instruction
2.2 Basic Compiler Techniques for Exposing ILP execution to reduce stalls while maintaining data flow and
2.3 Reducing Branch Costs with Prediction exception behavior.
2.4 Overcoming Data Hazards with Dynamic Scheduling It handles cases when dependences unknown at compile time.
2.5 Dynamic Scheduling: Examples and the Algorithm It allows the processor to tolerate unpredictable delays such as cache
2.6
26 Hardware-Based Speculation
H d B dS l ti misses,
misses by executing other code while waiting for the miss to resolve.
resolve
2.7 Exploiting ILP Using Multiple Issue and Static Scheduling It allows code that compiled for one pipeline to run efficiently on
2.8
28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue
Scheduling a different pipeline.
pp
and Speculation It simplifies the compiler.
Hardware speculation: a technique with significant performance
advantages, builds on dynamic scheduling (next lecture).
Computer Architecture Ch3-41 Computer Architecture Ch3-42
HW Schemes: Instruction Parallelism Dynamic Scheduling Step 1
Key idea: allow instructions behind stall to proceed. Simple pipeline had 1 stage to check both structural and data
DIVD F0, F2, F4
hazards: Instruction Decode (ID), also called Instruction Issue.
ADDD F10, F0, F8 Split the ID pipe stage of simple 5-stage pipeline into 2 stages
SUBD F12, F8, F14 Issue: decode instructions, check for structural hazards.
Enables out-of-order execution and allows out-of-order Read operands: wait until no data hazards, then read operands.
completion (e.g., SUBD).
In a dynamically scheduled pipeline, all instructions still pass through
issue stage in order (in-order issue).
(in order issue)
Will distinguish when an instruction begins execution and when
it completes execution; between 2 times the instruction is in
times,
execution.
Note: Dynamic execution creates WAR and WAW hazards and
y
makes exceptions harder.
Computer Architecture Ch3-43 Computer Architecture Ch3-44
12. Tomasulo Algorithm Tomasulo Organization
Control & buffers distributed with Function Units (FU) FP Registers
g
From Mem FP Op
FU buffers called “reservation stations”; have pending operands Queue
Registers in instructions replaced by values or pointers to reservation Load Buffers
stations(RS); called register renaming; Load1
Renaming avoids WAR, WAW hazards Load2
Load3
More reservation stations than registers, so can do optimizations compilers can’t Load4
Load5 Store
Results to FU from RS, not through registers, over Common Data Bus that Load6
Buffers
broadcasts results to all FUs
Avoids RAW hazards by executing an instruction only when its operands are
y g y p Add1
available Add2 Mult1
Add3 Mult2
Load and Stores treated as FUs with RSs as well
Reservation To Mem
Integer instructions can go past branches (predict taken), allowing FP ops
taken) Stations
beyond basic block in FP queue. FP adders FP multipliers
Computer Architecture Ch3-45 Computer Architecture Ch3-46
Reservation Station Components Three Stages of Tomasulo Algorithm
Op: operation to perform in the unit (e.g., + or –) 1. Issue: get instruction from FP Op Queue
If reservation station free (no structural hazard), control issues instr & sends
Vj, Vk: value of Source operands operands (renames registers).
Store buffers has V field, result to be stored 2. Execute: operate on operands (EX)
p p ( )
Qj, Qk: reservation stations producing source registers (value to When both operands ready then execute; if not ready, watch Common Data Bus for
be written) result.
Note: Qj,Qk=0 => ready 3.
3 Write result: finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station
Store buffers only have Qi for RS producing result
available
Busy: indicates reservation station or FU is busy Normal data bus: data + destination (“go to” bus)
Register result status: Indicates which functional unit will write Common data bus: data + source (“come from” bus)
each register if one exists Blank when no pending instructions
register, exists. 64 bits of data + 4 bits of Functional Unit source address
that will write that register. Write if matches expected Functional Unit (produces result)
Does the broadcast
Example speed:
3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
Computer Architecture Ch3-47 Computer Architecture Ch3-48
13. Outline Instruction stream Tomasulo Example
p
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
2.1 Instruction-Level Parallelism: Concepts and Challenges LD F6 34+ R2 Load1 No
2.2 Basic Compiler Techniques for Exposing ILP LD
MULTD F0
F2 45+
F2
R3
F4
Load2
Load3
No
No
2.3 Reducing Branch Costs with Prediction SUBD F8 F6 F2
V
DIVD F10 F0 F6
2.4 Overcoming Data Hazards with Dynamic Scheduling ADDD F6 F8 F2
3 Load/Buffers
L d/B ff
2.5 Dynamic Scheduling: Examples and the Algorithm Reservation Stations: S1 S2 RS RS
Time Name Busy
y Op
p Vj
j Vk Qj Q
Qk
2.6
26 Hardware-Based Speculation
H d B dS l ti Add1 No
2.7 Exploiting ILP Using Multiple Issue and Static Scheduling FU count Add2 No
3 FP Adder R.S.
down Add3 No
2 FP Mult R.S.
2.8
28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue
Scheduling Mult1 No
Mult2 No
and Speculation
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Clock cycle
y
counter
Computer Architecture Ch3-49
Tomasulo Example Cycle 1
p y Tomasulo Example Cycle 2
p y
Instruction status: Exec Write Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2 LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2 SUBD F8 F6 F2
V
DIVD F10 F0 F6 V
DIVD F10 F0 F6
ADDD F6 F8 F2 ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS
Time Name Busy
y Op
p Vj
j Vk Qj Q
Qk Time Name Busy
y Op
p Vj
j Vk Qj Q
Qk
Add1 No Add1 No
Add2 No Add2 No
Add3 No Add3 No
Mult1 No Mult1 No
Mult2 No Mult2 No
Register result status: Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1 2 FU Load2 Load1
14. Tomasulo Example Cycle 3
p y Tomasulo Example Cycle 4
p y
Instruction status: Exec Write Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 Load2 Yes 45+R3 LD F2 45+ R3 2 4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 SUBD F8 F6 F2 4
V
DIVD F10 F0 F6 V
DIVD F10 F0 F6
ADDD F6 F8 F2 ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS
Time Name Busy Op
y p Vj
j Vk Qj Q
Qk Time Name Busy Op
y p Vj
j Vk Qj Q
Qk
Add1 No Add1 Yes SUBD M(A1) Load2
Add2 No Add2 No
Add3 No Add3 No
Mult1 Yes MULTD R(F4) Load2 Mult1 Yes MULTD R(F4) Load2
Mult2 No Mult2 No
Register result status: Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1 4 FU Mult1 Load2 M(A1) Add1
• Note: registers names are removed (“renamed”) in Reservation Stations;
g ( ) ;
MULT issued • Load2 completing; what is waiting for Load2?
• Load1 completing; what is waiting for Load1?
Tomasulo Example Cycle 5
p y Tomasulo Example Cycle 6
p y
Instruction status: Exec Write Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No
LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4 SUBD F8 F6 F2 4
V
DIVD F10 F0 F6 5 V
DIVD F10 F0 F6 5
ADDD F6 F8 F2 ADDD F6 F8 F2 6
Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS
Time Name Busy Op
y p Vj
j Vk Qj Q
Qk Time Name Busy Op
y p Vj
j Vk Qj Q
Qk
2 Add1 Yes SUBD M(A1) M(A2) 1 Add1 Yes SUBD M(A1) M(A2)
Add2 No Add2 Yes ADDD M(A2) Add1
Add3 No Add3 No
10 Mult1 Yes MULTD M(A2) R(F4) 9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1
Register result status: Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2 6 FU Mult1 M(A2) Add2 Add1 Mult2
• Timer starts down for Add1 Mult1
Add1, • Issue ADDD here despite name dependency on F6?