This document summarizes a lecture on dynamic scheduling and the Tomasulo algorithm. It begins with an overview of dynamic scheduling and out-of-order execution. It then describes the Tomasulo algorithm used in IBM's 360/91 floating point unit, which introduced reservation stations, register renaming, and a common data bus to enable out-of-order execution while maintaining in-order retirement. Examples are provided to illustrate how the algorithm handles register dependencies like RAW, WAR, and WAW.
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Hsien-Hsin Sean Lee, Ph.D.
The document summarizes key aspects of the P6 microarchitecture used in processors like the Pentium Pro, Pentium II, and Pentium III. It describes the system architecture with separate front-side and back-side buses. It then details the instruction fetch, decode, register renaming, out-of-order execution, memory handling, and retirement stages of the processor pipeline. Diagrams illustrate the branch prediction, reservation stations, reorder buffer, and memory order buffer components that enable speculative and out-of-order execution in the P6.
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
This document discusses techniques for improving instruction fetch throughput in superscalar processors. It begins by explaining that fetch throughput defines the maximum performance and that superscalar processors need to supply more than one instruction per cycle. It then describes some challenges to high bandwidth instruction fetching including misaligned instructions, changes in control flow, and memory latency/bandwidth limitations. The document proceeds to discuss specific techniques like aligned fetching, split cache line access, predication, collapsing buffers, trace caches, and issues related to indexing and redundancy in trace caches.
This chapter discusses different instruction set architecture (ISA) designs including CISC, RISC, VLIW, and EPIC. CISC ISAs like x86 have complex, variable-length instructions and rely on microcode to simplify compilation, while RISC ISAs like MIPS have fixed-length, load-store instructions and require more compiler effort. Modern processors use RISC-like designs internally even if they support CISC ISAs. VLIW and EPIC ISAs rely on the compiler to schedule instructions across functional units at compile-time rather than using dynamic scheduling. While VLIW was popular for DSPs, most general-purpose processors today use superscalar out-of-
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
This document discusses DRAM and storage systems. It begins by describing the basic DRAM cell and how DRAM is organized into banks, rows, and columns. It then covers DRAM operation including refreshing and different DRAM standards. The document also discusses disk organization with platters, tracks, and sectors. It provides details on disk access times and reliability techniques like RAID levels 0 through 6 which use data mirroring, striping, and error correction codes.
This document discusses multithreading and multicore processors. It begins by explaining that instruction level parallelism is difficult to achieve for a single program, but that thread level parallelism exists when running multiple threads or programs simultaneously. It then covers different multithreading paradigms including coarse-grained and fine-grained multithreading as well as challenges with context switching. The document also discusses techniques for multicore processors including cache sharing and instruction fetching policies. It provides examples of commercial multicore chips and research prototypes.
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...Hsien-Hsin Sean Lee, Ph.D.
This document discusses branch prediction in computer architecture. It begins by explaining what information is predicted for branches - the direction and target. It then categorizes different types of branches and discusses the costs of branch misprediction. Various branch prediction techniques are presented, starting with simple 1-bit and 2-bit predictors, and progressing to more advanced correlating and global history predictors. The goal of branch prediction is to reduce penalties from mispredicted branches by speculatively executing the predicted path.
This document discusses pipelining in computer architecture through a five-stage pipelined datapath example. It includes:
1) An overview of the five stages - instruction fetch, instruction decode, execute, memory, and writeback;
2) An example of how the lw (load word) instruction progresses through each stage;
3) How pipeline control signals are added to coordinate execution across stages.
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...Hsien-Hsin Sean Lee, Ph.D.
This document provides an overview of instruction set architecture (ISA). It begins by breaking down a computing problem into levels including the algorithm, programming, compiler/assembler, system architecture, and machine implementation levels. It then defines ISA as the interface between hardware and low-level software, describing it as an abstraction that is independent of a specific implementation. The document outlines ISA design principles and provides examples of RISC and CISC instruction formats and encodings. It also gives examples of commercial ISAs like x86 and ARM and provides context on the philosophy between RISC and CISC designs.
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...Hsien-Hsin Sean Lee, Ph.D.
The document summarizes key aspects of the P6 microarchitecture used in processors like the Pentium Pro, Pentium II, and Pentium III. It describes the system architecture with separate front-side and back-side buses. It then details the instruction fetch, decode, register renaming, out-of-order execution, memory handling, and retirement stages of the processor pipeline. Diagrams illustrate the branch prediction, reservation stations, reorder buffer, and memory order buffer components that enable speculative and out-of-order execution in the P6.
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
This document discusses techniques for improving instruction fetch throughput in superscalar processors. It begins by explaining that fetch throughput defines the maximum performance and that superscalar processors need to supply more than one instruction per cycle. It then describes some challenges to high bandwidth instruction fetching including misaligned instructions, changes in control flow, and memory latency/bandwidth limitations. The document proceeds to discuss specific techniques like aligned fetching, split cache line access, predication, collapsing buffers, trace caches, and issues related to indexing and redundancy in trace caches.
This chapter discusses different instruction set architecture (ISA) designs including CISC, RISC, VLIW, and EPIC. CISC ISAs like x86 have complex, variable-length instructions and rely on microcode to simplify compilation, while RISC ISAs like MIPS have fixed-length, load-store instructions and require more compiler effort. Modern processors use RISC-like designs internally even if they support CISC ISAs. VLIW and EPIC ISAs rely on the compiler to schedule instructions across functional units at compile-time rather than using dynamic scheduling. While VLIW was popular for DSPs, most general-purpose processors today use superscalar out-of-
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
This document discusses DRAM and storage systems. It begins by describing the basic DRAM cell and how DRAM is organized into banks, rows, and columns. It then covers DRAM operation including refreshing and different DRAM standards. The document also discusses disk organization with platters, tracks, and sectors. It provides details on disk access times and reliability techniques like RAID levels 0 through 6 which use data mirroring, striping, and error correction codes.
This document discusses multithreading and multicore processors. It begins by explaining that instruction level parallelism is difficult to achieve for a single program, but that thread level parallelism exists when running multiple threads or programs simultaneously. It then covers different multithreading paradigms including coarse-grained and fine-grained multithreading as well as challenges with context switching. The document also discusses techniques for multicore processors including cache sharing and instruction fetching policies. It provides examples of commercial multicore chips and research prototypes.
Lec5 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Branch Pred...Hsien-Hsin Sean Lee, Ph.D.
This document discusses branch prediction in computer architecture. It begins by explaining what information is predicted for branches - the direction and target. It then categorizes different types of branches and discusses the costs of branch misprediction. Various branch prediction techniques are presented, starting with simple 1-bit and 2-bit predictors, and progressing to more advanced correlating and global history predictors. The goal of branch prediction is to reduce penalties from mispredicted branches by speculatively executing the predicted path.
This document discusses pipelining in computer architecture through a five-stage pipelined datapath example. It includes:
1) An overview of the five stages - instruction fetch, instruction decode, execute, memory, and writeback;
2) An example of how the lw (load word) instruction progresses through each stage;
3) How pipeline control signals are added to coordinate execution across stages.
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...Hsien-Hsin Sean Lee, Ph.D.
This document provides an overview of instruction set architecture (ISA). It begins by breaking down a computing problem into levels including the algorithm, programming, compiler/assembler, system architecture, and machine implementation levels. It then defines ISA as the interface between hardware and low-level software, describing it as an abstraction that is independent of a specific implementation. The document outlines ISA design principles and provides examples of RISC and CISC instruction formats and encodings. It also gives examples of commercial ISAs like x86 and ARM and provides context on the philosophy between RISC and CISC designs.
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Hsien-Hsin Sean Lee, Ph.D.
The document summarizes memory and programmable logic. It describes different types of memory including RAM, ROM, SRAM and DRAM. It explains how memory is addressed and organized hierarchically using decoders. The document also covers concepts like endianness and memory allocation. Finally, it introduces different types of programmable logic devices including PLA, PAL and ROM and provides examples of their usage and programming.
Static scheduling machines like VLIW move instruction scheduling from hardware to compiler by packing multiple operations into single instructions. This simplifies hardware but requires compilers to explicitly represent dependencies. The Intel Itanium ISA follows the VLIW philosophy, using instruction bundles containing up to 3 operations. Itanium 2 improved on the original design with an 8-stage pipeline and register renaming to boost performance. Itanium supports both control and data speculation via techniques like speculative loads and the Advanced Load Address Table.
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Hsien-Hsin Sean Lee, Ph.D.
The document discusses program control in computer engineering. It describes how a program counter (PC) controls instruction execution by incrementing to the next instruction address after each execution. The PC can be modified to change the normal sequential execution flow through mechanisms like jumps, branches, and subroutine calls and returns. Subroutines use a stack to save registers and pass arguments. Recursive subroutines additionally push the calling address and argument onto the stack to preserve their values across recursive activations.
This lecture discusses instruction-level parallelism (ILP) in computer architecture. ILP refers to executing multiple instructions simultaneously by exploiting parallelism between independent instructions. The lecture covers different techniques for improving ILP, including deeper pipelining, superscalar execution, exploiting basic block parallelism, control and data speculation, register renaming, and dynamic memory disambiguation. It discusses factors that limit ILP such as dependencies, limited instruction windows, and control flow. The goal of ILP techniques is to issue and complete as many instructions as possible during each clock cycle to improve processor throughput and performance.
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
The document discusses dynamic scheduling in modern out-of-order processors. It describes how register renaming is used to avoid false dependencies and allow instructions to execute out-of-order. The reorder buffer (ROB) is used to support precise interrupts by buffering instruction results and allowing the processor state to be reconstructed sequentially. The ROB also handles precise handling of speculative execution for branch mispredictions.
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
This document summarizes a lecture on advanced computer architecture and memory hierarchy design techniques. It discusses cache penalty reduction techniques like victim caches and prefetching. It also covers virtual memory and address translation using page tables and translation lookaside buffers. Different cache organizations are described, including virtually and physically indexed caches. Solutions to problems like aliases and synonyms in virtual caches are also summarized.
The document summarizes key points about the 8086 microprocessor architecture and assembly language programming. It discusses the 8086 architecture including its registers, data path, and parallel execution units. It also provides examples of assembly language programs using loops and addressing modes. Finally, it outlines the topics to be covered in the next class, including a summary of 8085/8086/i386 architectures and assembly programming basics, as well as an introduction to device interfacing.
The document discusses fundamentals of assembly language including instruction execution and addressing, directives, procedures, data types, arithmetic instructions, and a practice problem. It explains that instructions are translated to object code while directives control assembly but generate no machine code. Common directives include TITLE, STACK, DATA, CODE, and PROC. Data can be defined using directives like DB, DW, DD, DQ. Instructions like ADD, SUB, MUL, DIV perform arithmetic calculations on registers and memory.
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
This document discusses compiler optimizations. It begins with an outline of topics including compilation trajectory, intermediate languages, optimization levels, and optimization techniques. It then provides more details on each phase of compilation, how compilers use intermediate representations to perform optimizations, and specific optimizations like common subexpression elimination, constant propagation, and instruction scheduling.
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
The document discusses the ARM instruction set. It begins by defining the instruction set and describing the three states of operation: compiler, assembly, and object code. It then describes various types of instructions like data processing, data transfer, and control flow instructions. The rest of the document provides details on ARM characteristics, registers, conditional execution, addressing modes, and examples of instructions for common operations.
Pipelining allows the execution of multiple instructions simultaneously by dividing instruction processing into discrete stages. This improves processor throughput. Common pipeline stages include fetch, decode, execute, and writeback. Pipelining is used in modern CPUs but hazards like structural, data, and branch hazards can cause the pipeline to stall, degrading performance. Coprocessors are secondary processors that supplement the main CPU by accelerating tasks like floating-point arithmetic, graphics processing, and encryption. The Intel 8087 was an early numeric coprocessor that offloaded floating-point operations from the 8086/8088 CPU.
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
This document discusses optimization principles for GPU device code. The key principle is hiding latency by maximizing parallelism through high thread-level parallelism (TLP) and instruction-level parallelism (ILP). TLP can be improved through high occupancy, which means running many threads concurrently. ILP can be improved through techniques like kernel unrolling and loop unrolling to increase independent instructions per warp. Together, high TLP and ILP are needed to keep the GPU pipeline fully utilized and latency well hidden.
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
There are three key aspects of computer architecture: instruction set architecture, microarchitecture, and hardware design. Modern architectures aim to either hide latency or maximize throughput. Reduced instruction set computers (RISC) became popular due to simpler decoding and pipelining allowing higher clock speeds. While complex instruction set computers (CISC) focused on code density, RISC architectures are now dominant due to their efficiency. Very long instruction word (VLIW) and vector processors targeted specialized workloads but their concepts influence modern designs. Load-store RISC architectures with fixed-width instructions and minimal addressing modes provide an optimal balance between performance and efficiency.
This talk is all about the Berkeley Packet Filters (BPF) and their uses in Linux.
Agenda:
* What is a BPF and why do we need it?
* Writing custom BPFs
* Notes on BPF implementation in the kernel
* Usage examples: SOCKET_FILTER & seccomp
Speaker:
Kfir Gollan, senior embedded software developer, Linux kernel hacker and software team leader.
The document discusses assembly language and its relationship to computer architecture and programming. It covers the different views of computer design including the programmer's view through instruction set architecture and the logic designer's view through machine organization. It also summarizes how a high-level language program is converted into executable files through compilation, assembly, and linking.
Jennifer Rexford
Professor
Princeton University
Plenaries Session
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
This document summarizes key aspects of GPU hardware and the SIMT (Single Instruction Multiple Thread) architecture used in NVIDIA GPUs. It describes the evolution of NVIDIA GPU hardware, the differences between latency-oriented CPUs and throughput-oriented GPUs, how SIMT combines SIMD and threading, warp scheduling, divergence and convergence, predicated and conditional execution.
This document discusses BPF (Berkeley Packet Filter), a mechanism for filtering network packets on Linux. BPF allows defining filters using an instruction set that is executed against packets to determine whether to accept or drop them. The document provides an overview of how BPF works, demonstrating simple BPF filters, and discusses using BPF for packet filtering and other applications like seccomp.
GPU memory includes on-chip (registers, shared memory) and off-chip (global, constant, texture, local) memory. Global memory access from a warp is most efficient when coalesced to fit a cache line. The memory hierarchy includes L1 and L2 caches with different characteristics like size and latency. Memory requests travel through this hierarchy, with coalesced requests requiring fewer cache lines. Special memory types like constant and texture memory are optimized for uniform access patterns.
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...Hsien-Hsin Sean Lee, Ph.D.
The document describes the datapath and microcode control of a simple processor. It includes the following:
- The datapath components including register file, ALU, logical/shift units, and memory.
- How the datapath is controlled by microcode stored in a memory. Each instruction is mapped to a sequence of microinstructions that generate control signals.
- Examples of microcode control sequences that perform operations like memory load/store, arithmetic, and copying data between memory locations.
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...Hsien-Hsin Sean Lee, Ph.D.
This document discusses canonical (standard) forms for Boolean functions, including sum-of-products (SOP) and product-of-sums (POS) forms. It defines key concepts like Boolean variables, literals, product/sum terms, minterms/maxterms. It also provides examples of converting between a Boolean expression and its canonical SOP and POS forms, and how the canonical SOP and POS forms are complements of each other.
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Hsien-Hsin Sean Lee, Ph.D.
The document summarizes memory and programmable logic. It describes different types of memory including RAM, ROM, SRAM and DRAM. It explains how memory is addressed and organized hierarchically using decoders. The document also covers concepts like endianness and memory allocation. Finally, it introduces different types of programmable logic devices including PLA, PAL and ROM and provides examples of their usage and programming.
Static scheduling machines like VLIW move instruction scheduling from hardware to compiler by packing multiple operations into single instructions. This simplifies hardware but requires compilers to explicitly represent dependencies. The Intel Itanium ISA follows the VLIW philosophy, using instruction bundles containing up to 3 operations. Itanium 2 improved on the original design with an 8-stage pipeline and register renaming to boost performance. Itanium supports both control and data speculation via techniques like speculative loads and the Advanced Load Address Table.
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Hsien-Hsin Sean Lee, Ph.D.
The document discusses program control in computer engineering. It describes how a program counter (PC) controls instruction execution by incrementing to the next instruction address after each execution. The PC can be modified to change the normal sequential execution flow through mechanisms like jumps, branches, and subroutine calls and returns. Subroutines use a stack to save registers and pass arguments. Recursive subroutines additionally push the calling address and argument onto the stack to preserve their values across recursive activations.
This lecture discusses instruction-level parallelism (ILP) in computer architecture. ILP refers to executing multiple instructions simultaneously by exploiting parallelism between independent instructions. The lecture covers different techniques for improving ILP, including deeper pipelining, superscalar execution, exploiting basic block parallelism, control and data speculation, register renaming, and dynamic memory disambiguation. It discusses factors that limit ILP such as dependencies, limited instruction windows, and control flow. The goal of ILP techniques is to issue and complete as many instructions as possible during each clock cycle to improve processor throughput and performance.
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
The document discusses dynamic scheduling in modern out-of-order processors. It describes how register renaming is used to avoid false dependencies and allow instructions to execute out-of-order. The reorder buffer (ROB) is used to support precise interrupts by buffering instruction results and allowing the processor state to be reconstructed sequentially. The ROB also handles precise handling of speculative execution for branch mispredictions.
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2Hsien-Hsin Sean Lee, Ph.D.
This document summarizes a lecture on advanced computer architecture and memory hierarchy design techniques. It discusses cache penalty reduction techniques like victim caches and prefetching. It also covers virtual memory and address translation using page tables and translation lookaside buffers. Different cache organizations are described, including virtually and physically indexed caches. Solutions to problems like aliases and synonyms in virtual caches are also summarized.
The document summarizes key points about the 8086 microprocessor architecture and assembly language programming. It discusses the 8086 architecture including its registers, data path, and parallel execution units. It also provides examples of assembly language programs using loops and addressing modes. Finally, it outlines the topics to be covered in the next class, including a summary of 8085/8086/i386 architectures and assembly programming basics, as well as an introduction to device interfacing.
The document discusses fundamentals of assembly language including instruction execution and addressing, directives, procedures, data types, arithmetic instructions, and a practice problem. It explains that instructions are translated to object code while directives control assembly but generate no machine code. Common directives include TITLE, STACK, DATA, CODE, and PROC. Data can be defined using directives like DB, DW, DD, DQ. Instructions like ADD, SUB, MUL, DIV perform arithmetic calculations on registers and memory.
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
This document discusses compiler optimizations. It begins with an outline of topics including compilation trajectory, intermediate languages, optimization levels, and optimization techniques. It then provides more details on each phase of compilation, how compilers use intermediate representations to perform optimizations, and specific optimizations like common subexpression elimination, constant propagation, and instruction scheduling.
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsMarina Kolpakova
This document discusses various compiler optimizations including constant folding, hoisting loop invariant code, scalarization, loop unswitching, peeling and sentinels, strength reduction, loop induction variable elimination, and auto-vectorization. It provides code examples and the generated assembly for each optimization. It explains that many optimizations are performed by compilers automatically at high optimization levels, while some more advanced optimizations like loop peeling and sentinels require manual intervention.
The document discusses the ARM instruction set. It begins by defining the instruction set and describing the three states of operation: compiler, assembly, and object code. It then describes various types of instructions like data processing, data transfer, and control flow instructions. The rest of the document provides details on ARM characteristics, registers, conditional execution, addressing modes, and examples of instructions for common operations.
Pipelining allows the execution of multiple instructions simultaneously by dividing instruction processing into discrete stages. This improves processor throughput. Common pipeline stages include fetch, decode, execute, and writeback. Pipelining is used in modern CPUs but hazards like structural, data, and branch hazards can cause the pipeline to stall, degrading performance. Coprocessors are secondary processors that supplement the main CPU by accelerating tasks like floating-point arithmetic, graphics processing, and encryption. The Intel 8087 was an early numeric coprocessor that offloaded floating-point operations from the 8086/8088 CPU.
Code GPU with CUDA - Device code optimization principleMarina Kolpakova
This document discusses optimization principles for GPU device code. The key principle is hiding latency by maximizing parallelism through high thread-level parallelism (TLP) and instruction-level parallelism (ILP). TLP can be improved through high occupancy, which means running many threads concurrently. ILP can be improved through techniques like kernel unrolling and loop unrolling to increase independent instructions per warp. Together, high TLP and ILP are needed to keep the GPU pipeline fully utilized and latency well hidden.
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
There are three key aspects of computer architecture: instruction set architecture, microarchitecture, and hardware design. Modern architectures aim to either hide latency or maximize throughput. Reduced instruction set computers (RISC) became popular due to simpler decoding and pipelining allowing higher clock speeds. While complex instruction set computers (CISC) focused on code density, RISC architectures are now dominant due to their efficiency. Very long instruction word (VLIW) and vector processors targeted specialized workloads but their concepts influence modern designs. Load-store RISC architectures with fixed-width instructions and minimal addressing modes provide an optimal balance between performance and efficiency.
This talk is all about the Berkeley Packet Filters (BPF) and their uses in Linux.
Agenda:
* What is a BPF and why do we need it?
* Writing custom BPFs
* Notes on BPF implementation in the kernel
* Usage examples: SOCKET_FILTER & seccomp
Speaker:
Kfir Gollan, senior embedded software developer, Linux kernel hacker and software team leader.
The document discusses assembly language and its relationship to computer architecture and programming. It covers the different views of computer design including the programmer's view through instruction set architecture and the logic designer's view through machine organization. It also summarizes how a high-level language program is converted into executable files through compilation, assembly, and linking.
Jennifer Rexford
Professor
Princeton University
Plenaries Session
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
This document summarizes key aspects of GPU hardware and the SIMT (Single Instruction Multiple Thread) architecture used in NVIDIA GPUs. It describes the evolution of NVIDIA GPU hardware, the differences between latency-oriented CPUs and throughput-oriented GPUs, how SIMT combines SIMD and threading, warp scheduling, divergence and convergence, predicated and conditional execution.
This document discusses BPF (Berkeley Packet Filter), a mechanism for filtering network packets on Linux. BPF allows defining filters using an instruction set that is executed against packets to determine whether to accept or drop them. The document provides an overview of how BPF works, demonstrating simple BPF filters, and discusses using BPF for packet filtering and other applications like seccomp.
GPU memory includes on-chip (registers, shared memory) and off-chip (global, constant, texture, local) memory. Global memory access from a warp is most efficient when coalesced to fit a cache line. The memory hierarchy includes L1 and L2 caches with different characteristics like size and latency. Memory requests travel through this hierarchy, with coalesced requests requiring fewer cache lines. Special memory types like constant and texture memory are optimized for uniform access patterns.
Lec20 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Da...Hsien-Hsin Sean Lee, Ph.D.
The document describes the datapath and microcode control of a simple processor. It includes the following:
- The datapath components including register file, ALU, logical/shift units, and memory.
- How the datapath is controlled by microcode stored in a memory. Each instruction is mapped to a sequence of microinstructions that generate control signals.
- Examples of microcode control sequences that perform operations like memory load/store, arithmetic, and copying data between memory locations.
Lec6 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Can...Hsien-Hsin Sean Lee, Ph.D.
This document discusses canonical (standard) forms for Boolean functions, including sum-of-products (SOP) and product-of-sums (POS) forms. It defines key concepts like Boolean variables, literals, product/sum terms, minterms/maxterms. It also provides examples of converting between a Boolean expression and its canonical SOP and POS forms, and how the canonical SOP and POS forms are complements of each other.
Lec10 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Mu...Hsien-Hsin Sean Lee, Ph.D.
This document provides a summary of a lecture on building blocks for combinational logic, including timing diagrams, multiplexers, and demultiplexers. Timing diagrams are used to describe the functionality of logic circuits over time as a series of waveforms. Multiplexers are used to select one of several input signals to pass to the output based on selection bits. Demultiplexers are the inverse, distributing an input signal to one of several outputs based on the selection bits. Transmission gates can be used to implement multiplexers with enable inputs to disable the output.
Lec8 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Qui...Hsien-Hsin Sean Lee, Ph.D.
The Quine-McCluskey method is a systematic technique for minimizing Boolean functions expressed in sum-of-products form. It involves (1) finding all prime implicants of the function, (2) constructing a prime implicant table with prime implicants as columns and minterms as rows, and (3) selecting the minimum set of prime implicants that cover all minterms to obtain the minimal SOP expression. The method can be applied to Boolean functions with any number of inputs to derive optimized logic circuit designs.
Lec7 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Kar...Hsien-Hsin Sean Lee, Ph.D.
This document provides an overview of Karnaugh maps and their use in simplifying Boolean functions. It defines key concepts like Hamming distance, unit-distance codes, Gray codes, implicants, prime implicants, essential and non-essential prime implicants. Examples are given to show how to identify implicants on a K-map and use them to find a minimum sum-of-products or product-of-sums expression. The use of don't care conditions to simplify expressions is also demonstrated. Finally, it discusses extending K-maps to multiple variables using stacked or composite maps.
Lec2 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Num...Hsien-Hsin Sean Lee, Ph.D.
This document discusses number systems and binary arithmetic. It begins by explaining decimal and binary number representation, including place value and how to derive numbers in different bases. It then covers counting in binary, octal, and base-22 systems. Next, it discusses representing negative numbers using sign-magnitude, one's complement, and two's complement methods. Finally, it demonstrates binary addition and computation for both unsigned and signed numbers using two's complement.
Lec14 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Se...Hsien-Hsin Sean Lee, Ph.D.
This document discusses sequential logic circuits, which differ from combinational logic circuits in that they have state information stored in memory elements like flip-flops. The output of a sequential circuit depends on both the inputs and the present state. Common types of sequential elements discussed include SR latches, D latches, and edge-triggered flip-flops. Master-slave configurations are used to avoid problems with transparency in latches. Dual-phase non-overlapping clocks are introduced to control the transfer of data between latches in a flip-flop.
Lec1 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- IntroHsien-Hsin Sean Lee, Ph.D.
This document provides an overview of ECE2030 Introduction to Computer Engineering course. It introduces the instructor, Prof. Hsien-Hsin Sean Lee, and lists course details such as materials, assignments, exams, grading policy. It outlines the course objectives which include digital design principles like number systems, Boolean algebra, combinational and sequential logic. The focus will be on levels below system architecture like microarchitecture, gates and transistor levels.
This document discusses computer architecture performance, including metrics like execution time, throughput, and instructions per cycle (IPC). It provides examples of calculating the cycles per instruction (CPI) for different instruction types and evaluating potential design changes based on their impact on CPI and overall performance. The principles of locality and Amdahl's Law, which states that speedups from parallelism are limited by the serial fraction of a program, are also covered.
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Hsien-Hsin Sean Lee, Ph.D.
This document discusses combinational logic and mixed logic design. It begins by defining combinational circuits as those whose outputs are determined immediately by the current input combination, without any internal storage. The document then presents an example of designing a 9-input odd function hierarchically using 3-input odd functions. Mixed logic design is introduced to allow implementing combinational logic using only NAND gates, only NOR gates, or both. DeMorgan's laws are used to convert gate types by adding or removing bubbles on the inputs. Several examples showcase designing logic circuits using only NAND gates or only NOR gates.
Lec3 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMO...Hsien-Hsin Sean Lee, Ph.D.
1) The document discusses switches and CMOS transistors. It describes how a basic switch works and analogizes it to a transistor.
2) It then covers transistor characteristics like cut-off, linear, and saturation regions. Threshold voltage and factors affecting it are also discussed.
3) Switch logic is examined for switches in series (AND function) and parallel (OR function).
4) CMOS transistors are introduced, including nMOS and pMOS types. A transmission gate using both is described for transmitting signals without degradation.
Lec13 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Sh...Hsien-Hsin Sean Lee, Ph.D.
This lecture discusses building blocks for combinational logic, specifically shifters and multipliers. It introduces basic shifting operations including left/right shifts and logical/arithmetic shifts. It then describes implementations of 4-bit logical and arithmetic shifters using multiplexers. Barrel shifters that can shift multiple bits simultaneously are also covered. For multiplication, it explains unsigned and signed binary multiplication through examples. Implementations of 2-bit, 4-bit and carry-save multipliers are shown using full adders as building blocks.
Lec16 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Fi...Hsien-Hsin Sean Lee, Ph.D.
This document discusses finite state machines (FSMs) including Mealy and Moore machines. It provides examples of state diagrams for Mealy and Moore machines and describes how to design the logic circuits for an FSM from its state table. Key steps include generating Boolean functions for outputs and next states, simplifying the functions, and creating logic gates for the outputs, next state logic, and current state registers. An example of a vending machine FSM is also presented with its state diagram and logic circuit design.
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Hsien-Hsin Sean Lee, Ph.D.
This document summarizes key concepts in combinational logic building blocks including adders, subtractors, and parity checkers. It describes half adders, full adders, ripple carry adders, carry lookahead adders, subtraction using 2's complement, and even parity generation and detection. The document discusses issues like carry propagation delay in ripple carry adders and improved delay in carry lookahead adders. It also covers overflow/underflow detection in signed arithmetic and examples of parity error detection.
Lec4 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- CMOSHsien-Hsin Sean Lee, Ph.D.
1. The document describes CMOS inverters and how to construct CMOS networks for basic logic gates like NAND, NOR, and XOR from pull-up and pull-down networks.
2. It provides a systematic method for drawing the CMOS network from a Boolean equation by first constructing either the pull-up network or pull-down network based on the equation.
3. Examples are given to demonstrate how to apply the method to draw CMOS networks for equations with multiple variables like XOR, XNOR, and complex equations with nested terms.
Lec15 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Re...Hsien-Hsin Sean Lee, Ph.D.
This document discusses different types of digital logic components including registers, toggle flip-flops, and counters. It describes how registers can be constructed from cascaded flip-flops and how they can be read from and written to. Toggle flip-flops are introduced which toggle their output each clock cycle depending on enable signals. Finally, several types of counters are overviewed such as ripple counters, synchronous counters, modulo-N counters, and BCD counters.
Lec11 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- De...Hsien-Hsin Sean Lee, Ph.D.
This document discusses different types of decoders and encoders used in digital logic circuits. It begins by explaining 1-to-2 line, N-to-M line, and 2-to-4 line decoders. It then describes decoders with enables and how to implement logic functions using decoders. The document also covers BCD-to-7 segment decoders and designing the outputs individually. Finally, it summarizes M-to-N line encoders and provides examples of 4-to-2 and 8-to-3 encoders as well as priority encoders.
Lec5 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Boo...Hsien-Hsin Sean Lee, Ph.D.
This document summarizes key concepts from a lecture on Boolean algebra:
- Boolean algebra uses binary variables and logic operations like OR, AND, XOR to represent Boolean functions as equations or truth tables.
- Fundamental operators in Boolean algebra include NOT, OR, AND.
- Boolean operations can be represented as truth tables showing all combinations of 1s and 0s.
- Order of operations and identities like DeMorgan's law allow simplifying Boolean expressions and functions.
- The duality principle states that swapping 1s and 0s and AND/OR preserves validity of Boolean equations.
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Hsien-Hsin Sean Lee, Ph.D.
This document summarizes key topics about memory hierarchy design, including:
1) It explains why caches work due to the principle of locality, where programs tend to access a small portion of memory addresses at a time, exhibiting both temporal and spatial locality.
2) It describes the different levels of a memory hierarchy from fastest/smallest (registers and caches) to slowest/largest (main memory and disk). Caches aim to bridge the speed gap between fast processors and slower main memory.
3) The performance of a memory hierarchy is characterized by average memory access time (AMAT), which is affected by hit time, miss rate, and miss penalty. Multi-level caches can reduce miss penalties and improve A
This document discusses designing combinational logic circuits using static complementary CMOS design. It explains how to construct static CMOS circuits for logic gates like NAND and NOR by using pull-up and pull-down networks of PMOS and NMOS transistors respectively. Issues related to pass-transistor design like noise margins and static power consumption are also covered. The document provides details on implementing various logic functions using pass-transistor logic and differential pass-transistor logic. It discusses solutions to overcome the disadvantages of pass-transistor logic like level restoration and use of multiple threshold transistors.
Oracle Architecture document discusses:
1. The cost of an Oracle Enterprise Edition license is $47,500 per processor.
2. It provides an overview of key Oracle components like the instance, database, listener and cost based optimizer.
3. It demonstrates how to start an Oracle instance, check active processes, mount and open a database, and query it locally and remotely after starting the listener.
This document discusses pipelining the MIPS architecture to improve its performance. It begins with a review of MIPS instruction formats and the single-cycle datapath. It then introduces pipeline registers to break the execution of each instruction into multiple stages, including fetch, decode/register fetch, execute, memory, and write-back stages. This allows the clock period to be reduced while maintaining a cycles per instruction of 1. Key aspects covered include avoiding resource conflicts between stages, handling different instruction types, and the "iron law" relating instructions, cycles and time in determining processor performance.
This document provides information on various debugging and profiling tools that can be used for Ruby including:
- lsof to list open files for a process
- strace to trace system calls and signals
- tcpdump to dump network traffic
- google perftools profiler for CPU profiling
- pprof to analyze profiling data
It also discusses how some of these tools have helped identify specific performance issues with Ruby like excessive calls to sigprocmask and memcpy calls slowing down EventMachine with threads.
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Nelson Calero
Each new version of the Oracle database includes improvements in the upgrade and patching utilities, forcing us to update our procedures to incorporate these changes.
The Fleet Provisioning & Patching (FPP, formerly RHP) utility, together with the change in its licensing announced at OOW 2019 that makes it free in RAC, now makes it possible to centrally manage the software life cycle.
This presentation shows examples of how to use FPP and different configuration options.
The document provides an overview of the ARM architecture, including:
- ARM was founded in 1990 and licenses its processor core intellectual property to design partners.
- The ARM instruction set includes 32-bit ARM and 16-bit Thumb instructions. ARM supports different processor modes like user mode, IRQ mode, and FIQ mode.
- Popular ARM processors include ARM7 and Cortex-M series. ARM licenses its IP to semiconductor companies who integrate the cores into various end products.
The document describes how to debug a kernel crash by recording the full kernel panic text using techniques like configuring a serial console, using the netconsole kernel feature, or manually dumping memory on a virtual machine. It also explains how to use the crash analysis tool to examine the crash dump, including getting a backtrace, disassembling instructions, and viewing the kernel log.
This document discusses using static tracing and dynamic probes to debug DPDK applications. It provides an overview of tools like DPDK-PROCINFO, DPDK-PDUMP, LTTNG, and user probes. Screenshots demonstrate using eBPF binaries with DPDK to enable dynamic debugging of applications in production environments where other debugging techniques are not possible. Future areas to explore include developing user probes similar to dynamic tracing and integrating eBPF event data with tools like VTune.
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
BPF (Berkeley Packet Filter) has evolved from a limited virtual machine for efficient packet filtering to a new type of software called extended BPF. Extended BPF allows for custom, efficient, and production-safe performance analysis tools and observability programs to be run in the Linux kernel through BPF. It enables new event-based applications running as BPF programs attached to various kernel events like kprobes, uprobes, tracepoints, sockets, and more. Major companies like Facebook, Google, and Netflix are using BPF programs for tasks like intrusion detection, container security, firewalling, and observability with over 150,000 AWS instances running BPF programs. BPF provides a new program model and security features compared
This document discusses the crash reporting mechanism in Tizen. It describes the crash client, which handles crash signals and generates crash reports. It covers Samsung's crash-work-sdk and Intel's corewatcher crash clients. It also discusses the crash server that receives reports and the CrashDB web interface. Finally, it mentions crash reason location algorithms.
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
Talk about bcc/eBPF for SCALE15x (2017) by Brendan Gregg. "BPF (Berkeley Packet Filter) has been enhanced in the Linux 4.x series and now powers a large collection of performance analysis and observability tools ready for you to use, included in the bcc (BPF Complier Collection) open source project. BPF nowadays can do system tracing, software defined networks, and kernel fast path: much more than just filtering packets! This talk will focus on the bcc/BPF tools for performance analysis, which make use of other built in Linux capabilities: dynamic tracing (kprobes and uprobes) and static tracing (tracepoints and USDT). There are now bcc tools for measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. Tracing superpowers have finally arrived, built in to Linux."
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
Talk by Brendan Gregg for OSSNA 2017. "Advanced performance observability and debugging have arrived built into the Linux 4.x series, thanks to enhancements to Berkeley Packet Filter (BPF, or eBPF) and the repurposing of its sandboxed virtual machine to provide programmatic capabilities to system tracing. Netflix has been investigating its use for new observability tools, monitoring, security uses, and more. This talk will be a dive deep on these new tracing, observability, and debugging capabilities, which sooner or later will be available to everyone who uses Linux. Whether you’re doing analysis over an ssh session, or via a monitoring GUI, BPF can be used to provide an efficient, custom, and deep level of detail into system and application performance.
This talk will also demonstrate the new open source tools that have been developed, which make use of kernel- and user-level dynamic tracing (kprobes and uprobes), and kernel- and user-level static tracing (tracepoints). These tools provide new insights for file system and storage performance, CPU scheduler performance, TCP performance, and a whole lot more. This is a major turning point for Linux systems engineering, as custom advanced performance instrumentation can be used safely in production environments, powering a new generation of tools and visualizations."
This document provides an introduction to the ARM-7 microprocessor architecture. It describes key features of the ARM7TDMI including its 32-bit RISC instruction set, 3-stage pipeline, 37 registers including separate registers for different processor modes, and low power consumption. The document also compares RISC and CISC architectures and summarizes the different versions of the ARM architecture.
The document discusses exploring the x64 architecture, covering topics such as the x64 application binary interface, memory layout differences between x86 and x64, API hooking and code injection techniques for x64, and differences in system calls between x86 and x64. It provides an overview of key technical details and concepts for developers working with x64 platforms.
The document provides an overview of the 8085 microprocessor architecture and its assembly language programming. It discusses the 8085 block diagram, instruction set, sample assembly programs, use of counters and delays, stack and subroutines. It then introduces the 8086 and x86 architectures. The next class will cover the 8086 architecture in more detail, advanced 32-bit architectures, the x86 programming model, 8086 assembly language programming, and x86 assemblers.
The document provides an overview of the 8085 microprocessor architecture and its assembly language programming. It discusses the 8085 block diagram, instruction set, sample assembly programs, use of counters and delays, stack and subroutines. It then introduces the 8086 and x86 architectures. The next class will cover the 8086 architecture in more detail, advanced 32-bit architectures, the x86 programming model, 8086 assembly language programming, and x86 assemblers.
This document provides an overview of segment routing and how it can enable SDN 2.0. Segment routing uses source routing by encoding a path as a segment list in packet headers. It provides a simple stateless forwarding model and enables edge intelligence with the ability for controllers to program segment lists. Use cases discussed include traffic engineering, fast reroute, and integration with SDN controllers using protocols like PCEP to establish segment routing paths without signaling in the core network.
This document discusses porting, scaling, and optimizing applications on Cray XT systems. It covers topics such as choosing compilers, profiling and debugging applications at scale, understanding CPU affinity, and improvements in the Cray Message Passing Toolkit (MPT). The document provides guidance on leveraging different compilers, collecting performance data using hardware counters and CrayPAT, understanding MPI process binding, and enhancements in MPT 4.0 related to MPI standards support and communication optimizations.
Here are some useful GDB commands for debugging:
- break <function> - Set a breakpoint at a function
- break <file:line> - Set a breakpoint at a line in a file
- run - Start program execution
- next/n - Step over to next line, stepping over function calls
- step/s - Step into function calls
- finish - Step out of current function
- print/p <variable> - Print value of a variable
- backtrace/bt - Print the call stack
- info breakpoints/ib - List breakpoints
- delete <breakpoint#> - Delete a breakpoint
- layout src - Switch layout to source code view
- layout asm - Switch layout
Similar to Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1 (20)
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"Emmanuel Onwumere
In iOS 18, Apple has introduced a significant revamp to the Control Centre, making it more intuitive and user-friendly. One of the standout features is a quicker and more accessible way to shut down your iPhone. This enhancement aims to streamline the user experience, allowing for faster access to essential functions. Discover how iOS 18's redesigned Control Centre can simplify your daily interactions with your iPhone, bringing convenience right at your fingertips.
"IOS 18 CONTROL CENTRE REVAMP STREAMLINED IPHONE SHUTDOWN MADE EASIER"
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Scheduling part 1
1. ECE 4100/6100
Advanced Computer Architecture
Lecture 7 Dynamic Scheduling (I)
Prof. Hsien-Hsin Sean Lee
School of Electrical and Computer Engineering
Georgia Institute of Technology
3. Data Flow Execution Model
• To exploit maximal ILP
• An instruction can be executed immediately after
– All source operands are ready
– Execution unit available
– Destination is ready (to be written)
4. Dynamic Scheduling
• Exploit ILP at run-time
• Execute instructions out-of-order by a restricted data flow
execution model (still use PC!)
• Hardware will
– Maintain true dependency (data flow manner)
– Maintain exception behavior
– Find ILP within an Instruction Window (pool)
• Need an accurate branch predictor
• Pros
– Scalable performance: allows code to be compiled on one platform,
but also run efficiently on another
– Handle cases where dependency is unknown at compile-time
• Cons
– Hardware complexity (main argument from the VLIW/EPIC camp)
6. OOO Execution
• OOO execution ≡ out-of-order completion
• OOO execution ≠ out-of-order retirement (commit)
• No (speculative) instruction allowed to retire until it
is confirmed on the right path
• Fetch, decode, issue (i.e., front-end) are still done in
the program order
7. CDC 6600 Scoreboard Algorithm
• Enable OOO Execution to address long-latency FP
instructions
• Use scoreboard tables to track
– Functional unit status
– Register update status
• Issue and execute instructions whenever
– No structural hazard
– No data hazard
• Cons
– Stop issue when WAW is detected
– Stop writeback when WAR is detected
8. CDC6600 Scoreboard
FunctionalUnits
Registers
FP MultFP Mult
FP MultFP Mult
FP DivideFP Divide
FP AddFP Add
IntegerInteger
Memory
SCOREBOARDSCOREBOARD
Data bus
Data bus
Data bus
Data bus
Control bus/Status
Int
Mult1
Mult2
Add
Div
Fu
1
1
0
1
1
Busy
Load
Mult
Sub
Div
Op
F1
F0
F8
F2
Dest
R3
F1
F6
F0
Src1
F4
F1
F6
Src2
Int
Mult1
Dep1
Int
Dep2
F0 F1 F2 .. .. .. F31
Mult1 Int Div .. .. .. xxxFU
FU Status Table
Register Update Table
9. IBM 360
• IBM 360 introduced
– 8-bit = 1 byte
– 32-bit = 1 word
– Byte-addressable memory
– Differentiate an “architecture” from an “implementation”
• IBM 360/91 FPU about 3 years after CDC 6600 (1966-7)
• Tomasulo algorithm
– Dynamic scheduling
– Register renaming
10. Tomasulo Algorithm
• Goal: High Performance without special compilers
– Dynamic scheduling done completely by HW
– We generally use “supercalar processor” for such category as
opposed to “VLIW” or “EPIC”
• Differences between IBM 360 and CDC 6600 ISA
– IBM has only 2 register specifiers per inst vs. 3 in CDC 6600
• Make WAW and WAR much worse
– IBM has 4 FP registers vs. 8 in CDC 6600
• Smaller number of architectural register, compiler is incapable of
exploiting better register allocation
– IBM has memory-to-register operations
• Why study? Lead to Pentium Pro/II/III/4, Core, Alpha 21264, MIPS
R10000, HP 8000, PowerPC 604
11. IBM 360/91 FPU w/ Tomasulo Algorithm
• To not stall floating point instructions due to long latency
– Two function units FP Add + FP Mult/Div
– 360/91 FPU is not pipelined
• Three new Mechanisms
– Reservation StationsReservation Stations (RS)
• 3 in FP Add, 2 in FP mult/div
• Register name is discarded when issue to reservation station
– TagsTags
• 4-bit tag for one of the 11 possible sources (5 RSs + 6 FLB for loads)
• Written for unavailable sources whose results are being generated by
one of the sources (5 RS or 6 FLB)
• New tag assignment eliminates false dependency
– Common Data BusCommon Data Bus (CDB), driven by
• 11 Sources: 5 RS + 6 FLB
• 17 Destinations: 2*5 RS + 3 SDB + 4 FLR
12. Basic Principles
• Do not rely on a centralized register file !
• RS fetches and buffers an operand as soon as it is
available via CDB
– Eliminating the need to get it from a register (No WAR)
– Data Flow execution model
• Pending instructions designate the RS that will
provide their input (renaming and maintain RAW)
• Due to in-order issue, the register status table
always keeps the latest write (No WAW issue)
13. Key Representation
• Op Operation to perform in the units
• Vj Value of Source 1 (called SINK in 360/91)
• Vk Value of Source 2 (called SOURCE in
360/91)
• Qj The RS (tag) will produce source 1
• Qk The RS (tag) will produce source 2
• A(ddress) Hold info for the memory address
generation for a load or store
• Qi Whose value should be stored into the
register
14. IBM 360/91 FPU w/ Tomasulo Algorithm
From Mem FP Registers (FLR)
ReservationReservation
StationsStations
Common Data Bus (CDB)Common Data Bus (CDB)
To Mem
FP operation stack (FLOS)
FP Load Buffers
(FLB)
Store Data
Buffers
(SDB)
6
5
4
3
2
1
FP AdderFP Adder FP Mult/DivFP Mult/Div
3
2
1
2
1
15. IBM 360/91 FPU w/ Tomasulo Algorithm
From Mem
FLR
ReservationReservation
StationsStations
Common Data Bus (CDB)Common Data Bus (CDB)
To Mem
FP operation stack (FLOS)
FLB
Store Data
Buffers
(SDB)
6
5
4
3
2
1
FP AdderFP Adder FP Mult/DivFP Mult/Div
3
2
1
2
1
Tag
(Qj)
Tag
(Qk)
Sink
(Vj)
Source
(Vk)
Control
Tags and other info in RSTags and other info in RS
Control
Tag
(Qi) Tags in FLBTags in FLB
Control Tag
Control
16. RAWRAW Example: i: R2i: R2 ←← R0 + R4 (2 clks)R0 + R4 (2 clks)
j: R8j: R8 ←← R0 + R2 (2 clks)R0 + R2 (2 clks)
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 3.5
4 10.0
8 7.8
Cycle #0:Cycle #0:
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11 0 6.0 0 10.0
22
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 10.0
8 7.8
Cycle #1: IssueCycle #1: Issue ii
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11 0 6.0 0 10.0
22 0 6.0 11 ---
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 10.0
8 1 22 ---
Cycle #2: IssueCycle #2: Issue jj
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
17. RAWRAW Example: i: R2i: R2 ←← R0 + R4 (2 clks)R0 + R4 (2 clks)
j: R8j: R8 ←← R0 + R2 (2 clks)R0 + R2 (2 clks)
RSRS Tag Sink Tag Src
11
22 0 6.0 0 16.0
33
Adder
FLR Busy Tag Data
0 6.0
2 16.0
4 10.0
8 1 22 ---
Cycle #3:Cycle #3: Broadcasts tag and result: CDB_a=<RS1,16.0>Broadcasts tag and result: CDB_a=<RS1,16.0>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 16.0
4 10.0
8 22.0
Cycle #5:Cycle #5: Broadcasts tag and result: CDB_a=<RS2,22.0>Broadcasts tag and result: CDB_a=<RS2,22.0>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
18. RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 3.5
4 10.0
8 7.8
Cycle #0:Cycle #0:
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 3.5
4 1 44 ---
8 7.8
Cycle #1: IssueCycle #1: Issue ii
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 1 55 ---
2 3.5
4 1 44 ---
8 7.8
Cycle #2: IssueCycle #2: Issue jj
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55 44 --- 0 3.5
Multiplier/Divider
WARWAR Example:
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R0j: R0 ←← R4 x R2 (3)R4 x R2 (3)
k: R2k: R2 ←← R2 + R8 (2)R2 + R8 (2)
19. Cycle #3: IssueCycle #3: Issue kk
WARWAR Example:
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R0j: R0 ←← R4 x R2 (3)R4 x R2 (3)
k: R2k: R2 ←← R2 + R8 (2)R2 + R8 (2)
RSRS Tag Sink Tag Src
11 0 3.5 0 7.8
22
33
Adder
FLR Busy Tag Data
0 1 55 ---
2 1 11 ---
4 1 44 ---
8 7.8
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55 44 --- 0 3.5
Multiplier/Divider
RSRS Tag Sink Tag Src
11 0 3.5 0 7.8
22
33
Adder
FLR Busy Tag Data
0 1 55 ---
2 1 11 ---
4 46.8
8 7.8
Cycle #4:Cycle #4: Broadcasts CDB_m=<RS4,46.8>;Broadcasts CDB_m=<RS4,46.8>;
RSRS Tag Sink Tag Src
44
55 0 46.8 0 3.5
Multiplier/Divider
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 1 55 ---
2 11.3
4 46.8
8 7.8
Cycle #5:Cycle #5: Broadcasts CDB_a=<RS1,11.3>Broadcasts CDB_a=<RS1,11.3>
RSRS Tag Sink Tag Src
44
55 0 46.8 0 3.5
Multiplier/Divider
20. WARWAR Example:
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R0j: R0 ←← R4 x R2 (3)R4 x R2 (3)
k: R2k: R2 ←← R2 + R8 (2)R2 + R8 (2)
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 163.8
2 11.3
4 46.8
8 7.8
Cycle #7:Cycle #7: Broadcasts CDB_m=<RS5,163.8>Broadcasts CDB_m=<RS5,163.8>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
21. RSRS Tag Sink Tag Src
11 0 6.0 44 ---
22
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 1 44 ---
8 7.8
Cycle #2: IssueCycle #2: Issue jj
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55
Multiplier/Divider
WAWWAW Example:
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 3.5
4 10.0
8 7.8
Cycle #0:Cycle #0:
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 3.5
4 1 44 ---
8 7.8
Cycle #1: IssueCycle #1: Issue ii
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55
Multiplier/Divider
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R2j: R2 ←← R0 + R4 (2)R0 + R4 (2)
k: R4k: R4 ←← R0 + R8 (2)R0 + R8 (2)
22. RSRS Tag Sink Tag Src
11 0 6.0 44 ---
22 0 6.0 0 7.8
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 1 22 ---
8 7.8
Cycle #3: IssueCycle #3: Issue kk
RSRS Tag Sink Tag Src
44 0 6.0 0 7.8
55
Multiplier/Divider
WAWWAW Example:
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R2j: R2 ←← R0 + R4 (2)R0 + R4 (2)
k: R4k: R4 ←← R0 + R8 (2)R0 + R8 (2)
RSRS Tag Sink Tag Src
11 0 6.0 0 46.8
22 0 6.0 0 7.8
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 1 22 ---
8 7.8
Cycle #4:Cycle #4: Broadcasts CDB_m=<RS4,46.8>Broadcasts CDB_m=<RS4,46.8>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
RSRS Tag Sink Tag Src
11 0 6.0 0 46.8
22
33
Adder
FLR Busy Tag Data
0 6.0
2 1 11 ---
4 13.8
8 7.8
Cycle #5:Cycle #5: Broadcasts CDB_a=<RS2,13.8>Broadcasts CDB_a=<RS2,13.8>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
23. WAWWAW Example:
RSRS Tag Sink Tag Src
11
22
33
Adder
FLR Busy Tag Data
0 6.0
2 52.8
4 13.8
8 7.8
Cycle #6:Cycle #6: Broadcasts CDB_a=<RS1,52.8>Broadcasts CDB_a=<RS1,52.8>
RSRS Tag Sink Tag Src
44
55
Multiplier/Divider
i: R4i: R4 ←← R0 x R8 (3)R0 x R8 (3)
j: R2j: R2 ←← R0 + R4 (2)R0 + R4 (2)
k: R4k: R4 ←← R0 + R8 (2)R0 + R8 (2)
24. Tomasulo Example (H&P Text)
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Register result status:
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 Qi
These are RS,
we have only
“one FU” for each
type (MUL, ADD,
LD). We reduce
Load from 6 to 3 for
simplicity. SDB is
not shown either
47. Issues in Tomasulo Algorithm
• CDB at high speed?
• Precise exception issues
• Speculative instructions
– Branch prediction enlarges instruction window
– How to rollback when mispredicted?
Editor's Notes
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
Cycle 2 of LD is “Read Operand” which is not shown.