Chapter 4 the processor

Chapter 4 The Processor Cheng-Jung Tsai Assistant Professor Department of Mathematics, National Changhua University of Education, Taiwan, R.O.C. Email: [email_address] Modified from the instructor’s teaching material provided by Elsevier inc. Copyright 2009

Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§ 5 .1 Introduction

1.1 Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.1 Introduction

A basic MIPS implementation ,[object Object],[object Object],[object Object],[object Object],[object Object]

An overview of the implementation: Instruction Execution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

FIGURE 4.1 An abstract view of the implementation of the MIPS subset branch on equal beq $1,$2,25 if ($1 == $2) go to PC+4+100 This figure showing the major functional units and the major connections between them . All instructions start by using the program counter to supply the instruction address to the instruction memory . After the instruction is fetched, the register operands used by an instruction are specified by fields of that instruction . Once the register operands have been fetched, they can be operated on by ALU to compute a memory address (for a load or store), to compute an arithmetic result (for an integer arithmetic-logical instruction), or a compare (for a branch). If the instruction is an arithmetic-logical instruction, the result from the ALU must be written to a register . If the operation is a load or store, the ALU result is used as an address to either store a value from the registers or load a value from memory into the registers . The result from the ALU or memory is written back into the register file . Branches require the use of the ALU output to determine the next instruction address , which comes either from the ALU (where the PC and branch offset are summed) or from an adder that increments the current PC by 4 . The thick lines interconnecting the functional units represent buses , which consist of multiple signals.

CPU Overview & Multiplexers ( 多工器 ) ,[object Object],[object Object]

Multiplexors in appendix C ,[object Object],[object Object],[object Object],[object Object],[object Object],AND OR S = 1  C = B S = 0  C = A

FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors and control lines . The top multiplexor (“Mux”) controls what value replaces the PC ( PC + 4 or the branch destination address ); the multiplexor is controlled by the gate that “ANDs ” together the Zero output of the ALU and a control signal that indicates that the instruction is a branch . The middle multiplexor , whose output returns to the register file, is used to steer the output of the ALU (in the case of an arithmetic-logical instruction) or the output of the data memory (in the case of a load) for writing into the register file . Finally, the bottommost multiplexor is used to determine whether the second ALU input is from the registers (for an arithmetic-logical instruction OR a branch) or from the offset field of the instruction (for a load or store). The added control lines are straightforward and determine the operation performed at the ALU, whether the data memory should read or write, and whether the registers should perform a write operation. The control lines are shown in color to make them easier to see.

4.2 Logic Design Conventions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.2 Logic Design Conventions

Recall in Computer Science: Logic operations at the bit level Truth table: a truth table defines the values of the output for each possible input or inputs. exclusive-or

Combinational Elements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],A B Y I0 I1 Y M u x S A B Y + A B Y ALU F

Sequential Elements ,[object Object],[object Object],[object Object],D Clk Q Clk D Q

Clocking Methodology ,[object Object],[object Object],[object Object],[object Object],An edge-triggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to indeterminate data values.

4.3 Building a Datapath ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§ 4 .3 Building a Datapath

Datapath element for each instruction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

FIGURE 4.6 A portion of the datapath used for fetching instructions and incrementing the program counter 32-bit register Increment by 4 for next instruction ,[object Object]

Two elements needed to implement R-type ALU operations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],op rs rt rd shamt funct 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

Two elements needed to implement R-type ALU operations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],32 The operation to be performed by the ALU is controlled with the ALU operation signal, which will be 4 bits wide use the Zero detection output of the ALU shortly to implement branches in this simple implementation . 5 bits wide to specify one of the 32 registers

Two extra units needed to implement and Load/Store ,[object Object],[object Object],[object Object],[object Object],[object Object],I-type Has a 16-bit input that is sign-extended into a 32-bit result appearing on the output op rs rt constant or address 6 bits 5 bits 5 bits 16 bits

Recall: Useful 3 shortcuts of 2’s complement numbers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Datapath for Branch Instructions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],op address 6 bits 26 bits

FIGURE 4.9: The datapath for a branch Just re-routes wires Sign-bit wire replicated PC+4 or branch target

Creating a Single datapath: Composing the Elements ,[object Object],[object Object],[object Object],[object Object]

Example in p.313 : building a datapath: R-Type/Load/Store Datapath ,[object Object],[object Object],[object Object]

Figure 4.11 in p.315 Full Datapath PC+ 4 or branch target Arithmetic or address calculation load or ALU result Give you 5 minutes to be familiar with this datapath & Please try the “Check Yourself” in P.315 What is missed in this Figure? Control Unit

4.4 A Simple Implementation Scheme: The ALU Control ,[object Object],[object Object],[object Object],[object Object],[object Object],§ 4 .4 A Simple Implementation Scheme How to generate above 4-bit ALU Control input ? See next slide...... NOR 1100 set-on-less-than 0111 subtract 0110 add 0010 OR 0001 AND 0000 Function ALU control lines

Generate the 4-bit ALU Control input ,[object Object],[object Object],[object Object],[object Object],6-bit

FIGURE 4.13: The truth table for the 4 ALU control bits ,[object Object],[object Object],[object Object],[object Object]

Designing t he Main Control Unit ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],opcode always read read, except for load write for R-type and load

FIGURE 4.15 The datapath with all necessary multiplexors and all control lines identified ,[object Object],[object Object],Used to decide which operation is performed rs rt rd

FIGURE 4.16 & 4.18: The effect of each of the seven control signals 0 1

FIGURE 4.17: The simple datapath with the control unit. The PCsrc control line should be set if the instruction is branch on equal and the Zero output of ALU is true. Used to decide which operation is performed Used to decide if a branch is taken rs rt rd op address or f-code

FIGURE 4.22: implementation the truth table ,[object Object],[object Object]

Implementing Main Control in Appendix D

Operation of the Datapath for an R-type instruction: add ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],op rs rt rd shamt funct 0 6 11 16 21 26 31 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

Fig. 4. 19 in p.324 Operation of Datapath: add Instruction Fetch at Start of Add ,[object Object]

Operation of Datapath: add Instruction Decode of Add ,[object Object],rs rt rd op address or shamt + f-code op rs rt rd shamt funct 0 6 11 16 21 26 31 6 bits 6 bits 5 bits 5 bits 5 bits 5 bits

Operation of Datapath: add ALU Operation during Add ,[object Object]

Operation of Datapath: add Write Back at the End of Add ,[object Object]

Figure 4.19: Operation of Datapath in textbook: add

Figure 4.20: Datapath Operation for I-type instruction: lw ,[object Object]

Figure 4.21: Datapath Operation for I-type instruction: beq ,[object Object],[object Object]

Implementing Jumps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],J -type 000010 address (26-bit immediate) 31:26 25:0

Figure 4.24: Datapath With Jumps Added

Why a Single-Cycle Implementation Is Not Used Today ? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Performance of Single-Cycle Machine- Example in p.315 of the 3-ed. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],Performance of Single-Cycle Machine- Example in p.315 of the 3-ed. Instruction Fetch Jump ALU Register Access Instruction Fetch Branch Memory Access ALU Register Access Instruction Fetch Store Word Register Access Memory Access ALU Register Access Instruction Fetch Load Word Register Access ALU Register Access Instruction Fetch R-Format Used Function Units Instruction class

Performance of Single-Cycle Machine- Example in p.315 of the 3-ed. ,[object Object],Memory units: 2 ns, ALU and adders: 2 ns, Register file (read or write): 1 ns. 2 ns 2 Jump 5 ns 2 1 2 Branch 7 ns 2 2 1 2 Store Word 8 ns 1 2 2 1 2 Load Word 6 ns 1 0 2 1 2 R-Format Total Register Write Data Memory ALU Operation Register Read Instruction Memory Instruction Class

Performance of Single-Cycle Machine- Example in p.315 of the 3-ed. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

4.5 An Overview of Pipelining ,[object Object],[object Object],[object Object],§4.5 An Overview of Pipelining ,[object Object],[object Object],[object Object],[object Object]

Pipelining Lessons ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],6 PM 7 8 9 T a s k O r d e r Time A B C D 30 40 40 40 40 20

MIPS Pipeline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pipeline Performance : Example in P.333 ,[object Object],[object Object],[object Object],[object Object],500ps 200ps 100 ps 200ps beq 600ps 100 ps 200ps 100 ps 200ps R-format 700ps 200ps 200ps 100 ps 200ps sw 800ps 100 ps 200ps 200ps 100 ps 200ps lw Total time Register write Memory access ALU op Register read Instr fetch Instr

Pipeline Performance : Fig. 4.27 in P.333 Single-cycle (T c = 800ps) Pipelined (T c = 200ps) Unbalanced stage length reduce the speedup of pipeline

Pipeline Speedup ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pipelining and ISA Design ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hazards ( 危障 ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Structure Hazards ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Lw $4, 400 (s0) stall

Structural Hazard: Single Memory Mem I n s t r. O r d e r Time Load Instr 1 Instr 2 Instr 3 Instr 4 Reg Mem Reg Reg Mem Reg ALU Mem ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU ALU Mem Reg Mem Reg

Data Hazards ,[object Object],[object Object],[object Object],means write means read Three bubbles Need $ s0

Solution for data hazard ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Load-Use Data Hazard ,[object Object],[object Object],[object Object],[object Object]

Complier: Code Scheduling to Avoid Stalls – example in p.338 ,[object Object],[object Object],[object Object],lw $t1, 0($t0) lw $t2 , 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4 , 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) stall stall lw $t1, 0($t0) lw $t2 , 4($t0) lw $t4 , 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 13 cycles

Control Hazards (Branch Hazards ) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Solution 1: Stall on Branch ,[object Object],[object Object]

Solution 2: Branch Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More-Realistic Branch Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MIPS with Predict Not Taken Prediction correct Prediction incorrect

Solution 3: Delayed Branch Solution ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Pipeline Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The BIG Picture

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],4.6 Pipelined Datapath and Control §4.6 Pipelined Datapath and Control

[object Object],[object Object],Why all instructions have the same number of stages Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 R-type R-type Load R-type R-type Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr Ifetch Reg/Dec Exec Wr Ops! We have a problem!

[object Object],[object Object],[object Object],[object Object],Important Observation for pipeline Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4

MIPS Pipelined Datapath WB MEM Right-to-left flow leads to hazards FIGURE 4 . 33 Each step of the instruction can be mapped onto the datapath from left to right . The only exceptions (left-to-right) are the update of the PC and the write-back step, which sends either the ALU result (can lead to control hazard) or the data from memory to the left to be written into the register file (can lead to data hazard) Control hazard data hazard

FIGURE 4.35 Pipeline registers ,[object Object],[object Object],FIGURE 4.35 The pipeline registers are labeled by the stages that they separate; for example, the first is labeled IF/ID . The registers must be wide enough to store all the data corresponding to the lines that go through them . For example, the IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC address . We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, and 64 bits, respectively .

Graphically representing pipelines ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Considering load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load IF ID Exe Mem Wr

single-clock-cycle diagrams Figure 4.36 IF for Load, Store, … instruction Read, PC+4 M eans read

ID for Load, Store, … FIGURE 4.36 Although the load needs only the top register in stage 2, the processor doesn’t know what instruction is being decoded , so it sign-extends the 16-bit constant and reads both registers into the ID/EX pipeline register.

EX E for Load FIGURE 4.37 The register is added to the sign-extended immediate, and the sum is placed in the EX/MEM pipeline register.

MEM for Load FIGURE 4.38 Data memory is read using the address in the EX/MEM pipeline registers, and the data is placed in the MEM/WB pipeline register.

WB for Load FIGURE 4.38 Next, data is read from the MEM/WB pipeline register and written into the register file in the middle of the datapath. Note: there is a bug in this design that is repaired in Figure 4.41. Who will supply this address?

FIGURE 4 . 41 Corrected Datapath for Load The write register number now comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/WB pipeline register, adding 5 more bits to the last three pipeline registers . I-type

FIGURE 4.39 EX for Store Unlike the third stage of the load instruction, the second register value is loaded into the EX/MEM pipeline register to be used in the next stage . ,[object Object],I-type

FIGURE 4.40 MEM for Store In the fourth stage, the data is written into data memory for the store . Note that the data comes from the EX/MEM pipeline register and that nothing is changed in the MEM/WB pipeline register.

FIGURE 4.40 WB for Store Once the data is written in memory, there is nothing left for the store instruction to do, so nothing happens in stage 5 .

Figure 4.43 in p.357 Multi-Cycle Pipeline Diagram ,[object Object]

Figure 4.44 in p.357 Multi-Cycle Pipeline Diagram ,[object Object],[object Object]

Figure 4.45 in p.358 Single-Cycle Pipeline Diagram ,[object Object],[object Object],[object Object],[object Object]

FIGURE 4 . 46 Pipelined Control (Simplified version ) FIGURE 4.46 The pipelined datapath of Figure 4.41 with the control signals identified. This datapath borrows the control logic for PC source, register destination number, and ALU control from Section 4.4. Note that we now need the 6-bit function code of the instruction in the EX stage as input to ALU control , so these bits must also be included in the ID/EX pipeline register . ,[object Object]

Recall: FIGURE 4.17: The simple datapath with the control unit. The PCsrc control line should be set if the instruction is branch on equal and the Zero output of ALU is true. Used to decide which operation is performed rs rt rd op address or f-code Used to decide if a branch is taken

Pipelined Control ,[object Object],[object Object],[object Object],[object Object],[object Object],FIGURE 4.50 The control lines for the final three stages. Note that four of the nine control lines are used in the EX phase, with the remaining five control lines passed on to the EX/MEM pipeline register extended to hold the control lines; three are used during the MEM stage, and the last two are passed to MEM/WB for use in the WB stage.

Pipelined Control (cont.) ,[object Object],[object Object],[object Object],IF/ID Register ID/Ex Register Ex/MEM Register MEM/WB Register ID EX MEM ExtOp ALUOp RegDst ALUSrc Branch MemWr MemtoReg RegWr Main Control ExtOp ALUOp RegDst ALUSrc MemtoReg RegWr MemtoReg RegWr MemtoReg RegWr Branch MemWr Branch MemW WB

FIGURE 4.51 The pipelined datapath of Figure 4.46 with the control signals connected to the control portions of the pipe line registers. The control values for the last three stages are created during the ID stage and then placed in the ID/EX pipeline register . The control lines for each pipe stage are used, and remaining control lines are then passed to the next pipeline stage.

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Supplement: Pipelined control (Write the control signals) -- Let’s Try it Out

4.7 Data Hazards : Forwarding vs. Stalling Data Hazards in ALU Instructions ,[object Object],[object Object],[object Object],[object Object],[object Object],§4.7 Data Hazards: Forwarding vs. Stalling

Dependencies & Forwarding ,[object Object]

Detecting the Need to Forward ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg

Recall: the example in slide 108 Detecting the Need to Forward ,[object Object],[object Object],[object Object]

Detecting the Need to Forward ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

FIGURE 4.54 Forwarding Paths hardware On the top are the ALU and pipeline registers before adding forwarding. On the bottom, the multiplexors have been expanded to add the forwarding paths , and we show the forwarding unit. This figure is a stylized drawing, how ever, leaving out details from the full datapath such as the sign extension hardware. Note that the ID/EX. RegisterRt field is shown twice, once to connect to the mux and once to the forwarding unit, but it is a single signal.

FIGURE 4 . 55 The control values for the forwarding multiplexors

Forwarding Conditions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More complicated Data Hazard : Double Data Hazard ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Revised Forwarding Condition for double data hazard ,[object Object],[object Object],[object Object],[object Object],[object Object]

FIGURE 4 . 56 The datapath modified to resolve hazards via forwarding

Load-Use Data Hazard Need to stall for one cycle

Load-Use Hazard Detection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

How to Stall the Pipeline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Stall/Bubble in the Pipeline Stall inserted here

Stall/Bubble in the Pipeline Or, more accurately…

Datapath with Hazard Detection

Stalls and Performance ,[object Object],[object Object],[object Object],[object Object],The BIG Picture

Branch Hazards ,[object Object],§4.8 Control Hazards PC Flush these instructions (Set control values to 0)

Reducing Branch Delay ,[object Object],[object Object],[object Object],[object Object],[object Object]

Data Hazards for Branches ,[object Object],… add $4 , $5, $6 add $1 , $2, $3 beq $1 , $4 , target ,[object Object],IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

Data Hazards for Branches ,[object Object],[object Object],beq stalled IF ID ID EX MEM WB add $4 , $5, $6 lw $1 , addr beq $1 , $4 , target IF ID EX MEM WB IF ID EX MEM WB

Data Hazards for Branches ,[object Object],[object Object],beq stalled IF ID ID ID EX MEM WB beq stalled lw $1 , addr beq $1 , $0 , target IF ID EX MEM WB

Dynamic Branch Prediction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

1-Bit Predictor: Shortcoming ,[object Object],outer: … … inner: … … beq …, …, inner … beq …, …, outer ,[object Object],[object Object]

2-Bit Predictor ,[object Object]

Calculating the Branch Target ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Exceptions and Interrupts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.9 Exceptions

Handling Exceptions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

An Alternate Mechanism ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Handler Actions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Exceptions in a Pipeline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Exception Properties ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Exception Example ,[object Object],[object Object],[object Object],[object Object]

Multiple Exceptions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Imprecise Exceptions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Instruction-Level Parallelism (ILP) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.10 Parallelism and Advanced Instruction Level Parallelism

Multiple Issue ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Speculation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Compiler/Hardware Speculation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Speculation and Exceptions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Static Multiple Issue ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Scheduling Static Multiple Issue ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

MIPS with Static Dual Issue ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],n + 20 n + 16 n + 12 n + 8 n + 4 n Address WB MEM EX ID IF Load/store WB MEM EX ID IF ALU/branch WB MEM EX ID IF Load/store WB MEM EX ID IF ALU/branch WB MEM EX ID IF Load/store WB MEM EX ID IF ALU/branch Pipeline Stages Instruction type

Hazards in the Dual-Issue MIPS ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Scheduling Example ,[object Object],Loop: lw $t0 , 0($s1) # $t0=array element addu $t0 , $t0 , $s2 # add scalar in $s2 sw $t0 , 0($s1) # store result addi $s1 , $s1,–4 # decrement pointer bne $s1 , $zero, Loop # branch $s1!=0 ,[object Object],4 sw $t0 , 4($s1) bne $s1 , $zero, Loop 3 nop addu $t0 , $t0 , $s2 2 nop addi $s1 , $s1,–4 1 lw $t0 , 0($s1) nop Loop: cycle Load/store ALU/branch

Loop Unrolling ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Loop Unrolling Example ,[object Object],[object Object],3 lw $t2 , 8($s1) addu $t0 , $t0 , $s2 4 lw $t3 , 4($s1) addu $t1 , $t1 , $s2 5 sw $t0 , 16($s1) addu $t2 , $t2 , $s2 6 sw $t1 , 12($s1) addu $t3 , $t4 , $s2 8 sw $t3 , 4($s1) bne $s1 , $zero, Loop 7 sw $t2 , 8($s1) nop 2 lw $t1 , 12($s1) nop 1 lw $t0 , 0($s1) addi $s1 , $s1,–16 Loop: cycle Load/store ALU/branch

Dynamic Multiple Issue ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamic Pipeline Scheduling ,[object Object],[object Object],[object Object],[object Object],[object Object]

Dynamically Scheduled CPU Results also sent to any waiting reservation stations Reorders buffer for register writes Can supply operands for issued instructions Preserves dependencies Hold pending operands

Register Renaming ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Speculation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Why Do Dynamic Scheduling? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Does Multiple Issue Work? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The BIG Picture

Power Efficiency ,[object Object],[object Object],70W 8 No 1 6 1200MHz 2005 UltraSparc T1 90W 1 No 4 14 1950MHz 2003 UltraSparc III 75W 2 Yes 4 14 2930MHz 2006 Core 103W 1 Yes 3 31 3600MHz 2004 P4 Prescott 75W 1 Yes 3 22 2000MHz 2001 P4 Willamette 29W 1 Yes 3 10 200MHz 1997 Pentium Pro 10W 1 No 2 5 66MHz 1993 Pentium 5W 1 No 1 5 25MHz 1989 i486 Power Cores Out-of-order/ Speculation Issue width Pipeline Stages Clock Rate Year Microprocessor

The Opteron X4 Microarchitecture §4.11 Real Stuff: The AMD Opteron X4 (Barcelona) Pipeline 72 physical registers

The Opteron X4 Pipeline Flow ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Fallacies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.13 Fallacies and Pitfalls

Pitfalls ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Concluding Remarks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],§4.14 Concluding Remarks

Chapter 4 the processor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chapter 4 the processor

Similar to Chapter 4 the processor (20)

Recently uploaded

Recently uploaded (20)

Chapter 4 the processor

Editor's Notes