0
CS 211: Computer Architecture     Instructor: Prof. Bhagi Narahari       Dept. of Computer Science              Course URL...
How to improve performance? • Recall performance is function of    – CPI: cycles per instruction    – Clock cycle    – Ins...
How to improve performance? • First step is to apply concept of pipelining to   the instruction execution process    – Ove...
Pipeline Approach to ImproveSystem Performance • Analogous to fluid flow in pipelines and   assembly line in factories • D...
Instruction Pipeline • Instruction execution process lends itself   naturally to pipelining    – overlap the subtasks of i...
Linear Pipeline                         3          ProcessorLinear pipeline processes a sequence ofsubtasks with linear pr...
Synchronous Pipeline                         4All transfers simultaneousOne task or operation enters the pipeline per cycl...
Time Space Utilization of Pipeline                         Full pipeline after 4 cycles S3                 T1         T2  ...
Asynchronous                                      5            PipelineTransfers performed when individual processors are ...
Pipeline Clock and                           6             Timing                      Si                  Si+1           ...
Speedup and                                             7             Efficiencyk-stage pipeline processes n tasks in k + ...
10             Efficiency and Throughput Efficiency of the k-stages pipeline :                         Sk             n   ...
Pipeline Performance: Example•   Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns    (nanoseconds)•   latc...
Non-linear pipelines and pipelinecontrol algorithms • Can have non-linear path in pipeline…    – How to schedule instructi...
Non-linear Dynamic Pipelines •   Multiple processors (k-stages) as linear pipeline •   Variable functions of individual pr...
Reservation Tables • Reservation table : displays time-space flow of data   through the pipeline analogous to opcode of pi...
14Reservation Tables (Examples)                                CS211 17
Latency Analysis • Latency : the number of clock cycles   between two initiations of the pipeline • Collision : an attempt...
16          Collisions (Example)1    2     3    4      5      6        7      8       9      10x1        x2          x3   ...
17                                  Latency Cycle1    2    3    4    5    6   7    8    9 10 11 12 13 14 15 16 17 18x1    ...
18              Collision Free SchedulingGoal : to find the shortest average latencyLengths : for reservation table with n...
19                     Collision Vector                          Reservation Table                 x                      ...
Back to our focus: ComputerPipelines • Execute billions of instructions, so   throughput is what matters • MIPS desirable ...
Designing a PipelinedProcessor• Go back and examine your datapath and control  diagram• associated resources with states• ...
5 Steps of MIPS DatapathWhat do we need to do to pipeline the process ?            Instruction           Instr. Decode    ...
5 Steps of MIPS/DLX Datapath           Instruction            Instr. Decode                  Execute                    Me...
Graphically RepresentingPipelines• Can help with answering questions like:   – how many cycles does it take to execute thi...
Visualizing Pipelining                          Time (clock cycles)       Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 ...
Conventional Pipelined Execution  Representation TimeIFetch Dcd     Exec   Mem    WB       IFetch Dcd     Exec   Mem    WB...
Single Cycle, Multiple Cycle, vs.Clk      PipelineCycle 1           Cycle 2Single Cycle Implementation:                   ...
The Five Stages of    Load Cycle 1 Cycle 2 Cycle 3        Cycle 4 Cycle 5         Load Ifetch   Reg/Dec   Exec    Mem     ...
The Four Stages of R-  type Cycle 1 Cycle 2 Cycle 3 Cycle 4      R-type Ifetch   Reg/Dec   Exec   Wr• Ifetch: Instruction ...
Pipelining the R-type and Load        Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8          Cycle 1 Cycle 2 Cycle...
Important•   Observation    Each functional unit can only be used once per  instruction• Each functional unit must be used...
Solution 1: Insert “Bubble” into the        Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9         Cycl...
Solution 2: Delay R-type’s Write by •    One Cycle     Delay R-type’s register write by one cycle:        – Now R-type ins...
WhyPipeline?• Suppose we execute 100 instructions• Single Cycle Machine   – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns• Mult...
Why Pipeline? Because the resources     are there!   Time (clock cycles)                         ALU I            Im   Reg...
Problems with Pipelineprocessors?• Limits to pipelining: Hazards prevent next instruction from  executing during its desig...
Back to our old friend: CPU timeequation • Recall equation for CPU time • So what are we doing by pipelining the instructi...
Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Average Stall cycles per Inst            Ideal CPI × Pipeline ...
One Memory Port/Structural Hazards      Time (clock cycles)         Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle ...
One Memory Port/Structural Hazards      Time (clock cycles)         Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle ...
Example: Dual-port vs. Single-port• Machine A: Dual ported memory (“Harvard  Architecture”)• Machine B: Single ported memo...
Example… • Machine A: Dual ported memory (“Harvard   Architecture”) • Machine B: Single ported memory, but its pipelined  ...
Data Dependencies • True dependencies and False dependencies   – false implies we can remove the dependency   – true impli...
Three Generic Data Hazards• Read After Write (RAW)  InstrJ tries to read operand before InstrI writes it            I: add...
RAW Dependency • Example program (a) with two instructions    – i1: load r1, a;    – i2: add r2, r1,r1; • Program (b) with...
Three Generic Data Hazards• Write After Read (WAR)  InstrJ writes operand before InstrI reads it             I: sub r4,r1,...
Three Generic Data Hazards• Write After Write (WAW)  InstrJ writes operand before InstrI writes it.• Called an “output dep...
WAR and WAW Dependency• Example program (a):  – i1: mul r1, r2, r3;  – i2: add r2, r4, r5;• Example program (b):  – i1: mu...
What to do with WAR and WAW ? • Problem:    – i1: mul r1, r2, r3;    – i2: add r2, r4, r5; • Is this really a dependence/h...
What to do with WAR and WAW • Solution: Rename Registers   – i1: mul r1, r2, r3;   – i2: add r6, r4, r5; • Register renami...
Hazard Detection in H/W • Suppose instruction i is about to be issued and a   predecessor instruction j is in the instruct...
Hazard Detection in   Hardware• A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j )   – Keep a record of pend...
Internal Forwarding: Getting rid ofsome hazards • In some cases the data needed by the next   instruction at the ALU stage...
Data Hazard on R1      Time (clock cycles)                       IF ID/RF EX               MEM      WBI                   ...
Forwarding to Avoid Data Hazard                   Time (clock cycles) I n                                      ALU     add...
Internal Forwarding ofInstructions • Forward result from ALU/Execute unit to   execute unit in next stage • Also can be us...
Internal Data                                        38                     Forwarding                         Store-load ...
Internal Data                                             39                    Forwarding                        Load-loa...
Internal Data                                      40                   Forwarding                         Store-store for...
HW Change for Forwarding NextPC                        mux    Registers                                                   ...
What about memory       operations?in the same stage,º If instructions are initiated in order and  operations always occur...
Data Hazard Even with Forwarding     Time (clock cycles)I                                      ALU     lw r1, 0(r2) Ifetch...
Data Hazard Even with Forwarding         Time (clock cycles)In     lw r1, 0(r2)                                          A...
What can we (S/W) do?                        CS211 67
Software Scheduling to Avoid LoadHazards  Try producing fast code for        a = b + c;        d = e – f;  assuming a, b, ...
Control Hazards: Branches• Instruction flow   – Stream of instructions processed by Inst. Fetch unit   – Speed of “input f...
Control Hazard on Branches   Three Stage Stall                                         ALU10: beq r1,r3,36     Ifetch    R...
Branch Stall Impact• If CPI = 1, 30% branch,       Stall 3 cycles => new CPI = 1.9!• Two part solution:  – Determine branc...
Pipelined MIPS (DLX) DatapathInstruction   Instr. Decode    Execute          Memory      Write  Fetch        Reg. Fetch   ...
Four Branch Hazard Alternatives#1: Stall until branch direction is clear – flushing pipe#2: Predict Branch Not Taken    – ...
Four Branch Hazard Alternatives#4: Delayed Branch   – Define branch to take place AFTER a following instruction     branch...
CS211 75
Delayed Branch• Where to get instructions to fill branch delay slot?   –   Before branch instruction   –   From the target...
Evaluating Branch Alternatives  P ip e lin e s p e e d u p =                     P ip e lin e d e p th                    ...
Branch Prediction based onhistory • Can we use history of branch behaviour to predict   branch outcome ? • Simplest scheme...
Summary :  Control and Pipelining• Just overlap tasks; easy if tasks are independent• Speed Up ≤ Pipeline Depth; if ideal ...
Summary #1/2:•Pipelining it easy  What makes   – all instructions are the same length   – just a few instruction formats  ...
Introduction to ILP • What is ILP?    – Processor and Compiler design techniques that      speed up execution by causing i...
Compiler vs. Processor               Compiler                                                                Hardware    F...
Upcoming SlideShare
Loading in...5
×

Pipeline

1,625

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,625
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
61
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • This presentation contains some slides about data security
  • Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.
  • As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline.
  • Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage.
  • What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)?
  • The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else.
  • Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline.
  • Transcript of "Pipeline"

    1. 1. CS 211: Computer Architecture Instructor: Prof. Bhagi Narahari Dept. of Computer Science Course URL: www.seas.gwu.edu/~narahari/cs211/ CS211 1
    2. 2. How to improve performance? • Recall performance is function of – CPI: cycles per instruction – Clock cycle – Instruction count • Reducing any of the 3 factors will lead to improved performance CS211 2
    3. 3. How to improve performance? • First step is to apply concept of pipelining to the instruction execution process – Overlap computations • What does this do? – Decrease clock cycle – Decrease effective CPU time compared to original clock cycle • Appendix A of Textbook – Also parts of Chapter 2 CS211 3
    4. 4. Pipeline Approach to ImproveSystem Performance • Analogous to fluid flow in pipelines and assembly line in factories • Divide process into “stages” and send tasks into a pipeline – Overlap computations of different tasks by operating on them concurrently in different stages CS211 4
    5. 5. Instruction Pipeline • Instruction execution process lends itself naturally to pipelining – overlap the subtasks of instruction fetch, decode and execute CS211 5
    6. 6. Linear Pipeline 3 ProcessorLinear pipeline processes a sequence ofsubtasks with linear precedenceAt a higher level - Sequence of processorsData flowing in streams from stage S1 to thefinal stage SkControl of data flow : synchronous orasynchronous S1 S2 • • • • Sk CS211 6
    7. 7. Synchronous Pipeline 4All transfers simultaneousOne task or operation enters the pipeline per cycleProcessors reservation table : diagonal CS211 7
    8. 8. Time Space Utilization of Pipeline Full pipeline after 4 cycles S3 T1 T2 T1 S2 T2 T3 T1 T2 T3 T4 S1 1 2 3 4 Time (in pipeline cycles) CS211 8
    9. 9. Asynchronous 5 PipelineTransfers performed when individual processors are readyHandshaking protocol between processorsMainly used in multiprocessor systems with message-passing CS211 9
    10. 10. Pipeline Clock and 6 Timing Si Si+1 τ τm dClock cycle of the pipeline : τLatch delay : d τ = max {τ m } + dPipeline frequency : f f=1/τ CS211 10
    11. 11. Speedup and 7 Efficiencyk-stage pipeline processes n tasks in k + (n-1) clockcycles:k cycles for the first task and n-1 cyclesfor the remaining n-1 tasksTotal time to process n tasks Tk = [ k + (n-1)] τFor the non-pipelined processor T1 = n k τSpeedup factor T1 nkτ nk Sk = = [ k + (n-1)] τ = k + (n-1) Tk CS211 11
    12. 12. 10 Efficiency and Throughput Efficiency of the k-stages pipeline : Sk n Ek = = k k + (n-1)Pipeline throughput (the number of tasks per unit time) :note equivalence to IPC n nf Hk = = [ k + (n-1)] τ k + (n-1) CS211 12
    13. 13. Pipeline Performance: Example• Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns (nanoseconds)• latch delay = 10• Pipeline cycle time = 90+10 = 100 ns• For non-pipelined execution – time = 60+50+90+80 = 280 ns• Speedup for above case is: 280/100 = 2.8 !!• Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns• Sequential time = 1000*280ns• Throughput= 1000/1003• What is the problem here ?• How to improve performance ? CS211 13
    14. 14. Non-linear pipelines and pipelinecontrol algorithms • Can have non-linear path in pipeline… – How to schedule instructions so they do no conflict for resources • How does one control the pipeline at the microarchitecture level – How to build a scheduler in hardware ? – How much time does scheduler have to make decision ? CS211 14
    15. 15. Non-linear Dynamic Pipelines • Multiple processors (k-stages) as linear pipeline • Variable functions of individual processors • Functions may be dynamically assigned • Feedforward and feedback connections CS211 15
    16. 16. Reservation Tables • Reservation table : displays time-space flow of data through the pipeline analogous to opcode of pipeline – Not diagonal, as in linear pipelines • Multiple reservation tables for different functions – Functions may be dynamically assigned • Feedforward and feedback connections • The number of columns in the reservation table : evaluation time of a given function CS211 16
    17. 17. 14Reservation Tables (Examples) CS211 17
    18. 18. Latency Analysis • Latency : the number of clock cycles between two initiations of the pipeline • Collision : an attempt by two initiations to use the same pipeline stage at the same time • Some latencies cause collision, some not CS211 18
    19. 19. 16 Collisions (Example)1 2 3 4 5 6 7 8 9 10x1 x2 x3 x1 x4 x1x2 x2x3 x1 x1x2 x2x3 x3x4 x4 x1 x1x2 x1x2x4 x2x3x4 Latency = 2 CS211 19
    20. 20. 17 Latency Cycle1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18x1 x1 x2 x1 x2 x3 x2 x3 x1 x1 x2 x2 x3 x3 x1 x1 x1 x2 x2 x2 x3 x3 Cycle Cycle Latency cycle : the sequence of initiations which has repetitive subsequence and without collisions Latency sequence length : the number of time intervals within the cycle Average latency : the sum of all latencies divided by the number of latencies along the cycle CS211 20
    21. 21. 18 Collision Free SchedulingGoal : to find the shortest average latencyLengths : for reservation table with n columns, maximum forbidden latency is m <= n – 1, and permissible latency p is 1 <= p <= m – 1Ideal case : p = 1 (static pipeline)Collision vector : C = (CmCm-1 . . .C2C1) [ Ci = 1 if latency i causes collision ] [ Ci = 0 for permissible latencies ] CS211 21
    22. 22. 19 Collision Vector Reservation Table x x x x x xx1 x1 x2 x2 x1 x1 x2 x2 x1 x1 x2 x2Value X1 Value X2 C = (? ? . . . ? ?) CS211 22
    23. 23. Back to our focus: ComputerPipelines • Execute billions of instructions, so throughput is what matters • MIPS desirable features: – all instructions same length, – registers located in same place in instruction format, – memory operands only in loads or stores CS211 23
    24. 24. Designing a PipelinedProcessor• Go back and examine your datapath and control diagram• associated resources with states• ensure that flows do not conflict, or figure out how to resolve• assert control in appropriate stage CS211 24
    25. 25. 5 Steps of MIPS DatapathWhat do we need to do to pipeline the process ? Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Next SEQ PC Adder 4 RS1 Zero? MUX MUX RS2 Address Memory Reg File Inst ALU Memory RD L Data M MUX D Sign Imm Extend WB Data CS211 25
    26. 26. 5 Steps of MIPS/DLX Datapath Instruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc Access Back Next PC MUX Next SEQ PC Next SEQ PC Adder 4 RS1 Zero? MUX MUX MEM/WB Address Memory EX/MEM RS2 Reg File ID/EX IF/ID ALU Memory Data MUX WB Data Sign Extend Imm RD RD RD• Data stationary control – local decode for each instruction phase / pipeline stage CS211 26
    27. 27. Graphically RepresentingPipelines• Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths CS211 27
    28. 28. Visualizing Pipelining Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7I ALUn Ifetch Reg DMem Regstr. ALU Ifetch Reg DMem RegOr ALU Ifetch Reg DMem Regder ALU Ifetch Reg DMem Reg CS211 28
    29. 29. Conventional Pipelined Execution Representation TimeIFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WBProgram Flow IFetch Dcd Exec Mem WB CS211 29
    30. 30. Single Cycle, Multiple Cycle, vs.Clk PipelineCycle 1 Cycle 2Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10ClkMultiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem IfetchPipeline Implementation:Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr CS211 30
    31. 31. The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr• Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory• Reg/Dec: Registers Fetch and Instruction Decode• Exec: Calculate the memory address• Mem: Read the data from the Data Memory• Wr: Write the data back to the register file CS211 31
    32. 32. The Four Stages of R- type Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr• Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory• Reg/Dec: Registers Fetch and Instruction Decode• Exec: – ALU operates on the two register operands – Update PC• Wr: Write the ALU output back to the register file CS211 32
    33. 33. Pipelining the R-type and Load Instruction 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 1 Cycle 2 Cycle Cycle 9ClockR-type Ifetch Reg/Dec Exec Wr Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr • We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port CS211 33
    34. 34. Important• Observation Each functional unit can only be used once per instruction• Each functional unit must be used at the same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage 1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr – R-type uses Register File’s Write Port during its 4th stage 1 2 3 4 R-type Ifetch Reg/Dec Exec Wr° 2 ways to solve this pipeline hazard. CS211 34
    35. 35. Solution 1: Insert “Bubble” into the Pipeline2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 1 CycleClock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec • Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle – The control logic can be complex. – Lose instruction fetch and issue opportunity. • No instruction is started in Cycle 6! CS211 35
    36. 36. Solution 2: Delay R-type’s Write by • One Cycle Delay R-type’s register write by one cycle: – Now R-type instructions also use Reg File’s write port at Stage 5 – Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9ClockR-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr CS211 36
    37. 37. WhyPipeline?• Suppose we execute 100 instructions• Single Cycle Machine – 45 ns/cycle x 1 CPI x 100 inst = 4500 ns• Multicycle Machine – 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns• Ideal pipelined machine – 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns CS211 37
    38. 38. Why Pipeline? Because the resources are there! Time (clock cycles) ALU I Im Reg Dm Regn Inst 0s ALU t Inst 1 Im Reg Dm Regr. ALUO Inst 2 Im Reg Dm Regrd Inst 3 ALU Im Reg Dm Reger Inst 4 ALU Im Reg Dm Reg CS211 38
    39. 39. Problems with Pipelineprocessors?• Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle and introduce stall cycles which increase CPI – Structural hazards: HW cannot support this combination of instructions - two dogs fighting for the same bone – Data hazards: Instruction depends on result of prior instruction still in the pipeline » Data dependencies – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). » Control dependencies• Can always resolve hazards by stalling• More stall cycles = more CPU time = less performance – Increase performance = decrease stall cycles CS211 39
    40. 40. Back to our old friend: CPU timeequation • Recall equation for CPU time • So what are we doing by pipelining the instruction execution process ? – Clock ? – Instruction Count ? – CPI ? » How is CPI effected by the various hazards ? CS211 40
    41. 41. Speed Up Equation for Pipelining CPIpipelined = Ideal CPI + Average Stall cycles per Inst Ideal CPI × Pipeline depth Cycle Timeunpipelined Speedup = × Ideal CPI + Pipeline stall CPI Cycle TimepipelinedFor simple RISC pipeline, CPI = 1: Pipeline depth Cycle Timeunpipelined Speedup = × 1 + Pipeline stall CPI Cycle Timepipelined CS211 41
    42. 42. One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7I Load Ifetch ALU Reg DMem Regns ALU Regt Instr 1 Ifetch Reg DMemr. ALU Ifetch Reg DMem Reg Instr 2Or ALU Ifetch Reg DMem Regd Instr 3er ALU Ifetch Reg DMem Reg Instr 4 CS211 42
    43. 43. One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7I Load Ifetch ALU Reg DMem Regns ALU Regt Instr 1 Ifetch Reg DMemr. ALU Ifetch Reg DMem Reg Instr 2Or Stall Bubble Bubble Bubble Bubble Bubbleder ALU Ifetch Reg DMem Reg Instr 3 CS211 43
    44. 44. Example: Dual-port vs. Single-port• Machine A: Dual ported memory (“Harvard Architecture”)• Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Note - Loads will cause stalls of 1 cycle• Recall our friend: – CPU = IC*CPI*Clk » CPI= ideal CPI + stalls CS211 44
    45. 45. Example… • Machine A: Dual ported memory (“Harvard Architecture”) • Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate • Ideal CPI = 1 for both • Loads are 40% of instructions executed SpeedUpA = Pipe. Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipe. Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipe. Depth/1.4) x 1.05 = 0.75 x Pipe. Depth SpeedUpA / SpeedUpB = Pipe. Depth/(0.75 x Pipe. Depth) = 1.33 • Machine A is 1.33 times faster CS211 45
    46. 46. Data Dependencies • True dependencies and False dependencies – false implies we can remove the dependency – true implies we are stuck with it! • Three types of data dependencies defined in terms of how succeeding instruction depends on preceding instruction – RAW: Read after Write or Flow dependency – WAR: Write after Read or anti-dependency – WAW: Write after Write CS211 46
    47. 47. Three Generic Data Hazards• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. CS211 47
    48. 48. RAW Dependency • Example program (a) with two instructions – i1: load r1, a; – i2: add r2, r1,r1; • Program (b) with two instructions – i1: mul r1, r4, r5; – i2: add r2, r1, r1; • Both cases we cannot read in i2 until i1 has completed writing the result – In (a) this is due to load-use dependency – In (b) this is due to define-use dependency CS211 48
    49. 49. Three Generic Data Hazards• Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7• Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”.• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 CS211 49
    50. 50. Three Generic Data Hazards• Write After Write (WAW) InstrJ writes operand before InstrI writes it.• Called an “output dependence” by compiler writers This also results from the reuse of name “r1”.• Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5• Will see WAR and WAW in later more complicated pipes I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 CS211 50
    51. 51. WAR and WAW Dependency• Example program (a): – i1: mul r1, r2, r3; – i2: add r2, r4, r5;• Example program (b): – i1: mul r1, r2, r3; – i2: add r1, r4, r5;• both cases we have dependence between i1 and i2 – in (a) due to r2 must be read before it is written into – in (b) due to r1 must be written by i2 after it has been written into by i1 CS211 51
    52. 52. What to do with WAR and WAW ? • Problem: – i1: mul r1, r2, r3; – i2: add r2, r4, r5; • Is this really a dependence/hazard ? CS211 52
    53. 53. What to do with WAR and WAW • Solution: Rename Registers – i1: mul r1, r2, r3; – i2: add r6, r4, r5; • Register renaming can solve many of these false dependencies – note the role that the compiler plays in this – specifically, the register allocation process--i.e., the process that assigns registers to variables CS211 53
    54. 54. Hazard Detection in H/W • Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline • How to detect and store potential hazard information – Note that hazards in machine code are based on register usage – Keep track of results in registers and their usage » Constructing a register data flow graph • For each instruction i construct set of Read registers and Write registers – Rregs(i) is set of registers that instruction i reads from – Wregs(i) is set of registers that instruction i writes to – Use these to define the 3 types of data hazards CS211 54
    55. 55. Hazard Detection in Hardware• A RAW hazard exists on register ρ if ρ ∈ Rregs( i ) ∩ Wregs( j ) – Keep a record of pending writes (for insts in the pipe) and compare with operand regs of current instruction. – When instruction issues, reserve its result register. – When on operation completes, remove its write reservation.• A WAW hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Wregs( j )• A WAR hazard exists on register ρ if ρ ∈ Wregs( i ) ∩ Rregs( j ) CS211 55
    56. 56. Internal Forwarding: Getting rid ofsome hazards • In some cases the data needed by the next instruction at the ALU stage has been computed by the ALU (or some stage defining it) but has not been written back to the registers • Can we “forward” this result by bypassing stages ? CS211 56
    57. 57. Data Hazard on R1 Time (clock cycles) IF ID/RF EX MEM WBI ALU Reg add r1,r2,r3 Ifetch Reg DMemnst ALU Ifetch Reg DMem Reg sub r4,r1,r3r. ALUO Ifetch Reg DMem Reg and r6,r1,r7rd ALU Ifetch Reg DMem Rege or r8,r1,r9r ALU Ifetch Reg DMem Reg xor r10,r1,r11 CS211 57
    58. 58. Forwarding to Avoid Data Hazard Time (clock cycles) I n ALU add r1,r2,r3 Ifetch Reg DMem Reg s tr. ALU sub r4,r1,r3 Ifetch Reg DMem RegOr ALU Ifetch Reg DMem Regd and r6,r1,r7er ALU Ifetch Reg DMem Reg or r8,r1,r9 ALU Ifetch Reg DMem Reg xor r10,r1,r11 CS211 58
    59. 59. Internal Forwarding ofInstructions • Forward result from ALU/Execute unit to execute unit in next stage • Also can be used in cases of memory access • in some cases, operand fetched from memory has been computed previously by the program – can we “forward” this result to a later stage thus avoiding an extra read from memory ? – Who does this ? • Internal forwarding cases – Stage i to Stage i+k in pipeline – store-load forwarding – load-store forwarding – store-store forwarding CS211 59
    60. 60. Internal Data 38 Forwarding Store-load forwarding Memory Memory M M Access Unit Access UnitR1 R2 R1 R2 STO M,R1 LD R2,M STO M,R1 MOVE R2,R1 CS211 60
    61. 61. Internal Data 39 Forwarding Load-load forwarding Memory Memory M M Access Unit Access UnitR1 R2 R1 R2 LD R1,M LD R2,M LD R1,M MOVE R2,R1 CS211 61
    62. 62. Internal Data 40 Forwarding Store-store forwarding Memory Memory M M Access Unit Access UnitR1 R2 R1 R2 STO M,R2 STO M, R1 STO M,R2 CS211 62
    63. 63. HW Change for Forwarding NextPC mux Registers MEM/WR EX/MEM ALU ID/EX Data mux Memory muxImmediate CS211 63
    64. 64. What about memory operations?in the same stage,º If instructions are initiated in order and operations always occur op Rd Ra Rb there can be no hazards between memory operations!º What does delaying WB on arithmetic operations cost? – cycles ? op Rd Ra Rb A B – hardware ?º What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] Rd D R R3 <- R2 + R1 ⇒ “Delayed Loads” Memº Can recognize this in decode stage and Rd introduce bubble while stalling fetch stage Tº Tricky situation: to reg R1 <- Mem[ R2 + I ] file Mem[R3+34] <- R1 Handle with bypass in memory stage! CS211 64
    65. 65. Data Hazard Even with Forwarding Time (clock cycles)I ALU lw r1, 0(r2) Ifetch Reg DMem Regnst ALU Ifetch Reg DMem Reg sub r4,r1,r6r.O ALU Ifetch Reg DMem Reg and r6,r1,r7rde ALU Ifetch Reg DMem Regr or r8,r1,r9 CS211 65
    66. 66. Data Hazard Even with Forwarding Time (clock cycles)In lw r1, 0(r2) ALU Ifetch Reg DMem Regstr. ALU sub r4,r1,r6 Ifetch Reg Bubble DMem RegOrd Bubble ALU Ifetch Reg DMem Rege and r6,r1,r7r ALU Bubble Ifetch Reg DMem or r8,r1,r9 CS211 66
    67. 67. What can we (S/W) do? CS211 67
    68. 68. Software Scheduling to Avoid LoadHazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: Fast code: LW Rb,b LW Rb,b LW Rc,c LW Rc,c ADD Ra,Rb,Rc LW Re,e SW a,Ra ADD Ra,Rb,Rc LW Re,e LW Rf,f LW Rf,f SW a,Ra SUB Rd,Re,Rf SUB Rd,Re,Rf CS211 68 SW d,Rd SW d,Rd
    69. 69. Control Hazards: Branches• Instruction flow – Stream of instructions processed by Inst. Fetch unit – Speed of “input flow” puts bound on rate of outputs generated• Branch instruction affects instruction flow – Do not know next instruction to be executed until branch outcome known• When we hit a branch instruction – Need to compute target address (where to branch) – Resolution of branch condition (true or false) – Might need to ‘flush’ pipeline if other instructions have been fetched for execution CS211 69
    70. 70. Control Hazard on Branches Three Stage Stall ALU10: beq r1,r3,36 Ifetch Reg DMem Reg ALU Ifetch Reg DMem Reg14: and r2,r3,r5 ALU Ifetch Reg DMem Reg18: or r6,r1,r7 ALU Ifetch Reg DMem Reg22: add r8,r1,r9 ALU Reg36: xor r10,r1,r11 Ifetch Reg DMem CS211 70
    71. 71. Branch Stall Impact• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!• Two part solution: – Determine branch taken or not sooner, AND – Compute taken branch address earlier• MIPS branch tests if register = 0 or ≠ 0• MIPS Solution: – Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3 CS211 71
    72. 72. Pipelined MIPS (DLX) DatapathInstruction Instr. Decode Execute Memory Write Fetch Reg. Fetch Addr. Calc. Access Back This is the correct 1 cycle latency implementation! CS211 72
    73. 73. Four Branch Hazard Alternatives#1: Stall until branch direction is clear – flushing pipe#2: Predict Branch Not Taken – Execute successor instructions in sequence – “Squash” instructions in pipeline if branch actually taken – Advantage of late pipeline state update – 47% DLX branches not taken on average – PC+4 already calculated, so use it to get next instruction#3: Predict Branch Taken – 53% DLX branches taken on average – But haven’t calculated branch target address in DLX » DLX still incurs 1 cycle branch penalty » Other machines: branch target known before outcome CS211 73
    74. 74. Four Branch Hazard Alternatives#4: Delayed Branch – Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn Branch delay of length n branch target if taken – 1 slot delay allows proper decision and branch target address in 5 stage pipeline – DLX uses this CS211 74
    75. 75. CS211 75
    76. 76. Delayed Branch• Where to get instructions to fill branch delay slot? – Before branch instruction – From the target address: only valuable when branch taken – From fall through: only valuable when branch not taken – Cancelling branches allow more slots to be filled• Compiler effectiveness for single branch delay slot: – Fills about 60% of branch delay slots – About 80% of instructions executed in branch delay slots useful in computation – About 50% (60% x 80%) of slots usefully filled• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) CS211 76
    77. 77. Evaluating Branch Alternatives P ip e lin e s p e e d u p = P ip e lin e d e p th 1 + B r a n c h f r e q u e n c y × B r a n c h p e n a ltyScheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stallStall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31Conditional & Unconditional = 14%, 65% change PC CS211 77
    78. 78. Branch Prediction based onhistory • Can we use history of branch behaviour to predict branch outcome ? • Simplest scheme: use 1 bit of “history” – Set bit to Predict Taken (T) or Predict Not-taken (NT) – Pipeline checks bit value and predicts » If incorrect then need to invalidate instruction – Actual outcome used to set the bit value • Example: let initial value = T, actual outcome of branches is- NT, T,T,NT,T,T – Predictions are: T, NT,T,T,NT,T » 3 wrong (in red), 3 correct = 50% accuracy • In general, can have k-bit predictors: more when we cover superscalar processors. CS211 78
    79. 79. Summary : Control and Pipelining• Just overlap tasks; easy if tasks are independent• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then: Pipeline depth Cycle Timeunpipelined Speedup = × 1 + Pipeline stall CPI Cycle Timepipelined• Hazards limit performance on computers: – Structural: need more HW resources – Data (RAW,WAR,WAW): need forwarding, compiler scheduling – Control: delayed branch, prediction CS211 79
    80. 80. Summary #1/2:•Pipelining it easy What makes – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores• What makes it hard? HAZARDS! – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction depends on a previous instruction• Pipelines pass control information down the pipe just as data moves down pipe• Forwarding/Stalls handled by local control• Exceptions stop the pipeline CS211 80
    81. 81. Introduction to ILP • What is ILP? – Processor and Compiler design techniques that speed up execution by causing individual machine operations to execute in parallel • ILP is transparent to the user – Multiple operations executed in parallel even though the system is handed a single program written with a sequential processor in mind • Same execution hardware as a normal RISC machine – May be more than one of any given type of hardware CS211 81
    82. 82. Compiler vs. Processor Compiler Hardware Frontend and Optimizer Superscalar Determine Dependences Dataflow Determine Dependences Determine Independences Indep. Arch. Determine IndependencesBind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993. CS211 82
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×