SlideShare a Scribd company logo
1 of 19
Download to read offline
Outline
                                                                                          2.1    Instruction-Level Parallelism: Concepts and Challenges
             Computer Architecture                                                        2.2    Basic Compiler Techniques for Exposing ILP
                 計算機結構                                                                    2.3    Reducing Branch Costs with Prediction
                                                                                          2.4    Overcoming Data Hazards with Dynamic Scheduling
                     Lecture 3                                                            2.5    Dynamic Scheduling: Examples and the Algorithm
           Instruction-Level Parallelism                                                  2.6
                                                                                          26     Hardware-Based Speculation
                                                                                                 H d        B dS         l ti
                                                                                          2.7    Exploiting ILP Using Multiple Issue and Static Scheduling
                and Its Exploitation
                          p                                                               2.8
                                                                                          28     Exploiting ILP Using Dynamic Scheduling, Multiple Issue
                                                                                                                                 Scheduling
                        (Chapter 2 in textbook)                                                  and Speculation


                       Ping-Liang Lai (賴秉樑)


                                                            Computer Architecture Ch3-1                                                                        Computer Architecture Ch3-2




  2.1 ILP: Concepts and Challenges                                                          Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP): overlap the execution of                             Basic Block (BB) ILP is quite small
instructions to improve performance.                                                            BB: a straight-line code sequence with no branches in except to the entry
2 approaches to exploit ILP                                                                     and, no branches out except at the exit;
1) Rely on hardware to help discover and exploit the parallelism dynamically                    Average d
                                                                                                A       dynamic branch frequency 15% to 25%;
                                                                                                             i b     hf              t 25%
   (e.g., Pentium 4, AMD Opteron, IBM Power), and                                                »   3 to 6 instructions execute between a pair of branches.
2) Rely on software technology to find parallelism statically at compile-time
                                       parallelism,                                             Plus instructions in BB likely to depend on each other.
   (e.g., Itanium 2)                                                                      To obtain substantial performance enhancements, we must
Pipelining Review                                                                         exploit ILP across multiple basic blocks. (ILP → LLP)
                                                                                            p                     p                 (         )
   Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls                   Loop-Level Parallelism: to exploit parallelism among iterations of a loop.
   + Control Stalls                                                                             E.g., add two matrixes.

                                                                                                            for (i=1; i<=1000; i=i+1)
                                                                                                                      x[i] = x[i] + y[ ]
                                                                                                                       []     [ ] y[i];


                                                            Computer Architecture Ch3-3                                                                        Computer Architecture Ch3-4
Data Dependence and Hazards                                                                              Data Dependence

Three data dependence: data dependences (true data                                           Floating-point data part
dependences), name dependences, and control dependences.
1.   Instruction i produces a result that may be used by instruction j (i → j), or           Loop:       L.D              F0, 0(R1)         ;F0=array element
2.
2    Instruction j i data dependent on instruction k and i
             i     is d d       d        i     i k, d instruction k i data
                                                                  i      is d                            ADD.D
                                                                                                         ADD D            F4, F0,
                                                                                                                          F4 F0 F2           ;add scalar in
                                                                                                                                               dd     l i
     dependent on instruction i (i → k → j, dependence chain).                                           S.D              F4, 0(R1)          ;store result
For example, a MIPS code sequence
    example
                                                                                             Integer data part
     Loop:     L.D               F0, 0(R1)          ;F0=array element
               ADD.D             F4, F0, F2         ;add scalar in                                       DADDUI           R1, R1, #-8  ;decrement pointer
               S.D               F4, 0(R1)          ;store result
                                                                                                                                       ;8 bytes (per DW)
               DADDUI            R1, R1 # 8
                                 R1 R1, #-8         ;decrement pointer 8 bytes
                                                     d         t i t     b t
               BNE               R1, R2, Loop       ;branch R1!=R2                                       BNE              R1, R2, Loop ;branch R1!=R2

                                                                                             † This type is called a Write After Read (WAR) hazard.
                                                               Computer Architecture Ch3-5                                                            Computer Architecture Ch3-6




      Data Dependence and Hazards                                                            ILP and Data Dependencies, Hazards

InstrJ is data dependent (aka true dependence) on InstrI:                                    HW/SW must preserve program order: order instructions would
1) InstrJ tries to read operand before InstrI writes it;                                     execute in if executed sequentially as determined by original
                                                                                             source program.
                I:         add r1 r2 r3
                               r1,r2,r3                                                        Dependences are a property of programs.
                                                                                                    d                      f
                J:         sub r4,r1,r3
                                                                                             Presence of dependence indicates potential for a hazard, but
2) Or InstrJ is data dependent on InstrK which is dependent on InstrI.                       actual hazard and length of any stall is property of the pipeline
                                                                                                                                                      pipeline.
If two instructions are data dependent, they cannot execute                                  Importance of the data dependencies.
simultaneously or be completely overlapped.
               y           p    y        pp                                                  1) Indicates the possibility of a hazard;
Data dependence in instruction sequence → data dependence in                                 2) Determines order in which results must be calculated;
source code → effect of original data dependence must be                                     3) Sets an upper bound on how much parallelism can possibly be exploited.
preserved.                                                                                   HW/SW goal: exploit parallelism by preserving program order
If data dependence caused a hazard in pipeline, called a Read                                only where it affects the outcome of the program.
After Write (RAW) hazard.
                                                               Computer Architecture Ch3-7                                                            Computer Architecture Ch3-8
Name Dependence #1: Anti-dependence                                                       Name Dependence #2: Output dependence

 Name dependence: when 2 instructions use same register or                                InstrJ writes operand before InstrI writes it.
 memory location, called a name, but no flow of data between the
                                                                                                              I:       sub r1,r4,r3
 instructions associated with that name; 2 versions of name
                                                                                                              J:       add r1,r2,r3
                                                                                                                              , ,3
 dependence (WAR and WAW).
 d      d               d WAW)
                                                                                                              K:       mul r6,r1,r7
 InstrJ writes operand before InstrI reads it
                                                                                          Called an “output dependence by compiler writers This also
                                                                                                      output dependence”               writers.
                    I:         sub r4,r1,r3                                               results from the reuse of name “r1”
                    J:         add r1,r2,r3
                    K:
                    K          mul r6,r1,r7
                                  l 6 1 7                                                 If anti dependence caused a hazard in the pipeline, called a
                                                                                             anti-dependence                        pipeline
                                                                                          Write After Write (WAW) hazard.
    Called an “anti-dependence” by compiler writers. This results from reuse
                                                                                          Instructions involved in a name dependence can execute
    of the name “r1”.
     f h
                                                                                          simultaneously if name used in instructions is changed so
 If anti-dependence caused a hazard in the pipeline, called a                             instructions do not conflict.
 Write After Read (WAR) hazard.
                          hazard                                                               Register renaming resolves name dependence for regs;
                                                                                               Either by compiler or by HW.
                                                            Computer Architecture Ch3-9                                                                Computer Architecture Ch3-10




                  Control Dependencies                                                                      Control Dependence

 Every instruction is control dependent on some set of branches,                          Two constrains imposed by control dependence
 and, in general, these control dependencies must be preserved to                         1. An instruction that is dependent on a branch cannot be moved before the
 preserve program order.                                                                     branch so that its execution is no longer controlled by the branch;
                                                                                          2.
                                                                                          2 An instruction that is control dependent on a branch cannot be moved
              if p1 {                                                                        after the branch so that its execution is controlled by the branch.
                         S1;
                           ;                                                              Control dependence need not be preserved
                                                                                                    p                    p
              };                                                                                Willing to execute instructions that should not have been executed,
              if p2 {                                                                           thereby violating the control dependences, if can do so without affecting
                         S2;
                         S2                                                                     correctness of the program .
              }                                                                           Instead, 2 properties critical to program correctness are
                                                                                          1.
                                                                                          1     Exception behavior and,
                                                                                                                   and
 S1 is control dependent on p1, and S2 is control dependent on p2                         2.    Data flow
 but not on p1.
            p

                                                           Computer Architecture Ch3-11                                                                Computer Architecture Ch3-12
Exception Behavior                                                                                Data Flow

Preserving exception behavior                                                             Data flow: actual flow of data values among instructions that
      Any changes in instruction execution order must not change how                      produce results and those that consume them.
      exceptions are raised in program (⇒ no new exceptions).                                 Branches make flow dynamic, determine which instruction is supplier of
Example
E    l                                                                                        data.
                                                                                              data
                                                                                          Example
                DADDU            R2,R3,R4
                BEQZ             R2,L1                                                                     DADDU         R1, R2, R3
                LW               R1,0(R2)                                                                  BEQZ          R4, L
       L1:                                                                                                 DSUBU R1 R5 R6
                                                                                                                 R1, R5,
                                                                                                L:         …
       † Assume branches not delayed.                                                                      OR            R7, R1, R8

Problem with moving LW before BEQZ?                                                       R1 of OR depends on DADDU or DSUBU?
                                                                                             Must preserve data flow on execution.
                                                                                                                        execution

                                                           Computer Architecture Ch3-13                                                               Computer Architecture Ch3-14




                                                                                                     2.2      Basic Compiler Techniques
                               Outline                                                                          for Exposing ILP
2.1    Instruction-Level Parallelism: Concepts and Challenges                             This code, add a scalar to a vector
2.2    Basic Compiler Techniques for Exposing ILP                                               for (i=1000; i>0; i=i–1)
2.3    Reducing Branch Costs with Prediction                                                           x[i] = x[i] + s;
2.4    Overcoming Data Hazards with Dynamic Scheduling
                                                                                          Assume following latencies for all examples
2.5    Dynamic Scheduling: Examples and the Algorithm                                         Ignore delayed branch in these examples
2.6
26     Hardware-Based Speculation
       H d        B dS         l ti
2.7    Exploiting ILP Using Multiple Issue and Static Scheduling
                                                                                          Instruction producing result Instruction using result   Latency in cycles
2.8
28     Exploiting ILP Using Dynamic Scheduling, Multiple Issue
                                       Scheduling                                         FP ALU op                    Another FP ALU op          3
       and Speculation
                                                                                          FP ALU op                    Store double               2
                                                                                          Load d bl
                                                                                          L d double                   FP ALU op                  1
                                                                                          Load double                  Store double               0

                                                                                                 Figure 2.2 Latencies of FP operations used in this chapter.

                                                           Computer Architecture Ch3-15                                                               Computer Architecture Ch3-16
FP Loop: Where are the Hazards?                                                                                      FP Loop Showing Stalls

     First translate into MIPS code                                                                              Example 3-1 (p.76): Show how the loop would look on MIPS, both scheduled
          To simplify, assume 8 is lowest address                                                                and unscheduled i l di any stalls or idle clock cycles. Schedule for delays
                                                                                                                    d     h d l d including         ll   idl l k      l      h d l f d l
                                                                                                                 from floating-point operations, but remember that we are ignoring delayed
                                                                                                                 branches.
            Loop:
            L           L.D
                        LD            F0,0(R1)
                                      F0 0(R1)       ;F0=vector element
                                                      F0           l
                                                                                                                 Answer
                        ADD.D         F4,F0,F2       ;add scalar from F2
                                                                                                                 1.   Loop: L.D   F0,0(R1)       ;F0=vector element
                        S.D
                        SD            0(R1),F4
                                      0(R1) F4       ;store result
                                                                                                                 2.         stall
                        DADDUI        R1,R1,-8       ;decrement pointer 8B (DW)
                                                                                                                 3.         ADD.D F4,F0,F2       ;add scalar in F2
                        BNEZ          R1,Loop        ;branch R1!=zero
                                                                                                                 4.
                                                                                                                 4          stall
                                                                                                                 5.         stall
                                                                                                                 6.         S.D   0(R1),F4
                                                                                                                                   ( ),          ;
                                                                                                                                                 ;store result
                                                                                                                 7.         DADDUI R1,R1,-8      ;decrement pointer 8B (DW)
                                                                                                                 8.         stall                ;assumes can’t forward to branch
                                                                                                                 9.         BNEZ R1,Loop         ;branch R1!=zero

                                                                            Computer Architecture Ch3-17
                                                                                                                 † 9 clock cycles: Rewrite code to minimize stalls?            Computer Architecture Ch3-18




          Revised FP Loop Minimizing Stalls                                                                                     Unroll Loop Four Times
1.       Loop: L.D        F0,0(R1)
                                                                                                           1.
                                                                                                           1            Loop:   L.D
                                                                                                                                LD              F0,0(R1)
                                                                                                                                                F0 0(R1)
2.                        DADDUI R1,R1,-8                                                                  3.                   ADD.D           F4,F0,F2
3.                        ADD.D    F4,F0,F2                                                                6.                   S.D             0(R1),F4             ;drop DSUBUI & BNEZ
4.
4                         stall                                                                             .
                                                                                                           7.                   L.D
                                                                                                                                 .              F6,-8(R1)
                                                                                                                                                 6, 8( )
                                                                                                           9.                   ADD.D           F8,F6,F2
5.                        stall                                                                            12.                  S.D             -8(R1),F8            ;drop DSUBUI & BNEZ
6.                        S.D      8(R1),F4
                                    ( ),              ;
                                                      ;altered offset when move DSUBUI                     13.                  L.D                    ( )
                                                                                                                                                F10,-16(R1)
7.       BNEZ             R1,Loop                                                                          15.                  ADD.D           F12,F10,F2
                                                                                                           18.                  S.D             -16(R1),F12          ;drop DSUBUI & BNEZ
     Swap DADDUI and S.D by changing address of S.D                                                        19.                  L.D             F14,-24(R1)
                                                                                                           21.                  ADD.D           F16,F14,F2
     Instruction producing result        Instruction using result       Latency in cycles
                                                                                                           24.                  S.D             -24(R1),F16
     FP ALU op                           Another FP ALU op              3                                  25.                  DADDUI          R1,R1,#-32           ;alter to 4*8
     FP ALU op                           Store double
                                         S     d bl                     2                                  26.                  BNEZ            R1,LOOP
     Load double                         FP ALU op                      1
     Load double                         Store double                   0
                                                                                                                 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of
                                                                                                                 4).
                                                                                                                 4)
     †    7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead;
          How make faster?                                              Computer Architecture Ch3-19                                                                           Computer Architecture Ch3-20
Unrolled Loop Detail                                                          Unrolled Loop That Minimizes Stalls

Do not usually know upper bound of loop.
                                                                                                    1. Loop:
                                                                                                    1 L                  L.D
                                                                                                                         LD                   F0,
                                                                                                                                              F0 0(R1)
Suppose it is n, and we would like to unroll the loop to make k                                     2.                   L.D                  F6, -8(R1)
copies of the body.                                                                                 3.                   L.D                  F10, -16(R1)
Instead of a single unrolled loop, we generate a pair of                                            4.                   L.D                  F14, -24(R1)
                                                                                                    5.                   ADD.D                F4 ,F0, F2
consecutive loops:                                                                                  6.                   ADD.D                F8, F6, F2
                                                                                                                                                 , ,
   1st executes (n mod k) times and has a body that is the original loop;                           7.                   ADD.D                F12, F10, F2
   2nd is the unrolled body surrounded by an outer loop that iterates (n/k)                         8.                   ADD.D                F16, F14, F2
   times.                                                                                           9.                   S.D                  0(R1), F4
For large values of n, most of the execution time will be spent in                                  10.                  S.D                  -8(R1), F8
                                                                                                    11.                  S.D                  -16(R1), F12
the unrolled loop.
                p                                                                                   12.
                                                                                                    12                   DSUBUI               R1, R1
                                                                                                                                              R1 R1, #32
                                                                                                    13.                  S.D                  8(R1), F16   ; 8-32 = -24
                                                                                                    14.                  BNEZ                 R1, LOOP


                                                                Computer Architecture Ch3-21                                                                       Computer Architecture Ch3-22




           5 Loop Unrolling Decisions                                                                       Limits to Loop Unrolling

Requires understanding how one instruction depends on another                                  3 Limits to Loop Unrolling
and how the instructions can be changed or reordered given the                                 1.   Decrease in amount of overhead amortized with each extra unrolling.
dependences:                                                                                           Amdahl’s Law.
1.
1 Determine loop unrolling useful by finding that l
          i l            lli      f l b fi di    h loop iterations were
                                                          i    i                               2.
                                                                                               2    Growth i code size.
                                                                                                    G   th in d i
   independent (except for maintenance code);                                                          For larger loops, concern it increases the instruction cache miss rate.
2. Use different registers to avoid unnecessary constraints forced by using
                   g                           y                     y    g                    3.   Register p
                                                                                                      g      pressure: potential shortfall in registers created by aggressive
                                                                                                                       p                        g                y gg
   same registers for different computations;                                                       unrolling and scheduling.
3. Eliminate the extra test and branch instructions and adjust the loop                                If not be possible to allocate all live values to registers, may lose some or all
   termination and iteration code;                                                                     of its advantage
                                                                                                              advantage.
4. Determine that loads and stores in unrolled loop can be interchanged by                     Loop unrolling reduces impact of branches on pipeline; another
   observing that loads and stores from different iterations are independent;                  way is branch prediction.
                                                                                                 y           p
   »   Transformation requires analyzing memory addresses and finding that they do                  We discuss it in section 2.3: Reducing Branch Costs with Prediction.
       not refer to the same address.
5.
5 Schedule the code, preserving any dependences needed to yield the same
                  code
   result as the original code.
                                                                Computer Architecture Ch3-23                                                                       Computer Architecture Ch3-24
Outline                                                               2.3 Reducing Branch Costs with Prediction

2.1    Instruction-Level Parallelism: Concepts and Challenges                                                           Because of the need to enforce control dependences through
2.2    Basic Compiler Techniques for Exposing ILP                                                                       branch hazards and stall, branches will hurt pipeline performance.
2.3    Reducing Branch Costs with Prediction                                                                               Solution 1: loop unrolling;
2.4    Overcoming Data Hazards with Dynamic Scheduling                                                                     Solution 2 b
                                                                                                                           S l i 2: by predicting how they will behave.
                                                                                                                                             di i h    h    ill b h
2.5    Dynamic Scheduling: Examples and the Algorithm                                                                   SW/HW technology
2.6
26     Hardware-Based Speculation
       H d        B dS         l ti                                                                                        SW: St ti B
                                                                                                                           SW Static Branch Prediction, statically at compile time;
                                                                                                                                           h P di ti     t ti ll t        il ti
2.7    Exploiting ILP Using Multiple Issue and Static Scheduling                                                           HW: Dynamic Branch Prediction, dynamically by the hardware at
                                                                                                                           execution time.
2.8
28     Exploiting ILP Using Dynamic Scheduling, Multiple Issue
                                       Scheduling
       and Speculation




                                                                                         Computer Architecture Ch3-25                                                             Computer Architecture Ch3-26




                Static Branch Prediction                                                                                         Dynamic Branch Prediction
Appendix A showed scheduling code around delayed branch.                                                                Why does prediction work?
Reorder code around branches, need to predict branch statically when compile.                                              Underlying algorithm has regularities;
Simplest scheme is to predict a branch as taken.                                                                           Data that is being operated on has regularities;
      Average misprediction = untaken branch frequency = 34% SPEC. Unfortunately
                                                             SPEC Unfortunately,                                           Instruction sequence h redundancies that are artifacts of way that
                                                                                                                                                has d d i h                 if     f      h
      from very accurate (59%) to highly accurate (9%).
                                                                                                                           humans/compilers think about problems.
                                                25%         22%                                                         Is dynamic branch prediction better than static branch prediction?
                           Misprediction Rate




                                                                  18%
  More accurate
                                                20%
                                                                                                       15%
                                                                                                                           Seems to be;
  scheme predicts                               15%   12%               11% 12%                                            There are a small number of important branches in programs which have
  branches using                                                                            9% 10%
                                                10%
                                                                                       6%
                                                                                                                           dynamic behavior.
  profile information                           5%
                                                                                  4%
  collected from earlier
  runs, and modify
           d   dif                              0%

  prediction based on
  last run.
                                                      Integer (ave. 15%)          Floating Point (ave. 9%)
                              Figure 2.3 The result of predict-taken inSPEC92
                                                                                         Computer Architecture Ch3-27                                                             Computer Architecture Ch3-28
Dynamic Branch Prediction                                                              Basic Branch Prediction Buffers

Performance = ƒ(accuracy, cost of misprediction)                                           a.k.a. Branch History Table (BHT) - Small direct-mapped cache
Branch History Table (also called Branch Prediction Buffer):                               of T/NT bits.
lower bits of PC address index table of 1-bit values.
   Says whether the branch was recently taken or not;
   No address check.
Problem: i a loop, 1-bit BHT will cause two mispredictions
P bl      in l       1 bit        ill        t i  di ti
(average is 9 in 10 iterations before exit).
   End of loop case, when it exits instead of looping as before;
                case
   First time through loop on next time through code, when it predicts exit
   instead of looping.




                                                            Computer Architecture Ch3-29                                                                                                Computer Architecture Ch3-30




         Dynamic Branch Prediction                                                                                           2-bit Scheme Accuracy

Solution: 2-bit scheme where change prediction only if get                                 Mispredict because either:
misprediction twice.                                                                          Wrong guess for that branch;
                        T                                                                     Got branch history of wrong branch when index the table.
                                                                                           4,096 entry table
                                  NT                                                                                   20%   18%
  Predict Taken         11                  10         Predict Taken                                                   18%




                                                                                                         iction Rate
                                                                                                                   e
                                  T                                                                                    16%
                   T                              NT                                                                   14%                  12%
    Predict Not                   NT                   Predict Not                                                     12%                        10%
                        01                  00                                                                                                          9%        9%      9%
       Taken                                              Taken                                                        10%
                                   T


                                                                                                  Mispredi
                                                                                                                        8%
                                                                                                                        6%             5%                    5%
                                                                                                                        4%
                                                                                                                                                                                         1%
                                             NT                                                                         2%                                                      0%
                                                                                                                        0%
   Red: stop, not taken;
   Blue: go, taken;
                                                                                                                                   t




                                                                                                                                            c




                                                                                                                                                       c




                                                                                                                                                                                0
                                                                                                                                                       li




                                                                                                                                                                               p




                                                                                                                                                                                         7
                                                                                                                                 ot




                                                                                                                                                     ice




                                                                                                                                                                ice
                                                                                                                                so

                                                                                                                                        gc




                                                                                                                                                    du




                                                                                                                                                                             pp

                                                                                                                                                                             30

                                                                                                                                                                                      sa
                                                                                                                              nt

                                                                                                                             es




                                                                                                                                                  sp




                                                                                                                                                             sp
                                                                                                                                                  do




                                                                                                                                                                         ri x
                                                                                                                                                                          fp




                                                                                                                                                                                    na
                                                                                                                           eq

                                                                                                                          pr




                                                                                                                                                                       at
   Adds hysteresis to decision making process.                                                                          es




                                                                                                                                                                      m
                                                                                                                                 Integer                      Floating Point
                                                            Computer Architecture Ch3-31                                       Figure 2.5 The result of 2-bit scheme inSPEC89           Computer Architecture Ch3-32
2-bit Scheme Accuracy                                                           Improve Prediction Strategy By Correlating Branches

In figure 2.5, the accuracy of the predictors for integer programs,                               Consider the worst case for the 2-bit predictor
which typically also have higher branch frequencies, is lower                                                                                                                         DSUBUI R3, R1, #2
                                                                                                      if (aa==2)                            if the first                              BNEZ   R3, L1
than for the loop-intensive scientific programs.                                                                                            2 fail then                               DADD   R1, R0, R0
                                                                                                                      aa=0;
                                                                                                                                            the 3rd will
Two ways to attack this problem                                                                       if (bb==2)                            always be
                                                                                                                                                                           L1:
                                                                                                                                                                                      DSUBUI R3, R2, #2
   Large buffer size;                                                                                             bb=0;                     taken.
                                                                                                                                                                                      BNEZ   R3, L2
   Increasing the accuracy of the scheme we use for each prediction.
   I      i th              f th    h           f      h    di ti                                     if (aa != bb) {
                                                                                                                                                                                      DADD R2, R0, R0
However, simply increasing the number of bits per predictor                                          †       Single level predictors can never get this case.              L2:
without changing the predictor structure also has little impact.
                                                         impact                                                                                                                       DSUBU     R3, R1, R2
                                                                                                                                                                                      BEQZ      R3, L3
   Single branch predictor V.S. correlating branch predictors.                                    Correlating predictors or 2-level predictors
                                                                                                      Correlation = what happened on the last branch
                                                                                                                           pp                                                             This branch is based on
                                                                                                                                                                                          the Outcome of the
                                                                                                         »   Note that the last correlator branch may not always be the same.             previous 2 branches.
                                                                                                      Predictor = which way to go
                                                                                                         »   4 possibilities: which way the last one went chooses the prediction.
                                                                                                                                                                      prediction
                                                                                                                − (Last-taken, last-not-taken) × (predict-taken, predict-not-taken)

                                                                 Computer Architecture Ch3-33                                                                                           Computer Architecture Ch3-34




        Correlated Branch Prediction                                                                                     Correlating Branches
Idea: record m most recently executed branches as taken or not taken, and use                     (2, 2) predictor
that pattern to select the proper n-bit branch history table.
 h                l     h           bi b     h hi        bl                                           Behavior of recent 2 branches selects between four predictions of next
In general, (m, n) predictor means record last m branches to select between 2m                        branch, updating just that prediction.
history tables, each with n-bit counters.
      y        ,
   Thus, old 2-bit BHT is a (0, 2 ) predictor.                                                                                      Branch address
Global Branch History: m-bit shift register keeping T/NT status of last m                                                                           4
branches.
b    h
                                                                                                                               2-bits per branch predictor
Each entry in table has m n-bit predictors.
Total bits for the (m, n) BHT prediction buffer:

   Total _ memory _ bits = 2 m × n × 2 p                                                                                                                                   Prediction

   2m banks of memory selected by the global branch history (which is just a shift
   register) - e.g. a column address;
   Use p bits of the branch address to select row;
   Get the n predictor bits in the entry to make the decision.
                                                                                                                                                     2-bit global branch history
                                                                 Computer Architecture Ch3-35                                                                                           Computer Architecture Ch3-36
Example of Correlating Branch Predictors                                                                       Example: Multiple Consequent Branches

                                                                                                                                                                                                                                                              if(d == 0)           ;not taken
                                                                                                                                                                                                                                                                 d=1;
                                                                                                                                                                                                                                                              else                 ;taken
                                                                                                                                                                                                                                                                 if(d==1)          ;not taken
                                                                                                                                                                                                                                                              else                 ;taken
 if (d==0)                       BNEZ           R1,
                                                R1 L1               ;branch b1 (d!=0)
                                                                                                                                                                                    If b1 is not taken, then b2 will be not taken
       d = 1;                    DADDIU         R1, R0, #1          ;d==0, so d=1
 if (d==1)
    (     )          L1:         DADDIU         R3, R1, #-1
      …                          BNEZ           R3, L2              ;branch b2 (d!=1)
                     …
                     L2:
                                                                                                                      1-bit predictor: consider d alternates between 2 and 0. All branches are mispredicted




                                                                                Computer Architecture Ch3-37                                                                                                                                                      Computer Architecture Ch3-38




Example: Multiple Consequent Branches                                                                                                               Accuracy of Different Schemes

                                                                              if(d == 0)    ;not taken
                                                                                 d=1;                                                         20%
                                                                              else          ;taken                                            18%            4,096 Entries 2-bit BHT




                                                                                                                                     ctions
                                                                                 if(d==1)   ;not taken
                                                                              else          ;taken                                            16%            Unlimited Entries 2-bit BHT
                                                                                                                                                             1,024
                                                                                                                                                             1 024 Entries (2 2) BHT
                                                                                                                                                                           (2,




                                                                                                                     ency of Mispredic
                                                                                                                                              14%
     2-bits prediction : prediction if last branch not taken/ and prediction if last branch taken
                                                                                                                                              12%
                                                                                                                                                                                                                                            11%
                                                                                                                                              10%




                                                                                                                             M
                                                                                                                                              8%
                                                                                                                                                                                                                        6%           6%                                      6%
                                                                                                                                              6%
                                                                                                                                                                                                          5%                                                                           5%
                                                                                                                                                                                                                                                             4%
                                                                                                                                              4%
                                                                                                                Freque                        2%
     (1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or                                                               1%                                 1%
                                                                                                                                              0%                               0%
     not taken) decides which prediction bit will be considered or updated

                                                                                                                                                        a7



                                                                                                                                                                   matrix300



                                                                                                                                                                                      tomcatv
                                                                                                                                                                                            v



                                                                                                                                                                                                          d
                                                                                                                                                                                                     doducd




                                                                                                                                                                                                                             fpppp
                                                                                                                                                                                                                                 p




                                                                                                                                                                                                                                                  expresso



                                                                                                                                                                                                                                                                        tt



                                                                                                                                                                                                                                                                                  li
                                                                                                                                                                                                                   ce




                                                                                                                                                                                                                                           cc




                                                                                                                                                                                                                                                                                  l
                                                                                                                                                                                                                                                                   eqntot
                                                                                                                                                                                                                                          gc
                                                                                                                                                                                                                spic
                                                                                                                                                     nasa

                                                                                                                                                             4,096 entries: 2-bits per entry
                                                                                                                                                              ,                    p       y                   Unlimited entries: 2-bits/entry
                                                                                                                                                                                                                                        /    y                1,024 entries (2,2)
                                                                                                                                                                                                                                                               ,            ( , )




                                                                                Computer Architecture Ch3-39                                                                                                                                                      Computer Architecture Ch3-40
Outline                                                       Advantages of Dynamic Scheduling
2.1    Instruction-Level Parallelism: Concepts and Challenges                                 Dynamic scheduling: hardware rearranges the instruction
2.2    Basic Compiler Techniques for Exposing ILP                                             execution to reduce stalls while maintaining data flow and
2.3    Reducing Branch Costs with Prediction                                                  exception behavior.
2.4    Overcoming Data Hazards with Dynamic Scheduling                                        It handles cases when dependences unknown at compile time.
2.5    Dynamic Scheduling: Examples and the Algorithm                                            It allows the processor to tolerate unpredictable delays such as cache
2.6
26     Hardware-Based Speculation
       H d        B dS         l ti                                                              misses,
                                                                                                 misses by executing other code while waiting for the miss to resolve.
                                                                                                                                                                resolve

2.7    Exploiting ILP Using Multiple Issue and Static Scheduling                              It allows code that compiled for one pipeline to run efficiently on
2.8
28     Exploiting ILP Using Dynamic Scheduling, Multiple Issue
                                       Scheduling                                             a different pipeline.
                                                                                                          pp
       and Speculation                                                                        It simplifies the compiler.
                                                                                              Hardware speculation: a technique with significant performance
                                                                                              advantages, builds on dynamic scheduling (next lecture).



                                                               Computer Architecture Ch3-41                                                                  Computer Architecture Ch3-42




HW Schemes: Instruction Parallelism                                                                   Dynamic Scheduling Step 1

Key idea: allow instructions behind stall to proceed.                                         Simple pipeline had 1 stage to check both structural and data
                 DIVD F0, F2, F4
                                                                                              hazards: Instruction Decode (ID), also called Instruction Issue.
                 ADDD F10, F0, F8                                                             Split the ID pipe stage of simple 5-stage pipeline into 2 stages
                 SUBD F12, F8, F14                                                               Issue: decode instructions, check for structural hazards.
Enables out-of-order execution and allows out-of-order                                           Read operands: wait until no data hazards, then read operands.
completion (e.g., SUBD).
      In a dynamically scheduled pipeline, all instructions still pass through
      issue stage in order (in-order issue).
                           (in order issue)
Will distinguish when an instruction begins execution and when
it completes execution; between 2 times the instruction is in
                                   times,
execution.
Note: Dynamic execution creates WAR and WAW hazards and
        y
makes exceptions harder.
                                                               Computer Architecture Ch3-43                                                                  Computer Architecture Ch3-44
Tomasulo Algorithm                                                                                     Tomasulo Organization
Control & buffers distributed with Function Units (FU)                                                                                                             FP Registers
                                                                                                                                                                        g
                                                                                                   From Mem              FP Op
    FU buffers called “reservation stations”; have pending operands                                                      Queue
Registers in instructions replaced by values or pointers to reservation                                         Load Buffers
stations(RS); called register renaming;                                                          Load1
    Renaming avoids WAR, WAW hazards                                                             Load2
                                                                                                 Load3
    More reservation stations than registers, so can do optimizations compilers can’t            Load4
                                                                                                 Load5                                                                         Store
Results to FU from RS, not through registers, over Common Data Bus that                          Load6
                                                                                                                                                                              Buffers
broadcasts results to all FUs
    Avoids RAW hazards by executing an instruction only when its operands are
                        y         g                   y           p                                          Add1
    available                                                                                                Add2                                      Mult1
                                                                                                             Add3                                      Mult2
Load and Stores treated as FUs with RSs as well
                                                                                                                                                Reservation                                To Mem
Integer instructions can go past branches (predict taken), allowing FP ops
                                                   taken)                                                                                         Stations
beyond basic block in FP queue.                                                                                            FP adders                             FP multipliers




                                                                  Computer Architecture Ch3-45                                                                                    Computer Architecture Ch3-46




     Reservation Station Components                                                                       Three Stages of Tomasulo Algorithm

Op: operation to perform in the unit (e.g., + or –)                                                1. Issue: get instruction from FP Op Queue
                                                                                                            If reservation station free (no structural hazard), control issues instr & sends
Vj, Vk: value of Source operands                                                                            operands (renames registers).
    Store buffers has V field, result to be stored                                                 2. Execute: operate on operands (EX)
                                                                                                                p          p       ( )
Qj, Qk: reservation stations producing source registers (value to                                           When both operands ready then execute; if not ready, watch Common Data Bus for
be written)                                                                                                 result.
    Note: Qj,Qk=0 => ready                                                                         3.
                                                                                                   3 Write result: finish execution (WB)
                                                                                                            Write on Common Data Bus to all awaiting units; mark reservation station
    Store buffers only have Qi for RS producing result
                                                                                                            available
Busy: indicates reservation station or FU is busy                                                        Normal data bus: data + destination (“go to” bus)
Register result status: Indicates which functional unit will write                                       Common data bus: data + source (“come from” bus)
each register if one exists Blank when no pending instructions
     register,        exists.                                                                               64 bits of data + 4 bits of Functional Unit source address
that will write that register.                                                                              Write if matches expected Functional Unit (produces result)
                                                                                                            Does the broadcast
                                                                                                         Example speed:
                                                                                                            3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /
                                                                  Computer Architecture Ch3-47                                                                                    Computer Architecture Ch3-48
Outline                                                             Instruction stream          Tomasulo Example
                                                                                                                                                   p
                                                                                                           Instruction status:                   Exec Write
                                                                                                              Instruction         j   k    Issue Comp Result                   Busy Address
    2.1     Instruction-Level Parallelism: Concepts and Challenges                                            LD          F6    34+   R2                               Load1    No
    2.2     Basic Compiler Techniques for Exposing ILP                                                        LD
                                                                                                              MULTD F0
                                                                                                                          F2    45+
                                                                                                                                 F2
                                                                                                                                      R3
                                                                                                                                      F4
                                                                                                                                                                       Load2
                                                                                                                                                                       Load3
                                                                                                                                                                                No
                                                                                                                                                                                No
    2.3     Reducing Branch Costs with Prediction                                                             SUBD        F8     F6   F2
                                                                                                                 V
                                                                                                              DIVD        F10    F0   F6
    2.4     Overcoming Data Hazards with Dynamic Scheduling                                                   ADDD        F6     F8   F2
                                                                                                                                                                                3 Load/Buffers
                                                                                                                                                                                  L d/B ff

    2.5     Dynamic Scheduling: Examples and the Algorithm                                                 Reservation Stations:                  S1     S2     RS      RS
                                                                                                                     Time Name Busy
                                                                                                                                  y        Op
                                                                                                                                            p     Vj
                                                                                                                                                   j     Vk     Qj      Q
                                                                                                                                                                        Qk
    2.6
    26      Hardware-Based Speculation
            H d        B dS         l ti                                                                                  Add1  No

    2.7     Exploiting ILP Using Multiple Issue and Static Scheduling                                       FU count      Add2  No
                                                                                                                                                                                     3 FP Adder R.S.
                                                                                                             down         Add3  No
                                                                                                                                                                                      2 FP Mult R.S.
    2.8
    28      Exploiting ILP Using Dynamic Scheduling, Multiple Issue
                                            Scheduling                                                                    Mult1 No
                                                                                                                          Mult2 No
            and Speculation
                                                                                                           Register result status:
                                                                                                             Clock                         F0    F2      F4    F6       F8     F10      F12    ...   F30
                                                                                                                  0                   FU


                                                                                                             Clock cycle
                                                                                                                    y
                                                                                                               counter

                                                                           Computer Architecture Ch3-49




                     Tomasulo Example Cycle 1
                                  p    y                                                                                        Tomasulo Example Cycle 2
                                                                                                                                             p    y
Instruction status:                   Exec Write                                                           Instruction status:                   Exec Write
   Instruction         j   k    Issue Comp Result                   Busy Address                              Instruction         j   k    Issue Comp Result                   Busy Address
   LD          F6    34+   R2    1                          Load1    Yes      34+R2                           LD          F6    34+   R2    1                          Load1    Yes    34+R2
   LD          F2    45+   R3                               Load2    No                                       LD          F2    45+   R3    2                          Load2    Yes    45+R3
   MULTD F0           F2   F4                               Load3    No                                       MULTD F0           F2   F4                               Load3    No
   SUBD        F8     F6   F2                                                                                 SUBD        F8     F6   F2
      V
   DIVD        F10    F0   F6                                                                                    V
                                                                                                              DIVD        F10    F0   F6
   ADDD        F6     F8   F2                                                                                 ADDD        F6     F8   F2

Reservation Stations:                  S1   S2       RS      RS                                            Reservation Stations:                  S1     S2     RS      RS
            Time Name Busy
                         y      Op
                                 p     Vj
                                        j   Vk       Qj      Q
                                                             Qk                                                        Time Name Busy
                                                                                                                                    y      Op
                                                                                                                                            p     Vj
                                                                                                                                                   j     Vk     Qj      Q
                                                                                                                                                                        Qk
                 Add1  No                                                                                                   Add1  No
                 Add2  No                                                                                                   Add2  No
                 Add3  No                                                                                                   Add3  No
                 Mult1 No                                                                                                   Mult1 No
                 Mult2 No                                                                                                   Mult2 No

Register result status:                                                                                    Register result status:
  Clock                         F0    F2    F4      F6       F8     F10        F12       ...   F30           Clock                         F0    F2      F4    F6       F8     F10      F12    ...   F30
       1                   FU                       Load1                                                         2                   FU         Load2         Load1
Tomasulo Example Cycle 3
                                  p    y                                                                            Tomasulo Example Cycle 4
                                                                                                                                 p    y
Instruction status:                   Exec Write                                               Instruction status:                   Exec Write
   Instruction         j   k    Issue Comp Result                   Busy Address                  Instruction         j   k    Issue Comp Result                  Busy Address
   LD          F6    34+   R2    1      3                   Load1    Yes   34+R2                  LD          F6    34+   R2    1      3     4            Load1    No
   LD          F2    45+   R3    2                          Load2    Yes   45+R3                  LD          F2    45+   R3    2      4                  Load2    Yes   45+R3
   MULTD F0           F2   F4    3                          Load3    No                           MULTD F0           F2   F4    3                         Load3    No
   SUBD        F8     F6   F2                                                                     SUBD        F8     F6   F2    4
      V
   DIVD        F10    F0   F6                                                                        V
                                                                                                  DIVD        F10    F0   F6
   ADDD        F6     F8   F2                                                                     ADDD        F6     F8   F2

Reservation Stations:                  S1     S2     RS      RS                                Reservation Stations:                  S1     S2     RS     RS
            Time Name Busy Op
                         y   p         Vj
                                        j     Vk     Qj      Q
                                                             Qk                                            Time Name Busy Op
                                                                                                                        y   p   Vj
                                                                                                                                 j   Vk    Qj    Q
                                                                                                                                                 Qk
                 Add1  No                                                                                       Add1 Yes SUBD M(A1)             Load2
                 Add2  No                                                                                       Add2  No
                 Add3  No                                                                                       Add3  No
                 Mult1 Yes MULTD              R(F4) Load2                                                       Mult1 Yes MULTD     R(F4) Load2
                 Mult2 No                                                                                       Mult2 No

Register result status:                                                                        Register result status:
  Clock                         F0    F2      F4    F6       F8     F10    F12     ...   F30     Clock                         F0    F2      F4    F6      F8     F10    F12     ...   F30
       3                   FU   Mult1 Load2         Load1                                             4                   FU   Mult1 Load2         M(A1) Add1

 • Note: registers names are removed (“renamed”) in Reservation Stations;
           g                         (          )                       ;
   MULT issued                                                                                  • Load2 completing; what is waiting for Load2?
 • Load1 completing; what is waiting for Load1?




                     Tomasulo Example Cycle 5
                                  p    y                                                                            Tomasulo Example Cycle 6
                                                                                                                                 p    y
Instruction status:                   Exec Write                                               Instruction status:                   Exec Write
   Instruction         j   k    Issue Comp Result                   Busy Address                  Instruction         j   k    Issue Comp Result                  Busy Address
   LD          F6    34+   R2    1      3      4            Load1    No                           LD          F6    34+   R2    1      3     4            Load1    No
   LD          F2    45+   R3    2      4      5            Load2    No                           LD          F2    45+   R3    2      4     5            Load2    No
   MULTD F0           F2   F4    3                          Load3    No                           MULTD F0           F2   F4    3                         Load3    No
   SUBD        F8     F6   F2    4                                                                SUBD        F8     F6   F2    4
      V
   DIVD        F10    F0   F6    5                                                                   V
                                                                                                  DIVD        F10    F0   F6    5
   ADDD        F6     F8   F2                                                                     ADDD        F6     F8   F2    6

Reservation Stations:                  S1     S2     RS      RS                                Reservation Stations:                  S1     S2     RS     RS
            Time Name Busy Op
                          y   p    Vj
                                    j    Vk    Qj            Q
                                                             Qk                                            Time Name Busy Op
                                                                                                                         y   p    Vj
                                                                                                                                   j    Vk    Qj           Q
                                                                                                                                                           Qk
                2 Add1 Yes SUBD M(A1) M(A2)                                                                    1 Add1 Yes SUBD M(A1) M(A2)
                  Add2  No                                                                                       Add2 Yes ADDD         M(A2) Add1
                  Add3  No                                                                                       Add3  No
               10 Mult1 Yes MULTD M(A2) R(F4)                                                                  9 Mult1 Yes MULTD M(A2) R(F4)
                  Mult2 Yes DIVD        M(A1) Mult1                                                              Mult2 Yes DIVD        M(A1) Mult1

Register result status:                                                                        Register result status:
  Clock                         F0    F2      F4    F6       F8     F10    F12     ...   F30     Clock                         F0    F2      F4    F6      F8     F10    F12     ...   F30
       5                   FU   Mult1 M(A2)         M(A1) Add1 Mult2                                  6                   FU   Mult1 M(A2)         Add2   Add1 Mult2


 • Timer starts down for Add1 Mult1
                         Add1,                                                                  • Issue ADDD here despite name dependency on F6?
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation
Chapter 3 instruction level parallelism and its exploitation

More Related Content

What's hot

Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazardAJAL A J
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computingMehul Patel
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelismdeviyasharwin
 
Direct memory access (dma)
Direct memory access (dma)Direct memory access (dma)
Direct memory access (dma)Zubair Khalid
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingTushar Swami
 
Harvard vs Von Neumann Architecture
Harvard vs Von Neumann ArchitectureHarvard vs Von Neumann Architecture
Harvard vs Von Neumann ArchitectureProject Student
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Bhavik Vashi
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipeliningMazin Alwaaly
 
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipeliningTech_MX
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Mukesh Chinta
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple dataSyed Zaid Irshad
 

What's hot (20)

Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
SCHEDULING ALGORITHMS
SCHEDULING ALGORITHMSSCHEDULING ALGORITHMS
SCHEDULING ALGORITHMS
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
pipelining
pipeliningpipelining
pipelining
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Instruction level parallelism
Instruction level parallelismInstruction level parallelism
Instruction level parallelism
 
Direct memory access (dma)
Direct memory access (dma)Direct memory access (dma)
Direct memory access (dma)
 
RISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set ComputingRISC - Reduced Instruction Set Computing
RISC - Reduced Instruction Set Computing
 
Harvard vs Von Neumann Architecture
Harvard vs Von Neumann ArchitectureHarvard vs Von Neumann Architecture
Harvard vs Von Neumann Architecture
 
Context switching
Context switchingContext switching
Context switching
 
Parallel processing (simd and mimd)
Parallel processing (simd and mimd)Parallel processing (simd and mimd)
Parallel processing (simd and mimd)
 
Branch prediction
Branch predictionBranch prediction
Branch prediction
 
Computer architecture pipelining
Computer architecture pipeliningComputer architecture pipelining
Computer architecture pipelining
 
Instruction pipelining
Instruction pipeliningInstruction pipelining
Instruction pipelining
 
Pipelining & All Hazards Solution
Pipelining  & All Hazards SolutionPipelining  & All Hazards Solution
Pipelining & All Hazards Solution
 
Array Processor
Array ProcessorArray Processor
Array Processor
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)
 
Embedded C - Optimization techniques
Embedded C - Optimization techniquesEmbedded C - Optimization techniques
Embedded C - Optimization techniques
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
Single instruction multiple data
Single instruction multiple dataSingle instruction multiple data
Single instruction multiple data
 

Viewers also liked

INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMKamran Ashraf
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsJose Pinilla
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Ismail Mukiibi
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
Parallelism
ParallelismParallelism
ParallelismMs. Ross
 
Parallelism
ParallelismParallelism
Parallelismkguymon
 

Viewers also liked (10)

INSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISMINSTRUCTION LEVEL PARALLALISM
INSTRUCTION LEVEL PARALLALISM
 
Instruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) LimitationsInstruction Level Parallelism (ILP) Limitations
Instruction Level Parallelism (ILP) Limitations
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
 
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar ProcessorsInstruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
 
Data Hazard and Solution for Data Hazard
Data Hazard and Solution for Data HazardData Hazard and Solution for Data Hazard
Data Hazard and Solution for Data Hazard
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
1.prallelism
1.prallelism1.prallelism
1.prallelism
 
Parallelism
ParallelismParallelism
Parallelism
 
Parallelism
ParallelismParallelism
Parallelism
 

Similar to Chapter 3 instruction level parallelism and its exploitation

Multi Supply Digital Layout
Multi Supply Digital LayoutMulti Supply Digital Layout
Multi Supply Digital LayoutRégis SANTONJA
 
From programming to software engineering: ICSE keynote slides available
From programming to software engineering: ICSE keynote slides availableFrom programming to software engineering: ICSE keynote slides available
From programming to software engineering: ICSE keynote slides availableCelso Martins
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...Naoki Shibata
 
C programming session 02
C programming session 02C programming session 02
C programming session 02AjayBahoriya
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
Collaborative modeling and co simulation with destecs - a pilot study
Collaborative modeling and co simulation with destecs - a pilot studyCollaborative modeling and co simulation with destecs - a pilot study
Collaborative modeling and co simulation with destecs - a pilot studyDaniele Gianni
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Intel® Software
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareESUG
 
Of Bugs and Men (and Plugins too)
Of Bugs and Men (and Plugins too)Of Bugs and Men (and Plugins too)
Of Bugs and Men (and Plugins too)Michel Wermelinger
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computersSanjivani Sontakke
 
Course information outline
Course information outlineCourse information outline
Course information outlineidzwanpsis
 

Similar to Chapter 3 instruction level parallelism and its exploitation (20)

EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Multi Supply Digital Layout
Multi Supply Digital LayoutMulti Supply Digital Layout
Multi Supply Digital Layout
 
From programming to software engineering: ICSE keynote slides available
From programming to software engineering: ICSE keynote slides availableFrom programming to software engineering: ICSE keynote slides available
From programming to software engineering: ICSE keynote slides available
 
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
(Paper) Efficient Evaluation Methods of Elementary Functions Suitable for SIM...
 
C programming session 02
C programming session 02C programming session 02
C programming session 02
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
Collaborative modeling and co simulation with destecs - a pilot study
Collaborative modeling and co simulation with destecs - a pilot studyCollaborative modeling and co simulation with destecs - a pilot study
Collaborative modeling and co simulation with destecs - a pilot study
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CSMR06a.ppt
CSMR06a.pptCSMR06a.ppt
CSMR06a.ppt
 
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
Reducing Deep Learning Integration Costs and Maximizing Compute Efficiency| S...
 
Madeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable HardwareMadeo - a CAD Tool for reconfigurable Hardware
Madeo - a CAD Tool for reconfigurable Hardware
 
Of Bugs and Men
Of Bugs and MenOf Bugs and Men
Of Bugs and Men
 
Of Bugs and Men (and Plugins too)
Of Bugs and Men (and Plugins too)Of Bugs and Men (and Plugins too)
Of Bugs and Men (and Plugins too)
 
H5bk fthz
 H5bk fthz H5bk fthz
H5bk fthz
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Architectures for parallel
Architectures for parallelArchitectures for parallel
Architectures for parallel
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Course information outline
Course information outlineCourse information outline
Course information outline
 
CBSE Sample Paper IP
CBSE Sample Paper IPCBSE Sample Paper IP
CBSE Sample Paper IP
 

Chapter 3 instruction level parallelism and its exploitation

  • 1. Outline 2.1 Instruction-Level Parallelism: Concepts and Challenges Computer Architecture 2.2 Basic Compiler Techniques for Exposing ILP 計算機結構 2.3 Reducing Branch Costs with Prediction 2.4 Overcoming Data Hazards with Dynamic Scheduling Lecture 3 2.5 Dynamic Scheduling: Examples and the Algorithm Instruction-Level Parallelism 2.6 26 Hardware-Based Speculation H d B dS l ti 2.7 Exploiting ILP Using Multiple Issue and Static Scheduling and Its Exploitation p 2.8 28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue Scheduling (Chapter 2 in textbook) and Speculation Ping-Liang Lai (賴秉樑) Computer Architecture Ch3-1 Computer Architecture Ch3-2 2.1 ILP: Concepts and Challenges Instruction-Level Parallelism (ILP) Instruction-Level Parallelism (ILP): overlap the execution of Basic Block (BB) ILP is quite small instructions to improve performance. BB: a straight-line code sequence with no branches in except to the entry 2 approaches to exploit ILP and, no branches out except at the exit; 1) Rely on hardware to help discover and exploit the parallelism dynamically Average d A dynamic branch frequency 15% to 25%; i b hf t 25% (e.g., Pentium 4, AMD Opteron, IBM Power), and » 3 to 6 instructions execute between a pair of branches. 2) Rely on software technology to find parallelism statically at compile-time parallelism, Plus instructions in BB likely to depend on each other. (e.g., Itanium 2) To obtain substantial performance enhancements, we must Pipelining Review exploit ILP across multiple basic blocks. (ILP → LLP) p p ( ) Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls Loop-Level Parallelism: to exploit parallelism among iterations of a loop. + Control Stalls E.g., add two matrixes. for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[ ] [] [ ] y[i]; Computer Architecture Ch3-3 Computer Architecture Ch3-4
  • 2. Data Dependence and Hazards Data Dependence Three data dependence: data dependences (true data Floating-point data part dependences), name dependences, and control dependences. 1. Instruction i produces a result that may be used by instruction j (i → j), or Loop: L.D F0, 0(R1) ;F0=array element 2. 2 Instruction j i data dependent on instruction k and i i is d d d i i k, d instruction k i data i is d ADD.D ADD D F4, F0, F4 F0 F2 ;add scalar in dd l i dependent on instruction i (i → k → j, dependence chain). S.D F4, 0(R1) ;store result For example, a MIPS code sequence example Integer data part Loop: L.D F0, 0(R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in DADDUI R1, R1, #-8 ;decrement pointer S.D F4, 0(R1) ;store result ;8 bytes (per DW) DADDUI R1, R1 # 8 R1 R1, #-8 ;decrement pointer 8 bytes d t i t b t BNE R1, R2, Loop ;branch R1!=R2 BNE R1, R2, Loop ;branch R1!=R2 † This type is called a Write After Read (WAR) hazard. Computer Architecture Ch3-5 Computer Architecture Ch3-6 Data Dependence and Hazards ILP and Data Dependencies, Hazards InstrJ is data dependent (aka true dependence) on InstrI: HW/SW must preserve program order: order instructions would 1) InstrJ tries to read operand before InstrI writes it; execute in if executed sequentially as determined by original source program. I: add r1 r2 r3 r1,r2,r3 Dependences are a property of programs. d f J: sub r4,r1,r3 Presence of dependence indicates potential for a hazard, but 2) Or InstrJ is data dependent on InstrK which is dependent on InstrI. actual hazard and length of any stall is property of the pipeline pipeline. If two instructions are data dependent, they cannot execute Importance of the data dependencies. simultaneously or be completely overlapped. y p y pp 1) Indicates the possibility of a hazard; Data dependence in instruction sequence → data dependence in 2) Determines order in which results must be calculated; source code → effect of original data dependence must be 3) Sets an upper bound on how much parallelism can possibly be exploited. preserved. HW/SW goal: exploit parallelism by preserving program order If data dependence caused a hazard in pipeline, called a Read only where it affects the outcome of the program. After Write (RAW) hazard. Computer Architecture Ch3-7 Computer Architecture Ch3-8
  • 3. Name Dependence #1: Anti-dependence Name Dependence #2: Output dependence Name dependence: when 2 instructions use same register or InstrJ writes operand before InstrI writes it. memory location, called a name, but no flow of data between the I: sub r1,r4,r3 instructions associated with that name; 2 versions of name J: add r1,r2,r3 , ,3 dependence (WAR and WAW). d d d WAW) K: mul r6,r1,r7 InstrJ writes operand before InstrI reads it Called an “output dependence by compiler writers This also output dependence” writers. I: sub r4,r1,r3 results from the reuse of name “r1” J: add r1,r2,r3 K: K mul r6,r1,r7 l 6 1 7 If anti dependence caused a hazard in the pipeline, called a anti-dependence pipeline Write After Write (WAW) hazard. Called an “anti-dependence” by compiler writers. This results from reuse Instructions involved in a name dependence can execute of the name “r1”. f h simultaneously if name used in instructions is changed so If anti-dependence caused a hazard in the pipeline, called a instructions do not conflict. Write After Read (WAR) hazard. hazard Register renaming resolves name dependence for regs; Either by compiler or by HW. Computer Architecture Ch3-9 Computer Architecture Ch3-10 Control Dependencies Control Dependence Every instruction is control dependent on some set of branches, Two constrains imposed by control dependence and, in general, these control dependencies must be preserved to 1. An instruction that is dependent on a branch cannot be moved before the preserve program order. branch so that its execution is no longer controlled by the branch; 2. 2 An instruction that is control dependent on a branch cannot be moved if p1 { after the branch so that its execution is controlled by the branch. S1; ; Control dependence need not be preserved p p }; Willing to execute instructions that should not have been executed, if p2 { thereby violating the control dependences, if can do so without affecting S2; S2 correctness of the program . } Instead, 2 properties critical to program correctness are 1. 1 Exception behavior and, and S1 is control dependent on p1, and S2 is control dependent on p2 2. Data flow but not on p1. p Computer Architecture Ch3-11 Computer Architecture Ch3-12
  • 4. Exception Behavior Data Flow Preserving exception behavior Data flow: actual flow of data values among instructions that Any changes in instruction execution order must not change how produce results and those that consume them. exceptions are raised in program (⇒ no new exceptions). Branches make flow dynamic, determine which instruction is supplier of Example E l data. data Example DADDU R2,R3,R4 BEQZ R2,L1 DADDU R1, R2, R3 LW R1,0(R2) BEQZ R4, L L1: DSUBU R1 R5 R6 R1, R5, L: … † Assume branches not delayed. OR R7, R1, R8 Problem with moving LW before BEQZ? R1 of OR depends on DADDU or DSUBU? Must preserve data flow on execution. execution Computer Architecture Ch3-13 Computer Architecture Ch3-14 2.2 Basic Compiler Techniques Outline for Exposing ILP 2.1 Instruction-Level Parallelism: Concepts and Challenges This code, add a scalar to a vector 2.2 Basic Compiler Techniques for Exposing ILP for (i=1000; i>0; i=i–1) 2.3 Reducing Branch Costs with Prediction x[i] = x[i] + s; 2.4 Overcoming Data Hazards with Dynamic Scheduling Assume following latencies for all examples 2.5 Dynamic Scheduling: Examples and the Algorithm Ignore delayed branch in these examples 2.6 26 Hardware-Based Speculation H d B dS l ti 2.7 Exploiting ILP Using Multiple Issue and Static Scheduling Instruction producing result Instruction using result Latency in cycles 2.8 28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue Scheduling FP ALU op Another FP ALU op 3 and Speculation FP ALU op Store double 2 Load d bl L d double FP ALU op 1 Load double Store double 0 Figure 2.2 Latencies of FP operations used in this chapter. Computer Architecture Ch3-15 Computer Architecture Ch3-16
  • 5. FP Loop: Where are the Hazards? FP Loop Showing Stalls First translate into MIPS code Example 3-1 (p.76): Show how the loop would look on MIPS, both scheduled To simplify, assume 8 is lowest address and unscheduled i l di any stalls or idle clock cycles. Schedule for delays d h d l d including ll idl l k l h d l f d l from floating-point operations, but remember that we are ignoring delayed branches. Loop: L L.D LD F0,0(R1) F0 0(R1) ;F0=vector element F0 l Answer ADD.D F4,F0,F2 ;add scalar from F2 1. Loop: L.D F0,0(R1) ;F0=vector element S.D SD 0(R1),F4 0(R1) F4 ;store result 2. stall DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 3. ADD.D F4,F0,F2 ;add scalar in F2 BNEZ R1,Loop ;branch R1!=zero 4. 4 stall 5. stall 6. S.D 0(R1),F4 ( ), ; ;store result 7. DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 8. stall ;assumes can’t forward to branch 9. BNEZ R1,Loop ;branch R1!=zero Computer Architecture Ch3-17 † 9 clock cycles: Rewrite code to minimize stalls? Computer Architecture Ch3-18 Revised FP Loop Minimizing Stalls Unroll Loop Four Times 1. Loop: L.D F0,0(R1) 1. 1 Loop: L.D LD F0,0(R1) F0 0(R1) 2. DADDUI R1,R1,-8 3. ADD.D F4,F0,F2 3. ADD.D F4,F0,F2 6. S.D 0(R1),F4 ;drop DSUBUI & BNEZ 4. 4 stall . 7. L.D . F6,-8(R1) 6, 8( ) 9. ADD.D F8,F6,F2 5. stall 12. S.D -8(R1),F8 ;drop DSUBUI & BNEZ 6. S.D 8(R1),F4 ( ), ; ;altered offset when move DSUBUI 13. L.D ( ) F10,-16(R1) 7. BNEZ R1,Loop 15. ADD.D F12,F10,F2 18. S.D -16(R1),F12 ;drop DSUBUI & BNEZ Swap DADDUI and S.D by changing address of S.D 19. L.D F14,-24(R1) 21. ADD.D F16,F14,F2 Instruction producing result Instruction using result Latency in cycles 24. S.D -24(R1),F16 FP ALU op Another FP ALU op 3 25. DADDUI R1,R1,#-32 ;alter to 4*8 FP ALU op Store double S d bl 2 26. BNEZ R1,LOOP Load double FP ALU op 1 Load double Store double 0 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4). 4) † 7 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loop overhead; How make faster? Computer Architecture Ch3-19 Computer Architecture Ch3-20
  • 6. Unrolled Loop Detail Unrolled Loop That Minimizes Stalls Do not usually know upper bound of loop. 1. Loop: 1 L L.D LD F0, F0 0(R1) Suppose it is n, and we would like to unroll the loop to make k 2. L.D F6, -8(R1) copies of the body. 3. L.D F10, -16(R1) Instead of a single unrolled loop, we generate a pair of 4. L.D F14, -24(R1) 5. ADD.D F4 ,F0, F2 consecutive loops: 6. ADD.D F8, F6, F2 , , 1st executes (n mod k) times and has a body that is the original loop; 7. ADD.D F12, F10, F2 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) 8. ADD.D F16, F14, F2 times. 9. S.D 0(R1), F4 For large values of n, most of the execution time will be spent in 10. S.D -8(R1), F8 11. S.D -16(R1), F12 the unrolled loop. p 12. 12 DSUBUI R1, R1 R1 R1, #32 13. S.D 8(R1), F16 ; 8-32 = -24 14. BNEZ R1, LOOP Computer Architecture Ch3-21 Computer Architecture Ch3-22 5 Loop Unrolling Decisions Limits to Loop Unrolling Requires understanding how one instruction depends on another 3 Limits to Loop Unrolling and how the instructions can be changed or reordered given the 1. Decrease in amount of overhead amortized with each extra unrolling. dependences: Amdahl’s Law. 1. 1 Determine loop unrolling useful by finding that l i l lli f l b fi di h loop iterations were i i 2. 2 Growth i code size. G th in d i independent (except for maintenance code); For larger loops, concern it increases the instruction cache miss rate. 2. Use different registers to avoid unnecessary constraints forced by using g y y g 3. Register p g pressure: potential shortfall in registers created by aggressive p g y gg same registers for different computations; unrolling and scheduling. 3. Eliminate the extra test and branch instructions and adjust the loop If not be possible to allocate all live values to registers, may lose some or all termination and iteration code; of its advantage advantage. 4. Determine that loads and stores in unrolled loop can be interchanged by Loop unrolling reduces impact of branches on pipeline; another observing that loads and stores from different iterations are independent; way is branch prediction. y p » Transformation requires analyzing memory addresses and finding that they do We discuss it in section 2.3: Reducing Branch Costs with Prediction. not refer to the same address. 5. 5 Schedule the code, preserving any dependences needed to yield the same code result as the original code. Computer Architecture Ch3-23 Computer Architecture Ch3-24
  • 7. Outline 2.3 Reducing Branch Costs with Prediction 2.1 Instruction-Level Parallelism: Concepts and Challenges Because of the need to enforce control dependences through 2.2 Basic Compiler Techniques for Exposing ILP branch hazards and stall, branches will hurt pipeline performance. 2.3 Reducing Branch Costs with Prediction Solution 1: loop unrolling; 2.4 Overcoming Data Hazards with Dynamic Scheduling Solution 2 b S l i 2: by predicting how they will behave. di i h h ill b h 2.5 Dynamic Scheduling: Examples and the Algorithm SW/HW technology 2.6 26 Hardware-Based Speculation H d B dS l ti SW: St ti B SW Static Branch Prediction, statically at compile time; h P di ti t ti ll t il ti 2.7 Exploiting ILP Using Multiple Issue and Static Scheduling HW: Dynamic Branch Prediction, dynamically by the hardware at execution time. 2.8 28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue Scheduling and Speculation Computer Architecture Ch3-25 Computer Architecture Ch3-26 Static Branch Prediction Dynamic Branch Prediction Appendix A showed scheduling code around delayed branch. Why does prediction work? Reorder code around branches, need to predict branch statically when compile. Underlying algorithm has regularities; Simplest scheme is to predict a branch as taken. Data that is being operated on has regularities; Average misprediction = untaken branch frequency = 34% SPEC. Unfortunately SPEC Unfortunately, Instruction sequence h redundancies that are artifacts of way that has d d i h if f h from very accurate (59%) to highly accurate (9%). humans/compilers think about problems. 25% 22% Is dynamic branch prediction better than static branch prediction? Misprediction Rate 18% More accurate 20% 15% Seems to be; scheme predicts 15% 12% 11% 12% There are a small number of important branches in programs which have branches using 9% 10% 10% 6% dynamic behavior. profile information 5% 4% collected from earlier runs, and modify d dif 0% prediction based on last run. Integer (ave. 15%) Floating Point (ave. 9%) Figure 2.3 The result of predict-taken inSPEC92 Computer Architecture Ch3-27 Computer Architecture Ch3-28
  • 8. Dynamic Branch Prediction Basic Branch Prediction Buffers Performance = ƒ(accuracy, cost of misprediction) a.k.a. Branch History Table (BHT) - Small direct-mapped cache Branch History Table (also called Branch Prediction Buffer): of T/NT bits. lower bits of PC address index table of 1-bit values. Says whether the branch was recently taken or not; No address check. Problem: i a loop, 1-bit BHT will cause two mispredictions P bl in l 1 bit ill t i di ti (average is 9 in 10 iterations before exit). End of loop case, when it exits instead of looping as before; case First time through loop on next time through code, when it predicts exit instead of looping. Computer Architecture Ch3-29 Computer Architecture Ch3-30 Dynamic Branch Prediction 2-bit Scheme Accuracy Solution: 2-bit scheme where change prediction only if get Mispredict because either: misprediction twice. Wrong guess for that branch; T Got branch history of wrong branch when index the table. 4,096 entry table NT 20% 18% Predict Taken 11 10 Predict Taken 18% iction Rate e T 16% T NT 14% 12% Predict Not NT Predict Not 12% 10% 01 00 9% 9% 9% Taken Taken 10% T Mispredi 8% 6% 5% 5% 4% 1% NT 2% 0% 0% Red: stop, not taken; Blue: go, taken; t c c 0 li p 7 ot ice ice so gc du pp 30 sa nt es sp sp do ri x fp na eq pr at Adds hysteresis to decision making process. es m Integer Floating Point Computer Architecture Ch3-31 Figure 2.5 The result of 2-bit scheme inSPEC89 Computer Architecture Ch3-32
  • 9. 2-bit Scheme Accuracy Improve Prediction Strategy By Correlating Branches In figure 2.5, the accuracy of the predictors for integer programs, Consider the worst case for the 2-bit predictor which typically also have higher branch frequencies, is lower DSUBUI R3, R1, #2 if (aa==2) if the first BNEZ R3, L1 than for the loop-intensive scientific programs. 2 fail then DADD R1, R0, R0 aa=0; the 3rd will Two ways to attack this problem if (bb==2) always be L1: DSUBUI R3, R2, #2 Large buffer size; bb=0; taken. BNEZ R3, L2 Increasing the accuracy of the scheme we use for each prediction. I i th f th h f h di ti if (aa != bb) { DADD R2, R0, R0 However, simply increasing the number of bits per predictor † Single level predictors can never get this case. L2: without changing the predictor structure also has little impact. impact DSUBU R3, R1, R2 BEQZ R3, L3 Single branch predictor V.S. correlating branch predictors. Correlating predictors or 2-level predictors Correlation = what happened on the last branch pp This branch is based on the Outcome of the » Note that the last correlator branch may not always be the same. previous 2 branches. Predictor = which way to go » 4 possibilities: which way the last one went chooses the prediction. prediction − (Last-taken, last-not-taken) × (predict-taken, predict-not-taken) Computer Architecture Ch3-33 Computer Architecture Ch3-34 Correlated Branch Prediction Correlating Branches Idea: record m most recently executed branches as taken or not taken, and use (2, 2) predictor that pattern to select the proper n-bit branch history table. h l h bi b h hi bl Behavior of recent 2 branches selects between four predictions of next In general, (m, n) predictor means record last m branches to select between 2m branch, updating just that prediction. history tables, each with n-bit counters. y , Thus, old 2-bit BHT is a (0, 2 ) predictor. Branch address Global Branch History: m-bit shift register keeping T/NT status of last m 4 branches. b h 2-bits per branch predictor Each entry in table has m n-bit predictors. Total bits for the (m, n) BHT prediction buffer: Total _ memory _ bits = 2 m × n × 2 p Prediction 2m banks of memory selected by the global branch history (which is just a shift register) - e.g. a column address; Use p bits of the branch address to select row; Get the n predictor bits in the entry to make the decision. 2-bit global branch history Computer Architecture Ch3-35 Computer Architecture Ch3-36
  • 10. Example of Correlating Branch Predictors Example: Multiple Consequent Branches if(d == 0) ;not taken d=1; else ;taken if(d==1) ;not taken else ;taken if (d==0) BNEZ R1, R1 L1 ;branch b1 (d!=0) If b1 is not taken, then b2 will be not taken d = 1; DADDIU R1, R0, #1 ;d==0, so d=1 if (d==1) ( ) L1: DADDIU R3, R1, #-1 … BNEZ R3, L2 ;branch b2 (d!=1) … L2: 1-bit predictor: consider d alternates between 2 and 0. All branches are mispredicted Computer Architecture Ch3-37 Computer Architecture Ch3-38 Example: Multiple Consequent Branches Accuracy of Different Schemes if(d == 0) ;not taken d=1; 20% else ;taken 18% 4,096 Entries 2-bit BHT ctions if(d==1) ;not taken else ;taken 16% Unlimited Entries 2-bit BHT 1,024 1 024 Entries (2 2) BHT (2, ency of Mispredic 14% 2-bits prediction : prediction if last branch not taken/ and prediction if last branch taken 12% 11% 10% M 8% 6% 6% 6% 6% 5% 5% 4% 4% Freque 2% (1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or 1% 1% 0% 0% not taken) decides which prediction bit will be considered or updated a7 matrix300 tomcatv v d doducd fpppp p expresso tt li ce cc l eqntot gc spic nasa 4,096 entries: 2-bits per entry , p y Unlimited entries: 2-bits/entry / y 1,024 entries (2,2) , ( , ) Computer Architecture Ch3-39 Computer Architecture Ch3-40
  • 11. Outline Advantages of Dynamic Scheduling 2.1 Instruction-Level Parallelism: Concepts and Challenges Dynamic scheduling: hardware rearranges the instruction 2.2 Basic Compiler Techniques for Exposing ILP execution to reduce stalls while maintaining data flow and 2.3 Reducing Branch Costs with Prediction exception behavior. 2.4 Overcoming Data Hazards with Dynamic Scheduling It handles cases when dependences unknown at compile time. 2.5 Dynamic Scheduling: Examples and the Algorithm It allows the processor to tolerate unpredictable delays such as cache 2.6 26 Hardware-Based Speculation H d B dS l ti misses, misses by executing other code while waiting for the miss to resolve. resolve 2.7 Exploiting ILP Using Multiple Issue and Static Scheduling It allows code that compiled for one pipeline to run efficiently on 2.8 28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue Scheduling a different pipeline. pp and Speculation It simplifies the compiler. Hardware speculation: a technique with significant performance advantages, builds on dynamic scheduling (next lecture). Computer Architecture Ch3-41 Computer Architecture Ch3-42 HW Schemes: Instruction Parallelism Dynamic Scheduling Step 1 Key idea: allow instructions behind stall to proceed. Simple pipeline had 1 stage to check both structural and data DIVD F0, F2, F4 hazards: Instruction Decode (ID), also called Instruction Issue. ADDD F10, F0, F8 Split the ID pipe stage of simple 5-stage pipeline into 2 stages SUBD F12, F8, F14 Issue: decode instructions, check for structural hazards. Enables out-of-order execution and allows out-of-order Read operands: wait until no data hazards, then read operands. completion (e.g., SUBD). In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue). (in order issue) Will distinguish when an instruction begins execution and when it completes execution; between 2 times the instruction is in times, execution. Note: Dynamic execution creates WAR and WAW hazards and y makes exceptions harder. Computer Architecture Ch3-43 Computer Architecture Ch3-44
  • 12. Tomasulo Algorithm Tomasulo Organization Control & buffers distributed with Function Units (FU) FP Registers g From Mem FP Op FU buffers called “reservation stations”; have pending operands Queue Registers in instructions replaced by values or pointers to reservation Load Buffers stations(RS); called register renaming; Load1 Renaming avoids WAR, WAW hazards Load2 Load3 More reservation stations than registers, so can do optimizations compilers can’t Load4 Load5 Store Results to FU from RS, not through registers, over Common Data Bus that Load6 Buffers broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its operands are y g y p Add1 available Add2 Mult1 Add3 Mult2 Load and Stores treated as FUs with RSs as well Reservation To Mem Integer instructions can go past branches (predict taken), allowing FP ops taken) Stations beyond basic block in FP queue. FP adders FP multipliers Computer Architecture Ch3-45 Computer Architecture Ch3-46 Reservation Station Components Three Stages of Tomasulo Algorithm Op: operation to perform in the unit (e.g., + or –) 1. Issue: get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends Vj, Vk: value of Source operands operands (renames registers). Store buffers has V field, result to be stored 2. Execute: operate on operands (EX) p p ( ) Qj, Qk: reservation stations producing source registers (value to When both operands ready then execute; if not ready, watch Common Data Bus for be written) result. Note: Qj,Qk=0 => ready 3. 3 Write result: finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station Store buffers only have Qi for RS producing result available Busy: indicates reservation station or FU is busy Normal data bus: data + destination (“go to” bus) Register result status: Indicates which functional unit will write Common data bus: data + source (“come from” bus) each register if one exists Blank when no pending instructions register, exists. 64 bits of data + 4 bits of Functional Unit source address that will write that register. Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for / Computer Architecture Ch3-47 Computer Architecture Ch3-48
  • 13. Outline Instruction stream Tomasulo Example p Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address 2.1 Instruction-Level Parallelism: Concepts and Challenges LD F6 34+ R2 Load1 No 2.2 Basic Compiler Techniques for Exposing ILP LD MULTD F0 F2 45+ F2 R3 F4 Load2 Load3 No No 2.3 Reducing Branch Costs with Prediction SUBD F8 F6 F2 V DIVD F10 F0 F6 2.4 Overcoming Data Hazards with Dynamic Scheduling ADDD F6 F8 F2 3 Load/Buffers L d/B ff 2.5 Dynamic Scheduling: Examples and the Algorithm Reservation Stations: S1 S2 RS RS Time Name Busy y Op p Vj j Vk Qj Q Qk 2.6 26 Hardware-Based Speculation H d B dS l ti Add1 No 2.7 Exploiting ILP Using Multiple Issue and Static Scheduling FU count Add2 No 3 FP Adder R.S. down Add3 No 2 FP Mult R.S. 2.8 28 Exploiting ILP Using Dynamic Scheduling, Multiple Issue Scheduling Mult1 No Mult2 No and Speculation Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 0 FU Clock cycle y counter Computer Architecture Ch3-49 Tomasulo Example Cycle 1 p y Tomasulo Example Cycle 2 p y Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 Load1 Yes 34+R2 LD F6 34+ R2 1 Load1 Yes 34+R2 LD F2 45+ R3 Load2 No LD F2 45+ R3 2 Load2 Yes 45+R3 MULTD F0 F2 F4 Load3 No MULTD F0 F2 F4 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 V DIVD F10 F0 F6 V DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy y Op p Vj j Vk Qj Q Qk Time Name Busy y Op p Vj j Vk Qj Q Qk Add1 No Add1 No Add2 No Add2 No Add3 No Add3 No Mult1 No Mult1 No Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 1 FU Load1 2 FU Load2 Load1
  • 14. Tomasulo Example Cycle 3 p y Tomasulo Example Cycle 4 p y Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 Load1 Yes 34+R2 LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 Load2 Yes 45+R3 LD F2 45+ R3 2 4 Load2 Yes 45+R3 MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 SUBD F8 F6 F2 4 V DIVD F10 F0 F6 V DIVD F10 F0 F6 ADDD F6 F8 F2 ADDD F6 F8 F2 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op y p Vj j Vk Qj Q Qk Time Name Busy Op y p Vj j Vk Qj Q Qk Add1 No Add1 Yes SUBD M(A1) Load2 Add2 No Add2 No Add3 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult1 Yes MULTD R(F4) Load2 Mult2 No Mult2 No Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 3 FU Mult1 Load2 Load1 4 FU Mult1 Load2 M(A1) Add1 • Note: registers names are removed (“renamed”) in Reservation Stations; g ( ) ; MULT issued • Load2 completing; what is waiting for Load2? • Load1 completing; what is waiting for Load1? Tomasulo Example Cycle 5 p y Tomasulo Example Cycle 6 p y Instruction status: Exec Write Instruction status: Exec Write Instruction j k Issue Comp Result Busy Address Instruction j k Issue Comp Result Busy Address LD F6 34+ R2 1 3 4 Load1 No LD F6 34+ R2 1 3 4 Load1 No LD F2 45+ R3 2 4 5 Load2 No LD F2 45+ R3 2 4 5 Load2 No MULTD F0 F2 F4 3 Load3 No MULTD F0 F2 F4 3 Load3 No SUBD F8 F6 F2 4 SUBD F8 F6 F2 4 V DIVD F10 F0 F6 5 V DIVD F10 F0 F6 5 ADDD F6 F8 F2 ADDD F6 F8 F2 6 Reservation Stations: S1 S2 RS RS Reservation Stations: S1 S2 RS RS Time Name Busy Op y p Vj j Vk Qj Q Qk Time Name Busy Op y p Vj j Vk Qj Q Qk 2 Add1 Yes SUBD M(A1) M(A2) 1 Add1 Yes SUBD M(A1) M(A2) Add2 No Add2 Yes ADDD M(A2) Add1 Add3 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1 Mult2 Yes DIVD M(A1) Mult1 Register result status: Register result status: Clock F0 F2 F4 F6 F8 F10 F12 ... F30 Clock F0 F2 F4 F6 F8 F10 F12 ... F30 5 FU Mult1 M(A2) M(A1) Add1 Mult2 6 FU Mult1 M(A2) Add2 Add1 Mult2 • Timer starts down for Add1 Mult1 Add1, • Issue ADDD here despite name dependency on F6?