Fall 2012




            Thanks to Prof. Kim
• Discuss Lab 3

• Dealing with Branches

• Mid semester survey
• Most branches are biased

• Interference in PHT entries
  – Constructive (T+T, or N+N)
  – Destructive (T+N, or N+T)

  Agree predictor  Check if branches agree with
  Bias direction (most entries will agree)

  Reduces destructive interference in PHT
• Instructions are predicated
  -> Depending on the predicate value the
    instruction is valid or becomes a No-op.

  (p) add R1 = R2 + R3

              P              R1 = R2 + R3
            TRUE             R1 <- R2 + R3
            FALSE               No op
If ( a == 0 ) {
 b = 1;           Set p
}
else {            (p) b = 1
  b = 0;
}                 (!p) b = 0
(normal branch code)        (predicated code)
                            A
                        T        N                   A
if (cond) {
     b = 0;                                          B
                        C        B
}
else {                                               C
     b = 1;                 D                        D
}
              A
                      p1 = (cond)         A
                      branch p1, TARGET
              B                                    p1 = (cond)
                                          B
                      mov b, 1                 (!p1) mov b, 1
                      jmp JOIN
              C                           C
                  TARGET:
                      mov b, 0                  (p1) mov b, 0


                                                                  6
• Eliminate branch mispredictions
  – Convert control dependency to data
    dependency
• Increase compiler’s optimization
  opportunities
  – Trace scheduling, bigger basic blocks,
    instruction re-ordering
  – SIMD (Nvidia G80), vector processing
• More machine resources
  – Fetch more instructions
  – Occupy useful resources (ROB, scheduler..)
• ISA should support predicated execution
  – (ISA), predicate registers
  – X86: c-move
• In OOO, supporting predicated execution is
  harder
  – Three input sources
  – Dependent instructions cannot be executed.
• Conditional move
  – The simplest form of predicated execution
  – Works only for registers not for memory
  – E.g.) CMOVA r16, r/m16 (move if CF=0 and
    ZF-0)
• Full predication support
  – Only IA-64 (later lecture)
• When to use predicated execution?
  – Hard to predict?
  – Short branches?
  – Compiler optimization benefit?
• Who should decide it?
• Applicable to all branches?
  – Loop, function calls, indirect branches …
• Transforms an M-iteration loop into
  a loop with M/N iterations
    – We say that the loop has been unrolled N
      times
                                       for(i=0;i<100;i+=4){
   for(i=0;i<100;i++)                    a[i]*=2;
     a[i]*=2;                            a[i+1]*=2;
                                         a[i+2]*=2;
                                         a[i+3]*=2;
                                       }

Some compilers can do this (gcc -funroll-loops)
        Or you can do it manually (above)
• Less loop overhead
                             for(i=0;i<100;i+=4){
  for(i=0;i<100;i++)           a[i]   += 2;
    a[i] += 2;                 a[i+1] += 2;
                               a[i+2] += 2;
                               a[i+3] += 2;
                             }


  How many branches?

           Fewer branch prediction,
           Fewer number of instructions
R2 = R3 * #4
   R2 = R2 + #a                                   R2 = R3 * #4
 R1 = LOAD 0[R2]             • Allows better      R2 = R2 + #a
   R1 = R1 + #2                                 R1 = LOAD 0[R2]
STORE R1  0[R2]               scheduling of      R1 = R1 + #2
   R3 = R3 + 1                                 STORE R1  0[R2]
 BLT R3, 100, #top
                               instructions
                                                R1 = LOAD 4[R2]
                                                  R1 = R1 + #2
       R2 = R3 * #4                            STORE R1  4[R2]
       R2 = R2 + #a                             R1 = LOAD 8[R2]
     R1 = LOAD 0[R2]                              R1 = R1 + #2
       R1 = R1 + #2                            STORE R1  8[R2]
    STORE R1  0[R2]                            R1 = LOAD 12[R2]
       R3 = R3 + 1                                R1 = R1 + #2
     BLT R3, 100, #top
                                               STORE R1  12[R2]
                                                   R3 = R3 + 4
           R2 = R3 * #4                         BLT R3, 100, #top
           R2 = R2 + #a
         R1 = LOAD 0[R2]
           R1 = R1 = #2
        STORE R1  0[R2]
           R3 = R3 + 1
         BLT R3, 100, #top
• Get rid of small loops
                                                    a[0]*=2;
      for(i=0;i<4;i++)                              a[1]*=2;
        a[i]*=2;                                    a[2]*=2;
                                                    a[3]*=2;



  for(0)
              Difficult to schedule/hoist
  for(1)
              insts from bottom block to
  for(2)
              top block due to branches
  for(3)


                                            Easier: no branches in the way
• Instruction size is larger (code bloat)
• What if N not a multiple of M?
  – Or if N not known at compile time?
  – Or if it is a while loop?
                            j1=j-j%4;
                            for(i=0;i<j1;i+=4){
                              a[i]*=2;
   for(i=0;i<j;i++)           a[i+1]*=2;
     a[i]*=2;                 a[i+2]*=2;
                              a[i+3]*=2;
                            }
                            for(i=j1;i<j;i++)
                              a[i]*=2;

Predication

  • 1.
    Fall 2012 Thanks to Prof. Kim
  • 2.
    • Discuss Lab3 • Dealing with Branches • Mid semester survey
  • 3.
    • Most branchesare biased • Interference in PHT entries – Constructive (T+T, or N+N) – Destructive (T+N, or N+T) Agree predictor  Check if branches agree with Bias direction (most entries will agree) Reduces destructive interference in PHT
  • 4.
    • Instructions arepredicated -> Depending on the predicate value the instruction is valid or becomes a No-op. (p) add R1 = R2 + R3 P R1 = R2 + R3 TRUE R1 <- R2 + R3 FALSE No op
  • 5.
    If ( a== 0 ) { b = 1; Set p } else { (p) b = 1 b = 0; } (!p) b = 0
  • 6.
    (normal branch code) (predicated code) A T N A if (cond) { b = 0; B C B } else { C b = 1; D D } A p1 = (cond) A branch p1, TARGET B p1 = (cond) B mov b, 1 (!p1) mov b, 1 jmp JOIN C C TARGET: mov b, 0 (p1) mov b, 0 6
  • 7.
    • Eliminate branchmispredictions – Convert control dependency to data dependency • Increase compiler’s optimization opportunities – Trace scheduling, bigger basic blocks, instruction re-ordering – SIMD (Nvidia G80), vector processing
  • 8.
    • More machineresources – Fetch more instructions – Occupy useful resources (ROB, scheduler..) • ISA should support predicated execution – (ISA), predicate registers – X86: c-move • In OOO, supporting predicated execution is harder – Three input sources – Dependent instructions cannot be executed.
  • 9.
    • Conditional move – The simplest form of predicated execution – Works only for registers not for memory – E.g.) CMOVA r16, r/m16 (move if CF=0 and ZF-0) • Full predication support – Only IA-64 (later lecture)
  • 10.
    • When touse predicated execution? – Hard to predict? – Short branches? – Compiler optimization benefit? • Who should decide it? • Applicable to all branches? – Loop, function calls, indirect branches …
  • 11.
    • Transforms anM-iteration loop into a loop with M/N iterations – We say that the loop has been unrolled N times for(i=0;i<100;i+=4){ for(i=0;i<100;i++) a[i]*=2; a[i]*=2; a[i+1]*=2; a[i+2]*=2; a[i+3]*=2; } Some compilers can do this (gcc -funroll-loops) Or you can do it manually (above)
  • 12.
    • Less loopoverhead for(i=0;i<100;i+=4){ for(i=0;i<100;i++) a[i] += 2; a[i] += 2; a[i+1] += 2; a[i+2] += 2; a[i+3] += 2; } How many branches? Fewer branch prediction, Fewer number of instructions
  • 13.
    R2 = R3* #4 R2 = R2 + #a R2 = R3 * #4 R1 = LOAD 0[R2] • Allows better R2 = R2 + #a R1 = R1 + #2 R1 = LOAD 0[R2] STORE R1  0[R2] scheduling of R1 = R1 + #2 R3 = R3 + 1 STORE R1  0[R2] BLT R3, 100, #top instructions R1 = LOAD 4[R2] R1 = R1 + #2 R2 = R3 * #4 STORE R1  4[R2] R2 = R2 + #a R1 = LOAD 8[R2] R1 = LOAD 0[R2] R1 = R1 + #2 R1 = R1 + #2 STORE R1  8[R2] STORE R1  0[R2] R1 = LOAD 12[R2] R3 = R3 + 1 R1 = R1 + #2 BLT R3, 100, #top STORE R1  12[R2] R3 = R3 + 4 R2 = R3 * #4 BLT R3, 100, #top R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 = #2 STORE R1  0[R2] R3 = R3 + 1 BLT R3, 100, #top
  • 14.
    • Get ridof small loops a[0]*=2; for(i=0;i<4;i++) a[1]*=2; a[i]*=2; a[2]*=2; a[3]*=2; for(0) Difficult to schedule/hoist for(1) insts from bottom block to for(2) top block due to branches for(3) Easier: no branches in the way
  • 15.
    • Instruction sizeis larger (code bloat) • What if N not a multiple of M? – Or if N not known at compile time? – Or if it is a while loop? j1=j-j%4; for(i=0;i<j1;i+=4){ a[i]*=2; for(i=0;i<j;i++) a[i+1]*=2; a[i]*=2; a[i+2]*=2; a[i+3]*=2; } for(i=j1;i<j;i++) a[i]*=2;