Predication

Fall 2012

Thanks to Prof. Kim

• Discuss Lab 3

• Dealing with Branches

• Mid semester survey

• Most branches are biased

• Interference in PHT entries
– Constructive (T+T, or N+N)
– Destructive (T+N, or N+T)

Agree predictor  Check if branches agree with
Bias direction (most entries will agree)

Reduces destructive interference in PHT

• Instructions are predicated
-> Depending on the predicate value the
instruction is valid or becomes a No-op.

(p) add R1 = R2 + R3

P R1 = R2 + R3
TRUE R1 <- R2 + R3
FALSE No op

If ( a == 0 ) {
b = 1; Set p
}
else { (p) b = 1
b = 0;
} (!p) b = 0

(normal branch code) (predicated code)
A
T N A
if (cond) {
b = 0; B
C B
}
else { C
b = 1; D D
}
A
p1 = (cond) A
branch p1, TARGET
B p1 = (cond)
B
mov b, 1 (!p1) mov b, 1
jmp JOIN
C C
TARGET:
mov b, 0 (p1) mov b, 0

6

• Eliminate branch mispredictions
– Convert control dependency to data
dependency
• Increase compiler’s optimization
opportunities
– Trace scheduling, bigger basic blocks,
instruction re-ordering
– SIMD (Nvidia G80), vector processing

• More machine resources
– Fetch more instructions
– Occupy useful resources (ROB, scheduler..)
• ISA should support predicated execution
– (ISA), predicate registers
– X86: c-move
• In OOO, supporting predicated execution is
harder
– Three input sources
– Dependent instructions cannot be executed.

• Conditional move
– The simplest form of predicated execution
– Works only for registers not for memory
– E.g.) CMOVA r16, r/m16 (move if CF=0 and
ZF-0)
• Full predication support
– Only IA-64 (later lecture)

• When to use predicated execution?
– Hard to predict?
– Short branches?
– Compiler optimization benefit?
• Who should decide it?
• Applicable to all branches?
– Loop, function calls, indirect branches …

• Transforms an M-iteration loop into
a loop with M/N iterations
– We say that the loop has been unrolled N
times
for(i=0;i<100;i+=4){
for(i=0;i<100;i++) a[i]*=2;
a[i]*=2; a[i+1]*=2;
a[i+2]*=2;
a[i+3]*=2;
}

Some compilers can do this (gcc -funroll-loops)
Or you can do it manually (above)

• Less loop overhead
for(i=0;i<100;i+=4){
for(i=0;i<100;i++) a[i] += 2;
a[i] += 2; a[i+1] += 2;
a[i+2] += 2;
a[i+3] += 2;
}

How many branches?

Fewer branch prediction,
Fewer number of instructions

R2 = R3 * #4
R2 = R2 + #a R2 = R3 * #4
R1 = LOAD 0[R2] • Allows better R2 = R2 + #a
R1 = R1 + #2 R1 = LOAD 0[R2]
STORE R1  0[R2] scheduling of R1 = R1 + #2
R3 = R3 + 1 STORE R1  0[R2]
BLT R3, 100, #top
instructions
R1 = LOAD 4[R2]
R1 = R1 + #2
R2 = R3 * #4 STORE R1  4[R2]
R2 = R2 + #a R1 = LOAD 8[R2]
R1 = LOAD 0[R2] R1 = R1 + #2
R1 = R1 + #2 STORE R1  8[R2]
STORE R1  0[R2] R1 = LOAD 12[R2]
R3 = R3 + 1 R1 = R1 + #2
BLT R3, 100, #top
STORE R1  12[R2]
R3 = R3 + 4
R2 = R3 * #4 BLT R3, 100, #top
R2 = R2 + #a
R1 = LOAD 0[R2]
R1 = R1 = #2
STORE R1  0[R2]
R3 = R3 + 1
BLT R3, 100, #top

• Get rid of small loops
a[0]*=2;
for(i=0;i<4;i++) a[1]*=2;
a[i]*=2; a[2]*=2;
a[3]*=2;

for(0)
Difficult to schedule/hoist
for(1)
insts from bottom block to
for(2)
top block due to branches
for(3)

Easier: no branches in the way

• Instruction size is larger (code bloat)
• What if N not a multiple of M?
– Or if N not known at compile time?
– Or if it is a while loop?
j1=j-j%4;
for(i=0;i<j1;i+=4){
a[i]*=2;
for(i=0;i<j;i++) a[i+1]*=2;
a[i]*=2; a[i+2]*=2;
a[i+3]*=2;
}
for(i=j1;i<j;i++)
a[i]*=2;

Predication

More Related Content

What's hot

Viewers also liked

Similar to Predication

More from VisualBee.com

Predication