LLVM Vectorization Prototype Design

Changwoo Min (multics69@gmail.com)
2010/06/23

Project Goal

 Design and implement prototype level
autovectorizer in LLVM
 Understand and hands-on LLVM
 Implement simple analysis phase in LLVM
 Implement simple transform phase in LLVM

2

Vector Support in LLVM

 Support vector type and its operation in IR level
 Generate vector type to MMX/SSE instruction in
IA32 architecture
%pb = getelementptr [32 x i32]* @b, i32 0, i32 %i movaps b (,%eax,0), %xmm0
b[i] %vb = bitcast i32* %pb to <8 x i32>* paddd c (,%eax,0), %xmm0
%pc = getelementptr [32 x i32]* @c, i32 0, i32 %i movaps %xmm0, a (,%eax,0)
+ %vc = bitcast i32* %pa to <8 x i32>*
%vb_i = load <8 x i32>* %vb, align 32
c[i] %vc_i = load <8 x i32>* %vc, align 32
%va_i = add nsw <8 x i32> %vb_i, %vc_i
= %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i
%va = bitcast i32* %pa to <8 x i32>*
a[i] store <8 x i32> %va_i, <8 x i32>* %va, align 32

• vector stride = 1 • vector type, vector operation • SSE code generation
3

Vectorization, what it is?

int a[259], b[259], c[259] int a[259], b[259], c[259]

for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8];
} }

if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}

4


int a[259], b[259], c[259] int a[259], b[259], c[259]

for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
} }

if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}

5


int a[259], b[259], c[259] int a[259], b[259], c[259]

for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
} }

if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}

6

Vectorization, big idea
• Use existing LLVM infra structure
Find a loop

• Is it countable loop?
Is is vectorizable? • Are there any unvectorizable instructions?
• Loop independence dependence?
• Loop carried dependence?
Yes

• Change array type to vector type
If so, vectorize it • Type casting
• Alignment
• Handle remainder if any

7

Find a loop
 Implement “LoopVectorizer” path as one of
the transform path
 Inherit LoopPath class which is invoked for
every loop. PathManager
 PathManager which is a parent of LoopPath
manger deals with integrating other LLVM
paths.
LoopPath
 Ask PassManager to hand me a loop which
is more canonical form than natural loop
 LoopSimply Form LoopVectorize
 Entry Block, Exit Block, Latch Block
 Single backedge
 Countable loop which is incrementing by one

8

Is it vectorizable? (1/3)
 Loop type test
 Inner-most loop
 Countable loop
 for(i=0;i<100;++i)  OK
 for(;*p!=NULL;++p)  NOK
 Long enough to vectorize
 for(i=0;i<3;++i)  NOK
 Iteration should be longer than vectorization factor.
 Are there any unvectorizable IR instruction?
 Function call  NOK
 stack allocation  NOK
 operation to scalar value except for loop induction variable
 NOK
 Stride of pointer/array should be on.
 a[i]  OK, a[2*i]  NOK

9

 Collect array/pointer variables used in LHS and
RHS
a[i] = b[i+1] + c[i];

a[i]
c[i]
b[i+1]

LHS = {a[i]}, RHS={b[i+1], c[i]}
10

 Data dependence testing between LHS and RHS

foreach member W in LHS
foreach member R in LHS U RHS
if R is alias of W
if there is data dependence between W and R
“It is not vectorizable.”
“Ok, it is vectorizable”

 Dependence testing
 Strides of W and R are one.
 We only check if W and R will be colliding WITHIN vectorization
factor by subtracting base coefficient.
 W[i+LC]  R[i+RC]
 If |LC-RC| < vectorization factor, there will be collision.
 Not vectorizable

11

If so, vectorize it (1/5)
 Idea
int a[259], b[259], c[259] int a[259], b[259], c[259]

for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
} }

if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}

Vectorized Loop
Body

Check if there are Epilogue Preheader
Loop Body
remainders

Epilogue Loop for Epilogue loop
remainder 12

 Vectorize Loop Body
1. Insert bitcast instruction after every getelementptr insturction
2. Replace uses of getelementptr to use bitcast
 If it is a Load or Store instruction, set alignment constraint.
3. Construct set of instructions which requires type casting from
array/pointer type to vector type
 Maximal use set of getelementptr
 Type cast instructions in type casting set to vector type
4. Modify increment of induction variable to vectorization factor
5. Modify destination of loop exit to epilogue preheader

 Calculate alignment
 It assumes base address is 32-byte aligned.
 Only check if induction variable breaks its alignment.
 a[0]  32- byte aligned
 a[i]  32- byte aligned
 a[i+1]  4-byte aligned
13

 Vectorized Loop Body

bb1: ; preds = %bb1, %bb.nph
%i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5]
%scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1]
%0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1]
%1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%tmp = add i32 %i.03, 1 ; <i32> [#uses=1]
%scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1]
%2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1]
%4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1]
%5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1]
store <8 x i32> %5, <8 x i32>* %0, align 32
%6 = add i32 %i.03, 8 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1]
br i1 %exitcond, label %bb1.preheader, label %bb

14


 Generate epilogue preheader
 If there are remainders, jump to epilogue loop.

bb1.preheader: ; preds = %bb1
%7 = shl i32 %i.03, 0 ; <i32> [#uses=2]
%8 = icmp eq i32 %7, 259 ; <i1> [#uses=1]
br i1 %8, label %return, label %bb1.epilogue

15

 Generate epilogue loop for remainder
1. Clone original loop body
2. Update all the uses to denote the cloned one
3. Update phi of induction variable
4. Update branch target

bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader
%9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4]
%10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1]
%11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1]
%12 = add i32 %9, 1 ; <i32> [#uses=1]
%13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1]
%14 = load i32* %13, align 4 ; <i32> [#uses=1]
%15 = load i32* %11, align 4 ; <i32> [#uses=1]
%16 = add nsw i32 %15, %14 ; <i32> [#uses=1]
store i32 %16, i32* %10, align 4
%17 = add i32 %9, 1 ; <i32> [#uses=2]
%18 = icmp eq i32 %17, 259 ; <i1> [#uses=1]
br i1 %18, label %return, label %bb1.epilogue

16

Generated Code
.LBB1_1: # %bb1
# =>This Inner Loop Header: Depth=1
movups b+1088(,%eax,4), %xmm0
paddd c+1084(,%eax,4), %xmm0
movups b+1072(,%eax,4), %xmm1 Vectorized Loop Body
paddd c+1068(,%eax,4), %xmm1
movaps %xmm1, a+1068(,%eax,4)
movaps %xmm0, a+1084(,%eax,4)
addl $8, %eax
cmpl $-11, %eax
jne .LBB1_1
# BB#2: # %bb1.preheader
testl %eax, %eax Epilogue Preheader
je .LBB1_5
# BB#3: # %bb1.preheader.bb1.epilogue_crit_edge
movl $-44, %eax
.align 16, 0x90
.LBB1_4: # %bb1.epilogue
# =>This Inner Loop Header: Depth=1 Epilogue Loop
movl c+1036(%eax), %ecx
addl b+1040(%eax), %ecx
movl %ecx, a+1036(%eax)
addl $4, %eax
jne .LBB1_4
.LBB1_5: # %return
ret

17

Experiment Environment
 CPU
 Intel i5 2.67GHz

 OS
 Ubuntu 10.04

 LLVM
 LLVM 2.7 (Released at 04/27/2010)
 LLVM-GCC Front End 4.2

 GCC
 GCC 4.4.3 (Ubuntu 10.04 Canonical Version)

18

Performance Comparison
: aligned access
 a[i] = b[i] + c[i];
1.6
Normalized Execution Time

1.4

1.2
GCC Vect
1
GCC No-Vect

0.8 LLVM No-Vect

0.6 LLVM Vect(VF=4)

LLVM Vect(VF=8)
0.4
VF = Vectorization Factor

0.2

0
char short int float
19

Performance Comparison
: unaligned access
 For integer type
1.6

1.4
Normalized Execution Time

1.2

1
a[i]=b[i]+c[i]
0.8 a[i]=b[i+1]+c[i]
a[i]=b[i+1]+c[i+1]
0.6
a[i+1]=b[i+1]+c[i+1]
VF = Vectorization Factor
0.4

0.2

0
GCC Vect GCC No-Vect LLVM No-Vect LLVM LLVM
Vect(VF=4) Vect(VF=8) 20

Conclusion
 Implement prototype level LLVM vectorizer with
 Data dependence analysis
 Loop transformation and vectorization
 Alignment testing
 Type Conversion
 Use variety of LLVM infra structure
 Path Manage, Loop Path Manager, Loop Simply form, Alias analysis,
IndVars, SCEV, etc
 Its performance is quite promising
 In most cases, it is better than GCC tree vectorize.
 But, followings are requires to extend its coverage
 Need to extend dependence testing to support multi dimensional array
 W[i][ j][k+LC]  R[i][ j][k+RC]
 More sophisticated alignment calculation is required
 It may need to collaborate with code generation.
 Do we have efficient way to calculate alignment in multi dimensional array?
 a[i][ j][k]

 Do we need to support a loop which has more than one basic block for
loop body?
21

LLVM Vectorization Prototype Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LLVM Vectorization Prototype Design

Similar to LLVM Vectorization Prototype Design (20)

Recently uploaded

Recently uploaded (20)

LLVM Vectorization Prototype Design