2. Project Goal
Design and implement prototype level
autovectorizer in LLVM
Understand and hands-on LLVM
Implement simple analysis phase in LLVM
Implement simple transform phase in LLVM
2
3. Vector Support in LLVM
Support vector type and its operation in IR level
Generate vector type to MMX/SSE instruction in
IA32 architecture
%pb = getelementptr [32 x i32]* @b, i32 0, i32 %i movaps b (,%eax,0), %xmm0
b[i] %vb = bitcast i32* %pb to <8 x i32>* paddd c (,%eax,0), %xmm0
%pc = getelementptr [32 x i32]* @c, i32 0, i32 %i movaps %xmm0, a (,%eax,0)
+ %vc = bitcast i32* %pa to <8 x i32>*
%vb_i = load <8 x i32>* %vb, align 32
c[i] %vc_i = load <8 x i32>* %vc, align 32
%va_i = add nsw <8 x i32> %vb_i, %vc_i
= %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i
%va = bitcast i32* %pa to <8 x i32>*
a[i] store <8 x i32> %va_i, <8 x i32>* %va, align 32
• vector stride = 1 • vector type, vector operation • SSE code generation
3
4. Vectorization, what it is?
int a[259], b[259], c[259] int a[259], b[259], c[259]
for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8];
} }
if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}
4
5. Vectorization, what it is?
int a[259], b[259], c[259] int a[259], b[259], c[259]
for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8];
} }
if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}
5
6. Vectorization, what it is?
int a[259], b[259], c[259] int a[259], b[259], c[259]
for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8];
} }
if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}
6
7. Vectorization, big idea
• Use existing LLVM infra structure
Find a loop
• Is it countable loop?
Is is vectorizable? • Are there any unvectorizable instructions?
• Loop independence dependence?
• Loop carried dependence?
Yes
• Change array type to vector type
If so, vectorize it • Type casting
• Alignment
• Handle remainder if any
7
8. Find a loop
Implement “LoopVectorizer” path as one of
the transform path
Inherit LoopPath class which is invoked for
every loop. PathManager
PathManager which is a parent of LoopPath
manger deals with integrating other LLVM
paths.
LoopPath
Ask PassManager to hand me a loop which
is more canonical form than natural loop
LoopSimply Form LoopVectorize
Entry Block, Exit Block, Latch Block
Single backedge
Countable loop which is incrementing by one
8
9. Is it vectorizable? (1/3)
Loop type test
Inner-most loop
Countable loop
for(i=0;i<100;++i) OK
for(;*p!=NULL;++p) NOK
Long enough to vectorize
for(i=0;i<3;++i) NOK
Iteration should be longer than vectorization factor.
Are there any unvectorizable IR instruction?
Function call NOK
stack allocation NOK
operation to scalar value except for loop induction variable
NOK
Stride of pointer/array should be on.
a[i] OK, a[2*i] NOK
9
10. Is it vectorizable? (2/3)
Collect array/pointer variables used in LHS and
RHS
a[i] = b[i+1] + c[i];
a[i]
c[i]
b[i+1]
LHS = {a[i]}, RHS={b[i+1], c[i]}
10
11. Is it vectorizable? (3/3)
Data dependence testing between LHS and RHS
foreach member W in LHS
foreach member R in LHS U RHS
if R is alias of W
if there is data dependence between W and R
“It is not vectorizable.”
“Ok, it is vectorizable”
Dependence testing
Strides of W and R are one.
We only check if W and R will be colliding WITHIN vectorization
factor by subtracting base coefficient.
W[i+LC] R[i+RC]
If |LC-RC| < vectorization factor, there will be collision.
Not vectorizable
11
12. If so, vectorize it (1/5)
Idea
int a[259], b[259], c[259] int a[259], b[259], c[259]
for(i=0;i<259;++i) { for(i=0;i<259; i+=8) {
a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8];
} }
if (i!=259) {
for(;i<259;++i) {
a[i] = b[i+1] + c[i];
}
}
Vectorized Loop
Body
Check if there are Epilogue Preheader
Loop Body
remainders
Epilogue Loop for Epilogue loop
remainder 12
13. If so, vectorize it (2/5)
Vectorize Loop Body
1. Insert bitcast instruction after every getelementptr insturction
2. Replace uses of getelementptr to use bitcast
If it is a Load or Store instruction, set alignment constraint.
3. Construct set of instructions which requires type casting from
array/pointer type to vector type
Maximal use set of getelementptr
Type cast instructions in type casting set to vector type
4. Modify increment of induction variable to vectorization factor
5. Modify destination of loop exit to epilogue preheader
Calculate alignment
It assumes base address is 32-byte aligned.
Only check if induction variable breaks its alignment.
a[0] 32- byte aligned
a[i] 32- byte aligned
a[i+1] 4-byte aligned
13
14. If so, vectorize it (3/5)
Vectorized Loop Body
bb1: ; preds = %bb1, %bb.nph
%i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5]
%scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1]
%0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1]
%1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%tmp = add i32 %i.03, 1 ; <i32> [#uses=1]
%scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1]
%2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1]
%3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1]
%4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1]
%5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1]
store <8 x i32> %5, <8 x i32>* %0, align 32
%6 = add i32 %i.03, 8 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1]
br i1 %exitcond, label %bb1.preheader, label %bb
14
15. If so, vectorize it (4/5)
Generate epilogue preheader
If there are remainders, jump to epilogue loop.
bb1.preheader: ; preds = %bb1
%7 = shl i32 %i.03, 0 ; <i32> [#uses=2]
%8 = icmp eq i32 %7, 259 ; <i1> [#uses=1]
br i1 %8, label %return, label %bb1.epilogue
15
21. Conclusion
Implement prototype level LLVM vectorizer with
Data dependence analysis
Loop transformation and vectorization
Alignment testing
Type Conversion
Use variety of LLVM infra structure
Path Manage, Loop Path Manager, Loop Simply form, Alias analysis,
IndVars, SCEV, etc
Its performance is quite promising
In most cases, it is better than GCC tree vectorize.
But, followings are requires to extend its coverage
Need to extend dependence testing to support multi dimensional array
W[i][ j][k+LC] R[i][ j][k+RC]
More sophisticated alignment calculation is required
It may need to collaborate with code generation.
Do we have efficient way to calculate alignment in multi dimensional array?
a[i][ j][k]
Do we need to support a loop which has more than one basic block for
loop body?
21