Changwoo Min (multics69@gmail.com)
                        2010/06/23
Project Goal

 Design and implement prototype level
 autovectorizer in LLVM
   Understand and hands-on LLVM
   Implemen...
Vector Support in LLVM

    Support vector type and its operation in IR level
    Generate vector type to MMX/SSE instru...
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;...
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;...
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;...
Vectorization, big idea
                       • Use existing LLVM infra structure
    Find a loop




                   ...
Find a loop
 Implement “LoopVectorizer” path as one of
  the transform path
   Inherit LoopPath class which is invoked f...
Is it vectorizable? (1/3)
 Loop type test
    Inner-most loop
    Countable loop
        for(i=0;i<100;++i)  OK
     ...
Is it vectorizable? (2/3)
 Collect array/pointer variables used in LHS and
 RHS
                                         ...
Is it vectorizable? (3/3)
 Data dependence testing between LHS and RHS

     foreach member W in LHS
         foreach mem...
If so, vectorize it (1/5)
 Idea
 int a[259], b[259], c[259]   int a[259], b[259], c[259]

 for(i=0;i<259;++i) {         f...
If so, vectorize it (2/5)
 Vectorize Loop Body
   1. Insert bitcast instruction after every getelementptr insturction
   ...
If so, vectorize it (3/5)
 Vectorized Loop Body

  bb1:                           ; preds = %bb1, %bb.nph
   %i.03 = phi ...
If so, vectorize it (4/5)

 Generate epilogue preheader
    If there are remainders, jump to epilogue loop.

   bb1.preh...
If so, vectorize it (5/5)
 Generate epilogue loop for remainder
   1. Clone original loop body
   2. Update all the uses ...
Generated Code
 .LBB1_1:               # %bb1
                     # =>This Inner Loop Header: Depth=1
            movups ...
Experiment Environment
 CPU
    Intel i5 2.67GHz

 OS
    Ubuntu 10.04

 LLVM
    LLVM 2.7 (Released at 04/27/2010)
...
Performance Comparison
: aligned access
 a[i] = b[i] + c[i];
                              1.6
  Normalized Execution Tim...
Performance Comparison
: unaligned access
 For integer type
                              1.6

                          ...
Conclusion
 Implement prototype level LLVM vectorizer with
     Data dependence analysis
     Loop transformation and v...
Upcoming SlideShare
Loading in …5
×

Autovectorization in llvm

3,189 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,189
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
46
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Autovectorization in llvm

  1. 1. Changwoo Min (multics69@gmail.com) 2010/06/23
  2. 2. Project Goal  Design and implement prototype level autovectorizer in LLVM  Understand and hands-on LLVM  Implement simple analysis phase in LLVM  Implement simple transform phase in LLVM 2
  3. 3. Vector Support in LLVM  Support vector type and its operation in IR level  Generate vector type to MMX/SSE instruction in IA32 architecture %pb = getelementptr [32 x i32]* @b, i32 0, i32 %i movaps b (,%eax,0), %xmm0 b[i] %vb = bitcast i32* %pb to <8 x i32>* paddd c (,%eax,0), %xmm0 %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i movaps %xmm0, a (,%eax,0) + %vc = bitcast i32* %pa to <8 x i32>* %vb_i = load <8 x i32>* %vb, align 32 c[i] %vc_i = load <8 x i32>* %vc, align 32 %va_i = add nsw <8 x i32> %vb_i, %vc_i = %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i %va = bitcast i32* %pa to <8 x i32>* a[i] store <8 x i32> %va_i, <8 x i32>* %va, align 32 • vector stride = 1 • vector type, vector operation • SSE code generation 3
  4. 4. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 4
  5. 5. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 5
  6. 6. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 6
  7. 7. Vectorization, big idea • Use existing LLVM infra structure Find a loop • Is it countable loop? Is is vectorizable? • Are there any unvectorizable instructions? • Loop independence dependence? • Loop carried dependence? Yes • Change array type to vector type If so, vectorize it • Type casting • Alignment • Handle remainder if any 7
  8. 8. Find a loop  Implement “LoopVectorizer” path as one of the transform path  Inherit LoopPath class which is invoked for every loop. PathManager  PathManager which is a parent of LoopPath manger deals with integrating other LLVM paths. LoopPath  Ask PassManager to hand me a loop which is more canonical form than natural loop  LoopSimply Form LoopVectorize  Entry Block, Exit Block, Latch Block  Single backedge  Countable loop which is incrementing by one 8
  9. 9. Is it vectorizable? (1/3)  Loop type test  Inner-most loop  Countable loop  for(i=0;i<100;++i)  OK  for(;*p!=NULL;++p)  NOK  Long enough to vectorize  for(i=0;i<3;++i)  NOK  Iteration should be longer than vectorization factor.  Are there any unvectorizable IR instruction?  Function call  NOK  stack allocation  NOK  operation to scalar value except for loop induction variable  NOK  Stride of pointer/array should be on.  a[i]  OK, a[2*i]  NOK 9
  10. 10. Is it vectorizable? (2/3)  Collect array/pointer variables used in LHS and RHS a[i] = b[i+1] + c[i]; a[i] c[i] b[i+1] LHS = {a[i]}, RHS={b[i+1], c[i]} 10
  11. 11. Is it vectorizable? (3/3)  Data dependence testing between LHS and RHS foreach member W in LHS foreach member R in LHS U RHS if R is alias of W if there is data dependence between W and R “It is not vectorizable.” “Ok, it is vectorizable”  Dependence testing  Strides of W and R are one.  We only check if W and R will be colliding WITHIN vectorization factor by subtracting base coefficient.  W[i+LC]  R[i+RC]  If |LC-RC| < vectorization factor, there will be collision.  Not vectorizable 11
  12. 12. If so, vectorize it (1/5)  Idea int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } Vectorized Loop Body Check if there are Epilogue Preheader Loop Body remainders Epilogue Loop for Epilogue loop remainder 12
  13. 13. If so, vectorize it (2/5)  Vectorize Loop Body 1. Insert bitcast instruction after every getelementptr insturction 2. Replace uses of getelementptr to use bitcast  If it is a Load or Store instruction, set alignment constraint. 3. Construct set of instructions which requires type casting from array/pointer type to vector type  Maximal use set of getelementptr  Type cast instructions in type casting set to vector type 4. Modify increment of induction variable to vectorization factor 5. Modify destination of loop exit to epilogue preheader  Calculate alignment  It assumes base address is 32-byte aligned.  Only check if induction variable breaks its alignment.  a[0]  32- byte aligned  a[i]  32- byte aligned  a[i+1]  4-byte aligned 13
  14. 14. If so, vectorize it (3/5)  Vectorized Loop Body bb1: ; preds = %bb1, %bb.nph %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5] %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1] %0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1] %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1] %1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %tmp = add i32 %i.03, 1 ; <i32> [#uses=1] %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1] %2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1] %4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1] %5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1] store <8 x i32> %5, <8 x i32>* %0, align 32 %6 = add i32 %i.03, 8 ; <i32> [#uses=2] %exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1] br i1 %exitcond, label %bb1.preheader, label %bb 14
  15. 15. If so, vectorize it (4/5)  Generate epilogue preheader  If there are remainders, jump to epilogue loop. bb1.preheader: ; preds = %bb1 %7 = shl i32 %i.03, 0 ; <i32> [#uses=2] %8 = icmp eq i32 %7, 259 ; <i1> [#uses=1] br i1 %8, label %return, label %bb1.epilogue 15
  16. 16. If so, vectorize it (5/5)  Generate epilogue loop for remainder 1. Clone original loop body 2. Update all the uses to denote the cloned one 3. Update phi of induction variable 4. Update branch target bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4] %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1] %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1] %12 = add i32 %9, 1 ; <i32> [#uses=1] %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1] %14 = load i32* %13, align 4 ; <i32> [#uses=1] %15 = load i32* %11, align 4 ; <i32> [#uses=1] %16 = add nsw i32 %15, %14 ; <i32> [#uses=1] store i32 %16, i32* %10, align 4 %17 = add i32 %9, 1 ; <i32> [#uses=2] %18 = icmp eq i32 %17, 259 ; <i1> [#uses=1] br i1 %18, label %return, label %bb1.epilogue 16
  17. 17. Generated Code .LBB1_1: # %bb1 # =>This Inner Loop Header: Depth=1 movups b+1088(,%eax,4), %xmm0 paddd c+1084(,%eax,4), %xmm0 movups b+1072(,%eax,4), %xmm1 Vectorized Loop Body paddd c+1068(,%eax,4), %xmm1 movaps %xmm1, a+1068(,%eax,4) movaps %xmm0, a+1084(,%eax,4) addl $8, %eax cmpl $-11, %eax jne .LBB1_1 # BB#2: # %bb1.preheader testl %eax, %eax Epilogue Preheader je .LBB1_5 # BB#3: # %bb1.preheader.bb1.epilogue_crit_edge movl $-44, %eax .align 16, 0x90 .LBB1_4: # %bb1.epilogue # =>This Inner Loop Header: Depth=1 Epilogue Loop movl c+1036(%eax), %ecx addl b+1040(%eax), %ecx movl %ecx, a+1036(%eax) addl $4, %eax jne .LBB1_4 .LBB1_5: # %return ret 17
  18. 18. Experiment Environment  CPU  Intel i5 2.67GHz  OS  Ubuntu 10.04  LLVM  LLVM 2.7 (Released at 04/27/2010)  LLVM-GCC Front End 4.2  GCC  GCC 4.4.3 (Ubuntu 10.04 Canonical Version) 18
  19. 19. Performance Comparison : aligned access  a[i] = b[i] + c[i]; 1.6 Normalized Execution Time 1.4 1.2 GCC Vect 1 GCC No-Vect 0.8 LLVM No-Vect 0.6 LLVM Vect(VF=4) LLVM Vect(VF=8) 0.4 VF = Vectorization Factor 0.2 0 char short int float 19
  20. 20. Performance Comparison : unaligned access  For integer type 1.6 1.4 Normalized Execution Time 1.2 1 a[i]=b[i]+c[i] 0.8 a[i]=b[i+1]+c[i] a[i]=b[i+1]+c[i+1] 0.6 a[i+1]=b[i+1]+c[i+1] VF = Vectorization Factor 0.4 0.2 0 GCC Vect GCC No-Vect LLVM No-Vect LLVM LLVM Vect(VF=4) Vect(VF=8) 20
  21. 21. Conclusion  Implement prototype level LLVM vectorizer with  Data dependence analysis  Loop transformation and vectorization  Alignment testing  Type Conversion  Use variety of LLVM infra structure  Path Manage, Loop Path Manager, Loop Simply form, Alias analysis, IndVars, SCEV, etc  Its performance is quite promising  In most cases, it is better than GCC tree vectorize.  But, followings are requires to extend its coverage  Need to extend dependence testing to support multi dimensional array  W[i][ j][k+LC]  R[i][ j][k+RC]  More sophisticated alignment calculation is required  It may need to collaborate with code generation.  Do we have efficient way to calculate alignment in multi dimensional array?  a[i][ j][k]  Do we need to support a loop which has more than one basic block for loop body? 21

×