Your SlideShare is downloading. ×
0
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Autovectorization in llvm
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Autovectorization in llvm

2,722

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,722
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
39
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Changwoo Min (multics69@gmail.com) 2010/06/23
  • 2. Project Goal  Design and implement prototype level autovectorizer in LLVM  Understand and hands-on LLVM  Implement simple analysis phase in LLVM  Implement simple transform phase in LLVM 2
  • 3. Vector Support in LLVM  Support vector type and its operation in IR level  Generate vector type to MMX/SSE instruction in IA32 architecture %pb = getelementptr [32 x i32]* @b, i32 0, i32 %i movaps b (,%eax,0), %xmm0 b[i] %vb = bitcast i32* %pb to <8 x i32>* paddd c (,%eax,0), %xmm0 %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i movaps %xmm0, a (,%eax,0) + %vc = bitcast i32* %pa to <8 x i32>* %vb_i = load <8 x i32>* %vb, align 32 c[i] %vc_i = load <8 x i32>* %vc, align 32 %va_i = add nsw <8 x i32> %vb_i, %vc_i = %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i %va = bitcast i32* %pa to <8 x i32>* a[i] store <8 x i32> %va_i, <8 x i32>* %va, align 32 • vector stride = 1 • vector type, vector operation • SSE code generation 3
  • 4. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 4
  • 5. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 5
  • 6. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 6
  • 7. Vectorization, big idea • Use existing LLVM infra structure Find a loop • Is it countable loop? Is is vectorizable? • Are there any unvectorizable instructions? • Loop independence dependence? • Loop carried dependence? Yes • Change array type to vector type If so, vectorize it • Type casting • Alignment • Handle remainder if any 7
  • 8. Find a loop  Implement “LoopVectorizer” path as one of the transform path  Inherit LoopPath class which is invoked for every loop. PathManager  PathManager which is a parent of LoopPath manger deals with integrating other LLVM paths. LoopPath  Ask PassManager to hand me a loop which is more canonical form than natural loop  LoopSimply Form LoopVectorize  Entry Block, Exit Block, Latch Block  Single backedge  Countable loop which is incrementing by one 8
  • 9. Is it vectorizable? (1/3)  Loop type test  Inner-most loop  Countable loop  for(i=0;i<100;++i)  OK  for(;*p!=NULL;++p)  NOK  Long enough to vectorize  for(i=0;i<3;++i)  NOK  Iteration should be longer than vectorization factor.  Are there any unvectorizable IR instruction?  Function call  NOK  stack allocation  NOK  operation to scalar value except for loop induction variable  NOK  Stride of pointer/array should be on.  a[i]  OK, a[2*i]  NOK 9
  • 10. Is it vectorizable? (2/3)  Collect array/pointer variables used in LHS and RHS a[i] = b[i+1] + c[i]; a[i] c[i] b[i+1] LHS = {a[i]}, RHS={b[i+1], c[i]} 10
  • 11. Is it vectorizable? (3/3)  Data dependence testing between LHS and RHS foreach member W in LHS foreach member R in LHS U RHS if R is alias of W if there is data dependence between W and R “It is not vectorizable.” “Ok, it is vectorizable”  Dependence testing  Strides of W and R are one.  We only check if W and R will be colliding WITHIN vectorization factor by subtracting base coefficient.  W[i+LC]  R[i+RC]  If |LC-RC| < vectorization factor, there will be collision.  Not vectorizable 11
  • 12. If so, vectorize it (1/5)  Idea int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } Vectorized Loop Body Check if there are Epilogue Preheader Loop Body remainders Epilogue Loop for Epilogue loop remainder 12
  • 13. If so, vectorize it (2/5)  Vectorize Loop Body 1. Insert bitcast instruction after every getelementptr insturction 2. Replace uses of getelementptr to use bitcast  If it is a Load or Store instruction, set alignment constraint. 3. Construct set of instructions which requires type casting from array/pointer type to vector type  Maximal use set of getelementptr  Type cast instructions in type casting set to vector type 4. Modify increment of induction variable to vectorization factor 5. Modify destination of loop exit to epilogue preheader  Calculate alignment  It assumes base address is 32-byte aligned.  Only check if induction variable breaks its alignment.  a[0]  32- byte aligned  a[i]  32- byte aligned  a[i+1]  4-byte aligned 13
  • 14. If so, vectorize it (3/5)  Vectorized Loop Body bb1: ; preds = %bb1, %bb.nph %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5] %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1] %0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1] %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1] %1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %tmp = add i32 %i.03, 1 ; <i32> [#uses=1] %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1] %2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1] %4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1] %5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1] store <8 x i32> %5, <8 x i32>* %0, align 32 %6 = add i32 %i.03, 8 ; <i32> [#uses=2] %exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1] br i1 %exitcond, label %bb1.preheader, label %bb 14
  • 15. If so, vectorize it (4/5)  Generate epilogue preheader  If there are remainders, jump to epilogue loop. bb1.preheader: ; preds = %bb1 %7 = shl i32 %i.03, 0 ; <i32> [#uses=2] %8 = icmp eq i32 %7, 259 ; <i1> [#uses=1] br i1 %8, label %return, label %bb1.epilogue 15
  • 16. If so, vectorize it (5/5)  Generate epilogue loop for remainder 1. Clone original loop body 2. Update all the uses to denote the cloned one 3. Update phi of induction variable 4. Update branch target bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4] %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1] %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1] %12 = add i32 %9, 1 ; <i32> [#uses=1] %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1] %14 = load i32* %13, align 4 ; <i32> [#uses=1] %15 = load i32* %11, align 4 ; <i32> [#uses=1] %16 = add nsw i32 %15, %14 ; <i32> [#uses=1] store i32 %16, i32* %10, align 4 %17 = add i32 %9, 1 ; <i32> [#uses=2] %18 = icmp eq i32 %17, 259 ; <i1> [#uses=1] br i1 %18, label %return, label %bb1.epilogue 16
  • 17. Generated Code .LBB1_1: # %bb1 # =>This Inner Loop Header: Depth=1 movups b+1088(,%eax,4), %xmm0 paddd c+1084(,%eax,4), %xmm0 movups b+1072(,%eax,4), %xmm1 Vectorized Loop Body paddd c+1068(,%eax,4), %xmm1 movaps %xmm1, a+1068(,%eax,4) movaps %xmm0, a+1084(,%eax,4) addl $8, %eax cmpl $-11, %eax jne .LBB1_1 # BB#2: # %bb1.preheader testl %eax, %eax Epilogue Preheader je .LBB1_5 # BB#3: # %bb1.preheader.bb1.epilogue_crit_edge movl $-44, %eax .align 16, 0x90 .LBB1_4: # %bb1.epilogue # =>This Inner Loop Header: Depth=1 Epilogue Loop movl c+1036(%eax), %ecx addl b+1040(%eax), %ecx movl %ecx, a+1036(%eax) addl $4, %eax jne .LBB1_4 .LBB1_5: # %return ret 17
  • 18. Experiment Environment  CPU  Intel i5 2.67GHz  OS  Ubuntu 10.04  LLVM  LLVM 2.7 (Released at 04/27/2010)  LLVM-GCC Front End 4.2  GCC  GCC 4.4.3 (Ubuntu 10.04 Canonical Version) 18
  • 19. Performance Comparison : aligned access  a[i] = b[i] + c[i]; 1.6 Normalized Execution Time 1.4 1.2 GCC Vect 1 GCC No-Vect 0.8 LLVM No-Vect 0.6 LLVM Vect(VF=4) LLVM Vect(VF=8) 0.4 VF = Vectorization Factor 0.2 0 char short int float 19
  • 20. Performance Comparison : unaligned access  For integer type 1.6 1.4 Normalized Execution Time 1.2 1 a[i]=b[i]+c[i] 0.8 a[i]=b[i+1]+c[i] a[i]=b[i+1]+c[i+1] 0.6 a[i+1]=b[i+1]+c[i+1] VF = Vectorization Factor 0.4 0.2 0 GCC Vect GCC No-Vect LLVM No-Vect LLVM LLVM Vect(VF=4) Vect(VF=8) 20
  • 21. Conclusion  Implement prototype level LLVM vectorizer with  Data dependence analysis  Loop transformation and vectorization  Alignment testing  Type Conversion  Use variety of LLVM infra structure  Path Manage, Loop Path Manager, Loop Simply form, Alias analysis, IndVars, SCEV, etc  Its performance is quite promising  In most cases, it is better than GCC tree vectorize.  But, followings are requires to extend its coverage  Need to extend dependence testing to support multi dimensional array  W[i][ j][k+LC]  R[i][ j][k+RC]  More sophisticated alignment calculation is required  It may need to collaborate with code generation.  Do we have efficient way to calculate alignment in multi dimensional array?  a[i][ j][k]  Do we need to support a loop which has more than one basic block for loop body? 21

×