SlideShare a Scribd company logo
1 of 21
Download to read offline
Changwoo Min (multics69@gmail.com)
                        2010/06/23
Project Goal

 Design and implement prototype level
 autovectorizer in LLVM
   Understand and hands-on LLVM
   Implement simple analysis phase in LLVM
   Implement simple transform phase in LLVM




                                               2
Vector Support in LLVM

    Support vector type and its operation in IR level
    Generate vector type to MMX/SSE instruction in
       IA32 architecture
                             %pb = getelementptr [32 x i32]* @b, i32 0, i32 %i   movaps   b (,%eax,0), %xmm0
b[i]                         %vb = bitcast i32* %pb to <8 x i32>*                paddd    c (,%eax,0), %xmm0
                             %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i   movaps   %xmm0, a (,%eax,0)
               +             %vc = bitcast i32* %pa to <8 x i32>*
                             %vb_i = load <8 x i32>* %vb, align 32
c[i]                         %vc_i = load <8 x i32>* %vc, align 32
                             %va_i = add nsw <8 x i32> %vb_i, %vc_i
               =             %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i
                             %va = bitcast i32* %pa to <8 x i32>*
a[i]                         store <8 x i32> %va_i, <8 x i32>* %va, align 32



       • vector stride = 1    • vector type, vector operation                    • SSE code generation
                                                                                                       3
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;i<259; i+=8) {
  a[i] = b[i+1] + c[i];        a[i:i+8] = b[i+1:i+9] + c[i:i+8];
}                            }

                             if (i!=259) {
                               for(;i<259;++i) {
                                 a[i] = b[i+1] + c[i];
                               }
                             }




                                                                   4
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;i<259; i+=8) {
  a[i] = b[i+1] + c[i];        a[i:i+8] = b[i+1:i+9] + c[i:i+8];
}                            }

                             if (i!=259) {
                               for(;i<259;++i) {
                                 a[i] = b[i+1] + c[i];
                               }
                             }




                                                                   5
Vectorization, what it is?

int a[259], b[259], c[259]   int a[259], b[259], c[259]

for(i=0;i<259;++i) {         for(i=0;i<259; i+=8) {
  a[i] = b[i+1] + c[i];        a[i:i+8] = b[i+1:i+9] + c[i:i+8];
}                            }

                             if (i!=259) {
                               for(;i<259;++i) {
                                 a[i] = b[i+1] + c[i];
                               }
                             }




                                                                   6
Vectorization, big idea
                       • Use existing LLVM infra structure
    Find a loop




                       • Is it countable loop?
 Is is vectorizable?   • Are there any unvectorizable instructions?
                       • Loop independence dependence?
                       • Loop carried dependence?
           Yes


                       • Change array type to vector type
 If so, vectorize it   • Type casting
                       • Alignment
                       • Handle remainder if any

                                                                      7
Find a loop
 Implement “LoopVectorizer” path as one of
  the transform path
   Inherit LoopPath class which is invoked for
    every loop.                                     PathManager
   PathManager which is a parent of LoopPath
    manger deals with integrating other LLVM
    paths.
                                                      LoopPath
 Ask PassManager to hand me a loop which
  is more canonical form than natural loop
     LoopSimply Form                               LoopVectorize
     Entry Block, Exit Block, Latch Block
     Single backedge
     Countable loop which is incrementing by one

                                                                    8
Is it vectorizable? (1/3)
 Loop type test
    Inner-most loop
    Countable loop
        for(i=0;i<100;++i)  OK
        for(;*p!=NULL;++p)  NOK
    Long enough to vectorize
        for(i=0;i<3;++i)  NOK
        Iteration should be longer than vectorization factor.
    Are there any unvectorizable IR instruction?
        Function call  NOK
        stack allocation  NOK
        operation to scalar value except for loop induction variable
          NOK
        Stride of pointer/array should be on.
            a[i]  OK, a[2*i]  NOK

                                                                        9
Is it vectorizable? (2/3)
 Collect array/pointer variables used in LHS and
 RHS
                                              a[i] = b[i+1] + c[i];


    a[i]
    c[i]
  b[i+1]




           LHS = {a[i]}, RHS={b[i+1], c[i]}
                                                                      10
Is it vectorizable? (3/3)
 Data dependence testing between LHS and RHS

     foreach member W in LHS
         foreach member R in LHS U RHS
             if R is alias of W
               if there is data dependence between W and R
                 “It is not vectorizable.”
     “Ok, it is vectorizable”


 Dependence testing
    Strides of W and R are one.
    We only check if W and R will be colliding WITHIN vectorization
     factor by subtracting base coefficient.
        W[i+LC]  R[i+RC]
        If |LC-RC| < vectorization factor, there will be collision.
          Not vectorizable



                                                                       11
If so, vectorize it (1/5)
 Idea
 int a[259], b[259], c[259]   int a[259], b[259], c[259]

 for(i=0;i<259;++i) {         for(i=0;i<259; i+=8) {
   a[i] = b[i+1] + c[i];        a[i:i+8] = b[i+1:i+9] + c[i:i+8];
 }                            }

                              if (i!=259) {
                                for(;i<259;++i) {
                                  a[i] = b[i+1] + c[i];
                                }
                              }

                                Vectorized Loop
                                     Body


                               Check if there are          Epilogue Preheader
      Loop Body
                                 remainders


                               Epilogue Loop for           Epilogue loop
                                   remainder                                    12
If so, vectorize it (2/5)
 Vectorize Loop Body
   1. Insert bitcast instruction after every getelementptr insturction
   2. Replace uses of getelementptr to use bitcast
              If it is a Load or Store instruction, set alignment constraint.
    3.       Construct set of instructions which requires type casting from
             array/pointer type to vector type
              Maximal use set of getelementptr
              Type cast instructions in type casting set to vector type
    4.       Modify increment of induction variable to vectorization factor
    5.       Modify destination of loop exit to epilogue preheader

    Calculate alignment
            It assumes base address is 32-byte aligned.
            Only check if induction variable breaks its alignment.
              a[0]  32- byte aligned
              a[i]  32- byte aligned
              a[i+1]  4-byte aligned
                                                                                 13
If so, vectorize it (3/5)
 Vectorized Loop Body

  bb1:                           ; preds = %bb1, %bb.nph
   %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5]
   %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1]
   %0 = bitcast i32* %scevgep to <8 x i32>*            ; <<8 x i32>*> [#uses=1]
   %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1]
   %1 = bitcast i32* %scevgep4 to <8 x i32>*           ; <<8 x i32>*> [#uses=1]
   %tmp = add i32 %i.03, 1                 ; <i32> [#uses=1]
   %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1]
   %2 = bitcast i32* %scevgep5 to <8 x i32>*            ; <<8 x i32>*> [#uses=1]
   %3 = load <8 x i32>* %2, align 4             ; <<8 x i32>> [#uses=1]
   %4 = load <8 x i32>* %1, align 32            ; <<8 x i32>> [#uses=1]
   %5 = add nsw <8 x i32> %4, %3                 ; <<8 x i32>> [#uses=1]
   store <8 x i32> %5, <8 x i32>* %0, align 32
   %6 = add i32 %i.03, 8                 ; <i32> [#uses=2]
   %exitcond = icmp eq i32 %6, 256                ; <i1> [#uses=1]
   br i1 %exitcond, label %bb1.preheader, label %bb


                                                                                    14
If so, vectorize it (4/5)

 Generate epilogue preheader
    If there are remainders, jump to epilogue loop.

   bb1.preheader:                     ; preds = %bb1
    %7 = shl i32 %i.03, 0               ; <i32> [#uses=2]
    %8 = icmp eq i32 %7, 259                ; <i1> [#uses=1]
    br i1 %8, label %return, label %bb1.epilogue




                                                               15
If so, vectorize it (5/5)
 Generate epilogue loop for remainder
   1. Clone original loop body
   2. Update all the uses to denote the cloned one
   3. Update phi of induction variable
   4. Update branch target

   bb1.epilogue:                      ; preds = %bb1.epilogue, %bb1.preheader
    %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4]
    %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1]
    %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1]
    %12 = add i32 %9, 1                  ; <i32> [#uses=1]
    %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1]
    %14 = load i32* %13, align 4             ; <i32> [#uses=1]
    %15 = load i32* %11, align 4            ; <i32> [#uses=1]
    %16 = add nsw i32 %15, %14                 ; <i32> [#uses=1]
    store i32 %16, i32* %10, align 4
    %17 = add i32 %9, 1                  ; <i32> [#uses=2]
    %18 = icmp eq i32 %17, 259                ; <i1> [#uses=1]
    br i1 %18, label %return, label %bb1.epilogue

                                                                                    16
Generated Code
 .LBB1_1:               # %bb1
                     # =>This Inner Loop Header: Depth=1
            movups       b+1088(,%eax,4), %xmm0
            paddd        c+1084(,%eax,4), %xmm0
            movups       b+1072(,%eax,4), %xmm1                   Vectorized Loop Body
            paddd        c+1068(,%eax,4), %xmm1
            movaps       %xmm1, a+1068(,%eax,4)
            movaps       %xmm0, a+1084(,%eax,4)
            addl         $8, %eax
            cmpl         $-11, %eax
            jne          .LBB1_1
 # BB#2:                # %bb1.preheader
            testl        %eax, %eax                               Epilogue Preheader
            je           .LBB1_5
 # BB#3:                # %bb1.preheader.bb1.epilogue_crit_edge
            movl         $-44, %eax
            .align       16, 0x90
 .LBB1_4:               # %bb1.epilogue
                     # =>This Inner Loop Header: Depth=1          Epilogue Loop
            movl         c+1036(%eax), %ecx
            addl         b+1040(%eax), %ecx
            movl         %ecx, a+1036(%eax)
            addl         $4, %eax
            jne          .LBB1_4
 .LBB1_5:               # %return
            ret

                                                                                         17
Experiment Environment
 CPU
    Intel i5 2.67GHz

 OS
    Ubuntu 10.04

 LLVM
    LLVM 2.7 (Released at 04/27/2010)
    LLVM-GCC Front End 4.2

 GCC
    GCC 4.4.3 (Ubuntu 10.04 Canonical Version)

                                                  18
Performance Comparison
: aligned access
 a[i] = b[i] + c[i];
                              1.6
  Normalized Execution Time




                              1.4

                              1.2
                                                                   GCC Vect
                               1
                                                                   GCC No-Vect

                              0.8                                  LLVM No-Vect

                              0.6                                  LLVM Vect(VF=4)

                                                                   LLVM Vect(VF=8)
                              0.4
                                                                 VF = Vectorization Factor

                              0.2

                               0
                                    char   short   int   float
                                                                                      19
Performance Comparison
: unaligned access
 For integer type
                              1.6

                              1.4
  Normalized Execution Time




                              1.2

                               1
                                                                                                     a[i]=b[i]+c[i]
                              0.8                                                                    a[i]=b[i+1]+c[i]
                                                                                                     a[i]=b[i+1]+c[i+1]
                              0.6
                                                                                                     a[i+1]=b[i+1]+c[i+1]
                                                                                                    VF = Vectorization Factor
                              0.4

                              0.2

                               0
                                    GCC Vect   GCC No-Vect LLVM No-Vect     LLVM         LLVM
                                                                          Vect(VF=4)   Vect(VF=8)                         20
Conclusion
 Implement prototype level LLVM vectorizer with
     Data dependence analysis
     Loop transformation and vectorization
     Alignment testing
     Type Conversion
 Use variety of LLVM infra structure
     Path Manage, Loop Path Manager, Loop Simply form, Alias analysis,
      IndVars, SCEV, etc
 Its performance is quite promising
     In most cases, it is better than GCC tree vectorize.
 But, followings are requires to extend its coverage
     Need to extend dependence testing to support multi dimensional array
         W[i][ j][k+LC]  R[i][ j][k+RC]
    More sophisticated alignment calculation is required
      It may need to collaborate with code generation.
      Do we have efficient way to calculate alignment in multi dimensional array?
         a[i][ j][k]

    Do we need to support a loop which has more than one basic block for
      loop body?
                                                                                     21

More Related Content

What's hot

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)Igalia
 
Kirk Shoop, Reactive programming in C++
Kirk Shoop, Reactive programming in C++Kirk Shoop, Reactive programming in C++
Kirk Shoop, Reactive programming in C++Sergey Platonov
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Instancing
InstancingInstancing
Instancingacbess
 
JIT compilation in modern platforms – challenges and solutions
JIT compilation in modern platforms – challenges and solutionsJIT compilation in modern platforms – challenges and solutions
JIT compilation in modern platforms – challenges and solutionsaragozin
 
Async await in C++
Async await in C++Async await in C++
Async await in C++cppfrug
 
To Swift 2...and Beyond!
To Swift 2...and Beyond!To Swift 2...and Beyond!
To Swift 2...and Beyond!Scott Gardner
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Provectus
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)Igalia
 
A nice 64-bit error in C
A  nice 64-bit error in CA  nice 64-bit error in C
A nice 64-bit error in CPVS-Studio
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone? DVClub
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Daniel Lemire
 
Building Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesBuilding Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesGuido Chari
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 

What's hot (20)

NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
 
Kirk Shoop, Reactive programming in C++
Kirk Shoop, Reactive programming in C++Kirk Shoop, Reactive programming in C++
Kirk Shoop, Reactive programming in C++
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Instancing
InstancingInstancing
Instancing
 
JIT compilation in modern platforms – challenges and solutions
JIT compilation in modern platforms – challenges and solutionsJIT compilation in modern platforms – challenges and solutions
JIT compilation in modern platforms – challenges and solutions
 
Async await in C++
Async await in C++Async await in C++
Async await in C++
 
To Swift 2...and Beyond!
To Swift 2...and Beyond!To Swift 2...and Beyond!
To Swift 2...and Beyond!
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
 
JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)JavaScriptCore's DFG JIT (JSConf EU 2012)
JavaScriptCore's DFG JIT (JSConf EU 2012)
 
OpenMP
OpenMPOpenMP
OpenMP
 
A nice 64-bit error in C
A  nice 64-bit error in CA  nice 64-bit error in C
A nice 64-bit error in C
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone?
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
Building Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual MachinesBuilding Efficient and Highly Run-Time Adaptable Virtual Machines
Building Efficient and Highly Run-Time Adaptable Virtual Machines
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 

Similar to LLVM Vectorization Prototype Design

The Java Fx Platform – A Java Developer’S Guide
The Java Fx Platform – A Java Developer’S GuideThe Java Fx Platform – A Java Developer’S Guide
The Java Fx Platform – A Java Developer’S GuideStephen Chin
 
Lecture01a correctness
Lecture01a correctnessLecture01a correctness
Lecture01a correctnessSonia Djebali
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++Microsoft
 
Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Jonathan Salwan
 
Control Systems Jntu Model Paper{Www.Studentyogi.Com}
Control Systems Jntu Model Paper{Www.Studentyogi.Com}Control Systems Jntu Model Paper{Www.Studentyogi.Com}
Control Systems Jntu Model Paper{Www.Studentyogi.Com}guest3f9c6b
 
C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{WwwC O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{Www
C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Wwwguest3f9c6b
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Daniel Lemire
 
Csci101 lect09 vectorized_code
Csci101 lect09 vectorized_codeCsci101 lect09 vectorized_code
Csci101 lect09 vectorized_codeElsayed Hemayed
 
parallel programming.ppt
parallel programming.pptparallel programming.ppt
parallel programming.pptnazimsattar
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++Mohammad Shaker
 
Introduction to web programming for java and c# programmers by @drpicox
Introduction to web programming for java and c# programmers by @drpicoxIntroduction to web programming for java and c# programmers by @drpicox
Introduction to web programming for java and c# programmers by @drpicoxDavid Rodenas
 
Java notes 1 - operators control-flow
Java notes   1 - operators control-flowJava notes   1 - operators control-flow
Java notes 1 - operators control-flowMohammed Sikander
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfahmed8651
 
Vectors data frames
Vectors data framesVectors data frames
Vectors data framesFAO
 

Similar to LLVM Vectorization Prototype Design (20)

The Java Fx Platform – A Java Developer’S Guide
The Java Fx Platform – A Java Developer’S GuideThe Java Fx Platform – A Java Developer’S Guide
The Java Fx Platform – A Java Developer’S Guide
 
Lecture01a correctness
Lecture01a correctnessLecture01a correctness
Lecture01a correctness
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes
 
chapter4.ppt
chapter4.pptchapter4.ppt
chapter4.ppt
 
Control Systems Jntu Model Paper{Www.Studentyogi.Com}
Control Systems Jntu Model Paper{Www.Studentyogi.Com}Control Systems Jntu Model Paper{Www.Studentyogi.Com}
Control Systems Jntu Model Paper{Www.Studentyogi.Com}
 
C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
C O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{WwwC O N T R O L  S Y S T E M S  J N T U  M O D E L  P A P E R{Www
C O N T R O L S Y S T E M S J N T U M O D E L P A P E R{Www
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Csci101 lect09 vectorized_code
Csci101 lect09 vectorized_codeCsci101 lect09 vectorized_code
Csci101 lect09 vectorized_code
 
Experiments with C++11
Experiments with C++11Experiments with C++11
Experiments with C++11
 
parallel programming.ppt
parallel programming.pptparallel programming.ppt
parallel programming.ppt
 
Memory Management with Java and C++
Memory Management with Java and C++Memory Management with Java and C++
Memory Management with Java and C++
 
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
 
Introduction to web programming for java and c# programmers by @drpicox
Introduction to web programming for java and c# programmers by @drpicoxIntroduction to web programming for java and c# programmers by @drpicox
Introduction to web programming for java and c# programmers by @drpicox
 
Java notes 1 - operators control-flow
Java notes   1 - operators control-flowJava notes   1 - operators control-flow
Java notes 1 - operators control-flow
 
Scala.io
Scala.ioScala.io
Scala.io
 
MATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdfMATLAB Questions and Answers.pdf
MATLAB Questions and Answers.pdf
 
Vectors data frames
Vectors data framesVectors data frames
Vectors data frames
 
Nalinee java
Nalinee javaNalinee java
Nalinee java
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

LLVM Vectorization Prototype Design

  • 2. Project Goal  Design and implement prototype level autovectorizer in LLVM  Understand and hands-on LLVM  Implement simple analysis phase in LLVM  Implement simple transform phase in LLVM 2
  • 3. Vector Support in LLVM  Support vector type and its operation in IR level  Generate vector type to MMX/SSE instruction in IA32 architecture %pb = getelementptr [32 x i32]* @b, i32 0, i32 %i movaps b (,%eax,0), %xmm0 b[i] %vb = bitcast i32* %pb to <8 x i32>* paddd c (,%eax,0), %xmm0 %pc = getelementptr [32 x i32]* @c, i32 0, i32 %i movaps %xmm0, a (,%eax,0) + %vc = bitcast i32* %pa to <8 x i32>* %vb_i = load <8 x i32>* %vb, align 32 c[i] %vc_i = load <8 x i32>* %vc, align 32 %va_i = add nsw <8 x i32> %vb_i, %vc_i = %pa= getelementptr [32 x i32]* @a, i32 0, i32 %i %va = bitcast i32* %pa to <8 x i32>* a[i] store <8 x i32> %va_i, <8 x i32>* %va, align 32 • vector stride = 1 • vector type, vector operation • SSE code generation 3
  • 4. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 4
  • 5. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 5
  • 6. Vectorization, what it is? int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } 6
  • 7. Vectorization, big idea • Use existing LLVM infra structure Find a loop • Is it countable loop? Is is vectorizable? • Are there any unvectorizable instructions? • Loop independence dependence? • Loop carried dependence? Yes • Change array type to vector type If so, vectorize it • Type casting • Alignment • Handle remainder if any 7
  • 8. Find a loop  Implement “LoopVectorizer” path as one of the transform path  Inherit LoopPath class which is invoked for every loop. PathManager  PathManager which is a parent of LoopPath manger deals with integrating other LLVM paths. LoopPath  Ask PassManager to hand me a loop which is more canonical form than natural loop  LoopSimply Form LoopVectorize  Entry Block, Exit Block, Latch Block  Single backedge  Countable loop which is incrementing by one 8
  • 9. Is it vectorizable? (1/3)  Loop type test  Inner-most loop  Countable loop  for(i=0;i<100;++i)  OK  for(;*p!=NULL;++p)  NOK  Long enough to vectorize  for(i=0;i<3;++i)  NOK  Iteration should be longer than vectorization factor.  Are there any unvectorizable IR instruction?  Function call  NOK  stack allocation  NOK  operation to scalar value except for loop induction variable  NOK  Stride of pointer/array should be on.  a[i]  OK, a[2*i]  NOK 9
  • 10. Is it vectorizable? (2/3)  Collect array/pointer variables used in LHS and RHS a[i] = b[i+1] + c[i]; a[i] c[i] b[i+1] LHS = {a[i]}, RHS={b[i+1], c[i]} 10
  • 11. Is it vectorizable? (3/3)  Data dependence testing between LHS and RHS foreach member W in LHS foreach member R in LHS U RHS if R is alias of W if there is data dependence between W and R “It is not vectorizable.” “Ok, it is vectorizable”  Dependence testing  Strides of W and R are one.  We only check if W and R will be colliding WITHIN vectorization factor by subtracting base coefficient.  W[i+LC]  R[i+RC]  If |LC-RC| < vectorization factor, there will be collision.  Not vectorizable 11
  • 12. If so, vectorize it (1/5)  Idea int a[259], b[259], c[259] int a[259], b[259], c[259] for(i=0;i<259;++i) { for(i=0;i<259; i+=8) { a[i] = b[i+1] + c[i]; a[i:i+8] = b[i+1:i+9] + c[i:i+8]; } } if (i!=259) { for(;i<259;++i) { a[i] = b[i+1] + c[i]; } } Vectorized Loop Body Check if there are Epilogue Preheader Loop Body remainders Epilogue Loop for Epilogue loop remainder 12
  • 13. If so, vectorize it (2/5)  Vectorize Loop Body 1. Insert bitcast instruction after every getelementptr insturction 2. Replace uses of getelementptr to use bitcast  If it is a Load or Store instruction, set alignment constraint. 3. Construct set of instructions which requires type casting from array/pointer type to vector type  Maximal use set of getelementptr  Type cast instructions in type casting set to vector type 4. Modify increment of induction variable to vectorization factor 5. Modify destination of loop exit to epilogue preheader  Calculate alignment  It assumes base address is 32-byte aligned.  Only check if induction variable breaks its alignment.  a[0]  32- byte aligned  a[i]  32- byte aligned  a[i+1]  4-byte aligned 13
  • 14. If so, vectorize it (3/5)  Vectorized Loop Body bb1: ; preds = %bb1, %bb.nph %i.03 = phi i32 [ 0, %bb.nph ], [ %6, %bb1 ] ; <i32> [#uses=5] %scevgep = getelementptr [259 x i32]* @a, i32 0, i32 %i.03 ; <i32*> [#uses=1] %0 = bitcast i32* %scevgep to <8 x i32>* ; <<8 x i32>*> [#uses=1] %scevgep4 = getelementptr [259 x i32]* @c, i32 0, i32 %i.03 ; <i32*> [#uses=1] %1 = bitcast i32* %scevgep4 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %tmp = add i32 %i.03, 1 ; <i32> [#uses=1] %scevgep5 = getelementptr [259 x i32]* @b, i32 0, i32 %tmp ; <i32*> [#uses=1] %2 = bitcast i32* %scevgep5 to <8 x i32>* ; <<8 x i32>*> [#uses=1] %3 = load <8 x i32>* %2, align 4 ; <<8 x i32>> [#uses=1] %4 = load <8 x i32>* %1, align 32 ; <<8 x i32>> [#uses=1] %5 = add nsw <8 x i32> %4, %3 ; <<8 x i32>> [#uses=1] store <8 x i32> %5, <8 x i32>* %0, align 32 %6 = add i32 %i.03, 8 ; <i32> [#uses=2] %exitcond = icmp eq i32 %6, 256 ; <i1> [#uses=1] br i1 %exitcond, label %bb1.preheader, label %bb 14
  • 15. If so, vectorize it (4/5)  Generate epilogue preheader  If there are remainders, jump to epilogue loop. bb1.preheader: ; preds = %bb1 %7 = shl i32 %i.03, 0 ; <i32> [#uses=2] %8 = icmp eq i32 %7, 259 ; <i1> [#uses=1] br i1 %8, label %return, label %bb1.epilogue 15
  • 16. If so, vectorize it (5/5)  Generate epilogue loop for remainder 1. Clone original loop body 2. Update all the uses to denote the cloned one 3. Update phi of induction variable 4. Update branch target bb1.epilogue: ; preds = %bb1.epilogue, %bb1.preheader %9 = phi i32 [ %7, %bb1.preheader ], [ %17, %bb1.epilogue ] ; <i32> [#uses=4] %10 = getelementptr [259 x i32]* @a, i32 0, i32 %9 ; <i32*> [#uses=1] %11 = getelementptr [259 x i32]* @c, i32 0, i32 %9 ; <i32*> [#uses=1] %12 = add i32 %9, 1 ; <i32> [#uses=1] %13 = getelementptr [259 x i32]* @b, i32 0, i32 %12 ; <i32*> [#uses=1] %14 = load i32* %13, align 4 ; <i32> [#uses=1] %15 = load i32* %11, align 4 ; <i32> [#uses=1] %16 = add nsw i32 %15, %14 ; <i32> [#uses=1] store i32 %16, i32* %10, align 4 %17 = add i32 %9, 1 ; <i32> [#uses=2] %18 = icmp eq i32 %17, 259 ; <i1> [#uses=1] br i1 %18, label %return, label %bb1.epilogue 16
  • 17. Generated Code .LBB1_1: # %bb1 # =>This Inner Loop Header: Depth=1 movups b+1088(,%eax,4), %xmm0 paddd c+1084(,%eax,4), %xmm0 movups b+1072(,%eax,4), %xmm1 Vectorized Loop Body paddd c+1068(,%eax,4), %xmm1 movaps %xmm1, a+1068(,%eax,4) movaps %xmm0, a+1084(,%eax,4) addl $8, %eax cmpl $-11, %eax jne .LBB1_1 # BB#2: # %bb1.preheader testl %eax, %eax Epilogue Preheader je .LBB1_5 # BB#3: # %bb1.preheader.bb1.epilogue_crit_edge movl $-44, %eax .align 16, 0x90 .LBB1_4: # %bb1.epilogue # =>This Inner Loop Header: Depth=1 Epilogue Loop movl c+1036(%eax), %ecx addl b+1040(%eax), %ecx movl %ecx, a+1036(%eax) addl $4, %eax jne .LBB1_4 .LBB1_5: # %return ret 17
  • 18. Experiment Environment  CPU  Intel i5 2.67GHz  OS  Ubuntu 10.04  LLVM  LLVM 2.7 (Released at 04/27/2010)  LLVM-GCC Front End 4.2  GCC  GCC 4.4.3 (Ubuntu 10.04 Canonical Version) 18
  • 19. Performance Comparison : aligned access  a[i] = b[i] + c[i]; 1.6 Normalized Execution Time 1.4 1.2 GCC Vect 1 GCC No-Vect 0.8 LLVM No-Vect 0.6 LLVM Vect(VF=4) LLVM Vect(VF=8) 0.4 VF = Vectorization Factor 0.2 0 char short int float 19
  • 20. Performance Comparison : unaligned access  For integer type 1.6 1.4 Normalized Execution Time 1.2 1 a[i]=b[i]+c[i] 0.8 a[i]=b[i+1]+c[i] a[i]=b[i+1]+c[i+1] 0.6 a[i+1]=b[i+1]+c[i+1] VF = Vectorization Factor 0.4 0.2 0 GCC Vect GCC No-Vect LLVM No-Vect LLVM LLVM Vect(VF=4) Vect(VF=8) 20
  • 21. Conclusion  Implement prototype level LLVM vectorizer with  Data dependence analysis  Loop transformation and vectorization  Alignment testing  Type Conversion  Use variety of LLVM infra structure  Path Manage, Loop Path Manager, Loop Simply form, Alias analysis, IndVars, SCEV, etc  Its performance is quite promising  In most cases, it is better than GCC tree vectorize.  But, followings are requires to extend its coverage  Need to extend dependence testing to support multi dimensional array  W[i][ j][k+LC]  R[i][ j][k+RC]  More sophisticated alignment calculation is required  It may need to collaborate with code generation.  Do we have efficient way to calculate alignment in multi dimensional array?  a[i][ j][k]  Do we need to support a loop which has more than one basic block for loop body? 21