SlideShare a Scribd company logo
EXPERIENCE WITH PARTIAL
SIMDIZATION IN OPEN64 COMPILER
USING DYNAMIC PROGRAMMING

Dibyendu Das, Soham Sundar Chakraborty, Michael Lai
AMD

{dibyendu.das, soham.chakraborty, michael.lai}@amd.com
Open64 Workshop, June 15th , 2012, Beijing
AGENDA


 Introduction to Partial Simdization
 Overview of Barik-Zhao-Sarkar Dynamic Programming-based algorithm
 Motivating Example – hot loop in gromacs
 Our Approach
 Initial Results
 Conclusion and Future Work




2 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
PARTIAL SIMDIZATION

 Effectively utilize and exploit the SIMD (vector) ISA of modern x86 microprocessors
 Traditional loop-based vectorization adopt an all-or-nothing approach in SIMDizing loops - frequently
  employing fission or other techniques to separate the non-vectorizable part from the vectorizable part
 Partial simdization packs isomorphic instructions to SIMD instructions (if possible),
   – Most of the opportunities still loop-based
   – Need not have all instructions packed
 SLP (Superworld-level-parallelism) [ Larsen et al.] one of the most well-known partial simidization
  techniques
 A simple example of partial simdization:




  3 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
PARTIAL SIMDIZATION - EXAMPLE
   1: for (it = 0; it < n; it++) {                                     1: for (it = 0; it < n; it=it+2) {            1: for (it = 0; it < n; it=it+2) {
   2: i = a[it] + b[it];                                               2: i1 = a[it] + b[it];                        2: va = <a[it], a[it+1]>
   3: j = c[it] + d[it];                                               3: j1 = c[it] + d[it];                        3: vb = <b[it], b[it+1]>
   4: k = e[it] + f[it];                                               4: k1 = e[it] + f[it];                        4: vc = <c[it], c[it+1]>
   5: l = g[it] + h[it];
                                                                       5: l1 = g[it] + h[it];                        5: vd = <d[it], d[it+1>
   6: m = i + j;
                                                                       6: m1 = i1 + j1;                              6: ve = <e[it], e[it+1]>
   7: n = k + l;
   8: res[it] = m + n;                                                 7: n1 = k1 + l1;                              7: vf = <f[it], f[it+1>
   }                                                                   8: res[it] = m1 + n1;                         8: vg = <g[it], g[it+1]>
                                                                       9: i2 = a[it+1] + b[it+1];                    9: vh = <h[it],h[it+1]>
                                                                       10: j2 = c[it+1] + d[it+1];                   10: <i1,i2> = va+vb
                    1: for (it = 0; it < n; it++) {                    11: k2 = e[it+1] + f[it+1];                   11: <j1,j2> = vc+vd
                    2: vt1 = pack< a[it], e[it] >                      12: l2 = g[it+1] + h[it+1];                   12: <k1,k2> = ve+vf
simd                3: vt2 = pack< b[it], f[it] >                      13: m2 = i2 + j2;                             13: <l1,l2> = vg+vh
                    4: vt3 = pack< c[it], g[it] >
                                                                       14: n2 = k2 + l2;                             14: <m1,m2> = <i1,i2>+<j1,j2>
                    5: vt4 = pack< d[it], h[it] >
                                                                       15: res[it+1] = m2 + n2;                      15: <n1,n2> = <k1,k2>+<l1,l2>
                    6: vt1 = vt1 + vt2; <i,k>
                    7: vt3 = vt3 + vt4; <j,l>                          }                                             16: <res[it], res[it+1]> = <m1,m2>
                    8: vt1 = vt1 + vt3; <m,n>                                                                                                   + <n1,n2>
scalar              9: m,n = unpack vt1                                                                              }
                    10: res[it] = m+n;
                    }

   Example courtesy: “SIMD Defragmenter: Efficient ILP
   Realization on Data-parallel Architectures”,
   Y. Park et al., ASPLOS 2012
4 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OVERVIEW OF BARIK-ZHAO-SARKAR (BZS) ALGORITHM


 Dynamic Programming based approach
   – Data Dependence Graph is built for each basic block
   – k-tiles are built, each tile consists of k isomorphic independent operations
   – Two sets of costs built
         Scalar Cost of each node n
         Vector/SIMD Cost of k-tiles
   – [Pass 1] Dependence Graph traversed from source to sink, build scalar and vector costs, use dynamic
     programming rules
   – [Pass 2] Best costs are determined , decide between whether a k-tile should be SIMDized or not
         Use scalar cost vs vector cost comparison for this purpose
   – [Pass 3] Generate SIMD+scalar code




5 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
BARIK-ZHAO-SARKAR (BZS) ALGORITHM - ILLUSTRATED
 Various combinations of packing ( assume 2-tiles) possible
   – <ld a[it], ld b[it]> OR <ld a[it], ld c[it]> OR <ld a[it], d[it]> OR <ld a[it], e[it]> OR <ld a[it], f[it]> …
   – Which is better ? Depends on whether <+1,+2> OR <+1,+3> OR <+1,+4> is better and so on
   – vcost(<+1,+2>) = vcost(<ld a[it], ld c[it]>) + vcost(<ld b[it], ld d[it]>) OR
   – vcost(<+1,+2>) = vcost(<ld a[it], ld b[it]>) + vcost(<ld c[it], ld d[it]>) + unpcost(a,b)+unpcost(c,d)+
     pcost(a,c)+pcost(b,d)    ( Dynamic Programming)
In actuality,
                                                                           st r[it]
Total Scalar Cost = 20
Total Vector Cost = 21
DON’T SIMDIZE                                                                 +
                                               +                                                                         +
                           st i                                 st j                                  st k                               st l


                             +       1                           +        2                            +         3                        +        4
                ld a[it]            ld b[it]         ld c[it]          ld d[it]            ld e[it]           ld f[it]        ld g[it]          ld h[it]

6 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
A MOTIVATING EXAMPLE – HOT LOOP IN GROMACS




7 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OUR APPROACH

 Phase I                                                              LDID ix1
                                                                                       ILOAD                        ILOAD                          ILOAD
                                                                                       pos[j3]    LDID iy1          pos[j3+1]    LDID iz3          pos[j3+8]
  – Initialization (Preparatory phase)
        Build the Data Dependence Graph ( still                                      STID jx1                  STID jy1                           STID jz3
         basic-block oriented ), only for INNER loops,
         preferably HOT
        Topological sort and create levels ( provision                               LDID jx1                      LDID jy1                       LDID jz3
         to do ALAP or ASAP )
           – Difference with BZS approach which is                          SUB                          SUB                           SUB

             ready-list based
           – We statically build all the possible tiles to                STID dx11                     STID dy11                      STID dz33
             control compile cost and ease of design

                                                                       LDID dx11    LDID dx11      LDID dy11         LDID dy11     LDID dz33       LDID dz33



                                                                                   MPY                          MPY                                MPY


                                                                                                  ADD
                                                                                                                                 ADD
                                                                                                                                                      STID rsq11




  8 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OUR APPROACH
 Phase II
   – Tiling
         Pack isomorphic operations as k-tiles
         In BZS k = SIMD width / scalar op width , ex: for simd width = 128, scalar = int, k = 4
         Not the best mechanism to pick k
         Iterative process, k = 2,3,4 …
         Traverse through every level in topological order
            – Pick k nodes with isomporphic ops , create k-tile <op1,op2, …, opk>
            – Associate a dummy cost with the tile
            – Not all isomporphic nodes at the same level checked for packing, too computationally intensive
            – Use heuristics, ex: nodes with common parent/grandparent not packed OR pack nodes which appear as the left
              child, right child etc.




9 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OUR APPROACH
 Phase III                                                                                                     scost(n) = vcost(t1) + vcost(t2) +
                                                               scost(n) =scost(lchild(n))
                                                                                                                 unpcost(t1) + unpcost(t2) + op(n)
    – Dynamic Programming                                      + scost(rchild(n)) + op(n)
          Traverse from Level = 0 (source) to Level =                                    n                                               n
           maxLevel (sink)
             – Build all the scalar costs, scost(n) for
               every node n
                                                                                                                             lchild(n)                   rchild(n)
             – Build the simd costs of the k-tiles                          lchild(n)              rchild(n)
               based on dynamic programming rules
             – For each tile choose the best vector                                                                         Tile: t1            Tile: t2
               cost option at any point, store the
               configuration
    – Pick Best Costs/Configuration                                                     Tile: t            o1               o2                      ok

          Traverse from Level = maxLevel(sink) to
           Level = 0
                                                                                                     o11              o12                     o1k
             – Pick k-tiles and chose those where                               Tile: t1
               vcost(tile) < Σ scost(node that is part of
               the tile)                                                                                        o21              o22                     o2k
                                                                                        Tile: t2
             – The non-chosen tiles are assumed to
               be scalar operations                                                             vcost(t) = vcost(t1) + vcost(t2) +op(t)
             – The chosen tiles are the SIMD ops
10 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OUR APPROACH
 Phase IV
  – Code Generation
        Generate WHIRL vector types for SIMD tiles
        Generate unpacking/packing instructions at the boundary of non-
         SIMD to SIMD and SIMD to non-SIMD code generation
                                                                                                                                           Case 1
           – Use SIMD intrinsics for packing/unpacking
        Case I – General Case
        Case 2 – Handling LDID/ILOAD tiles (also packing)
        Case 3 – Handling STID/ISTORE tiles (also unpacking)
        Arbitrary packing/unpacking not allowed




                                                                                                                                           Case 3
 Case 2




 11 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
APPLYSIMD ALGORITHM                                                            Input: WN - the WHIRL Tree of the loop
                                                                                 Output: Simdized WHIRL Tree
Phase-I
                                                                                 Procedure ApplySimd() {
 - Create dependence DAG among operations from                                   1: // Phase-I
   data dependence graph.                                                        2: /* Initialization: create the DAG from the WHIRL using data flow analysis. */
- Create topological levels for the DAG nodes.                                   3: _dag = CreateDAG();
                                                                                 4: /* level creation in DAG: topological ordering of the DAG nodes.*/
Phase-II                                                                         5: _setOfLevels = CreateLevelsForDAG();
 - At each level
   * for each operator having n DAG nodes                                        6: // Phase-II
        > creates tile combos where k is the tile size.                          7: /* possible vector tiles are created for each level. */
                                                                                 8: _setOfTiles = CreateVector KTilesForLevels();
Phase-III
 -Compute scalar and vector cost for all tiles as per                            9: // Phase-III
                                                                                 10: /* scalar and vector costs are computed according to
  cost model following dynamic programming
                                                                                 11: the defined cost function using dynamic programming */
  approach.                                                                      12: _setOfTiles< scost, vcost > = Find KTileSimpleBestCost();
- Choose the best set of tiles with minimum cost for                             13: /* choose the best set of tiles for maximal benefits */
   maximal benefit.                                                              14: _chosenSetOfTiles< scost, vcost > = FindBestVectorScalar Ktiles();

Phase-IV                                                                         15: // Phase-IV
-Generate the simdized WHIRL code from which                                     16: /* generate the simdized code by transforming the WHIRL */
 code generator generates simdized code in                                       17: _simdizedWHIRL = GenerateVectorScalarCode KTile();
 consequent compiler phase.                                                      }


  12 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
OUR APPROACH – MAJOR DIFFERENCES WITH BZS

  Level-based DAG approach to create tiles
     – More restricted compared to BZS which rely on an „ready list‟ but helps control compile-time
  Implemented in the LNO
     – Higher level implementation helps reason about the simidization implementation than bother about low-
       level details.
     – Easy vector type support in WHIRL which can be lowered
  Iterative „k‟-tiling as opposed to a rigid selection of „k‟
     – Helps with the natural structure of the loop




13 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
INITIAL RESULTS & EXPERIMENTS - GROMACS(INL1130)


 Inl1130‟s innermost loop hottest at 70%+
 Loop body demonstrates 3-way parallelism, ideally suited
  for 3-tiling
 Several complexities over the implementation outlined
    – Simdizing reductions
    – Supertiling ( identify some special patterns for better
      code generation )
    – LiveIn, LiveOut Code Generation




         #Nodes                 #Levels           #Nodes Simdized                  #Simdized            %Simdized
                                                                                   tiles
         763                    43                678                              226                  88.8

14 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
INITIAL RESULTS & EXPERIMENTS


 Implemented in the Open64 4.5.1 AMD release                                                                  %speedup
 Enabled under a special flag –LNO:simd=3 at –O3
  and above                                                                                      30
 ApplySimd invoked in LNO, after regular
  vectorization fails                                                                            20
 LNO/simd.cxx – code home
 Compile-time intensive due to explicit tile creation
                                                                                                 10
 Our main targets are gromacs (inl1130) (and namd,
  WIP)
                                                                                                    0
 Not available in open64.net yet
 Targets AVX, though 256 bit registers not utilized yet




15 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
CONCLUSION & FUTURE WORK


 Illustrates the efficacy of the dynamic programming approach
 First top-of-the-line compiler to effectively simdize gromacs with such huge gains
    – Some efforts do simdize gromacs but uses old legacy compilers like SUIF
 Can be extended to CFGs
 Iterative “k”- tiling factor decided on percentage simdized
 The underlying graph problem is identifying subgraph-isomporphism (hard). Refer to the paper in Slide 2
  by Mahlke et al.
    – Should we look at disjoint clusters before tiling ?
 Still lot of work
    – Challenges in reducing compile time overhead
    – Code Generation too tied to the ISA (virtual vectors may help, see Zaks et al.)
    – Scatter/Gather operations may make things attractive (Haswell, future AMD processors)




16 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
REFERENCES


 Efficient Selection of Vector Instructions using Dynamic Programming. Rajkishore Barik, Jisheng
  Zhao, Vivek Sarkar (Rice University), MICRO 2010.
 Efficient SIMD Code Generation for Irregular Kernels. Seonggun Kim and Hwansoo Han
  Sungkyunkwan University, PPoPP 2012.
 A Compiler Framework for Extracting Superword Level Parallelism. Jun Liu, Yuanrui Zhang, Ohyoung
  Jang, Wei Ding, Mahmut Kandemir, The Pennsylvania State University, PLDI 2012.
 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. Yongjun Park (University
  of Michigan, Ann Arbor), Sangwon Seo (University of Michigan, Ann Arbor), Hyunchul Park (Intel), Hyoun
  Kyu Cho (University of Michigan, Ann Arbor) and Scott Mahlke (University of Michigan, Ann
  Arbor), ASPLOS 2012.




17 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions.
Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2011 Advanced Micro Devices, Inc. All rights reserved.




18 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public

More Related Content

What's hot

NCCU CPDA Lecture 12 Attribute Based Encryption
NCCU CPDA Lecture 12 Attribute Based EncryptionNCCU CPDA Lecture 12 Attribute Based Encryption
NCCU CPDA Lecture 12 Attribute Based Encryption
National Chengchi University
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
theijes
 
Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...
journal ijrtem
 
Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...
Cora Li
 
Appendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution modelAppendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution model
Cora Li
 
Active Attacks on DH Key Exchange
Active Attacks on DH Key ExchangeActive Attacks on DH Key Exchange
Active Attacks on DH Key Exchange
Dharmalingam Ganesan
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
Hansol Kang
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
Deepak Malani
 
A verifiable random function with short proofs and keys
A verifiable random function with short proofs and keysA verifiable random function with short proofs and keys
A verifiable random function with short proofs and keys
Aleksandr Yampolskiy
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
Tomonari Masada
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Hansol Kang
 
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An ApplicationIRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET Journal
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
zukun
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
Alex Pruden
 
ICME 2013
ICME 2013ICME 2013
ICME 2013
nlab_utokyo
 
On the Secrecy of RSA Private Keys
On the Secrecy of RSA Private KeysOn the Secrecy of RSA Private Keys
On the Secrecy of RSA Private Keys
Dharmalingam Ganesan
 
CppConcurrencyInAction - Chapter07
CppConcurrencyInAction - Chapter07CppConcurrencyInAction - Chapter07
CppConcurrencyInAction - Chapter07
DooSeon Choi
 
Microsoft Word Hw#1
Microsoft Word   Hw#1Microsoft Word   Hw#1
Microsoft Word Hw#1
kkkseld
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials

What's hot (20)

NCCU CPDA Lecture 12 Attribute Based Encryption
NCCU CPDA Lecture 12 Attribute Based EncryptionNCCU CPDA Lecture 12 Attribute Based Encryption
NCCU CPDA Lecture 12 Attribute Based Encryption
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...
 
Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...
 
Appendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution modelAppendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution model
 
Active Attacks on DH Key Exchange
Active Attacks on DH Key ExchangeActive Attacks on DH Key Exchange
Active Attacks on DH Key Exchange
 
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
ILP Based Approach for Input Vector Controlled (IVC) Toggle Maximization in C...
 
A verifiable random function with short proofs and keys
A verifiable random function with short proofs and keysA verifiable random function with short proofs and keys
A verifiable random function with short proofs and keys
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
 
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An ApplicationIRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An Application
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
 
ZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their ApplicationsZK Study Club: Sumcheck Arguments and Their Applications
ZK Study Club: Sumcheck Arguments and Their Applications
 
ICME 2013
ICME 2013ICME 2013
ICME 2013
 
On the Secrecy of RSA Private Keys
On the Secrecy of RSA Private KeysOn the Secrecy of RSA Private Keys
On the Secrecy of RSA Private Keys
 
CppConcurrencyInAction - Chapter07
CppConcurrencyInAction - Chapter07CppConcurrencyInAction - Chapter07
CppConcurrencyInAction - Chapter07
 
Microsoft Word Hw#1
Microsoft Word   Hw#1Microsoft Word   Hw#1
Microsoft Word Hw#1
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
On the Zeros of Complex Polynomials
 

Viewers also liked

MY DATA
MY  DATAMY  DATA
MY DATA
Lelly Lupy
 
04 panorama do novo testamento completa
04 panorama do novo testamento   completa04 panorama do novo testamento   completa
04 panorama do novo testamento completa
Adylton Cardoso
 
Diapositivas
DiapositivasDiapositivas
Diapositivas
SUMADREA
 
Maintenance of computer case
Maintenance of computer caseMaintenance of computer case
Maintenance of computer case
Amer Gad
 
BIODATAKU TRITUNISIH
BIODATAKU TRITUNISIHBIODATAKU TRITUNISIH
BIODATAKU TRITUNISIH
Lelly Lupy
 
Forex vina.com đào tạo forex nâng cao phan 2
Forex vina.com   đào tạo forex nâng cao phan 2Forex vina.com   đào tạo forex nâng cao phan 2
Forex vina.com đào tạo forex nâng cao phan 2
Dung Nguyen
 
Why data science?
Why data science?Why data science?
Why data science?
Ryan
 
The Adaptive Enterprise v3
The Adaptive Enterprise v3The Adaptive Enterprise v3
The Adaptive Enterprise v3
Gustav N. Toppenberg, Ph.D.
 
Tarea de ingles 19 de febrero
Tarea de ingles 19 de febreroTarea de ingles 19 de febrero
Tarea de ingles 19 de febrero
Monsse Fuentes
 
Business Presentation
Business PresentationBusiness Presentation
Business Presentation
noahbizzoe
 
التعلیم التفریدي والتعلیم الجمعي
التعلیم التفریدي والتعلیم الجمعيالتعلیم التفریدي والتعلیم الجمعي
التعلیم التفریدي والتعلیم الجمعي
Amer Gad
 
ИКТ компетентность учителя
ИКТ компетентность учителяИКТ компетентность учителя
ИКТ компетентность учителяKaterina Katkova
 
Gerd all in one 2011
Gerd all in one 2011Gerd all in one 2011
Gerd all in one 2011
wykypieh
 
Physical assessment
Physical assessmentPhysical assessment
Physical assessment
raymar29
 
Learn Data Science
Learn Data ScienceLearn Data Science
Learn Data Science
Ryan
 
التعليم الفردي
التعليم الفرديالتعليم الفردي
التعليم الفردي
Amer Gad
 
Mobile data collection using odk
Mobile data collection using odkMobile data collection using odk
Mobile data collection using odk
KARUMBA GATAMA
 

Viewers also liked (17)

MY DATA
MY  DATAMY  DATA
MY DATA
 
04 panorama do novo testamento completa
04 panorama do novo testamento   completa04 panorama do novo testamento   completa
04 panorama do novo testamento completa
 
Diapositivas
DiapositivasDiapositivas
Diapositivas
 
Maintenance of computer case
Maintenance of computer caseMaintenance of computer case
Maintenance of computer case
 
BIODATAKU TRITUNISIH
BIODATAKU TRITUNISIHBIODATAKU TRITUNISIH
BIODATAKU TRITUNISIH
 
Forex vina.com đào tạo forex nâng cao phan 2
Forex vina.com   đào tạo forex nâng cao phan 2Forex vina.com   đào tạo forex nâng cao phan 2
Forex vina.com đào tạo forex nâng cao phan 2
 
Why data science?
Why data science?Why data science?
Why data science?
 
The Adaptive Enterprise v3
The Adaptive Enterprise v3The Adaptive Enterprise v3
The Adaptive Enterprise v3
 
Tarea de ingles 19 de febrero
Tarea de ingles 19 de febreroTarea de ingles 19 de febrero
Tarea de ingles 19 de febrero
 
Business Presentation
Business PresentationBusiness Presentation
Business Presentation
 
التعلیم التفریدي والتعلیم الجمعي
التعلیم التفریدي والتعلیم الجمعيالتعلیم التفریدي والتعلیم الجمعي
التعلیم التفریدي والتعلیم الجمعي
 
ИКТ компетентность учителя
ИКТ компетентность учителяИКТ компетентность учителя
ИКТ компетентность учителя
 
Gerd all in one 2011
Gerd all in one 2011Gerd all in one 2011
Gerd all in one 2011
 
Physical assessment
Physical assessmentPhysical assessment
Physical assessment
 
Learn Data Science
Learn Data ScienceLearn Data Science
Learn Data Science
 
التعليم الفردي
التعليم الفرديالتعليم الفردي
التعليم الفردي
 
Mobile data collection using odk
Mobile data collection using odkMobile data collection using odk
Mobile data collection using odk
 

Similar to Psimd open64 workshop_2012-new

The bytecode gobbledygook
The bytecode gobbledygookThe bytecode gobbledygook
The bytecode gobbledygook
Raimon Ràfols
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
Samsung Open Source Group
 
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docxCOMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
TashiBhutia12
 
Box2D with SIMD in JavaScript
Box2D with SIMD in JavaScriptBox2D with SIMD in JavaScript
Box2D with SIMD in JavaScript
Intel® Software
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Shakil Ahmed
 
Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5
Takao Wada
 
C Code and the Art of Obfuscation
C Code and the Art of ObfuscationC Code and the Art of Obfuscation
C Code and the Art of Obfuscation
guest9006ab
 
Automatically Describing Program Structure and Behavior (PhD Defense)
Automatically Describing Program Structure and Behavior (PhD Defense)Automatically Describing Program Structure and Behavior (PhD Defense)
Automatically Describing Program Structure and Behavior (PhD Defense)
Ray Buse
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
The bytecode mumbo-jumbo
The bytecode mumbo-jumboThe bytecode mumbo-jumbo
The bytecode mumbo-jumbo
Raimon Ràfols
 
Digital logic circuit
Digital logic circuit Digital logic circuit
Digital logic circuit
Prabhu R
 
Newton two Equation method
Newton two Equation  method Newton two Equation  method
Newton two Equation method
shanto017
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
ihji
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
JunZhao68
 
Goal Programming
Goal ProgrammingGoal Programming
Goal Programming
Evren E
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone?
DVClub
 
Função Retorna MAC ADDRESS do Adaptador de rede
Função Retorna MAC ADDRESS do Adaptador de redeFunção Retorna MAC ADDRESS do Adaptador de rede
Função Retorna MAC ADDRESS do Adaptador de rede
Luiz Francisco Bozo
 
Mobile Robot PD and DOB control
Mobile Robot PD and DOB controlMobile Robot PD and DOB control
Mobile Robot PD and DOB control
Hansol Kang
 
Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes
acijjournal
 

Similar to Psimd open64 workshop_2012-new (20)

The bytecode gobbledygook
The bytecode gobbledygookThe bytecode gobbledygook
The bytecode gobbledygook
 
Boosting Developer Productivity with Clang
Boosting Developer Productivity with ClangBoosting Developer Productivity with Clang
Boosting Developer Productivity with Clang
 
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docxCOMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
COMPAPPABCA49085rFunrAP__Practical Number 9 & 10.docx
 
Box2D with SIMD in JavaScript
Box2D with SIMD in JavaScriptBox2D with SIMD in JavaScript
Box2D with SIMD in JavaScript
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5Trident International Graphics Workshop 2014 1/5
Trident International Graphics Workshop 2014 1/5
 
C Code and the Art of Obfuscation
C Code and the Art of ObfuscationC Code and the Art of Obfuscation
C Code and the Art of Obfuscation
 
Automatically Describing Program Structure and Behavior (PhD Defense)
Automatically Describing Program Structure and Behavior (PhD Defense)Automatically Describing Program Structure and Behavior (PhD Defense)
Automatically Describing Program Structure and Behavior (PhD Defense)
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
The bytecode mumbo-jumbo
The bytecode mumbo-jumboThe bytecode mumbo-jumbo
The bytecode mumbo-jumbo
 
Digital logic circuit
Digital logic circuit Digital logic circuit
Digital logic circuit
 
Newton two Equation method
Newton two Equation  method Newton two Equation  method
Newton two Equation method
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
K means clustering
K means clusteringK means clustering
K means clustering
 
5 - Advanced SVE.pdf
5 - Advanced SVE.pdf5 - Advanced SVE.pdf
5 - Advanced SVE.pdf
 
Goal Programming
Goal ProgrammingGoal Programming
Goal Programming
 
ESL Anyone?
ESL Anyone? ESL Anyone?
ESL Anyone?
 
Função Retorna MAC ADDRESS do Adaptador de rede
Função Retorna MAC ADDRESS do Adaptador de redeFunção Retorna MAC ADDRESS do Adaptador de rede
Função Retorna MAC ADDRESS do Adaptador de rede
 
Mobile Robot PD and DOB control
Mobile Robot PD and DOB controlMobile Robot PD and DOB control
Mobile Robot PD and DOB control
 
Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 

Psimd open64 workshop_2012-new

  • 1. EXPERIENCE WITH PARTIAL SIMDIZATION IN OPEN64 COMPILER USING DYNAMIC PROGRAMMING Dibyendu Das, Soham Sundar Chakraborty, Michael Lai AMD {dibyendu.das, soham.chakraborty, michael.lai}@amd.com Open64 Workshop, June 15th , 2012, Beijing
  • 2. AGENDA  Introduction to Partial Simdization  Overview of Barik-Zhao-Sarkar Dynamic Programming-based algorithm  Motivating Example – hot loop in gromacs  Our Approach  Initial Results  Conclusion and Future Work 2 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 3. PARTIAL SIMDIZATION  Effectively utilize and exploit the SIMD (vector) ISA of modern x86 microprocessors  Traditional loop-based vectorization adopt an all-or-nothing approach in SIMDizing loops - frequently employing fission or other techniques to separate the non-vectorizable part from the vectorizable part  Partial simdization packs isomorphic instructions to SIMD instructions (if possible), – Most of the opportunities still loop-based – Need not have all instructions packed  SLP (Superworld-level-parallelism) [ Larsen et al.] one of the most well-known partial simidization techniques  A simple example of partial simdization: 3 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 4. PARTIAL SIMDIZATION - EXAMPLE 1: for (it = 0; it < n; it++) { 1: for (it = 0; it < n; it=it+2) { 1: for (it = 0; it < n; it=it+2) { 2: i = a[it] + b[it]; 2: i1 = a[it] + b[it]; 2: va = <a[it], a[it+1]> 3: j = c[it] + d[it]; 3: j1 = c[it] + d[it]; 3: vb = <b[it], b[it+1]> 4: k = e[it] + f[it]; 4: k1 = e[it] + f[it]; 4: vc = <c[it], c[it+1]> 5: l = g[it] + h[it]; 5: l1 = g[it] + h[it]; 5: vd = <d[it], d[it+1> 6: m = i + j; 6: m1 = i1 + j1; 6: ve = <e[it], e[it+1]> 7: n = k + l; 8: res[it] = m + n; 7: n1 = k1 + l1; 7: vf = <f[it], f[it+1> } 8: res[it] = m1 + n1; 8: vg = <g[it], g[it+1]> 9: i2 = a[it+1] + b[it+1]; 9: vh = <h[it],h[it+1]> 10: j2 = c[it+1] + d[it+1]; 10: <i1,i2> = va+vb 1: for (it = 0; it < n; it++) { 11: k2 = e[it+1] + f[it+1]; 11: <j1,j2> = vc+vd 2: vt1 = pack< a[it], e[it] > 12: l2 = g[it+1] + h[it+1]; 12: <k1,k2> = ve+vf simd 3: vt2 = pack< b[it], f[it] > 13: m2 = i2 + j2; 13: <l1,l2> = vg+vh 4: vt3 = pack< c[it], g[it] > 14: n2 = k2 + l2; 14: <m1,m2> = <i1,i2>+<j1,j2> 5: vt4 = pack< d[it], h[it] > 15: res[it+1] = m2 + n2; 15: <n1,n2> = <k1,k2>+<l1,l2> 6: vt1 = vt1 + vt2; <i,k> 7: vt3 = vt3 + vt4; <j,l> } 16: <res[it], res[it+1]> = <m1,m2> 8: vt1 = vt1 + vt3; <m,n> + <n1,n2> scalar 9: m,n = unpack vt1 } 10: res[it] = m+n; } Example courtesy: “SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures”, Y. Park et al., ASPLOS 2012 4 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 5. OVERVIEW OF BARIK-ZHAO-SARKAR (BZS) ALGORITHM  Dynamic Programming based approach – Data Dependence Graph is built for each basic block – k-tiles are built, each tile consists of k isomorphic independent operations – Two sets of costs built  Scalar Cost of each node n  Vector/SIMD Cost of k-tiles – [Pass 1] Dependence Graph traversed from source to sink, build scalar and vector costs, use dynamic programming rules – [Pass 2] Best costs are determined , decide between whether a k-tile should be SIMDized or not  Use scalar cost vs vector cost comparison for this purpose – [Pass 3] Generate SIMD+scalar code 5 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 6. BARIK-ZHAO-SARKAR (BZS) ALGORITHM - ILLUSTRATED  Various combinations of packing ( assume 2-tiles) possible – <ld a[it], ld b[it]> OR <ld a[it], ld c[it]> OR <ld a[it], d[it]> OR <ld a[it], e[it]> OR <ld a[it], f[it]> … – Which is better ? Depends on whether <+1,+2> OR <+1,+3> OR <+1,+4> is better and so on – vcost(<+1,+2>) = vcost(<ld a[it], ld c[it]>) + vcost(<ld b[it], ld d[it]>) OR – vcost(<+1,+2>) = vcost(<ld a[it], ld b[it]>) + vcost(<ld c[it], ld d[it]>) + unpcost(a,b)+unpcost(c,d)+ pcost(a,c)+pcost(b,d) ( Dynamic Programming) In actuality, st r[it] Total Scalar Cost = 20 Total Vector Cost = 21 DON’T SIMDIZE + + + st i st j st k st l + 1 + 2 + 3 + 4 ld a[it] ld b[it] ld c[it] ld d[it] ld e[it] ld f[it] ld g[it] ld h[it] 6 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 7. A MOTIVATING EXAMPLE – HOT LOOP IN GROMACS 7 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 8. OUR APPROACH  Phase I LDID ix1 ILOAD ILOAD ILOAD pos[j3] LDID iy1 pos[j3+1] LDID iz3 pos[j3+8] – Initialization (Preparatory phase)  Build the Data Dependence Graph ( still STID jx1 STID jy1 STID jz3 basic-block oriented ), only for INNER loops, preferably HOT  Topological sort and create levels ( provision LDID jx1 LDID jy1 LDID jz3 to do ALAP or ASAP ) – Difference with BZS approach which is SUB SUB SUB ready-list based – We statically build all the possible tiles to STID dx11 STID dy11 STID dz33 control compile cost and ease of design LDID dx11 LDID dx11 LDID dy11 LDID dy11 LDID dz33 LDID dz33 MPY MPY MPY ADD ADD STID rsq11 8 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 9. OUR APPROACH  Phase II – Tiling  Pack isomorphic operations as k-tiles  In BZS k = SIMD width / scalar op width , ex: for simd width = 128, scalar = int, k = 4  Not the best mechanism to pick k  Iterative process, k = 2,3,4 …  Traverse through every level in topological order – Pick k nodes with isomporphic ops , create k-tile <op1,op2, …, opk> – Associate a dummy cost with the tile – Not all isomporphic nodes at the same level checked for packing, too computationally intensive – Use heuristics, ex: nodes with common parent/grandparent not packed OR pack nodes which appear as the left child, right child etc. 9 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 10. OUR APPROACH  Phase III scost(n) = vcost(t1) + vcost(t2) + scost(n) =scost(lchild(n)) unpcost(t1) + unpcost(t2) + op(n) – Dynamic Programming + scost(rchild(n)) + op(n)  Traverse from Level = 0 (source) to Level = n n maxLevel (sink) – Build all the scalar costs, scost(n) for every node n lchild(n) rchild(n) – Build the simd costs of the k-tiles lchild(n) rchild(n) based on dynamic programming rules – For each tile choose the best vector Tile: t1 Tile: t2 cost option at any point, store the configuration – Pick Best Costs/Configuration Tile: t o1 o2 ok  Traverse from Level = maxLevel(sink) to Level = 0 o11 o12 o1k – Pick k-tiles and chose those where Tile: t1 vcost(tile) < Σ scost(node that is part of the tile) o21 o22 o2k Tile: t2 – The non-chosen tiles are assumed to be scalar operations vcost(t) = vcost(t1) + vcost(t2) +op(t) – The chosen tiles are the SIMD ops 10 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 11. OUR APPROACH  Phase IV – Code Generation  Generate WHIRL vector types for SIMD tiles  Generate unpacking/packing instructions at the boundary of non- SIMD to SIMD and SIMD to non-SIMD code generation Case 1 – Use SIMD intrinsics for packing/unpacking  Case I – General Case  Case 2 – Handling LDID/ILOAD tiles (also packing)  Case 3 – Handling STID/ISTORE tiles (also unpacking)  Arbitrary packing/unpacking not allowed Case 3 Case 2 11 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 12. APPLYSIMD ALGORITHM Input: WN - the WHIRL Tree of the loop Output: Simdized WHIRL Tree Phase-I Procedure ApplySimd() { - Create dependence DAG among operations from 1: // Phase-I data dependence graph. 2: /* Initialization: create the DAG from the WHIRL using data flow analysis. */ - Create topological levels for the DAG nodes. 3: _dag = CreateDAG(); 4: /* level creation in DAG: topological ordering of the DAG nodes.*/ Phase-II 5: _setOfLevels = CreateLevelsForDAG(); - At each level * for each operator having n DAG nodes 6: // Phase-II > creates tile combos where k is the tile size. 7: /* possible vector tiles are created for each level. */ 8: _setOfTiles = CreateVector KTilesForLevels(); Phase-III -Compute scalar and vector cost for all tiles as per 9: // Phase-III 10: /* scalar and vector costs are computed according to cost model following dynamic programming 11: the defined cost function using dynamic programming */ approach. 12: _setOfTiles< scost, vcost > = Find KTileSimpleBestCost(); - Choose the best set of tiles with minimum cost for 13: /* choose the best set of tiles for maximal benefits */ maximal benefit. 14: _chosenSetOfTiles< scost, vcost > = FindBestVectorScalar Ktiles(); Phase-IV 15: // Phase-IV -Generate the simdized WHIRL code from which 16: /* generate the simdized code by transforming the WHIRL */ code generator generates simdized code in 17: _simdizedWHIRL = GenerateVectorScalarCode KTile(); consequent compiler phase. } 12 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 13. OUR APPROACH – MAJOR DIFFERENCES WITH BZS  Level-based DAG approach to create tiles – More restricted compared to BZS which rely on an „ready list‟ but helps control compile-time  Implemented in the LNO – Higher level implementation helps reason about the simidization implementation than bother about low- level details. – Easy vector type support in WHIRL which can be lowered  Iterative „k‟-tiling as opposed to a rigid selection of „k‟ – Helps with the natural structure of the loop 13 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 14. INITIAL RESULTS & EXPERIMENTS - GROMACS(INL1130)  Inl1130‟s innermost loop hottest at 70%+  Loop body demonstrates 3-way parallelism, ideally suited for 3-tiling  Several complexities over the implementation outlined – Simdizing reductions – Supertiling ( identify some special patterns for better code generation ) – LiveIn, LiveOut Code Generation #Nodes #Levels #Nodes Simdized #Simdized %Simdized tiles 763 43 678 226 88.8 14 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 15. INITIAL RESULTS & EXPERIMENTS  Implemented in the Open64 4.5.1 AMD release %speedup  Enabled under a special flag –LNO:simd=3 at –O3 and above 30  ApplySimd invoked in LNO, after regular vectorization fails 20  LNO/simd.cxx – code home  Compile-time intensive due to explicit tile creation 10  Our main targets are gromacs (inl1130) (and namd, WIP) 0  Not available in open64.net yet  Targets AVX, though 256 bit registers not utilized yet 15 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 16. CONCLUSION & FUTURE WORK  Illustrates the efficacy of the dynamic programming approach  First top-of-the-line compiler to effectively simdize gromacs with such huge gains – Some efforts do simdize gromacs but uses old legacy compilers like SUIF  Can be extended to CFGs  Iterative “k”- tiling factor decided on percentage simdized  The underlying graph problem is identifying subgraph-isomporphism (hard). Refer to the paper in Slide 2 by Mahlke et al. – Should we look at disjoint clusters before tiling ?  Still lot of work – Challenges in reducing compile time overhead – Code Generation too tied to the ISA (virtual vectors may help, see Zaks et al.) – Scatter/Gather operations may make things attractive (Haswell, future AMD processors) 16 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 17. REFERENCES  Efficient Selection of Vector Instructions using Dynamic Programming. Rajkishore Barik, Jisheng Zhao, Vivek Sarkar (Rice University), MICRO 2010.  Efficient SIMD Code Generation for Irregular Kernels. Seonggun Kim and Hwansoo Han Sungkyunkwan University, PPoPP 2012.  A Compiler Framework for Extracting Superword Level Parallelism. Jun Liu, Yuanrui Zhang, Ohyoung Jang, Wei Ding, Mahmut Kandemir, The Pennsylvania State University, PLDI 2012.  SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures. Yongjun Park (University of Michigan, Ann Arbor), Sangwon Seo (University of Michigan, Ann Arbor), Hyunchul Park (Intel), Hyoun Kyu Cho (University of Michigan, Ann Arbor) and Scott Mahlke (University of Michigan, Ann Arbor), ASPLOS 2012. 17 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public
  • 18. Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. ©2011 Advanced Micro Devices, Inc. All rights reserved. 18 | Experience with Partial Simdization in Open64 Compiler Using Dynamic Programming | June 15th, 2012 | Open64 Workshop 2012 | Public