SlideShare a Scribd company logo
1 of 82
Auto Tuning
Hemanth and Siddharth
     UT Austin
Basics
What is Auto Tuning?
● Several Definitions
   β—‹ First result on Wikipedia - "Auto-Tune is an audio
     processor created by Antares Audio Technologies
     "


● A Definition
  β—‹ Autotuning is an automatic process for selecting one
      out of several possible solutions to a computational
      problem.


● Techniques used by:
   β—‹ Library generators, Compilers and Runtime systems
Possible Versions of a Solution
● The solutions may differ in the
  β—‹ algorithm (quicksort vs selection sort)
  β—‹ implementation (loop unroll).

● The versions may result from
  β—‹ transformations (unroll, tile, interchange)

● The versions could be generated by
  β—‹ programmer manually (coding or directives)
   β—‹ compiler automatically
Motivation
β–  Increasing diversity of computation supports
β–  New influences on the execution of parallel
  applications
  β—‹ Hierarchical structure
  β—‹ Heterogeneity of the processors
β–  Design efficient software that takes full
  advantage of such systems
β–  Solving a target problem by using a single
  algorithm is not always efficient everywhere
First Ideas
● Poly-Algorithms
    β—‹   (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic
        solution of nonlinear equations"


●   Profiling and feedback assisted compilation
    β—‹   (1982) S. Graham et.al : gprof
    β—‹   (1991) P. Chang et.a l: "Using profile information to assist classic
        code optimizations"


●   Code generation
    β—‹   (1989) J. Johnson et.al : β€œA methodology for designing, modifying,
        and implementing Fourier Transform algorithms on various
        architectures.”
    β—‹   (1992) M. Covell et.al : β€œComputer-aided algorithm design and
        arrangement”
Context: High Performance Libraries
● Linear Algebra
   β—‹ BLAS, LAPACK, ScaLAPACK
● Signal/Image Processing
  β—‹ Vector Signal Image Processing Library (VSIPL)
● Distributed/Parallel Systems
  β—‹ Message Passing Interface (MPI)
● Can we implement libraries:
  β—‹ Automatically and Portably
  β—‹ Incorporating platform-specific features
  β—‹ matching performance of hand-tuned
     implementations leveraging compiler technology
   β—‹ using domain-specific knowledge
AutoTuning
● 2 phase scheme for producing automatically
  tuned code

● Given: Program; inputs; machine

● Step1: Identify and generate a space of
  candidate implementations

● Step2: Select the fastest one using empirical
  modeling and/or automated experiments
Why not let the compiler worry?
● General Purpose
  β—‹ whereas Library generators can focus on specific
    problems


● Engineering
  β—‹ Hard to modify a production compiler and its effects
    are global


● Analysis
  β—‹ Limited access to relevant run-time information
  β—‹ Over specified dependencies
  β—‹ Correctness Criteria
Compiler Vs AutoTuner
                 Compiler                 AutoTuner
Input            General Purpose          Specification including
                 Source Code              problem size, machine
                                          parameters and
                                          problem specific
                                          transformations

Output           Low level Machine        Mostly High Level
                 Code                     Source (eg: C code)

Time to          Short (unless            Usually Long (depends
                 feedback/profiling       on search space)
Generate         enabled)

Select           Mostly Static Analysis   Automated Empirical
                 (rarely feedback         Models and
Implementation   tuning)                  experiments
Some AutoTuning Projects

● Linear Algebra
  β—‹ Portable High-Performance ANSI C
     β–  PHiPAC
  β—‹ Automatically Tuned Linear Algebra Software
    β–  ATLAS


● Signal and Image Processing
  β—‹ Fast Fourier Transformations of the West
    β–  FFTW
  β—‹ SPIRAL
PHiPAC
Traditional Approach
Hand Tuned Libraries
PHiPAC (1997)
● Developing Portable High-Performance
  matrix vector libraries in ANSI C
● Parametrized C-code Generator
  β—‹ produces code according to certain
     guidelines
● Auto Tune the code
● Exhaustive search over all parameters
● Claim: achieve over 90% of peak-perf and
PHiPAC Approach
Generate Optimized C Code
PHiPAC Approach
Parameters are Architecture Specific
Efficient Code Generation
● Studied several ANSI C Compilers and
  determined that it is best to

● Rely on Compilers for:
  β—‹ Register allocation
  β—‹ Instruction selection and Scheduling


● Manually perform:
  β—‹ register/cache blocking
  β—‹ loop unrolling
  β—‹ software pipe-lining, etc
Local Variables to explicitly remove false
dependencies
●        Before                    After
    a[i] = b[i] + c;             float f1, f2;
    a[i+1] = b[i+1] * d;   f1 = b[i]; f2 = b[i+1];
                                a[i] = f1 + c;
                               a[i+1] = f2 * d;



Compiler mayn't assume &a[i] != &b[i+1]
and so is forced to first store a[i] before
loading b[i+1] (Pointer Aliasing)
False Dependencies




              After Removing Dependency
Exploit Multiple Registers

● Explicitly keep values in local variables
  β—‹ Reduces memory bandwidth
   β—‹ compiler would reload fil values for every
     iteration (potential aliasing with res)

           Before                     After
  while(...) {              float f0 = fil[0];
  *res++ = fil[0] * sig[0]; float f1 = fil[1];
         + fil[1] * sig[1]; while(...) {
  signal ++;                  *res++ = f0 * sig[0]
  }                                  + f1 * sig[1];
                               signal ++
                            }
Minimize pointer updates by striding with
constant offsets

         Before                    After
●   f0 = *r8; r8 += 4;   f0   = r8[0];
    f1 = *r8; r8 += 4;   f1   = r8[4];
    f2 = *r8; r8 += 4;   f2   = r8[8];
                         r8   += 12;




Compilers can fold constant index into
(register + offset) addressing mode.
Minimize branches, avoid magnitude
compares
● Branches are costly
  β—‹ Unroll loops
  β—‹ Use do{} while(); loops to avoid loop
     head branches
● Using == and != is cheaper
          Before                      After
  for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE];
      i < ARRAY_SIZE;       do {
      i ++, a++) {            ...
      ....                    a++;
  }                         } while (a != end_ptr);
Explicitly unroll loops

● Instruction level parallelism
          Before                      After
  while(...) {              float f0, f1, s0, s1;
  *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1];
         + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1];
  signal ++;
  }                         *res++ = (f0*s0)+(f1*s1)
                            do { signal++;
                                 s0 = sig[0];
                              res[0] = f0*s1 + f1*s2;
                                 s1 = sig[1];
                              res[1] = f0*s2 + f1*s0;
                              res += 2;
                            } while(...);
Other Guidelines
● Balance Instruction Mix
  β—‹ Interleave 1 FPM, 1 FPA and 1-2 FP loads or
     stores
● Increase Locality
  β—‹ Arrange code to have unit-stride memory
     accesses and try to reuse data in cache
● Convert Integer multiplies to adds
  β—‹ * and / are slower than +
Matrix Multiply Generators
● Produce C code with PHiPAC guidelines
● C = Ξ±op(A)op(B) + Ξ²C
  β—‹ MxK, KxN and MxN matrices
  β—‹ op(X) is either X or transpose(X)

● mm_cgen and mm_lgen
    β—‹ Core (register blocking)
    β—‹ Level (higher level cache blocking)


●   mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
Blocked MMM
for (i=0; i<M; i+=M0)
 for (j=0; j<N; j+=N0)
  for (l=0; l<K; l+=K0)

   for (r=i; r<i+M0; r++)
    for (s=i; s<i+N0; s++)
     for (t=i; t<i+K0; t++)
      c[r][s] += a[r][t] * b[t][s];
Code Generator
 $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ]




  M0 K0 N0          mm_gen              Optimized C
  M1 K1 N1
Usage and Options
Usage: mm_cgen [OPTIONS]
● Semantics options:
    β—‹ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose
    β—‹ -no_fringes : don’t generate an M,K, or N reg block
      fringes


●   Optimization options:
    β—‹ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1)
      blocking parameters
    β—‹ -sp [1|2lm|2ma|3] : software pipelining options
Contd.
● Precision options:
   β—‹ prec/sprec/aprec/dprec [single|double|ldouble] :
     Precision (source, accumulator, destination)


● Misc. options:
  β—‹ file name : Write to file ’name’
   β—‹ routine_name name : Name of routines
Optimal Block Sizes
Use the search.pl script
Optimal Block Sizes
● Naive brute force search

● For Register Parameters
   β—‹ NR/4 <= M0N0 <= NR ; NR is max regs
   β—‹ 1 <= K0 <= K0max ; K0max = 20 (tunable)


● Benchmark all squares M = K = N = D
  β—‹ D runs over 2x, 3x, 10x and all primes
  β—‹ 3D2 fits in L1 cache
Contd.
● For L1 blocking Parameters
● The square case ( D x D)
● Search the neighborhood centered at 3D2 =
L1
● Set the values of M1, K1, N1 to Ο• D/M0
   β—‹ Where, Ο• ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 }
   β—‹ D = sqrt(L1/3)
   β—‹ 125 Combinations
Naive Brute Force ?
● Search take too long

● Generates very lengthy code

● Very slow under full optimization

● Need a better search strategy
Smarter Search
● Majority of the computation is performed in
  register blocked code
● Benchmark only in multiples of register block
  size
● Search space of M0, N0, K0 is not reduced
  β—‹ Prioritize neighborhood of the best ones found
  β—‹ {M0-1, M0, M0+1} etc.
● Terminate after reaching acceptable
  efficiency
Evaluation
Single Precision MMM (100 MHz SGI
Indigo R4k)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
Double Precision MMM (HP 712/80i)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
There is no Golden Hammer
Strengths:              Weaknesses:
● Automatic Search      ● Focus on
   for optimal Params     uniprocessor
● Produces portable       Machines
  ANSI C Code.          ● No support for
                          vector based CPUs
                        ● No control over
                          instruction
                          scheduling
Further Information
● http://www.icsi.berkeley.edu/~bilmes/phipac/

● http://www.inf.ethz.
  ch/personal/markusp/teaching/252-2600-
  ETH-fall11/slides/01-Dietiker.pdf
ATLAS
Siddharth Subramanian
ATLAS
● Automatically Tuned Linear Algebra
  Software
● Generates optimized BLAS library
● C and Fortran77
● Provides implementation for BLAS levels 1,2
  and 3.
● We will focus on Matrix-Matrix-Multiply
  (MMM)
Naive MMM
● C = A * B using 3 for-loops
● Dimensions of A, B and C are NxK, KxM and
  NxM respectively.
Optimization for L1 cache
● Matrix divided into NB x NB blocks
● Each block is called mini-MMM
● Optimization parameter NB is chosen such
  that each mini-MMM fits in cache
Optimization for L1 cache
Optimization for register file
● Mini-MMMs are further represented as micro-
  MMMs
● Multiplies MU x 1 sub-matrix of A by 1 x NU sub-
  matrix of B and accumulates the result into MU x
  NU sub-matrix of C
● Here MU and NU are the optimization parameters
● Necessary condition : MU + NU + MU*NU <= NR
● where NR = no. of floating point registers
Mini and Micro- MMM
Code
Pipeline scheduling
The 2 innermost loops (i'' and j'') are unrolled,
to create interleaved multiply and add
statements
Exploits instruction-level parallelism
● If there is fused multiply-add, then these 2
  operations can be executed together
● The optimization parameter FMA indicates
  the code generator whether this facility
Pipeline scheduling
● MU + NU loads and stores
● MU * NU additions and multiplications
● Latency of operations might stall the pipeline
● Solution : Interleave the operations such that
  dependent operations are separated by a
  particular distance (What would that be?)
● This is governed by another optimization
  parameter - LS
Pipeline scheduling

● Inject MU + NU loads of A and B
● Loads divided into:
  β—‹ Initial fetch (IF)
  β—‹ Blocks of other load operations (NF)
Loop Unrolling
● KU is the optimization parameter that
  controls loop unrolling
● Constrained by the capacity of instruction
  cache
● Should not be so small (wastage of cache)
  or so big (overflow of instruction cache)
Other Optimizations


● Copying tiles of A is done in the beginning of
  outermost loop. These tiles are fully reused
  in each iteration of j loop
● Copying jth vertical panel of B -- done before
  beginning of i loop.
● Copying tile (i,j) of C just before the "k" loop
  starts
Other optimizations
● Choosing loop order:

  β—‹ if N < M then JIK loop order (so that A

     completely fits into L2 cache)

  β—‹ else if M < N then IJK loop order
Other optimizations
● Copying A, B, C for smaller matrices might
  be an overhead
● Non-copying versions are generated with
  optimization parameter NCNB
● This version used if:
  β—‹ M * N * K is less than a threshold
  β—‹ at least 1 dimension of 1 of the matrices is
     smaller than 3 * NCNB
Estimating parameters
● Orthogonal search is used for optimizing
  parameters.
● It is a heuristic, and finds approximate
  solutions
● No guarantee of optimized solution
● It needs these details:
  β—‹ Optimized in what order?
  β—‹ Possible solution range for parameters
  β—‹ reference value used for parameter k during
     optimization of 1 to k-1
Summary of Parameters
Estimating Machine Parameters

Machine parameters are measured:
● C1 - Size of L1 data cache
● NR - Number of floating point registers
● FMA - Availability of fused multiply-add
● LS - Amount of separation between
  dependent multiply and add instructions
Estimating parameters

Optimization sequence
● NB
● MU and NU
● KU
● Ls
● I F, N F
● NCNB
Finding NB

● Generates values in range :

  16 <= NB <= min(80, √C1)


  where C1 = size of L1 data cache
Finding MU and NU

● All combinations that satisfy:

   β—‹ MU * NU + MU + NU + LS <= NR


● NB was obtained earlier
Finding LS and IF, NF

LS
● Tries values in interval [1, 6]
● Boundary value fixed based on experiments
● Divides MU * NU * KU (instruction scheduling)

● IF: Searches of IF in the interval [2, MU + NU]
● NF in the interval [1, MU + NU - IF]
Finding NCNB


● Searches in the range [NB : -4 : 4]

● Terminates search when performance drops
  by 20% of the best found solution
Is Search Really
   Necessary?
Finding KU


● Constrained by instruction cache
● Values between 4 and NB/2 are tried

● Special values 1 and NB are also considered
Empirical Optimization
● Estimation of optimal values is the key
    β—‹ Compilers use Analytical models
    β—‹ Library Generators (eg: ATLAS) use search
● Empirical Search:
    β—‹ Get a version of program for each combination of
      parameters
    β—‹ Execute it on the target machine and measure
      performance
    β—‹ Select the one that performs best
    β—‹ Increased installation time!!
●   How is the search space bounded?
    β—‹ The hardware parameters
Yotov et.al
● Realised that most optimizations used in
  ATLAS code generator are already known to
  the compilers.
  β—‹ cache Tiling, register tiling, etc.
● Replaced the search module with a
  parameter estimator based on standard
  analytical models
● Code generator is not modified
  β—‹ Any performance change is solely based on
    differently chosen parameters
ATLAS Architecture
Analysis
● Results indicated that a simple and intuitive
  model is able to estimate near-optimal
  values for the parameters

● Focus on the ATLAS generated code

● Notations:
   β—‹ ATLAS CGw/S - Code Generator with Search
   β—‹ ATLAS Model - Modified Atlas (No search)
   β—‹ Atlas Unleashed - Hand written code may be used
     along with predefined architecture defaults for the
     parameter values to produce the library.
Model-Based Optimization

● Requires more machine parameters than
  original ATLAS
  β—‹ No Search!!
● Empirical optimizers:
  β—‹ Approximate values of machine params are okay
  β—‹ Only used to bound the search space
● Model-based Optimizers:
  β—‹ Need accurate values
  β—‹ Developed a tool called X-RAY to accurately
    measure them
Hardware Parameters
● C1,B1: the capacity and the line size of the
  L1 data cache
● CI : The capacity of the L1 instruction cache
● Lx: hardware latency of the floating-point
  multiply instruction
● |ALUFP |: number of floating-point functional
  units
● NR: the number of floating-point registers
● FMA: the availability of a fused multiply-add
  instruction
Estimating NB
● Consider L1 cache - Fully Associative,
  Optimal replacement, Unit line size

● Working set of mini-MMM loop has 3 blocks
  of NB x NB
                3 NB2 <= C1
● In the inner most loop (C), element once
  computed is not used again. Similarly only 1
  column of B is needed in cache.
              NB2 + NB + 1 <= C1
Refined Estimate of NB


● Correcting for non-unit line size

        |N2B/B1| + |NB/B1| + 1 <= C1/B1
Further Refinement
● Estimated NB may not be multiple of MU and
  NU
● This might cause fractional register tiles and
  extra clean up
● Avoid this by choosing proper NB
● ATLAS needs NB to be an even integer
● So, we have: NB =
Estimating MU and NU

● View register file as a software cache
  β—‹ that is fully associative
  β—‹ unit line size
  β—‹ capacity = # registers, NR


● ATLAS performs outer products of (MU x 1)
  and (1 x NU) vectors for register tiling
Contd.
● ATLAS allocates MU elements for A, NU
  elements for B, and MU*NU elements for C
● Also need LS registers to store temp values
  of multiplications to make use of pipelining
● So we have:
      (MU x NU) + NU + MU + LS <= NR
LS calculation will be shown later, NR is known.
Only unknowns are MU and NU.
Estimation Scheme
● Let MU = NU = u. Solve prev inequality for u

● Let MU = max (u, 1). Solve for NU

● Let NU = max (NU, 1)

● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
Estimating KU

● Not limited by the size of the register file
● Limited by the size of I-Cache
● Unroll the innermost loop within the size
  constraints of instruction cache
● Avoid micro-MMM code cleanup
   β—‹ Trim KU so that it divides NB

   β—‹ Usually, KU = NB in most machines
Estimating LS

● Skew factor that ATLAS code generator
  uses to schedule dependent multiplication
  and addition operations for CPU Pipeline
● LS independent multiplications and LS-1
  independent additions between muli and
  corresponding addi should at least hide the
  latency of multiplication.
Estimating Ls

● LX = latency of multiplication
● 2 * LS - 1 independent instructions hides this
  latency
● So, 2 * LS - 1 >= LX
● There may be multiple floating point units
        (2 x LS) - 1/ |ALUFP| >= LX
● Solution for LS:
Summary
1.   Estimate FMA
2.   Estimate LS :


3. Estimate MU and Nu
MU*NU + NU + MU + LS <= NR
Set MU = NU = u. Solve for u
MU = max(1, u). Solve for NU
NU = max(NU, 1). If MU < NU swap MU and NU
4. Estimate NB
              |N2B/B1| + |NB/B1| + 1 <= C1/B1
     β—‹   Trim NB to be multiple of 2, MU and NU
5. Estimate KU
     β—‹   Constrained by I-cache.
     β—‹   Make KU divide NB
6. Estimate NF, IF
     β—‹   IF = 2 , N F = 2
Experimental Results
Conclusions
● In all machines (other than Itanium), the
  codes performed almost as well as global
  search based codes
● Models to find parameters are much faster
● Might be difficult to implement analytical
  methods in compilers
  β—‹ This model is focused on only 1 application

More Related Content

What's hot

Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)Dr. Swaminathan Kathirvel
Β 
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
Β 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
Β 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of CompilersIT MegaMeet
Β 
TMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptIosif Itkin
Β 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysisjayavignesh86
Β 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments IIGouthaman V
Β 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
Β 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
Β 
Common Crypto Pitfalls
Common Crypto PitfallsCommon Crypto Pitfalls
Common Crypto PitfallsAmirali Sanatinia
Β 
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
Β 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Ed Dodds
Β 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Hsien-Hsin Sean Lee, Ph.D.
Β 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesisAmrith Krishna
Β 
Introduction to digital logic
Introduction to digital logicIntroduction to digital logic
Introduction to digital logicKamal Acharya
Β 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
Β 

What's hot (19)

Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)
Β 
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Β 
Verilog tutorial
Verilog tutorialVerilog tutorial
Verilog tutorial
Β 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
Β 
The Inner Secrets of Compilers
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of Compilers
Β 
TMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScript
Β 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysis
Β 
VLSI experiments II
VLSI experiments IIVLSI experiments II
VLSI experiments II
Β 
Cs2251 daa
Cs2251 daaCs2251 daa
Cs2251 daa
Β 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
Β 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Β 
C046051216
C046051216C046051216
C046051216
Β 
Common Crypto Pitfalls
Common Crypto PitfallsCommon Crypto Pitfalls
Common Crypto Pitfalls
Β 
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
Β 
Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011
Β 
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Β 
Unsupervised program synthesis
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesis
Β 
Introduction to digital logic
Introduction to digital logicIntroduction to digital logic
Introduction to digital logic
Β 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
Β 

Viewers also liked

Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering  Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering PSH Mechanical Design
Β 
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ Caliber_Engineering
Β 
Casting (Part II)
Casting (Part II)Casting (Part II)
Casting (Part II)elm0011
Β 
Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Marius FAILLOT DEVARRE
Β 
Parts washer
Parts washerParts washer
Parts washercoleelijah
Β 
Automotive casting part in reverse engineering
Automotive casting part in reverse engineeringAutomotive casting part in reverse engineering
Automotive casting part in reverse engineeringPSH Mechanical Design
Β 
Content Management for Web Designers
Content Management for Web DesignersContent Management for Web Designers
Content Management for Web DesignersReuben Jackson
Β 
IT Advance for Post Foundation
IT Advance for Post FoundationIT Advance for Post Foundation
IT Advance for Post FoundationVTC
Β 
Investment casting details
Investment casting detailsInvestment casting details
Investment casting detailsAuto Design Online
Β 
Product categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumProduct categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumIvy gtg
Β 
Mawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 HiddenMawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 Hiddenevebby526
Β 
Toyota trunk Class A in Alias design
Toyota trunk  Class A in Alias designToyota trunk  Class A in Alias design
Toyota trunk Class A in Alias designPSH Mechanical Design
Β 
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10CATHERINEM1_
Β 
Five Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsFive Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsDesign World
Β 
Rinine Engineering
Rinine EngineeringRinine Engineering
Rinine EngineeringHussain M T
Β 

Viewers also liked (19)

Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering  Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering
Β 
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ
Projet 1200 ΧžΧ“Χ€Χ‘Χͺ Χ©Χ•ΧœΧ—Χ Χ™Χͺ ΧœΧ™Χ™Χ¦Χ•Χ¨ ΧͺΧ›Χ©Χ™Χ˜Χ™Χ
Β 
Catia Part07
Catia Part07Catia Part07
Catia Part07
Β 
Casting (Part II)
Casting (Part II)Casting (Part II)
Casting (Part II)
Β 
final.portfolio
final.portfoliofinal.portfolio
final.portfolio
Β 
Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)
Β 
Parts washer
Parts washerParts washer
Parts washer
Β 
Automotive casting part in reverse engineering
Automotive casting part in reverse engineeringAutomotive casting part in reverse engineering
Automotive casting part in reverse engineering
Β 
Content Management for Web Designers
Content Management for Web DesignersContent Management for Web Designers
Content Management for Web Designers
Β 
IT Advance for Post Foundation
IT Advance for Post FoundationIT Advance for Post Foundation
IT Advance for Post Foundation
Β 
Investment casting details
Investment casting detailsInvestment casting details
Investment casting details
Β 
Product categort shenzhen advanced titanium
Product categort shenzhen advanced titaniumProduct categort shenzhen advanced titanium
Product categort shenzhen advanced titanium
Β 
Mawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 HiddenMawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 Hidden
Β 
Toyota trunk Class A in Alias design
Toyota trunk  Class A in Alias designToyota trunk  Class A in Alias design
Toyota trunk Class A in Alias design
Β 
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
Β 
Five Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate DefectsFive Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate Defects
Β 
Mould in Reverse Engineering
Mould in Reverse Engineering Mould in Reverse Engineering
Mould in Reverse Engineering
Β 
Rinine Engineering
Rinine EngineeringRinine Engineering
Rinine Engineering
Β 
apostila de catia
apostila de catiaapostila de catia
apostila de catia
Β 

Similar to Auto Tuning Basics

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
Β 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
Β 
02 functions, variables, basic input and output of c++
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++Manzoor ALam
Β 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
Β 
Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetrisYodalee
Β 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0Junji Hashimoto
Β 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languagesAnkit Pandey
Β 
Onnc intro
Onnc introOnnc intro
Onnc introLuba Tang
Β 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
Β 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2ozgur_can
Β 
Compiler presention
Compiler presentionCompiler presention
Compiler presentionFaria Priya
Β 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
Β 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLinaro
Β 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISelIgalia
Β 
micro:bit and JavaScript
micro:bit and JavaScriptmicro:bit and JavaScript
micro:bit and JavaScriptKenneth Geisshirt
Β 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
Β 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
Β 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesAhmedMahjoub15
Β 

Similar to Auto Tuning Basics (20)

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Β 
BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
Β 
02 functions, variables, basic input and output of c++
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++
Β 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
Β 
Introduction to nand2 tetris
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetris
Β 
Towards hasktorch 1.0
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0
Β 
Optimization in Programming languages
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
Β 
Onnc intro
Onnc introOnnc intro
Onnc intro
Β 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
Β 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
Β 
Compiler presention
Compiler presentionCompiler presention
Compiler presention
Β 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Β 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Β 
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Β 
Cryptography 202
Cryptography 202Cryptography 202
Cryptography 202
Β 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
Β 
micro:bit and JavaScript
micro:bit and JavaScriptmicro:bit and JavaScript
micro:bit and JavaScript
Β 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
Β 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Β 
LECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphesLECTURE2 td 2 sue les theories de graphes
LECTURE2 td 2 sue les theories de graphes
Β 

More from Hemanth Kumar Mantri

TCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksTCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksHemanth Kumar Mantri
Β 
Basic Paxos Implementation in Orc
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in OrcHemanth Kumar Mantri
Β 
Neural Networks in File access Prediction
Neural Networks in File access PredictionNeural Networks in File access Prediction
Neural Networks in File access PredictionHemanth Kumar Mantri
Β 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components LabelingHemanth Kumar Mantri
Β 
Traffic Simulation using NetLogo
Traffic Simulation using NetLogoTraffic Simulation using NetLogo
Traffic Simulation using NetLogoHemanth Kumar Mantri
Β 

More from Hemanth Kumar Mantri (8)

TCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksTCP Issues in DataCenter Networks
TCP Issues in DataCenter Networks
Β 
Basic Paxos Implementation in Orc
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in Orc
Β 
Neural Networks in File access Prediction
Neural Networks in File access PredictionNeural Networks in File access Prediction
Neural Networks in File access Prediction
Β 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
Β 
JPEG Image Compression
JPEG Image CompressionJPEG Image Compression
JPEG Image Compression
Β 
Traffic Simulation using NetLogo
Traffic Simulation using NetLogoTraffic Simulation using NetLogo
Traffic Simulation using NetLogo
Β 
Search Engine Switching
Search Engine SwitchingSearch Engine Switching
Search Engine Switching
Β 
Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduce
Β 

Recently uploaded

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Β 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Β 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Β 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Β 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Β 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Β 

Recently uploaded (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
Β 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
Β 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Β 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Β 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Β 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Β 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Β 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Β 

Auto Tuning Basics

  • 1. Auto Tuning Hemanth and Siddharth UT Austin
  • 3. What is Auto Tuning? ● Several Definitions β—‹ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies " ● A Definition β—‹ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem. ● Techniques used by: β—‹ Library generators, Compilers and Runtime systems
  • 4. Possible Versions of a Solution ● The solutions may differ in the β—‹ algorithm (quicksort vs selection sort) β—‹ implementation (loop unroll). ● The versions may result from β—‹ transformations (unroll, tile, interchange) ● The versions could be generated by β—‹ programmer manually (coding or directives) β—‹ compiler automatically
  • 5. Motivation β–  Increasing diversity of computation supports β–  New influences on the execution of parallel applications β—‹ Hierarchical structure β—‹ Heterogeneity of the processors β–  Design efficient software that takes full advantage of such systems β–  Solving a target problem by using a single algorithm is not always efficient everywhere
  • 6. First Ideas ● Poly-Algorithms β—‹ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations" ● Profiling and feedback assisted compilation β—‹ (1982) S. Graham et.al : gprof β—‹ (1991) P. Chang et.a l: "Using profile information to assist classic code optimizations" ● Code generation β—‹ (1989) J. Johnson et.al : β€œA methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” β—‹ (1992) M. Covell et.al : β€œComputer-aided algorithm design and arrangement”
  • 7. Context: High Performance Libraries ● Linear Algebra β—‹ BLAS, LAPACK, ScaLAPACK ● Signal/Image Processing β—‹ Vector Signal Image Processing Library (VSIPL) ● Distributed/Parallel Systems β—‹ Message Passing Interface (MPI) ● Can we implement libraries: β—‹ Automatically and Portably β—‹ Incorporating platform-specific features β—‹ matching performance of hand-tuned implementations leveraging compiler technology β—‹ using domain-specific knowledge
  • 8. AutoTuning ● 2 phase scheme for producing automatically tuned code ● Given: Program; inputs; machine ● Step1: Identify and generate a space of candidate implementations ● Step2: Select the fastest one using empirical modeling and/or automated experiments
  • 9. Why not let the compiler worry? ● General Purpose β—‹ whereas Library generators can focus on specific problems ● Engineering β—‹ Hard to modify a production compiler and its effects are global ● Analysis β—‹ Limited access to relevant run-time information β—‹ Over specified dependencies β—‹ Correctness Criteria
  • 10. Compiler Vs AutoTuner Compiler AutoTuner Input General Purpose Specification including Source Code problem size, machine parameters and problem specific transformations Output Low level Machine Mostly High Level Code Source (eg: C code) Time to Short (unless Usually Long (depends feedback/profiling on search space) Generate enabled) Select Mostly Static Analysis Automated Empirical (rarely feedback Models and Implementation tuning) experiments
  • 11. Some AutoTuning Projects ● Linear Algebra β—‹ Portable High-Performance ANSI C β–  PHiPAC β—‹ Automatically Tuned Linear Algebra Software β–  ATLAS ● Signal and Image Processing β—‹ Fast Fourier Transformations of the West β–  FFTW β—‹ SPIRAL
  • 14. PHiPAC (1997) ● Developing Portable High-Performance matrix vector libraries in ANSI C ● Parametrized C-code Generator β—‹ produces code according to certain guidelines ● Auto Tune the code ● Exhaustive search over all parameters ● Claim: achieve over 90% of peak-perf and
  • 16. PHiPAC Approach Parameters are Architecture Specific
  • 17. Efficient Code Generation ● Studied several ANSI C Compilers and determined that it is best to ● Rely on Compilers for: β—‹ Register allocation β—‹ Instruction selection and Scheduling ● Manually perform: β—‹ register/cache blocking β—‹ loop unrolling β—‹ software pipe-lining, etc
  • 18. Local Variables to explicitly remove false dependencies ● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; Compiler mayn't assume &a[i] != &b[i+1] and so is forced to first store a[i] before loading b[i+1] (Pointer Aliasing)
  • 19. False Dependencies After Removing Dependency
  • 20. Exploit Multiple Registers ● Explicitly keep values in local variables β—‹ Reduces memory bandwidth β—‹ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
  • 21. Minimize pointer updates by striding with constant offsets Before After ● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12; Compilers can fold constant index into (register + offset) addressing mode.
  • 22. Minimize branches, avoid magnitude compares ● Branches are costly β—‹ Unroll loops β—‹ Use do{} while(); loops to avoid loop head branches ● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
  • 23. Explicitly unroll loops ● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
  • 24. Other Guidelines ● Balance Instruction Mix β—‹ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores ● Increase Locality β—‹ Arrange code to have unit-stride memory accesses and try to reuse data in cache ● Convert Integer multiplies to adds β—‹ * and / are slower than +
  • 25. Matrix Multiply Generators ● Produce C code with PHiPAC guidelines ● C = Ξ±op(A)op(B) + Ξ²C β—‹ MxK, KxN and MxN matrices β—‹ op(X) is either X or transpose(X) ● mm_cgen and mm_lgen β—‹ Core (register blocking) β—‹ Level (higher level cache blocking) ● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
  • 26. Blocked MMM for (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
  • 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
  • 28. Usage and Options Usage: mm_cgen [OPTIONS] ● Semantics options: β—‹ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose β—‹ -no_fringes : don’t generate an M,K, or N reg block fringes ● Optimization options: β—‹ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters β—‹ -sp [1|2lm|2ma|3] : software pipelining options
  • 29. Contd. ● Precision options: β—‹ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination) ● Misc. options: β—‹ file name : Write to file ’name’ β—‹ routine_name name : Name of routines
  • 30. Optimal Block Sizes Use the search.pl script
  • 31. Optimal Block Sizes ● Naive brute force search ● For Register Parameters β—‹ NR/4 <= M0N0 <= NR ; NR is max regs β—‹ 1 <= K0 <= K0max ; K0max = 20 (tunable) ● Benchmark all squares M = K = N = D β—‹ D runs over 2x, 3x, 10x and all primes β—‹ 3D2 fits in L1 cache
  • 32. Contd. ● For L1 blocking Parameters ● The square case ( D x D) ● Search the neighborhood centered at 3D2 = L1 ● Set the values of M1, K1, N1 to Ο• D/M0 β—‹ Where, Ο• ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } β—‹ D = sqrt(L1/3) β—‹ 125 Combinations
  • 33. Naive Brute Force ? ● Search take too long ● Generates very lengthy code ● Very slow under full optimization ● Need a better search strategy
  • 34. Smarter Search ● Majority of the computation is performed in register blocked code ● Benchmark only in multiples of register block size ● Search space of M0, N0, K0 is not reduced β—‹ Prioritize neighborhood of the best ones found β—‹ {M0-1, M0, M0+1} etc. ● Terminate after reaching acceptable efficiency
  • 36. Single Precision MMM (100 MHz SGI Indigo R4k) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 37. Double Precision MMM (HP 712/80i) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 38. There is no Golden Hammer Strengths: Weaknesses: ● Automatic Search ● Focus on for optimal Params uniprocessor ● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
  • 39. Further Information ● http://www.icsi.berkeley.edu/~bilmes/phipac/ ● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
  • 41. ATLAS ● Automatically Tuned Linear Algebra Software ● Generates optimized BLAS library ● C and Fortran77 ● Provides implementation for BLAS levels 1,2 and 3. ● We will focus on Matrix-Matrix-Multiply (MMM)
  • 42. Naive MMM ● C = A * B using 3 for-loops ● Dimensions of A, B and C are NxK, KxM and NxM respectively.
  • 43. Optimization for L1 cache ● Matrix divided into NB x NB blocks ● Each block is called mini-MMM ● Optimization parameter NB is chosen such that each mini-MMM fits in cache
  • 45. Optimization for register file ● Mini-MMMs are further represented as micro- MMMs ● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C ● Here MU and NU are the optimization parameters ● Necessary condition : MU + NU + MU*NU <= NR ● where NR = no. of floating point registers
  • 47. Code
  • 48. Pipeline scheduling The 2 innermost loops (i'' and j'') are unrolled, to create interleaved multiply and add statements Exploits instruction-level parallelism ● If there is fused multiply-add, then these 2 operations can be executed together ● The optimization parameter FMA indicates the code generator whether this facility
  • 49. Pipeline scheduling ● MU + NU loads and stores ● MU * NU additions and multiplications ● Latency of operations might stall the pipeline ● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?) ● This is governed by another optimization parameter - LS
  • 50. Pipeline scheduling ● Inject MU + NU loads of A and B ● Loads divided into: β—‹ Initial fetch (IF) β—‹ Blocks of other load operations (NF)
  • 51. Loop Unrolling ● KU is the optimization parameter that controls loop unrolling ● Constrained by the capacity of instruction cache ● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
  • 52. Other Optimizations ● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop ● Copying jth vertical panel of B -- done before beginning of i loop. ● Copying tile (i,j) of C just before the "k" loop starts
  • 53. Other optimizations ● Choosing loop order: β—‹ if N < M then JIK loop order (so that A completely fits into L2 cache) β—‹ else if M < N then IJK loop order
  • 54. Other optimizations ● Copying A, B, C for smaller matrices might be an overhead ● Non-copying versions are generated with optimization parameter NCNB ● This version used if: β—‹ M * N * K is less than a threshold β—‹ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
  • 55. Estimating parameters ● Orthogonal search is used for optimizing parameters. ● It is a heuristic, and finds approximate solutions ● No guarantee of optimized solution ● It needs these details: β—‹ Optimized in what order? β—‹ Possible solution range for parameters β—‹ reference value used for parameter k during optimization of 1 to k-1
  • 57. Estimating Machine Parameters Machine parameters are measured: ● C1 - Size of L1 data cache ● NR - Number of floating point registers ● FMA - Availability of fused multiply-add ● LS - Amount of separation between dependent multiply and add instructions
  • 58. Estimating parameters Optimization sequence ● NB ● MU and NU ● KU ● Ls ● I F, N F ● NCNB
  • 59. Finding NB ● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
  • 60. Finding MU and NU ● All combinations that satisfy: β—‹ MU * NU + MU + NU + LS <= NR ● NB was obtained earlier
  • 61. Finding LS and IF, NF LS ● Tries values in interval [1, 6] ● Boundary value fixed based on experiments ● Divides MU * NU * KU (instruction scheduling) ● IF: Searches of IF in the interval [2, MU + NU] ● NF in the interval [1, MU + NU - IF]
  • 62. Finding NCNB ● Searches in the range [NB : -4 : 4] ● Terminates search when performance drops by 20% of the best found solution
  • 63. Is Search Really Necessary?
  • 64. Finding KU ● Constrained by instruction cache ● Values between 4 and NB/2 are tried ● Special values 1 and NB are also considered
  • 65. Empirical Optimization ● Estimation of optimal values is the key β—‹ Compilers use Analytical models β—‹ Library Generators (eg: ATLAS) use search ● Empirical Search: β—‹ Get a version of program for each combination of parameters β—‹ Execute it on the target machine and measure performance β—‹ Select the one that performs best β—‹ Increased installation time!! ● How is the search space bounded? β—‹ The hardware parameters
  • 66. Yotov et.al ● Realised that most optimizations used in ATLAS code generator are already known to the compilers. β—‹ cache Tiling, register tiling, etc. ● Replaced the search module with a parameter estimator based on standard analytical models ● Code generator is not modified β—‹ Any performance change is solely based on differently chosen parameters
  • 68. Analysis ● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters ● Focus on the ATLAS generated code ● Notations: β—‹ ATLAS CGw/S - Code Generator with Search β—‹ ATLAS Model - Modified Atlas (No search) β—‹ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
  • 69. Model-Based Optimization ● Requires more machine parameters than original ATLAS β—‹ No Search!! ● Empirical optimizers: β—‹ Approximate values of machine params are okay β—‹ Only used to bound the search space ● Model-based Optimizers: β—‹ Need accurate values β—‹ Developed a tool called X-RAY to accurately measure them
  • 70. Hardware Parameters ● C1,B1: the capacity and the line size of the L1 data cache ● CI : The capacity of the L1 instruction cache ● Lx: hardware latency of the floating-point multiply instruction ● |ALUFP |: number of floating-point functional units ● NR: the number of floating-point registers ● FMA: the availability of a fused multiply-add instruction
  • 71. Estimating NB ● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size ● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1 ● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
  • 72. Refined Estimate of NB ● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
  • 73. Further Refinement ● Estimated NB may not be multiple of MU and NU ● This might cause fractional register tiles and extra clean up ● Avoid this by choosing proper NB ● ATLAS needs NB to be an even integer ● So, we have: NB =
  • 74. Estimating MU and NU ● View register file as a software cache β—‹ that is fully associative β—‹ unit line size β—‹ capacity = # registers, NR ● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
  • 75. Contd. ● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C ● Also need LS registers to store temp values of multiplications to make use of pipelining ● So we have: (MU x NU) + NU + MU + LS <= NR LS calculation will be shown later, NR is known. Only unknowns are MU and NU.
  • 76. Estimation Scheme ● Let MU = NU = u. Solve prev inequality for u ● Let MU = max (u, 1). Solve for NU ● Let NU = max (NU, 1) ● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
  • 77. Estimating KU ● Not limited by the size of the register file ● Limited by the size of I-Cache ● Unroll the innermost loop within the size constraints of instruction cache ● Avoid micro-MMM code cleanup β—‹ Trim KU so that it divides NB β—‹ Usually, KU = NB in most machines
  • 78. Estimating LS ● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline ● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
  • 79. Estimating Ls ● LX = latency of multiplication ● 2 * LS - 1 independent instructions hides this latency ● So, 2 * LS - 1 >= LX ● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX ● Solution for LS:
  • 80. Summary 1. Estimate FMA 2. Estimate LS : 3. Estimate MU and Nu MU*NU + NU + MU + LS <= NR Set MU = NU = u. Solve for u MU = max(1, u). Solve for NU NU = max(NU, 1). If MU < NU swap MU and NU 4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 β—‹ Trim NB to be multiple of 2, MU and NU 5. Estimate KU β—‹ Constrained by I-cache. β—‹ Make KU divide NB 6. Estimate NF, IF β—‹ IF = 2 , N F = 2
  • 82. Conclusions ● In all machines (other than Itanium), the codes performed almost as well as global search based codes ● Models to find parameters are much faster ● Might be difficult to implement analytical methods in compilers β—‹ This model is focused on only 1 application