Auto Tuning

Hemanth Kumar Mantri
Hemanth Kumar MantriGraduate Teaching Assistant
Auto Tuning
Hemanth and Siddharth
     UT Austin
Basics
What is Auto Tuning?
● Several Definitions
   ○ First result on Wikipedia - "Auto-Tune is an audio
     processor created by Antares Audio Technologies
     "


● A Definition
  ○ Autotuning is an automatic process for selecting one
      out of several possible solutions to a computational
      problem.


● Techniques used by:
   ○ Library generators, Compilers and Runtime systems
Possible Versions of a Solution
● The solutions may differ in the
  ○ algorithm (quicksort vs selection sort)
  ○ implementation (loop unroll).

● The versions may result from
  ○ transformations (unroll, tile, interchange)

● The versions could be generated by
  ○ programmer manually (coding or directives)
   ○ compiler automatically
Motivation
■ Increasing diversity of computation supports
■ New influences on the execution of parallel
  applications
  ○ Hierarchical structure
  ○ Heterogeneity of the processors
■ Design efficient software that takes full
  advantage of such systems
■ Solving a target problem by using a single
  algorithm is not always efficient everywhere
First Ideas
● Poly-Algorithms
    ○   (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic
        solution of nonlinear equations"


●   Profiling and feedback assisted compilation
    ○   (1982) S. Graham et.al : gprof
    ○   (1991) P. Chang et.a l: "Using profile information to assist classic
        code optimizations"


●   Code generation
    ○   (1989) J. Johnson et.al : “A methodology for designing, modifying,
        and implementing Fourier Transform algorithms on various
        architectures.”
    ○   (1992) M. Covell et.al : “Computer-aided algorithm design and
        arrangement”
Context: High Performance Libraries
● Linear Algebra
   ○ BLAS, LAPACK, ScaLAPACK
● Signal/Image Processing
  ○ Vector Signal Image Processing Library (VSIPL)
● Distributed/Parallel Systems
  ○ Message Passing Interface (MPI)
● Can we implement libraries:
  ○ Automatically and Portably
  ○ Incorporating platform-specific features
  ○ matching performance of hand-tuned
     implementations leveraging compiler technology
   ○ using domain-specific knowledge
AutoTuning
● 2 phase scheme for producing automatically
  tuned code

● Given: Program; inputs; machine

● Step1: Identify and generate a space of
  candidate implementations

● Step2: Select the fastest one using empirical
  modeling and/or automated experiments
Why not let the compiler worry?
● General Purpose
  ○ whereas Library generators can focus on specific
    problems


● Engineering
  ○ Hard to modify a production compiler and its effects
    are global


● Analysis
  ○ Limited access to relevant run-time information
  ○ Over specified dependencies
  ○ Correctness Criteria
Compiler Vs AutoTuner
                 Compiler                 AutoTuner
Input            General Purpose          Specification including
                 Source Code              problem size, machine
                                          parameters and
                                          problem specific
                                          transformations

Output           Low level Machine        Mostly High Level
                 Code                     Source (eg: C code)

Time to          Short (unless            Usually Long (depends
                 feedback/profiling       on search space)
Generate         enabled)

Select           Mostly Static Analysis   Automated Empirical
                 (rarely feedback         Models and
Implementation   tuning)                  experiments
Some AutoTuning Projects

● Linear Algebra
  ○ Portable High-Performance ANSI C
     ■ PHiPAC
  ○ Automatically Tuned Linear Algebra Software
    ■ ATLAS


● Signal and Image Processing
  ○ Fast Fourier Transformations of the West
    ■ FFTW
  ○ SPIRAL
PHiPAC
Traditional Approach
Hand Tuned Libraries
PHiPAC (1997)
● Developing Portable High-Performance
  matrix vector libraries in ANSI C
● Parametrized C-code Generator
  ○ produces code according to certain
     guidelines
● Auto Tune the code
● Exhaustive search over all parameters
● Claim: achieve over 90% of peak-perf and
PHiPAC Approach
Generate Optimized C Code
PHiPAC Approach
Parameters are Architecture Specific
Efficient Code Generation
● Studied several ANSI C Compilers and
  determined that it is best to

● Rely on Compilers for:
  ○ Register allocation
  ○ Instruction selection and Scheduling


● Manually perform:
  ○ register/cache blocking
  ○ loop unrolling
  ○ software pipe-lining, etc
Local Variables to explicitly remove false
dependencies
●        Before                    After
    a[i] = b[i] + c;             float f1, f2;
    a[i+1] = b[i+1] * d;   f1 = b[i]; f2 = b[i+1];
                                a[i] = f1 + c;
                               a[i+1] = f2 * d;



Compiler mayn't assume &a[i] != &b[i+1]
and so is forced to first store a[i] before
loading b[i+1] (Pointer Aliasing)
False Dependencies




              After Removing Dependency
Exploit Multiple Registers

● Explicitly keep values in local variables
  ○ Reduces memory bandwidth
   ○ compiler would reload fil values for every
     iteration (potential aliasing with res)

           Before                     After
  while(...) {              float f0 = fil[0];
  *res++ = fil[0] * sig[0]; float f1 = fil[1];
         + fil[1] * sig[1]; while(...) {
  signal ++;                  *res++ = f0 * sig[0]
  }                                  + f1 * sig[1];
                               signal ++
                            }
Minimize pointer updates by striding with
constant offsets

         Before                    After
●   f0 = *r8; r8 += 4;   f0   = r8[0];
    f1 = *r8; r8 += 4;   f1   = r8[4];
    f2 = *r8; r8 += 4;   f2   = r8[8];
                         r8   += 12;




Compilers can fold constant index into
(register + offset) addressing mode.
Minimize branches, avoid magnitude
compares
● Branches are costly
  ○ Unroll loops
  ○ Use do{} while(); loops to avoid loop
     head branches
● Using == and != is cheaper
          Before                      After
  for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE];
      i < ARRAY_SIZE;       do {
      i ++, a++) {            ...
      ....                    a++;
  }                         } while (a != end_ptr);
Explicitly unroll loops

● Instruction level parallelism
          Before                      After
  while(...) {              float f0, f1, s0, s1;
  *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1];
         + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1];
  signal ++;
  }                         *res++ = (f0*s0)+(f1*s1)
                            do { signal++;
                                 s0 = sig[0];
                              res[0] = f0*s1 + f1*s2;
                                 s1 = sig[1];
                              res[1] = f0*s2 + f1*s0;
                              res += 2;
                            } while(...);
Other Guidelines
● Balance Instruction Mix
  ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or
     stores
● Increase Locality
  ○ Arrange code to have unit-stride memory
     accesses and try to reuse data in cache
● Convert Integer multiplies to adds
  ○ * and / are slower than +
Matrix Multiply Generators
● Produce C code with PHiPAC guidelines
● C = αop(A)op(B) + βC
  ○ MxK, KxN and MxN matrices
  ○ op(X) is either X or transpose(X)

● mm_cgen and mm_lgen
    ○ Core (register blocking)
    ○ Level (higher level cache blocking)


●   mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
Blocked MMM
for (i=0; i<M; i+=M0)
 for (j=0; j<N; j+=N0)
  for (l=0; l<K; l+=K0)

   for (r=i; r<i+M0; r++)
    for (s=i; s<i+N0; s++)
     for (t=i; t<i+K0; t++)
      c[r][s] += a[r][t] * b[t][s];
Code Generator
 $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ]




  M0 K0 N0          mm_gen              Optimized C
  M1 K1 N1
Usage and Options
Usage: mm_cgen [OPTIONS]
● Semantics options:
    ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose
    ○ -no_fringes : don’t generate an M,K, or N reg block
      fringes


●   Optimization options:
    ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1)
      blocking parameters
    ○ -sp [1|2lm|2ma|3] : software pipelining options
Contd.
● Precision options:
   ○ prec/sprec/aprec/dprec [single|double|ldouble] :
     Precision (source, accumulator, destination)


● Misc. options:
  ○ file name : Write to file ’name’
   ○ routine_name name : Name of routines
Optimal Block Sizes
Use the search.pl script
Optimal Block Sizes
● Naive brute force search

● For Register Parameters
   ○ NR/4 <= M0N0 <= NR ; NR is max regs
   ○ 1 <= K0 <= K0max ; K0max = 20 (tunable)


● Benchmark all squares M = K = N = D
  ○ D runs over 2x, 3x, 10x and all primes
  ○ 3D2 fits in L1 cache
Contd.
● For L1 blocking Parameters
● The square case ( D x D)
● Search the neighborhood centered at 3D2 =
L1
● Set the values of M1, K1, N1 to ϕ D/M0
   ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 }
   ○ D = sqrt(L1/3)
   ○ 125 Combinations
Naive Brute Force ?
● Search take too long

● Generates very lengthy code

● Very slow under full optimization

● Need a better search strategy
Smarter Search
● Majority of the computation is performed in
  register blocked code
● Benchmark only in multiples of register block
  size
● Search space of M0, N0, K0 is not reduced
  ○ Prioritize neighborhood of the best ones found
  ○ {M0-1, M0, M0+1} etc.
● Terminate after reaching acceptable
  efficiency
Evaluation
Single Precision MMM (100 MHz SGI
Indigo R4k)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
Double Precision MMM (HP 712/80i)




Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
There is no Golden Hammer
Strengths:              Weaknesses:
● Automatic Search      ● Focus on
   for optimal Params     uniprocessor
● Produces portable       Machines
  ANSI C Code.          ● No support for
                          vector based CPUs
                        ● No control over
                          instruction
                          scheduling
Further Information
● http://www.icsi.berkeley.edu/~bilmes/phipac/

● http://www.inf.ethz.
  ch/personal/markusp/teaching/252-2600-
  ETH-fall11/slides/01-Dietiker.pdf
ATLAS
Siddharth Subramanian
ATLAS
● Automatically Tuned Linear Algebra
  Software
● Generates optimized BLAS library
● C and Fortran77
● Provides implementation for BLAS levels 1,2
  and 3.
● We will focus on Matrix-Matrix-Multiply
  (MMM)
Naive MMM
● C = A * B using 3 for-loops
● Dimensions of A, B and C are NxK, KxM and
  NxM respectively.
Optimization for L1 cache
● Matrix divided into NB x NB blocks
● Each block is called mini-MMM
● Optimization parameter NB is chosen such
  that each mini-MMM fits in cache
Optimization for L1 cache
Optimization for register file
● Mini-MMMs are further represented as micro-
  MMMs
● Multiplies MU x 1 sub-matrix of A by 1 x NU sub-
  matrix of B and accumulates the result into MU x
  NU sub-matrix of C
● Here MU and NU are the optimization parameters
● Necessary condition : MU + NU + MU*NU <= NR
● where NR = no. of floating point registers
Mini and Micro- MMM
Code
Pipeline scheduling
The 2 innermost loops (i'' and j'') are unrolled,
to create interleaved multiply and add
statements
Exploits instruction-level parallelism
● If there is fused multiply-add, then these 2
  operations can be executed together
● The optimization parameter FMA indicates
  the code generator whether this facility
Pipeline scheduling
● MU + NU loads and stores
● MU * NU additions and multiplications
● Latency of operations might stall the pipeline
● Solution : Interleave the operations such that
  dependent operations are separated by a
  particular distance (What would that be?)
● This is governed by another optimization
  parameter - LS
Pipeline scheduling

● Inject MU + NU loads of A and B
● Loads divided into:
  ○ Initial fetch (IF)
  ○ Blocks of other load operations (NF)
Loop Unrolling
● KU is the optimization parameter that
  controls loop unrolling
● Constrained by the capacity of instruction
  cache
● Should not be so small (wastage of cache)
  or so big (overflow of instruction cache)
Other Optimizations


● Copying tiles of A is done in the beginning of
  outermost loop. These tiles are fully reused
  in each iteration of j loop
● Copying jth vertical panel of B -- done before
  beginning of i loop.
● Copying tile (i,j) of C just before the "k" loop
  starts
Other optimizations
● Choosing loop order:

  ○ if N < M then JIK loop order (so that A

     completely fits into L2 cache)

  ○ else if M < N then IJK loop order
Other optimizations
● Copying A, B, C for smaller matrices might
  be an overhead
● Non-copying versions are generated with
  optimization parameter NCNB
● This version used if:
  ○ M * N * K is less than a threshold
  ○ at least 1 dimension of 1 of the matrices is
     smaller than 3 * NCNB
Estimating parameters
● Orthogonal search is used for optimizing
  parameters.
● It is a heuristic, and finds approximate
  solutions
● No guarantee of optimized solution
● It needs these details:
  ○ Optimized in what order?
  ○ Possible solution range for parameters
  ○ reference value used for parameter k during
     optimization of 1 to k-1
Summary of Parameters
Estimating Machine Parameters

Machine parameters are measured:
● C1 - Size of L1 data cache
● NR - Number of floating point registers
● FMA - Availability of fused multiply-add
● LS - Amount of separation between
  dependent multiply and add instructions
Estimating parameters

Optimization sequence
● NB
● MU and NU
● KU
● Ls
● I F, N F
● NCNB
Finding NB

● Generates values in range :

  16 <= NB <= min(80, √C1)


  where C1 = size of L1 data cache
Finding MU and NU

● All combinations that satisfy:

   ○ MU * NU + MU + NU + LS <= NR


● NB was obtained earlier
Finding LS and IF, NF

LS
● Tries values in interval [1, 6]
● Boundary value fixed based on experiments
● Divides MU * NU * KU (instruction scheduling)

● IF: Searches of IF in the interval [2, MU + NU]
● NF in the interval [1, MU + NU - IF]
Finding NCNB


● Searches in the range [NB : -4 : 4]

● Terminates search when performance drops
  by 20% of the best found solution
Is Search Really
   Necessary?
Finding KU


● Constrained by instruction cache
● Values between 4 and NB/2 are tried

● Special values 1 and NB are also considered
Empirical Optimization
● Estimation of optimal values is the key
    ○ Compilers use Analytical models
    ○ Library Generators (eg: ATLAS) use search
● Empirical Search:
    ○ Get a version of program for each combination of
      parameters
    ○ Execute it on the target machine and measure
      performance
    ○ Select the one that performs best
    ○ Increased installation time!!
●   How is the search space bounded?
    ○ The hardware parameters
Yotov et.al
● Realised that most optimizations used in
  ATLAS code generator are already known to
  the compilers.
  ○ cache Tiling, register tiling, etc.
● Replaced the search module with a
  parameter estimator based on standard
  analytical models
● Code generator is not modified
  ○ Any performance change is solely based on
    differently chosen parameters
ATLAS Architecture
Analysis
● Results indicated that a simple and intuitive
  model is able to estimate near-optimal
  values for the parameters

● Focus on the ATLAS generated code

● Notations:
   ○ ATLAS CGw/S - Code Generator with Search
   ○ ATLAS Model - Modified Atlas (No search)
   ○ Atlas Unleashed - Hand written code may be used
     along with predefined architecture defaults for the
     parameter values to produce the library.
Model-Based Optimization

● Requires more machine parameters than
  original ATLAS
  ○ No Search!!
● Empirical optimizers:
  ○ Approximate values of machine params are okay
  ○ Only used to bound the search space
● Model-based Optimizers:
  ○ Need accurate values
  ○ Developed a tool called X-RAY to accurately
    measure them
Hardware Parameters
● C1,B1: the capacity and the line size of the
  L1 data cache
● CI : The capacity of the L1 instruction cache
● Lx: hardware latency of the floating-point
  multiply instruction
● |ALUFP |: number of floating-point functional
  units
● NR: the number of floating-point registers
● FMA: the availability of a fused multiply-add
  instruction
Estimating NB
● Consider L1 cache - Fully Associative,
  Optimal replacement, Unit line size

● Working set of mini-MMM loop has 3 blocks
  of NB x NB
                3 NB2 <= C1
● In the inner most loop (C), element once
  computed is not used again. Similarly only 1
  column of B is needed in cache.
              NB2 + NB + 1 <= C1
Refined Estimate of NB


● Correcting for non-unit line size

        |N2B/B1| + |NB/B1| + 1 <= C1/B1
Further Refinement
● Estimated NB may not be multiple of MU and
  NU
● This might cause fractional register tiles and
  extra clean up
● Avoid this by choosing proper NB
● ATLAS needs NB to be an even integer
● So, we have: NB =
Estimating MU and NU

● View register file as a software cache
  ○ that is fully associative
  ○ unit line size
  ○ capacity = # registers, NR


● ATLAS performs outer products of (MU x 1)
  and (1 x NU) vectors for register tiling
Contd.
● ATLAS allocates MU elements for A, NU
  elements for B, and MU*NU elements for C
● Also need LS registers to store temp values
  of multiplications to make use of pipelining
● So we have:
      (MU x NU) + NU + MU + LS <= NR
LS calculation will be shown later, NR is known.
Only unknowns are MU and NU.
Estimation Scheme
● Let MU = NU = u. Solve prev inequality for u

● Let MU = max (u, 1). Solve for NU

● Let NU = max (NU, 1)

● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
Estimating KU

● Not limited by the size of the register file
● Limited by the size of I-Cache
● Unroll the innermost loop within the size
  constraints of instruction cache
● Avoid micro-MMM code cleanup
   ○ Trim KU so that it divides NB

   ○ Usually, KU = NB in most machines
Estimating LS

● Skew factor that ATLAS code generator
  uses to schedule dependent multiplication
  and addition operations for CPU Pipeline
● LS independent multiplications and LS-1
  independent additions between muli and
  corresponding addi should at least hide the
  latency of multiplication.
Estimating Ls

● LX = latency of multiplication
● 2 * LS - 1 independent instructions hides this
  latency
● So, 2 * LS - 1 >= LX
● There may be multiple floating point units
        (2 x LS) - 1/ |ALUFP| >= LX
● Solution for LS:
Summary
1.   Estimate FMA
2.   Estimate LS :


3. Estimate MU and Nu
MU*NU + NU + MU + LS <= NR
Set MU = NU = u. Solve for u
MU = max(1, u). Solve for NU
NU = max(NU, 1). If MU < NU swap MU and NU
4. Estimate NB
              |N2B/B1| + |NB/B1| + 1 <= C1/B1
     ○   Trim NB to be multiple of 2, MU and NU
5. Estimate KU
     ○   Constrained by I-cache.
     ○   Make KU divide NB
6. Estimate NF, IF
     ○   IF = 2 , N F = 2
Experimental Results
Conclusions
● In all machines (other than Itanium), the
  codes performed almost as well as global
  search based codes
● Models to find parameters are much faster
● Might be difficult to implement analytical
  methods in compilers
  ○ This model is focused on only 1 application
1 of 82

Recommended

Конверсия управляемых языков в неуправляемые by
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеPlatonov Sergey
974 views21 slides
Algorithmic Notations by
Algorithmic NotationsAlgorithmic Notations
Algorithmic NotationsMuhammad Muzammal
23.3K views11 slides
Arduino - Classes and functions by
Arduino - Classes and functionsArduino - Classes and functions
Arduino - Classes and functionsEmertxe Information Technologies Pvt Ltd
392 views28 slides
Code generation in Compiler Design by
Code generation in Compiler DesignCode generation in Compiler Design
Code generation in Compiler DesignKuppusamy P
982 views63 slides
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com... by
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Hsien-Hsin Sean Lee, Ph.D.
564 views41 slides
8051 -5 by
8051 -58051 -5
8051 -5Ranjan Horkeri
672 views31 slides

More Related Content

What's hot

Verilog lab manual (ECAD and VLSI Lab) by
Verilog lab manual (ECAD and VLSI Lab)Verilog lab manual (ECAD and VLSI Lab)
Verilog lab manual (ECAD and VLSI Lab)Dr. Swaminathan Kathirvel
14.5K views57 slides
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems by
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
2.3K views33 slides
Verilog tutorial by
Verilog tutorialVerilog tutorial
Verilog tutorialMaryala Srinivas
14.9K views108 slides
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai by
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
1.5K views30 slides
The Inner Secrets of Compilers by
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of CompilersIT MegaMeet
367 views56 slides
TMPA-2017: Static Checking of Array Objects in JavaScript by
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScriptIosif Itkin
810 views16 slides

What's hot(19)

Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems by NIT Sikkim
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
NIT Sikkim2.3K views
The Inner Secrets of Compilers by IT MegaMeet
The Inner Secrets of CompilersThe Inner Secrets of Compilers
The Inner Secrets of Compilers
IT MegaMeet367 views
TMPA-2017: Static Checking of Array Objects in JavaScript by Iosif Itkin
TMPA-2017: Static Checking of Array Objects in JavaScriptTMPA-2017: Static Checking of Array Objects in JavaScript
TMPA-2017: Static Checking of Array Objects in JavaScript
Iosif Itkin810 views
Lecture 3 insertion sort and complexity analysis by jayavignesh86
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysis
jayavignesh863.3K views
VLSI experiments II by Gouthaman V
VLSI experiments IIVLSI experiments II
VLSI experiments II
Gouthaman V418 views
Q4.11: Using GCC Auto-Vectorizer by Linaro
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
Linaro13.5K views
Q4.11: NEON Intrinsics by Linaro
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Linaro6.9K views
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in... by Frank Nielsen
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...
Frank Nielsen169 views
Liszt los alamos national laboratory Aug 2011 by Ed Dodds
Liszt los alamos national laboratory Aug 2011Liszt los alamos national laboratory Aug 2011
Liszt los alamos national laboratory Aug 2011
Ed Dodds357 views
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad... by Hsien-Hsin Sean Lee, Ph.D.
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Lec12 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Ad...
Unsupervised program synthesis by Amrith Krishna
Unsupervised program synthesisUnsupervised program synthesis
Unsupervised program synthesis
Amrith Krishna394 views
Introduction to digital logic by Kamal Acharya
Introduction to digital logicIntroduction to digital logic
Introduction to digital logic
Kamal Acharya2.9K views

Viewers also liked

Auto-Mirror in Reverse engineering by
Auto-Mirror in Reverse engineering  Auto-Mirror in Reverse engineering
Auto-Mirror in Reverse engineering PSH Mechanical Design
355 views9 slides
Projet 1200 מדפסת שולחנית לייצור תכשיטים by
Projet 1200 מדפסת שולחנית לייצור תכשיטים Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים Caliber_Engineering
917 views19 slides
Catia Part07 by
Catia Part07Catia Part07
Catia Part07Girish S Murthy
1.7K views19 slides
Casting (Part II) by
Casting (Part II)Casting (Part II)
Casting (Part II)elm0011
435 views48 slides
final.portfolio by
final.portfoliofinal.portfolio
final.portfolioBineeth Narayan
376 views18 slides
Advanced Skills for Professionals (Administrative) by
Advanced Skills for Professionals (Administrative)Advanced Skills for Professionals (Administrative)
Advanced Skills for Professionals (Administrative)Marius FAILLOT DEVARRE
695 views32 slides

Viewers also liked(19)

Projet 1200 מדפסת שולחנית לייצור תכשיטים by Caliber_Engineering
Projet 1200 מדפסת שולחנית לייצור תכשיטים Projet 1200 מדפסת שולחנית לייצור תכשיטים
Projet 1200 מדפסת שולחנית לייצור תכשיטים
Casting (Part II) by elm0011
Casting (Part II)Casting (Part II)
Casting (Part II)
elm0011435 views
Parts washer by coleelijah
Parts washerParts washer
Parts washer
coleelijah371 views
Content Management for Web Designers by Reuben Jackson
Content Management for Web DesignersContent Management for Web Designers
Content Management for Web Designers
Reuben Jackson1.1K views
IT Advance for Post Foundation by VTC
IT Advance for Post FoundationIT Advance for Post Foundation
IT Advance for Post Foundation
VTC6.1K views
Product categort shenzhen advanced titanium by Ivy gtg
Product categort shenzhen advanced titaniumProduct categort shenzhen advanced titanium
Product categort shenzhen advanced titanium
Ivy gtg195 views
Mawea Profile Presentation Slides 2011 Hidden by evebby526
Mawea Profile Presentation Slides 2011 HiddenMawea Profile Presentation Slides 2011 Hidden
Mawea Profile Presentation Slides 2011 Hidden
evebby526515 views
GFMI AEROSPACE DIVISION LINE CARD 4JUN10 by CATHERINEM1_
GFMI AEROSPACE DIVISION LINE CARD 4JUN10GFMI AEROSPACE DIVISION LINE CARD 4JUN10
GFMI AEROSPACE DIVISION LINE CARD 4JUN10
CATHERINEM1_462 views
Five Steps to Optimize Casting and Eliminate Defects by Design World
Five Steps to Optimize Casting and Eliminate DefectsFive Steps to Optimize Casting and Eliminate Defects
Five Steps to Optimize Casting and Eliminate Defects
Design World2.6K views
Rinine Engineering by Hussain M T
Rinine EngineeringRinine Engineering
Rinine Engineering
Hussain M T381 views

Similar to Auto Tuning

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array... by
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
3K views50 slides
BUD17-302: LLVM Internals #2 by
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2 Linaro
1.6K views31 slides
02 functions, variables, basic input and output of c++ by
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++Manzoor ALam
920 views36 slides
Mirko Damiani - An Embedded soft real time distributed system in Go by
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
225 views37 slides
Introduction to nand2 tetris by
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetrisYodalee
1.6K views57 slides
Towards hasktorch 1.0 by
Towards hasktorch 1.0Towards hasktorch 1.0
Towards hasktorch 1.0Junji Hashimoto
68 views29 slides

Similar to Auto Tuning(20)

Productive OpenCL Programming An Introduction to OpenCL Libraries with Array... by AMD Developer Central
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
BUD17-302: LLVM Internals #2 by Linaro
BUD17-302: LLVM Internals #2 BUD17-302: LLVM Internals #2
BUD17-302: LLVM Internals #2
Linaro1.6K views
02 functions, variables, basic input and output of c++ by Manzoor ALam
02   functions, variables, basic input and output of c++02   functions, variables, basic input and output of c++
02 functions, variables, basic input and output of c++
Manzoor ALam920 views
Mirko Damiani - An Embedded soft real time distributed system in Go by linuxlab_conf
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
linuxlab_conf225 views
Introduction to nand2 tetris by Yodalee
Introduction to nand2 tetrisIntroduction to nand2 tetris
Introduction to nand2 tetris
Yodalee1.6K views
Optimization in Programming languages by Ankit Pandey
Optimization in Programming languagesOptimization in Programming languages
Optimization in Programming languages
Ankit Pandey1.1K views
Chapter Eight(3) by bolovv
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
bolovv814 views
Lcdf4 chap 03_p2 by ozgur_can
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
ozgur_can660 views
Compiler presention by Faria Priya
Compiler presentionCompiler presention
Compiler presention
Faria Priya230 views
from Binary to Binary: How Qemu Works by Zhen Wei
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
Zhen Wei2K views
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals by Linaro
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Linaro1.4K views
A taste of GlobalISel by Igalia
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
Igalia18 views
SIMD.pptx by dk03006
SIMD.pptxSIMD.pptx
SIMD.pptx
dk030067 views
Pragmatic Optimization in Modern Programming - Demystifying the Compiler by Marina Kolpakova
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Marina Kolpakova1.6K views
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a... by Yusuke Izawa
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Stack Hybridization: A Mechanism for Bridging Two Compilation Strategies in a...
Yusuke Izawa67 views

More from Hemanth Kumar Mantri

TCP Issues in DataCenter Networks by
TCP Issues in DataCenter NetworksTCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksHemanth Kumar Mantri
1.5K views32 slides
Basic Paxos Implementation in Orc by
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in OrcHemanth Kumar Mantri
1.3K views28 slides
Neural Networks in File access Prediction by
Neural Networks in File access PredictionNeural Networks in File access Prediction
Neural Networks in File access PredictionHemanth Kumar Mantri
720 views17 slides
Connected Components Labeling by
Connected Components LabelingConnected Components Labeling
Connected Components LabelingHemanth Kumar Mantri
2.3K views52 slides
JPEG Image Compression by
JPEG Image CompressionJPEG Image Compression
JPEG Image CompressionHemanth Kumar Mantri
6.8K views42 slides
Traffic Simulation using NetLogo by
Traffic Simulation using NetLogoTraffic Simulation using NetLogo
Traffic Simulation using NetLogoHemanth Kumar Mantri
3K views12 slides

Recently uploaded

Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...ShapeBlue
158 views20 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
94 views13 slides
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
88 views9 slides
NTGapps NTG LowCode Platform by
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
365 views30 slides
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
163 views54 slides
Cencora Executive Symposium by
Cencora Executive SymposiumCencora Executive Symposium
Cencora Executive Symposiummarketingcommunicati21
139 views14 slides

Recently uploaded(20)

Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue158 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue94 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue88 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu365 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue163 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10126 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue117 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue253 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue123 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue103 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue179 views
DRBD Deep Dive - Philipp Reisner - LINBIT by ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue140 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash153 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty62 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue93 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue154 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li80 views

Auto Tuning

  • 1. Auto Tuning Hemanth and Siddharth UT Austin
  • 3. What is Auto Tuning? ● Several Definitions ○ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies " ● A Definition ○ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem. ● Techniques used by: ○ Library generators, Compilers and Runtime systems
  • 4. Possible Versions of a Solution ● The solutions may differ in the ○ algorithm (quicksort vs selection sort) ○ implementation (loop unroll). ● The versions may result from ○ transformations (unroll, tile, interchange) ● The versions could be generated by ○ programmer manually (coding or directives) ○ compiler automatically
  • 5. Motivation ■ Increasing diversity of computation supports ■ New influences on the execution of parallel applications ○ Hierarchical structure ○ Heterogeneity of the processors ■ Design efficient software that takes full advantage of such systems ■ Solving a target problem by using a single algorithm is not always efficient everywhere
  • 6. First Ideas ● Poly-Algorithms ○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations" ● Profiling and feedback assisted compilation ○ (1982) S. Graham et.al : gprof ○ (1991) P. Chang et.a l: "Using profile information to assist classic code optimizations" ● Code generation ○ (1989) J. Johnson et.al : “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” ○ (1992) M. Covell et.al : “Computer-aided algorithm design and arrangement”
  • 7. Context: High Performance Libraries ● Linear Algebra ○ BLAS, LAPACK, ScaLAPACK ● Signal/Image Processing ○ Vector Signal Image Processing Library (VSIPL) ● Distributed/Parallel Systems ○ Message Passing Interface (MPI) ● Can we implement libraries: ○ Automatically and Portably ○ Incorporating platform-specific features ○ matching performance of hand-tuned implementations leveraging compiler technology ○ using domain-specific knowledge
  • 8. AutoTuning ● 2 phase scheme for producing automatically tuned code ● Given: Program; inputs; machine ● Step1: Identify and generate a space of candidate implementations ● Step2: Select the fastest one using empirical modeling and/or automated experiments
  • 9. Why not let the compiler worry? ● General Purpose ○ whereas Library generators can focus on specific problems ● Engineering ○ Hard to modify a production compiler and its effects are global ● Analysis ○ Limited access to relevant run-time information ○ Over specified dependencies ○ Correctness Criteria
  • 10. Compiler Vs AutoTuner Compiler AutoTuner Input General Purpose Specification including Source Code problem size, machine parameters and problem specific transformations Output Low level Machine Mostly High Level Code Source (eg: C code) Time to Short (unless Usually Long (depends feedback/profiling on search space) Generate enabled) Select Mostly Static Analysis Automated Empirical (rarely feedback Models and Implementation tuning) experiments
  • 11. Some AutoTuning Projects ● Linear Algebra ○ Portable High-Performance ANSI C ■ PHiPAC ○ Automatically Tuned Linear Algebra Software ■ ATLAS ● Signal and Image Processing ○ Fast Fourier Transformations of the West ■ FFTW ○ SPIRAL
  • 14. PHiPAC (1997) ● Developing Portable High-Performance matrix vector libraries in ANSI C ● Parametrized C-code Generator ○ produces code according to certain guidelines ● Auto Tune the code ● Exhaustive search over all parameters ● Claim: achieve over 90% of peak-perf and
  • 16. PHiPAC Approach Parameters are Architecture Specific
  • 17. Efficient Code Generation ● Studied several ANSI C Compilers and determined that it is best to ● Rely on Compilers for: ○ Register allocation ○ Instruction selection and Scheduling ● Manually perform: ○ register/cache blocking ○ loop unrolling ○ software pipe-lining, etc
  • 18. Local Variables to explicitly remove false dependencies ● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; Compiler mayn't assume &a[i] != &b[i+1] and so is forced to first store a[i] before loading b[i+1] (Pointer Aliasing)
  • 19. False Dependencies After Removing Dependency
  • 20. Exploit Multiple Registers ● Explicitly keep values in local variables ○ Reduces memory bandwidth ○ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
  • 21. Minimize pointer updates by striding with constant offsets Before After ● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12; Compilers can fold constant index into (register + offset) addressing mode.
  • 22. Minimize branches, avoid magnitude compares ● Branches are costly ○ Unroll loops ○ Use do{} while(); loops to avoid loop head branches ● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
  • 23. Explicitly unroll loops ● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
  • 24. Other Guidelines ● Balance Instruction Mix ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores ● Increase Locality ○ Arrange code to have unit-stride memory accesses and try to reuse data in cache ● Convert Integer multiplies to adds ○ * and / are slower than +
  • 25. Matrix Multiply Generators ● Produce C code with PHiPAC guidelines ● C = αop(A)op(B) + βC ○ MxK, KxN and MxN matrices ○ op(X) is either X or transpose(X) ● mm_cgen and mm_lgen ○ Core (register blocking) ○ Level (higher level cache blocking) ● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
  • 26. Blocked MMM for (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
  • 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
  • 28. Usage and Options Usage: mm_cgen [OPTIONS] ● Semantics options: ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose ○ -no_fringes : don’t generate an M,K, or N reg block fringes ● Optimization options: ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters ○ -sp [1|2lm|2ma|3] : software pipelining options
  • 29. Contd. ● Precision options: ○ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination) ● Misc. options: ○ file name : Write to file ’name’ ○ routine_name name : Name of routines
  • 30. Optimal Block Sizes Use the search.pl script
  • 31. Optimal Block Sizes ● Naive brute force search ● For Register Parameters ○ NR/4 <= M0N0 <= NR ; NR is max regs ○ 1 <= K0 <= K0max ; K0max = 20 (tunable) ● Benchmark all squares M = K = N = D ○ D runs over 2x, 3x, 10x and all primes ○ 3D2 fits in L1 cache
  • 32. Contd. ● For L1 blocking Parameters ● The square case ( D x D) ● Search the neighborhood centered at 3D2 = L1 ● Set the values of M1, K1, N1 to ϕ D/M0 ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } ○ D = sqrt(L1/3) ○ 125 Combinations
  • 33. Naive Brute Force ? ● Search take too long ● Generates very lengthy code ● Very slow under full optimization ● Need a better search strategy
  • 34. Smarter Search ● Majority of the computation is performed in register blocked code ● Benchmark only in multiples of register block size ● Search space of M0, N0, K0 is not reduced ○ Prioritize neighborhood of the best ones found ○ {M0-1, M0, M0+1} etc. ● Terminate after reaching acceptable efficiency
  • 36. Single Precision MMM (100 MHz SGI Indigo R4k) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 37. Double Precision MMM (HP 712/80i) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 38. There is no Golden Hammer Strengths: Weaknesses: ● Automatic Search ● Focus on for optimal Params uniprocessor ● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
  • 39. Further Information ● http://www.icsi.berkeley.edu/~bilmes/phipac/ ● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
  • 41. ATLAS ● Automatically Tuned Linear Algebra Software ● Generates optimized BLAS library ● C and Fortran77 ● Provides implementation for BLAS levels 1,2 and 3. ● We will focus on Matrix-Matrix-Multiply (MMM)
  • 42. Naive MMM ● C = A * B using 3 for-loops ● Dimensions of A, B and C are NxK, KxM and NxM respectively.
  • 43. Optimization for L1 cache ● Matrix divided into NB x NB blocks ● Each block is called mini-MMM ● Optimization parameter NB is chosen such that each mini-MMM fits in cache
  • 45. Optimization for register file ● Mini-MMMs are further represented as micro- MMMs ● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C ● Here MU and NU are the optimization parameters ● Necessary condition : MU + NU + MU*NU <= NR ● where NR = no. of floating point registers
  • 47. Code
  • 48. Pipeline scheduling The 2 innermost loops (i'' and j'') are unrolled, to create interleaved multiply and add statements Exploits instruction-level parallelism ● If there is fused multiply-add, then these 2 operations can be executed together ● The optimization parameter FMA indicates the code generator whether this facility
  • 49. Pipeline scheduling ● MU + NU loads and stores ● MU * NU additions and multiplications ● Latency of operations might stall the pipeline ● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?) ● This is governed by another optimization parameter - LS
  • 50. Pipeline scheduling ● Inject MU + NU loads of A and B ● Loads divided into: ○ Initial fetch (IF) ○ Blocks of other load operations (NF)
  • 51. Loop Unrolling ● KU is the optimization parameter that controls loop unrolling ● Constrained by the capacity of instruction cache ● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
  • 52. Other Optimizations ● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop ● Copying jth vertical panel of B -- done before beginning of i loop. ● Copying tile (i,j) of C just before the "k" loop starts
  • 53. Other optimizations ● Choosing loop order: ○ if N < M then JIK loop order (so that A completely fits into L2 cache) ○ else if M < N then IJK loop order
  • 54. Other optimizations ● Copying A, B, C for smaller matrices might be an overhead ● Non-copying versions are generated with optimization parameter NCNB ● This version used if: ○ M * N * K is less than a threshold ○ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
  • 55. Estimating parameters ● Orthogonal search is used for optimizing parameters. ● It is a heuristic, and finds approximate solutions ● No guarantee of optimized solution ● It needs these details: ○ Optimized in what order? ○ Possible solution range for parameters ○ reference value used for parameter k during optimization of 1 to k-1
  • 57. Estimating Machine Parameters Machine parameters are measured: ● C1 - Size of L1 data cache ● NR - Number of floating point registers ● FMA - Availability of fused multiply-add ● LS - Amount of separation between dependent multiply and add instructions
  • 58. Estimating parameters Optimization sequence ● NB ● MU and NU ● KU ● Ls ● I F, N F ● NCNB
  • 59. Finding NB ● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
  • 60. Finding MU and NU ● All combinations that satisfy: ○ MU * NU + MU + NU + LS <= NR ● NB was obtained earlier
  • 61. Finding LS and IF, NF LS ● Tries values in interval [1, 6] ● Boundary value fixed based on experiments ● Divides MU * NU * KU (instruction scheduling) ● IF: Searches of IF in the interval [2, MU + NU] ● NF in the interval [1, MU + NU - IF]
  • 62. Finding NCNB ● Searches in the range [NB : -4 : 4] ● Terminates search when performance drops by 20% of the best found solution
  • 63. Is Search Really Necessary?
  • 64. Finding KU ● Constrained by instruction cache ● Values between 4 and NB/2 are tried ● Special values 1 and NB are also considered
  • 65. Empirical Optimization ● Estimation of optimal values is the key ○ Compilers use Analytical models ○ Library Generators (eg: ATLAS) use search ● Empirical Search: ○ Get a version of program for each combination of parameters ○ Execute it on the target machine and measure performance ○ Select the one that performs best ○ Increased installation time!! ● How is the search space bounded? ○ The hardware parameters
  • 66. Yotov et.al ● Realised that most optimizations used in ATLAS code generator are already known to the compilers. ○ cache Tiling, register tiling, etc. ● Replaced the search module with a parameter estimator based on standard analytical models ● Code generator is not modified ○ Any performance change is solely based on differently chosen parameters
  • 68. Analysis ● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters ● Focus on the ATLAS generated code ● Notations: ○ ATLAS CGw/S - Code Generator with Search ○ ATLAS Model - Modified Atlas (No search) ○ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
  • 69. Model-Based Optimization ● Requires more machine parameters than original ATLAS ○ No Search!! ● Empirical optimizers: ○ Approximate values of machine params are okay ○ Only used to bound the search space ● Model-based Optimizers: ○ Need accurate values ○ Developed a tool called X-RAY to accurately measure them
  • 70. Hardware Parameters ● C1,B1: the capacity and the line size of the L1 data cache ● CI : The capacity of the L1 instruction cache ● Lx: hardware latency of the floating-point multiply instruction ● |ALUFP |: number of floating-point functional units ● NR: the number of floating-point registers ● FMA: the availability of a fused multiply-add instruction
  • 71. Estimating NB ● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size ● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1 ● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
  • 72. Refined Estimate of NB ● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
  • 73. Further Refinement ● Estimated NB may not be multiple of MU and NU ● This might cause fractional register tiles and extra clean up ● Avoid this by choosing proper NB ● ATLAS needs NB to be an even integer ● So, we have: NB =
  • 74. Estimating MU and NU ● View register file as a software cache ○ that is fully associative ○ unit line size ○ capacity = # registers, NR ● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
  • 75. Contd. ● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C ● Also need LS registers to store temp values of multiplications to make use of pipelining ● So we have: (MU x NU) + NU + MU + LS <= NR LS calculation will be shown later, NR is known. Only unknowns are MU and NU.
  • 76. Estimation Scheme ● Let MU = NU = u. Solve prev inequality for u ● Let MU = max (u, 1). Solve for NU ● Let NU = max (NU, 1) ● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
  • 77. Estimating KU ● Not limited by the size of the register file ● Limited by the size of I-Cache ● Unroll the innermost loop within the size constraints of instruction cache ● Avoid micro-MMM code cleanup ○ Trim KU so that it divides NB ○ Usually, KU = NB in most machines
  • 78. Estimating LS ● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline ● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
  • 79. Estimating Ls ● LX = latency of multiplication ● 2 * LS - 1 independent instructions hides this latency ● So, 2 * LS - 1 >= LX ● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX ● Solution for LS:
  • 80. Summary 1. Estimate FMA 2. Estimate LS : 3. Estimate MU and Nu MU*NU + NU + MU + LS <= NR Set MU = NU = u. Solve for u MU = max(1, u). Solve for NU NU = max(NU, 1). If MU < NU swap MU and NU 4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 ○ Trim NB to be multiple of 2, MU and NU 5. Estimate KU ○ Constrained by I-cache. ○ Make KU divide NB 6. Estimate NF, IF ○ IF = 2 , N F = 2
  • 82. Conclusions ● In all machines (other than Itanium), the codes performed almost as well as global search based codes ● Models to find parameters are much faster ● Might be difficult to implement analytical methods in compilers ○ This model is focused on only 1 application