Auto Tuning Basics

Auto Tuning
Hemanth and Siddharth
UT Austin

What is Auto Tuning?
● Several Definitions
○ First result on Wikipedia - "Auto-Tune is an audio
processor created by Antares Audio Technologies
"

● A Definition
○ Autotuning is an automatic process for selecting one
out of several possible solutions to a computational
problem.

● Techniques used by:
○ Library generators, Compilers and Runtime systems

Possible Versions of a Solution
● The solutions may differ in the
○ algorithm (quicksort vs selection sort)
○ implementation (loop unroll).

● The versions may result from
○ transformations (unroll, tile, interchange)

● The versions could be generated by
○ programmer manually (coding or directives)
○ compiler automatically

Motivation
■ Increasing diversity of computation supports
■ New influences on the execution of parallel
applications
○ Hierarchical structure
○ Heterogeneity of the processors
■ Design efficient software that takes full
advantage of such systems
■ Solving a target problem by using a single
algorithm is not always efficient everywhere

First Ideas
● Poly-Algorithms
○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic
solution of nonlinear equations"

● Proﬁling and feedback assisted compilation
○ (1982) S. Graham et.al : gprof
○ (1991) P. Chang et.a l: "Using proﬁle information to assist classic
code optimizations"

● Code generation
○ (1989) J. Johnson et.al : “A methodology for designing, modifying,
and implementing Fourier Transform algorithms on various
architectures.”
○ (1992) M. Covell et.al : “Computer-aided algorithm design and
arrangement”

Context: High Performance Libraries
● Linear Algebra
○ BLAS, LAPACK, ScaLAPACK
● Signal/Image Processing
○ Vector Signal Image Processing Library (VSIPL)
● Distributed/Parallel Systems
○ Message Passing Interface (MPI)
● Can we implement libraries:
○ Automatically and Portably
○ Incorporating platform-specific features
○ matching performance of hand-tuned
implementations leveraging compiler technology
○ using domain-specific knowledge

AutoTuning
● 2 phase scheme for producing automatically
tuned code

● Given: Program; inputs; machine

● Step1: Identify and generate a space of
candidate implementations

● Step2: Select the fastest one using empirical
modeling and/or automated experiments

Why not let the compiler worry?
● General Purpose
○ whereas Library generators can focus on specific
problems

● Engineering
○ Hard to modify a production compiler and its effects
are global

● Analysis
○ Limited access to relevant run-time information
○ Over specified dependencies
○ Correctness Criteria

Compiler Vs AutoTuner
Compiler AutoTuner
Input General Purpose Specification including
Source Code problem size, machine
parameters and
problem specific
transformations

Output Low level Machine Mostly High Level
Code Source (eg: C code)

Time to Short (unless Usually Long (depends
feedback/profiling on search space)
Generate enabled)

Select Mostly Static Analysis Automated Empirical
(rarely feedback Models and
Implementation tuning) experiments

Some AutoTuning Projects

● Linear Algebra
○ Portable High-Performance ANSI C
■ PHiPAC
○ Automatically Tuned Linear Algebra Software
■ ATLAS

● Signal and Image Processing
○ Fast Fourier Transformations of the West
■ FFTW
○ SPIRAL

Traditional Approach
Hand Tuned Libraries

PHiPAC (1997)
● Developing Portable High-Performance
matrix vector libraries in ANSI C
● Parametrized C-code Generator
○ produces code according to certain
guidelines
● Auto Tune the code
● Exhaustive search over all parameters
● Claim: achieve over 90% of peak-perf and

PHiPAC Approach
Generate Optimized C Code

PHiPAC Approach
Parameters are Architecture Specific

Efficient Code Generation
● Studied several ANSI C Compilers and
determined that it is best to

● Rely on Compilers for:
○ Register allocation
○ Instruction selection and Scheduling

● Manually perform:
○ register/cache blocking
○ loop unrolling
○ software pipe-lining, etc

Local Variables to explicitly remove false
dependencies
● Before After
a[i] = b[i] + c; float f1, f2;
a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1];
a[i] = f1 + c;
a[i+1] = f2 * d;

Compiler mayn't assume &a[i] != &b[i+1]
and so is forced to first store a[i] before
loading b[i+1] (Pointer Aliasing)

False Dependencies

After Removing Dependency

Exploit Multiple Registers

● Explicitly keep values in local variables
○ Reduces memory bandwidth
○ compiler would reload fil values for every
iteration (potential aliasing with res)

Before After
while(...) { float f0 = fil[0];
*res++ = fil[0] * sig[0]; float f1 = fil[1];
+ fil[1] * sig[1]; while(...) {
signal ++; *res++ = f0 * sig[0]
} + f1 * sig[1];
signal ++
}

Minimize pointer updates by striding with
constant offsets

Before After
● f0 = *r8; r8 += 4; f0 = r8[0];
f1 = *r8; r8 += 4; f1 = r8[4];
f2 = *r8; r8 += 4; f2 = r8[8];
r8 += 12;

Compilers can fold constant index into
(register + offset) addressing mode.

Minimize branches, avoid magnitude
compares
● Branches are costly
○ Unroll loops
○ Use do{} while(); loops to avoid loop
head branches
● Using == and != is cheaper
Before After
for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE];
i < ARRAY_SIZE; do {
i ++, a++) { ...
.... a++;
} } while (a != end_ptr);

Explicitly unroll loops

● Instruction level parallelism
Before After
while(...) { float f0, f1, s0, s1;
*res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1];
+ fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1];
signal ++;
} *res++ = (f0*s0)+(f1*s1)
do { signal++;
s0 = sig[0];
res[0] = f0*s1 + f1*s2;
s1 = sig[1];
res[1] = f0*s2 + f1*s0;
res += 2;
} while(...);

Other Guidelines
● Balance Instruction Mix
○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or
stores
● Increase Locality
○ Arrange code to have unit-stride memory
accesses and try to reuse data in cache
● Convert Integer multiplies to adds
○ * and / are slower than +

Matrix Multiply Generators
● Produce C code with PHiPAC guidelines
● C = αop(A)op(B) + βC
○ MxK, KxN and MxN matrices
○ op(X) is either X or transpose(X)

● mm_cgen and mm_lgen
○ Core (register blocking)
○ Level (higher level cache blocking)

● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...

Blocked MMM
for (i=0; i<M; i+=M0)
for (j=0; j<N; j+=N0)
for (l=0; l<K; l+=K0)

for (r=i; r<i+M0; r++)
for (s=i; s<i+N0; s++)
for (t=i; t<i+K0; t++)
c[r][s] += a[r][t] * b[t][s];

Code Generator
$ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ]

M0 K0 N0 mm_gen Optimized C
M1 K1 N1

Usage and Options
Usage: mm_cgen [OPTIONS]
● Semantics options:
○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose
○ -no_fringes : don’t generate an M,K, or N reg block
fringes

● Optimization options:
○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1)
blocking parameters
○ -sp [1|2lm|2ma|3] : software pipelining options

Contd.
● Precision options:
○ prec/sprec/aprec/dprec [single|double|ldouble] :
Precision (source, accumulator, destination)

● Misc. options:
○ file name : Write to file ’name’
○ routine_name name : Name of routines

Optimal Block Sizes
Use the search.pl script

Optimal Block Sizes
● Naive brute force search

● For Register Parameters
○ NR/4 <= M0N0 <= NR ; NR is max regs
○ 1 <= K0 <= K0max ; K0max = 20 (tunable)

● Benchmark all squares M = K = N = D
○ D runs over 2x, 3x, 10x and all primes
○ 3D2 fits in L1 cache

Contd.
● For L1 blocking Parameters
● The square case ( D x D)
● Search the neighborhood centered at 3D2 =
L1
● Set the values of M1, K1, N1 to ϕ D/M0
○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 }
○ D = sqrt(L1/3)
○ 125 Combinations

Naive Brute Force ?
● Search take too long

● Generates very lengthy code

● Very slow under full optimization

● Need a better search strategy

Smarter Search
● Majority of the computation is performed in
register blocked code
● Benchmark only in multiples of register block
size
● Search space of M0, N0, K0 is not reduced
○ Prioritize neighborhood of the best ones found
○ {M0-1, M0, M0+1} etc.
● Terminate after reaching acceptable
efficiency

Single Precision MMM (100 MHz SGI
Indigo R4k)

Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

Double Precision MMM (HP 712/80i)

Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology

There is no Golden Hammer
Strengths: Weaknesses:
● Automatic Search ● Focus on
for optimal Params uniprocessor
● Produces portable Machines
ANSI C Code. ● No support for
vector based CPUs
● No control over
instruction
scheduling

Further Information
● http://www.icsi.berkeley.edu/~bilmes/phipac/

● http://www.inf.ethz.
ch/personal/markusp/teaching/252-2600-
ETH-fall11/slides/01-Dietiker.pdf

ATLAS
● Automatically Tuned Linear Algebra
Software
● Generates optimized BLAS library
● C and Fortran77
● Provides implementation for BLAS levels 1,2
and 3.
● We will focus on Matrix-Matrix-Multiply
(MMM)

Naive MMM
● C = A * B using 3 for-loops
● Dimensions of A, B and C are NxK, KxM and
NxM respectively.

Optimization for L1 cache
● Matrix divided into NB x NB blocks
● Each block is called mini-MMM
● Optimization parameter NB is chosen such
that each mini-MMM fits in cache

Optimization for register file
● Mini-MMMs are further represented as micro-
MMMs
● Multiplies MU x 1 sub-matrix of A by 1 x NU sub-
matrix of B and accumulates the result into MU x
NU sub-matrix of C
● Here MU and NU are the optimization parameters
● Necessary condition : MU + NU + MU*NU <= NR
● where NR = no. of floating point registers

Pipeline scheduling
The 2 innermost loops (i'' and j'') are unrolled,
to create interleaved multiply and add
statements
Exploits instruction-level parallelism
● If there is fused multiply-add, then these 2
operations can be executed together
● The optimization parameter FMA indicates
the code generator whether this facility

Pipeline scheduling
● MU + NU loads and stores
● MU * NU additions and multiplications
● Latency of operations might stall the pipeline
● Solution : Interleave the operations such that
dependent operations are separated by a
particular distance (What would that be?)
● This is governed by another optimization
parameter - LS

Pipeline scheduling

● Inject MU + NU loads of A and B
● Loads divided into:
○ Initial fetch (IF)
○ Blocks of other load operations (NF)

Loop Unrolling
● KU is the optimization parameter that
controls loop unrolling
● Constrained by the capacity of instruction
cache
● Should not be so small (wastage of cache)
or so big (overflow of instruction cache)

Other Optimizations

● Copying tiles of A is done in the beginning of
outermost loop. These tiles are fully reused
in each iteration of j loop
● Copying jth vertical panel of B -- done before
beginning of i loop.
● Copying tile (i,j) of C just before the "k" loop
starts

Other optimizations
● Choosing loop order:

○ if N < M then JIK loop order (so that A

completely fits into L2 cache)

○ else if M < N then IJK loop order

Other optimizations
● Copying A, B, C for smaller matrices might
be an overhead
● Non-copying versions are generated with
optimization parameter NCNB
● This version used if:
○ M * N * K is less than a threshold
○ at least 1 dimension of 1 of the matrices is
smaller than 3 * NCNB

Estimating parameters
● Orthogonal search is used for optimizing
parameters.
● It is a heuristic, and finds approximate
solutions
● No guarantee of optimized solution
● It needs these details:
○ Optimized in what order?
○ Possible solution range for parameters
○ reference value used for parameter k during
optimization of 1 to k-1

Estimating Machine Parameters

Machine parameters are measured:
● C1 - Size of L1 data cache
● NR - Number of floating point registers
● FMA - Availability of fused multiply-add
● LS - Amount of separation between
dependent multiply and add instructions

Estimating parameters

Optimization sequence
● NB
● MU and NU
● KU
● Ls
● I F, N F
● NCNB

Finding NB

● Generates values in range :

16 <= NB <= min(80, √C1)

where C1 = size of L1 data cache

Finding MU and NU

● All combinations that satisfy:

○ MU * NU + MU + NU + LS <= NR

● NB was obtained earlier

Finding LS and IF, NF

LS
● Tries values in interval [1, 6]
● Boundary value fixed based on experiments
● Divides MU * NU * KU (instruction scheduling)

● IF: Searches of IF in the interval [2, MU + NU]
● NF in the interval [1, MU + NU - IF]

Finding NCNB

● Searches in the range [NB : -4 : 4]

● Terminates search when performance drops
by 20% of the best found solution

Is Search Really
Necessary?

Finding KU

● Constrained by instruction cache
● Values between 4 and NB/2 are tried

● Special values 1 and NB are also considered

Empirical Optimization
● Estimation of optimal values is the key
○ Compilers use Analytical models
○ Library Generators (eg: ATLAS) use search
● Empirical Search:
○ Get a version of program for each combination of
parameters
○ Execute it on the target machine and measure
performance
○ Select the one that performs best
○ Increased installation time!!
● How is the search space bounded?
○ The hardware parameters

Yotov et.al
● Realised that most optimizations used in
ATLAS code generator are already known to
the compilers.
○ cache Tiling, register tiling, etc.
● Replaced the search module with a
parameter estimator based on standard
analytical models
● Code generator is not modified
○ Any performance change is solely based on
differently chosen parameters

Analysis
● Results indicated that a simple and intuitive
model is able to estimate near-optimal
values for the parameters

● Focus on the ATLAS generated code

● Notations:
○ ATLAS CGw/S - Code Generator with Search
○ ATLAS Model - Modified Atlas (No search)
○ Atlas Unleashed - Hand written code may be used
along with predefined architecture defaults for the
parameter values to produce the library.

Model-Based Optimization

● Requires more machine parameters than
original ATLAS
○ No Search!!
● Empirical optimizers:
○ Approximate values of machine params are okay
○ Only used to bound the search space
● Model-based Optimizers:
○ Need accurate values
○ Developed a tool called X-RAY to accurately
measure them

Hardware Parameters
● C1,B1: the capacity and the line size of the
L1 data cache
● CI : The capacity of the L1 instruction cache
● Lx: hardware latency of the floating-point
multiply instruction
● |ALUFP |: number of floating-point functional
units
● NR: the number of floating-point registers
● FMA: the availability of a fused multiply-add
instruction

Estimating NB
● Consider L1 cache - Fully Associative,
Optimal replacement, Unit line size

● Working set of mini-MMM loop has 3 blocks
of NB x NB
3 NB2 <= C1
● In the inner most loop (C), element once
computed is not used again. Similarly only 1
column of B is needed in cache.
NB2 + NB + 1 <= C1

Refined Estimate of NB

● Correcting for non-unit line size

|N2B/B1| + |NB/B1| + 1 <= C1/B1

Further Refinement
● Estimated NB may not be multiple of MU and
NU
● This might cause fractional register tiles and
extra clean up
● Avoid this by choosing proper NB
● ATLAS needs NB to be an even integer
● So, we have: NB =

Estimating MU and NU

● View register file as a software cache
○ that is fully associative
○ unit line size
○ capacity = # registers, NR

● ATLAS performs outer products of (MU x 1)
and (1 x NU) vectors for register tiling

Contd.
● ATLAS allocates MU elements for A, NU
elements for B, and MU*NU elements for C
● Also need LS registers to store temp values
of multiplications to make use of pipelining
● So we have:
(MU x NU) + NU + MU + LS <= NR
LS calculation will be shown later, NR is known.
Only unknowns are MU and NU.

Estimation Scheme
● Let MU = NU = u. Solve prev inequality for u

● Let MU = max (u, 1). Solve for NU

● Let NU = max (NU, 1)

● <MU,NU> = <max (MU,NU) ,min (MU,NU)>

Estimating KU

● Not limited by the size of the register file
● Limited by the size of I-Cache
● Unroll the innermost loop within the size
constraints of instruction cache
● Avoid micro-MMM code cleanup
○ Trim KU so that it divides NB

○ Usually, KU = NB in most machines

Estimating LS

● Skew factor that ATLAS code generator
uses to schedule dependent multiplication
and addition operations for CPU Pipeline
● LS independent multiplications and LS-1
independent additions between muli and
corresponding addi should at least hide the
latency of multiplication.

Estimating Ls

● LX = latency of multiplication
● 2 * LS - 1 independent instructions hides this
latency
● So, 2 * LS - 1 >= LX
● There may be multiple floating point units
(2 x LS) - 1/ |ALUFP| >= LX
● Solution for LS:

Summary
1. Estimate FMA
2. Estimate LS :

3. Estimate MU and Nu
MU*NU + NU + MU + LS <= NR
Set MU = NU = u. Solve for u
MU = max(1, u). Solve for NU
NU = max(NU, 1). If MU < NU swap MU and NU
4. Estimate NB
|N2B/B1| + |NB/B1| + 1 <= C1/B1
○ Trim NB to be multiple of 2, MU and NU
5. Estimate KU
○ Constrained by I-cache.
○ Make KU divide NB
6. Estimate NF, IF
○ IF = 2 , N F = 2

Conclusions
● In all machines (other than Itanium), the
codes performed almost as well as global
search based codes
● Models to find parameters are much faster
● Might be difficult to implement analytical
methods in compilers
○ This model is focused on only 1 application

Auto Tuning Basics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (19)

Similar to Auto Tuning Basics

Similar to Auto Tuning Basics (20)

More from Hemanth Kumar Mantri

More from Hemanth Kumar Mantri (8)

Recently uploaded

Recently uploaded (20)

Auto Tuning Basics