Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Auto Tuning


Published on

Presentation on Auto Tuning delivered as part of our "Software for Multicore Processors" course at UT Austin. It covers the basics of AutoTuning and details of two library generators called PhiPAC and ATLAS.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Auto Tuning

  1. 1. Auto TuningHemanth and Siddharth UT Austin
  2. 2. Basics
  3. 3. What is Auto Tuning?● Several Definitions ○ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies "● A Definition ○ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem.● Techniques used by: ○ Library generators, Compilers and Runtime systems
  4. 4. Possible Versions of a Solution● The solutions may differ in the ○ algorithm (quicksort vs selection sort) ○ implementation (loop unroll).● The versions may result from ○ transformations (unroll, tile, interchange)● The versions could be generated by ○ programmer manually (coding or directives) ○ compiler automatically
  5. 5. Motivation■ Increasing diversity of computation supports■ New influences on the execution of parallel applications ○ Hierarchical structure ○ Heterogeneity of the processors■ Design efficient software that takes full advantage of such systems■ Solving a target problem by using a single algorithm is not always efficient everywhere
  6. 6. First Ideas● Poly-Algorithms ○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations"● Profiling and feedback assisted compilation ○ (1982) S. Graham : gprof ○ (1991) P. Chang et.a l: "Using profile information to assist classic code optimizations"● Code generation ○ (1989) J. Johnson : “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” ○ (1992) M. Covell : “Computer-aided algorithm design and arrangement”
  7. 7. Context: High Performance Libraries● Linear Algebra ○ BLAS, LAPACK, ScaLAPACK● Signal/Image Processing ○ Vector Signal Image Processing Library (VSIPL)● Distributed/Parallel Systems ○ Message Passing Interface (MPI)● Can we implement libraries: ○ Automatically and Portably ○ Incorporating platform-specific features ○ matching performance of hand-tuned implementations leveraging compiler technology ○ using domain-specific knowledge
  8. 8. AutoTuning● 2 phase scheme for producing automatically tuned code● Given: Program; inputs; machine● Step1: Identify and generate a space of candidate implementations● Step2: Select the fastest one using empirical modeling and/or automated experiments
  9. 9. Why not let the compiler worry?● General Purpose ○ whereas Library generators can focus on specific problems● Engineering ○ Hard to modify a production compiler and its effects are global● Analysis ○ Limited access to relevant run-time information ○ Over specified dependencies ○ Correctness Criteria
  10. 10. Compiler Vs AutoTuner Compiler AutoTunerInput General Purpose Specification including Source Code problem size, machine parameters and problem specific transformationsOutput Low level Machine Mostly High Level Code Source (eg: C code)Time to Short (unless Usually Long (depends feedback/profiling on search space)Generate enabled)Select Mostly Static Analysis Automated Empirical (rarely feedback Models andImplementation tuning) experiments
  11. 11. Some AutoTuning Projects● Linear Algebra ○ Portable High-Performance ANSI C ■ PHiPAC ○ Automatically Tuned Linear Algebra Software ■ ATLAS● Signal and Image Processing ○ Fast Fourier Transformations of the West ■ FFTW ○ SPIRAL
  12. 12. PHiPAC
  13. 13. Traditional ApproachHand Tuned Libraries
  14. 14. PHiPAC (1997)● Developing Portable High-Performance matrix vector libraries in ANSI C● Parametrized C-code Generator ○ produces code according to certain guidelines● Auto Tune the code● Exhaustive search over all parameters● Claim: achieve over 90% of peak-perf and
  15. 15. PHiPAC ApproachGenerate Optimized C Code
  16. 16. PHiPAC ApproachParameters are Architecture Specific
  17. 17. Efficient Code Generation● Studied several ANSI C Compilers and determined that it is best to● Rely on Compilers for: ○ Register allocation ○ Instruction selection and Scheduling● Manually perform: ○ register/cache blocking ○ loop unrolling ○ software pipe-lining, etc
  18. 18. Local Variables to explicitly remove falsedependencies● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d;Compiler maynt assume &a[i] != &b[i+1]and so is forced to first store a[i] beforeloading b[i+1] (Pointer Aliasing)
  19. 19. False Dependencies After Removing Dependency
  20. 20. Exploit Multiple Registers● Explicitly keep values in local variables ○ Reduces memory bandwidth ○ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
  21. 21. Minimize pointer updates by striding withconstant offsets Before After● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12;Compilers can fold constant index into(register + offset) addressing mode.
  22. 22. Minimize branches, avoid magnitudecompares● Branches are costly ○ Unroll loops ○ Use do{} while(); loops to avoid loop head branches● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
  23. 23. Explicitly unroll loops● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
  24. 24. Other Guidelines● Balance Instruction Mix ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores● Increase Locality ○ Arrange code to have unit-stride memory accesses and try to reuse data in cache● Convert Integer multiplies to adds ○ * and / are slower than +
  25. 25. Matrix Multiply Generators● Produce C code with PHiPAC guidelines● C = αop(A)op(B) + βC ○ MxK, KxN and MxN matrices ○ op(X) is either X or transpose(X)● mm_cgen and mm_lgen ○ Core (register blocking) ○ Level (higher level cache blocking)● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
  26. 26. Blocked MMMfor (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
  27. 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
  28. 28. Usage and OptionsUsage: mm_cgen [OPTIONS]● Semantics options: ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose ○ -no_fringes : don’t generate an M,K, or N reg block fringes● Optimization options: ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters ○ -sp [1|2lm|2ma|3] : software pipelining options
  29. 29. Contd.● Precision options: ○ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination)● Misc. options: ○ file name : Write to file ’name’ ○ routine_name name : Name of routines
  30. 30. Optimal Block SizesUse the script
  31. 31. Optimal Block Sizes● Naive brute force search● For Register Parameters ○ NR/4 <= M0N0 <= NR ; NR is max regs ○ 1 <= K0 <= K0max ; K0max = 20 (tunable)● Benchmark all squares M = K = N = D ○ D runs over 2x, 3x, 10x and all primes ○ 3D2 fits in L1 cache
  32. 32. Contd.● For L1 blocking Parameters● The square case ( D x D)● Search the neighborhood centered at 3D2 =L1● Set the values of M1, K1, N1 to ϕ D/M0 ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } ○ D = sqrt(L1/3) ○ 125 Combinations
  33. 33. Naive Brute Force ?● Search take too long● Generates very lengthy code● Very slow under full optimization● Need a better search strategy
  34. 34. Smarter Search● Majority of the computation is performed in register blocked code● Benchmark only in multiples of register block size● Search space of M0, N0, K0 is not reduced ○ Prioritize neighborhood of the best ones found ○ {M0-1, M0, M0+1} etc.● Terminate after reaching acceptable efficiency
  35. 35. Evaluation
  36. 36. Single Precision MMM (100 MHz SGIIndigo R4k)Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  37. 37. Double Precision MMM (HP 712/80i)Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  38. 38. There is no Golden HammerStrengths: Weaknesses:● Automatic Search ● Focus on for optimal Params uniprocessor● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
  39. 39. Further Information●● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
  40. 40. ATLASSiddharth Subramanian
  41. 41. ATLAS● Automatically Tuned Linear Algebra Software● Generates optimized BLAS library● C and Fortran77● Provides implementation for BLAS levels 1,2 and 3.● We will focus on Matrix-Matrix-Multiply (MMM)
  42. 42. Naive MMM● C = A * B using 3 for-loops● Dimensions of A, B and C are NxK, KxM and NxM respectively.
  43. 43. Optimization for L1 cache● Matrix divided into NB x NB blocks● Each block is called mini-MMM● Optimization parameter NB is chosen such that each mini-MMM fits in cache
  44. 44. Optimization for L1 cache
  45. 45. Optimization for register file● Mini-MMMs are further represented as micro- MMMs● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C● Here MU and NU are the optimization parameters● Necessary condition : MU + NU + MU*NU <= NR● where NR = no. of floating point registers
  46. 46. Mini and Micro- MMM
  47. 47. Code
  48. 48. Pipeline schedulingThe 2 innermost loops (i and j) are unrolled,to create interleaved multiply and addstatementsExploits instruction-level parallelism● If there is fused multiply-add, then these 2 operations can be executed together● The optimization parameter FMA indicates the code generator whether this facility
  49. 49. Pipeline scheduling● MU + NU loads and stores● MU * NU additions and multiplications● Latency of operations might stall the pipeline● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?)● This is governed by another optimization parameter - LS
  50. 50. Pipeline scheduling● Inject MU + NU loads of A and B● Loads divided into: ○ Initial fetch (IF) ○ Blocks of other load operations (NF)
  51. 51. Loop Unrolling● KU is the optimization parameter that controls loop unrolling● Constrained by the capacity of instruction cache● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
  52. 52. Other Optimizations● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop● Copying jth vertical panel of B -- done before beginning of i loop.● Copying tile (i,j) of C just before the "k" loop starts
  53. 53. Other optimizations● Choosing loop order: ○ if N < M then JIK loop order (so that A completely fits into L2 cache) ○ else if M < N then IJK loop order
  54. 54. Other optimizations● Copying A, B, C for smaller matrices might be an overhead● Non-copying versions are generated with optimization parameter NCNB● This version used if: ○ M * N * K is less than a threshold ○ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
  55. 55. Estimating parameters● Orthogonal search is used for optimizing parameters.● It is a heuristic, and finds approximate solutions● No guarantee of optimized solution● It needs these details: ○ Optimized in what order? ○ Possible solution range for parameters ○ reference value used for parameter k during optimization of 1 to k-1
  56. 56. Summary of Parameters
  57. 57. Estimating Machine ParametersMachine parameters are measured:● C1 - Size of L1 data cache● NR - Number of floating point registers● FMA - Availability of fused multiply-add● LS - Amount of separation between dependent multiply and add instructions
  58. 58. Estimating parametersOptimization sequence● NB● MU and NU● KU● Ls● I F, N F● NCNB
  59. 59. Finding NB● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
  60. 60. Finding MU and NU● All combinations that satisfy: ○ MU * NU + MU + NU + LS <= NR● NB was obtained earlier
  61. 61. Finding LS and IF, NFLS● Tries values in interval [1, 6]● Boundary value fixed based on experiments● Divides MU * NU * KU (instruction scheduling)● IF: Searches of IF in the interval [2, MU + NU]● NF in the interval [1, MU + NU - IF]
  62. 62. Finding NCNB● Searches in the range [NB : -4 : 4]● Terminates search when performance drops by 20% of the best found solution
  63. 63. Is Search Really Necessary?
  64. 64. Finding KU● Constrained by instruction cache● Values between 4 and NB/2 are tried● Special values 1 and NB are also considered
  65. 65. Empirical Optimization● Estimation of optimal values is the key ○ Compilers use Analytical models ○ Library Generators (eg: ATLAS) use search● Empirical Search: ○ Get a version of program for each combination of parameters ○ Execute it on the target machine and measure performance ○ Select the one that performs best ○ Increased installation time!!● How is the search space bounded? ○ The hardware parameters
  66. 66. Yotov● Realised that most optimizations used in ATLAS code generator are already known to the compilers. ○ cache Tiling, register tiling, etc.● Replaced the search module with a parameter estimator based on standard analytical models● Code generator is not modified ○ Any performance change is solely based on differently chosen parameters
  67. 67. ATLAS Architecture
  68. 68. Analysis● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters● Focus on the ATLAS generated code● Notations: ○ ATLAS CGw/S - Code Generator with Search ○ ATLAS Model - Modified Atlas (No search) ○ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
  69. 69. Model-Based Optimization● Requires more machine parameters than original ATLAS ○ No Search!!● Empirical optimizers: ○ Approximate values of machine params are okay ○ Only used to bound the search space● Model-based Optimizers: ○ Need accurate values ○ Developed a tool called X-RAY to accurately measure them
  70. 70. Hardware Parameters● C1,B1: the capacity and the line size of the L1 data cache● CI : The capacity of the L1 instruction cache● Lx: hardware latency of the floating-point multiply instruction● |ALUFP |: number of floating-point functional units● NR: the number of floating-point registers● FMA: the availability of a fused multiply-add instruction
  71. 71. Estimating NB● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
  72. 72. Refined Estimate of NB● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
  73. 73. Further Refinement● Estimated NB may not be multiple of MU and NU● This might cause fractional register tiles and extra clean up● Avoid this by choosing proper NB● ATLAS needs NB to be an even integer● So, we have: NB =
  74. 74. Estimating MU and NU● View register file as a software cache ○ that is fully associative ○ unit line size ○ capacity = # registers, NR● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
  75. 75. Contd.● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C● Also need LS registers to store temp values of multiplications to make use of pipelining● So we have: (MU x NU) + NU + MU + LS <= NRLS calculation will be shown later, NR is known.Only unknowns are MU and NU.
  76. 76. Estimation Scheme● Let MU = NU = u. Solve prev inequality for u● Let MU = max (u, 1). Solve for NU● Let NU = max (NU, 1)● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
  77. 77. Estimating KU● Not limited by the size of the register file● Limited by the size of I-Cache● Unroll the innermost loop within the size constraints of instruction cache● Avoid micro-MMM code cleanup ○ Trim KU so that it divides NB ○ Usually, KU = NB in most machines
  78. 78. Estimating LS● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
  79. 79. Estimating Ls● LX = latency of multiplication● 2 * LS - 1 independent instructions hides this latency● So, 2 * LS - 1 >= LX● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX● Solution for LS:
  80. 80. Summary1. Estimate FMA2. Estimate LS :3. Estimate MU and NuMU*NU + NU + MU + LS <= NRSet MU = NU = u. Solve for uMU = max(1, u). Solve for NUNU = max(NU, 1). If MU < NU swap MU and NU4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 ○ Trim NB to be multiple of 2, MU and NU5. Estimate KU ○ Constrained by I-cache. ○ Make KU divide NB6. Estimate NF, IF ○ IF = 2 , N F = 2
  81. 81. Experimental Results
  82. 82. Conclusions● In all machines (other than Itanium), the codes performed almost as well as global search based codes● Models to find parameters are much faster● Might be difficult to implement analytical methods in compilers ○ This model is focused on only 1 application