•1 like•736 views

Report

Share

Presentation on Auto Tuning delivered as part of our "Software for Multicore Processors" course at UT Austin. It covers the basics of AutoTuning and details of two library generators called PhiPAC and ATLAS.

Follow

- 1. Auto Tuning Hemanth and Siddharth UT Austin
- 2. Basics
- 3. What is Auto Tuning? ● Several Definitions ○ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies " ● A Definition ○ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem. ● Techniques used by: ○ Library generators, Compilers and Runtime systems
- 4. Possible Versions of a Solution ● The solutions may differ in the ○ algorithm (quicksort vs selection sort) ○ implementation (loop unroll). ● The versions may result from ○ transformations (unroll, tile, interchange) ● The versions could be generated by ○ programmer manually (coding or directives) ○ compiler automatically
- 5. Motivation ■ Increasing diversity of computation supports ■ New influences on the execution of parallel applications ○ Hierarchical structure ○ Heterogeneity of the processors ■ Design efficient software that takes full advantage of such systems ■ Solving a target problem by using a single algorithm is not always efficient everywhere
- 6. First Ideas ● Poly-Algorithms ○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations" ● Proﬁling and feedback assisted compilation ○ (1982) S. Graham et.al : gprof ○ (1991) P. Chang et.a l: "Using proﬁle information to assist classic code optimizations" ● Code generation ○ (1989) J. Johnson et.al : “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” ○ (1992) M. Covell et.al : “Computer-aided algorithm design and arrangement”
- 7. Context: High Performance Libraries ● Linear Algebra ○ BLAS, LAPACK, ScaLAPACK ● Signal/Image Processing ○ Vector Signal Image Processing Library (VSIPL) ● Distributed/Parallel Systems ○ Message Passing Interface (MPI) ● Can we implement libraries: ○ Automatically and Portably ○ Incorporating platform-specific features ○ matching performance of hand-tuned implementations leveraging compiler technology ○ using domain-specific knowledge
- 8. AutoTuning ● 2 phase scheme for producing automatically tuned code ● Given: Program; inputs; machine ● Step1: Identify and generate a space of candidate implementations ● Step2: Select the fastest one using empirical modeling and/or automated experiments
- 9. Why not let the compiler worry? ● General Purpose ○ whereas Library generators can focus on specific problems ● Engineering ○ Hard to modify a production compiler and its effects are global ● Analysis ○ Limited access to relevant run-time information ○ Over specified dependencies ○ Correctness Criteria
- 10. Compiler Vs AutoTuner Compiler AutoTuner Input General Purpose Specification including Source Code problem size, machine parameters and problem specific transformations Output Low level Machine Mostly High Level Code Source (eg: C code) Time to Short (unless Usually Long (depends feedback/profiling on search space) Generate enabled) Select Mostly Static Analysis Automated Empirical (rarely feedback Models and Implementation tuning) experiments
- 11. Some AutoTuning Projects ● Linear Algebra ○ Portable High-Performance ANSI C ■ PHiPAC ○ Automatically Tuned Linear Algebra Software ■ ATLAS ● Signal and Image Processing ○ Fast Fourier Transformations of the West ■ FFTW ○ SPIRAL
- 12. PHiPAC
- 13. Traditional Approach Hand Tuned Libraries
- 14. PHiPAC (1997) ● Developing Portable High-Performance matrix vector libraries in ANSI C ● Parametrized C-code Generator ○ produces code according to certain guidelines ● Auto Tune the code ● Exhaustive search over all parameters ● Claim: achieve over 90% of peak-perf and
- 15. PHiPAC Approach Generate Optimized C Code
- 16. PHiPAC Approach Parameters are Architecture Specific
- 17. Efficient Code Generation ● Studied several ANSI C Compilers and determined that it is best to ● Rely on Compilers for: ○ Register allocation ○ Instruction selection and Scheduling ● Manually perform: ○ register/cache blocking ○ loop unrolling ○ software pipe-lining, etc
- 18. Local Variables to explicitly remove false dependencies ● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d; Compiler mayn't assume &a[i] != &b[i+1] and so is forced to first store a[i] before loading b[i+1] (Pointer Aliasing)
- 19. False Dependencies After Removing Dependency
- 20. Exploit Multiple Registers ● Explicitly keep values in local variables ○ Reduces memory bandwidth ○ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
- 21. Minimize pointer updates by striding with constant offsets Before After ● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12; Compilers can fold constant index into (register + offset) addressing mode.
- 22. Minimize branches, avoid magnitude compares ● Branches are costly ○ Unroll loops ○ Use do{} while(); loops to avoid loop head branches ● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
- 23. Explicitly unroll loops ● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
- 24. Other Guidelines ● Balance Instruction Mix ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores ● Increase Locality ○ Arrange code to have unit-stride memory accesses and try to reuse data in cache ● Convert Integer multiplies to adds ○ * and / are slower than +
- 25. Matrix Multiply Generators ● Produce C code with PHiPAC guidelines ● C = αop(A)op(B) + βC ○ MxK, KxN and MxN matrices ○ op(X) is either X or transpose(X) ● mm_cgen and mm_lgen ○ Core (register blocking) ○ Level (higher level cache blocking) ● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
- 26. Blocked MMM for (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
- 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
- 28. Usage and Options Usage: mm_cgen [OPTIONS] ● Semantics options: ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose ○ -no_fringes : don’t generate an M,K, or N reg block fringes ● Optimization options: ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters ○ -sp [1|2lm|2ma|3] : software pipelining options
- 29. Contd. ● Precision options: ○ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination) ● Misc. options: ○ file name : Write to file ’name’ ○ routine_name name : Name of routines
- 30. Optimal Block Sizes Use the search.pl script
- 31. Optimal Block Sizes ● Naive brute force search ● For Register Parameters ○ NR/4 <= M0N0 <= NR ; NR is max regs ○ 1 <= K0 <= K0max ; K0max = 20 (tunable) ● Benchmark all squares M = K = N = D ○ D runs over 2x, 3x, 10x and all primes ○ 3D2 fits in L1 cache
- 32. Contd. ● For L1 blocking Parameters ● The square case ( D x D) ● Search the neighborhood centered at 3D2 = L1 ● Set the values of M1, K1, N1 to ϕ D/M0 ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } ○ D = sqrt(L1/3) ○ 125 Combinations
- 33. Naive Brute Force ? ● Search take too long ● Generates very lengthy code ● Very slow under full optimization ● Need a better search strategy
- 34. Smarter Search ● Majority of the computation is performed in register blocked code ● Benchmark only in multiples of register block size ● Search space of M0, N0, K0 is not reduced ○ Prioritize neighborhood of the best ones found ○ {M0-1, M0, M0+1} etc. ● Terminate after reaching acceptable efficiency
- 35. Evaluation
- 36. Single Precision MMM (100 MHz SGI Indigo R4k) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
- 37. Double Precision MMM (HP 712/80i) Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
- 38. There is no Golden Hammer Strengths: Weaknesses: ● Automatic Search ● Focus on for optimal Params uniprocessor ● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
- 39. Further Information ● http://www.icsi.berkeley.edu/~bilmes/phipac/ ● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
- 41. ATLAS ● Automatically Tuned Linear Algebra Software ● Generates optimized BLAS library ● C and Fortran77 ● Provides implementation for BLAS levels 1,2 and 3. ● We will focus on Matrix-Matrix-Multiply (MMM)
- 42. Naive MMM ● C = A * B using 3 for-loops ● Dimensions of A, B and C are NxK, KxM and NxM respectively.
- 43. Optimization for L1 cache ● Matrix divided into NB x NB blocks ● Each block is called mini-MMM ● Optimization parameter NB is chosen such that each mini-MMM fits in cache
- 44. Optimization for L1 cache
- 45. Optimization for register file ● Mini-MMMs are further represented as micro- MMMs ● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C ● Here MU and NU are the optimization parameters ● Necessary condition : MU + NU + MU*NU <= NR ● where NR = no. of floating point registers
- 46. Mini and Micro- MMM
- 47. Code
- 48. Pipeline scheduling The 2 innermost loops (i'' and j'') are unrolled, to create interleaved multiply and add statements Exploits instruction-level parallelism ● If there is fused multiply-add, then these 2 operations can be executed together ● The optimization parameter FMA indicates the code generator whether this facility
- 49. Pipeline scheduling ● MU + NU loads and stores ● MU * NU additions and multiplications ● Latency of operations might stall the pipeline ● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?) ● This is governed by another optimization parameter - LS
- 50. Pipeline scheduling ● Inject MU + NU loads of A and B ● Loads divided into: ○ Initial fetch (IF) ○ Blocks of other load operations (NF)
- 51. Loop Unrolling ● KU is the optimization parameter that controls loop unrolling ● Constrained by the capacity of instruction cache ● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
- 52. Other Optimizations ● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop ● Copying jth vertical panel of B -- done before beginning of i loop. ● Copying tile (i,j) of C just before the "k" loop starts
- 53. Other optimizations ● Choosing loop order: ○ if N < M then JIK loop order (so that A completely fits into L2 cache) ○ else if M < N then IJK loop order
- 54. Other optimizations ● Copying A, B, C for smaller matrices might be an overhead ● Non-copying versions are generated with optimization parameter NCNB ● This version used if: ○ M * N * K is less than a threshold ○ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
- 55. Estimating parameters ● Orthogonal search is used for optimizing parameters. ● It is a heuristic, and finds approximate solutions ● No guarantee of optimized solution ● It needs these details: ○ Optimized in what order? ○ Possible solution range for parameters ○ reference value used for parameter k during optimization of 1 to k-1
- 57. Estimating Machine Parameters Machine parameters are measured: ● C1 - Size of L1 data cache ● NR - Number of floating point registers ● FMA - Availability of fused multiply-add ● LS - Amount of separation between dependent multiply and add instructions
- 58. Estimating parameters Optimization sequence ● NB ● MU and NU ● KU ● Ls ● I F, N F ● NCNB
- 59. Finding NB ● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
- 60. Finding MU and NU ● All combinations that satisfy: ○ MU * NU + MU + NU + LS <= NR ● NB was obtained earlier
- 61. Finding LS and IF, NF LS ● Tries values in interval [1, 6] ● Boundary value fixed based on experiments ● Divides MU * NU * KU (instruction scheduling) ● IF: Searches of IF in the interval [2, MU + NU] ● NF in the interval [1, MU + NU - IF]
- 62. Finding NCNB ● Searches in the range [NB : -4 : 4] ● Terminates search when performance drops by 20% of the best found solution
- 63. Is Search Really Necessary?
- 64. Finding KU ● Constrained by instruction cache ● Values between 4 and NB/2 are tried ● Special values 1 and NB are also considered
- 65. Empirical Optimization ● Estimation of optimal values is the key ○ Compilers use Analytical models ○ Library Generators (eg: ATLAS) use search ● Empirical Search: ○ Get a version of program for each combination of parameters ○ Execute it on the target machine and measure performance ○ Select the one that performs best ○ Increased installation time!! ● How is the search space bounded? ○ The hardware parameters
- 66. Yotov et.al ● Realised that most optimizations used in ATLAS code generator are already known to the compilers. ○ cache Tiling, register tiling, etc. ● Replaced the search module with a parameter estimator based on standard analytical models ● Code generator is not modified ○ Any performance change is solely based on differently chosen parameters
- 68. Analysis ● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters ● Focus on the ATLAS generated code ● Notations: ○ ATLAS CGw/S - Code Generator with Search ○ ATLAS Model - Modified Atlas (No search) ○ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
- 69. Model-Based Optimization ● Requires more machine parameters than original ATLAS ○ No Search!! ● Empirical optimizers: ○ Approximate values of machine params are okay ○ Only used to bound the search space ● Model-based Optimizers: ○ Need accurate values ○ Developed a tool called X-RAY to accurately measure them
- 70. Hardware Parameters ● C1,B1: the capacity and the line size of the L1 data cache ● CI : The capacity of the L1 instruction cache ● Lx: hardware latency of the floating-point multiply instruction ● |ALUFP |: number of floating-point functional units ● NR: the number of floating-point registers ● FMA: the availability of a fused multiply-add instruction
- 71. Estimating NB ● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size ● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1 ● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
- 72. Refined Estimate of NB ● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
- 73. Further Refinement ● Estimated NB may not be multiple of MU and NU ● This might cause fractional register tiles and extra clean up ● Avoid this by choosing proper NB ● ATLAS needs NB to be an even integer ● So, we have: NB =
- 74. Estimating MU and NU ● View register file as a software cache ○ that is fully associative ○ unit line size ○ capacity = # registers, NR ● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
- 75. Contd. ● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C ● Also need LS registers to store temp values of multiplications to make use of pipelining ● So we have: (MU x NU) + NU + MU + LS <= NR LS calculation will be shown later, NR is known. Only unknowns are MU and NU.
- 76. Estimation Scheme ● Let MU = NU = u. Solve prev inequality for u ● Let MU = max (u, 1). Solve for NU ● Let NU = max (NU, 1) ● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
- 77. Estimating KU ● Not limited by the size of the register file ● Limited by the size of I-Cache ● Unroll the innermost loop within the size constraints of instruction cache ● Avoid micro-MMM code cleanup ○ Trim KU so that it divides NB ○ Usually, KU = NB in most machines
- 78. Estimating LS ● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline ● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
- 79. Estimating Ls ● LX = latency of multiplication ● 2 * LS - 1 independent instructions hides this latency ● So, 2 * LS - 1 >= LX ● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX ● Solution for LS:
- 80. Summary 1. Estimate FMA 2. Estimate LS : 3. Estimate MU and Nu MU*NU + NU + MU + LS <= NR Set MU = NU = u. Solve for u MU = max(1, u). Solve for NU NU = max(NU, 1). If MU < NU swap MU and NU 4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 ○ Trim NB to be multiple of 2, MU and NU 5. Estimate KU ○ Constrained by I-cache. ○ Make KU divide NB 6. Estimate NF, IF ○ IF = 2 , N F = 2
- 82. Conclusions ● In all machines (other than Itanium), the codes performed almost as well as global search based codes ● Models to find parameters are much faster ● Might be difficult to implement analytical methods in compilers ○ This model is focused on only 1 application