Your SlideShare is downloading. ×
  • Like
  • Save
Auto Tuning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Auto Tuning

  • 329 views
Published

Presentation on Auto Tuning delivered as part of our "Software for Multicore Processors" course at UT Austin. It covers the basics of AutoTuning and details of two library generators called PhiPAC …

Presentation on Auto Tuning delivered as part of our "Software for Multicore Processors" course at UT Austin. It covers the basics of AutoTuning and details of two library generators called PhiPAC and ATLAS.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
329
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Auto TuningHemanth and Siddharth UT Austin
  • 2. Basics
  • 3. What is Auto Tuning?● Several Definitions ○ First result on Wikipedia - "Auto-Tune is an audio processor created by Antares Audio Technologies "● A Definition ○ Autotuning is an automatic process for selecting one out of several possible solutions to a computational problem.● Techniques used by: ○ Library generators, Compilers and Runtime systems
  • 4. Possible Versions of a Solution● The solutions may differ in the ○ algorithm (quicksort vs selection sort) ○ implementation (loop unroll).● The versions may result from ○ transformations (unroll, tile, interchange)● The versions could be generated by ○ programmer manually (coding or directives) ○ compiler automatically
  • 5. Motivation■ Increasing diversity of computation supports■ New influences on the execution of parallel applications ○ Hierarchical structure ○ Heterogeneity of the processors■ Design efficient software that takes full advantage of such systems■ Solving a target problem by using a single algorithm is not always efficient everywhere
  • 6. First Ideas● Poly-Algorithms ○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic solution of nonlinear equations"● Profiling and feedback assisted compilation ○ (1982) S. Graham et.al : gprof ○ (1991) P. Chang et.a l: "Using profile information to assist classic code optimizations"● Code generation ○ (1989) J. Johnson et.al : “A methodology for designing, modifying, and implementing Fourier Transform algorithms on various architectures.” ○ (1992) M. Covell et.al : “Computer-aided algorithm design and arrangement”
  • 7. Context: High Performance Libraries● Linear Algebra ○ BLAS, LAPACK, ScaLAPACK● Signal/Image Processing ○ Vector Signal Image Processing Library (VSIPL)● Distributed/Parallel Systems ○ Message Passing Interface (MPI)● Can we implement libraries: ○ Automatically and Portably ○ Incorporating platform-specific features ○ matching performance of hand-tuned implementations leveraging compiler technology ○ using domain-specific knowledge
  • 8. AutoTuning● 2 phase scheme for producing automatically tuned code● Given: Program; inputs; machine● Step1: Identify and generate a space of candidate implementations● Step2: Select the fastest one using empirical modeling and/or automated experiments
  • 9. Why not let the compiler worry?● General Purpose ○ whereas Library generators can focus on specific problems● Engineering ○ Hard to modify a production compiler and its effects are global● Analysis ○ Limited access to relevant run-time information ○ Over specified dependencies ○ Correctness Criteria
  • 10. Compiler Vs AutoTuner Compiler AutoTunerInput General Purpose Specification including Source Code problem size, machine parameters and problem specific transformationsOutput Low level Machine Mostly High Level Code Source (eg: C code)Time to Short (unless Usually Long (depends feedback/profiling on search space)Generate enabled)Select Mostly Static Analysis Automated Empirical (rarely feedback Models andImplementation tuning) experiments
  • 11. Some AutoTuning Projects● Linear Algebra ○ Portable High-Performance ANSI C ■ PHiPAC ○ Automatically Tuned Linear Algebra Software ■ ATLAS● Signal and Image Processing ○ Fast Fourier Transformations of the West ■ FFTW ○ SPIRAL
  • 12. PHiPAC
  • 13. Traditional ApproachHand Tuned Libraries
  • 14. PHiPAC (1997)● Developing Portable High-Performance matrix vector libraries in ANSI C● Parametrized C-code Generator ○ produces code according to certain guidelines● Auto Tune the code● Exhaustive search over all parameters● Claim: achieve over 90% of peak-perf and
  • 15. PHiPAC ApproachGenerate Optimized C Code
  • 16. PHiPAC ApproachParameters are Architecture Specific
  • 17. Efficient Code Generation● Studied several ANSI C Compilers and determined that it is best to● Rely on Compilers for: ○ Register allocation ○ Instruction selection and Scheduling● Manually perform: ○ register/cache blocking ○ loop unrolling ○ software pipe-lining, etc
  • 18. Local Variables to explicitly remove falsedependencies● Before After a[i] = b[i] + c; float f1, f2; a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1]; a[i] = f1 + c; a[i+1] = f2 * d;Compiler maynt assume &a[i] != &b[i+1]and so is forced to first store a[i] beforeloading b[i+1] (Pointer Aliasing)
  • 19. False Dependencies After Removing Dependency
  • 20. Exploit Multiple Registers● Explicitly keep values in local variables ○ Reduces memory bandwidth ○ compiler would reload fil values for every iteration (potential aliasing with res) Before After while(...) { float f0 = fil[0]; *res++ = fil[0] * sig[0]; float f1 = fil[1]; + fil[1] * sig[1]; while(...) { signal ++; *res++ = f0 * sig[0] } + f1 * sig[1]; signal ++ }
  • 21. Minimize pointer updates by striding withconstant offsets Before After● f0 = *r8; r8 += 4; f0 = r8[0]; f1 = *r8; r8 += 4; f1 = r8[4]; f2 = *r8; r8 += 4; f2 = r8[8]; r8 += 12;Compilers can fold constant index into(register + offset) addressing mode.
  • 22. Minimize branches, avoid magnitudecompares● Branches are costly ○ Unroll loops ○ Use do{} while(); loops to avoid loop head branches● Using == and != is cheaper Before After for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE]; i < ARRAY_SIZE; do { i ++, a++) { ... .... a++; } } while (a != end_ptr);
  • 23. Explicitly unroll loops● Instruction level parallelism Before After while(...) { float f0, f1, s0, s1; *res++ = fil[0] * sig[0]; f0 = fil[0]; f1 = fil[1]; + fil[1] * sig[1]; s0 = sig[0]; s1 = sig[1]; signal ++; } *res++ = (f0*s0)+(f1*s1) do { signal++; s0 = sig[0]; res[0] = f0*s1 + f1*s2; s1 = sig[1]; res[1] = f0*s2 + f1*s0; res += 2; } while(...);
  • 24. Other Guidelines● Balance Instruction Mix ○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or stores● Increase Locality ○ Arrange code to have unit-stride memory accesses and try to reuse data in cache● Convert Integer multiplies to adds ○ * and / are slower than +
  • 25. Matrix Multiply Generators● Produce C code with PHiPAC guidelines● C = αop(A)op(B) + βC ○ MxK, KxN and MxN matrices ○ op(X) is either X or transpose(X)● mm_cgen and mm_lgen ○ Core (register blocking) ○ Level (higher level cache blocking)● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
  • 26. Blocked MMMfor (i=0; i<M; i+=M0) for (j=0; j<N; j+=N0) for (l=0; l<K; l+=K0) for (r=i; r<i+M0; r++) for (s=i; s<i+N0; s++) for (t=i; t<i+K0; t++) c[r][s] += a[r][t] * b[t][s];
  • 27. Code Generator $ mm_gen -l0 <M0> <K0> <N0> [ -l1 <M1> <K1> <N1> ] M0 K0 N0 mm_gen Optimized C M1 K1 N1
  • 28. Usage and OptionsUsage: mm_cgen [OPTIONS]● Semantics options: ○ -op[ABC] [N|T] : [ABC] matrix op. Normal|Transpose ○ -no_fringes : don’t generate an M,K, or N reg block fringes● Optimization options: ○ -l0/l1 M0/M1 K0/K1 N0/N1 : register (L0)/Cache (L1) blocking parameters ○ -sp [1|2lm|2ma|3] : software pipelining options
  • 29. Contd.● Precision options: ○ prec/sprec/aprec/dprec [single|double|ldouble] : Precision (source, accumulator, destination)● Misc. options: ○ file name : Write to file ’name’ ○ routine_name name : Name of routines
  • 30. Optimal Block SizesUse the search.pl script
  • 31. Optimal Block Sizes● Naive brute force search● For Register Parameters ○ NR/4 <= M0N0 <= NR ; NR is max regs ○ 1 <= K0 <= K0max ; K0max = 20 (tunable)● Benchmark all squares M = K = N = D ○ D runs over 2x, 3x, 10x and all primes ○ 3D2 fits in L1 cache
  • 32. Contd.● For L1 blocking Parameters● The square case ( D x D)● Search the neighborhood centered at 3D2 =L1● Set the values of M1, K1, N1 to ϕ D/M0 ○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 } ○ D = sqrt(L1/3) ○ 125 Combinations
  • 33. Naive Brute Force ?● Search take too long● Generates very lengthy code● Very slow under full optimization● Need a better search strategy
  • 34. Smarter Search● Majority of the computation is performed in register blocked code● Benchmark only in multiples of register block size● Search space of M0, N0, K0 is not reduced ○ Prioritize neighborhood of the best ones found ○ {M0-1, M0, M0+1} etc.● Terminate after reaching acceptable efficiency
  • 35. Evaluation
  • 36. Single Precision MMM (100 MHz SGIIndigo R4k)Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 37. Double Precision MMM (HP 712/80i)Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
  • 38. There is no Golden HammerStrengths: Weaknesses:● Automatic Search ● Focus on for optimal Params uniprocessor● Produces portable Machines ANSI C Code. ● No support for vector based CPUs ● No control over instruction scheduling
  • 39. Further Information● http://www.icsi.berkeley.edu/~bilmes/phipac/● http://www.inf.ethz. ch/personal/markusp/teaching/252-2600- ETH-fall11/slides/01-Dietiker.pdf
  • 40. ATLASSiddharth Subramanian
  • 41. ATLAS● Automatically Tuned Linear Algebra Software● Generates optimized BLAS library● C and Fortran77● Provides implementation for BLAS levels 1,2 and 3.● We will focus on Matrix-Matrix-Multiply (MMM)
  • 42. Naive MMM● C = A * B using 3 for-loops● Dimensions of A, B and C are NxK, KxM and NxM respectively.
  • 43. Optimization for L1 cache● Matrix divided into NB x NB blocks● Each block is called mini-MMM● Optimization parameter NB is chosen such that each mini-MMM fits in cache
  • 44. Optimization for L1 cache
  • 45. Optimization for register file● Mini-MMMs are further represented as micro- MMMs● Multiplies MU x 1 sub-matrix of A by 1 x NU sub- matrix of B and accumulates the result into MU x NU sub-matrix of C● Here MU and NU are the optimization parameters● Necessary condition : MU + NU + MU*NU <= NR● where NR = no. of floating point registers
  • 46. Mini and Micro- MMM
  • 47. Code
  • 48. Pipeline schedulingThe 2 innermost loops (i and j) are unrolled,to create interleaved multiply and addstatementsExploits instruction-level parallelism● If there is fused multiply-add, then these 2 operations can be executed together● The optimization parameter FMA indicates the code generator whether this facility
  • 49. Pipeline scheduling● MU + NU loads and stores● MU * NU additions and multiplications● Latency of operations might stall the pipeline● Solution : Interleave the operations such that dependent operations are separated by a particular distance (What would that be?)● This is governed by another optimization parameter - LS
  • 50. Pipeline scheduling● Inject MU + NU loads of A and B● Loads divided into: ○ Initial fetch (IF) ○ Blocks of other load operations (NF)
  • 51. Loop Unrolling● KU is the optimization parameter that controls loop unrolling● Constrained by the capacity of instruction cache● Should not be so small (wastage of cache) or so big (overflow of instruction cache)
  • 52. Other Optimizations● Copying tiles of A is done in the beginning of outermost loop. These tiles are fully reused in each iteration of j loop● Copying jth vertical panel of B -- done before beginning of i loop.● Copying tile (i,j) of C just before the "k" loop starts
  • 53. Other optimizations● Choosing loop order: ○ if N < M then JIK loop order (so that A completely fits into L2 cache) ○ else if M < N then IJK loop order
  • 54. Other optimizations● Copying A, B, C for smaller matrices might be an overhead● Non-copying versions are generated with optimization parameter NCNB● This version used if: ○ M * N * K is less than a threshold ○ at least 1 dimension of 1 of the matrices is smaller than 3 * NCNB
  • 55. Estimating parameters● Orthogonal search is used for optimizing parameters.● It is a heuristic, and finds approximate solutions● No guarantee of optimized solution● It needs these details: ○ Optimized in what order? ○ Possible solution range for parameters ○ reference value used for parameter k during optimization of 1 to k-1
  • 56. Summary of Parameters
  • 57. Estimating Machine ParametersMachine parameters are measured:● C1 - Size of L1 data cache● NR - Number of floating point registers● FMA - Availability of fused multiply-add● LS - Amount of separation between dependent multiply and add instructions
  • 58. Estimating parametersOptimization sequence● NB● MU and NU● KU● Ls● I F, N F● NCNB
  • 59. Finding NB● Generates values in range : 16 <= NB <= min(80, √C1) where C1 = size of L1 data cache
  • 60. Finding MU and NU● All combinations that satisfy: ○ MU * NU + MU + NU + LS <= NR● NB was obtained earlier
  • 61. Finding LS and IF, NFLS● Tries values in interval [1, 6]● Boundary value fixed based on experiments● Divides MU * NU * KU (instruction scheduling)● IF: Searches of IF in the interval [2, MU + NU]● NF in the interval [1, MU + NU - IF]
  • 62. Finding NCNB● Searches in the range [NB : -4 : 4]● Terminates search when performance drops by 20% of the best found solution
  • 63. Is Search Really Necessary?
  • 64. Finding KU● Constrained by instruction cache● Values between 4 and NB/2 are tried● Special values 1 and NB are also considered
  • 65. Empirical Optimization● Estimation of optimal values is the key ○ Compilers use Analytical models ○ Library Generators (eg: ATLAS) use search● Empirical Search: ○ Get a version of program for each combination of parameters ○ Execute it on the target machine and measure performance ○ Select the one that performs best ○ Increased installation time!!● How is the search space bounded? ○ The hardware parameters
  • 66. Yotov et.al● Realised that most optimizations used in ATLAS code generator are already known to the compilers. ○ cache Tiling, register tiling, etc.● Replaced the search module with a parameter estimator based on standard analytical models● Code generator is not modified ○ Any performance change is solely based on differently chosen parameters
  • 67. ATLAS Architecture
  • 68. Analysis● Results indicated that a simple and intuitive model is able to estimate near-optimal values for the parameters● Focus on the ATLAS generated code● Notations: ○ ATLAS CGw/S - Code Generator with Search ○ ATLAS Model - Modified Atlas (No search) ○ Atlas Unleashed - Hand written code may be used along with predefined architecture defaults for the parameter values to produce the library.
  • 69. Model-Based Optimization● Requires more machine parameters than original ATLAS ○ No Search!!● Empirical optimizers: ○ Approximate values of machine params are okay ○ Only used to bound the search space● Model-based Optimizers: ○ Need accurate values ○ Developed a tool called X-RAY to accurately measure them
  • 70. Hardware Parameters● C1,B1: the capacity and the line size of the L1 data cache● CI : The capacity of the L1 instruction cache● Lx: hardware latency of the floating-point multiply instruction● |ALUFP |: number of floating-point functional units● NR: the number of floating-point registers● FMA: the availability of a fused multiply-add instruction
  • 71. Estimating NB● Consider L1 cache - Fully Associative, Optimal replacement, Unit line size● Working set of mini-MMM loop has 3 blocks of NB x NB 3 NB2 <= C1● In the inner most loop (C), element once computed is not used again. Similarly only 1 column of B is needed in cache. NB2 + NB + 1 <= C1
  • 72. Refined Estimate of NB● Correcting for non-unit line size |N2B/B1| + |NB/B1| + 1 <= C1/B1
  • 73. Further Refinement● Estimated NB may not be multiple of MU and NU● This might cause fractional register tiles and extra clean up● Avoid this by choosing proper NB● ATLAS needs NB to be an even integer● So, we have: NB =
  • 74. Estimating MU and NU● View register file as a software cache ○ that is fully associative ○ unit line size ○ capacity = # registers, NR● ATLAS performs outer products of (MU x 1) and (1 x NU) vectors for register tiling
  • 75. Contd.● ATLAS allocates MU elements for A, NU elements for B, and MU*NU elements for C● Also need LS registers to store temp values of multiplications to make use of pipelining● So we have: (MU x NU) + NU + MU + LS <= NRLS calculation will be shown later, NR is known.Only unknowns are MU and NU.
  • 76. Estimation Scheme● Let MU = NU = u. Solve prev inequality for u● Let MU = max (u, 1). Solve for NU● Let NU = max (NU, 1)● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
  • 77. Estimating KU● Not limited by the size of the register file● Limited by the size of I-Cache● Unroll the innermost loop within the size constraints of instruction cache● Avoid micro-MMM code cleanup ○ Trim KU so that it divides NB ○ Usually, KU = NB in most machines
  • 78. Estimating LS● Skew factor that ATLAS code generator uses to schedule dependent multiplication and addition operations for CPU Pipeline● LS independent multiplications and LS-1 independent additions between muli and corresponding addi should at least hide the latency of multiplication.
  • 79. Estimating Ls● LX = latency of multiplication● 2 * LS - 1 independent instructions hides this latency● So, 2 * LS - 1 >= LX● There may be multiple floating point units (2 x LS) - 1/ |ALUFP| >= LX● Solution for LS:
  • 80. Summary1. Estimate FMA2. Estimate LS :3. Estimate MU and NuMU*NU + NU + MU + LS <= NRSet MU = NU = u. Solve for uMU = max(1, u). Solve for NUNU = max(NU, 1). If MU < NU swap MU and NU4. Estimate NB |N2B/B1| + |NB/B1| + 1 <= C1/B1 ○ Trim NB to be multiple of 2, MU and NU5. Estimate KU ○ Constrained by I-cache. ○ Make KU divide NB6. Estimate NF, IF ○ IF = 2 , N F = 2
  • 81. Experimental Results
  • 82. Conclusions● In all machines (other than Itanium), the codes performed almost as well as global search based codes● Models to find parameters are much faster● Might be difficult to implement analytical methods in compilers ○ This model is focused on only 1 application