E Lno

1,269 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,269
On SlideShare
0
From Embeds
0
Number of Embeds
427
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

E Lno

  1. 1. Section E Loop Nest Optimization
  2. 2. Loop Nest Optimizer (LNO) Overview  Performs transformations on a loop nest  Scope of work: each top-level loop nest  Does not build control flow graph  Driven by data dependency analysis  Uses alias and use-def information provided by scalar optimizer  Annotate data dependency information for use by code generator  Innermost loop only  Requires modeling of hardware resources
  3. 3. Dependence Testing Dependence  Given two references R1 and R2, R2 depends on R1 if they may access the same memory location and there is a path from R1 to R2 – true dependence, anti dependence, output dependence Access Array  Each array subscript in terms of loop index variables Access Vector  A vector of all the subscripts’ access arrays Do I = 1, N access arrays are: (2, 1), (0, 3) DO J = 1, N access vector: [( 2,1), (0, 3)] ...a(2I + J, 3J)... Dependence Testing (input are access arrays)  refer to “Efficient and Exact Data Dependence Analysis”, Dror Maydan, et al., PLDI’91.  output: dependence vector, each dimension representing a loop level
  4. 4. Three Classes of Optimizations by LNO 1. Transformations for Data Cache 2. Transformations that help other optimizations 3. Vectorization and Parallellization
  5. 5. LNO Transformations for Data Cache Cache blocking  Transform loop to work on sub-matrices that fit in cache Loop interchange Array Padding  Reduce cache conflicts Prefetches generation  Hide the long latency of cache miss references Loop fusion Loop fission
  6. 6. Cache Blocking for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; A B C * = Matrix B misses all the time n3+2n2 cache misses (ignoring line size)
  7. 7. Cache Blocking Use sub-matrices that fit entirely in cache. A B C A11 A12 B11 B12 C11 C12 * = A21 A22 B21 B22 C21 C22 C11 = A11 * B11 + A12 * B21 For sub-matrices of size b, (n/b)*n2+2n2 cache misses instead of n3+2n2
  8. 8. Loop Interchange (permutation) Improve Data Locality DO J = 1, M DO I = 1, N DO I = 1, N DO J = 1, M A(I, J) = B(I, J) + C A(I, J) = B(I, J) + C Miss once every 16 No spatial reuse. Cache miss iterations (element size: for every reference of A and B 4 bytes, cache line size: 64 bytes) Unimodular Transformation  combined with cache blocking, loop reversal and loop skewing, etc. to improve the overall data locality of the loop nests. Enables Vectorization and Parallelization
  9. 9. Software Prefetch Major Considerations for Software Prefetch  what to prefetch – for references that most likely cause cache misses  when to prefetch – neither too early nor too late  avoid useless prefetches – register pressure, cache pollution, memory bandwidth Major Phases in Our Prefetch Engine  Process_Loop – build internal structures, etc.  Build_Base_LGs – build locality group (references that likely access same cache line)  Volume – compute data volume for each loop, from innermost to outermost  Find_Loc_Loops – determine which locality groups need prefetch  Gen_Prefetch – generate prefetch instructions
  10. 10. Software Prefetch Prefetch N Cache Lines Ahead  for a(i), prefetch a(i + N*line_size)  -LNO:prefetch_ahead=N (default 2) One Prefetch for Each Cache Line  Versioning in loop DO I = 1, N if I % 16 == 0 then loop body with prefetch code else loop body without prefetch code  Combine with unrolling DO I = 1, N, 16 prefetch b(I + 2*16) a = a + b(I) a = a + b(I+1) ... a = a + b(I+15
  11. 11. Loop Fission and Fusion Loop Fission  Enables loop interchange  Enables vectorization and parallelization  Reduce conflict misses Loop Fusion  Reduce loop overhead  Improve data reuse  Larger loop body DO I = 1, N a(I) = b(I) + C DO I = 1, N ENDDO a(I) = b(I) + C DO I = 1, N d(I) = a(I) + E d(I) = a(I) + E ENDDO ENDDO
  12. 12. LNO Transformations that Help Other Optimizations Scalar Expansion / Array Expansion  Reduce inter-loop dependencies  Enable parallelization Scalar Variable Renaming  Loop nests can be optimized separately  Less constraints for register allocation Array Scalarization  Improves register allocation Hoist Messy Loop Bounds Outer loop unrolling Array Substitution (Forward and Backward) Loop Unswitching Hoist IF Gather-scatter Move invariant array references out of loops Inter-iteration CSE
  13. 13. Outer Loop Unrolling  Form larger loop bodies  Reduce loop overhead For (i=0; i<n; i++) 1 add for (j=0; j<m; j++) 2 mults a[i][j] = a[i][j] + x*b[j] * c[j]; 3 loads 1 store for (i=0; i<n-1; i+=2) for (j=0; j<m; j++) { 2 add a[i][j] = a[i][j] + x*b[j] * c[j]; 4 mults a[i+1][j] = a[i+1][j] + x*b[j] * c[j]; 4 loads } 2 store for (j=0; j<m; j++) a[i][j] = a[i][j] + x*b[j] * c[j];
  14. 14. Gather-Scatter do i = 1, n do i = 1, n if (c(i) .gt. 0.0) then deref_gs(inc_0+1) = i a(i) = c(i) / b(i) if (c(i) .gt. 0.0) then c(i) = c(i) * b(i) inc_0 = inc_0 + 1 end if end if end do end do do ind_0 = 0, inc_0-1 i_gs = deref_gs(ind_0+1) a(i_gs) = c(i_gs)/b(i_gs) c(i_gs) = c(i_gs)*b(i_gs) end do
  15. 15. Forward and Backward Array Substitution DO i = 1, N DO i = 1, N DO j = 1, N DO j = 1, N s = C(i,j) DO k = 1,N DO k = 1,N C(i,j) = C(i,j) + A(i, k) * B(k, j) s = s + A(i, k) * B(k, j) ENDDO ENDDO ENDDO C(i, j) = s ENDDO ENDDO ENDDO
  16. 16. Hoist IF Remove the loop by replicating the matching iterations DO i = 1, N if (F(winner)) if (i == winner && F(i)) G(i) G(winner) ENDDO
  17. 17. Loop Unswitching Move IFs with invariant conditions out of the loop DO i = 1, N If (cond) if (cond) DO i = 1, N G(i) G(i) ENDDO ENDDO
  18. 18. Inter-Iteration CSE temp1 = a(1) + b(1) DO I = 1, N DO I = 1, N c(I) = a(I) + b(I) temp = temp1 d(I) = a(I+1) + b(I+1) temp1 = a(I+1) + b(I+1) ENDDO c(I) = temp d(I) = temp1 ENDDO
  19. 19. LNO Parallelization  SIMD code generation  Highly dependent on the SIMD instructions in target  Generate vector intrinsics  Based on the library functions available  Automatic parallelization  Leverage OpenMP support in rest of backend
  20. 20. Vectorization Applied to the Innermost Loop  Any statement not involved in a dependence cycle may be vectorized DO I = 1, N DO I = 1, N a(I) = a(I) + C a(I+1) = a(I) + C Can be vectorized Dependence Cycle! Cannot be vectorized General Vectorization Implementation  Constraints Checking  Dependence Analysis for Innermost Loop – build statement dependence graph – detect dependence cycles (strongly connected components (SCCs))  Techniques to Enable Vectorization – applies when dependence cycle exists (see next slide).  Rewrite the Loop to its Vectorized Version
  21. 21. Techniques to Enable Vectorization  Scalar Expansion DO I = 1, N DO I = 1, N t(I) = a(I) expand scalar T to array t T = a(I) a(I) = b(I) a(I) = b(I) b(I) = t(I) b(I) = T ENDDO ENDDO  Loop Fission DO I = 1, N DO I = 1, N a(I+1) = a(I) + C //cycle, loop not vectorizable a(I+1) = a(I) + C //cycle ENDDO b(I) = b(I) + D //no cycle DO I = 1, N ENDDO b(I) = B(I) + D //no cycle, vectorizable loop ENDDO  Other Approaches  Loop interchange, array renaming, etc
  22. 22. Extra Considerations for SIMD  Special type of vectorization  Array reference must be contiguous  e.g., A[2*I] and A[2*(I+1)] are not contiguous  Loop versioning may be required for F90 arrays to guarantee contiguity  Alignment  Sometimes no benefit if not aligned to 128 bit boundary  peeling may be required  May need a remainder loop DO I = 1, N-N%4, 4 a(I:I+3) = a(I:I+3) + C DO I = 1, N ENDDO a(I) = a(I) + C ! remainder ENDDO DO I = N-N%4, N a(I) = a(I) + C ENDDO
  23. 23. SIMD with Reduction Replicate accumulator for each computation thread sum0 = 0 sum1 = 0 sum2 = 0 sum = 0 sum3 = 0 DO i = 1, N DO i = 1, N, 4 sum = sum + A(i) sum0 = sum0 + A(i) ENDDO sum1 = sum1 + A(i+1) sum2 = sum2 + A(i+2) sum3 = sum3 + A(i+3) ENDDO sum = sum0 + sum1 + sum2 + sum3
  24. 24. Generating Vector Intrinsic Fission is usually needed to isolate the instrinsics for(i=0; i< N; i++){ vcos(&ct[0], &u[0], N, 1, 1); a[i] = a[i] + 3.0; for(i= 0; i<N; i++) u[i]=cos(ct[i]); a[i] = a[i] + 3.0; }
  25. 25. LNO Phase Structure 1. Pre-optimization 2. Fully Unroll Short Loops [lnopt_main.cxx] 3. Build Array Dependence Graph [be/com/dep_graph.cxx] 4. Miscellaneous Optimizations  hoist varying lower bounds [access_main.cxx]  form min/max [ifminmax.cxx]  dead store eliminate arrays [dead.cxx]  array substitutions (forward and backward) [forward.cxx]  loop reversal [reverse.cxx] 1. Loop Unswitching [cond.cxx] 2. Cache Blocking [tile.cxx] 3. Loop Interchange [permute.cxx] 4. Loop Fusion [fusion.cxx] 5. Hoist Messy Loop Bounds [array_bounds.cxx]
  26. 26. LNO Phase Structure (continued) 10. Array Padding [pad.cxx] 11. Parallellization [parallel.cxx] 12. Shackling [shackle.cxx] 13. Gather Scatter [fis_gthr.cxx] 14. Loop Fission [fission.cxx] 15. SIMD [simd.cxx] 16. Hoist IF [lnopt_hoistif.cxx] 17. Generate Vector Intrinsics [vintr_fis.cxx] 18. Generate Prefetches [prefetch.cxx] 19. Array Scalarization [sclrze.cxx] 20. Move Invariant Outside the Loop [minvariant.cxx] 21. Inter-Iteration Common Sub-Expression Elimination [cse.cxx]

×