Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

No Downloads

Total views

1,269

On SlideShare

0

From Embeds

0

Number of Embeds

427

Shares

0

Downloads

21

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Section E Loop Nest Optimization
- 2. Loop Nest Optimizer (LNO) Overview Performs transformations on a loop nest Scope of work: each top-level loop nest Does not build control flow graph Driven by data dependency analysis Uses alias and use-def information provided by scalar optimizer Annotate data dependency information for use by code generator Innermost loop only Requires modeling of hardware resources
- 3. Dependence Testing Dependence Given two references R1 and R2, R2 depends on R1 if they may access the same memory location and there is a path from R1 to R2 – true dependence, anti dependence, output dependence Access Array Each array subscript in terms of loop index variables Access Vector A vector of all the subscripts’ access arrays Do I = 1, N access arrays are: (2, 1), (0, 3) DO J = 1, N access vector: [( 2,1), (0, 3)] ...a(2I + J, 3J)... Dependence Testing (input are access arrays) refer to “Efficient and Exact Data Dependence Analysis”, Dror Maydan, et al., PLDI’91. output: dependence vector, each dimension representing a loop level
- 4. Three Classes of Optimizations by LNO 1. Transformations for Data Cache 2. Transformations that help other optimizations 3. Vectorization and Parallellization
- 5. LNO Transformations for Data Cache Cache blocking Transform loop to work on sub-matrices that fit in cache Loop interchange Array Padding Reduce cache conflicts Prefetches generation Hide the long latency of cache miss references Loop fusion Loop fission
- 6. Cache Blocking for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; A B C * = Matrix B misses all the time n3+2n2 cache misses (ignoring line size)
- 7. Cache Blocking Use sub-matrices that fit entirely in cache. A B C A11 A12 B11 B12 C11 C12 * = A21 A22 B21 B22 C21 C22 C11 = A11 * B11 + A12 * B21 For sub-matrices of size b, (n/b)*n2+2n2 cache misses instead of n3+2n2
- 8. Loop Interchange (permutation) Improve Data Locality DO J = 1, M DO I = 1, N DO I = 1, N DO J = 1, M A(I, J) = B(I, J) + C A(I, J) = B(I, J) + C Miss once every 16 No spatial reuse. Cache miss iterations (element size: for every reference of A and B 4 bytes, cache line size: 64 bytes) Unimodular Transformation combined with cache blocking, loop reversal and loop skewing, etc. to improve the overall data locality of the loop nests. Enables Vectorization and Parallelization
- 9. Software Prefetch Major Considerations for Software Prefetch what to prefetch – for references that most likely cause cache misses when to prefetch – neither too early nor too late avoid useless prefetches – register pressure, cache pollution, memory bandwidth Major Phases in Our Prefetch Engine Process_Loop – build internal structures, etc. Build_Base_LGs – build locality group (references that likely access same cache line) Volume – compute data volume for each loop, from innermost to outermost Find_Loc_Loops – determine which locality groups need prefetch Gen_Prefetch – generate prefetch instructions
- 10. Software Prefetch Prefetch N Cache Lines Ahead for a(i), prefetch a(i + N*line_size) -LNO:prefetch_ahead=N (default 2) One Prefetch for Each Cache Line Versioning in loop DO I = 1, N if I % 16 == 0 then loop body with prefetch code else loop body without prefetch code Combine with unrolling DO I = 1, N, 16 prefetch b(I + 2*16) a = a + b(I) a = a + b(I+1) ... a = a + b(I+15
- 11. Loop Fission and Fusion Loop Fission Enables loop interchange Enables vectorization and parallelization Reduce conflict misses Loop Fusion Reduce loop overhead Improve data reuse Larger loop body DO I = 1, N a(I) = b(I) + C DO I = 1, N ENDDO a(I) = b(I) + C DO I = 1, N d(I) = a(I) + E d(I) = a(I) + E ENDDO ENDDO
- 12. LNO Transformations that Help Other Optimizations Scalar Expansion / Array Expansion Reduce inter-loop dependencies Enable parallelization Scalar Variable Renaming Loop nests can be optimized separately Less constraints for register allocation Array Scalarization Improves register allocation Hoist Messy Loop Bounds Outer loop unrolling Array Substitution (Forward and Backward) Loop Unswitching Hoist IF Gather-scatter Move invariant array references out of loops Inter-iteration CSE
- 13. Outer Loop Unrolling Form larger loop bodies Reduce loop overhead For (i=0; i<n; i++) 1 add for (j=0; j<m; j++) 2 mults a[i][j] = a[i][j] + x*b[j] * c[j]; 3 loads 1 store for (i=0; i<n-1; i+=2) for (j=0; j<m; j++) { 2 add a[i][j] = a[i][j] + x*b[j] * c[j]; 4 mults a[i+1][j] = a[i+1][j] + x*b[j] * c[j]; 4 loads } 2 store for (j=0; j<m; j++) a[i][j] = a[i][j] + x*b[j] * c[j];
- 14. Gather-Scatter do i = 1, n do i = 1, n if (c(i) .gt. 0.0) then deref_gs(inc_0+1) = i a(i) = c(i) / b(i) if (c(i) .gt. 0.0) then c(i) = c(i) * b(i) inc_0 = inc_0 + 1 end if end if end do end do do ind_0 = 0, inc_0-1 i_gs = deref_gs(ind_0+1) a(i_gs) = c(i_gs)/b(i_gs) c(i_gs) = c(i_gs)*b(i_gs) end do
- 15. Forward and Backward Array Substitution DO i = 1, N DO i = 1, N DO j = 1, N DO j = 1, N s = C(i,j) DO k = 1,N DO k = 1,N C(i,j) = C(i,j) + A(i, k) * B(k, j) s = s + A(i, k) * B(k, j) ENDDO ENDDO ENDDO C(i, j) = s ENDDO ENDDO ENDDO
- 16. Hoist IF Remove the loop by replicating the matching iterations DO i = 1, N if (F(winner)) if (i == winner && F(i)) G(i) G(winner) ENDDO
- 17. Loop Unswitching Move IFs with invariant conditions out of the loop DO i = 1, N If (cond) if (cond) DO i = 1, N G(i) G(i) ENDDO ENDDO
- 18. Inter-Iteration CSE temp1 = a(1) + b(1) DO I = 1, N DO I = 1, N c(I) = a(I) + b(I) temp = temp1 d(I) = a(I+1) + b(I+1) temp1 = a(I+1) + b(I+1) ENDDO c(I) = temp d(I) = temp1 ENDDO
- 19. LNO Parallelization SIMD code generation Highly dependent on the SIMD instructions in target Generate vector intrinsics Based on the library functions available Automatic parallelization Leverage OpenMP support in rest of backend
- 20. Vectorization Applied to the Innermost Loop Any statement not involved in a dependence cycle may be vectorized DO I = 1, N DO I = 1, N a(I) = a(I) + C a(I+1) = a(I) + C Can be vectorized Dependence Cycle! Cannot be vectorized General Vectorization Implementation Constraints Checking Dependence Analysis for Innermost Loop – build statement dependence graph – detect dependence cycles (strongly connected components (SCCs)) Techniques to Enable Vectorization – applies when dependence cycle exists (see next slide). Rewrite the Loop to its Vectorized Version
- 21. Techniques to Enable Vectorization Scalar Expansion DO I = 1, N DO I = 1, N t(I) = a(I) expand scalar T to array t T = a(I) a(I) = b(I) a(I) = b(I) b(I) = t(I) b(I) = T ENDDO ENDDO Loop Fission DO I = 1, N DO I = 1, N a(I+1) = a(I) + C //cycle, loop not vectorizable a(I+1) = a(I) + C //cycle ENDDO b(I) = b(I) + D //no cycle DO I = 1, N ENDDO b(I) = B(I) + D //no cycle, vectorizable loop ENDDO Other Approaches Loop interchange, array renaming, etc
- 22. Extra Considerations for SIMD Special type of vectorization Array reference must be contiguous e.g., A[2*I] and A[2*(I+1)] are not contiguous Loop versioning may be required for F90 arrays to guarantee contiguity Alignment Sometimes no benefit if not aligned to 128 bit boundary peeling may be required May need a remainder loop DO I = 1, N-N%4, 4 a(I:I+3) = a(I:I+3) + C DO I = 1, N ENDDO a(I) = a(I) + C ! remainder ENDDO DO I = N-N%4, N a(I) = a(I) + C ENDDO
- 23. SIMD with Reduction Replicate accumulator for each computation thread sum0 = 0 sum1 = 0 sum2 = 0 sum = 0 sum3 = 0 DO i = 1, N DO i = 1, N, 4 sum = sum + A(i) sum0 = sum0 + A(i) ENDDO sum1 = sum1 + A(i+1) sum2 = sum2 + A(i+2) sum3 = sum3 + A(i+3) ENDDO sum = sum0 + sum1 + sum2 + sum3
- 24. Generating Vector Intrinsic Fission is usually needed to isolate the instrinsics for(i=0; i< N; i++){ vcos(&ct[0], &u[0], N, 1, 1); a[i] = a[i] + 3.0; for(i= 0; i<N; i++) u[i]=cos(ct[i]); a[i] = a[i] + 3.0; }
- 25. LNO Phase Structure 1. Pre-optimization 2. Fully Unroll Short Loops [lnopt_main.cxx] 3. Build Array Dependence Graph [be/com/dep_graph.cxx] 4. Miscellaneous Optimizations hoist varying lower bounds [access_main.cxx] form min/max [ifminmax.cxx] dead store eliminate arrays [dead.cxx] array substitutions (forward and backward) [forward.cxx] loop reversal [reverse.cxx] 1. Loop Unswitching [cond.cxx] 2. Cache Blocking [tile.cxx] 3. Loop Interchange [permute.cxx] 4. Loop Fusion [fusion.cxx] 5. Hoist Messy Loop Bounds [array_bounds.cxx]
- 26. LNO Phase Structure (continued) 10. Array Padding [pad.cxx] 11. Parallellization [parallel.cxx] 12. Shackling [shackle.cxx] 13. Gather Scatter [fis_gthr.cxx] 14. Loop Fission [fission.cxx] 15. SIMD [simd.cxx] 16. Hoist IF [lnopt_hoistif.cxx] 17. Generate Vector Intrinsics [vintr_fis.cxx] 18. Generate Prefetches [prefetch.cxx] 19. Array Scalarization [sclrze.cxx] 20. Move Invariant Outside the Loop [minvariant.cxx] 21. Inter-Iteration Common Sub-Expression Elimination [cse.cxx]

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment