Successfully reported this slideshow.

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

4

Share

Loading in …3
×
1 of 53
1 of 53

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

4

Share

Download to read offline

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani.
prepared from Compilers by Aho, Ullman

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani.
prepared from Compilers by Aho, Ullman

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Instruction Level Parallelism Compiler optimization Techniques Anna University,K.Thirunadana Sikamani

  1. 1. Compiler Optimization Techniques CP 7031 Dr.K.Thirunadana Sikamani
  2. 2. Principal Sources of Optimization Elimination of unnecessary instructions in object code , or the replacement of one sequence of instructions by a faster sequence of instructions that does the same thing is usually called “code improvement” or “code optimization”  Redundancy  Semantic preserving transformations  Global Common Subexpressions  Copy Propagation  Dead Code Elimination  Code Motion 8/25/2014 Compiler OptimizationTechniques - unit II 2
  3. 3. The Speed of a program run on a processor with Instruction Level Parallelism depends on 1. The potential parallelism in the program. 2. The available parallelism on the processor. 3. Our ability to extract parallelism from the original sequential program. 4. Our ability to find the best parallel schedule given scheduling constraints. 8/25/2014 3Compiler OptimizationTechniques - unit II
  4. 4. Processor Architecture 8/25/2014 4Compiler OptimizationTechniques - unit II
  5. 5. 1. Instruction Pipelines and Branch delays 2. Pipelined Execution 3. Multiple Instruction Issues – VLIW ( Very Long Instruction Word) 8/25/2014 5Compiler OptimizationTechniques - unit II
  6. 6. Code Scheduling Constraints 8/25/2014 6Compiler OptimizationTechniques - unit II
  7. 7. 1. Control-dependence constraints 2. Data-dependence Constraints 3. Resource Constraints 8/25/2014 7Compiler OptimizationTechniques - unit II
  8. 8. Control dependence constraints All the operations executed in original program must be executed in the optimized one 8/25/2014 8Compiler OptimizationTechniques - unit II
  9. 9. Data Dependence Constraints The operations in the optimized program must produce the same results as the corresponding ones in the original program 8/25/2014 9Compiler OptimizationTechniques - unit II
  10. 10. Resource Constraints The schedule must not oversubscribe the resources on the machine 8/25/2014 10Compiler OptimizationTechniques - unit II
  11. 11. Data Dependence True dependence - Read after Write Antidependence - Write after Read Output dependence - Write after Write 8/25/2014 11Compiler OptimizationTechniques - unit II
  12. 12. Classify dependence for the following statements  1. a =b  2.c =d  3.b =c  4. d =a  5. c= d  6. a = b 8/25/2014 Compiler OptimizationTechniques - unit II 12 1 and 4 3 and 5 1 and 6 Check the dependences for the following
  13. 13. Give the register level m/c code to provide maxm parallelism also give the solution for minimal usage of register expression ((u+v) + (w+x)) + (y+z) LD r1,u LD r2,v ADD r1,r1,r2 LD r2,w LD r3,x ADD r2,r2,r3 ADD r1,r1,r2 LD r2,y LDr3,z ADD r2,r2,r3 ADD r1,r1,r2 8/25/2014 Compiler OptimizationTechniques - unit II 13 Clock 1 LD r1,u LD r2,v LD r3,w LD r4,x LD r5,y LD r6,z Clock 2 ADD r1,r1,r2 ADD r3,r3,r 4 ADD r5,r5,r 6 Clock 3 ADD r1,r1,r3 clock 4 ADD r1,r1,r5 Implementation of parallelism in 4 clocks
  14. 14. Finding dependences among memory Access 1. Array data dependence analysis for ( i = 0; i < n; i++) A[2*i] = A[2* i+1] 2. Pointer alias analysis Two pointers aliased if they refer to the same object 3. Inter procedural analysis It is to determine if same variable is passed as two or more different arguments in passing parameters by reference language 8/25/2014 14Compiler OptimizationTechniques - unit II
  15. 15. Tradeoff between Register usage and Parallelism e.g., machine independent intermediate representation code LD t1 , a ST b , t1 LD t2 , c ST d , t2 the code above is to copy the values of a and c to b and d . If all memory locations are distinct the copies can be proceed in parallel . The other case if t1 and t2 are assigned to use the same register to minimize the register usage. 8/25/2014 15Compiler OptimizationTechniques - unit II
  16. 16. Tradeoff between Register U sage and Parallelism The syntax tree for the (a+b) + c + ( d+ e) a b + + + + C d e Machine code LD r1 , a LDr2 , b ADD r1,r1,r2 LD r2 , c ADD r1,r1,r2 LD r2, d LD r3, e ADD r2,r2,r3 ADD r1,r1,r2 Parallel evaluation of the expression r1 =a r6=r1+r2 r8=r6+r3 R9=r8+r7 r2=b r7=r4+r5 r3=c r4=d r5=e 8/25/2014 16Compiler OptimizationTechniques - unit II
  17. 17. Phase Ordering between register allocation and Code Scheduling  If registers are allocated before scheduling , the resulting code tends to have many storage dependences that limit code scheduling.  On the other way around , the schedule created may require so many registers that register spilling Spilling – storing the contents of a register in a memory location, so the register can be used for some other purpose. Based on the characteristics of the program. e.g., numeric , non numeric, etc., 8/25/2014 17Compiler OptimizationTechniques - unit II
  18. 18. Control Dependence  If ( c ) s1; else s2; /* s1 and s2 are control dependent on c */  While ( c ) s; /* s is dependent on c */  if ( a > t ) b = a * a; d = a + c; / * No dependence * / 8/25/2014 18Compiler OptimizationTechniques - unit II
  19. 19. Speculative Execution Support  Prefetching - Bringing data from memory to cache before it is used.  Poison Bits – Speculative load of data from memory to register file. Each register is augmented with poison bit. The poison bit is set when an illegal memory is accessed to raise exception at later usage. 8/25/2014 19Compiler OptimizationTechniques - unit II
  20. 20. Predicated Execution  Predicated instructions were invented to reduce the number of branches in a program.  A predicated instruction is like a normal instruction but has an extra predicate operand to guard its execution.  E.g., CMOVZ R2, R3, R1 has the semantics of moving contents of R3 to R2 if R1 is zero if ( a == 0 ) b = c + d; can be implemented as ADD R3 , R4 ,R5 /* a ,b,c ,d are allotedR1, R2 , R4,R5 */ CMOVZ R2, R3, R1 8/25/2014 20Compiler OptimizationTechniques - unit II
  21. 21. Basic Machine Model Many machines can be represented as M = < R , T > T – Set of operation types T, such as loads, stores and arithmetic operations. R is a vector – R = [ r1,r2,…..] are hardware resources. r1 - number of units availabel of the ith kind of resources. Resources – memory access units, ALUs, floating point functional units. 8/25/2014 21Compiler OptimizationTechniques - unit II
  22. 22. Basic Machine Model  Each operation has a set of input operands , a set of output operands and resource requirement  RTt– Resource –Reservation table  RTt[i,j]- is the number of units of jth resource used by an operation type t, i clocks after it is issued. 8/25/2014 22Compiler OptimizationTechniques - unit II
  23. 23. Basic-Block Scheduling 8/25/2014 23Compiler OptimizationTechniques - unit II
  24. 24. Data-Dependence Graphs Graph G = ( N , E) N -- E --- A set of nodes representing the operations in the machine instructions. A set of directed edges representing the data dependence constraints among operations 1. Each operation n in N has a resource reservation table RTn , whose value is simply the resource – reservation table associated with operation type of n 2. Each edge e in E is labeled with delay de indicating that the destination node must be issued no earlier than de clocks after the source node is issued. 8/25/2014 24Compiler OptimizationTechniques - unit II
  25. 25. Data- dependence Graph LD R2, 0(R1) ST 4(R1), R2 ADD R3,R3,R2 ADD R3, R3, R4 Ld R3, 8 (R1) ST 0(R7), R7 ST 12(R1), R3 i1 i2 i3 i4 i5 i6 i7 2 2 1 1 1 1 1 1 2 1.Load operation takes 2 clock cycles 2. R1 is a stack pointer having offset from 0 t0 12 8/25/2014 25Compiler OptimizationTechniques - unit II
  26. 26. List Scheduling of Basic Blocks  This involves visiting each node of the data-de pendence graph in “prioritized topological order”  Machine-resource vector R = [r1,r2,r3,..] ri --- the number of units available of the ith kind of resource G = ( N,E) data dependence graph RTn ---- Resource -reservation table Edge e = n1 n2 with de indicating n2 would be executed de delays after n1. 8/25/2014 Compiler OptimizationTechniques - unit II 26
  27. 27. List Scheduling Algorithm RT = An empty reservation table for ( each n in N in prioritized topological order){ s = max e=p ->n in E (S(p) + de); /* find the earliest time this instruction this instruction could begin given when its predecessors started */ while ( there exists i such that RT[s+i] + RTn [i] > R) s = s+ 1; /* delay the instruction further until the needed resources are available */ S(n) = s; for (all i) RT[s + i] = RT [ s+i ] + RTn [i] } 8/25/2014 Compiler OptimizationTechniques - unit II 27
  28. 28. Prioritized topological Order Possible prioritized orderings: 1) Critical path - the longest path through the data-dependence graph. Height of the node – the length of the longest path in the graph originating from the node. 2) The length of the schedule is constrained by the resource available. Critical resource - the one with the largest ratio of uses to the number of units of that resource available. Operations using more critical resources may be given higher priority. 3) Source ordering – the operation that shows up earlier in the source program should be scheduled first. 8/25/2014 Compiler OptimizationTechniques - unit II 28
  29. 29. Result of applying List Scheduling (for example in slide 22) ALU Memory LD R3 , 8(R1) /* using height as the priority function */ LD R2, 0(R1) ADD R3, R3,R4 /* 2 delay */ ADD R3,R3,R2 ST 4(R1) , R2 St 12(R1), R3 St 0(R1),R7 8/25/2014 Compiler OptimizationTechniques - unit II 29
  30. 30. Global Code Scheduling  Strategies that consider more than one Basic Block at a time are referred to as Global Scheduling.  Conditions: ( must abide control and data dependencies) 1. All instructions in the original program are executed in the optimized one and 2. While the optimized program may execute extra instructions speculatively ,these instructions must not have any unwanted side effects. 8/25/2014 Compiler OptimizationTechniques - unit II 30
  31. 31. Basic Block A basic Block is constituted by set of instructions in which the control enters the block through the first instruction and leaves the block via the last instruction without any deterrence or jump / branch in between them. ( the flow will be linear) 8/25/2014 Compiler OptimizationTechniques - unit II 31
  32. 32. Primitive code motion Source Program 8/25/2014 Compiler OptimizationTechniques - unit II 32 if ( a == 0) goto L e = d + d c = b L:
  33. 33. Locally Scheduled Machine code 8/25/2014 Compiler OptimizationTechniques - unit II 33 LD R6 , 0(R1) nop BEQZ R6 , L LD R7 ,0(R2) nop ST 0(R3),R7 LD R8 , 0(R4) nop ADD R8,R8,R8 ST 0(R5), R8 B1 B2 B3 L:
  34. 34. Globally Scheduled machine code 8/25/2014 Compiler OptimizationTechniques - unit II 34 LD R6 , 0(R1) LD R8 , 0(R4) LD R7 , 0(R2) ADD R8,R8,R8 BEQZ R6 , L ST 0(R5), R8 ST 0(R5) , R8 ST 0(R3) , R7 B1 B3’ B3
  35. 35. Upward Code motion It moves as operation from block src up a control-flow path to block dst. such move does not violate any data dependences and it makes the path through dst and src run faster Case 1: If src does not postdominate dst In this case there exists a path that passes through dst that does not reach src This code motion is illegal unless tehoperation moved has no unwanted side effects 8/25/2014 Compiler OptimizationTechniques - unit II 35
  36. 36. Contd… Case 2: If dst does not dominate src In this case there exists a path that reaches src without first going through dst. We need to move copies of the moved operation along such paths Constraints: 1.The operands of the operation must hold the same values as in the original. 2.The result does not overwrite a value that is still needed , and 3. It itself is not subsequently overwritten before reaching src. 8/25/2014 Compiler OptimizationTechniques - unit II 36
  37. 37. Downward Code Motion It is moving an operation from block src down a control flow path to block dst Case 1: src does not dominate dst – There exists a path to dst that does not passes through src. Case 2: dst does not postdominate src - There exists a path through src does not pass through dst 8/25/2014 Compiler OptimizationTechniques - unit II 37
  38. 38. E.g.,  If ( x == 0) a = b; Else a =c; d= a; 8/25/2014 Compiler OptimizationTechniques - unit II 38 (x==0) LD R1,x Nop BEQZ R1, L (a = c) LD R3,c Nop ST a,R3 ( a= b) LD R2,b Nop ST a, R2 (d =a) LD R4, a Nop ST d, R4 B1 B2B3 B4 x---0(R5) b-----0(R6) c --------0(R7) a -------- 0(R8) d -------- 0(R9) L:
  39. 39. E.g.,  If ( x == 0) a = b; Else a =c; d= a; 8/25/2014 Compiler OptimizationTechniques - unit II 39 LD R1,0(R5), LD R3 , 0(R7) LD R2 , 0(R6) ST 0(R8),R3 BEQZ R1, L /* CMOVZ 0(R8) ,R2,R1 */ ST 0(R8), R2 LD R4, 0(R8) Nop ST 0(R9), R4 B1 B2 B4x---0(R5) b-----0(R6) c --------0(R7) a -------- 0(R8) d -------- 0(R9) L:
  40. 40. Updating data dependences  Code motions can change data dependence relations between operations. Thus data dependences just be updated after each code motions 8/25/2014 Compiler OptimizationTechniques - unit II 40 X = 1 X = 2 If one assignment is moved up the other can not. X is not live before code motion
  41. 41. Global Scheduling Algorithms  Region Based Scheduling Two easiest form of code motion 1. Moving operations up to control equivalent basic blocks 2. Moving operations speculatively up one branch to a dominating predecessor. Assignment : Region Based Scheduling Algorithm 8/25/2014 Compiler OptimizationTechniques - unit II 41
  42. 42. Loop Unrolling unrolling creates more instructions in the loop body permitting global scheduling algorithms to find more parallelism for (i = 0; i < N; i ++) { S(i); } Can be unrolled for ( i = 0; i+4 < N; i+=4) { S(i); S(i+1); S(i+2); S(i+3); } repeat S; until C; Can be unrolled as repeat { S; if(C) break; S; if (C) break; S; } until C ; 8/25/2014 Compiler OptimizationTechniques - unit II 42
  43. 43. Neighborhood Compaction  Examine each pair of basic blocks that are executed one after the other , and check if any operation can be moved up or down between them to improve the execution time to those blocks.  If such a pair is found we check if the instruction to be moved needs to be duplicated along other paths. 8/25/2014 Compiler OptimizationTechniques - unit II 43
  44. 44. Advanced Code Motion Techniques  Adding new basic blocks along the control flow edges originating from blocks with more than one predecessor. Moving instructions from basic blocks, so that the block can be eliminated completely.  The code to be executed in each basic block is scheduled once and for all as each block is visited, because algorithms only move operations up to dominating block.  Implementing downward code motion is harder in an algorithm that visits basic blocks in topological order , We move all operations that i) can be moved and ii) can not be executed in their native block 8/25/2014 Compiler OptimizationTechniques - unit II 44
  45. 45. Interaction with dynamic Schedulers  It can create new schedules according to the run time conditions.  High latency instructions are issued early.  Data pre fetch instructions will help the dynamic scheduler to make them available advance.  Data dependent operations are put in correct order to ensure program correctness. For best performance the compiler should assign long delays to dependences that are likely to occur and short ones to those that are not likely.  Branch misprediction must be avoided 8/25/2014 Compiler OptimizationTechniques - unit II 45
  46. 46. Software Pipelining 8/25/2014 Compiler OptimizationTechniques - unit II 46
  47. 47. Software Pipelining  Numerical applications often have loops whose iterations are completely independent of one another.  These loops with many iterations have enough parallelism to saturate all the resources in a processor. It is up to the scheduler to take full advantage available parallelism.  Software Pipelining schedules an entire loop at a time to take full advantage of the parallelism across iterations. 8/25/2014 Compiler OptimizationTechniques - unit II 47
  48. 48. Machine Model  The machine can issue in a single clock : one load, one store, one arithmetic operation and one branch operation.  The machine has a loop back operation BL R, L which decrements register R and , unless the result is 0, branches to location L.  8/25/2014 Compiler OptimizationTechniques - unit II 48
  49. 49. Machine Model  Memory operations have an auto increment addressing mode , denoted by ++ after the register. The register is automatically incremented to point to the next consecutive address after each access.  The arithmetic operations are fully pipelined ; they can be initiated every clock but their results are not available until 2 clock later. All other instructions have a single- clock latency. 8/25/2014 Compiler OptimizationTechniques - unit II 49
  50. 50. Typical do-all loop for ( i = 0; i< n; i++) D[i] = A[i] * B[i] + c; 8/25/2014 Compiler OptimizationTechniques - unit II 50 //R1,R2,R3 = & A, &B, &D // R4 = c // R10 = n-1 LD R5 , 0(R1 ++) LD R6 , 0(R2 ++) MUL R7 , R5, R6 Nop ADD R8 , R7, R4 Nop ST 0(R3 ++) , R8 BL R10 , L L: Locally scheduled code
  51. 51. Five unrolled iterations of e.g., for (i = 0; i < n; i ++) D[i] = A[i] * B[i] + c ; 8/25/2014 Compiler OptimizationTechniques - unit II 51 Clock j = 1 J =2 J = 3 J =4 J = 5 1 LD 2 LD 3 MUL LD 4 LD 5 MUL LD 6 ADD LD 7 MUL LD 8 ST ADD LD 9 MUL LD 10 ST ADD LD 11 MUL 12 ST ADD 13 14 ST ADD 15 16 ST
  52. 52. Clock j = 1 J =2 J = 3 J =4 1 LD 2 LD 3 MUL LD 4 LD 5 MUL LD 6 ADD LD 7 L: MUL LD 8 ST ADD LD BL (L) 9 MUL 10 ST ADD 11 12 ST ADD 13 14 ST 8/25/2014 Compiler OptimizationTechniques - unit II 52 Software pipelined Code
  53. 53.  A new iteration can be started on the pipeline every 2 clocks  When first iteration proceeds to stage three , the second iteration starts to execute.  By clock 7 the pipeline is fully filled with first four iterations.  In the steady state four consecutive iterations are executing at the same time.  The sequence of instructions 1 through 6 is called prolog.  7 and 8 are steady state.  lines 9 through 14 is called epilog. 8/25/2014 Compiler OptimizationTechniques - unit II 53

×