Global Load Instruction Aggregation Based on Code Motion

509 views

Published on

The 2012 International Symposium on Parallel Architecture, Algorithm and Programming.
If you download this file, you can see explanation of each slides.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
509
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Global Load Instruction Aggregation Based on Code Motion

  1. 1. The 2012 International Symposium on Parallel Architectures, Algorithms and Programming. December18, 2012 Global Load Instruction Aggregation Based on Code Motion
  2. 2. Outline Background  Previous works  Motivations Partial Redundancy Elimination(PRE)  Lazy code motion(LCM) Global Load Instruction Aggregation(GLIA) Experiment results Conclusion
  3. 3. Background ProcessorSpeed: Speed: Mainmemory
  4. 4. Background Important Processor Cache memory Mainmemory
  5. 5. Previous works 1. Prefetch instructions 2. Transform loop structures. before afterfor(j=0;j<10;j++) for(i=0;i<10;i++) for(i=0;i<10;i++) for(j=0;j<10;j++) ... = a[i][j] ... = a[i][j]
  6. 6. Previous worksfor(j=0;j<10;j++) j:0 for(i=0;i<10;i++) j:1 i:0 ... = a[i][j] ・ ・ ・ j:0 j:1 i:1 ・ ・ ・
  7. 7. Previous worksfor(j=0;j<10;j++) j:0 for(i=0;i<10;i++) j:1 i:0 ... = a[i][j] ・ ・ ・ j:0 j:1 i:1 ・ ・ ・
  8. 8. Previous worksfor(j=0;j<10;j++) j:0 for(i=0;i<10;i++) j:1 i:0 ... = a[i][j] ・ ・ ・ j:0 j:1 i:1 ・ ・ ・
  9. 9. Previous worksfor(j=0;j<10;j++) j:0 for(i=0;i<10;i++) j:1 i:0 ... = a[i][j] ・ ・ ・ j:0 j:1 i:1 ・ ・ ・
  10. 10. Previous worksfor(j=0;j<10;j++) j:0 for(i=0;i<10;i++) j:1 i:0 ... = a[i][j] ・ ・ ・ j:0 j:1 i:1 ・ ・ ・
  11. 11. Previous works 1. Prefetch instructions 2. Transform loop structures. before afterfor(j=0;j<10;j++) for(i=0;i<10;i++) for(i=0;i<10;i++) for(j=0;j<10;j++) ... = a[i][j] ... = a[i][j]
  12. 12. Problems1. Local technique ex. target: initial load instruction, loop only.2. It is necessary to change the structure.
  13. 13. How we can apply cache optimization to any programglobally? Main memory ・ Cache memory ・ ・ main(){ x = a[i] a[i] a[i+1] } ・ ・ ・
  14. 14. How we can apply cache optimization to any programglobally? Main memory ・ Cache memory ・ ・ main(){ x = a[i] a[i] a[i] a[i+1] a[i+1] } ・ ・ ・
  15. 15. How we can apply cache optimization to any programglobally? Main memory Cache memory a[i] main(){ a[i+1] ... = a[i] ・ ... = b[i] a[i] ・ ... = a[i+1] a[i+1] ・ } b[i] b[i+1]
  16. 16. How we can apply cache optimization to any programglobally? Main memory Cache memory a[i] main(){ a[i+1] ... = a[i] ・ ... = b[i] b[i] ・ ... = a[i+1] b[i+1] ・ } b[i] b[i+1]
  17. 17. How we can apply cache optimization to any programglobally? Main memory Cache memory a[i] main(){ a[i+1] ... = a[i] ・ ... = b[i] b[i] ・ ... = a[i+1] b[i+1] ・ } b[i] Cache miss b[i+1]
  18. 18. How we can apply cache optimization to any programglobally? We can remove this cache miss by changing the order of accesses a[i] main(){ a[i+1] ... = a[i] ・ ... = b[i] b[i] ・ ... = a[i+1] b[i+1] ・ } b[i] Cache miss b[i+1]
  19. 19. Code motion x = a[i] Expel from y = x+1cache memory z = b[i] w = a[i+j]
  20. 20. Code motion x = a[i] w = a[i+j] y = x+1 z = b[i]
  21. 21. Code motion x = a[i] w = a[i+j]Live range y = x+1of w z = b[i]
  22. 22. Code motion x = a[i] w = a[i+j]x w y = x+1 z = b[i]
  23. 23. Code motion x = a[i] w = a[i+j] y = x+1 Spill z = b[i]
  24. 24. Code motion x = a[i] t = Load(j) w = a[i+t] Change the access order y = x+1 z = b[i]
  25. 25. Code motion x = a[i] w = a[i+j] y = x+1 z = b[i]
  26. 26. Code motion x = a[i] Delayed y = x+1 w = a[i+j] z = b[i]
  27. 27. ImplementationWe use Partial Redundancy Elimination(PRE) One of the code optimization Eliminates redundant expressions
  28. 28. PREx = a[i] t = a[i] t = a[i] x=t y = a[i] y=t
  29. 29. LCM LCM determines two insertion node -- Earliest and Latest x = a[i]• Earliest(n) denotes that node n is the closest to the start node of the nodes which can be inserted y = a[i]• Latest(n) denotes that node n is the closest to nodes which contain same load instruction. Knoop,J.,etc.:Lazy Code Motion, Proc. Programming Language Design and Implementation, ACM, pp.224-234, 1992.
  30. 30. LCM x = a[i] y = a[i]
  31. 31. LCM t = a[i] x = a[i] y = a[i]
  32. 32. LCM t = a[i] x = a[i] y = a[i]
  33. 33. LCM Delayed t = a[i] x = a[i] y = a[i]
  34. 34. LCM Delayed t = a[i] x = a[i] y = a[i]
  35. 35. LCM t = a[i] x=t y=t
  36. 36. Global Load InstructionAggregation(GLIA) Purpose 1. Decrease the cache miss. 2. Suppress register spills. Extension 1. Move not redundant load instructions. 2. Delayed considering the order of memory access.
  37. 37. GLIA x = a[i] y = b[i] w = a[i+1]
  38. 38. GLIA t = a[i+1] x = a[i] y = b[i] w = a[i+1]
  39. 39. GLIA x = a[i] t = a[i+1] y = b[i] w = a[i+1]
  40. 40. GLIA x = a[i] t = a[i+1] y = b[i] w = a[i+1]
  41. 41. GLIA x = a[i] t = a[i+1] y = b[i] w=t
  42. 42. Application to the entire program = a[i] = b[i] = a[i+1] = a[i+1]
  43. 43. Application to the entire program = a[i] = b[i] = a[i+1] = a[i+1]
  44. 44. Application to the entire program = a[i] = b[i] = a[i+1] = a[i+1]
  45. 45. Application to the entire program = a[i] = b[i] = a[i+1] = a[i+1]
  46. 46. Application to the entire program = a[i] = b[i] = a[i+1] = a[i+1]
  47. 47. Application to the entire program = a[i] = a[i+1] = b[i] = a[i+1]
  48. 48. Experiment Implementation  our technique in COINS compiler as LIR converter. Benchmark  SPEC2000 Measurement 1. Execution efficiency 2. The number of cache misses
  49. 49. Experiment(1/2) | Executionefficiency Environment  SPARC64-V 2GHz, Solaris 10 Optimization  BASE:applies Dead Code Elimination(DCE)  GLIADCE:applies GLIA and DCE.
  50. 50. Experiment(1/2) | ExecutionefficiencyImprovement of art has been about 10.5%
  51. 51. The decrease reason 1: speculative codemotion = a[i] = b[i] = a[j]
  52. 52. The decrease reason 1: speculative codemotion = a[i] = a[j] = b[i]
  53. 53. The decrease reason 2: register spill The number of spills
  54. 54. Experiment(2/2) | Cache misses System parameter of x86 machine  Intel corei5-2320 3.00GHz  Floating register:8  Integer register :8  L1D cache memory:32KB  L2 cache memory :256KB  L3 cache memory :6144KB
  55. 55. Experiment(2/2) | Level 2 cachemisses Improvement of twolf has been about 10.6%
  56. 56. Experiment(2/2) | Level 3 cachemisses Improvement of art has been about 93.7%
  57. 57. ConclusionWe proposed a new cache optimization. 1. GLIA can be applied to any programs 2. GLIA improves cache efficiency 3. GLIA considers register spill Thank you for your attention.

×