Your SlideShare is downloading. ×
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Apresentacao
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apresentacao

63

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
63
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Performance Optimization with PerfExpert and MACPO Jim Browne, Ashay Rane and Leo Fialho ICS 2013 Victo Ap
  • 2. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 2 / 42
  • 3. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 3 / 42
  • 4. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the morning: 09:00 Introduction and motivation [Jim] 09:20 What PerfExpert can provide to you? [Leo] 09:30 Demo [Leo] 09:45 How PerfExpert does that? (opening Pandora’s box) [Leo] 10:15 Extending PerfExpert [Leo] 10:30 (Coffee?) break [everyone, including you] 10:45 Hands on tutorial [all the team] 11:45 Morning closure [all the team] 3 / 42
  • 5. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 4 / 42
  • 6. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the afternoon: 01:30 What MACPO can provide to you? [Ashay] 02:00 Demo [Ashay, Jim] 02:30 How MACPO does that? [Ashay] 03:15 (Coffee?) break [everyone, including you] 03:30 Hands on tutorial [Ashay] 04:00 Selecting code segments to run on GPUs/accelerators [Jim] 04:30 Enhancing PerfExpert with MACPO analysis [all the team] 04:45 Afternoon closure and future work [all the team] 4 / 42
  • 7. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? 5 / 42
  • 8. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak 5 / 42
  • 9. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly 5 / 42
  • 10. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools 5 / 42
  • 11. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts 5 / 42
  • 12. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise 5 / 42
  • 13. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus 5 / 42
  • 14. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus Result: Many HPC programmers do not use these tools (seriously) 5 / 42
  • 15. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! 6 / 42
  • 16. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: 6 / 42
  • 17. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible 6 / 42
  • 18. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization 6 / 42
  • 19. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures 6 / 42
  • 20. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance 6 / 42
  • 21. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? 6 / 42
  • 22. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks 6 / 42
  • 23. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development 6 / 42
  • 24. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  • 25. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  • 26. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: 7 / 42
  • 27. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) 7 / 42
  • 28. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) 7 / 42
  • 29. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) 7 / 42
  • 30. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) 7 / 42
  • 31. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: 7 / 42
  • 32. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) 7 / 42
  • 33. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) 7 / 42
  • 34. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  • 35. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  • 36. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: 8 / 42
  • 37. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level 8 / 42
  • 38. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed 8 / 42
  • 39. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) 8 / 42
  • 40. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  • 41. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  • 42. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: 9 / 42
  • 43. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces 9 / 42
  • 44. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement 9 / 42
  • 45. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces 9 / 42
  • 46. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement 9 / 42
  • 47. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models 9 / 42
  • 48. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment 9 / 42
  • 49. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  • 50. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  • 51. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 10 / 42
  • 52. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 11 / 42
  • 53. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance 11 / 42
  • 54. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics 11 / 42
  • 55. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization 11 / 42
  • 56. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: 11 / 42
  • 57. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only 11 / 42
  • 58. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations 11 / 42
  • 59. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  • 60. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  • 61. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 12 / 42
  • 62. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Loop in function compute() at mm.c:8 (99.8% of the total runtime) =============================================================================== ratio to total instrns % 0.........25...........50.........75........100 - floating point : 100 *********************************************** - data accesses : 25 ************ * GFLOPS (% max) : 12 ****** - packed : 0 * - scalar : 12 ****** ------------------------------------------------------------------------------- performance assessment LCPI good......okay......fair......poor......bad.... * overall : 3.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ upper bound estimates * data accesses : 9.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - L1d hits : 0.9 >>>>>>>>>>>>>>>>> - L2d hits : 1.8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - L2d misses : 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction accesses : 0.1 > - L1i hits : 0.0 > - L2i hits : 0.0 > - L2i misses : 0.1 > * data TLB : 4.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction TLB : 0.0 > * branch instructions : 0.1 >> - correctly predicted : 0.1 >> - mispredicted : 0.0 > * floating-point instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - fast FP instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - slow FP instr : 0.0 > 12 / 42
  • 63. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: 13 / 42
  • 64. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: #-------------------------------------------------- # Recommendations for mm.c:8 #-------------------------------------------------- # # This is a possible recommendation for this code segment # Recommendation ID: 31 Recommendation Description: change the order of loops Recommendation Reason: this optimization may improve the memory access pattern and make it more cache and TLB friendly Pattern Recognizers: c loop2 f loop2 Code example: loop i { loop j {...} } =====> loop j { loop i {...} } 13 / 42
  • 65. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 14 / 42
  • 66. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: The Big Picture User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 15 / 42
  • 67. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 16 / 42
  • 68. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script 16 / 42
  • 69. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters 16 / 42
  • 70. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) 16 / 42
  • 71. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) Backward compatible 16 / 42
  • 72. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 17 / 42
  • 73. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” 17 / 42
  • 74. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” Based on HPCToolKit 17 / 42
  • 75. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 18 / 42
  • 76. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics 18 / 42
  • 77. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics Based on ROSE 18 / 42
  • 78. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 19 / 42
  • 79. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database 19 / 42
  • 80. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” 19 / 42
  • 81. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations 19 / 42
  • 82. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks 19 / 42
  • 83. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE 19 / 42
  • 84. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics 19 / 42
  • 85. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics Extendable: it is possible to write new “recommendation selection functions” (SQL query) 19 / 42
  • 86. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 20 / 42
  • 87. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database 20 / 42
  • 88. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” 20 / 42
  • 89. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” Engine to run the “recommendation selection functions” 20 / 42
  • 90. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 21 / 42
  • 91. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) 21 / 42
  • 92. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive 21 / 42
  • 93. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex 21 / 42
  • 94. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers 21 / 42
  • 95. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers Extendable: it is possible to write new grammars to recognize/ match/filter code fragments (to work with new “transformers”) 21 / 42
  • 96. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 22 / 42
  • 97. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation 22 / 42
  • 98. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive 22 / 42
  • 99. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want 22 / 42
  • 100. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers 22 / 42
  • 101. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers Extendable: it is possible to write code transformers using any language you want 22 / 42
  • 102. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 23 / 42
  • 103. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments 23 / 42
  • 104. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments Based on ROSE 23 / 42
  • 105. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points 24 / 42
  • 106. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? 24 / 42
  • 107. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually 24 / 42
  • 108. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance 24 / 42
  • 109. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) 24 / 42
  • 110. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) 24 / 42
  • 111. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) Easy to use and lightweight! 24 / 42
  • 112. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  • 113. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  • 114. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics 25 / 42
  • 115. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] 25 / 42
  • 116. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” 25 / 42
  • 117. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers 25 / 42
  • 118. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers Code transformers 25 / 42
  • 119. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 26 / 42
  • 120. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics 26 / 42
  • 121. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics code.section info=Loop in function compute() at mm.c:8 code.filename=mm.c code.line number=8 code.type=loop code.function name=compute code.extra info=3 code.representativeness=99.8 perfexpert.ratio.data accesses=0.25 perfexpert.instruction accesses.L2i hits=0.002 perfexpert.branch instructions.mispredicted=0.0 perfexpert.floating-point instr.fast FP instr=5.073 perfexpert.data accesses.L2d hits=1.846 ... 26 / 42
  • 122. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 27 / 42
  • 123. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions 27 / 42
  • 124. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions SELECT r.id AS recommendation id, SUM( (CASE c.short WHEN ’d-l1’ THEN (m.data accesses L1d hits - (max * 0.1)) ELSE 0 END) + ... ) AS score FROM recommendation AS r JOIN metric AS m JOIN (SELECT MAX( m.data accesses L1d hits, m.data accesses L2d hits, ... ) AS max FROM metric AS m WHERE m.overall * 100 / (0.5 * (100 - m.ratio floating point) + m.ratio floating point) > 1 AND m.id = @RID) WHERE (r.loop <= @LPD AND m.code type = ’loop’) OR (r.loop IS NULL AND m.code type = ’function’) AND m.id = @RID GROUP BY r.id ORDER BY score DESC; 27 / 42
  • 125. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 28 / 42
  • 126. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers 28 / 42
  • 127. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers nested iteration statement : WHILE ’(’ exp ’)’ WHILE ’(’ exp ’)’ stmnt | WHILE ’(’ exp ’)’ ’’ WHILE ’(’ exp ’)’ stmnt ’’ | DO DO stmnt WHILE ’(’ exp ’)’ ’;’ stmnt WHILE ’(’ exp ’)’ ’;’ | DO ’’ DO stmnt WHILE ’(’ exp ’)’ ’;’ ’’ WHILE ’(’ exp ’)’ ’;’ | FOR ’(’ exp stmnt exp stmnt ’)’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt ’)’ ’’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt ’’ | FOR ’(’ exp stmnt exp stmnt exp ’)’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt exp ’)’ ’’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt ’’ ; 28 / 42
  • 128. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 29 / 42
  • 129. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers 29 / 42
  • 130. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers create c loop2 ../source/mm.c activate INTERPROCEDURAL SUMMARY PRECONDITION activate TRANSFORMERS INTER FULL activate PRECONDITIONS INTER FULL setproperty SEMANTICS FIX POINT OPERATOR ‘‘derivative’’ module compute apply LOOP INTERCHANGE loop 8 apply UNSPLIT[%PROGRAM] close quit 29 / 42
  • 131. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: ssh login@stampede.tacc.utexas.edu use the password that has been provided to you Request a Compute Node: ./reserve now we are ready to go... 30 / 42
  • 132. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: cd 1 perfexpert perfexpert -s mm.c mm grep -R "running time" * more mm.c more perfexpert-temp-zUKfkx7/1/fragments/new/mm.c perfexpert mm perfexpert -r 5 mm cd ../2 perfexpert -m -s backprop.c backprop 31 / 42
  • 133. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 32 / 42
  • 134. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure 32 / 42
  • 135. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure What we will see in the afternoon: How to enhance the application performance using memory access metrics (MAPCO) 32 / 42
  • 136. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 33 / 42
  • 137. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 34 / 42
  • 138. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 35 / 42
  • 139. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  • 140. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  • 141. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application 36 / 42
  • 142. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application Problem: how to select those parts of an application which will benefit from execution on an accelerator? 36 / 42
  • 143. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  • 144. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  • 145. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? 37 / 42
  • 146. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert 37 / 42
  • 147. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? 37 / 42
  • 148. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? 37 / 42
  • 149. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step 37 / 42
  • 150. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement 37 / 42
  • 151. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator 37 / 42
  • 152. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL 37 / 42
  • 153. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL Suggest kernels needing new algorithms 37 / 42
  • 154. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 38 / 42
  • 155. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels 38 / 42
  • 156. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses 38 / 42
  • 157. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches 38 / 42
  • 158. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores 38 / 42
  • 159. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels 38 / 42
  • 160. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity 38 / 42
  • 161. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism 38 / 42
  • 162. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization 38 / 42
  • 163. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures 38 / 42
  • 164. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume 38 / 42
  • 165. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume “Limited” recursion 38 / 42
  • 166. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 39 / 42
  • 167. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels 39 / 42
  • 168. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped 39 / 42
  • 169. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order 39 / 42
  • 170. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order Hold up your thumb? 39 / 42
  • 171. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  • 172. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  • 173. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 41 / 42
  • 174. Thank You Victor Apr

×