Performance Optimization
with PerfExpert and MACPO
Jim Browne, Ashay Rane and Leo Fialho
ICS 2013
Victo
Ap
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accel...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
3 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
In the morning:
09:00 Introduction and motivati...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
4 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
In the afternoon:
01:30 What MACPO can provide ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
5 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Overview: why PerfExpert?
Problem: HPC systems operate...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
6 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Goal for PerfExpert: democratize optimization!
Subgoal...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
The four stages of automatic performance ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
8 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Uniqueness of PerfExpert:
Nearly complete...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
9 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Introduction
Unique properties of MACPO:
Multicore res...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accel...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
Performance report...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
List of Recommenda...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
What PerfExpert can provide to you?
List of Recommenda...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Short Demo
Short demo
14 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: The Big Picture
User Interfa...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Work Flow Script
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
ori...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
ori...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Analyzer
User Interface!
ori...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
origin...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
origin...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: MACPO
User Interface!
origin...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Optimization Formulator
User...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Support Database
User Interf...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Pattern Recognizer
User Inte...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Transformer
User Interface!
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
o...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
o...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Integrator
User Interface!
o...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
24 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
How PerfExpert does that: Key Points
Why is this perfo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
25 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimi...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimi...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimi...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding performance metrics
Optimi...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
26 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding Performance Metrics
26 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Adding Performance Metrics
code.s...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
27 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Recommendation Selection Function...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Recommendation Selection Function...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
28 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Pattern Recognizers
28 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Pattern Recognizers
nested iterat...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
29 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Code Transformers
29 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Extending PerfExpert
Code Transformers
create c loop2 ...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Hands on Tutorial
Accessing Stampede:
ssh login@stampe...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Hands on Tutorial
Accessing Stampede:
cd 1
perfexpert
...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
32 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
What we saw in the morning:
Introduction and mo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
What we saw in the morning:
Introduction and mo...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accel...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Short Demo
Short demo
34 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accel...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
36...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
36...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
Ma...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Performance Optimization by Mapping to Accelerators
Ma...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
37 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Optimize for mul...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
38 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Unsuitable Kerne...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
39 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” K...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” K...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” K...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Code Segments for SIMT/SIMD Execution
Ranking “Good” K...
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Example
40 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Example
40 / 42
Introduction
Introduction PerfExpert MACPO GPU/Accelerators Closure
Agenda
1 Introduction
2 PerfExpert
3 MACPO
4 GPU/Accel...
Thank You
Victor
Apr
Upcoming SlideShare
Loading in …5
×

Apresentacao

185 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Apresentacao

  1. 1. Performance Optimization with PerfExpert and MACPO Jim Browne, Ashay Rane and Leo Fialho ICS 2013 Victo Ap
  2. 2. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 2 / 42
  3. 3. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 3 / 42
  4. 4. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the morning: 09:00 Introduction and motivation [Jim] 09:20 What PerfExpert can provide to you? [Leo] 09:30 Demo [Leo] 09:45 How PerfExpert does that? (opening Pandora’s box) [Leo] 10:15 Extending PerfExpert [Leo] 10:30 (Coffee?) break [everyone, including you] 10:45 Hands on tutorial [all the team] 11:45 Morning closure [all the team] 3 / 42
  5. 5. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 4 / 42
  6. 6. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda In the afternoon: 01:30 What MACPO can provide to you? [Ashay] 02:00 Demo [Ashay, Jim] 02:30 How MACPO does that? [Ashay] 03:15 (Coffee?) break [everyone, including you] 03:30 Hands on tutorial [Ashay] 04:00 Selecting code segments to run on GPUs/accelerators [Jim] 04:30 Enhancing PerfExpert with MACPO analysis [all the team] 04:45 Afternoon closure and future work [all the team] 4 / 42
  7. 7. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? 5 / 42
  8. 8. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak 5 / 42
  9. 9. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly 5 / 42
  10. 10. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools 5 / 42
  11. 11. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts 5 / 42
  12. 12. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise 5 / 42
  13. 13. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus 5 / 42
  14. 14. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Overview: why PerfExpert? Problem: HPC systems operate far below peak Chip/node architectural complexity is growing rapidly Performance optimization for these chips requires deep knowledge of architectures, code patterns, compilers, etc. Performance optimization tools Powerful in the hands of experts Require detailed performance and system expertise HPC application developers are domain experts, not computer gurus Result: Many HPC programmers do not use these tools (seriously) 5 / 42
  15. 15. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! 6 / 42
  16. 16. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: 6 / 42
  17. 17. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible 6 / 42
  18. 18. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization 6 / 42
  19. 19. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures 6 / 42
  20. 20. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance 6 / 42
  21. 21. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? 6 / 42
  22. 22. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks 6 / 42
  23. 23. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development 6 / 42
  24. 24. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  25. 25. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Goal for PerfExpert: democratize optimization! Subgoals: Make use of the tool as simple as possible Start with only chip/node level optimization Make it adaptable across multiple architectures Design for extension to communication and I/O performance How to accomplish? Formulate the performance optimization task as a workflow of subtasks Leverage the state-of-the-art: Build on the best available tools for the subtasks to minimize the effort and cost of development Automate the entire workflow 6 / 42
  26. 26. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: 7 / 42
  27. 27. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) 7 / 42
  28. 28. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) 7 / 42
  29. 29. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) 7 / 42
  30. 30. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) 7 / 42
  31. 31. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: 7 / 42
  32. 32. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) 7 / 42
  33. 33. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) 7 / 42
  34. 34. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  35. 35. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction The four stages of automatic performance optimization: Measurement and attribution (1) Analysis, diagnosis and identification of bottlenecks (2) Selection of effective optimizations (3) Implementation of optimizations (4) Use of State-of-the-Art: HPCToolkit, MACPO based on ROSE (1) PerfExpert Team (2 and 3) PerfExpert Team based on ROSE, PIPS, Bison and Flex (4) 7 / 42
  36. 36. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: 8 / 42
  37. 37. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level 8 / 42
  38. 38. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed 8 / 42
  39. 39. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) 8 / 42
  40. 40. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  41. 41. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Uniqueness of PerfExpert: Nearly complete optimization first three stages of optimization for chip/node level Framework for implementing optimizations is complete and several optimizations are completed Integrates code segment focused and data structure based measurements (MACPO) Workflow will apply to communication and I/O optimization as well 8 / 42
  42. 42. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: 9 / 42
  43. 43. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces 9 / 42
  44. 44. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement 9 / 42
  45. 45. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces 9 / 42
  46. 46. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement 9 / 42
  47. 47. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models 9 / 42
  48. 48. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment 9 / 42
  49. 49. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  50. 50. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Introduction Unique properties of MACPO: Multicore resolved traces Code segment local measurement Data structure specific traces Order of magnitude lower overhead of measurement More accurate (associative) cache models Strides by data structure and code segment Architecture “independent” metrics 9 / 42
  51. 51. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 10 / 42
  52. 52. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 11 / 42
  53. 53. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance 11 / 42
  54. 54. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics 11 / 42
  55. 55. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization 11 / 42
  56. 56. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: 11 / 42
  57. 57. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only 11 / 42
  58. 58. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations 11 / 42
  59. 59. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  60. 60. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Identification of bottlenecks by relevance Performance analysis based on performance metrics Recommendations for optimization There are three possible outputs: Performance report only List of recommendations Fully automated code transformation 11 / 42
  61. 61. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: 12 / 42
  62. 62. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? Performance report: Loop in function compute() at mm.c:8 (99.8% of the total runtime) =============================================================================== ratio to total instrns % 0.........25...........50.........75........100 - floating point : 100 *********************************************** - data accesses : 25 ************ * GFLOPS (% max) : 12 ****** - packed : 0 * - scalar : 12 ****** ------------------------------------------------------------------------------- performance assessment LCPI good......okay......fair......poor......bad.... * overall : 3.0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ upper bound estimates * data accesses : 9.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - L1d hits : 0.9 >>>>>>>>>>>>>>>>> - L2d hits : 1.8 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - L2d misses : 6.9 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction accesses : 0.1 > - L1i hits : 0.0 > - L2i hits : 0.0 > - L2i misses : 0.1 > * data TLB : 4.6 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ * instruction TLB : 0.0 > * branch instructions : 0.1 >> - correctly predicted : 0.1 >> - mispredicted : 0.0 > * floating-point instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - fast FP instr : 5.1 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+ - slow FP instr : 0.0 > 12 / 42
  63. 63. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: 13 / 42
  64. 64. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure What PerfExpert can provide to you? List of Recommendations: #-------------------------------------------------- # Recommendations for mm.c:8 #-------------------------------------------------- # # This is a possible recommendation for this code segment # Recommendation ID: 31 Recommendation Description: change the order of loops Recommendation Reason: this optimization may improve the memory access pattern and make it more cache and TLB friendly Pattern Recognizers: c loop2 f loop2 Code example: loop i { loop j {...} } =====> loop j { loop i {...} } 13 / 42
  65. 65. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 14 / 42
  66. 66. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: The Big Picture User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 15 / 42
  67. 67. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 16 / 42
  68. 68. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script 16 / 42
  69. 69. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters 16 / 42
  70. 70. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) 16 / 42
  71. 71. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Work Flow Script User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a shell script Accepts parameters Invokes all tools (including the compiler) Backward compatible 16 / 42
  72. 72. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 17 / 42
  73. 73. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” 17 / 42
  74. 74. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Analyzer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is the old PerfExpert, minus “recommender” Based on HPCToolKit 17 / 42
  75. 75. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 18 / 42
  76. 76. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics 18 / 42
  77. 77. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: MACPO User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Enhances the set of metrics with data access performance metrics Based on ROSE 18 / 42
  78. 78. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 19 / 42
  79. 79. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database 19 / 42
  80. 80. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” 19 / 42
  81. 81. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations 19 / 42
  82. 82. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks 19 / 42
  83. 83. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE 19 / 42
  84. 84. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics 19 / 42
  85. 85. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Optimization Formulator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Loads performance metrics on the Support Database Runs all “recommendation selection functions” Concatenates and ranks the list of recommendations Extracts code fragments identified as bottlenecks Based on ROSE Extendable: accepts user-defined performance metrics Extendable: it is possible to write new “recommendation selection functions” (SQL query) 19 / 42
  86. 86. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 20 / 42
  87. 87. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database 20 / 42
  88. 88. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” 20 / 42
  89. 89. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Support Database User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! This is a SQLite database Stores the list of “recommendation selection functions”, “pattern recognizers” and “code transformers” Engine to run the “recommendation selection functions” 20 / 42
  90. 90. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 21 / 42
  91. 91. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) 21 / 42
  92. 92. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive 21 / 42
  93. 93. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex 21 / 42
  94. 94. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers 21 / 42
  95. 95. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Pattern Recognizer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Acts as a “filter” trying to find (match) the right code transformer for a source code fragment (identified as bottleneck) Language sensitive Based on Bison and Flex One recommendation may have multiple pattern recognizers Extendable: it is possible to write new grammars to recognize/ match/filter code fragments (to work with new “transformers”) 21 / 42
  96. 96. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 22 / 42
  97. 97. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation 22 / 42
  98. 98. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive 22 / 42
  99. 99. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want 22 / 42
  100. 100. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers 22 / 42
  101. 101. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Transformer User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Implements the recommendation by applying source code transformation May or may not be language sensitive Based on ROSE, PIPS or anything you want One code pattern may lead to multiple code transformers Extendable: it is possible to write code transformers using any language you want 22 / 42
  102. 102. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! 23 / 42
  103. 103. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments 23 / 42
  104. 104. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Integrator User Interface! original! source! code! Compiler! Analyzer! (HPCToolKit)! MACPO! code bottlenecks and general performance metrics! add data access! performance metrics to previous output! code fragments to! optimize and list of! recommendations! ! Pattern Recognizer! (Bison/Flex)! code fragments to optimize and list of code transformers! ! optimized code fragments! Optimization Formulator! (ROSE)! Integrator! (ROSE)! optimized! source code! ! Support Database! Transformer! (PIPS/ROSE)! Compilation Phase! DiagnoseandRecommendationPhases! Code Transformation Phase! CodeIntegrationPhase! Input/output data! Developed by the authors! Standard Compiler! Measurement and Analysis Phases! Work Flow Script! binary object! Generates a new source code by integrating to the transformed code fragments Based on ROSE 23 / 42
  105. 105. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points 24 / 42
  106. 106. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? 24 / 42
  107. 107. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually 24 / 42
  108. 108. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance 24 / 42
  109. 109. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) 24 / 42
  110. 110. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) 24 / 42
  111. 111. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure How PerfExpert does that: Key Points Why is this performance optimization “architecture” strong? Each piece of the tool chain can be updated/upgraded individually It is flexible: you can add new metrics as well as plug new tools to measure application performance It is extendable: new recommendations, transformations and strategies to select recommendations (we are counting on you!) Multi-language, multi-architecture, open-source and built on top of well-established tools (HPCToolKit, ROSE, PIPS, etc.) Easy to use and lightweight! 24 / 42
  112. 112. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  113. 113. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 25 / 42
  114. 114. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics 25 / 42
  115. 115. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] 25 / 42
  116. 116. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” 25 / 42
  117. 117. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers 25 / 42
  118. 118. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding performance metrics Optimization recommendations [entries on the SQL database] “Recommendation selection functions” Pattern recognizers Code transformers 25 / 42
  119. 119. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 26 / 42
  120. 120. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics 26 / 42
  121. 121. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Adding Performance Metrics code.section info=Loop in function compute() at mm.c:8 code.filename=mm.c code.line number=8 code.type=loop code.function name=compute code.extra info=3 code.representativeness=99.8 perfexpert.ratio.data accesses=0.25 perfexpert.instruction accesses.L2i hits=0.002 perfexpert.branch instructions.mispredicted=0.0 perfexpert.floating-point instr.fast FP instr=5.073 perfexpert.data accesses.L2d hits=1.846 ... 26 / 42
  122. 122. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 27 / 42
  123. 123. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions 27 / 42
  124. 124. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Recommendation Selection Functions SELECT r.id AS recommendation id, SUM( (CASE c.short WHEN ’d-l1’ THEN (m.data accesses L1d hits - (max * 0.1)) ELSE 0 END) + ... ) AS score FROM recommendation AS r JOIN metric AS m JOIN (SELECT MAX( m.data accesses L1d hits, m.data accesses L2d hits, ... ) AS max FROM metric AS m WHERE m.overall * 100 / (0.5 * (100 - m.ratio floating point) + m.ratio floating point) > 1 AND m.id = @RID) WHERE (r.loop <= @LPD AND m.code type = ’loop’) OR (r.loop IS NULL AND m.code type = ’function’) AND m.id = @RID GROUP BY r.id ORDER BY score DESC; 27 / 42
  125. 125. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 28 / 42
  126. 126. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers 28 / 42
  127. 127. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Pattern Recognizers nested iteration statement : WHILE ’(’ exp ’)’ WHILE ’(’ exp ’)’ stmnt | WHILE ’(’ exp ’)’ ’’ WHILE ’(’ exp ’)’ stmnt ’’ | DO DO stmnt WHILE ’(’ exp ’)’ ’;’ stmnt WHILE ’(’ exp ’)’ ’;’ | DO ’’ DO stmnt WHILE ’(’ exp ’)’ ’;’ ’’ WHILE ’(’ exp ’)’ ’;’ | FOR ’(’ exp stmnt exp stmnt ’)’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt ’)’ ’’ FOR ’(’ exp stmnt exp stmnt ’)’ stmnt ’’ | FOR ’(’ exp stmnt exp stmnt exp ’)’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt | FOR ’(’ exp stmnt exp stmnt exp ’)’ ’’ FOR ’(’ exp stmnt exp stmnt exp ’)’ stmnt ’’ ; 28 / 42
  128. 128. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert 29 / 42
  129. 129. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers 29 / 42
  130. 130. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Extending PerfExpert Code Transformers create c loop2 ../source/mm.c activate INTERPROCEDURAL SUMMARY PRECONDITION activate TRANSFORMERS INTER FULL activate PRECONDITIONS INTER FULL setproperty SEMANTICS FIX POINT OPERATOR ‘‘derivative’’ module compute apply LOOP INTERCHANGE loop 8 apply UNSPLIT[%PROGRAM] close quit 29 / 42
  131. 131. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: ssh login@stampede.tacc.utexas.edu use the password that has been provided to you Request a Compute Node: ./reserve now we are ready to go... 30 / 42
  132. 132. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Hands on Tutorial Accessing Stampede: cd 1 perfexpert perfexpert -s mm.c mm grep -R "running time" * more mm.c more perfexpert-temp-zUKfkx7/1/fragments/new/mm.c perfexpert mm perfexpert -r 5 mm cd ../2 perfexpert -m -s backprop.c backprop 31 / 42
  133. 133. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 32 / 42
  134. 134. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure 32 / 42
  135. 135. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda What we saw in the morning: Introduction and motivation What PerfExpert can provide to you? Demo How PerfExpert does that? (opening Pandora’s box) Extending PerfExpert Hands on tutorial Morning closure What we will see in the afternoon: How to enhance the application performance using memory access metrics (MAPCO) 32 / 42
  136. 136. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 33 / 42
  137. 137. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Short Demo Short demo 34 / 42
  138. 138. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 35 / 42
  139. 139. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  140. 140. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators 36 / 42
  141. 141. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application 36 / 42
  142. 142. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Performance Optimization by Mapping to Accelerators Mapping of code segments to accelerators is becoming one of the most methods for optimizing the performance of an application Problem: how to select those parts of an application which will benefit from execution on an accelerator? 36 / 42
  143. 143. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  144. 144. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 37 / 42
  145. 145. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? 37 / 42
  146. 146. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert 37 / 42
  147. 147. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? 37 / 42
  148. 148. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? 37 / 42
  149. 149. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step 37 / 42
  150. 150. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement 37 / 42
  151. 151. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator 37 / 42
  152. 152. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL 37 / 42
  153. 153. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Optimize for multicore chip execution — PerfExpert, why? Identify time consuming kernels in code — PerfExpert Eliminate kernels not easily mappable for SIMT/SIMD execution — How? Characterize the kernels suitable for SIMT/SIMD execution –– What properties? Rank appropriate kernels using the characteristics identified in the last step Estimate cost of data movement Look for refactorings that will enable leaving data on accelerator Generate compiler annotations for translation of C/C++/Fortran to CUDA/OpenCL Suggest kernels needing new algorithms 37 / 42
  154. 154. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 38 / 42
  155. 155. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels 38 / 42
  156. 156. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses 38 / 42
  157. 157. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches 38 / 42
  158. 158. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores 38 / 42
  159. 159. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels 38 / 42
  160. 160. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity 38 / 42
  161. 161. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism 38 / 42
  162. 162. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization 38 / 42
  163. 163. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures 38 / 42
  164. 164. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume 38 / 42
  165. 165. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Unsuitable Kernels Frequent TLB misses High fraction of branches Cache conflicts across cores Irregular access strides for kernel data structures Characterizing “Good” Kernels Computational intensity Pure “local” SPMD parallelism Streaming parallelism or vectorization Regular access strides for data structures Data reuse factor and data transfer volume “Limited” recursion 38 / 42
  166. 166. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution 39 / 42
  167. 167. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels 39 / 42
  168. 168. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped 39 / 42
  169. 169. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order 39 / 42
  170. 170. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Code Segments for SIMT/SIMD Execution Ranking “Good” Kernels Curve fit characteristics to speed-up measurements of kernels that have already been mapped Sort by values of characteristics in some chosen order Hold up your thumb? 39 / 42
  171. 171. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  172. 172. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Example 40 / 42
  173. 173. Introduction Introduction PerfExpert MACPO GPU/Accelerators Closure Agenda 1 Introduction 2 PerfExpert 3 MACPO 4 GPU/Accelerators 5 Closure 41 / 42
  174. 174. Thank You Victor Apr

×