SlideShare a Scribd company logo
1 of 52
Download to read offline
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
How to use TAU for Performance Analysis
George S. Markomanolis
7 August 2019
22 Open slide master to edit
Outline
• Introduction to TAU
• How to compile
• Explaining functionalities of TAU/ParaProf
• Presenting basic steps of PerfExplorer
33 Open slide master to edit
TAU
• Tuning and Analysis Utilities, developed at University of Oregon
• Scalable and flexible performance analysis toolkit
• Automatic instrumentation through Program Database Toolkit (PDT)
for routines, loops, I/O, memory, phases, etc.
• Installed version on Summit: v2.28.1
• Module: tau
• Web site: https://www.cs.uoregon.edu/research/tau/home.php
• Email: tau-bugs@cs.uoregon.edu
44 Open slide master to edit
Capability Matrix - TAU
Capability Profiling Tracing Notes/Limitations
MPI, MPI-IO Yes Yes
OpenMP CPU Yes Yes
OpenMP GPU Yes Yes Some restrictions apply regarding the
CUPTI metrics
OpenACC Yes Yes Some functionalities are not ready for
production, no metrics available
CUDA Yes Yes Some functionalities are not ready for
production
POSIX I/O Yes Yes
POSIX threads Yes Yes
Memory – app-level Yes Yes
Memory – func-level Yes Yes
Hotspot Detection Yes Yes
Variance Detection Yes Yes
Hardware Counters Yes Yes
55 Open slide master to edit
Compilation
• There are mainly three approaches to use an application with TAU
– Use TAU Wrappers
• For C: replace the compiler with tau_cc.sh
• For C++: replace the compiler with tau_cxx.sh
• For Fortran: replace the compiler with tau_f90.sh/tau_f77.sh
– Dynamic instrumentation, for example:
• jsrun -n 4 –r 4 –a 1 –c1 tau_exec -T mpi ./test
– Rewrite the binary (support for x86_64):
• tau_rewrite –T papi,pdf a.out –o a.inst
66 Open slide master to edit
Compilation (cont.)
Interposition: tau_exec
Compiler:
tau_cc.sh –tau_options=-optCompInst
Set the TAU_MAKEFILE
Source:
tau_cc.sh
The TAU_MAKEFILE should include the PDT
77 Open slide master to edit
tau_exec
tau_exec –help
Options:
-v Verbose mode
-s Show what will be done but don't actually do anything (dryrun)
-io Track I/O
-memory Track memory allocation/deallocation
-memory_debug Enable memory debugger
-cuda Track GPU events via CUDA
-cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API)
-opencl Track GPU events via OpenCL
-openacc Track GPU events via OpenACC (currently PGI only)
-rocm Track ROCm events via rocprofiler
-ompt Track OpenMP events via OMPT interface
-ebs Enable event-based sampling
-ebs_period=<count> Sampling period (default 1000)
-ebs_source=<counter> Counter (default itimer)
-ebs_resolution=<file|function|line> Choose sampling granularity.
-um Enable Unified Memory events via CUPTI
-sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level)
-csv Outputs sass profile in CSV
-env Track GPU environment activity (power utilization, SM, memory frequency, temperature)
-T <CUPTI,DISABLE,GNU,GNU_MEM,MPI,OPENMP,PAPI,PDT,PGI,PGI_MEM,PROFILE,SERIAL> : Specify TAU tags
88 Open slide master to edit
TAU Environment Variables
99 Open slide master to edit
TAU Compile-Time Environment Variables
For using free format in .f files, use:
% export TAU_OPTIONS=`-optPdtF95Opts=``-R free’’’
1010 Open slide master to edit
How TAU works?
• Instrumentation:
– Adds probes to perform measurements
– Source code instrumentation
– Wrapping external libraries (I/O, CUDA, OpenACC, OpenCL)
– Rewriting the binary executable
• Measurement:
– Profiling or Tracing
– Direct instrumentation
– Sampling
– Throttling
• Analysis:
– Visualization of profiles and traces
– 3D visualization
– Trace conversion tools
1111 Open slide master to edit
TAU Instrumentation/Measurement
1212 Open slide master to edit
Tau_exec
Usage: tau_exec [options] [--] <exe> <exe options>
Options:
-v Verbose mode
-vv Very Verbose mode (enables TAU_VERBOSE=1)
-s Show what will be done but don't actually do anything (dryrun)
-io Track I/O
-memory Track memory allocation/deallocation
-memory_debug Enable memory debugger
-cuda Track GPU events via CUDA
-cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API)
-opencl Track GPU events via OpenCL
-openacc Track GPU events via OpenACC (currently PGI only)
-rocm Track ROCm events via rocprofiler
-ompt Track OpenMP events via OMPT interface
-power Track power events via PAPI's perf RAPL interface|
-numa Track remote DRAM, total DRAM events (needs papi with recent perf support for x86_64)
-ebs Enable event-based sampling
-ebs_period=<count> Sampling period (default 1000)
-um Enable Unified Memory events via CUPTI
-sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level)
-csv Outputs sass profile in CSV
1313 Open slide master to edit
MiniWeather MPI compilation
• module load pgi
• module load tau
• export
TAU_MAKEFILE/sw/summit/tau/2.28.1_patched/ibm64linux/lib/Makef
ile.tau-pgi-papi-mpi-pdt-pgi
• Replace mpicxx with tau_cxx.sh in the Makefile
• export TAU_OPTIONS='-optLinking=-lpnetcdf -optVerbose'
• make mpi
1414 Open slide master to edit
MiniWeather MPI – Execution - Profiling
export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS
#export TAU_CALLPATH=1
#export TAU_CALLPATH_DEPTH=10
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
Or if compiled with mpicxx
jsrun -n 64 -r 8 -a 1 -c 1 tau_exec ./miniWeather_mpi
1515 Open slide master to edit
MiniWeather MPI - Execution
• When the execution finished, there is one folder for each TAU_METRICS
declaration with the format MULTI__
• If there is no TAU_METRICS declared, then by default is used the metric
TIME and the profiling files are not in a folder, in this case you need to
pack them and execute paraprof:
summit> paraprof –pack name.ppk
summit> paraprof name.ppk
• To visualize the results execute paraprof (check also pprof for text mode)
1616 Open slide master to edit
MiniWeather MPI - Paraprof
• The default metric is TIME
• Each color is a different call
• Each horizontal line is a process or Std.Dev./mean/max/min
1717 Open slide master to edit
Exploring Paraprof
• Options -> Uncheck Stack Bars Together
• It is easier to check the load imbalance
• We will call this window as the main one
1818 Open slide master to edit
Exploring Paraprof
• Click on any color, values per process, name of routine with callpath (if activated), units in seconds, value exclusive, max, min,
mean, std, values.
1919 Open slide master to edit
Exploring Paraprof
• Scroll down
2020 Open slide master to edit
Exploring Paraprof
• Click on any label on the left (node 0, mean, etc.). You can see immediately which calls take more time
2121 Open slide master to edit
Paraprof – Thread Statistics Text Window
• Right click on any label of the main window, select “Show Thread Statistics Text Window”
2222 Open slide master to edit
Paraprof – Thread Statistics Table
• Right click on any label of the main window, select “Show Thread Statistics Table”
2323 Open slide master to edit
Paraprof – User Bar Chart
• Right click on any label of the main window, select “Show User Bar Chart”
2424 Open slide master to edit
Paraprof – User Event
• Options -> Select Value Type -> Max. Value
2525 Open slide master to edit
Paraprof – User Event Statistics Window
• Right click on any label of the main window, select “Show User Event Statistics Window”
2626 Open slide master to edit
Paraprof – Context Event Window
• Right click on any label of the main window, select “Show Context Event Window” (with callpath)
2727 Open slide master to edit
Paraprof – Add Thread to Comparison Window
• Right click on node 0 and select “Add Thread to Comparison Window”, similar for node 12. You could use any number of processes that you prefer.
2828 Open slide master to edit
Derived Metrics
Options -> Show Derived Metric Panel, select the metrics and then operator and then
click Apply. Then uncheck the Show Derived Metric
2929 Open slide master to edit
Paraprof - IPC
• Click on the new metric, PAPI_TOT_INS/PAPI_TOT_CYC
3030 Open slide master to edit
Paraprof – Mean IPC
• Click on the label mean
3131 Open slide master to edit
Paraprof – IPC for thread 0
• From the main window with the PAPI_TOT_INS/PAPI_TOT_CYC metric, right click on node 0 and select Show Thread Statistics
Table
3232 Open slide master to edit
Paraprof
• From the main window select Options -> Select Metric… -> Exclusive -> PAPI_FP_OPS
3333 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive Time and Exclusive Floating operations
3434 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Specific routine and thread
3535 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive time and total instructions
3636 Open slide master to edit
Paraprof – 3D Visualization
Menu Windows -> 3D Visualization (3D demands OpenGL)
3737 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive time and instructions per cycle
3838 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Bar Plot
3939 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Scatter Plot
4040 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Topology Plot
4141 Open slide master to edit
Paraprof – 3D Communication Matrix
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Max message size vs Number of calls
4242 Open slide master to edit
Paraprof
Menu Windows -> Communication Matrix
4343 Open slide master to edit
Which loops require the most time?
• File select.tau:
BEGIN_INSTRUMENT_SECTION
loops routine=“#”
END_INSTRUMENT_SECTION
• Declare TAU options:
export TAU_OPTIONS=“-optTauSelectFile=select.tau -optLinking=-lpnetcdf -
optVerbose”
• Do not forget to unset TAU_OPTIONS when not required
• Execute as before
4444 Open slide master to edit
Paraprof - Loops
4545 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS
4646 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… ->
PAPI_TOT_INS/PAPI_TOT_CYC
4747 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… -> PAPI_FP_OPS
4848 Open slide master to edit
Paraprof
From the main window select a node
Click on node 0
4949 Open slide master to edit
Paraprof – Function Histogram
From the main window select a node
Right click on
MPI_File_write_at_all() ->
Show Function
Histogram
5050 Open slide master to edit
Callpath
export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS
export TAU_CALLPATH=1
export TAU_CALLPATH_DEPTH=10
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
5151 Open slide master to edit
Paraprof - Callpath
From the main Window right click on any label (node 0, mean etc.) and select “Show
Thread Call Graph”
5252 Open slide master to edit
Paraprof - Callpath
From the main Window right click on any label (node 0, mean etc.) and select “Show
Thread Statistics Table”

More Related Content

Similar to How to use TAU for Performance Analysis on Summit Supercomputer

Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SGanesan Narayanasamy
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
Inside the JVM - Performance & Garbage Collector Tuning in JAVA
Inside the JVM - Performance & Garbage Collector Tuning in JAVA Inside the JVM - Performance & Garbage Collector Tuning in JAVA
Inside the JVM - Performance & Garbage Collector Tuning in JAVA Rafael Monteiro e Pereira
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Bgoug 2019.11 test your pl sql - not your patience
Bgoug 2019.11   test your pl sql - not your patienceBgoug 2019.11   test your pl sql - not your patience
Bgoug 2019.11 test your pl sql - not your patienceJacek Gebal
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSTulipp. Eu
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1상욱 송
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolGanesan Narayanasamy
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deploymentsOdoo
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018Fabio Janiszevski
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Compuware
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...Puppet
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)Linaro
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youJacek Gebal
 

Similar to How to use TAU for Performance Analysis on Summit Supercomputer (20)

Performance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4SPerformance Evaluation using TAU Performance System and E4S
Performance Evaluation using TAU Performance System and E4S
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Inside the JVM - Performance & Garbage Collector Tuning in JAVA
Inside the JVM - Performance & Garbage Collector Tuning in JAVA Inside the JVM - Performance & Garbage Collector Tuning in JAVA
Inside the JVM - Performance & Garbage Collector Tuning in JAVA
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Bgoug 2019.11 test your pl sql - not your patience
Bgoug 2019.11   test your pl sql - not your patienceBgoug 2019.11   test your pl sql - not your patience
Bgoug 2019.11 test your pl sql - not your patience
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Deep learning - the conf br 2018
Deep learning - the conf br 2018Deep learning - the conf br 2018
Deep learning - the conf br 2018
 
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
 
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-TR04: Using tracing to tune and optimize EAS (English)
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love you
 

More from George Markomanolis

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

More from George Markomanolis (14)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

How to use TAU for Performance Analysis on Summit Supercomputer

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy How to use TAU for Performance Analysis George S. Markomanolis 7 August 2019
  • 2. 22 Open slide master to edit Outline • Introduction to TAU • How to compile • Explaining functionalities of TAU/ParaProf • Presenting basic steps of PerfExplorer
  • 3. 33 Open slide master to edit TAU • Tuning and Analysis Utilities, developed at University of Oregon • Scalable and flexible performance analysis toolkit • Automatic instrumentation through Program Database Toolkit (PDT) for routines, loops, I/O, memory, phases, etc. • Installed version on Summit: v2.28.1 • Module: tau • Web site: https://www.cs.uoregon.edu/research/tau/home.php • Email: tau-bugs@cs.uoregon.edu
  • 4. 44 Open slide master to edit Capability Matrix - TAU Capability Profiling Tracing Notes/Limitations MPI, MPI-IO Yes Yes OpenMP CPU Yes Yes OpenMP GPU Yes Yes Some restrictions apply regarding the CUPTI metrics OpenACC Yes Yes Some functionalities are not ready for production, no metrics available CUDA Yes Yes Some functionalities are not ready for production POSIX I/O Yes Yes POSIX threads Yes Yes Memory – app-level Yes Yes Memory – func-level Yes Yes Hotspot Detection Yes Yes Variance Detection Yes Yes Hardware Counters Yes Yes
  • 5. 55 Open slide master to edit Compilation • There are mainly three approaches to use an application with TAU – Use TAU Wrappers • For C: replace the compiler with tau_cc.sh • For C++: replace the compiler with tau_cxx.sh • For Fortran: replace the compiler with tau_f90.sh/tau_f77.sh – Dynamic instrumentation, for example: • jsrun -n 4 –r 4 –a 1 –c1 tau_exec -T mpi ./test – Rewrite the binary (support for x86_64): • tau_rewrite –T papi,pdf a.out –o a.inst
  • 6. 66 Open slide master to edit Compilation (cont.) Interposition: tau_exec Compiler: tau_cc.sh –tau_options=-optCompInst Set the TAU_MAKEFILE Source: tau_cc.sh The TAU_MAKEFILE should include the PDT
  • 7. 77 Open slide master to edit tau_exec tau_exec –help Options: -v Verbose mode -s Show what will be done but don't actually do anything (dryrun) -io Track I/O -memory Track memory allocation/deallocation -memory_debug Enable memory debugger -cuda Track GPU events via CUDA -cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API) -opencl Track GPU events via OpenCL -openacc Track GPU events via OpenACC (currently PGI only) -rocm Track ROCm events via rocprofiler -ompt Track OpenMP events via OMPT interface -ebs Enable event-based sampling -ebs_period=<count> Sampling period (default 1000) -ebs_source=<counter> Counter (default itimer) -ebs_resolution=<file|function|line> Choose sampling granularity. -um Enable Unified Memory events via CUPTI -sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level) -csv Outputs sass profile in CSV -env Track GPU environment activity (power utilization, SM, memory frequency, temperature) -T <CUPTI,DISABLE,GNU,GNU_MEM,MPI,OPENMP,PAPI,PDT,PGI,PGI_MEM,PROFILE,SERIAL> : Specify TAU tags
  • 8. 88 Open slide master to edit TAU Environment Variables
  • 9. 99 Open slide master to edit TAU Compile-Time Environment Variables For using free format in .f files, use: % export TAU_OPTIONS=`-optPdtF95Opts=``-R free’’’
  • 10. 1010 Open slide master to edit How TAU works? • Instrumentation: – Adds probes to perform measurements – Source code instrumentation – Wrapping external libraries (I/O, CUDA, OpenACC, OpenCL) – Rewriting the binary executable • Measurement: – Profiling or Tracing – Direct instrumentation – Sampling – Throttling • Analysis: – Visualization of profiles and traces – 3D visualization – Trace conversion tools
  • 11. 1111 Open slide master to edit TAU Instrumentation/Measurement
  • 12. 1212 Open slide master to edit Tau_exec Usage: tau_exec [options] [--] <exe> <exe options> Options: -v Verbose mode -vv Very Verbose mode (enables TAU_VERBOSE=1) -s Show what will be done but don't actually do anything (dryrun) -io Track I/O -memory Track memory allocation/deallocation -memory_debug Enable memory debugger -cuda Track GPU events via CUDA -cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API) -opencl Track GPU events via OpenCL -openacc Track GPU events via OpenACC (currently PGI only) -rocm Track ROCm events via rocprofiler -ompt Track OpenMP events via OMPT interface -power Track power events via PAPI's perf RAPL interface| -numa Track remote DRAM, total DRAM events (needs papi with recent perf support for x86_64) -ebs Enable event-based sampling -ebs_period=<count> Sampling period (default 1000) -um Enable Unified Memory events via CUPTI -sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level) -csv Outputs sass profile in CSV
  • 13. 1313 Open slide master to edit MiniWeather MPI compilation • module load pgi • module load tau • export TAU_MAKEFILE/sw/summit/tau/2.28.1_patched/ibm64linux/lib/Makef ile.tau-pgi-papi-mpi-pdt-pgi • Replace mpicxx with tau_cxx.sh in the Makefile • export TAU_OPTIONS='-optLinking=-lpnetcdf -optVerbose' • make mpi
  • 14. 1414 Open slide master to edit MiniWeather MPI – Execution - Profiling export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS #export TAU_CALLPATH=1 #export TAU_CALLPATH_DEPTH=10 export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi Or if compiled with mpicxx jsrun -n 64 -r 8 -a 1 -c 1 tau_exec ./miniWeather_mpi
  • 15. 1515 Open slide master to edit MiniWeather MPI - Execution • When the execution finished, there is one folder for each TAU_METRICS declaration with the format MULTI__ • If there is no TAU_METRICS declared, then by default is used the metric TIME and the profiling files are not in a folder, in this case you need to pack them and execute paraprof: summit> paraprof –pack name.ppk summit> paraprof name.ppk • To visualize the results execute paraprof (check also pprof for text mode)
  • 16. 1616 Open slide master to edit MiniWeather MPI - Paraprof • The default metric is TIME • Each color is a different call • Each horizontal line is a process or Std.Dev./mean/max/min
  • 17. 1717 Open slide master to edit Exploring Paraprof • Options -> Uncheck Stack Bars Together • It is easier to check the load imbalance • We will call this window as the main one
  • 18. 1818 Open slide master to edit Exploring Paraprof • Click on any color, values per process, name of routine with callpath (if activated), units in seconds, value exclusive, max, min, mean, std, values.
  • 19. 1919 Open slide master to edit Exploring Paraprof • Scroll down
  • 20. 2020 Open slide master to edit Exploring Paraprof • Click on any label on the left (node 0, mean, etc.). You can see immediately which calls take more time
  • 21. 2121 Open slide master to edit Paraprof – Thread Statistics Text Window • Right click on any label of the main window, select “Show Thread Statistics Text Window”
  • 22. 2222 Open slide master to edit Paraprof – Thread Statistics Table • Right click on any label of the main window, select “Show Thread Statistics Table”
  • 23. 2323 Open slide master to edit Paraprof – User Bar Chart • Right click on any label of the main window, select “Show User Bar Chart”
  • 24. 2424 Open slide master to edit Paraprof – User Event • Options -> Select Value Type -> Max. Value
  • 25. 2525 Open slide master to edit Paraprof – User Event Statistics Window • Right click on any label of the main window, select “Show User Event Statistics Window”
  • 26. 2626 Open slide master to edit Paraprof – Context Event Window • Right click on any label of the main window, select “Show Context Event Window” (with callpath)
  • 27. 2727 Open slide master to edit Paraprof – Add Thread to Comparison Window • Right click on node 0 and select “Add Thread to Comparison Window”, similar for node 12. You could use any number of processes that you prefer.
  • 28. 2828 Open slide master to edit Derived Metrics Options -> Show Derived Metric Panel, select the metrics and then operator and then click Apply. Then uncheck the Show Derived Metric
  • 29. 2929 Open slide master to edit Paraprof - IPC • Click on the new metric, PAPI_TOT_INS/PAPI_TOT_CYC
  • 30. 3030 Open slide master to edit Paraprof – Mean IPC • Click on the label mean
  • 31. 3131 Open slide master to edit Paraprof – IPC for thread 0 • From the main window with the PAPI_TOT_INS/PAPI_TOT_CYC metric, right click on node 0 and select Show Thread Statistics Table
  • 32. 3232 Open slide master to edit Paraprof • From the main window select Options -> Select Metric… -> Exclusive -> PAPI_FP_OPS
  • 33. 3333 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive Time and Exclusive Floating operations
  • 34. 3434 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Specific routine and thread
  • 35. 3535 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive time and total instructions
  • 36. 3636 Open slide master to edit Paraprof – 3D Visualization Menu Windows -> 3D Visualization (3D demands OpenGL)
  • 37. 3737 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive time and instructions per cycle
  • 38. 3838 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Bar Plot
  • 39. 3939 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Scatter Plot
  • 40. 4040 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Topology Plot
  • 41. 4141 Open slide master to edit Paraprof – 3D Communication Matrix • Menu Windows -> 3D Visualization (3D demands OpenGL) • Max message size vs Number of calls
  • 42. 4242 Open slide master to edit Paraprof Menu Windows -> Communication Matrix
  • 43. 4343 Open slide master to edit Which loops require the most time? • File select.tau: BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION • Declare TAU options: export TAU_OPTIONS=“-optTauSelectFile=select.tau -optLinking=-lpnetcdf - optVerbose” • Do not forget to unset TAU_OPTIONS when not required • Execute as before
  • 44. 4444 Open slide master to edit Paraprof - Loops
  • 45. 4545 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS
  • 46. 4646 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS/PAPI_TOT_CYC
  • 47. 4747 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_FP_OPS
  • 48. 4848 Open slide master to edit Paraprof From the main window select a node Click on node 0
  • 49. 4949 Open slide master to edit Paraprof – Function Histogram From the main window select a node Right click on MPI_File_write_at_all() -> Show Function Histogram
  • 50. 5050 Open slide master to edit Callpath export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS export TAU_CALLPATH=1 export TAU_CALLPATH_DEPTH=10 export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
  • 51. 5151 Open slide master to edit Paraprof - Callpath From the main Window right click on any label (node 0, mean etc.) and select “Show Thread Call Graph”
  • 52. 5252 Open slide master to edit Paraprof - Callpath From the main Window right click on any label (node 0, mean etc.) and select “Show Thread Statistics Table”