SlideShare a Scribd company logo
1 of 39
Download to read offline
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Performance Analysis with Scalasca part II
George S. Markomanolis
8 August 2019
22 Open slide master to edit
MiniWeather MPI+OpenACC
• Compilations
• MPI+OpenACC: make PREP="scorep --mpp=mpi --openacc –
pdt" openacc
• Cube4 does not show OpenACC information
33 Open slide master to edit
LSMS compilation - Issues
• scorep --cuda nvcc…
/gpfs/alpine/gen110/scratch/gmarkoma/scorep/LSMS3/Source
/LSMS_3/src/Accelerator/buildKKRMatrix_kernels.cu(242): error:
identifier "omp_get_thread_num" is undefined
• Adding the option --keep-files, does not delete the temporary
files, so we can observe what is wrong
44 Open slide master to edit
LSMS compilation – Issues (cont.)
Original .cu file Temporary opari.cu file during
instrumentation
#ifdef _OPENMP
#include <omp.h>
#else
inline int omp_get_max_threads()
{return 1;}
inline int omp_get_num_threads()
{return 1;}
inline int omp_get_thread_num()
{return 0;}
#endif
#ifdef _OPENMP
#else
inline int omp_get_max_threads()
{return 1;}
inline int omp_get_num_threads()
{return 1;}
inline int omp_get_thread_num()
{return 0;}
#endif
Adding manually the #include <omp.h> to the temporary file, it solves the issue
55 Open slide master to edit
LSMS – Profiling one compute node
66 Open slide master to edit
LSMS on 128 compute nodes – Compute variation
77 Open slide master to edit
LSMS on 128 compute nodes – MPI_Allreduce variation
88 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier
99 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier
1010 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier - Sunburst
1111 Open slide master to edit
LSMS – Buffer for tracing
• scalasca -examine -s /gpfs/alpine/…/scorep_lsms_Ox8_sum/
INFO: Score report written to /gpfs/…/scorep_lsms_Ox8_sum/scorep.score
vim /gpfs/…/scorep_lsms_Ox8_sum/scorep.score
Estimated aggregate size of event trace: 24TB
Estimated requirements for largest trace buffer (max_buf): 36GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 36GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum
supported memory or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 37,758,243,091 2,906,934,475 3738.13 100.0 1.29 ALL
USR 37,757,438,730 2,906,899,463 3714.45 99.4 1.28 USR
OMP 757,272 32,688 18.59 0.5 568.66 OMP
MPI 29,168 932 2.58 0.1 2772.73 MPI
COM 17,880 1,390 2.50 0.1 1798.55 COM
SCOREP 41 2 0.00 0.0 242.20 SCOREP
USR 22,486,328,384 1,729,717,541 260.77 7.0 0.15 dfv_m
USR 9,921,071,888 763,159,376 126.99 3.4 0.17 fit
USR 1,064,960,000 81,920,000 343.03 9.2 4.19 void bulirsch_stoer_integrator
USR 846,102,582 65,083,854 19.65 0.5 0.30 void plm_normalized_(int*, double*, double*)
1212 Open slide master to edit
Selective instrumentation
• File exclude.filt, we select the USR calls with high volume of max_buf and the ones with many calls:
SCOREP_REGION_NAMES_BEGIN EXCLUDE
dfv_m
fit
*bulirsch_stoer_integrator_*
*plm_normalized_*
...
SCOREP_REGION_NAMES_END
• Apply on the profiling data:
scalasca -examine -s -f exclude.filt scorep_lsms_Ox7_sum/
1313 Open slide master to edit
Selective instrumentation (cont.)
• vim scorep_lsms_Ox7_sum/scorep.score_exclude.filt
Estimated aggregate size of event trace: 3100MB
Estimated requirements for largest trace buffer (max_buf): 6MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 20MB
(hint: When tracing set SCOREP_TOTAL_MEMORY=20MB to avoid intermediate flushes or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
- ALL 37,768,557,978 997,234,022,247 1508955.54 100.0 1.51 ALL
- USR 37,767,383,966 997,216,546,439 1237070.04 82.0 1.24 USR
- MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI
- OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP
- COM 302,424 543,392 2698.66 0.2 4966.33 COM
- SCOREP 41 768 0.21 0.0 270.75 SCOREP
* ALL 5,851,181 108,920,629 1061962.47 70.4 9749.87 ALL-FLT
+ FLT 37,763,922,579 997,125,101,618 446993.07 29.6 0.45 FLT
* USR 3,725,651 91,444,821 790076.96 52.4 8639.93 USR-FLT
- MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI-FLT
- OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP-FLT
* COM 302,424 543,392 2698.66 0.2 4966.33 COM-FLT
- SCOREP 41 768 0.21 0.0 270.75 SCOREP-FLT
Execute: scalasca -analyze -q -t -f /full_path/exclude.filt jsrun …
1414 Open slide master to edit
LSMS on 128 compute nodes – Late Receivers (occ)
1515 Open slide master to edit
LSMS on 128 compute nodes – Late Sender (sec.)
1616 Open slide master to edit
LSMS on 128 compute nodes – Short-term Idleness delay
1717 Open slide master to edit
LSMS on 128 compute nodes – Short-term Idleness delay
1818 Open slide master to edit
Vampir with OTF2 trace
OTF2 trace is created inside the scorep folder after the scalasca –analyze,
example of 640 MPI processes
1919 Open slide master to edit
Vampir with OTF2 trace
2020 Open slide master to edit
Vampir with OTF2 trace
2121 Open slide master to edit
Vampir with OTF2 trace
2222 Open slide master to edit
Vampir with OTF2 trace
2323 Open slide master to edit
Vampir with OTF2 trace
2424 Open slide master to edit
Vampir with OTF2 trace
2525 Open slide master to edit
Vampir with OTF2 trace
2626 Open slide master to edit
Vampir with OTF2 trace
2727 Open slide master to edit
Vampir with OTF2 trace
2828 Open slide master to edit
Vampir with OTF2 trace
2929 Open slide master to edit
Vampir with OTF2 trace
3030 Open slide master to edit
Vampir with OTF2 trace
3131 Open slide master to edit
Vampir with OTF2 trace
3232 Open slide master to edit
Vampir with OTF2 trace
3333 Open slide master to edit
Vampir with OTF2 trace
3434 Open slide master to edit
Vampir with OTF2 trace
3535 Open slide master to edit
Vampir with OTF2 trace
3636 Open slide master to edit
Vampir with OTF2 trace II
Zoom in with the mouse
3737 Open slide master to edit
Vampir with OTF2 trace III
Zoom in with the mouse
3838 Open slide master to edit
Vampir with OTF2 trace IV
Add functionalities from the menu
3939 Open slide master to edit
Conclusions
• Scalasca can identify bottlenecks based on some patterns
• Useful for MPI and OpenMP
• It can not provide useful information about GPUs but the
information exists inside the traces
• It can create OTF2 trace files and use Vampir to visualize

More Related Content

Similar to ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 

Similar to ORNL is managed by UT-Battelle, LLC for the US Department of Energy (20)

Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
200.1,2-Capacity Planning
200.1,2-Capacity Planning200.1,2-Capacity Planning
200.1,2-Capacity Planning
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 

More from George Markomanolis

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

More from George Markomanolis (17)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Performance Analysis with Scalasca part II George S. Markomanolis 8 August 2019
  • 2. 22 Open slide master to edit MiniWeather MPI+OpenACC • Compilations • MPI+OpenACC: make PREP="scorep --mpp=mpi --openacc – pdt" openacc • Cube4 does not show OpenACC information
  • 3. 33 Open slide master to edit LSMS compilation - Issues • scorep --cuda nvcc… /gpfs/alpine/gen110/scratch/gmarkoma/scorep/LSMS3/Source /LSMS_3/src/Accelerator/buildKKRMatrix_kernels.cu(242): error: identifier "omp_get_thread_num" is undefined • Adding the option --keep-files, does not delete the temporary files, so we can observe what is wrong
  • 4. 44 Open slide master to edit LSMS compilation – Issues (cont.) Original .cu file Temporary opari.cu file during instrumentation #ifdef _OPENMP #include <omp.h> #else inline int omp_get_max_threads() {return 1;} inline int omp_get_num_threads() {return 1;} inline int omp_get_thread_num() {return 0;} #endif #ifdef _OPENMP #else inline int omp_get_max_threads() {return 1;} inline int omp_get_num_threads() {return 1;} inline int omp_get_thread_num() {return 0;} #endif Adding manually the #include <omp.h> to the temporary file, it solves the issue
  • 5. 55 Open slide master to edit LSMS – Profiling one compute node
  • 6. 66 Open slide master to edit LSMS on 128 compute nodes – Compute variation
  • 7. 77 Open slide master to edit LSMS on 128 compute nodes – MPI_Allreduce variation
  • 8. 88 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier
  • 9. 99 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier
  • 10. 1010 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier - Sunburst
  • 11. 1111 Open slide master to edit LSMS – Buffer for tracing • scalasca -examine -s /gpfs/alpine/…/scorep_lsms_Ox8_sum/ INFO: Score report written to /gpfs/…/scorep_lsms_Ox8_sum/scorep.score vim /gpfs/…/scorep_lsms_Ox8_sum/scorep.score Estimated aggregate size of event trace: 24TB Estimated requirements for largest trace buffer (max_buf): 36GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 36GB (warning: The memory requirements cannot be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum supported memory or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 37,758,243,091 2,906,934,475 3738.13 100.0 1.29 ALL USR 37,757,438,730 2,906,899,463 3714.45 99.4 1.28 USR OMP 757,272 32,688 18.59 0.5 568.66 OMP MPI 29,168 932 2.58 0.1 2772.73 MPI COM 17,880 1,390 2.50 0.1 1798.55 COM SCOREP 41 2 0.00 0.0 242.20 SCOREP USR 22,486,328,384 1,729,717,541 260.77 7.0 0.15 dfv_m USR 9,921,071,888 763,159,376 126.99 3.4 0.17 fit USR 1,064,960,000 81,920,000 343.03 9.2 4.19 void bulirsch_stoer_integrator USR 846,102,582 65,083,854 19.65 0.5 0.30 void plm_normalized_(int*, double*, double*)
  • 12. 1212 Open slide master to edit Selective instrumentation • File exclude.filt, we select the USR calls with high volume of max_buf and the ones with many calls: SCOREP_REGION_NAMES_BEGIN EXCLUDE dfv_m fit *bulirsch_stoer_integrator_* *plm_normalized_* ... SCOREP_REGION_NAMES_END • Apply on the profiling data: scalasca -examine -s -f exclude.filt scorep_lsms_Ox7_sum/
  • 13. 1313 Open slide master to edit Selective instrumentation (cont.) • vim scorep_lsms_Ox7_sum/scorep.score_exclude.filt Estimated aggregate size of event trace: 3100MB Estimated requirements for largest trace buffer (max_buf): 6MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 20MB (hint: When tracing set SCOREP_TOTAL_MEMORY=20MB to avoid intermediate flushes or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/visit[us] region - ALL 37,768,557,978 997,234,022,247 1508955.54 100.0 1.51 ALL - USR 37,767,383,966 997,216,546,439 1237070.04 82.0 1.24 USR - MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI - OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP - COM 302,424 543,392 2698.66 0.2 4966.33 COM - SCOREP 41 768 0.21 0.0 270.75 SCOREP * ALL 5,851,181 108,920,629 1061962.47 70.4 9749.87 ALL-FLT + FLT 37,763,922,579 997,125,101,618 446993.07 29.6 0.45 FLT * USR 3,725,651 91,444,821 790076.96 52.4 8639.93 USR-FLT - MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI-FLT - OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP-FLT * COM 302,424 543,392 2698.66 0.2 4966.33 COM-FLT - SCOREP 41 768 0.21 0.0 270.75 SCOREP-FLT Execute: scalasca -analyze -q -t -f /full_path/exclude.filt jsrun …
  • 14. 1414 Open slide master to edit LSMS on 128 compute nodes – Late Receivers (occ)
  • 15. 1515 Open slide master to edit LSMS on 128 compute nodes – Late Sender (sec.)
  • 16. 1616 Open slide master to edit LSMS on 128 compute nodes – Short-term Idleness delay
  • 17. 1717 Open slide master to edit LSMS on 128 compute nodes – Short-term Idleness delay
  • 18. 1818 Open slide master to edit Vampir with OTF2 trace OTF2 trace is created inside the scorep folder after the scalasca –analyze, example of 640 MPI processes
  • 19. 1919 Open slide master to edit Vampir with OTF2 trace
  • 20. 2020 Open slide master to edit Vampir with OTF2 trace
  • 21. 2121 Open slide master to edit Vampir with OTF2 trace
  • 22. 2222 Open slide master to edit Vampir with OTF2 trace
  • 23. 2323 Open slide master to edit Vampir with OTF2 trace
  • 24. 2424 Open slide master to edit Vampir with OTF2 trace
  • 25. 2525 Open slide master to edit Vampir with OTF2 trace
  • 26. 2626 Open slide master to edit Vampir with OTF2 trace
  • 27. 2727 Open slide master to edit Vampir with OTF2 trace
  • 28. 2828 Open slide master to edit Vampir with OTF2 trace
  • 29. 2929 Open slide master to edit Vampir with OTF2 trace
  • 30. 3030 Open slide master to edit Vampir with OTF2 trace
  • 31. 3131 Open slide master to edit Vampir with OTF2 trace
  • 32. 3232 Open slide master to edit Vampir with OTF2 trace
  • 33. 3333 Open slide master to edit Vampir with OTF2 trace
  • 34. 3434 Open slide master to edit Vampir with OTF2 trace
  • 35. 3535 Open slide master to edit Vampir with OTF2 trace
  • 36. 3636 Open slide master to edit Vampir with OTF2 trace II Zoom in with the mouse
  • 37. 3737 Open slide master to edit Vampir with OTF2 trace III Zoom in with the mouse
  • 38. 3838 Open slide master to edit Vampir with OTF2 trace IV Add functionalities from the menu
  • 39. 3939 Open slide master to edit Conclusions • Scalasca can identify bottlenecks based on some patterns • Useful for MPI and OpenMP • It can not provide useful information about GPUs but the information exists inside the traces • It can create OTF2 trace files and use Vampir to visualize