Online performance modeling and
analysis of message-passing
parallel applications

                                       ...
Motivation
• Parallel system hardware is evolving at an incredible rate
• Contemporary HPC systems
    – Top500 ranging fr...
Motivation
• Challenges of developing large-scale scientific software
   – Evolution of programming models is much slower
...
Motivation
• Challenges of developing large-scale scientific software
   – Evolution of programming models is much slower
...
Motivation
• Quickly finding performance problems and their reasons is hard
• Requires thorough understanding of the progr...
Our goals
• Analyze the performance of parallel applications
• Detect bottlenecks and explain their causes
   – Focus on c...
Contributions
• A systematic approach for automated diagnosis of application
  performance
   – Application is monitored, ...
Outline

1. Overview of approaches

2. Online performance modeling

3. Online performance analysis

4. Experimental evalua...
Overview
of approaches

                9
Classical performance analysis

            Code                            Compile
                            Develop   ...
Classical performance analysis
Drawbacks

•   Manual task of experimental nature
•   Time consuming
•   High degree of exp...
Automated offline analysis

           Code                                Compile
                            Develop    ...
Automated offline analysis
Drawbacks

• Post-mortem
• Addresses only well-known problems
• Not fully explored capabilities...
Automated online analysis


                            Develop
          Code
          changes                         C...
Automated online analysis

Paradyn advantages              Paradyn drawbacks
• Locate problems while app     • Addresses l...
Automated online analysis
Our approach
                                          Consume
Code              Develop
       ...
Automated online analysis
Key characteristics
• Discovers application model on-the-fly
   – Model execution flows, not mod...
Monitoring




Modeling




             Analysis




Online performance
modeling

                        18
Modeling objectives

• Enable high-level understanding of application performance

• Reflect parallel application structur...
Online performance modeling

• Novel application performance modeling approach

• Combines static code analysis with runti...
Modeling individual tasks
• We decompose execution into units that correspond to
  different activities:
   –   Communicat...
Modeling individual tasks
Task Activity Graph (TAG) reflects program structure by
  modeling executed flow of activities

...
Modeling individual tasks
• Each activity corresponds
  to a particular location
  in the source code




                ...
Modeling individual tasks
• Runtime behavior of activities is described by adding
  performance metrics to nodes and edges...
Modeling communication
• Message edges capture matching send-receive links
   – P2P, Collective
• Completion edges capture...
Modeling parallel application
• Individual TAG models connected by message edges
  form a Parallel-TAG model (PTAG)




  ...
Modeling techniques
We developed a set of techniques to automatically construct
  and exploit the PTAG model at runtime

•...
Online PTAG construction
                                                    Front-end
                                 7 ...
Building individual TAG

                          6 update



                               Modeler
                    ...
Building individual TAG
    Offline program analysis
    • Parse binary executable
    • Find target functions
    • Detec...
Building individual TAG
Dynamic instrumentation
• Instrument all target functions:
     – Record events
     – Collect per...
Building individual TAG
Performance metrics
•       Counters
•       Timers {sum, sum2, min, max}
•       Histograms      ...
Building individual TAG
     Runtime modeling
     • Process generated events
     • Walk the stack to capture
       prog...
Building individual TAG
Model sampling
•   Goal: examine model at runtime
•   Read model from shared memory
•   Sampling i...
Online communication modeling
How to model inter-task communication?
• Intercept MPI communication calls (nodes)
• Match s...
Online communication modeling
• Requires tracking of individual messages transmitted from
  sender to receiver(s) at runti...
Online parallel application modeling
Building and maintaining PTAG

• Individual TAGs are
  distributed
                  ...
Online parallel application modeling
Scalable modeling


                                       10240
                    ...
Online parallel application modeling
Resolving scalability issues

• Classes of similar tasks
   – E.g. stencil codes, M/W...
Online parallel application modeling
Scalable PTAG visualization
• Example: 1D stencil, 8 nodes




                      ...
Benefits of modeling

• Facilitates performance understanding

• Reveals communication and computational patterns and thei...
Monitoring




Modeling




             Analysis




Online performance
analysis

                        42
Online analysis objectives
• Diagnose the performance on-the-fly
• Detect relevant performance bottlenecks and their
  rea...
Online performance analysis
Time-continuous Root-Cause Analysis process

           Monitoring




Modeling




          ...
Root-cause analysis
                                                       Phase 1        Phase 2      Phase 3
           ...
Problem identification
                                                      Phase 1        Phase 2      Phase 3
         ...
Problem identification
                                                                    Phase 1        Phase 2      Pha...
Root-cause analysis
                                                     Phase 1        Phase 2      Phase 3
             ...
In-depth problem analysis
                                                     Phase 1        Phase 2      Phase 3
       ...
In-depth problem analysis
                                                         Phase 1        Phase 2      Phase 3
   ...
In-depth problem analysis
                                                                                               P...
In-depth problem analysis
                                                                                                ...
In-depth problem analysis
                                            Phase 1        Phase 2      Phase 3
                ...
Root-cause analysis
                                                       Phase 1        Phase 2      Phase 3
           ...
Cause-effect analysis
                                                                                                    ...
Cause-effect analysis
                                                                   Phase 1        Phase 2      Phase...
Cause-effect analysis
                                                                                                    ...
Benefits of RCA

• Systematic approach to online performance analysis

• Quick identification of problems as they manifest...
Experimental
evaluation

               59
Prototype tool
                                                 global
                                                ana...
Experimental environment
UAB cluster                   BSC Marenostrum
     x86/Linux                    PowerPC-64/Linu...
Modeling MPI applications
• Experiences with different classes of MPI codes
   – SPMD codes
      • WaveSend – 1D stencil,...
Case study #1: Modeling SPMD
Integer sort (IS) NAS Parallel Benchmark
• Large integer sort used in
  “particle method” cod...
Case study #2: Master/Worker
Forest Fire Propagation Simulator (XFire)
• Calculates the expansion of the fireline
• Comput...
Evaluation of overheads
Sources of overheads
    • Offline startup
         – Less than 20 seconds per 1MB executable
    ...
Evaluation of overheads
NAS LU overheads, varying nº of nodes
                           120,00                           ...
Case study #3: SPMD analysis
WaveSend application
• Parallel calculations of vibrating string over time

• Wave equation, ...
Case study #3: SPMD analysis
WaveSend
PTAG

After execution




                               68
Case study #3: SPMD analysis
CPU-bound problem at task 7

PTAG after 30 seconds
of execution




                         ...
Case study #3: SPMD analysis
Potential bottlenecks
                           Task 0 findings:
                           ...
Case study #3: SPMD analysis
Potential bottlenecks
                           Task 0 findings:
                           ...
Case study #3: SPMD analysis
Cause-effect analysis




                               72
Case study #3: SPMD analysis
Analysis results



•   Load imbalance found
•   Multiple instances of late-sender problem
• ...
Conclusions
and future work

                  74
Conclusions
• A novel approach for online performance modeling
   – Discovers high-level application structure and runtime...
Future work
• Modeling
  –   Support for other classes of activities (I/O, MPI RMA)
  –   OpenMP applications
  –   Suppor...
Online performance modeling and analysis
 of message-passing parallel applications




        Thank You

           PhD T...
Upcoming SlideShare
Loading in …5
×

Online performance modeling and analysis of message-passing parallel applications

1,947 views

Published on

Although the evolution of hardware is improving at an incredible rate, the advances in
parallel software have been hampered for many reasons. Developing an efficient parallel
application is still not an easy task. Our thesis is that many performance problems and their reasons can be quickly located and explained with automated techniques that work on unmodified parallel applications. This work identifies main obstacles for such diagnosis and presents a two-step approach for addressing them. In this approach, the application is automatically modeled and diagnosed during its execution.

First, we introduce an online performance modeling technique that enables automated discovery of causal execution flows through communication and computational activities in message-passing parallel programs. Second, we present a systematic approach to online performance analysis. The automated
analysis uses online model to quickly identify the most important performance problems,
and correlate them with application source code. Our technique is able to discover causal
dependences between the problems, infer their root causes in some scenarios and explain
them to developers. In this work, we focus on diagnosing scientific MPI parallel applications and their communication and computational problems although the approach can be extended to support other classes of activities and programming models.

We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance modeling technique proved effective for low-overhead capturing of program’s behavior and facilitated performance understanding. With our automated, model-based performance analysis approach, we were able to easily identify the most severe performance problems during application execution, and locate their root causes without previous knowledge of application internals.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Online performance modeling and analysis of message-passing parallel applications

  1. 1. Online performance modeling and analysis of message-passing parallel applications Delayed receive PhD Thesis Oleg Morajko Universitat Autònoma de Barcelona, Long local calculations Barcelona, 2008
  2. 2. Motivation • Parallel system hardware is evolving at an incredible rate • Contemporary HPC systems – Top500 ranging from 1.000 to 200.000+ processors (June 2008) – Take BSC MareNostrum: 10K processors • Whole industry is shifting to parallel computing 2
  3. 3. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately MPI 3
  4. 4. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately Careful performance analysis and optimization tasks are crucial 4
  5. 5. Motivation • Quickly finding performance problems and their reasons is hard • Requires thorough understanding of the program’s behavior – Parallel algorithm, domain decomposition, communication, synchronization • Large scale brings additional complexities – Large data volume, excessive analysis cost • Existing tools support finding what happens, where, and when – Locating root causes of problems still manual – Tools expose scalability limitations (E.g. tracing) • Problem diagnosis still requires substantial time and effort of highly-skilled professionals 5
  6. 6. Our goals • Analyze the performance of parallel applications • Detect bottlenecks and explain their causes – Focus on communication and synchronization in message-passing programs • Automate the approach to the extent possible • Scalable to thousands of nodes • Online approach without trace files 6
  7. 7. Contributions • A systematic approach for automated diagnosis of application performance – Application is monitored, modeled and diagnosed during its execution • Scalable modeling technique that generates performance knowledge about application behavior • Analysis technique that diagnoses MPI applications running in large-scale parallel systems – Detects performance bottlenecks on-the-fly – Finds root causes • Prototype tool to demonstrate the ideas 7
  8. 8. Outline 1. Overview of approaches 2. Online performance modeling 3. Online performance analysis 4. Experimental evaluation 5. Conclusions and future work 8
  9. 9. Overview of approaches 9
  10. 10. Classical performance analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Visualization tool 10
  11. 11. Classical performance analysis Drawbacks • Manual task of experimental nature • Time consuming • High degree of expertise required • Full trace excessive volume of information • Poor scalability 11
  12. 12. Automated offline analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Automated tools (KappaPI, EXPERT) 12
  13. 13. Automated offline analysis Drawbacks • Post-mortem • Addresses only well-known problems • Not fully explored capabilities to find root causes 13
  14. 14. Automated online analysis Develop Code changes Compile Instrument Find solutions Execute Performance problems Online monitoring (What, Where, When) and diagnosis (Paradyn) 14
  15. 15. Automated online analysis Paradyn advantages Paradyn drawbacks • Locate problems while app • Addresses lower-level runs problems (profiler) • Automated problem-space • No search for root causes of search problems – Functional decomposition – Refinable measurements • Scalable 15
  16. 16. Automated online analysis Our approach Consume Code Develop events Monitoring changes Compile Find Refine solutions Execute Modeling Analysis Observe 16 model Problems and causes
  17. 17. Automated online analysis Key characteristics • Discovers application model on-the-fly – Model execution flows, not modules/functions – Lossy trace compression • Runtime analysis based on continuous model observation • Automatically locates problems while app runs • Search for root-causes of problems 17
  18. 18. Monitoring Modeling Analysis Online performance modeling 18
  19. 19. Modeling objectives • Enable high-level understanding of application performance • Reflect parallel application structure and runtime behavior • Maintain tradeoff between volume of collected data and level of preserved details – Communication and computational patterns – Causality of events • Base for online performance analysis 19
  20. 20. Online performance modeling • Novel application performance modeling approach • Combines static code analysis with runtime monitoring to extract performance knowledge • Three step approach: – Modeling individual tasks – Modeling inter-task communication – Modeling entire application 20
  21. 21. Modeling individual tasks • We decompose execution into units that correspond to different activities: – Communication activities (E.g. MPI_Send, MPI_Gather) – Computation activities (E.g. calc_gauss) – Control activities (E.g. program start/termination) – Others (E.g. I/O) • We capture execution flow through these activities using a directed graph called Task Activity Graph (TAG): – Nodes model communication activities and loops – Edges represent sequential flow of execution (computation activities) – Nodes and edges maintain happens-before relationship 21
  22. 22. Modeling individual tasks Task Activity Graph (TAG) reflects program structure by modeling executed flow of activities 22
  23. 23. Modeling individual tasks • Each activity corresponds to a particular location in the source code 23
  24. 24. Modeling individual tasks • Runtime behavior of activities is described by adding performance metrics to nodes and edges • Data aggregated into statistical execution profiles Edge counter & accumulative timer {min, max, stddev} Node accumulative timer {min, max, stddev} 24
  25. 25. Modeling communication • Message edges capture matching send-receive links – P2P, Collective • Completion edges capture non-blocking semantics • Performance metrics describe runtime behavior 25
  26. 26. Modeling parallel application • Individual TAG models connected by message edges form a Parallel-TAG model (PTAG) 26
  27. 27. Modeling techniques We developed a set of techniques to automatically construct and exploit the PTAG model at runtime • Targeted to parallel scientific applications • Focus on modeling MPI applications • But extendible to other programming paradigms • Low-overhead • Scalable to 1000+ nodes 27
  28. 28. Online PTAG construction Front-end 7 analyze 6 update 5 merge TBON Node 1 TBON Node 2 … 4 update Modeler 1 Modeler 2 Modeler 3 … Modeler N 1 instrument sample 3 2 build MPI Task 1 MPI Task 2 MPI Task 3 … MPI Task N 28
  29. 29. Building individual TAG 6 update Modeler 5 sample 1 analyze executable 2 instrument shared memory MPI Task capture 3 events RT Library 4 update 29
  30. 30. Building individual TAG Offline program analysis • Parse binary executable • Find target functions • Detect relevant loops Modeler 1 analyze executable shared memory MPI Task RT Library 30
  31. 31. Building individual TAG Dynamic instrumentation • Instrument all target functions: – Record events – Collect performance metrics – Invoke TAG update • Refinable at runtime Modeler 2 instrument shared memory MPI Task RT Library 31
  32. 32. Building individual TAG Performance metrics • Counters • Timers {sum, sum2, min, max} • Histograms cnt2++ cnt3++ • Compound metrics cnt1++ cnt4++ cnt5++ Modeler t1 t2 t3 t4 2 instrument shared memory MPI Task RT Library 32
  33. 33. Building individual TAG Runtime modeling • Process generated events • Walk the stack to capture program location (call path) • Update TAG incrementally Modeler shared memory capture MPI Task events RT Library 3 4 update 33
  34. 34. Building individual TAG Model sampling • Goal: examine model at runtime • Read model from shared memory • Sampling is periodic • Lock-free synchronization Modeler 5 sample shared memory MPI Task RT Library 34
  35. 35. Online communication modeling How to model inter-task communication? • Intercept MPI communication calls (nodes) • Match sender nodes with receiver nodes • Add messages edges to the TAG models 35
  36. 36. Online communication modeling • Requires tracking of individual messages transmitted from sender to receiver(s) at runtime • Achieved by propagating piggyback data over every transmitted MPI message • Transmit node id from sender to receiver(s) • P2P/Blocking/Non-blocking/Collective • Optimized hybrid strategy to minimize intrusion • Store references to sender’s nodes at receiver’s TAG 36
  37. 37. Online parallel application modeling Building and maintaining PTAG • Individual TAGs are distributed Hierarchical Reduction • Collect TAGs snapshots Network (TBON) • Distributed merge • Periodic process Individual TAGs Merged groups PTAG of TAGs 37
  38. 38. Online parallel application modeling Scalable modeling 10240 nodes, 1024 nodes, 625MB 62MB 8 nodes, 250KB • Increasing data volume • Increasing analysis cost • Non-scalable visualization 38
  39. 39. Online parallel application modeling Resolving scalability issues • Classes of similar tasks – E.g. stencil codes, M/W • TAG clustering – Structural equivalence – Behavioral equivalence • Distributed and scalable TAG merging algorithm 39
  40. 40. Online parallel application modeling Scalable PTAG visualization • Example: 1D stencil, 8 nodes 40
  41. 41. Benefits of modeling • Facilitates performance understanding • Reveals communication and computational patterns and their causal relationships • Enables an assortment of online analysis techniques – Quick identification of performance bottlenecks and their location – Behavioral task clustering – Causal relationships permit root-cause analysis – Feedback-guided analysis (refinements) 41
  42. 42. Monitoring Modeling Analysis Online performance analysis 42
  43. 43. Online analysis objectives • Diagnose the performance on-the-fly • Detect relevant performance bottlenecks and their reasons • Distinguish problem symptoms from root causes • Explain what, where, when and why • Focus on communication and synchronization problems in MPI applications 43
  44. 44. Online performance analysis Time-continuous Root-Cause Analysis process Monitoring Modeling Analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis 44
  45. 45. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 1: Problem identification • Focus attention on code regions with biggest potential optimization benefits • A potential bottleneck – an individual task activity with significant amount of execution time • TAG node might corresponds to a communication or synchronization problem • TAG edge might be a computation-bound problem 45
  46. 46. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis CPU-bound activity ~45% time Cold activity Hot activity Blocked receive ~42% time • Rainbow spectrum TAG coloring Communication or • Activity time / Max Activity Time synchronization problem 46
  47. 47. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis TAG ranking process • Identify potential bottlenecks for further analysis • Periodic ranking in moving time-window Select top problems by ranking Rank = activity time / task time > 20% for computation activities > 3% for communication activities TAG snapshot Potential bottlenecks 47
  48. 48. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 2: In-depth problem analysis • For each potential bottleneck, investigate its causes • Explore knowledge-based cause space • Focus on causes that contribute most to the problem time • Distinguish task-local problems from inter-task problems – Find root-causes of task-local problems • E.g. CPU-bound computation, local I/O – Find symptoms of inter-task problems • E.g. Blocked receive, barrier 48
  49. 49. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Performance models for activities • Classification of activities • Each class has a performance model that divides the activity cost into separate components • Each component is a non-exclusive potential cause of the problem 49
  50. 50. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for computational activities • Sequential code region modeled by TAG edge • No external knowledge about computation • Determine where edge-constrained code spends time • Divide TAG edge into components – Functional or basic-blocks decomposition • Apply statistical profiling constrained to an edge – Dynamic instrumentation • Other metrics – Idle time, I/O time, hardware counters 50
  51. 51. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost Task e1 Send e3 e2 Receive e4 Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 51
  52. 52. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost • Piggyback send entry Task timestamp (e1) e1 Send e3 • Accumulate synchronization cost e2 Receive e4 per message edge Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 52
  53. 53. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Example receive activity break-down Requires inter-task cause-effect analysis 53
  54. 54. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 3: Cause-effect analysis • Explain causes of synchronization inefficiencies – Why sender is late? • Correlate problems into cause-effect chains • Distinguish root-causes of inefficiencies from their causal propagation (symptoms) • Pinpoint problems in non-dominant code regions • Improve the feedback provided to application developers 54
  55. 55. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Causal propagation Causes Causes ComputationA (Task A) Late Sender Causes (Task A) Inefficiency1 Causes (Task B) Late Sender (Task B) Task Inefficiency2 (Task C) ComputationB Causes (Task B) A ComputationA Send1 WT1 Inefficiency 1 m0 B Receive1 ComputationB Send2 WT2 Inefficiency 2 m1 C ComputationC Receive2 t0 t1 t2 t3 t4 Time 55
  56. 56. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Explaining problem causes • Causes of waiting time between two nodes as the differences between their execution paths – Online adaptation of Wait-Time Analysis approach by Meira et al. – Based on PTAG model, not full trace • Explain synchronization inefficiencies by means of other activities – Identify corresponding execution paths in PTAG model – Compare the paths – Build causal tree with explanations – Merge trees of individual problems 56
  57. 57. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Execution path comparison Inefficiency caused by Late Sender problem Path q (Task 1) e7 ... ... e1 e2 e4 e5 e6 Path p (Task 2) Inefficiency at MPI_Recv Waiting-time e3 (Task 1) 138,4 sec. ... ... e1 e2 Late Sender (Task 2) 91.9% 7.7% Computation Computation edge e3 edge e2 (Task 2) (Task 2) Root causes 57
  58. 58. Benefits of RCA • Systematic approach to online performance analysis • Quick identification of problems as they manifest at runtime (without trace) • Causal correlation of different problems • Discovery of root-causes of synchronization inefficiencies 58
  59. 59. Experimental evaluation 59
  60. 60. Prototype tool global analyzer • Implemented in C++ • DynInst 5.1 mrnet mrnet … comm comm • MRNet 1.2 node node • OpenMPI 1.2.x mrnet mrnet • Linux platforms comm node comm node … – x86 – IA-64 (Itanium) – PowerPC 32/64 dmad dmad dmad dmad dmad MPI Task MPI Task MPI Task … MPI Task MPI Task 60
  61. 61. Experimental environment UAB cluster BSC Marenostrum  x86/Linux  PowerPC-64/Linux  32 nodes  512 nodes (restricted)  Intel Pentium IV 3GHz  PowerPC 2.3GHz dual core  Linux FC4  SUSE Linux Enterprise Server 9  Gigabit Ethernet  Myrinet 61
  62. 62. Modeling MPI applications • Experiences with different classes of MPI codes – SPMD codes • WaveSend – 1D stencil, concurrent wave equation • NAS Parallel Benchmarks – 2D stencils • SMG2000 – 3D stencil, multigrid solver – Master/Worker • XFire – forest fire propagation simulator + Demonstrated ability to model arbitrary MPI code with low-overhead + Best with regular codes – Limitations with recursive codes 62
  63. 63. Case study #1: Modeling SPMD Integer sort (IS) NAS Parallel Benchmark • Large integer sort used in “particle method” codes • Tests both integer computation speed and communication performance • Mostly collective communication • We extract PTAG to understand application communication patterns and behavior 63
  64. 64. Case study #2: Master/Worker Forest Fire Propagation Simulator (XFire) • Calculates the expansion of the fireline • Computationally intensive code, exploits data parallelism • We extract and cluster PTAG 64
  65. 65. Evaluation of overheads Sources of overheads • Offline startup – Less than 20 seconds per 1MB executable – In function of program size • Online TAG construction – 4-20 μs per instrumented call (*) – Depends on the number of instrumented calls and loops • Online TAG sampling – 40-50 μs per snapshot (256 KB) – Depends on program structure size, number of communication links (*) Experiments conducted in UAB cluster 65
  66. 66. Evaluation of overheads NAS LU overheads, varying nº of nodes 120,00 2,50% 100,00 1,91% 2,00% 1,59% Overhead (seconds) 80,00 1,50% 1,34% 1,42% 1,26% 1,50% 60,00 Overhead (seconds) 1,00% 40,00 Overhead (%) 0,50% 20,00 0,00 0,00% 16 32 64 128 256 512 Nº CPUs 66
  67. 67. Case study #3: SPMD analysis WaveSend application • Parallel calculations of vibrating string over time • Wave equation, block-decomposition • P2P communication to exchange boundary points with nearest neighbors • Synthetic performance problems 67
  68. 68. Case study #3: SPMD analysis WaveSend PTAG After execution 68
  69. 69. Case study #3: SPMD analysis CPU-bound problem at task 7 PTAG after 30 seconds of execution 69
  70. 70. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 35.4% CPU-bound in edge 8→6 Task 1 findings: 33% CPU-bound in edge 11→6 Task 6 findings: 32.1% CPU-bound in edge 11→6 Task 7 findings: 50.5% CPU-bound in edge 8→6 70
  71. 71. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 21.4% blocked receive caused by late sender from task 1 Task 1 findings: 19.1 % blocked receive caused by late sender from task 2 Task 6 findings: 19.2 blocked receive caused by late sender from task 7 71
  72. 72. Case study #3: SPMD analysis Cause-effect analysis 72
  73. 73. Case study #3: SPMD analysis Analysis results • Load imbalance found • Multiple instances of late-sender problem • Causal propagation of inefficiencies • Root-cause found in task 7 as an imbalanced computational edge 73
  74. 74. Conclusions and future work 74
  75. 75. Conclusions • A novel approach for online performance modeling – Discovers high-level application structure and runtime behavior – A differential hybrid technique that combines both static code analysis with runtime monitoring to extract performance knowledge – Scalable to 1000+ processors • An automated online performance analysis approach – Enables quick detection of performance bottlenecks – Focuses on explaining sources of communication and synchronization – Correlates different problems and identifies their root causes • A prototype tool that models and analyzes MPI applications at runtime 75
  76. 76. Future work • Modeling – Support for other classes of activities (I/O, MPI RMA) – OpenMP applications – Support for recursive codes – Multi-experiment support • Analysis – More accurate cause-effect analysis with causal paths – Evaluation of scalability of analysis in large-scale HPC – Actionable recommendations – Integration with automatic tuning framework (MATE) 76
  77. 77. Online performance modeling and analysis of message-passing parallel applications Thank You PhD Thesis, Oleg Morajko Universitat Autònoma de Barcelona 77

×