Online performance modeling and analysis of message-passing parallel applications

  • 1,563 views
Uploaded on

Although the evolution of hardware is improving at an incredible rate, the advances in …

Although the evolution of hardware is improving at an incredible rate, the advances in
parallel software have been hampered for many reasons. Developing an efficient parallel
application is still not an easy task. Our thesis is that many performance problems and their reasons can be quickly located and explained with automated techniques that work on unmodified parallel applications. This work identifies main obstacles for such diagnosis and presents a two-step approach for addressing them. In this approach, the application is automatically modeled and diagnosed during its execution.

First, we introduce an online performance modeling technique that enables automated discovery of causal execution flows through communication and computational activities in message-passing parallel programs. Second, we present a systematic approach to online performance analysis. The automated
analysis uses online model to quickly identify the most important performance problems,
and correlate them with application source code. Our technique is able to discover causal
dependences between the problems, infer their root causes in some scenarios and explain
them to developers. In this work, we focus on diagnosing scientific MPI parallel applications and their communication and computational problems although the approach can be extended to support other classes of activities and programming models.

We have evaluated our approach on a variety of scientific parallel applications. In all scenarios, our online performance modeling technique proved effective for low-overhead capturing of program’s behavior and facilitated performance understanding. With our automated, model-based performance analysis approach, we were able to easily identify the most severe performance problems during application execution, and locate their root causes without previous knowledge of application internals.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,563
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Online performance modeling and analysis of message-passing parallel applications Delayed receive PhD Thesis Oleg Morajko Universitat Autònoma de Barcelona, Long local calculations Barcelona, 2008
  • 2. Motivation • Parallel system hardware is evolving at an incredible rate • Contemporary HPC systems – Top500 ranging from 1.000 to 200.000+ processors (June 2008) – Take BSC MareNostrum: 10K processors • Whole industry is shifting to parallel computing 2
  • 3. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately MPI 3
  • 4. Motivation • Challenges of developing large-scale scientific software – Evolution of programming models is much slower – Hard to achieve good efficiency – Hard to achieve scalability • The parallel applications rarely achieve good performance immediately Careful performance analysis and optimization tasks are crucial 4
  • 5. Motivation • Quickly finding performance problems and their reasons is hard • Requires thorough understanding of the program’s behavior – Parallel algorithm, domain decomposition, communication, synchronization • Large scale brings additional complexities – Large data volume, excessive analysis cost • Existing tools support finding what happens, where, and when – Locating root causes of problems still manual – Tools expose scalability limitations (E.g. tracing) • Problem diagnosis still requires substantial time and effort of highly-skilled professionals 5
  • 6. Our goals • Analyze the performance of parallel applications • Detect bottlenecks and explain their causes – Focus on communication and synchronization in message-passing programs • Automate the approach to the extent possible • Scalable to thousands of nodes • Online approach without trace files 6
  • 7. Contributions • A systematic approach for automated diagnosis of application performance – Application is monitored, modeled and diagnosed during its execution • Scalable modeling technique that generates performance knowledge about application behavior • Analysis technique that diagnoses MPI applications running in large-scale parallel systems – Detects performance bottlenecks on-the-fly – Finds root causes • Prototype tool to demonstrate the ideas 7
  • 8. Outline 1. Overview of approaches 2. Online performance modeling 3. Online performance analysis 4. Experimental evaluation 5. Conclusions and future work 8
  • 9. Overview of approaches 9
  • 10. Classical performance analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Visualization tool 10
  • 11. Classical performance analysis Drawbacks • Manual task of experimental nature • Time consuming • High degree of expertise required • Full trace excessive volume of information • Poor scalability 11
  • 12. Automated offline analysis Code Compile Develop Instrument changes Find Execute solutions Performance Trace problems files Analyze trace Automated tools (KappaPI, EXPERT) 12
  • 13. Automated offline analysis Drawbacks • Post-mortem • Addresses only well-known problems • Not fully explored capabilities to find root causes 13
  • 14. Automated online analysis Develop Code changes Compile Instrument Find solutions Execute Performance problems Online monitoring (What, Where, When) and diagnosis (Paradyn) 14
  • 15. Automated online analysis Paradyn advantages Paradyn drawbacks • Locate problems while app • Addresses lower-level runs problems (profiler) • Automated problem-space • No search for root causes of search problems – Functional decomposition – Refinable measurements • Scalable 15
  • 16. Automated online analysis Our approach Consume Code Develop events Monitoring changes Compile Find Refine solutions Execute Modeling Analysis Observe 16 model Problems and causes
  • 17. Automated online analysis Key characteristics • Discovers application model on-the-fly – Model execution flows, not modules/functions – Lossy trace compression • Runtime analysis based on continuous model observation • Automatically locates problems while app runs • Search for root-causes of problems 17
  • 18. Monitoring Modeling Analysis Online performance modeling 18
  • 19. Modeling objectives • Enable high-level understanding of application performance • Reflect parallel application structure and runtime behavior • Maintain tradeoff between volume of collected data and level of preserved details – Communication and computational patterns – Causality of events • Base for online performance analysis 19
  • 20. Online performance modeling • Novel application performance modeling approach • Combines static code analysis with runtime monitoring to extract performance knowledge • Three step approach: – Modeling individual tasks – Modeling inter-task communication – Modeling entire application 20
  • 21. Modeling individual tasks • We decompose execution into units that correspond to different activities: – Communication activities (E.g. MPI_Send, MPI_Gather) – Computation activities (E.g. calc_gauss) – Control activities (E.g. program start/termination) – Others (E.g. I/O) • We capture execution flow through these activities using a directed graph called Task Activity Graph (TAG): – Nodes model communication activities and loops – Edges represent sequential flow of execution (computation activities) – Nodes and edges maintain happens-before relationship 21
  • 22. Modeling individual tasks Task Activity Graph (TAG) reflects program structure by modeling executed flow of activities 22
  • 23. Modeling individual tasks • Each activity corresponds to a particular location in the source code 23
  • 24. Modeling individual tasks • Runtime behavior of activities is described by adding performance metrics to nodes and edges • Data aggregated into statistical execution profiles Edge counter & accumulative timer {min, max, stddev} Node accumulative timer {min, max, stddev} 24
  • 25. Modeling communication • Message edges capture matching send-receive links – P2P, Collective • Completion edges capture non-blocking semantics • Performance metrics describe runtime behavior 25
  • 26. Modeling parallel application • Individual TAG models connected by message edges form a Parallel-TAG model (PTAG) 26
  • 27. Modeling techniques We developed a set of techniques to automatically construct and exploit the PTAG model at runtime • Targeted to parallel scientific applications • Focus on modeling MPI applications • But extendible to other programming paradigms • Low-overhead • Scalable to 1000+ nodes 27
  • 28. Online PTAG construction Front-end 7 analyze 6 update 5 merge TBON Node 1 TBON Node 2 … 4 update Modeler 1 Modeler 2 Modeler 3 … Modeler N 1 instrument sample 3 2 build MPI Task 1 MPI Task 2 MPI Task 3 … MPI Task N 28
  • 29. Building individual TAG 6 update Modeler 5 sample 1 analyze executable 2 instrument shared memory MPI Task capture 3 events RT Library 4 update 29
  • 30. Building individual TAG Offline program analysis • Parse binary executable • Find target functions • Detect relevant loops Modeler 1 analyze executable shared memory MPI Task RT Library 30
  • 31. Building individual TAG Dynamic instrumentation • Instrument all target functions: – Record events – Collect performance metrics – Invoke TAG update • Refinable at runtime Modeler 2 instrument shared memory MPI Task RT Library 31
  • 32. Building individual TAG Performance metrics • Counters • Timers {sum, sum2, min, max} • Histograms cnt2++ cnt3++ • Compound metrics cnt1++ cnt4++ cnt5++ Modeler t1 t2 t3 t4 2 instrument shared memory MPI Task RT Library 32
  • 33. Building individual TAG Runtime modeling • Process generated events • Walk the stack to capture program location (call path) • Update TAG incrementally Modeler shared memory capture MPI Task events RT Library 3 4 update 33
  • 34. Building individual TAG Model sampling • Goal: examine model at runtime • Read model from shared memory • Sampling is periodic • Lock-free synchronization Modeler 5 sample shared memory MPI Task RT Library 34
  • 35. Online communication modeling How to model inter-task communication? • Intercept MPI communication calls (nodes) • Match sender nodes with receiver nodes • Add messages edges to the TAG models 35
  • 36. Online communication modeling • Requires tracking of individual messages transmitted from sender to receiver(s) at runtime • Achieved by propagating piggyback data over every transmitted MPI message • Transmit node id from sender to receiver(s) • P2P/Blocking/Non-blocking/Collective • Optimized hybrid strategy to minimize intrusion • Store references to sender’s nodes at receiver’s TAG 36
  • 37. Online parallel application modeling Building and maintaining PTAG • Individual TAGs are distributed Hierarchical Reduction • Collect TAGs snapshots Network (TBON) • Distributed merge • Periodic process Individual TAGs Merged groups PTAG of TAGs 37
  • 38. Online parallel application modeling Scalable modeling 10240 nodes, 1024 nodes, 625MB 62MB 8 nodes, 250KB • Increasing data volume • Increasing analysis cost • Non-scalable visualization 38
  • 39. Online parallel application modeling Resolving scalability issues • Classes of similar tasks – E.g. stencil codes, M/W • TAG clustering – Structural equivalence – Behavioral equivalence • Distributed and scalable TAG merging algorithm 39
  • 40. Online parallel application modeling Scalable PTAG visualization • Example: 1D stencil, 8 nodes 40
  • 41. Benefits of modeling • Facilitates performance understanding • Reveals communication and computational patterns and their causal relationships • Enables an assortment of online analysis techniques – Quick identification of performance bottlenecks and their location – Behavioral task clustering – Causal relationships permit root-cause analysis – Feedback-guided analysis (refinements) 41
  • 42. Monitoring Modeling Analysis Online performance analysis 42
  • 43. Online analysis objectives • Diagnose the performance on-the-fly • Detect relevant performance bottlenecks and their reasons • Distinguish problem symptoms from root causes • Explain what, where, when and why • Focus on communication and synchronization problems in MPI applications 43
  • 44. Online performance analysis Time-continuous Root-Cause Analysis process Monitoring Modeling Analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis 44
  • 45. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 1: Problem identification • Focus attention on code regions with biggest potential optimization benefits • A potential bottleneck – an individual task activity with significant amount of execution time • TAG node might corresponds to a communication or synchronization problem • TAG edge might be a computation-bound problem 45
  • 46. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis CPU-bound activity ~45% time Cold activity Hot activity Blocked receive ~42% time • Rainbow spectrum TAG coloring Communication or • Activity time / Max Activity Time synchronization problem 46
  • 47. Problem identification Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis TAG ranking process • Identify potential bottlenecks for further analysis • Periodic ranking in moving time-window Select top problems by ranking Rank = activity time / task time > 20% for computation activities > 3% for communication activities TAG snapshot Potential bottlenecks 47
  • 48. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 2: In-depth problem analysis • For each potential bottleneck, investigate its causes • Explore knowledge-based cause space • Focus on causes that contribute most to the problem time • Distinguish task-local problems from inter-task problems – Find root-causes of task-local problems • E.g. CPU-bound computation, local I/O – Find symptoms of inter-task problems • E.g. Blocked receive, barrier 48
  • 49. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Performance models for activities • Classification of activities • Each class has a performance model that divides the activity cost into separate components • Each component is a non-exclusive potential cause of the problem 49
  • 50. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for computational activities • Sequential code region modeled by TAG edge • No external knowledge about computation • Determine where edge-constrained code spends time • Divide TAG edge into components – Functional or basic-blocks decomposition • Apply statistical profiling constrained to an edge – Dynamic instrumentation • Other metrics – Idle time, I/O time, hardware counters 50
  • 51. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost Task e1 Send e3 e2 Receive e4 Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 51
  • 52. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Model for communication activities Communication cost = Synchronization Cost + Transmission Cost Transmission cost Overall communication cost • Piggyback send entry Task timestamp (e1) e1 Send e3 • Accumulate synchronization cost e2 Receive e4 per message edge Time Synchronization cost • Captures semantics of well-known synchronization inefficiencies – Late sender, wait at barrier, early reduce, etc. 52
  • 53. In-depth problem analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Example receive activity break-down Requires inter-task cause-effect analysis 53
  • 54. Root-cause analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Phase 3: Cause-effect analysis • Explain causes of synchronization inefficiencies – Why sender is late? • Correlate problems into cause-effect chains • Distinguish root-causes of inefficiencies from their causal propagation (symptoms) • Pinpoint problems in non-dominant code regions • Improve the feedback provided to application developers 54
  • 55. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Causal propagation Causes Causes ComputationA (Task A) Late Sender Causes (Task A) Inefficiency1 Causes (Task B) Late Sender (Task B) Task Inefficiency2 (Task C) ComputationB Causes (Task B) A ComputationA Send1 WT1 Inefficiency 1 m0 B Receive1 ComputationB Send2 WT2 Inefficiency 2 m1 C ComputationC Receive2 t0 t1 t2 t3 t4 Time 55
  • 56. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Explaining problem causes • Causes of waiting time between two nodes as the differences between their execution paths – Online adaptation of Wait-Time Analysis approach by Meira et al. – Based on PTAG model, not full trace • Explain synchronization inefficiencies by means of other activities – Identify corresponding execution paths in PTAG model – Compare the paths – Build causal tree with explanations – Merge trees of individual problems 56
  • 57. Cause-effect analysis Phase 1 Phase 2 Phase 3 Problem Problem Cause-effect identification analysis analysis Execution path comparison Inefficiency caused by Late Sender problem Path q (Task 1) e7 ... ... e1 e2 e4 e5 e6 Path p (Task 2) Inefficiency at MPI_Recv Waiting-time e3 (Task 1) 138,4 sec. ... ... e1 e2 Late Sender (Task 2) 91.9% 7.7% Computation Computation edge e3 edge e2 (Task 2) (Task 2) Root causes 57
  • 58. Benefits of RCA • Systematic approach to online performance analysis • Quick identification of problems as they manifest at runtime (without trace) • Causal correlation of different problems • Discovery of root-causes of synchronization inefficiencies 58
  • 59. Experimental evaluation 59
  • 60. Prototype tool global analyzer • Implemented in C++ • DynInst 5.1 mrnet mrnet … comm comm • MRNet 1.2 node node • OpenMPI 1.2.x mrnet mrnet • Linux platforms comm node comm node … – x86 – IA-64 (Itanium) – PowerPC 32/64 dmad dmad dmad dmad dmad MPI Task MPI Task MPI Task … MPI Task MPI Task 60
  • 61. Experimental environment UAB cluster BSC Marenostrum  x86/Linux  PowerPC-64/Linux  32 nodes  512 nodes (restricted)  Intel Pentium IV 3GHz  PowerPC 2.3GHz dual core  Linux FC4  SUSE Linux Enterprise Server 9  Gigabit Ethernet  Myrinet 61
  • 62. Modeling MPI applications • Experiences with different classes of MPI codes – SPMD codes • WaveSend – 1D stencil, concurrent wave equation • NAS Parallel Benchmarks – 2D stencils • SMG2000 – 3D stencil, multigrid solver – Master/Worker • XFire – forest fire propagation simulator + Demonstrated ability to model arbitrary MPI code with low-overhead + Best with regular codes – Limitations with recursive codes 62
  • 63. Case study #1: Modeling SPMD Integer sort (IS) NAS Parallel Benchmark • Large integer sort used in “particle method” codes • Tests both integer computation speed and communication performance • Mostly collective communication • We extract PTAG to understand application communication patterns and behavior 63
  • 64. Case study #2: Master/Worker Forest Fire Propagation Simulator (XFire) • Calculates the expansion of the fireline • Computationally intensive code, exploits data parallelism • We extract and cluster PTAG 64
  • 65. Evaluation of overheads Sources of overheads • Offline startup – Less than 20 seconds per 1MB executable – In function of program size • Online TAG construction – 4-20 μs per instrumented call (*) – Depends on the number of instrumented calls and loops • Online TAG sampling – 40-50 μs per snapshot (256 KB) – Depends on program structure size, number of communication links (*) Experiments conducted in UAB cluster 65
  • 66. Evaluation of overheads NAS LU overheads, varying nº of nodes 120,00 2,50% 100,00 1,91% 2,00% 1,59% Overhead (seconds) 80,00 1,50% 1,34% 1,42% 1,26% 1,50% 60,00 Overhead (seconds) 1,00% 40,00 Overhead (%) 0,50% 20,00 0,00 0,00% 16 32 64 128 256 512 Nº CPUs 66
  • 67. Case study #3: SPMD analysis WaveSend application • Parallel calculations of vibrating string over time • Wave equation, block-decomposition • P2P communication to exchange boundary points with nearest neighbors • Synthetic performance problems 67
  • 68. Case study #3: SPMD analysis WaveSend PTAG After execution 68
  • 69. Case study #3: SPMD analysis CPU-bound problem at task 7 PTAG after 30 seconds of execution 69
  • 70. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 35.4% CPU-bound in edge 8→6 Task 1 findings: 33% CPU-bound in edge 11→6 Task 6 findings: 32.1% CPU-bound in edge 11→6 Task 7 findings: 50.5% CPU-bound in edge 8→6 70
  • 71. Case study #3: SPMD analysis Potential bottlenecks Task 0 findings: 21.4% blocked receive caused by late sender from task 1 Task 1 findings: 19.1 % blocked receive caused by late sender from task 2 Task 6 findings: 19.2 blocked receive caused by late sender from task 7 71
  • 72. Case study #3: SPMD analysis Cause-effect analysis 72
  • 73. Case study #3: SPMD analysis Analysis results • Load imbalance found • Multiple instances of late-sender problem • Causal propagation of inefficiencies • Root-cause found in task 7 as an imbalanced computational edge 73
  • 74. Conclusions and future work 74
  • 75. Conclusions • A novel approach for online performance modeling – Discovers high-level application structure and runtime behavior – A differential hybrid technique that combines both static code analysis with runtime monitoring to extract performance knowledge – Scalable to 1000+ processors • An automated online performance analysis approach – Enables quick detection of performance bottlenecks – Focuses on explaining sources of communication and synchronization – Correlates different problems and identifies their root causes • A prototype tool that models and analyzes MPI applications at runtime 75
  • 76. Future work • Modeling – Support for other classes of activities (I/O, MPI RMA) – OpenMP applications – Support for recursive codes – Multi-experiment support • Analysis – More accurate cause-effect analysis with causal paths – Evaluation of scalability of analysis in large-scale HPC – Actionable recommendations – Integration with automatic tuning framework (MATE) 76
  • 77. Online performance modeling and analysis of message-passing parallel applications Thank You PhD Thesis, Oleg Morajko Universitat Autònoma de Barcelona 77