Performance Analysis
using the Vampir Toolchain
   Robert Henschel (HPA-IU)
   David Cronk (CS-UTK)
   Thomas William (PSW...
Overview
Morning Session (Innovation Center, Room 105)
• 09:00 – 10:15 Overview: Event Based Program Analysis
• 10:15 – 10...
We do have computers in Germany too (although quiet old ones)

TU DRESDEN, ZIH, AND HPC
Dresden University of Technology
• Founded in 1828
• One of the oldest technical
  universities in Germany
• 14 faculties ...
Center for Information Services and
                  HPC (ZIH)
• Central Scientific Unit at TU
  Dresden
• Competence Cen...
Structure of ZIH
•   Management
     – Director:                            Prof. Dr. Wolfgang E. Nagel
     – Assistant d...
Today‘s Main HPC Infrastructure

     HPC-Component
                                                    PC-Farm
    Main M...
Areas of Expertise
• Research topics
   – Architecture and performance analysis of High
     Performance Computers
   – Pr...
Performance Analysis Tools
• The Vampir performance analysis toolkit
   – Vampir: Scalable event trace visualization
   – ...
Performance Analysis Tools


                    Vampir-Team

Ronny Brendel            Matthias Jurenz       Prof. Wolfgan...
EVENT BASED PROGRAM ANALYSIS
Why performance analysis?
• Moore's Law still in charge, no need to tune performance?
• Increasingly difficult to get clos...
Basics about Parallelization
Performance Analysis with Profiling
Instrumentation and Tracing

OVERVIEW
Motivation
• Reasons for parallel programming:
  – Higher Performance
     • Solve the same problem in shorter time
     •...
Parallelization Strategies
• General strategy for parallelization:
   – Distribute the work to many workers

   Limitation...
BASICS ABOUT PARALLELIZATION
Speed-up
• Definition of speed-up S                                                                        TS
       Ts: S...
Parallel Efficiency
• Alternative definition: parallel efficiency E
  TS: Serial Execution Time                           ...
Amdahl’s law
• Fundamental limit of parallelization
                                                   1                  ...
Amdahl’s law
• If you know your desired speed up S you can calculate F:
                                                 1...
Amdahl’s law, example
• Example program with some sub-routines calling
  one another:
       # c a lls      T im e ( % )  ...
General Parallelization Strategy
• Therefore, successful parallelization requires:
   – Finding the actual hot-spots of wo...
PERFORMANCE ANALYSIS WITH
PROFILING
Profiling
• Profiling gives an overview about the distribution of
  run time
• Usually on the level of subroutines, also a...
Profiling
• Profile Recording
  – Of aggregated information (Time, Counts, …)
  – About program and system entities
     •...
Profiling with gprof
 – Compile with profiling support
    • Using -pg for GNU, -p –g for Intel
    • Optimization -O3 mig...
Profiling with gprof
 – Pre-process profiling output with gprof:
    • Text output
    • There are also GUI front-ends lik...
Profiling with gprof
 – Flat profile for one of four ranks:
Flat profile:


Each sample counts as 0.01 seconds.
  %     cu...
Profiling with gprof
– Annotated call graph for one of four ranks:
Call graph
granularity: each sample hit covers 4 byte(s...
Profiling

• Simple profiling is a good starting point
      • Reveals computational hot spots
      • Hides away outlier ...
INSTRUMENTATION AND TRACING
Event Tracing
• Collect more detailed information for more
  insight
• Do not summarize run-time information
• Collect ind...
Tracing
• Recording of run-time events (points of interest)
  – During program execution
  – Enter leave of functions/subr...
Profiling vs Tracing
• Tracing Advantages
  – Preserve temporal and spatial relationships (context)
  – Allow reconstructi...
Common Event Types
• Enter/leave of function/routine/region
  – Time stamp, process/thread, function ID
• Send/receive of ...
Parallel Trace
                                                DEF TIMERRES 1000000000
                                   ...
Instrumentation
• Instrumentation: Process of modifying programs
  to detect and report events by calling
  instrumentatio...
Source Code Instrumentation

int foo(void* arg){            int foo(void* arg){
                               enter(6);
i...
Source Code Instrumentation
Manually
  – Large effort, error prone
  – Difficult to manage
Automatically
  – Via source to...
Wrapper Function Instrumentation
• Provide wrapper functions
     • Call instrumentation function for notification
     • ...
The MPI Profiling Interface
– Each MPI function has two names:
   • MPI_xxx and PMPI_xxx
– Selective replacement of MPI ro...
Compiler Instrumentation
• gcc -finstrument-functions -c foo.c

         void __cyg_profile_func_enter( <args> );
        ...
Dynamic Instrumentation
• Modify binary executable in memory
• Insert instrumentation calls
• Very platform/machine depend...
Instrumentation & Trace Overhead

            manual     PDT           GCC          DynInst
w/o                           ...
Trace Libraries
• Provide instrumentation functions
• Receive events of various types
• Collect event properties
  – Time ...
Trace Files & Formats
•   TAU Trace Format (Univ. of Oregon)
•   Epilog (ZAM, FZ Jülich)
•   STF (Pallas, now Intel)
•   O...
Interoperability
Other Tools
• TAU profiling (University of Oregon, USA)
   – Extensive profiling and tracing for parallel applications and...
Upcoming SlideShare
Loading in …5
×

Overview: Event Based Program Analysis

1,428 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,428
On SlideShare
0
From Embeds
0
Number of Embeds
109
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overview: Event Based Program Analysis

  1. 1. Performance Analysis using the Vampir Toolchain Robert Henschel (HPA-IU) David Cronk (CS-UTK) Thomas William (PSW-ZIH)
  2. 2. Overview Morning Session (Innovation Center, Room 105) • 09:00 – 10:15 Overview: Event Based Program Analysis • 10:15 – 10:45 Break • 10:45 – 11:45 Instrumentation and Runtime Measurement • 11:45 – 13:00 Lunch break Afternoon Session • 13:00 – 13:45 Using PAPI Performance Counters • 13:45 – 14:00 Break • 14:00 – 15:00 Trace Visualization • 15:00 – 15:30 Break • 15:30 – 18:00 Hands On (Wrubel Computing Center, Building WCC, Room 107)
  3. 3. We do have computers in Germany too (although quiet old ones) TU DRESDEN, ZIH, AND HPC
  4. 4. Dresden University of Technology • Founded in 1828 • One of the oldest technical universities in Germany • 14 faculties and a number of specialized institutes • More than 35000 Students, about 4000 Employees, 438 professors • International courses of studies, bachelor, masters • One of the largest faculties for computer science in Germany • 110 million Euro annual third party funding • http://tu-dresden.de
  5. 5. Center for Information Services and HPC (ZIH) • Central Scientific Unit at TU Dresden • Competence Center for „Parallel Computing and Software Tools“ • Strong commitment to support real users • Development of algorithms and methods: Cooperation with users from all departments • Providing infrastructure and qualified service for TU Dresden and Saxony
  6. 6. Structure of ZIH • Management – Director: Prof. Dr. Wolfgang E. Nagel – Assistant directors: Dr. Peter Fischer (COO), Dr. Matthias S. Müller (CTO) • Administration (7 Employees) • Departments (ca. 100 Employees; incl. Trainees) – Department of interdisciplinary function support and coordination (IAK) – Department of networking and communication services (NK) – Department of central systems and services (ZSD) – Department of innovative methods of computing (IMC) – Department of programming and software tool-kits (PSW) – Department of distributed and data intensive computing (VDR)
  7. 7. Today‘s Main HPC Infrastructure HPC-Component PC-Farm Main Memory 6,5 TB 8 GB/s 4 GB/s 4 GB/s HPC-SAN PC-SAN Hard-disk - Hard-disk - capacity : capacity : 68 TB 68 TB 1,8 GB/s PetaByte- Tapestorage capacity : 1 PB installed in 2006
  8. 8. Areas of Expertise • Research topics – Architecture and performance analysis of High Performance Computers – Programming methods and techniques for HPC systems – Grid Computing – Software tools to support programming and optimization – Modeling algorithms of biological processes – Mathematical models, algorithms, and efficient implementations • Role of mediator between vendors, developers, and users • Pick up and preparation of new concepts, methods, and techniques • Teaching and Education
  9. 9. Performance Analysis Tools • The Vampir performance analysis toolkit – Vampir: Scalable event trace visualization – VampirTrace: Instrumentation and run-time data collection – Open Trace Format (OTF): Event trace data format
  10. 10. Performance Analysis Tools Vampir-Team Ronny Brendel Matthias Jurenz Prof. Wolfgang E. Nagel Jens Doleschal Dr. Andreas Knüpfer Michael Peter Ronald Geisler Matthias Lieber Heide Rohling Daniel Hackenberg Holger Mickler Matthias Weber Robert Henschel Dr. Hartmut Mix Thomas William Dr. Matthias Müller http://www.tu-dresden.de/zih/ptools http://www.vampir.eu
  11. 11. EVENT BASED PROGRAM ANALYSIS
  12. 12. Why performance analysis? • Moore's Law still in charge, no need to tune performance? • Increasingly difficult to get close to peak performance – for sequential computation • memory wall • optimum pipelining, ... – for parallel interaction • Amdahl's law • synchronization with single late-comer, ... • Efficiency is important because of limited resources • Scalability is important to cope with next bigger simulation
  13. 13. Basics about Parallelization Performance Analysis with Profiling Instrumentation and Tracing OVERVIEW
  14. 14. Motivation • Reasons for parallel programming: – Higher Performance • Solve the same problem in shorter time • Solve larger problems in the same time – Higher Capability • Solve problems that cannot be solved on a single processor • Larger memory on parallel computers • Time constraints limit the possible problem size ( Weather forecast, turn around within working day) • In both cases performance is one of the major concerns: – Also consider sequential performance within the parallel sections
  15. 15. Parallelization Strategies • General strategy for parallelization: – Distribute the work to many workers Limitations: – Not all tasks can be split into smaller sub-tasks – Dependencies between sub-tasks – Coordination overhead – (same as for human teams) Algorithms: – Different algorithms for the same problem differ in terms of parallelization – Different “best” algorithms for serial vs. parallel execution or for different parallelization schemes
  16. 16. BASICS ABOUT PARALLELIZATION
  17. 17. Speed-up • Definition of speed-up S TS Ts: Serial Execution S Tp: Parallel Execution Time with P CPUs Tp Speed-up versus number of used processors: 9 8 Id e a l S p e e d -u p 7 R e a l S p e e d -u p 6 S p e e d -u p 5 4 3 2 1 0 1 2 3 4 5 6 7 8 #CP Us Actual speed-up often lower than optimal one due to aforementioned limitations.
  18. 18. Parallel Efficiency • Alternative definition: parallel efficiency E TS: Serial Execution Time S TS E TP: Parallel Execution Time with P CPUs P TP P Parallel efficiency versus number of used processors: 1,2 1 P a r a l l e l E ffi c i e n c y 0,8 0,6 Id e a l Pa r a lle l 0,4 Ef f ic ie n c y 0,2 R e a l Pa r a lle l Ef f ic ie n c y 0 1 2 3 4 5 6 7 8 #CP Us
  19. 19. Amdahl’s law • Fundamental limit of parallelization 1 1 S S (P ) F (1 F) (1 F) SP •Only a fraction F of the algorithm is parallel with speed-up Sp •A fraction (1-F) is serial Then the maximum resulting speed-up is: 18 16 Id e a l 14 F=9 9 % M a x im u m S p e e d - u p F=9 5 % 12 F=9 0 % 10 F=8 0 % 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # C P Us
  20. 20. Amdahl’s law • If you know your desired speed up S you can calculate F: 1 F 1 S – F gives you the percentage of your program that has to be executed parallel in order to achieve a speed up S (asymptotically). – In order estimate the resulting effort you need to know in which parts of your program (1-F) of the time is spent. – This is even before considering the actual parallelization method • Might add new serial sections • Brings coordination overhead • Will not scale arbitrarily high, i.e. the parallel section will stay > 0
  21. 21. Amdahl’s law, example • Example program with some sub-routines calling one another: # c a lls T im e ( % ) A c c u m u la te d C a ll T im e ( % ) 155648 3 1 .2 2 3 1 .2 2 C a lc 603648 2 2 .2 4 5 3 .4 6 M u ltip ly 155648 1 0 .0 5 6 3 .5 1 M a tm u l 214528 9 .3 3 7 2 .8 4 Copy 603648 7 .8 7 8 0 .7 1 F in d – For a maximum speed-up of 2 one needs to parallelize Calc and Multiply. – For a maximum speed-up of 5 all need to be parallelized!
  22. 22. General Parallelization Strategy • Therefore, successful parallelization requires: – Finding the actual hot-spots of work – Sufficient potential for parallelization – Parallelization strategy that introduces minimum coordination overhead • There are no general rules! Things that help to achieve high performance: – Know your application – Know your compiler – Understand the performance tool – Know the characteristics of the hardware
  23. 23. PERFORMANCE ANALYSIS WITH PROFILING
  24. 24. Profiling • Profiling gives an overview about the distribution of run time • Usually on the level of subroutines, also at line-by-line level • Rather low overhead • Usually good enough to find computation hot spots • Little details to detect performance problems and their causes • More sophisticated ways of profiling: – Based on hardware performance counters – Phase-based profiles – Call-path profiles
  25. 25. Profiling • Profile Recording – Of aggregated information (Time, Counts, …) – About program and system entities • Functions, loops, basic blocks • Application, processes, threads, … • Methods of Profile Creation – PC sampling (statistical approach) – Direct measurement (deterministic approach)
  26. 26. Profiling with gprof – Compile with profiling support • Using -pg for GNU, -p –g for Intel • Optimization -O3 might obscure the output somewhat %> mpicc –p -g -O2 heat-mpi-slow-big.c -o heat-mpi-slow-big – Execute normally • Used to be only for sequential programms • Parallel only with the GMON_OUT_PREFIX trick %> export GMON_OUT_PREFIX=ggg %> mpirun -np 4 heat-mpi-slow-big %> ls ggg.11762 ggg.11763 ggg.11764 ggg.11765
  27. 27. Profiling with gprof – Pre-process profiling output with gprof: • Text output • There are also GUI front-ends like – pgprof (PGI) – kprof (KDE) – For a single rank: %> gprof [–b] heat-mpi-slow-big ggg.11765 | less – Combine results for all ranks: %> gprof -s heat-mpi-slow-big ggg.* %> gprof [–b] heat-mpi-slow-big gmon.sum | less
  28. 28. Profiling with gprof – Flat profile for one of four ranks: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.00 2.08 2.08 1 2.08 2.08 Algorithm 0.00 2.08 0.00 1 0.00 0.00 CalcBoundaries 0.00 2.08 0.00 1 0.00 0.00 DistributeNodes – Flat profile for all four ranks combined: Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.00 8.59 8.59 4 2.15 2.15 Algorithm 0.00 8.59 0.00 4 0.00 0.00 CalcBoundaries 0.00 8.59 0.00 4 0.00 0.00 DistributeNodes
  29. 29. Profiling with gprof – Annotated call graph for one of four ranks: Call graph granularity: each sample hit covers 4 byte(s) for 0.48% of 2.08 seconds index % time self children called name 2.08 0.00 1/1 main [2] [1] 100.0 2.08 0.00 1 Algorithm [1] ----------------------------------------------- <spontaneous> [2] 100.0 0.00 2.08 main [2] 2.08 0.00 1/1 Algorithm [1] 0.00 0.00 1/1 DistributeNodes [4] 0.00 0.00 1/1 CalcBoundaries [3] ----------------------------------------------- 0.00 0.00 1/1 main [2] [3] 0.0 0.00 0.00 1 CalcBoundaries [3] ----------------------------------------------- 0.00 0.00 1/1 main [2] [4] 0.0 0.00 0.00 1 DistributeNodes [4] -----------------------------------------------
  30. 30. Profiling • Simple profiling is a good starting point • Reveals computational hot spots • Hides away outlier values in the average • More details needed for • Parallel analysis and identification of performance problems • Finding optimization opportunities • Advanced profiling tools: • TAU http://www.cs.uoregon.edu/research/tau/ • HPCToolkit http://hpctoolkit.org/
  31. 31. INSTRUMENTATION AND TRACING
  32. 32. Event Tracing • Collect more detailed information for more insight • Do not summarize run-time information • Collect individual events with properties during run-time • Event Tracing can be used for: – Visualization (VampirSuite) – Automatic analysis (Scalasca) – Debugging or for re-play (VampirSuite + Scalasca)
  33. 33. Tracing • Recording of run-time events (points of interest) – During program execution – Enter leave of functions/subroutines – Send/receive of messages, synchronization – More … – Saved as event records • Timestamp, process, thread, event type • Event specific information • Sorted by time stamp – Collected via instrumentation & trace library
  34. 34. Profiling vs Tracing • Tracing Advantages – Preserve temporal and spatial relationships (context) – Allow reconstruction of dynamic behavior on any required abstraction level – Profiles can be calculated from trace • Tracing Disadvantages – Traces can become very large – May cause perturbation – Instrumentation and tracing is complicated • Event buffering, clock synchronization, …
  35. 35. Common Event Types • Enter/leave of function/routine/region – Time stamp, process/thread, function ID • Send/receive of P2P message (MPI) – Time stamp, sender, receiver, length, tag, communicator • Collective communication (MPI) – Time stamp, process, root, communicator, # bytes • Hardware performance counter values – Time stamp, process, counter ID, value • Etc.
  36. 36. Parallel Trace DEF TIMERRES 1000000000 DEF PROCESS 1 `Master` DEF PROCESS 1 `Slave` 10010 P 1 ENTER 5 DEF FUNCTION 5 `main` 10090 P 1 ENTER 6 DEF FUNCTION 6 `foo` 10110 P 1 ENTER 12 10110 P 1 SEND TO 3 LEN 1024 ... DEF FUNCTION 9 `bar` 10330 P 1 LEAVE 12 10020 P 2 ENTER 5 DEF FUNCTION 12 `MPI_Send` 10400 P 1 LEAVE 6 10095 P 2 ENTER 6 DEF FUNCTION 13 `MPI_Recv` 10520 P 1 ENTER 9 10120 P 2 ENTER 13 10550 P 1 LEAVE 9 10300 P 2 RECV FROM 3 LEN 1024 ... ... 10350 P 2 LEAVE 13 10450 P 2 LEAVE 6 10620 P 2 ENTER 9 10650 P 2 LEAVE 9 ...
  37. 37. Instrumentation • Instrumentation: Process of modifying programs to detect and report events by calling instrumentation functions. – Instrumentation functions provided by trace library – Call == notification about run-time event – There are various ways of instrumentation
  38. 38. Source Code Instrumentation int foo(void* arg){ int foo(void* arg){ enter(6); if (cond){ if (cond){ leave(6); return 1; return 1; } } leave(6); return 0; return 0; } } Manually or Automatically
  39. 39. Source Code Instrumentation Manually – Large effort, error prone – Difficult to manage Automatically – Via source to source translation – Program Database Toolkit (PDT) http://www.cs.uoregon.edu/research/pdt/ – OpenMP Pragma And Region Instrumentor (Opari) http://www.fz-juelich.de/zam/kojak/opari/
  40. 40. Wrapper Function Instrumentation • Provide wrapper functions • Call instrumentation function for notification • Call original target for functionality • Via preprocessor directives: #define MPI_Init WRAPPER_MPI_Init #define MPI_Send WRAPPER_MPI_Send – Via library preload: • preload instrumented dynamic library – Suitable for standard libraries (e.g. MPI, glibc)
  41. 41. The MPI Profiling Interface – Each MPI function has two names: • MPI_xxx and PMPI_xxx – Selective replacement of MPI routines at link time MPI_Send MPI_Send user program MPI_Send wrapper library PMPI_Send MPI_Send MPI_Send MPI library
  42. 42. Compiler Instrumentation • gcc -finstrument-functions -c foo.c void __cyg_profile_func_enter( <args> ); void __cyg_profile_func_exit( <args> ); • Many compilers support instrumentation: (GCC, Intel, IBM, PGI, NEC, Hitachi, Sun Fortran, …) • No source modification
  43. 43. Dynamic Instrumentation • Modify binary executable in memory • Insert instrumentation calls • Very platform/machine dependent, expensive • DynInst project (http://www.dyninst.org) – Common interface – Alpha/Tru64, MIPS/IRIX, PowerPC/AIX, Sparc/Solaris, x86/Linux+Windows, ia64/Linux
  44. 44. Instrumentation & Trace Overhead manual PDT GCC DynInst w/o 15 ticks dummy 59 60 52 568 f.addr. 117 117 115 638 f.symbol 120 121 278 637 f.id 119 120 219 633 id+timer 299 300 451 937 overhead for empty function call
  45. 45. Trace Libraries • Provide instrumentation functions • Receive events of various types • Collect event properties – Time stamp – Location (thread, process, cluster node, MPI rank) – Event specific properties – Perhaps hardware performance counter values • Record to memory buffer, flush eventually • Try to be fast, minimize overhead
  46. 46. Trace Files & Formats • TAU Trace Format (Univ. of Oregon) • Epilog (ZAM, FZ Jülich) • STF (Pallas, now Intel) • Open Trace Format (OTF) – ZIH, TU Dresden in coop. with Oregon & Jülich – Single/multiple files per trace with – Fast sequential and random access – Including API for writing/reading – Supports auxiliary information – See http://www.tu-dresden.de/zih/otf/
  47. 47. Interoperability
  48. 48. Other Tools • TAU profiling (University of Oregon, USA) – Extensive profiling and tracing for parallel applications and visualization, camparison, etc. http://www.cs.uoregon.edu/research/tau/ • Paraver (CEPBA, Barcelona, Spain) – Trace based parallel performance analysis and visualization http://www.cepba.upc.edu/paraver/ • Scalasca (FZ Jülich) – Tracing and automatic detection of performance problems http://www.scalasca.org • Intel Trace Collector & Analyzer – Very similar to Vampir

×