Center for Information Services and High Performance Computing (ZIH)




Performance Analysis on IU’s HPC
      Systems us...
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Post-mortem Event-based Performance Analysis

 Performance optimization remains one of the key issues in parallel
 program...
Background
 Performance visualization and analysis tool
 Targets the visualization of dynamic processes
 on massively para...
Components

                Application      Trace                    Vampir
         CPU                      Data
      ...
Flavors

Vampir
 stabilized sequential version
 similar set of displays and options as presented here
 less scalable
 no o...
VampirServer

Parallel/distributed server
  runs in (part of) production environment
  no need to transfer huge traces
  p...
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
Vampir for Windows
Main Displays

Global Timeline
Process Timeline + Counter
Counter Timeline
Summary Timeline


Summary Chart (aka. Profile)...
Most Prominent Displays: Global Timeline
                                     Time Axis



MPI Processes

                ...
Most Prominent Displays: Single Process Timeline

                                          Time Axis


   Call Stack
    ...
Other Tools

TAU profiling (University of Oregon, USA)
  extensive profiling and tracing for parallel application and
  vi...
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Approaching Performance Problems

Trace Visualization
  Vampir provides a number of display types
  each provides many cus...
Finding Performance Bottlenecks

Four Categories
1. Computation
2. Communication
3. Memory, I/O, …
4. Tracing itself
Finding Performance Bottlenecks

Computation
 unbalanced workload distribution: single late comer(s)
 strictly serial part...
LM-MUSCAT Air Quality Model: Load Imbalance




          Many Processes
             waiting for
              individual...
LM-MUSCAT Air Quality Model: Bad CPU Partitioning




      Meteorology
      Processes are
     waiting most of
         ...
LM-MUSCAT Air Quality Model: Good CPU Partitioning




                    More Processes
                   for Chemistry...
LM-MUSCAT Air Quality Model: Load Imbalance

                    High load imbalance
                   only during (simul...
SPEC OMP Benchmark fma3d: Unbalanced Threads




                   Not well balanced
                   OpenMP threads
WRF Weather Model: MPI/OpenMP - Idle Threads




           Idle Threads
Finding Performance Bottlenecks

Communication
 communication as such (domination over computation)
 late sender, late rec...
High Performance Linpack Using Open MPI




                       Everything
                     looks ok here
HPL with Alternative MPI Implementation




                      Several slow
                      Messages.
           ...
HPL with Alternative MPI Implementation




                         Transfer Rate
                         only 1.63 MB/s...
Finding Performance Bottlenecks

Memory bound computation
  inefficient L1/L2/L3 cache usage
  TLB misses
  detectable via...
Performance Counters: Floating Point Exceptions
WRF Weather Model: Floating Point Exceptions
                               FPU exceptions lead to long
                  ...
WRF Weather Model: Low I/O Performance




                               Transfer Rate
                               onl...
WRF Weather Model: Slow Metadata Operations




        128 Processes
        call open – takes
        more than 4
      ...
Semtex CFD Application: Serial I/O




      Process 0 is
     performing I/O
           …




                … while 127...
Complex Cell Application: RAxML (1)




          RAxML (Randomized Accelerated Maximum Likelihood)
                      ...
Complex Cell Application: RAxML (2)




               RAxML with 8 SPEs, 4000 ns window
                      enlargement...
Complex Cell Application: RAxML (3)




               RAxML with 8 SPEs, 4000 ns window
               enlargement of a s...
Complex Cell Application: RAxML (4)




               RAxML with 16 SPEs, load imbalance
Finding Performance Bottlenecks

Tracing
  measurement overhead
   – esp. grave for tiny function calls
   – solve with se...
Trace Buffer Flush
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Product Family

Vampir for UNIX:                            Vampir Classic
    VampirClassic (all-in-one,    All in one, s...
New GUI Layout



     Chart Selection   Global Time Selection with Summary




                        Shared Chart Area
...
Chart Overview
Chart Arrangement
Windows Event Tracing

Windows HPC Server 2008
      Microsoft MPI (MS-MPI) integrated with
      Event Tracing for Window...
Creation of OTF Traces
                                       Rank 0 node
Run myApp with tracing enabled                  ...
Creation of OTF Traces

 The four steps are
 created as individual
 tasks in a cluster job.
 The task options allow
 to ch...
Creation of OTF Traces
File system prerequisites:
   “shareuserHome” is the shared user directory throughout the
  cluster...
Creation of OTF Traces

Time-Sync the Log files on throughout all nodes
       mpiexec –cores 1 –wdir %USERPROFILE% mpicsy...
Summary

Hybrid MPI/OpenMP trace file with 1024 cores
      256 MPI ranks
      4 OpenMP threads per rank
Some feedback fr...
Thank You

                     Team
                  Ronny Brendel
               Dr. Holger Brunst
                  Je...
Upcoming SlideShare
Loading in …5
×

2 Vampir Trace Visualization

2,574 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,574
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2 Vampir Trace Visualization

  1. 1. Center for Information Services and High Performance Computing (ZIH) Performance Analysis on IU’s HPC Systems using Vampir Trace Visualization Holger Brunst e-mail: holger.brunst@tu-dresden.de
  2. 2. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  3. 3. Post-mortem Event-based Performance Analysis Performance optimization remains one of the key issues in parallel programming Strong need for performance analysis, analysis process still not easy Profilers do not give detailed insight into timing behavior of an application Detailed online analysis pretty much impossible because of intrusion and data amount Tracing is an option to capture the dynamic behavior of parallel applications Performance analysis done on a post-mortem basis
  4. 4. Background Performance visualization and analysis tool Targets the visualization of dynamic processes on massively parallel (compute-) systems Available for major Unix based OS and for Windows Development started more than 15 years ago at Research Centre Jülich, ZAM Since 1997, developed at TU Dresden (first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Visualization components (Vampir) are commercial – Motivation – Advantages/Disadvantages Monitor components (VampirTrace) are Open Source
  5. 5. Components Application Trace Vampir CPU Data VampirTrace (OTF) VampirServer Time Application OTF Trace Task 1 … Task n << m 1 CPU VampirTrace Part 1 Application OTF Trace 2 CPU VampirTrace Part 2 Application OTF Trace 3 CPU VampirTrace Part 3 Application OTF Trace 4 CPU VampirTrace Part 4 . . . . . . Application Trace Data 10,000 CPU VampirTrace Part m
  6. 6. Flavors Vampir stabilized sequential version similar set of displays and options as presented here less scalable no ongoing active development VampirServer distributed analysis engine allows server and client on the same workstation as well new features windows port in progress
  7. 7. VampirServer Parallel/distributed server runs in (part of) production environment no need to transfer huge traces parallel I/O Lightweight client on local workstation receive visual content only already adapted to display resolution moderate network load Scalability data volumes > 100 GB number of processes > 10.000
  8. 8. Outline Introduction Prominent Display Types Performance Analysis Examples Vampir for Windows
  9. 9. Main Displays Global Timeline Process Timeline + Counter Counter Timeline Summary Timeline Summary Chart (aka. Profile) Message Statistics Collective Communication Statistics I/O Statistics Call Tree, ...
  10. 10. Most Prominent Displays: Global Timeline Time Axis MPI Processes Black Lines: Function MPI Messages Groups Thumbnail Other Colors: Application Red: MPI Routines Routines
  11. 11. Most Prominent Displays: Single Process Timeline Time Axis Call Stack Level Application Routines MPI I/O Performance Counter
  12. 12. Other Tools TAU profiling (University of Oregon, USA) extensive profiling and tracing for parallel application and visualization, camparison, etc. http://www.cs.uoregon.edu/research/tau/ KOJAK (JSC, FZ Jülich) very scalable performance tracing automatic performance analysis and classification http://www.fz-juelich.de/jsc/kojak/ Paraver (CEPBA, Barcelona, Spain) trace based parallel performance analysis and visualization http://www.cepba.upc.edu/paraver/
  13. 13. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  14. 14. Approaching Performance Problems Trace Visualization Vampir provides a number of display types each provides many customization options Advice make a hypothesis about performance problems consider application's internal workings if known select the appropriate display use statistic displays in conjunction with timelines
  15. 15. Finding Performance Bottlenecks Four Categories 1. Computation 2. Communication 3. Memory, I/O, … 4. Tracing itself
  16. 16. Finding Performance Bottlenecks Computation unbalanced workload distribution: single late comer(s) strictly serial parts of program: idle processes/threads very frequent tiny function calls: call overhead sparse loops
  17. 17. LM-MUSCAT Air Quality Model: Load Imbalance Many Processes waiting for individual procecsses to finish chemistry computation.
  18. 18. LM-MUSCAT Air Quality Model: Bad CPU Partitioning Meteorology Processes are waiting most of the time
  19. 19. LM-MUSCAT Air Quality Model: Good CPU Partitioning More Processes for Chemistry- Transport, less for Meteorology: better balance
  20. 20. LM-MUSCAT Air Quality Model: Load Imbalance High load imbalance only during (simulated) sunrise Examine the runtime behavior of the Application
  21. 21. SPEC OMP Benchmark fma3d: Unbalanced Threads Not well balanced OpenMP threads
  22. 22. WRF Weather Model: MPI/OpenMP - Idle Threads Idle Threads
  23. 23. Finding Performance Bottlenecks Communication communication as such (domination over computation) late sender, late receiver point-to-point messages instead of collective communication unmatched messages overcharge of MPI buffers bursts of large messages (bandwidth) frequent short messages (latency) unnecessary synchronization (barrier)
  24. 24. High Performance Linpack Using Open MPI Everything looks ok here
  25. 25. HPL with Alternative MPI Implementation Several slow Messages. MPI Problem?
  26. 26. HPL with Alternative MPI Implementation Transfer Rate only 1.63 MB/s! Tracking down Performance Problems to individual Events
  27. 27. Finding Performance Bottlenecks Memory bound computation inefficient L1/L2/L3 cache usage TLB misses detectable via HW performance counters I/O bound computation slow input/output sequential I/O on single process I/O load imbalance Exception handling
  28. 28. Performance Counters: Floating Point Exceptions
  29. 29. WRF Weather Model: Floating Point Exceptions FPU exceptions lead to long runtime of routine ADVECT. Timeline interval: 77.7ms Other optimization, no FPU exceptions: only 10ms for the same program section
  30. 30. WRF Weather Model: Low I/O Performance Transfer Rate only 389 kB/s!
  31. 31. WRF Weather Model: Slow Metadata Operations 128 Processes call open – takes more than 4 seconds
  32. 32. Semtex CFD Application: Serial I/O Process 0 is performing I/O … … while 127 processes are waiting
  33. 33. Complex Cell Application: RAxML (1) RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase
  34. 34. Complex Cell Application: RAxML (2) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime
  35. 35. Complex Cell Application: RAxML (3) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention
  36. 36. Complex Cell Application: RAxML (4) RAxML with 16 SPEs, load imbalance
  37. 37. Finding Performance Bottlenecks Tracing measurement overhead – esp. grave for tiny function calls – solve with selective instrumentation long, asynchronous trace buffer flushes too many concurrent counters – more data heisenbugs
  38. 38. Trace Buffer Flush
  39. 39. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  40. 40. Product Family Vampir for UNIX: Vampir Classic VampirClassic (all-in-one, All in one, single threaded Motif app. single threaded, Unix OpenMotif based) VampirServer (parallelized Vampir Server (MPI) client/server program approach) Parallelized Sockets Visualization service engine (Motif) Windows for Windows: Based on VampirServer’s parallel service engine Vampir for Windows HPC Server New Windows GUI to the Threaded Windows harnessed VampirServer service DLL API GUI services
  41. 41. New GUI Layout Chart Selection Global Time Selection with Summary Shared Chart Area Holger Brunst
  42. 42. Chart Overview
  43. 43. Chart Arrangement
  44. 44. Windows Event Tracing Windows HPC Server 2008 Microsoft MPI (MS-MPI) integrated with Event Tracing for Windows (ETW) infrastructure Allows MPI tracing No “special” builds needed. Just run application with extra mpiexec flag (-trace) High-precision CPU clock correction for MS-MPI (mpicsync) Tracing prerequisites: User must be in Administrator or Performance Log group Jobs should be executed exclusively in the Windows HPC Server 2008 Scheduler to avoid confusion/conflict of the trace data HOLGER BRUNST SLIDE 44
  45. 45. Creation of OTF Traces Rank 0 node Run myApp with tracing enabled myApp.ex e MS-MPI ETW Time-Sync the ETL logs MS-MPI Convert the ETL logs to OTF mpicsync Trace (.etl) Copy OTF files to head node MS-MPI etl2otf MS-MPI Formatted copy Trace (.otf) HEAD NODE share Rank 1 node userHome myApp.exe Trace trace.etl_otf.otf trace.etl_otf.0.def … trace.etl_otf.1.events Rank N trace.etl_oft.2.events node …
  46. 46. Creation of OTF Traces The four steps are created as individual tasks in a cluster job. The task options allow to choose the number of cores for the job and other parameters. In “Dependency” the right order of execution of the tasks can be ensured.
  47. 47. Creation of OTF Traces File system prerequisites: “shareuserHome” is the shared user directory throughout the cluster MPI executable “myApp.exe” is available in shared directory “shareuserHomeTrace” is the directory where the OTF files are collected Launch program with –tracefile option mpiexec –wdir shareuserHome -tracefile %USERPROFILE%trace.etl myApp.exe wdir sets the working directory, myApp.exe has to be here %USERPROFILE% translates to the local home directory, e.g. “C:UsersuserHome”, on each node the eventlog file (.etl) is stored locally in this directory
  48. 48. Creation of OTF Traces Time-Sync the Log files on throughout all nodes mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl “- cores 1”: run only one instance of mpicsync on each node Format Log files to OTF files mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl Copy all OTF files from nodes to trace directory on share mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y “*_otf*” “shareuserHomeTrace” HOLGER BRUNST SLIDE 48
  49. 49. Summary Hybrid MPI/OpenMP trace file with 1024 cores 256 MPI ranks 4 OpenMP threads per rank Some feedback from users: “This was very impressive to see live as they had never seen their application profiled at this scale, and vampir pointed us at the problem straight away” “I was impressed by how detailed MPI functions are visualized in a zoomable fashion into micro seconds scale. We have some parallel C# programs currently run on Windows cluster of up to 128 core. I will use Vampir to test on other applications I have.” Work in progress with regular updates Completion of charts Additional information sources from ETW
  50. 50. Thank You Team Ronny Brendel Dr. Holger Brunst Jens Doleschal Matthias Jurenz Dr. Andreas Knüpfer Matthias Lieber Christian Mach Holger Mickler Dr. Hartmut Mix Dr. Matthias Müller Prof. Wolfgang E. Nagel Michael Peter Matthias Weber Thomas William http://www.vampir.eu http://www.tu-dresden.de/zih/vampirtrace

×