Your SlideShare is downloading. ×
2 Vampir Trace Visualization
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

2 Vampir Trace Visualization


Published on

Published in: Education, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Center for Information Services and High Performance Computing (ZIH) Performance Analysis on IU’s HPC Systems using Vampir Trace Visualization Holger Brunst e-mail:
  • 2. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  • 3. Post-mortem Event-based Performance Analysis Performance optimization remains one of the key issues in parallel programming Strong need for performance analysis, analysis process still not easy Profilers do not give detailed insight into timing behavior of an application Detailed online analysis pretty much impossible because of intrusion and data amount Tracing is an option to capture the dynamic behavior of parallel applications Performance analysis done on a post-mortem basis
  • 4. Background Performance visualization and analysis tool Targets the visualization of dynamic processes on massively parallel (compute-) systems Available for major Unix based OS and for Windows Development started more than 15 years ago at Research Centre Jülich, ZAM Since 1997, developed at TU Dresden (first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Visualization components (Vampir) are commercial – Motivation – Advantages/Disadvantages Monitor components (VampirTrace) are Open Source
  • 5. Components Application Trace Vampir CPU Data VampirTrace (OTF) VampirServer Time Application OTF Trace Task 1 … Task n << m 1 CPU VampirTrace Part 1 Application OTF Trace 2 CPU VampirTrace Part 2 Application OTF Trace 3 CPU VampirTrace Part 3 Application OTF Trace 4 CPU VampirTrace Part 4 . . . . . . Application Trace Data 10,000 CPU VampirTrace Part m
  • 6. Flavors Vampir stabilized sequential version similar set of displays and options as presented here less scalable no ongoing active development VampirServer distributed analysis engine allows server and client on the same workstation as well new features windows port in progress
  • 7. VampirServer Parallel/distributed server runs in (part of) production environment no need to transfer huge traces parallel I/O Lightweight client on local workstation receive visual content only already adapted to display resolution moderate network load Scalability data volumes > 100 GB number of processes > 10.000
  • 8. Outline Introduction Prominent Display Types Performance Analysis Examples Vampir for Windows
  • 9. Main Displays Global Timeline Process Timeline + Counter Counter Timeline Summary Timeline Summary Chart (aka. Profile) Message Statistics Collective Communication Statistics I/O Statistics Call Tree, ...
  • 10. Most Prominent Displays: Global Timeline Time Axis MPI Processes Black Lines: Function MPI Messages Groups Thumbnail Other Colors: Application Red: MPI Routines Routines
  • 11. Most Prominent Displays: Single Process Timeline Time Axis Call Stack Level Application Routines MPI I/O Performance Counter
  • 12. Other Tools TAU profiling (University of Oregon, USA) extensive profiling and tracing for parallel application and visualization, camparison, etc. KOJAK (JSC, FZ Jülich) very scalable performance tracing automatic performance analysis and classification Paraver (CEPBA, Barcelona, Spain) trace based parallel performance analysis and visualization
  • 13. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  • 14. Approaching Performance Problems Trace Visualization Vampir provides a number of display types each provides many customization options Advice make a hypothesis about performance problems consider application's internal workings if known select the appropriate display use statistic displays in conjunction with timelines
  • 15. Finding Performance Bottlenecks Four Categories 1. Computation 2. Communication 3. Memory, I/O, … 4. Tracing itself
  • 16. Finding Performance Bottlenecks Computation unbalanced workload distribution: single late comer(s) strictly serial parts of program: idle processes/threads very frequent tiny function calls: call overhead sparse loops
  • 17. LM-MUSCAT Air Quality Model: Load Imbalance Many Processes waiting for individual procecsses to finish chemistry computation.
  • 18. LM-MUSCAT Air Quality Model: Bad CPU Partitioning Meteorology Processes are waiting most of the time
  • 19. LM-MUSCAT Air Quality Model: Good CPU Partitioning More Processes for Chemistry- Transport, less for Meteorology: better balance
  • 20. LM-MUSCAT Air Quality Model: Load Imbalance High load imbalance only during (simulated) sunrise Examine the runtime behavior of the Application
  • 21. SPEC OMP Benchmark fma3d: Unbalanced Threads Not well balanced OpenMP threads
  • 22. WRF Weather Model: MPI/OpenMP - Idle Threads Idle Threads
  • 23. Finding Performance Bottlenecks Communication communication as such (domination over computation) late sender, late receiver point-to-point messages instead of collective communication unmatched messages overcharge of MPI buffers bursts of large messages (bandwidth) frequent short messages (latency) unnecessary synchronization (barrier)
  • 24. High Performance Linpack Using Open MPI Everything looks ok here
  • 25. HPL with Alternative MPI Implementation Several slow Messages. MPI Problem?
  • 26. HPL with Alternative MPI Implementation Transfer Rate only 1.63 MB/s! Tracking down Performance Problems to individual Events
  • 27. Finding Performance Bottlenecks Memory bound computation inefficient L1/L2/L3 cache usage TLB misses detectable via HW performance counters I/O bound computation slow input/output sequential I/O on single process I/O load imbalance Exception handling
  • 28. Performance Counters: Floating Point Exceptions
  • 29. WRF Weather Model: Floating Point Exceptions FPU exceptions lead to long runtime of routine ADVECT. Timeline interval: 77.7ms Other optimization, no FPU exceptions: only 10ms for the same program section
  • 30. WRF Weather Model: Low I/O Performance Transfer Rate only 389 kB/s!
  • 31. WRF Weather Model: Slow Metadata Operations 128 Processes call open – takes more than 4 seconds
  • 32. Semtex CFD Application: Serial I/O Process 0 is performing I/O … … while 127 processes are waiting
  • 33. Complex Cell Application: RAxML (1) RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase
  • 34. Complex Cell Application: RAxML (2) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime
  • 35. Complex Cell Application: RAxML (3) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention
  • 36. Complex Cell Application: RAxML (4) RAxML with 16 SPEs, load imbalance
  • 37. Finding Performance Bottlenecks Tracing measurement overhead – esp. grave for tiny function calls – solve with selective instrumentation long, asynchronous trace buffer flushes too many concurrent counters – more data heisenbugs
  • 38. Trace Buffer Flush
  • 39. Outline Introduction Prominent Display Types Performance Analysis Examples New Vampir GUI
  • 40. Product Family Vampir for UNIX: Vampir Classic VampirClassic (all-in-one, All in one, single threaded Motif app. single threaded, Unix OpenMotif based) VampirServer (parallelized Vampir Server (MPI) client/server program approach) Parallelized Sockets Visualization service engine (Motif) Windows for Windows: Based on VampirServer’s parallel service engine Vampir for Windows HPC Server New Windows GUI to the Threaded Windows harnessed VampirServer service DLL API GUI services
  • 41. New GUI Layout Chart Selection Global Time Selection with Summary Shared Chart Area Holger Brunst
  • 42. Chart Overview
  • 43. Chart Arrangement
  • 44. Windows Event Tracing Windows HPC Server 2008 Microsoft MPI (MS-MPI) integrated with Event Tracing for Windows (ETW) infrastructure Allows MPI tracing No “special” builds needed. Just run application with extra mpiexec flag (-trace) High-precision CPU clock correction for MS-MPI (mpicsync) Tracing prerequisites: User must be in Administrator or Performance Log group Jobs should be executed exclusively in the Windows HPC Server 2008 Scheduler to avoid confusion/conflict of the trace data HOLGER BRUNST SLIDE 44
  • 45. Creation of OTF Traces Rank 0 node Run myApp with tracing enabled myApp.ex e MS-MPI ETW Time-Sync the ETL logs MS-MPI Convert the ETL logs to OTF mpicsync Trace (.etl) Copy OTF files to head node MS-MPI etl2otf MS-MPI Formatted copy Trace (.otf) HEAD NODE share Rank 1 node userHome myApp.exe Trace trace.etl_otf.otf trace.etl_otf.0.def … Rank N node …
  • 46. Creation of OTF Traces The four steps are created as individual tasks in a cluster job. The task options allow to choose the number of cores for the job and other parameters. In “Dependency” the right order of execution of the tasks can be ensured.
  • 47. Creation of OTF Traces File system prerequisites: “shareuserHome” is the shared user directory throughout the cluster MPI executable “myApp.exe” is available in shared directory “shareuserHomeTrace” is the directory where the OTF files are collected Launch program with –tracefile option mpiexec –wdir shareuserHome -tracefile %USERPROFILE%trace.etl myApp.exe wdir sets the working directory, myApp.exe has to be here %USERPROFILE% translates to the local home directory, e.g. “C:UsersuserHome”, on each node the eventlog file (.etl) is stored locally in this directory
  • 48. Creation of OTF Traces Time-Sync the Log files on throughout all nodes mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl “- cores 1”: run only one instance of mpicsync on each node Format Log files to OTF files mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl Copy all OTF files from nodes to trace directory on share mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y “*_otf*” “shareuserHomeTrace” HOLGER BRUNST SLIDE 48
  • 49. Summary Hybrid MPI/OpenMP trace file with 1024 cores 256 MPI ranks 4 OpenMP threads per rank Some feedback from users: “This was very impressive to see live as they had never seen their application profiled at this scale, and vampir pointed us at the problem straight away” “I was impressed by how detailed MPI functions are visualized in a zoomable fashion into micro seconds scale. We have some parallel C# programs currently run on Windows cluster of up to 128 core. I will use Vampir to test on other applications I have.” Work in progress with regular updates Completion of charts Additional information sources from ETW
  • 50. Thank You Team Ronny Brendel Dr. Holger Brunst Jens Doleschal Matthias Jurenz Dr. Andreas Knüpfer Matthias Lieber Christian Mach Holger Mickler Dr. Hartmut Mix Dr. Matthias Müller Prof. Wolfgang E. Nagel Michael Peter Matthias Weber Thomas William