2 Vampir Trace Visualization

Center for Information Services and High Performance Computing (ZIH)

Performance Analysis on IU’s HPC
Systems using Vampir
Trace Visualization

Holger Brunst
e-mail: holger.brunst@tu-dresden.de

Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI

Post-mortem Event-based Performance Analysis

Performance optimization remains one of the key issues in parallel
programming
Strong need for performance analysis, analysis process still not
easy
Profilers do not give detailed insight into timing behavior of an
application
Detailed online analysis pretty much impossible because of intrusion
and data amount
Tracing is an option to capture the dynamic behavior of parallel
applications
Performance analysis done on a post-mortem basis

Background
Performance visualization and analysis tool
Targets the visualization of dynamic processes
on massively parallel (compute-) systems
Available for major Unix based OS
and for Windows
Development started more than 15 years ago at Research Centre
Jülich, ZAM
Since 1997, developed at TU Dresden
(first: collaboration with Pallas GmbH,
from 2003-2005: Intel Software & Solutions Group,
since January 2006: TU Dresden, ZIH / GWT-TUD)
Visualization components (Vampir) are commercial
– Motivation
– Advantages/Disadvantages
Monitor components (VampirTrace) are Open Source

Components

Application Trace Vampir
CPU Data
VampirTrace (OTF)
VampirServer
Time
Application OTF Trace Task 1 … Task n << m
1 CPU
VampirTrace Part 1
Application OTF Trace
2 CPU
VampirTrace Part 2
3 CPU
VampirTrace Part 3
4 CPU
VampirTrace Part 4
. .
. .
. .
Application Trace Data
10,000 CPU
VampirTrace Part m

Flavors

Vampir
stabilized sequential version
similar set of displays and options as presented here
less scalable
no ongoing active development

VampirServer
distributed analysis engine
allows server and client on the same workstation as well
new features
windows port in progress

VampirServer

Parallel/distributed server
runs in (part of) production environment
no need to transfer huge traces
parallel I/O
Lightweight client on local workstation
receive visual content only
already adapted to display resolution
moderate network load
Scalability
data volumes > 100 GB
number of processes > 10.000

Outline

Introduction
Prominent Display Types
Performance Analysis Examples
Vampir for Windows

Main Displays

Global Timeline
Process Timeline + Counter
Counter Timeline
Summary Timeline

Summary Chart (aka. Profile)
Message Statistics
Collective Communication Statistics
I/O Statistics
Call Tree, ...

Most Prominent Displays: Global Timeline
Time Axis

MPI Processes

Black Lines: Function
MPI Messages Groups

Thumbnail
Other Colors:
Application
Red: MPI Routines
Routines

Most Prominent Displays: Single Process Timeline

Time Axis

Call Stack
Level Application
Routines

MPI
I/O

Performance
Counter

Other Tools

TAU profiling (University of Oregon, USA)
extensive profiling and tracing for parallel application and
visualization, camparison, etc.
http://www.cs.uoregon.edu/research/tau/
KOJAK (JSC, FZ Jülich)
very scalable performance tracing
automatic performance analysis and classification
http://www.fz-juelich.de/jsc/kojak/
Paraver (CEPBA, Barcelona, Spain)
trace based parallel performance analysis and visualization
http://www.cepba.upc.edu/paraver/

Approaching Performance Problems

Trace Visualization
Vampir provides a number of display types
each provides many customization options

Advice
make a hypothesis about performance problems
consider application's internal workings if known
select the appropriate display
use statistic displays in conjunction with timelines

Finding Performance Bottlenecks

Four Categories
1. Computation
2. Communication
3. Memory, I/O, …
4. Tracing itself


Computation
unbalanced workload distribution: single late comer(s)
strictly serial parts of program: idle processes/threads
very frequent tiny function calls: call overhead
sparse loops

LM-MUSCAT Air Quality Model: Load Imbalance

Many Processes
waiting for
individual
procecsses to
finish chemistry
computation.

LM-MUSCAT Air Quality Model: Bad CPU Partitioning

Meteorology
Processes are
waiting most of
the time

LM-MUSCAT Air Quality Model: Good CPU Partitioning

More Processes
for Chemistry-
Transport, less
for Meteorology:
better balance

LM-MUSCAT Air Quality Model: Load Imbalance

High load imbalance
only during (simulated)
sunrise

Examine the
runtime
behavior of the
Application

SPEC OMP Benchmark fma3d: Unbalanced Threads

Not well balanced
OpenMP threads

WRF Weather Model: MPI/OpenMP - Idle Threads

Idle Threads


Communication
communication as such (domination over computation)
late sender, late receiver
point-to-point messages instead of
collective communication
unmatched messages
overcharge of MPI buffers
bursts of large messages (bandwidth)
frequent short messages (latency)
unnecessary synchronization (barrier)

High Performance Linpack Using Open MPI

Everything
looks ok here

HPL with Alternative MPI Implementation

Several slow
Messages.
MPI Problem?

HPL with Alternative MPI Implementation

Transfer Rate
only 1.63 MB/s!

Tracking down
Performance
Problems to
individual
Events


Memory bound computation
inefficient L1/L2/L3 cache usage
TLB misses
detectable via HW performance counters
I/O bound computation
slow input/output
sequential I/O on single process
I/O load imbalance
Exception handling

Performance Counters: Floating Point Exceptions

WRF Weather Model: Floating Point Exceptions
FPU exceptions lead to long
runtime of routine ADVECT.
Timeline interval: 77.7ms

Other optimization, no FPU
exceptions: only 10ms for
the same program section

WRF Weather Model: Low I/O Performance

Transfer Rate
only 389 kB/s!

WRF Weather Model: Slow Metadata Operations

128 Processes
call open – takes
more than 4
seconds

Semtex CFD Application: Serial I/O

Process 0 is
performing I/O
…

… while 127
processes are
waiting

Complex Cell Application: RAxML (1)

RAxML (Randomized Accelerated Maximum Likelihood)
with 8 SPEs, ramp-up phase


RAxML with 8 SPEs, 4000 ns window
enlargement of a small loop
shifted start of loop, constant runtime


RAxML with 8 SPEs, 4000 ns window
enlargement of a small loop (modified)
synchronous start, memory contention


RAxML with 16 SPEs, load imbalance


Tracing
measurement overhead
– esp. grave for tiny function calls
– solve with selective instrumentation
long, asynchronous trace buffer flushes
too many concurrent counters
– more data
heisenbugs

Product Family

Vampir for UNIX: Vampir Classic
VampirClassic (all-in-one, All in one, single threaded Motif app.
single threaded, Unix
OpenMotif based)
VampirServer (parallelized Vampir Server

(MPI) client/server program
approach) Parallelized
Sockets
Visualization
service engine (Motif)
Windows for Windows:
Based on VampirServer’s
parallel service engine Vampir for Windows HPC Server

New Windows GUI to the Threaded Windows
harnessed VampirServer service DLL
API
GUI
services

New GUI Layout

Chart Selection Global Time Selection with Summary

Shared Chart Area

Holger Brunst

Windows Event Tracing

Windows HPC Server 2008
Microsoft MPI (MS-MPI) integrated with
Event Tracing for Windows (ETW) infrastructure
Allows MPI tracing
No “special” builds needed. Just run application with extra mpiexec flag
(-trace)
High-precision CPU clock correction for MS-MPI (mpicsync)
Tracing prerequisites:
User must be in Administrator or Performance Log group
Jobs should be executed exclusively in the Windows HPC
Server 2008 Scheduler to avoid confusion/conflict of the trace
data

HOLGER BRUNST SLIDE 44

Creation of OTF Traces
Rank 0 node
Run myApp with tracing enabled myApp.ex
e
MS-MPI ETW
Time-Sync the ETL logs
MS-MPI
Convert the ETL logs to OTF mpicsync
Trace
(.etl)
Copy OTF files to head node MS-MPI
etl2otf

MS-MPI
Formatted
copy
Trace (.otf)
HEAD NODE

share Rank 1
node

userHome
myApp.exe
Trace
trace.etl_otf.otf
trace.etl_otf.0.def

…
trace.etl_otf.1.events Rank N
trace.etl_oft.2.events node
…


The four steps are
created as individual
tasks in a cluster job.
The task options allow
to choose the number
of cores for the job
and other
parameters.
In “Dependency” the
right order of
execution of the tasks
can be ensured.

File system prerequisites:
“shareuserHome” is the shared user directory throughout the
cluster
MPI executable “myApp.exe” is available in shared directory
“shareuserHomeTrace” is the directory where the OTF files are
collected

Launch program with –tracefile option
mpiexec –wdir shareuserHome -tracefile
%USERPROFILE%trace.etl myApp.exe
wdir sets the working directory, myApp.exe has to be here
%USERPROFILE% translates to the local home directory, e.g.
“C:UsersuserHome”, on each node
the eventlog file (.etl) is stored locally in this directory


Time-Sync the Log files on throughout all nodes
mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl
“- cores 1”: run only one instance of mpicsync on each node
Format Log files to OTF files
mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl
Copy all OTF files from nodes to trace directory on share
mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y
“*_otf*” “shareuserHomeTrace”

HOLGER BRUNST SLIDE 48

Summary

Hybrid MPI/OpenMP trace file with 1024 cores
256 MPI ranks
4 OpenMP threads per rank
Some feedback from users:
“This was very impressive to see live as they had never seen
their application profiled at this scale, and vampir pointed us at
the problem straight away”
“I was impressed by how detailed MPI functions are visualized in
a zoomable fashion into micro seconds scale. We have some
parallel C# programs currently run on Windows cluster of up to
128 core. I will use Vampir to test on other applications I have.”
Work in progress with regular updates
Completion of charts
Additional information sources from ETW

Thank You

Team
Ronny Brendel
Dr. Holger Brunst
Jens Doleschal
Matthias Jurenz
Dr. Andreas Knüpfer
Matthias Lieber
Christian Mach
Holger Mickler
Dr. Hartmut Mix
Dr. Matthias Müller
Prof. Wolfgang E. Nagel
Michael Peter
Matthias Weber
Thomas William

http://www.vampir.eu
http://www.tu-dresden.de/zih/vampirtrace

2 Vampir Trace Visualization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2 Vampir Trace Visualization

Similar to 2 Vampir Trace Visualization (20)

More from PTIHPA

More from PTIHPA (13)

Recently uploaded

Recently uploaded (20)

2 Vampir Trace Visualization