Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
2 Vampir Trace Visualization
1. Center for Information Services and High Performance Computing (ZIH)
Performance Analysis on IU’s HPC
Systems using Vampir
Trace Visualization
Holger Brunst
e-mail: holger.brunst@tu-dresden.de
3. Post-mortem Event-based Performance Analysis
Performance optimization remains one of the key issues in parallel
programming
Strong need for performance analysis, analysis process still not
easy
Profilers do not give detailed insight into timing behavior of an
application
Detailed online analysis pretty much impossible because of intrusion
and data amount
Tracing is an option to capture the dynamic behavior of parallel
applications
Performance analysis done on a post-mortem basis
4. Background
Performance visualization and analysis tool
Targets the visualization of dynamic processes
on massively parallel (compute-) systems
Available for major Unix based OS
and for Windows
Development started more than 15 years ago at Research Centre
Jülich, ZAM
Since 1997, developed at TU Dresden
(first: collaboration with Pallas GmbH,
from 2003-2005: Intel Software & Solutions Group,
since January 2006: TU Dresden, ZIH / GWT-TUD)
Visualization components (Vampir) are commercial
– Motivation
– Advantages/Disadvantages
Monitor components (VampirTrace) are Open Source
5. Components
Application Trace Vampir
CPU Data
VampirTrace (OTF)
VampirServer
Time
Application OTF Trace Task 1 … Task n << m
1 CPU
VampirTrace Part 1
Application OTF Trace
2 CPU
VampirTrace Part 2
Application OTF Trace
3 CPU
VampirTrace Part 3
Application OTF Trace
4 CPU
VampirTrace Part 4
. .
. .
. .
Application Trace Data
10,000 CPU
VampirTrace Part m
6. Flavors
Vampir
stabilized sequential version
similar set of displays and options as presented here
less scalable
no ongoing active development
VampirServer
distributed analysis engine
allows server and client on the same workstation as well
new features
windows port in progress
7. VampirServer
Parallel/distributed server
runs in (part of) production environment
no need to transfer huge traces
parallel I/O
Lightweight client on local workstation
receive visual content only
already adapted to display resolution
moderate network load
Scalability
data volumes > 100 GB
number of processes > 10.000
9. Main Displays
Global Timeline
Process Timeline + Counter
Counter Timeline
Summary Timeline
Summary Chart (aka. Profile)
Message Statistics
Collective Communication Statistics
I/O Statistics
Call Tree, ...
10. Most Prominent Displays: Global Timeline
Time Axis
MPI Processes
Black Lines: Function
MPI Messages Groups
Thumbnail
Other Colors:
Application
Red: MPI Routines
Routines
11. Most Prominent Displays: Single Process Timeline
Time Axis
Call Stack
Level Application
Routines
MPI
I/O
Performance
Counter
12. Other Tools
TAU profiling (University of Oregon, USA)
extensive profiling and tracing for parallel application and
visualization, camparison, etc.
http://www.cs.uoregon.edu/research/tau/
KOJAK (JSC, FZ Jülich)
very scalable performance tracing
automatic performance analysis and classification
http://www.fz-juelich.de/jsc/kojak/
Paraver (CEPBA, Barcelona, Spain)
trace based parallel performance analysis and visualization
http://www.cepba.upc.edu/paraver/
14. Approaching Performance Problems
Trace Visualization
Vampir provides a number of display types
each provides many customization options
Advice
make a hypothesis about performance problems
consider application's internal workings if known
select the appropriate display
use statistic displays in conjunction with timelines
16. Finding Performance Bottlenecks
Computation
unbalanced workload distribution: single late comer(s)
strictly serial parts of program: idle processes/threads
very frequent tiny function calls: call overhead
sparse loops
17. LM-MUSCAT Air Quality Model: Load Imbalance
Many Processes
waiting for
individual
procecsses to
finish chemistry
computation.
18. LM-MUSCAT Air Quality Model: Bad CPU Partitioning
Meteorology
Processes are
waiting most of
the time
19. LM-MUSCAT Air Quality Model: Good CPU Partitioning
More Processes
for Chemistry-
Transport, less
for Meteorology:
better balance
20. LM-MUSCAT Air Quality Model: Load Imbalance
High load imbalance
only during (simulated)
sunrise
Examine the
runtime
behavior of the
Application
21. SPEC OMP Benchmark fma3d: Unbalanced Threads
Not well balanced
OpenMP threads
23. Finding Performance Bottlenecks
Communication
communication as such (domination over computation)
late sender, late receiver
point-to-point messages instead of
collective communication
unmatched messages
overcharge of MPI buffers
bursts of large messages (bandwidth)
frequent short messages (latency)
unnecessary synchronization (barrier)
29. WRF Weather Model: Floating Point Exceptions
FPU exceptions lead to long
runtime of routine ADVECT.
Timeline interval: 77.7ms
Other optimization, no FPU
exceptions: only 10ms for
the same program section
37. Finding Performance Bottlenecks
Tracing
measurement overhead
– esp. grave for tiny function calls
– solve with selective instrumentation
long, asynchronous trace buffer flushes
too many concurrent counters
– more data
heisenbugs
40. Product Family
Vampir for UNIX: Vampir Classic
VampirClassic (all-in-one, All in one, single threaded Motif app.
single threaded, Unix
OpenMotif based)
VampirServer (parallelized Vampir Server
(MPI) client/server program
approach) Parallelized
Sockets
Visualization
service engine (Motif)
Windows for Windows:
Based on VampirServer’s
parallel service engine Vampir for Windows HPC Server
New Windows GUI to the Threaded Windows
harnessed VampirServer service DLL
API
GUI
services
41. New GUI Layout
Chart Selection Global Time Selection with Summary
Shared Chart Area
Holger Brunst
44. Windows Event Tracing
Windows HPC Server 2008
Microsoft MPI (MS-MPI) integrated with
Event Tracing for Windows (ETW) infrastructure
Allows MPI tracing
No “special” builds needed. Just run application with extra mpiexec flag
(-trace)
High-precision CPU clock correction for MS-MPI (mpicsync)
Tracing prerequisites:
User must be in Administrator or Performance Log group
Jobs should be executed exclusively in the Windows HPC
Server 2008 Scheduler to avoid confusion/conflict of the trace
data
HOLGER BRUNST SLIDE 44
45. Creation of OTF Traces
Rank 0 node
Run myApp with tracing enabled myApp.ex
e
MS-MPI ETW
Time-Sync the ETL logs
MS-MPI
Convert the ETL logs to OTF mpicsync
Trace
(.etl)
Copy OTF files to head node MS-MPI
etl2otf
MS-MPI
Formatted
copy
Trace (.otf)
HEAD NODE
share Rank 1
node
userHome
myApp.exe
Trace
trace.etl_otf.otf
trace.etl_otf.0.def
…
trace.etl_otf.1.events Rank N
trace.etl_oft.2.events node
…
46. Creation of OTF Traces
The four steps are
created as individual
tasks in a cluster job.
The task options allow
to choose the number
of cores for the job
and other
parameters.
In “Dependency” the
right order of
execution of the tasks
can be ensured.
47. Creation of OTF Traces
File system prerequisites:
“shareuserHome” is the shared user directory throughout the
cluster
MPI executable “myApp.exe” is available in shared directory
“shareuserHomeTrace” is the directory where the OTF files are
collected
Launch program with –tracefile option
mpiexec –wdir shareuserHome -tracefile
%USERPROFILE%trace.etl myApp.exe
wdir sets the working directory, myApp.exe has to be here
%USERPROFILE% translates to the local home directory, e.g.
“C:UsersuserHome”, on each node
the eventlog file (.etl) is stored locally in this directory
48. Creation of OTF Traces
Time-Sync the Log files on throughout all nodes
mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl
“- cores 1”: run only one instance of mpicsync on each node
Format Log files to OTF files
mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl
Copy all OTF files from nodes to trace directory on share
mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y
“*_otf*” “shareuserHomeTrace”
HOLGER BRUNST SLIDE 48
49. Summary
Hybrid MPI/OpenMP trace file with 1024 cores
256 MPI ranks
4 OpenMP threads per rank
Some feedback from users:
“This was very impressive to see live as they had never seen
their application profiled at this scale, and vampir pointed us at
the problem straight away”
“I was impressed by how detailed MPI functions are visualized in
a zoomable fashion into micro seconds scale. We have some
parallel C# programs currently run on Windows cluster of up to
128 core. I will use Vampir to test on other applications I have.”
Work in progress with regular updates
Completion of charts
Additional information sources from ETW
50. Thank You
Team
Ronny Brendel
Dr. Holger Brunst
Jens Doleschal
Matthias Jurenz
Dr. Andreas Knüpfer
Matthias Lieber
Christian Mach
Holger Mickler
Dr. Hartmut Mix
Dr. Matthias Müller
Prof. Wolfgang E. Nagel
Michael Peter
Matthias Weber
Thomas William
http://www.vampir.eu
http://www.tu-dresden.de/zih/vampirtrace