SlideShare a Scribd company logo
1 of 50
Download to read offline
Center for Information Services and High Performance Computing (ZIH)




Performance Analysis on IU’s HPC
      Systems using Vampir
                               Trace Visualization

        Holger Brunst
        e-mail: holger.brunst@tu-dresden.de
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Post-mortem Event-based Performance Analysis

 Performance optimization remains one of the key issues in parallel
 programming
 Strong need for performance analysis, analysis process still not
 easy
 Profilers do not give detailed insight into timing behavior of an
 application
 Detailed online analysis pretty much impossible because of intrusion
 and data amount
 Tracing is an option to capture the dynamic behavior of parallel
 applications
 Performance analysis done on a post-mortem basis
Background
 Performance visualization and analysis tool
 Targets the visualization of dynamic processes
 on massively parallel (compute-) systems
 Available for major Unix based OS
 and for Windows
 Development started more than 15 years ago at Research Centre
 Jülich, ZAM
 Since 1997, developed at TU Dresden
 (first: collaboration with Pallas GmbH,
 from 2003-2005: Intel Software & Solutions Group,
 since January 2006: TU Dresden, ZIH / GWT-TUD)
 Visualization components (Vampir) are commercial
  – Motivation
  – Advantages/Disadvantages
 Monitor components (VampirTrace) are Open Source
Components

                Application      Trace                    Vampir
         CPU                      Data
                 VampirTrace     (OTF)
                                                        VampirServer
                                            Time
                Application    OTF Trace       Task 1     …   Task n << m
    1     CPU
                VampirTrace      Part 1
                Application    OTF Trace
    2     CPU
                VampirTrace      Part 2
                Application    OTF Trace
    3     CPU
                VampirTrace      Part 3
                Application    OTF Trace
    4     CPU
                VampirTrace      Part 4
           .                        .
           .                        .
           .                        .
                Application    Trace Data
10,000    CPU
                VampirTrace      Part m
Flavors

Vampir
 stabilized sequential version
 similar set of displays and options as presented here
 less scalable
 no ongoing active development


VampirServer
 distributed analysis engine
 allows server and client on the same workstation as well
 new features
 windows port in progress
VampirServer

Parallel/distributed server
  runs in (part of) production environment
  no need to transfer huge traces
  parallel I/O
Lightweight client on local workstation
  receive visual content only
  already adapted to display resolution
  moderate network load
Scalability
  data volumes > 100 GB
  number of processes > 10.000
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
Vampir for Windows
Main Displays

Global Timeline
Process Timeline + Counter
Counter Timeline
Summary Timeline


Summary Chart (aka. Profile)
Message Statistics
Collective Communication Statistics
I/O Statistics
Call Tree, ...
Most Prominent Displays: Global Timeline
                                     Time Axis



MPI Processes

                 Black Lines:               Function
                MPI Messages                 Groups



                                                       Thumbnail
                                Other Colors:
                                 Application
                Red: MPI          Routines
                Routines
Most Prominent Displays: Single Process Timeline

                                          Time Axis


   Call Stack
     Level            Application
                       Routines




                MPI
                                    I/O



                           Performance
                             Counter
Other Tools

TAU profiling (University of Oregon, USA)
  extensive profiling and tracing for parallel application and
  visualization, camparison, etc.
  http://www.cs.uoregon.edu/research/tau/
KOJAK (JSC, FZ Jülich)
  very scalable performance tracing
  automatic performance analysis and classification
  http://www.fz-juelich.de/jsc/kojak/
Paraver (CEPBA, Barcelona, Spain)
  trace based parallel performance analysis and visualization
  http://www.cepba.upc.edu/paraver/
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Approaching Performance Problems

Trace Visualization
  Vampir provides a number of display types
  each provides many customization options


Advice
  make a hypothesis about performance problems
  consider application's internal workings if known
  select the appropriate display
  use statistic displays in conjunction with timelines
Finding Performance Bottlenecks

Four Categories
1. Computation
2. Communication
3. Memory, I/O, …
4. Tracing itself
Finding Performance Bottlenecks

Computation
 unbalanced workload distribution: single late comer(s)
 strictly serial parts of program: idle processes/threads
 very frequent tiny function calls: call overhead
 sparse loops
LM-MUSCAT Air Quality Model: Load Imbalance




          Many Processes
             waiting for
              individual
            procecsses to
          finish chemistry
            computation.
LM-MUSCAT Air Quality Model: Bad CPU Partitioning




      Meteorology
      Processes are
     waiting most of
         the time
LM-MUSCAT Air Quality Model: Good CPU Partitioning




                    More Processes
                   for Chemistry-
                   Transport, less
                  for Meteorology:
                    better balance
LM-MUSCAT Air Quality Model: Load Imbalance

                    High load imbalance
                   only during (simulated)
                           sunrise




                            Examine the
                              runtime
                           behavior of the
                             Application
SPEC OMP Benchmark fma3d: Unbalanced Threads




                   Not well balanced
                   OpenMP threads
WRF Weather Model: MPI/OpenMP - Idle Threads




           Idle Threads
Finding Performance Bottlenecks

Communication
 communication as such (domination over computation)
 late sender, late receiver
 point-to-point messages instead of
 collective communication
 unmatched messages
 overcharge of MPI buffers
 bursts of large messages (bandwidth)
 frequent short messages (latency)
 unnecessary synchronization (barrier)
High Performance Linpack Using Open MPI




                       Everything
                     looks ok here
HPL with Alternative MPI Implementation




                      Several slow
                      Messages.
                      MPI Problem?
HPL with Alternative MPI Implementation




                         Transfer Rate
                         only 1.63 MB/s!




                           Tracking down
                           Performance
                           Problems to
                           individual
                           Events
Finding Performance Bottlenecks

Memory bound computation
  inefficient L1/L2/L3 cache usage
  TLB misses
  detectable via HW performance counters
I/O bound computation
  slow input/output
  sequential I/O on single process
  I/O load imbalance
Exception handling
Performance Counters: Floating Point Exceptions
WRF Weather Model: Floating Point Exceptions
                               FPU exceptions lead to long
                               runtime of routine ADVECT.
                               Timeline interval: 77.7ms




  Other optimization, no FPU
  exceptions: only 10ms for
  the same program section
WRF Weather Model: Low I/O Performance




                               Transfer Rate
                               only 389 kB/s!
WRF Weather Model: Slow Metadata Operations




        128 Processes
        call open – takes
        more than 4
        seconds
Semtex CFD Application: Serial I/O




      Process 0 is
     performing I/O
           …




                … while 127
               processes are
                  waiting
Complex Cell Application: RAxML (1)




          RAxML (Randomized Accelerated Maximum Likelihood)
                      with 8 SPEs, ramp-up phase
Complex Cell Application: RAxML (2)




               RAxML with 8 SPEs, 4000 ns window
                      enlargement of a small loop
                shifted start of loop, constant runtime
Complex Cell Application: RAxML (3)




               RAxML with 8 SPEs, 4000 ns window
               enlargement of a small loop (modified)
               synchronous start, memory contention
Complex Cell Application: RAxML (4)




               RAxML with 16 SPEs, load imbalance
Finding Performance Bottlenecks

Tracing
  measurement overhead
   – esp. grave for tiny function calls
   – solve with selective instrumentation
  long, asynchronous trace buffer flushes
  too many concurrent counters
   – more data
  heisenbugs
Trace Buffer Flush
Outline

Introduction
Prominent Display Types
Performance Analysis Examples
New Vampir GUI
Product Family

Vampir for UNIX:                            Vampir Classic
    VampirClassic (all-in-one,    All in one, single threaded Motif app.
    single threaded, Unix
    OpenMotif based)
    VampirServer (parallelized               Vampir Server

    (MPI) client/server program
    approach)                      Parallelized
                                                 Sockets
                                                         Visualization
                                  service engine               (Motif)
Windows for Windows:
    Based on VampirServer’s
    parallel service engine         Vampir for Windows HPC Server

    New Windows GUI to the          Threaded                 Windows
    harnessed VampirServer         service DLL
                                                   API
                                                                 GUI
    services
New GUI Layout



     Chart Selection   Global Time Selection with Summary




                        Shared Chart Area




                             Holger Brunst
Chart Overview
Chart Arrangement
Windows Event Tracing

Windows HPC Server 2008
      Microsoft MPI (MS-MPI) integrated with
      Event Tracing for Windows (ETW) infrastructure
      Allows MPI tracing
No “special” builds needed. Just run application with extra mpiexec flag
  (-trace)
High-precision CPU clock correction for MS-MPI (mpicsync)
Tracing prerequisites:
      User must be in Administrator or Performance Log group
      Jobs should be executed exclusively in the Windows HPC
      Server 2008 Scheduler to avoid confusion/conflict of the trace
      data




                                HOLGER BRUNST                    SLIDE 44
Creation of OTF Traces
                                       Rank 0 node
Run myApp with tracing enabled                       myApp.ex
                                                        e
                                                      MS-MPI         ETW
Time-Sync the ETL logs
                                                      MS-MPI
Convert the ETL logs to OTF                           mpicsync
                                                                    Trace
                                                                    (.etl)
Copy OTF files to head node                           MS-MPI
                                                      etl2otf

                                       MS-MPI
                                                     Formatted
                                        copy
                                                     Trace (.otf)
          HEAD NODE

          share                           Rank 1
                                             node
              
           userHome
            myApp.exe
            Trace
              trace.etl_otf.otf
              trace.etl_otf.0.def




                                                …
              trace.etl_otf.1.events        Rank N
              trace.etl_oft.2.events         node
              …
Creation of OTF Traces

 The four steps are
 created as individual
 tasks in a cluster job.
 The task options allow
 to choose the number
 of cores for the job
 and other
 parameters.
 In “Dependency” the
 right order of
 execution of the tasks
 can be ensured.
Creation of OTF Traces
File system prerequisites:
   “shareuserHome” is the shared user directory throughout the
  cluster
   MPI executable “myApp.exe” is available in shared directory
   “shareuserHomeTrace” is the directory where the OTF files are
  collected


Launch program with –tracefile option
  mpiexec –wdir shareuserHome -tracefile
  %USERPROFILE%trace.etl myApp.exe
  wdir sets the working directory, myApp.exe has to be here
  %USERPROFILE% translates to the local home directory, e.g.
  “C:UsersuserHome”, on each node
  the eventlog file (.etl) is stored locally in this directory
Creation of OTF Traces

Time-Sync the Log files on throughout all nodes
       mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl
       “- cores 1”: run only one instance of mpicsync on each node
Format Log files to OTF files
       mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl
Copy all OTF files from nodes to trace directory on share
       mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y
       “*_otf*” “shareuserHomeTrace”




                                HOLGER BRUNST                   SLIDE 48
Summary

Hybrid MPI/OpenMP trace file with 1024 cores
      256 MPI ranks
      4 OpenMP threads per rank
Some feedback from users:
      “This was very impressive to see live as they had never seen
      their application profiled at this scale, and vampir pointed us at
      the problem straight away”
      “I was impressed by how detailed MPI functions are visualized in
      a zoomable fashion into micro seconds scale. We have some
      parallel C# programs currently run on Windows cluster of up to
      128 core. I will use Vampir to test on other applications I have.”
Work in progress with regular updates
      Completion of charts
      Additional information sources from ETW
Thank You

                     Team
                  Ronny Brendel
               Dr. Holger Brunst
                  Jens Doleschal
                 Matthias Jurenz
             Dr. Andreas Knüpfer
                  Matthias Lieber
                  Christian Mach
                   Holger Mickler
                 Dr. Hartmut Mix
              Dr. Matthias Müller
          Prof. Wolfgang E. Nagel
                    Michael Peter
                 Matthias Weber
                 Thomas William


http://www.vampir.eu
http://www.tu-dresden.de/zih/vampirtrace

More Related Content

What's hot

Improving Passive Packet Capture : Beyond Device Polling
Improving Passive Packet Capture : Beyond Device PollingImproving Passive Packet Capture : Beyond Device Polling
Improving Passive Packet Capture : Beyond Device Polling
Hargyo T. Nugroho
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
Ashik Iqbal
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
DVClub
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
sean chen
 

What's hot (20)

OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
 
Open mp
Open mpOpen mp
Open mp
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Co emulation of scan-chain based designs
Co emulation of scan-chain based designsCo emulation of scan-chain based designs
Co emulation of scan-chain based designs
 
openmp
openmpopenmp
openmp
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
XMPP - Introduction And LAS Implementation (Presentation)
XMPP - Introduction And LAS  Implementation (Presentation)XMPP - Introduction And LAS  Implementation (Presentation)
XMPP - Introduction And LAS Implementation (Presentation)
 
Lect15
Lect15Lect15
Lect15
 
Improving Passive Packet Capture : Beyond Device Polling
Improving Passive Packet Capture : Beyond Device PollingImproving Passive Packet Capture : Beyond Device Polling
Improving Passive Packet Capture : Beyond Device Polling
 
Multicore
MulticoreMulticore
Multicore
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Pipeline hazard
Pipeline hazardPipeline hazard
Pipeline hazard
 
Pipeline Mechanism
Pipeline MechanismPipeline Mechanism
Pipeline Mechanism
 
Coverage Solutions on Emulators
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
 
Uvm presentation dac2011_final
Uvm presentation dac2011_finalUvm presentation dac2011_final
Uvm presentation dac2011_final
 
Chris brown ti
Chris brown tiChris brown ti
Chris brown ti
 
python-csp: bringing OCCAM to Python
python-csp: bringing OCCAM to Pythonpython-csp: bringing OCCAM to Python
python-csp: bringing OCCAM to Python
 

Similar to 2 Vampir Trace Visualization

Trace Visualization
Trace VisualizationTrace Visualization
Trace Visualization
PTIHPA
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
Stephan Ewen
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
PTIHPA
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
PTIHPA
 
Know More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy KKnow More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy K
Roopa Nadkarni
 
3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k
IBM
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Ganesan Narayanasamy
 

Similar to 2 Vampir Trace Visualization (20)

Trace Visualization
Trace VisualizationTrace Visualization
Trace Visualization
 
Apache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World LondonApache Flink@ Strata & Hadoop World London
Apache Flink@ Strata & Hadoop World London
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Introduction to NBL
Introduction to NBLIntroduction to NBL
Introduction to NBL
 
1 Vampir Overview
1 Vampir Overview1 Vampir Overview
1 Vampir Overview
 
2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement2010 02 instrumentation_and_runtime_measurement
2010 02 instrumentation_and_runtime_measurement
 
Summit 16: The Hitchhiker/Hacker's Guide to NFV Benchmarking
Summit 16: The Hitchhiker/Hacker's Guide to NFV BenchmarkingSummit 16: The Hitchhiker/Hacker's Guide to NFV Benchmarking
Summit 16: The Hitchhiker/Hacker's Guide to NFV Benchmarking
 
Know More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy KKnow More About Rational Performance - Snehamoy K
Know More About Rational Performance - Snehamoy K
 
3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k3 know more_about_rational_performance_tester_8-1-snehamoy_k
3 know more_about_rational_performance_tester_8-1-snehamoy_k
 
Ch1
Ch1Ch1
Ch1
 
Ch1
Ch1Ch1
Ch1
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Java multi thread programming on cmp system
Java multi thread programming on cmp systemJava multi thread programming on cmp system
Java multi thread programming on cmp system
 
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor toolContainerizing HPC and AI applications using E4S and Performance Monitor tool
Containerizing HPC and AI applications using E4S and Performance Monitor tool
 
Comparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization TechnologyComparison of Open Source Virtualization Technology
Comparison of Open Source Virtualization Technology
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
OpenMP
OpenMPOpenMP
OpenMP
 
LPAR2RRD on CZ/SK common 2014
LPAR2RRD on CZ/SK common 2014LPAR2RRD on CZ/SK common 2014
LPAR2RRD on CZ/SK common 2014
 
News In The Net40
News In The Net40News In The Net40
News In The Net40
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 

More from PTIHPA

Github:fi Presentation
Github:fi PresentationGithub:fi Presentation
Github:fi Presentation
PTIHPA
 
2010 05 hands_on
2010 05 hands_on2010 05 hands_on
2010 05 hands_on
PTIHPA
 
2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration
PTIHPA
 
2010 03 papi_indiana
2010 03 papi_indiana2010 03 papi_indiana
2010 03 papi_indiana
PTIHPA
 
Overview: Event Based Program Analysis
Overview: Event Based Program AnalysisOverview: Event Based Program Analysis
Overview: Event Based Program Analysis
PTIHPA
 
Switc Hpa
Switc HpaSwitc Hpa
Switc Hpa
PTIHPA
 
Statewide It Robert Henschel
Statewide It Robert HenschelStatewide It Robert Henschel
Statewide It Robert Henschel
PTIHPA
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
PTIHPA
 
5 Vampir Configuration At IU
5 Vampir Configuration At IU5 Vampir Configuration At IU
5 Vampir Configuration At IU
PTIHPA
 
4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage
PTIHPA
 
GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...
PTIHPA
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
PTIHPA
 
Big Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing WorkshopBig Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing Workshop
PTIHPA
 

More from PTIHPA (13)

Github:fi Presentation
Github:fi PresentationGithub:fi Presentation
Github:fi Presentation
 
2010 05 hands_on
2010 05 hands_on2010 05 hands_on
2010 05 hands_on
 
2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration2010 vampir workshop_iu_configuration
2010 vampir workshop_iu_configuration
 
2010 03 papi_indiana
2010 03 papi_indiana2010 03 papi_indiana
2010 03 papi_indiana
 
Overview: Event Based Program Analysis
Overview: Event Based Program AnalysisOverview: Event Based Program Analysis
Overview: Event Based Program Analysis
 
Switc Hpa
Switc HpaSwitc Hpa
Switc Hpa
 
Statewide It Robert Henschel
Statewide It Robert HenschelStatewide It Robert Henschel
Statewide It Robert Henschel
 
3 Vampir Trace In Detail
3 Vampir Trace In Detail3 Vampir Trace In Detail
3 Vampir Trace In Detail
 
5 Vampir Configuration At IU
5 Vampir Configuration At IU5 Vampir Configuration At IU
5 Vampir Configuration At IU
 
4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage4 HPA Examples Of Vampir Usage
4 HPA Examples Of Vampir Usage
 
GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...GeneIndex: an open source parallel program for enumerating and locating words...
GeneIndex: an open source parallel program for enumerating and locating words...
 
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. ProcessorImplementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
Implementing 3D SPHARM Surfaces Registration on Cell B.E. Processor
 
Big Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing WorkshopBig Iron and Parallel Processing, USArray Data Processing Workshop
Big Iron and Parallel Processing, USArray Data Processing Workshop
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

2 Vampir Trace Visualization

  • 1. Center for Information Services and High Performance Computing (ZIH) Performance Analysis on IU’s HPC Systems using Vampir Trace Visualization Holger Brunst e-mail: holger.brunst@tu-dresden.de
  • 3. Post-mortem Event-based Performance Analysis Performance optimization remains one of the key issues in parallel programming Strong need for performance analysis, analysis process still not easy Profilers do not give detailed insight into timing behavior of an application Detailed online analysis pretty much impossible because of intrusion and data amount Tracing is an option to capture the dynamic behavior of parallel applications Performance analysis done on a post-mortem basis
  • 4. Background Performance visualization and analysis tool Targets the visualization of dynamic processes on massively parallel (compute-) systems Available for major Unix based OS and for Windows Development started more than 15 years ago at Research Centre Jülich, ZAM Since 1997, developed at TU Dresden (first: collaboration with Pallas GmbH, from 2003-2005: Intel Software & Solutions Group, since January 2006: TU Dresden, ZIH / GWT-TUD) Visualization components (Vampir) are commercial – Motivation – Advantages/Disadvantages Monitor components (VampirTrace) are Open Source
  • 5. Components Application Trace Vampir CPU Data VampirTrace (OTF) VampirServer Time Application OTF Trace Task 1 … Task n << m 1 CPU VampirTrace Part 1 Application OTF Trace 2 CPU VampirTrace Part 2 Application OTF Trace 3 CPU VampirTrace Part 3 Application OTF Trace 4 CPU VampirTrace Part 4 . . . . . . Application Trace Data 10,000 CPU VampirTrace Part m
  • 6. Flavors Vampir stabilized sequential version similar set of displays and options as presented here less scalable no ongoing active development VampirServer distributed analysis engine allows server and client on the same workstation as well new features windows port in progress
  • 7. VampirServer Parallel/distributed server runs in (part of) production environment no need to transfer huge traces parallel I/O Lightweight client on local workstation receive visual content only already adapted to display resolution moderate network load Scalability data volumes > 100 GB number of processes > 10.000
  • 8. Outline Introduction Prominent Display Types Performance Analysis Examples Vampir for Windows
  • 9. Main Displays Global Timeline Process Timeline + Counter Counter Timeline Summary Timeline Summary Chart (aka. Profile) Message Statistics Collective Communication Statistics I/O Statistics Call Tree, ...
  • 10. Most Prominent Displays: Global Timeline Time Axis MPI Processes Black Lines: Function MPI Messages Groups Thumbnail Other Colors: Application Red: MPI Routines Routines
  • 11. Most Prominent Displays: Single Process Timeline Time Axis Call Stack Level Application Routines MPI I/O Performance Counter
  • 12. Other Tools TAU profiling (University of Oregon, USA) extensive profiling and tracing for parallel application and visualization, camparison, etc. http://www.cs.uoregon.edu/research/tau/ KOJAK (JSC, FZ Jülich) very scalable performance tracing automatic performance analysis and classification http://www.fz-juelich.de/jsc/kojak/ Paraver (CEPBA, Barcelona, Spain) trace based parallel performance analysis and visualization http://www.cepba.upc.edu/paraver/
  • 14. Approaching Performance Problems Trace Visualization Vampir provides a number of display types each provides many customization options Advice make a hypothesis about performance problems consider application's internal workings if known select the appropriate display use statistic displays in conjunction with timelines
  • 15. Finding Performance Bottlenecks Four Categories 1. Computation 2. Communication 3. Memory, I/O, … 4. Tracing itself
  • 16. Finding Performance Bottlenecks Computation unbalanced workload distribution: single late comer(s) strictly serial parts of program: idle processes/threads very frequent tiny function calls: call overhead sparse loops
  • 17. LM-MUSCAT Air Quality Model: Load Imbalance Many Processes waiting for individual procecsses to finish chemistry computation.
  • 18. LM-MUSCAT Air Quality Model: Bad CPU Partitioning Meteorology Processes are waiting most of the time
  • 19. LM-MUSCAT Air Quality Model: Good CPU Partitioning More Processes for Chemistry- Transport, less for Meteorology: better balance
  • 20. LM-MUSCAT Air Quality Model: Load Imbalance High load imbalance only during (simulated) sunrise Examine the runtime behavior of the Application
  • 21. SPEC OMP Benchmark fma3d: Unbalanced Threads Not well balanced OpenMP threads
  • 22. WRF Weather Model: MPI/OpenMP - Idle Threads Idle Threads
  • 23. Finding Performance Bottlenecks Communication communication as such (domination over computation) late sender, late receiver point-to-point messages instead of collective communication unmatched messages overcharge of MPI buffers bursts of large messages (bandwidth) frequent short messages (latency) unnecessary synchronization (barrier)
  • 24. High Performance Linpack Using Open MPI Everything looks ok here
  • 25. HPL with Alternative MPI Implementation Several slow Messages. MPI Problem?
  • 26. HPL with Alternative MPI Implementation Transfer Rate only 1.63 MB/s! Tracking down Performance Problems to individual Events
  • 27. Finding Performance Bottlenecks Memory bound computation inefficient L1/L2/L3 cache usage TLB misses detectable via HW performance counters I/O bound computation slow input/output sequential I/O on single process I/O load imbalance Exception handling
  • 28. Performance Counters: Floating Point Exceptions
  • 29. WRF Weather Model: Floating Point Exceptions FPU exceptions lead to long runtime of routine ADVECT. Timeline interval: 77.7ms Other optimization, no FPU exceptions: only 10ms for the same program section
  • 30. WRF Weather Model: Low I/O Performance Transfer Rate only 389 kB/s!
  • 31. WRF Weather Model: Slow Metadata Operations 128 Processes call open – takes more than 4 seconds
  • 32. Semtex CFD Application: Serial I/O Process 0 is performing I/O … … while 127 processes are waiting
  • 33. Complex Cell Application: RAxML (1) RAxML (Randomized Accelerated Maximum Likelihood) with 8 SPEs, ramp-up phase
  • 34. Complex Cell Application: RAxML (2) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop shifted start of loop, constant runtime
  • 35. Complex Cell Application: RAxML (3) RAxML with 8 SPEs, 4000 ns window enlargement of a small loop (modified) synchronous start, memory contention
  • 36. Complex Cell Application: RAxML (4) RAxML with 16 SPEs, load imbalance
  • 37. Finding Performance Bottlenecks Tracing measurement overhead – esp. grave for tiny function calls – solve with selective instrumentation long, asynchronous trace buffer flushes too many concurrent counters – more data heisenbugs
  • 40. Product Family Vampir for UNIX: Vampir Classic VampirClassic (all-in-one, All in one, single threaded Motif app. single threaded, Unix OpenMotif based) VampirServer (parallelized Vampir Server (MPI) client/server program approach) Parallelized Sockets Visualization service engine (Motif) Windows for Windows: Based on VampirServer’s parallel service engine Vampir for Windows HPC Server New Windows GUI to the Threaded Windows harnessed VampirServer service DLL API GUI services
  • 41. New GUI Layout Chart Selection Global Time Selection with Summary Shared Chart Area Holger Brunst
  • 44. Windows Event Tracing Windows HPC Server 2008 Microsoft MPI (MS-MPI) integrated with Event Tracing for Windows (ETW) infrastructure Allows MPI tracing No “special” builds needed. Just run application with extra mpiexec flag (-trace) High-precision CPU clock correction for MS-MPI (mpicsync) Tracing prerequisites: User must be in Administrator or Performance Log group Jobs should be executed exclusively in the Windows HPC Server 2008 Scheduler to avoid confusion/conflict of the trace data HOLGER BRUNST SLIDE 44
  • 45. Creation of OTF Traces Rank 0 node Run myApp with tracing enabled myApp.ex e MS-MPI ETW Time-Sync the ETL logs MS-MPI Convert the ETL logs to OTF mpicsync Trace (.etl) Copy OTF files to head node MS-MPI etl2otf MS-MPI Formatted copy Trace (.otf) HEAD NODE share Rank 1 node userHome myApp.exe Trace trace.etl_otf.otf trace.etl_otf.0.def … trace.etl_otf.1.events Rank N trace.etl_oft.2.events node …
  • 46. Creation of OTF Traces The four steps are created as individual tasks in a cluster job. The task options allow to choose the number of cores for the job and other parameters. In “Dependency” the right order of execution of the tasks can be ensured.
  • 47. Creation of OTF Traces File system prerequisites: “shareuserHome” is the shared user directory throughout the cluster MPI executable “myApp.exe” is available in shared directory “shareuserHomeTrace” is the directory where the OTF files are collected Launch program with –tracefile option mpiexec –wdir shareuserHome -tracefile %USERPROFILE%trace.etl myApp.exe wdir sets the working directory, myApp.exe has to be here %USERPROFILE% translates to the local home directory, e.g. “C:UsersuserHome”, on each node the eventlog file (.etl) is stored locally in this directory
  • 48. Creation of OTF Traces Time-Sync the Log files on throughout all nodes mpiexec –cores 1 –wdir %USERPROFILE% mpicsync trace.etl “- cores 1”: run only one instance of mpicsync on each node Format Log files to OTF files mpiexec –cores 1 –wdir %USERPROFILE% etl2otf trace.etl Copy all OTF files from nodes to trace directory on share mpiexec –cores 1 –wdir %USERPROFILE% cmd /c copy /y “*_otf*” “shareuserHomeTrace” HOLGER BRUNST SLIDE 48
  • 49. Summary Hybrid MPI/OpenMP trace file with 1024 cores 256 MPI ranks 4 OpenMP threads per rank Some feedback from users: “This was very impressive to see live as they had never seen their application profiled at this scale, and vampir pointed us at the problem straight away” “I was impressed by how detailed MPI functions are visualized in a zoomable fashion into micro seconds scale. We have some parallel C# programs currently run on Windows cluster of up to 128 core. I will use Vampir to test on other applications I have.” Work in progress with regular updates Completion of charts Additional information sources from ETW
  • 50. Thank You Team Ronny Brendel Dr. Holger Brunst Jens Doleschal Matthias Jurenz Dr. Andreas Knüpfer Matthias Lieber Christian Mach Holger Mickler Dr. Hartmut Mix Dr. Matthias Müller Prof. Wolfgang E. Nagel Michael Peter Matthias Weber Thomas William http://www.vampir.eu http://www.tu-dresden.de/zih/vampirtrace