PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel Performance

PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE
Evaluating Serial, Thread and Process-Parallel Performance
Molecular Dynamics Collaboration between
A molecular dynamics code simulating the PTI and ZIH
diffusion in dense nuclear matter in white dwarf
stars is analyzed in this collaboration. The code Planning In 2008 a collaboration between the Technische
is highly configurable allowing MPI, OpenMP, or Universität Dresden, Center for Information
hybrid runs and additional fine tuning with a Services and High Performance Computing
range of parameters. (ZIH), in Germany and the Indiana University,

Im h VampirTrace
Pervasive Technology Institute (PTI), was

wit
plem
founded to mutually benefit from common

valuation
r
Vampi
research areas.

entation
Serial Analysis

with
These research topics include:
The first step in the code analysis is to identify ‣ Data-centric computing

E
the best performing parameter set. This ‣ Computing for biological and life sciences
configuration represents the most promising ‣ Wide area distributed file systems
candidate for further parallel analysis. ‣ Parallel computing performance
Testing
This collaboration involves ZIH maintaining a
Parallel Analysis stack of performance evaluation tools inside the
NSF FutureGrid project. For the analysis an
Aim of the parallel analysis is to measure the HPC system called Xray, a Cray XT5m which is
scalability limits of the different parallel code Authors part of the FutureGrid hardware, was used.
implementations and to detect bottlenecks Xray consists of Quad-core 64-bit AMD
possibly preventing further parallel efficiency. T. William, M. Weber, D. Röhrig - ZIH, TU Dresden Opteron series 2000 processors. PGI compilers
This work has been done with the parallel D. K. Berry, R. Henschel - UITS, IU Bloomington version 9.0.4 and the optimized xtpe-barcelona
analysis framework Vampir. J. Hughto, A. S. Schneider, C. J. Horowitz - Physics, IU Bloomington module provided by Cray have been used.

This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.

Technische Universität Dresden IU Bloomington
Zentrum für Informationsdienste Pervasive Technology Institute
und Hochleistungsrechnen (ZIH)
2719 East 10th Street
01062 Dresden, Germany Bloomington, Indiana 47408

E-mail: contact@vampir.eu E-mail: pti@indiana.edu
Web: http://www.vampir.eu/ Web: http://pti.iu.edu/

MOLECULAR DYNAMICS CODE
Overview

The analyzed molecular dynamics code (MD) The movie shows a 5125 (5k) particle
has been developed at Indiana University for simulation of oxygen ions which are forming
simulating dense nuclear matter in white dwarf microcrystals:
and neutron stars.
‣ 1280 Carbon ions
Matter in these compact stellar objects consists
of either completely ionized atoms, or else free ‣ 3740 Oxygen ions
neutrons and protons.
‣ 100 Neon ions
The MD code models these systems as classical
particles interacting via a screened Coulomb By studying the motion of ions over a long
interaction. The electrons are not modeled simulation time, one can calculate the self-
explicitly, but are treated as a uniform diffusion coefﬁcient of ions, an important
background charge that serves to screen the quantity for understanding the behavior of
Coulomb interaction. white dwarfs.

MD-Code Ion Mix Movements



MOLECULAR DYNAMICS CODE
Implementation Details

During compilation the input parameter PP0x
MD is a hybrid MPI/OpenMP parallel code decides what file is used for the simulation.
written in Fortran. Its main features are: Inside the PP0x files the parameters Ax and Bx
denote different code-blocks that implement
‣ Simple particle-particle algorithm equal semantics in different ways. Additionally,
‣ Long range interaction with large cut-off the file PP03 provides the define NBS to
sphere influence the vector access length. Those
‣ Cutoff radius is half the box edge length parameters have been used throughout the
‣ Each ion interacts with about half the years to re-implement code utilizing newer
others constructs of Fortran95 that had not been
‣ Interactions are calculated in a pair of available in Fortran77.
nested do loops
‣ Outer loop goes over "target" particles PP01:
‣ Inner loop goes over "source" particles ‣ Original implementation
‣ Targets are assigned in a round-robin ‣ No division into Ax, Bx, or NBS
fashion to MPI processes ‣ Baseline for other measurements
‣ Within each MPI process, the work is
shared among OpenMP threads PP02:
‣ A0, A1, A2
The MD code is highly modular, consisting of ‣ B0, B1, B2
several different implementations of the same ‣ No NBS
simulation logic . The par ticle-par ticle Code block in the PP03 file showing different implementations
interaction semantics have been implemented of the same loop selectable by Ax preprocessor macros
PP03:
multiple times in different ways. Each ‣ A0, A1, A2
implementation is stored in an individual source ‣ B0, B1, B2, B3, B4, B5, B6, B7
file, called from PP01 to PP03. ‣ NBS



SCOPE OF THE CODE ANALYSIS
Hardware Details & Parameter Sweep
All measurements done on FutureGrid Time estimate per measurements and Serial analysis:
machine Xray: particles:
‣ 55k particles
‣ Cray XT5m 5k ! ~ 300 seconds! ~5 minutes ‣ Comparing optimization ﬂags:
‣ 672 cores 27k ! ~ 4000 seconds! ~1 h ! O2
‣ 84 nodes 55k ! ~ 35000 seconds! ~10 h ! O3
‣ 2 CPU per node ! fastsse
‣ Quad-Core AMD Opteron(tm) Processor 23 (C2) ‣ fastsse == -fast
‣ 2.4 GHz ‣ fast - Common optimizations;
‣ ... ! -O2 -Munroll=c:1 -Mnoframe -Mlre
! -Mautoinline -Mvect=sse -Mscalarsse
! -Mcache_align -Mﬂushz

Particle-type Loop combination Paralellism
nucleon-nucleon Block A: ! ! A0-A2 serial: ! md
ion pure Block B: ! ! B1-B7 OpenMP:! md_omp
ion mix Code-blocking: ! NBS MPI: ! md_mpi
Hybrid: ! md_mpi_omp

Input parameter Compiler Flags Code version
simulation type O2 Original: ! ! PP01
number of particles O3 Production: ! PP02
number of time steps SSE Research:! ! PP03
....

>8100 possible combinations


SERIAL ANALYSIS
Identifying The Best Configuration For Parallel Runs

Runtime for all code cominations


SERIAL ANALYSIS
Identifying The Best Configuration For Parallel Runs

Runtime in seconds



SERIAL ANALYSIS
PAPI Aided Analysis Of PP02 Code Versions

PAPI_FP_INS



SERIAL ANALYSIS

FPU Idle



SERIAL ANALYSIS

Branches mispredicted



SERIAL ANALYSIS

PAPI_VEC_INS



SERIAL ANALYSIS

L1 hit ratio



SERIAL ANALYSIS

PAPI_TO_CYC



SOURCE CODE ANALYSIS
Codeblocks Ax & Bx For Ion-Mix (PP02)

A0:
! ! Branching

A1:
! ! Arithmetic

A2:
! ! Array syntax

Block A




B0:
! ! Loop

B1:
! ! No Cut-off Sphere

B2:
! ! Cut-off Sphere

Block B




B0:
! ! Loop

B1:
! ! No Cut-off Sphere

B2:
! ! Cut-off Sphere

Result: "Calculation does not beat branching"

Block B



PARALLEL ANALYSIS
First Scalability Evaluation

The MD code is used in production with particle counts between 27k and
55k. Production runs currently use the PP02 implementation of particle-
particle interaction semantics. The code blocks selected are A0 and B2, and
the vector length (NBS) is set to 32. Measurements show that this
configuration is very efficient and yields good performance.

So far, runtimes of larger runs have only been analyzed using the hybrid
version. The chart shows speedups on a large Cray XT5 (Kraken) and the
resulting efficiency for 27k and 55k particle runs. Up to 1152 cores, the
efficiency is higher than 80%. At 4608 cores the efficiency drops below 50%
for 27k particles. Using more ions shows the effect of weak scaling that can
be used to optimize the efficiency so that even on 4608 cores an efficiency
of 58% can be achieved.

Although these results are very promising, Kraken has 112,896 compute
cores that could be used for computation. The mid term goal is therefore
to optimize the MD code to achieve 50% parallel efficiency for 16000
cores. To achieve this we will take a more detailed look at the 4 different
versions of the code (serial, OpenMP, MPI and hybrid).

The code is very efficient for multi-core processors such as the 6-core processors on Kraken, the
Cray XT5 at The National Institute for Computational Sciences. Each processor is assigned one
MPI process. The work of each process is then shared among 6 OpenMP threads.



PARALLEL ANALYSIS WITH VAMPIR

ZIH has more then a decade of expertise in analyzing
sequential and parallel codes. As part of the
collaboration with IU, ZIH applied its analysis software
Vampir to the different versions of MD.

Vampir consists of a tracing framework called
VampirTrace that is able to monitor all levels of
parallelism of the MD code. Additionally PAPI
counters are recorded in order to analyze the cache
behavior of the main calculation loop.

In a first step traces of the four versions were
generated using only 5k particles to get a broad
overview. The next step will be to refine the tracing
mechanism using manually instrumented code to avoid
the rather huge overhead introduced by OpenMP
tracing. This will enable monitoring of at least 4096
cores to get an understanding why the efficiency
drops below 50% at this core count.



OPENMP CODE OPTIMIZATION

Comparison of OpenMP Versions using 1 Node and 1-8 Cores



55k particles, MPI only, 8 nodes, 1 process per node



20+ loops of the A0_B2 block



55k particles, MPI only, 1 nodes, 8 processes per node



55k particles, MPI only, 84 nodes, 8 process per node (672 core == full machine run)



switching of groups and ﬂushing of VT buffers



55k particles, OpenMP only, 1 Node, 8 processes



55k particles, Hybrid, 84 MPI processes (1 per Node), 8 Threads per process (672 core == full machine run)



Second group, 100 runs


SCALABILITY STUDIES ON XRAY

1 Node 8 Nodes

Parallel Efﬁciency (MPI-only runs)


SCALABILITY STUDIES ON XRAY

Parallel Efﬁciency (Hybrid runs)


CONCLUSION AND FUTURE WORK

To find the optimal serial version of MD, different combinations of the code
blocks and compiler flags were tested. The results from the serial analysis built
the starting point for the parallel analysis.

Vampir was then successfully applied to the code making it possible to visualize Planning
MPI and OpenMP parallelism.

An important step for future parallel analysis is to embed PAPI counter directly

Im h VampirTrace
in the MD code to avoid the overhead caused by VampirTrace. This enables the

wit
plem
comparison of the actually achieved performance with the theoretical peak

valuation
performance using larger core counts.

r
Vampi

entation
Additional tests with larger particle counts are planned to exploit the weak

with
scaling capabilities of the code. The hybrid implementation is to be run in
different process/thread configurations to research wether fitting MPI/OpenMP

E
usage to the hardware characteristics can further increase performance.

Analysis tests will be re-run on Kraken with up to 16386 cores. The goal is to
detect scalability issues and to significantly improve parallel efficiency. Te s ti n g
The MD code is currently being ported to GPGPUs. Vampir already supports
the analysis of CUDA parallelized code. This analysis support will enable efficient
ported code allowing for the utilization of the maximum possible potential
provided by GPGPUs.



PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel Performance

PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel Performance

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (7)

Similar to PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel Performance

Similar to PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread and Process-Parallel Performance (19)

Recently uploaded

Recently uploaded (20)