SlideShare a Scribd company logo
1 of 47
Download to read offline
OPTIMIZATION AND IMPLEMENTATION OF
PARADIS CODE IN PARALLEL CLUSTER
Thesis
Submitted to
The Department of Computer Science and College of Engineering of
Southern University and A&M College
In Partial Fulfillment of the Requirements for
The Degree of
Master of Science in Computer Science
By
Cheng Guo
Baton Rouge, Louisiana
December, 2015
iii
© Copyright by
Cheng Guo
All rights reserved
2015
iv
ABSTRACT
In this work, we implemented and optimized ParaDiS (Parallel Dislocation Simulator)
code as a dislocation dynamics simulation tool for an S-doped Ni crystalline ductility
study. After a brief analysis of the original ParaDiS source code, we focused on the most
computational expensive modules of ParaDiS and implemented different optimization
approaches, including loop unrolling, SIMD (single instruction multiple data) intrinsics
implementation for vector calculations with Intel AVX instruction sets, write buffer, and
OpenMP level optimization. The computational improvement and time reduction of
different optimization methods were also done at the parallel cluster QB2 of the LONI
system. This optimization method can be extended to similar computer configurations
and parallel simulation codes. Our simulation results also support our experiment data
and models.
Keywords: High performance computing, parallel computation, optimization
v
ACKNOWLEDGEMENTS
Special thanks to Dr. Wei Cai (co-developer of ParaDiS) and Dr. Amin Arbabian
from Stanford University for their patient explanation of the ParaDiS code to the author,
which helped the author to better understand the code.
The author would like to express his gratitude to Dr. Shuju Bai and Dr. Ebrahim
Khosravi for their advice on this research and thesis.
The author also wishes to express his sincere appreciation and gratitude to his
research advisor, Dr. Shizhong Yang, for his guidance, advice, availability, and support
from the beginning of the research to the completion of this thesis.
The author would like to show his deepest gratitude to his parents. Lastly, the author
would like to express his warm regards and blessings to all of those who supported him in
any respect during the completion of this project.
vi
TABLE OF CONTENTS
APROVAL PAGE ............................................................................................................... ii
COPYRIGHT PAGE.......................................................................................................... iii
ABSTRACT....................................................................................................................... iv
ACKNOWLEDGEMENTS................................................................................................ v
CHAPTER I INTRODUCTION......................................................................................... 1
Significance.................................................................................................................... 3
Statement of Problem..................................................................................................... 4
Objectives....................................................................................................................... 5
Delimitations.................................................................................................................. 5
CHAPTER II REVIEW OF RELATED LITERATURE.................................................... 6
Dislocation Dynamics.................................................................................................... 6
Dislocation Computational Algorithms ......................................................................... 8
Domain Decomposition and Paralleled Implementation ............................................. 10
Common DD Simulation Software.............................................................................. 12
CHAPTER III METHODOLOGY ................................................................................... 14
Design .......................................................................................................................... 14
Resources ..................................................................................................................... 16
Software Debugging..................................................................................................... 18
Preliminary Wall Time Analysis.................................................................................. 19
vii
CHAPTER IV OPTIMIZATION...................................................................................... 21
Loop Unrolling............................................................................................................. 21
SIMD Implementations................................................................................................ 22
Write Buffer ................................................................................................................. 23
Further Implementation of OpenMP............................................................................ 23
CHAPTER V RESULTS .................................................................................................. 25
Multi-core Performances.............................................................................................. 25
Effectiveness of Optimizations .................................................................................... 27
Association with Experimental Results ....................................................................... 28
Summary ...................................................................................................................... 31
Future Work ................................................................................................................. 31
APPENDIX A................................................................................................................... 37
APPENDIX B................................................................................................................... 38
1
CHAPTER I
INTRODUCTION
Background
In the past decades, computation technologies have evolved rapidly: the top
supercomputers nowadays are capable of performing over 30 Petaflops per second [1].
These enhanced calculation capabilities of supercomputers, or high performance
computers (HPC), are to a great extent achieved both by doing every operation in a
shorter time and by having many computing units performing operations simultaneously.
The latter approach is usually referred to as parallelism, and it is essential for high
performance computing. The central idea behind most powerful computers today is the
use of multiple cores, by which a given computational task is divided into several sub-
tasks, which are then simultaneously executed on different cores. These cores, therefore,
solve the computational problem in a cooperative way, and the number of cores used by a
supercomputer is continuously increasing in time. The cores of an operating parallel
computer work simultaneously. Since data transfer is typically the dominant factor that
limits scientific code, network connections of paralleled computer systems play an
important role in the parallel performance of applications. To generate efficient parallel
code, three important network characteristics need to be taken into account: topology,
bandwidth, and latency. These features all have an important influence on the
performance of a parallel computer.
2
Usually, parallel computers are available in two types: shared-memory and distributed
memory (also known as clusters). The major difference between them is that a distributed
memory parallel computer includes one main memory for each code, whereas, in shared-
memory systems, CPUs operate in a common shared memory space. The dominant HPC
architectures at present and for the foreseeable future are comprised of nodes that are
shared-memory Non-uniform Memory Access (NUMA) machines connected with the
rest of the nodes following a distributed-memory pattern. Naturally, the efficiency of data
transfer throughout the different nodes of a computer has drawn a lot of attention [2]. The
Message Passing Interface (MPI) protocol, in particular, has been developed as a solution
for writing codes that run in parallel in both distributed-memory machines and shared-
memory machines [3]. Similarly, another protocol called OpenMP interface was
developed to achieve higher efficiency in shared-memory machines [4]. Since most
supercomputers are configured to implement hybrid share-distributed-memory
architectures, the mixed use of different programming codes, including both MPI and
OpenMP, is preferable.
As the essential components of HPC technologies, high-speed advanced data
networks are not only revolutionizing the ways that educators, researchers, and
businesses work, but they are also dramatically changing the scale and character of the
problems they can solve. The Louisiana Optical Network Initiative (LONI), for instance,
is a state-of-the-art fiber optics network that runs throughout Louisiana and connects
Louisiana and Mississippi research universities to one another as well as to the National
Lambda Rail and Internet 2. LONI provides the most powerful distributed supercomputer
resources available to any academic community, with ~ 2 Petaflops of computational
3
capacity and 10 Gbps Internet connections. In 2008, LONI joined the TeraGrid program
of the National Science Foundation (NSF) as a new resource provider; since, then LONI
has also contributed its resources to the national research community.
The increase of existing computational power has made simulations an important
field in scientific and engineering disciplines, bridging experimental and purely
theoretical studies. As a mighty research tool, it allows for exploration when
experimental procedures are expensive/difficult to carry out or when an experimental
procedure needs better results under specific conditions. Dislocation Dynamics (DD)
simulation are broadly used in the study of material science, physics, and mechanical
engineering [5-7]. With the explosive growth of computational power in the past decade,
dislocation dynamics simulation can now handle larger and more complex systems that
consist of hundreds of thousands of atoms or multi-million atoms thanks to powerful
processor performance, larger HPC clusters, and improved modeling of theory and
methods.
Significance
The parallel implementation and optimization of complex simulations on high
performance computers (HPCs) has become a frontier field in scientific researches. The
analysis and comparison of different algorithms and function implementations, message
passing protocols on HPC systems with different configurations, and the performance and
efficiency of HPC systems are also of importance since one of the major reasons for
developing faster computers is the demand to solve increasingly complicated scientific
and engineering problems. Up to this time, 3-dimensional dislocation dynamics
4
simulations have become a significant tool in the study of the plasticity of sub-micron
size metallic components due to the increasing implementation of small-scale devices.
Statement of Problem
In this work, we aim to evaluate how much the efficiency of running ParaDiS can be
improved with different node configurations and optimization approaches, including loop
unrolling, SIMD instructions, writing cache, and further implementation of OpenMP in
the QB2 cluster. We will also use the computational results to explain and support our
experimental data.
Hypothesis
The high performance computing (HPC) efficiency of running ParaDiS can be
improved and optimized by increasing the number of computational nodes and by
implementing SIMD and OpenMP techniques.
Research Questions
1. How can the performance of these computing systems be improved by increasing
the number of computational nodes in the simulation processes?
2. What are the primary factors that affect the computation time for the ParaDiS
code?
3. How can the ParaDiS package be used as the benchmark for testing jobs that are
running on the LONI clusters?
4. What techniques can be used to reduce overall program run time?
5
5. How well can dislocation dynamics simulation results be used to explain current
experimental results and models?
Objectives
Different optimization techniques such as loop reduction, SIMD instructions, and
further implementation of OpenMP have been analyzed and tested in QB2. An evaluation
of the effectiveness of scientific computational performance is also included in this
research work. To ensure the validity and accuracy of the simulation, our results are
compared to existing experimental data.
Delimitations
This research was only tested in the QB2 cluster. The Implementation of ParaDiS on
other computer architectures, such as IBM Blue Gene, may require modification in the
software configurations. The performance improvement due to an increase of
computational nodes only represented the dislocation system in this research; the optimal
number of nodes may differ if the system, or the scale of the system, changes.
6
CHAPTER II
REVIEW OF RELATED LITERATURE
Dislocation Dynamics
The plastic deformation of single crystals is carried out by a large number of
dislocations. To translate the fundamental understanding of dislocation mechanisms into
a quantitative physical theory for crystal plasticity, a new means of tracking dislocation
motion and interactions over long time spans and large space evolution is needed. Three
dimensional dislocation-dynamics (DD) simulation is aimed at developing a numerical
tool for simulating the dynamic behavior of large numbers of dislocations of arbitrary
shapes, interaction among groups of 3D dislocations, and the behavior of prescribed cell
walls [8]. It produces stress/strain curves and other mechanical properties and allows a
detailed analysis of dislocation microstructure evolution. In a numerical implementation,
dislocation lines are represented by connected discrete line segments that move according
to driving forces, including dislocation line tension, dislocation interaction forces, and
external loading. The dislocation segments respond to these forces by making discrete
movements according to a mobility function that is characteristic of the dislocation type
and specific material being simulated. This dislocation mobility can be extracted from
experimental data or calculated with atomistic simulations. Further, mobility is one of the
key inputs to a DD simulation. Another important consideration for DD simulations is
dealing with close dislocation-dislocation interactions such as annihilation and junction
7
formation and breaking. These close interactions can be quite complex and usually
require special treatment. An efficient way to deal with them is to use prescribed “rules.”
A bottleneck for DD simulation, which is long range in nature, is the calculation of the
elastic interactions between dislocations. In order to perform DD simulations for realistic
material plastic behavior, efficient algorithms must be developed to enable the simulation
over a reasonable time and space range with a large number of dislocations
Dislocations are curvilinear defects in materials and are considered the primary
carriers of the plastic deformation. Dislocation microstructure evolution and dislocation
pattern formation in deformed materials are crucial to determining the mechanical
properties of these materials. A full understanding of the physical origin of plasticity,
therefore, requires fundamental knowledge of these phenomena, such as the occurrence
of slip bands and cell structures.
Dislocation theory has been extensively studied and used to explain plasticity since
1930’s. In the past decades, with the development of fast computers, a new methodology,
called dislocation dynamics (DD) has been developed and successfully applied to
examine many of the fundamental aspects of the plasticity of materials at the micro-scale
[9]. In DD simulations, the equations of motion for dislocations are numerically solved to
determine the evolution and interaction of dislocations. Microstructure details are directly
observed in the simulation and can be used to validate experimental results. Many DD
models exist and most of them use straight segments to connect discrete dislocation
nodes on a dislocation loop [10, 11]. This leads to singularities in which the self-force is
calculated at the intersection of two segments because the self-force is a function of the
curvature of the dislocation line. The fine meshing of the dislocation line with many
8
segments is specifically required for strong interactions, which demands additional,
expensive computations. Such limits have been overcome since the development of the
parametric dislocation dynamics (PDD) model [12-15]. In the PDD, dislocations are
represented as spatial curves connected through dislocation nodes.
PDD and other DD simulations provide a promising way to link the microstructural
and the macroscopic properties of materials. One interesting application is to use DD to
explain the strain hardening of materials. Although progress has been made by PDD and
general DD methods to explain the nature of plastic deformation [16-18], there are
several challenges that limit their extensive utilization. First, in order to get preventative
macro-scale plastic behaviors, the collective behaviors of large dislocation systems must
be considered. The typical size scales are microns, which may contain tens of thousands
of dislocations. This presents a huge demand for computational power that cannot be
fulfilled by a single processor. Parallel simulation, in which a large problem is divided
into small pieces and solved by different individual processors, is naturally selected as an
alternative approach.
Dislocation Computational Algorithms
Dislocations have long-range elastic interactions such that all the dislocation
neighbors have to be taken into account for the interaction calculation. This makes the
computational burden scale O (N2
), which becomes prohibitive for large N. To reduce
this computational complexity, a “cut-off distance” in the simulation with long-range
force fields is employed; however, it is known to produce spurious results. Much effort
has been put into developing algorithms that diminish that drawback, and many of the
9
essential methods, developed independently by a number of groups, are based on the
concept of a hierarchical tree [9]. While different methods of constructing tree structures
and handling interaction potentials have been used, all of them share two important
common characteristics. First, they utilize a hierarchical-tree data structure, and second,
they directly compute the force on an individual particle from nearby particles while the
force from remote particles is calculated by an approximate method. Among these
methods are the Barnes–Hut (BH) method [19] and fast multipole method (FMM) [20].
The BH technique builds its data structure via a hierarchical subdivision of space into
cubic cells (for 3-D problems), and an oct tree (quad tree for 2-D) is built. The
subdivision of volume is repeated until the cells at the lowest level of the tree contain at
most one or no particles. Each node of the tree in the BH method represents a physical
volume of space, and the total mass and the center-of-mass within the volume is stored at
the node.
The BH method has a computational burden of O (NlnN). Far-field interactions are
usually calculated with a low-order multipole expansion, though, often, just the zeroth-
order ‘‘charge–charge’’ term is employed. The FMM method is based on the same oct-
tree structure as the BH method, though with no restriction on having just one (or no)
particles in the leaf nodes. With an efficient management of the interactions, however,
FMM reduces computational effort to O (N). In addition to this computational efficiency,
the errors in the FMM can be reduced to machine accuracy by keeping enough terms in
multipole expansions. A multipole acceptance criteria is employed to determine what
kind of interaction (direct or approximate) should be included for each particle. Parallel
formulations for both the BH and FMM methods have been developed [21-24], and it is
10
possible to apply a similar hierarchical strategy to DD simulations. However, another
important challenge for DD simulations is that dislocations are curvilinear defects that
run through a lattice. This makes DD simulation more complicated than particle problems.
While we can divide space into subdivisions, connectivity between dislocation segments
has to be maintained if the dislocation occurs across the boundaries of these subdivisions.
In the meantime, dislocation systems are heterogeneous—i.e., there are regions of high
dislocation density and regions that are essentially empty of dislocations. This makes it
difficult to divide the system into sub-systems with similar problem sizes for parallel
simulations in order to maintain the load balance for each working processor.
Due to these challenges, very few parallel DD simulations have been implemented.
One well-known implementation, called DD3D, in which dislocations are treated as
nodes connected through straight line segments, was developed by Lawrence Livermore
National Laboratory (LLNL) [25]. In the original implementation, one node may have
more than two arms due to the annihilation of two or more segments. This may create
complex microstructures that can be artificial according to their topological rules. In later
modeling implementations, improvements were made such that each dislocation retains
its identity even in strong interactions and its topological configuration is fully controlled
by the force acting on the dislocation without defined rules, which are more physically
based.
Domain Decomposition and Paralleled Implementation
In order to implement the complex calculations required for various dislocation
dynamics systems, it is crucial for the software package to utilize a large number of
11
processors efficiently in parallel. To date, an efficient usage of 1500 has been
demonstrated [26]. In such an implementation, all processors are treated equally during
the simulation. In other words, there is no distinction—such as “master” versus
“slaves”—between the processors. The primary objective of DD parallelization is to
divide the simulation box (also referred to as the domain), representing the physical
volume that contains the dislocation system, into different sub-domains, and to solve the
equation of motions of dislocation particles (DPs) in each domain independently on a
single processor, as shown in Figure 1. This way, communications are mostly local. That
is, each processor can obtain most of the information it needs by communicating with its
nearest neighbors. As the result, the amount of work for each processor is reduced and
the speed of calculation improved.
Figure 1. Decomposition of total simulation space into 3 ×3 ×2 domains along x, y, z
axes.
Dislocation microstructures can be highly heterogeneous and some processors may
contain a lot more nodes than others [26]; hence, dividing the total domain into equally
sized and/or shaped sub-domains may lead to severe load imbalances. As a matter of fact,
the only two requirements are that the DPs in each sub-domain should be as close as
12
possible and that each sub-domain has a similar number of DPs [9]. The first criterion
ensures that the closest neighbors of most dislocation DPs reside on the same processor
so that they will not be transferred from a different processor when needed. This way,
communication is minimized. The second criterion ensures that all processors have a
similar amount of work so that load balancing is achieved for optimal performance. To
reach a good load balance, it is important to perform data decomposition, as follows. The
total simulation box is first divided into Nx domains along the x direction such that each
domain contains an equal number of nodes. Each domain is then further divided along the
y direction by Ny times, and the resulting number of domains is again divided along the z
direction by Nz times. At the end, we obtain Nx ×Ny ×Nz domains, all containing the same
number of nodes, as shown in Figure 1. However, because the dislocation structure
evolves during the simulation, one needs to re-partition the problem among processors
from time to time in order to maintain a good load balance. The optimal number of nodes
per domain is in the range of 200 to 1000. In this case, the computational load on each
processor is relatively light, while most of the computing time is still spent on
computation instead of communication. If and when the total number of dislocation
segments increases significantly (e.g., due to dislocation multiplication), it is usually
helpful to stop and restart the simulation with more processors in order to maintain a
reasonable simulation speed.
Common DD Simulation Software
To date, different dislocation dynamics simulation software packages have been
developed. Beside the ParaDiS code that we used as the primary simulation tool, there are
also other open-source software packages. MicroMegas, for example, also known as
13
‘mM’, is a 3-D discrete dislocation dynamics program developed by the 'Laboratoire
d'Etude des Microstructures', CNRS-ONERA, France. The software was released as
freeware under the terms of the GNU General Public License (GPL) and published by the
Free Software Foundation [27]. The MicroMegas code, another such package, is mainly
used for the study of mono-crystalline metal plasticity and is based on the elastic theory
of dislocations. Its source code is written in a mix of FORTRAN 90 and FORTRAN 95
and includes 18 source modules. The code can be used for DD simulation for HCP, BCC,
and FCC systems [28].
Although simulation codes differ in their detailed structures and their work-flow,
there are some basic features that the codes have in common. All of the simulation codes
treat dislocations as discrete finite sets of degrees of freedom attached to line segments.
Then, forces on the dislocation lines are estimated from the elastic theory of dislocations
and the positions of the dislocation segments are updated according to a system-
dependent equation of motion [29]. The dislocation configurations are usually
represented by a set of curved dislocation lines that are typically discretized into a
succession of straight segments, marked by the start and end positions of the segments on
a discrete simulation lattice. The degrees of freedom are defined by the position, length,
and velocity of the segments. The entire dislocation dynamics simulation is based on the
time integration for the motion of dislocation segments in well-defined crystallographic
directions at a steady velocity. The simulation lattices are usually chosen with periodic
boundary and dimensions comparable to volume attached to Gauss points in FE methods.
As a result, DD simulations function like a numerical coarse-graining method through the
connection between discrete and continuous descriptions of plastic flow.
14
CHAPTER III
METHODOLOGY
Design
In preparation for the input parameters for the ParaDiS simulations, we carried out a
series of first-principle simulations based on the density function theory (DFT) for bulk
Ni system with various S substitution sites under pressures of 0, 15, and 30 GPa. We
chose 1 S atom substitution at the center for the 12.5% S doped Ni system, and 2 S atoms
at the center and corner for the 25% S doped Ni System. The simulation results of the
shear moduli and Poisson’s Ratios (Table 1) used as ParaDiS input parameters are
discussed below.
Table 1. Shear Moduli and Poisson’s Ratios of Simulation Results
Pressure Model Shear modulus (GPa) Poisson's Ratio
0 GPa
Pure 102.9 0.3226
12.5% 75.24 0.3533
25% 22.76 0.4486
15 GPa
Pure 124.45 0.3331
12.5% 94.36 0.3624
25% 34.72 0.4427
30 GPa
Pure 142.35 0.3404
12.5% 112.6 0.3654
25% 48.42 0.4350
15
Figure 2. Initial dislocation setup for 1×1×1 µm3
lattice Cubic FCC Nickel System in
ParaDiS.
We then employed the Parallel Dislocation Simulator (ParaDiS) package as the
primary tool to carry out our dislocation dynamics simulation study. The ParaDiS code is,
first, compiled on the QB machine LONI clusters using Intel, GNU, and PGI C compilers.
The initial dislocation configuration was then generated through the ParaDiSGen utility
tool. A cubic system of 1×1×1 µm3
lattice was chosen as the initial simulation scale setup.
We also applied the FCC0 mobility module as the dynamics functions for the dislocation
development. The FCC_0 mobility module attempts to simulate easy glide, with its glide
plane limited to one of the [111] planes in the FCC materials. Since no crystallographic
information was used in the dislocation core reactions, junction formation could take
place even slightly off the zone axis. By default, this module automatically enables the
<enableCrossSlip> control parameter, which allows dislocations to cross-slip to new
glide planes.
16
Moreover, we analyzed the original ParaDiS source code and explored different code
optimization approaches to reduce program run time. The results of these different
optimization techniques were compared using un-optimized code.
Resources
Previously known as Dislocation Dynamics in 3 Dimensions (DD3d), the ParaDiS
code is a free software package for large-scale dislocation dynamics simulations. It was
developed by Lawrence Livermore National Laboratory as a key part of the multi-scale
ASC Physics and Engineering Models effort to model material strength. The ParaDiS
code is written primarily in C with some C++ interface for real-time 3D plotting display
capability. The implementation of MPI libraries makes it a mighty tool for parallel
computing [26]. In addition, the ParaDiS package is based on a non-singular continuum
theory of dislocations that allows for accurate numerical calculations of dislocations in
terms of multiple dislocation nodes [30]. It is a massively parallel and specialized
material physics code that enables study of the fundamental mechanisms of plasticity at
the micro-structure dislocation level. In ParaDiS, dislocations are represented by nodes
that are interconnected by straight segments [19], as shown in Figure 3. The junctions can
be discretization nodes for representing a smooth line (e.g. node 1, 2, 3) or physical nodes
where three dislocations meet (e.g. node 0).
17
Figure 3. Dislocation networks represented as a set of “nodes” (empty circles)
interconnected by straight segments.
Both discretization and physical nodes are treated on equal footing in ParaDiS—they
have a common data structure and, essentially, the same equations of motion. Two nodes
connected by an arm are referred to as neighbors, and an arbitrary number of neighbor
nodes are allowed for a node. The arms between nodes represent straight dislocation
segments and, hence, are associated with a Burgers vector. The Burgers vector on each
arm is fixed until it is either destroyed or altered due to a dislocation reaction.
Most of our simulation works run on the QB2 cluster of the LONI network. The
Louisiana Optical Network Initiative (LONI) network is an advanced fiber optics
network that runs throughout Louisiana. It connects major research universities in
Louisiana, including Louisiana State University (LSU), Louisiana Tech University, LSU
Health Sciences Center in New Orleans, LSU Health Sciences Center in Shreveport,
Southern University, Tulane University, University of Louisiana at Lafayette and
University of New Orleans, allowing greater collaboration on research that produces
18
results quickly and with great accuracy. LONI provides the most powerful distributed
supercomputer resources available to any academic community, with 1~2 Petaflops of
computational capacity [31].
As the core of LONI, the Queen Bee supercomputer system is located in the state
Information Systems Building in downtown Baton Rouge. The original QB
supercomputer was launched in 2007, with 50 Teraflops of computational capacity. It
was later upgraded to 1500 Teraflop peak performance at the end of June 2014 and
ranked as one of the Top 50 supercomputer systems in the world. The upgraded QB is
known as QB2. It has 504 compute nodes with 960 NVIDIA Tesla K20x GPUs and over
10,000 Intel Xeon processing cores. Each node has two 10-core 2.8 GHz Xeon 64-bit
processors and either 64 or 128 GB of memory. The QB2 cluster system also features
1/Gb sec. Ethernet management network and 10 Gb/ sec and 40 Gb/sec external
connectivity, with a huge 2.8 PB Lustre file system [31].
Software Debugging
During the compilation of the original ParaDiS code, there were a huge number of
warnings. To prevent potential problems, it is recommended to fix these warnings rather
than ignoring them. Some warnings are reported due to deprecated library calls, such as
XKeycodeToKeysym(). The reason for this is that the original source code was written
more than 5 years ago. Understandably, the code has fallen behind current compiler
standards that suggest the use of new library calls for enhancement and bug fixes. A
simple solution is to use new library calls to substitute the deprecated ones, with the
inclusion of corresponding library files. Noticeable warnings also occur when return
19
values of certain functions—such as fscanf()—are ignored. Such warnings are fixed by
placing these functions into if statements, which catch the return value of function calls in
the event of failure. There are other warnings reporting a deprecated conversion from
string constants to 'char*'. The reason for this is that in standard C, a string liberal is
usually treated as a pointer to a constant char array, and any attempt to modify it risks a
program crash. A simple solution is to replace ‘char *’ with ‘const char *’ for the reason
discussed above.
Besides these warnings, the ParaDiS failed to start and encountered a segmentation
fault. The most likely reason was memory allocation issues, and this error was eliminated
after setting all dynamically allocated pointers to NULL after the free() function calls.
Initial attempts to run ParaDiS through PBS script job submission also failed, however, a
direct mpirun command works. The reason is that the default makefile setup enables X-
Window for plotting a real-time 3D dislocation dynamics box. Nevertheless, this feature
is currently unavailable for jobs that run through PBS submission. A work-around was
achieved by disabling the X-window plot capacity in the makefile configurations.
Preliminary Wall Time Analysis
Prior to optimization, preliminary wall time analysis was needed to determine how
much time each module/operation takes. Detailed timing information was obtained from
the timer files of the ParaDiS output data files. Noticeably, the nodal force and cell
charge computations occupied an extremely high portion of the program execution wall
time. The reason for this is that both modules were computationally intensive. In the
Nodal force module, there was a huge number of vector and matrix calculations, such as
20
inner product of two matrices, cross product of two matrices, matrices multiplication,
matrices transpose, inverse matrices calculations, etc. Due to accuracy concerns, most of
the variables used in these matrices calculations were double-precision floating-point
values. The arithmetic calculations of these floating point values, especially
multiplication and division, took a substantially longer time than the integer values. The
cell charge module implemented a nested 6 layer for the loop due to the nature of the
Cartesian coordinate point notation, which resulted in the complexity of problem to N6
for the variable size of N.
Table 2. Time occupancy of different modules in ParaDiS code
Module Name Time/s
TOTAL TIME
NODAL FORCE
CELL CHARGE
SEND COL TOPO CHANGES
SEGMENT FORCE COMM
SPLIT MULTI-NODES
COMM SEND GHOSTS
COMM SEND VELOCITY
HANDLE COLLISIONS
GENERATE ALL OUTPUT
NODAL VELOCITY
2245.238
1028.745
473.881
284.737
158.863
138.982
71.835
34.603
25.802
8.032
3.871
21
CHAPTER IV
OPTIMIZATION
Loop Unrolling
Original ParaDiS code implements a huge amount of for loops. One approach that
aims to optimize such program’s execution speed is loop unrolling, which reduces the
number of loops by replacing original statements with a repeated sequence of similar
independent statements [32], as a result of which the number of jump and branch
instructions was reduced, which made the loops faster. If the statements inside the loop
were independent, they could also benefit from the parallel execution with compiler and
processor support.
The disadvantages of loop unrolling were also taken into account. The additional
sequential lines of code leads to an increase in source code size and reduced readability.
Allocation of extra temporary variables also causes increased register usage and, if
unrolled loop value is not within the scope of program memory allocation, instruction
cache misses will be more likely to occur.
In practice, this approach is helpful for loops for very simple functions with variable
iterations that are considerably large (N>100000). Regarding the original ParaDiS source
code, since single loops (un-nested loops) with huge iterations are not widely used, the
benefit of loop unrolling is marginal.
22
SIMD Implementations
Single instruction, multiple data (SIMD) input allows a computer to perform the
same operation on multiple data simultaneously. Two improvements are achieved with
the implementation of SIMD instructions. First is the capability to load data as small
blocks instead of individual variables. This reduces data loading and retrieving time
significantly compared with traditional CPU. Another advantage is that a single
instruction can be applied to all of the data elements in a loaded block of data at the same
time. These advantages make SIMD Instructions extremely helpful for vector
calculations or vectorization processing.
SIMD instructions require both hardware support from the processor and software
support from the compiler. Traditional ways to implement SIMD instructions are to write
the program in assembly language or to insert assembly code into standard C/C++ source
codes. Later, intrinsics were developed as pre-defined functions by the compiler that
directly mapped to a sequence of assembly language instructions [33]. No calling linkage
is required for intrinsic functions since they were built in the compiler. Unlike assembly
language, intrinsics allows the user to write SIMD code with a C/C++ interface without
concerns about register names and allocations because the compiler handles this process
automatically. Most modern C/C++ compilers, such as Intel, GNU, or Visual Studio
support intrinsics and have intrinsic functions built-in.
The capability of SIMD instructions advances with the development of processor
architecture. This is because the maximum amounts of data that can be loaded to registers
depend on the number of registers and register size. Early SIMD instruction (SSE
23
streaming SIMD extensions) only handled a block of 4 floating-point values due to the
size limits of their 128 bit XMM registers. Current AVX (advanced vector extensions)
SIMD library is much more powerful due to the increased register size from 128 to 256
bit in the processor architectural design, which allows the processor to load or compute a
block of 4 double-precision floating point values or 8 single-precision floating point
values at one time [34].
As ParaDiS is designed to solve simulation problems in a three-dimensional cubic
space, there are large number of double precision floating point vector calculations with
notation of Cartesian coordinates, such as matrix inner products, matrix cross products,
matrix multiplications, and normalization of matrices. Vector calculations like these can
be accelerated using SIMD instructions [35].
Write Buffer
Buffer is a temporary storage region of memory space for data transfer. Since access
speed is much faster in memory compared to hard disks, saving a block of data into the
buffer then writing it back takes substantially shorter time than writing directly to a hard
disk. The number of I/O request will also be substantially reduced.
Further Implementation of OpenMP
OpenMP (Open Multi-Processing) is a widely used compiler directive tool for
multiprocessing programming on a variety of shared-memory super computers. It was
supported by C/C++ and FORTRAN programming languages and has been implemented
in many compilers, including Visual C++, Intel compiler, GNU compiler, and the
Portland Group compiler. It allows parallel implementation through multithreading,
24
whereby a master thread forks a specified slave thread and each slave thread can execute
the divided task independently [4]. The OpenMP mode is a new feature in ParaDiS 2.5.1,
but it was only preliminarily implemented and not fully supported. Extended
implementation of OpenMP to more time-consuming modules of ParaDiS may result in
performance improvements due to multithreading.
25
CHAPTER V
RESULTS
Multi-core Performances
Since ParaDiS implements MPI for parallelism, several test runs on the un-optimized
ParaDiS codes were created as a control test and a benchmark reference. On a single
computation QB2 node, there were 2 Intel Ivy Bridge Xeon 10-core processors; this
allowed a maximum of 20 cores for computation on each node. We tested its
performance by utilizing 1, 2, 8, 16, and 32 nodes, with all 20 of the processor cores
enabled. As showed in Figure 4, the total execution time of ParaDiS gradually reduces
Figure 4. Speedups of original ParaDiS code on different number of nodes.
2000
2100
2200
2300
2400
2500
2600
1 2 4 8 16 32
Totaltime(/s)
Number of Nodes
26
when the number of computational nodes increases from 1 to 8 (20 cores to 160 cores),
and the total execution time is reduced from 2480 to 2194 seconds, for a 13%
improvement. However, when number of nodes exceeds 8, the total execution takes much
longer, because the communication time between different nodes is significantly larger.
When number of nodes increases to 32, there is no benefit compared with single node
execution since the high cost of communication time makes it even slower than single
node execution. The performance gain, as the number of nodes increases, is highly
dependent on the scale of the problem’s input complexity. Regarding our simulation
setup, 8 nodes was the optimal choice for solving the dislocation problem in this research,
as this configuration took the shortest computation time to finish compared with other
configurations. Besides accumulation of communication time and non-parallel sections of
the ParaDiS code, there was another critical factor that kept ParaDiS from near-linear
performance enhancement. This is because the QB2 parallel cluster is implemented in a
Figure 5. Efficiency of different optimization approaches.
2000
2100
2200
2300
2400
2500
2600
Original Loop
Unrolling
Write Buffer AVX OpenMP
Totaltime(/s)
Optimization Appraoches
27
loosely coupled scheme. Each QB2 node runs an independent, autonomous OS, and only
part of computational resources—such as RAM, cache and buses—are shared.
Effectiveness of Optimizations
ParaDiS runtimes with different optimization approaches are directly compared with
un-optimized versions running on a single node. As shown in Figure 5, the use of
different optimization methods yields remarkably distinct performance results.
Theoretically, the loop unrolling approach can contribute to noticeable performance
gains for single loops with a high number of iterations. However, in the case of ParaDiS
code, the number of such loops is quite limited. Another reason that makes loop unrolling
less effective is that in the actual implementation of single loops in ParaDiS, there are
many dependency chains, which means that one function call relies on the results of
previous statements. These two factors made loop unrolling counter-productive for our
code optimization.
A write buffer design only brings about very marginal performance improvement
because the writing frequency of ParaDiS is not very intensive, considering that the
default setting for data logging writes files every 200 cycles, at ~ 5 second intervals. The
typical sizes of output files are also considerably small by today’s standards, with the
largest file in the range of 10 to 20 KB.
Compared with the previous optimization approaches mentioned above, optimization
using SIMD intrinsic functions resulted in solid performance gains of 16.9%. The SIMD
intrinsic function call allows the parallelism of vector calculations at the instruction level,
28
which is important for ParaDiS since the top time occupant module relies on floating
point vector calculations.
The further implementation of OpenMP allows more parallelism through
multithreading and the mixed implementation of MPI. Such optimization brings about a
noticeable reduction in execution time by 7% compared with the original ParaDiS code.
The improvement is limited for a couple of reasons. The most important one is that there
exists data dependency that prevents certain threads from executing in parallel. Load-
balancing and synchronization overhead can also affect the final speedup in parallel
computing when using OpenMP.
Association with Experimental Results
Our primary purpose for using ParaDiS code was to reveal an explanation for the
theoretical dislocation dynamics in our ductility research on sulfur doped Ni multi
crystalline under high pressure. We prepared S-doped Ni samples with different S
concentration ratios of 7%, 11%, 14%, and 20% by ball mill. These samples were then
characterized in the synchrotron x-ray facility at Lawrence Berkeley National Laboratory
under different pressures up to ~30 GPa at ambient temperatures. As shown in Figure 6,
from textual analysis of our X-ray Diffraction experiment, we concluded that the 14% S-
doped Ni specimen exhibited the most ductility under high pressure (~ 30 GPa),
compared with the 7%, 11%, and 20% S specimens. Although ParaDiS simulations
assume an isotropic bulk system at the micron level, the previous literature suggested that
textual patterns at high pressures are similar in the 20 nm – 500 nm particle range for
FCC metals. It is reasonable, therefore, to use ParaDiS dislocation dynamic results to
29
Figure 6. Inverse pole figures of S doped nickel samples with 7%, 11%, 14% and 20%
S concentrations along the normal direction (ND) under compression. Equal area
projection and a linear scale are used.
explain our experimental specimens at the ~ 40 nm scale. At a high pressure of ~ 30 GPa,
the 12.5% S doped Ni system showed the highest dislocation densities and dislocation
velocities along different directions, which is in agreement with our earlier textural
pattern, in which a 14% S sample showed the strongest texture at a high pressure of 26.3
GPa (Figure 7 and 8). According to Ashby [36], under compression, the plastic
deformation induced increases in dislocation density, resulting in strain hardening, which
makes ductile materials stronger.
30
Figure 7. Dislocation density of S doped Ni under different pressures in ParaDiS
simulation.
Figure 8. The dislocation density along different direction for S doped Ni under
different pressures in ParaDiS simulation.
31
Summary
The original ParaDiS code gained substantial performance increases when the
number of nodes increased from 1 to 8. A maximum speed-up of 13% was achieved for
solving the dislocation systems in this work when the number of nodes increased to 8,
with 160 cores in total. Among the different optimization approaches we have attempted,
the SIMD implementation has the most noticeable speedup, by 16.9 %, using the AVX
intrinsic function libraries, which allow accelerated vector computation. The loop-
unrolling method led to a counter-productive effect due to the limitations of single loop
implementations in the ParaDiS code. The write buffer optimization was less helpful in
practice due to the comparatively low writing frequency and the small size of output
logging files. The extended implementation of OpenMP to more ParaDiS modules brings
about a 7% performance increase by multithreading during program execution, indicating
that current parallel programs can be improved using a hybrid implementation of MPI
and OpenMP.
The results of our ParaDiS dislocation dynamic simulations gave strong theoretical
support to our early experiment. We conclude that high dislocation density and
dislocation velocity are the major reasons for the high ductility of the data we obtained
from the experiment.
Future Work
Since the original code was written in C with small portion in C++, there was no
object-oriented design. As a result, the structures and functions were separated. Several
functions have to be defined again and again in source code, making the code less
32
readable and less maintainable. The standard C language also lacks sufficient advanced
libraries and interfaces to handle strings, vectors, unions, etc. Re-writing the entire source
code into pure C++ using object oriented design to replace separated structures and
methods with classes is more favorable for the reasons mentioned above.
The emerging GPGPU (General Purpose Graphic Processing Unit) is also a good
candidate for accelerating the performance of ParaDiS. This is due to the fact that GPU
has a large number of cores optimized for SIMD operation algorithms. To get the full
benefit of the power of GPU, one needs to write the code in NVIDIA CUDA C/C++, a
modified C/C++ code that is supposed to be run specifically on a GPU. However, this
approach is challenging because the thread management schemes are different in GPU
compared with CPU. A simpler, yet less powerful alternative is to use OpenACC libraries
to accelerate parallel computation based on a hybrid implementation of CPU and GPU.
33
REFERENCES
[1] Wikipedia. Tianhe-2 Supercomputer. 2014: http://en.wikipedia.org/wiki/Tianhe-2.
[2] Hager, G. and G. Wellein, Introduction to High Performance Computing for
Scientists and Engineers 1ed. 2010: CRC Press. p. 95-103.
[3] Gropp, W., et al., A high-performance, portable implementation of the MPI
message passing interface standard. Parallel computing, 1996. 22(6): p. 789-828.
[4] Dagum, L. and R. Menon, OpenMP: an industry standard API for shared-memory
programming. Computational Science & Engineering, IEEE, 1998. 5(1): p. 46-55.
[5] Motz, C. and D. Dunstan, Observation of the critical thickness phenomenon in
dislocation dynamics simulation of microbeam bending. Acta Materialia, 2012.
60(4): p. 1603-1609.
[6] Senger, J., et al., Dislocation microstructure evolution in cyclically twisted
microsamples: a discrete dislocation dynamics simulation. Modelling and
Simulation in Materials Science and Engineering, 2011. 19(7): p. 74-104.
[7] Zhou, C. and R. LeSar, Dislocation dynamics simulations of plasticity in
polycrystalline thin films. International Journal of Plasticity, 2012. 30: p. 185-
201.
[8] Vattré, A., et al., Modelling crystal plasticity by 3D dislocation dynamics and the
finite element method: the discrete-continuous model revisited. Journal of the
Mechanics and Physics of Solids, 2014. 63: p. 491-505.
34
[9] Wang, Z., et al., A parallel algorithm for 3D dislocation dynamics. Journal of
computational physics, 2006. 219(2): p. 608-621.
[10] Rhee, M., et al., Dislocation stress fields for dynamic codes using anisotropic
elasticity: methodology and analysis. Materials Science and Engineering: A,
2001. 309: p. 288-293.
[11] Schwarz, K., Simulation of dislocations on the mesoscopic scale. I. Methods and
examples. Journal of Applied Physics, 1999. 85(1): p. 108-119.
[12] Ghoniem, N.M., J. Huang, and Z. Wang, Affine covariant-contravariant vector
forms for the elastic field of parametric dislocations in isotropic crystals.
Philosophical Magazine Letters, 2002. 82(2): p. 55-63.
[13] Ghoniem, N., S.-H. Tong, and L. Sun, Parametric dislocation dynamics: a
thermodynamics-based approach to investigations of mesoscopic plastic
deformation. Physical Review B, 2000. 61(2): p. 913-927.
[14] Beneš, M., et al., A parametric simulation method for discrete dislocation
dynamics. The European Physical Journal-Special Topics, 2009. 177(1): p. 177-
191.
[15] El-Awady, J.A., S.B. Biner, and N.M. Ghoniem, A self-consistent boundary
element, parametric dislocation dynamics formulation of plastic flow in finite
volumes. Journal of the Mechanics and Physics of Solids, 2008. 56(5): p. 2019-
2035.
[16] Bulatov, V.V., Crystal Plasticity from Dislocation Dynamics, in Materials Issues
for Generation IV Systems. 2008, Springer. p. 275-284.
35
[17] Ghoniem, N. and J. Huang, Computer simulations of mesoscopic plastic
deformation with differential geometric forms for the elastic field of parametric
dislocations: Review of recent progress. Le Journal de Physique IV, 2001.
11(PR5): p. 53-60.
[18] Wang, Z., et al., Dislocation motion in thin Cu foils: a comparison between
computer simulations and experiment. Acta materialia, 2004. 52(6): p. 1535-1542.
[19] Dehnen, W., A hierarchical O (N) force calculation algorithm. Journal of
Computational Physics, 2002. 179(1): p. 27-42.
[20] Greengard, L. and V. Rokhlin, A fast algorithm for particle simulations. Journal
of Computational Physics, 1997. 135(2): p. 280-292.
[21] Amor, M., et al., A data parallel formulation of the barnes-hut method for n-body
simulations, in Applied Parallel Computing. New Paradigms for HPC in Industry
and Academia. 2001, Springer. p. 342-349.
[22] Singh, J.P., et al., Load balancing and data locality in adaptive hierarchical N-
body methods: Barnes-Hut, fast multipole, and radiosity. Journal of Parallel and
Distributed Computing, 1995. 27(2): p. 118-141.
[23] Grama, A., V. Kumar, and A.H. Sameh. N-Body Simulations Using Message
Passsing Parallel Computers. in PPSC. 1995.
[24] Grama, A., V. Kumar, and A. Sameh, Scalable parallel formulations of the
Barnes–Hut method for n-body simulations. Parallel Computing, 1998. 24(5): p.
797-822.
[25] Bulatov, V., et al. Scalable line dynamics in ParaDiS. in Proceedings of the 2004
ACM/IEEE conference on Supercomputing. 2004. IEEE Computer Society.
36
[26] Cai, W., et al., Massively-parallel dislocation dynamics simulations. in IUTAM
Symposium on Mesoscopic Dynamics of Fracture Process and Materials Strength.
2004.
[27] MicroMegas Introduction: http://zig.onera.fr/mm_home_page/.
[28] Code: microMegas, a 3-D DDD (Discrete Dislocation Dynamics) simulations :
https://icme.hpc.msstate.edu/mediawiki/index.php/Code:_microMegas.
[29] Devincre, B., et al., Modeling crystal plasticity with dislocation dynamics
simulations: The ‘microMegas’ code. Mechanics of Nano-objects. Presses de
l'Ecole des Mines de Paris, Paris, 2011: p. 81-100.
[30] Cai, W., et.al., A Non-singular Continuum Theory of Dislocations. J. Mech. Phys.
Solids, 2006. 54(3): p. 561-587.
[31] LONI OVERVIEW. 2014: http://www.loni.org/.
[32] Huang, J.-C. and T. Leng. Generalized loop-unrolling: a method for program
speedup. in Application-Specific Systems and Software Engineering and
Technology, 1999. ASSET'99. Proceedings. 1999 IEEE Symposium on. 1999.
IEEE.
[33] MSDN. Introduction to SIMD Intrinsics: https://msdn.microsoft.com/en-
us/library/26td21ds(v=vs.90).aspx.
[34] Firasta, N., et al., Intel avx: New frontiers in performance improvements and
energy efficiency. Intel white paper, 2008.
[35] Eichenberger, A.E., P. Wu, and K. O'brien. Vectorization for SIMD architectures
with alignment constraints. in ACM SIGPLAN Notices. 2004. ACM.
37
APPENDIX A
PARADIS JOB SUBMISSION PBS SCRIPT
#!/bin/bash
#PBS -q workq
#PBS -A loni_mat_bio6
#PBS -l nodes=8:ppn=20
#PBS -l walltime=03:00:00
#PBS -o /home/cguo1987/worka/output
#PBS -j oe
#PBS -N Cheng_Paradis
#PBS -X
#
cd /home/cguo1987/worka/pub-ParaDiS.v2.5.1
#
mpirun -np 20 -machinefile $PBS_NODEFILE bin/paradis tests/Nickel.ctrl
#
# Mark the time processing ends.
#
date
#
# And we're out'a here!
#
exit 0
38
APPENDIX B
SIMD VECTOR AND MATRICES COMPUTATION FUNCTIONS
Calculcation of the cross product of two vectors
void cross(double a[3], double8 b[3], double8 c[3])
{
_mm256_zeroall();
__m256d a, b, c, ea, eb;
double tmp[4] = {0.0, 0.0, 0.0, 0.0};
// set to va[1][2][0][3] , vb[2][0][1][3]
ea = _mm256_set_pd(0, va[0], va[2], va[1]);
eb = _mm256_set_pd(0, vb[1], vb[0], vb[2]);
// set to va[2][0][1][3] , vb[1][2][0][3]
a = _mm256_set_pd(0, va[1], va[0], va[2]);
b = _mm256_set_pd(0, vb[0], vb[2], vb[1]);
c= _mm256_sub_pd(_mm256_mul_pd(ea, eb) , _mm256_mul_pd( a , b ) );
_mm256_storeu_pd(tmp,c);
vc[0] = tmp[0];
vc[1] = tmp[1];
vc[2] = tmp[2];
}
Calculation of the inner product of two vectors
double inner(double a[3], double b[3])
{
double z=0.0, fres = 0.0;
double ftmp[4] = { 0.0, 0.0, 0.0, 0.0};
__m256d mres= _mm256_load_pd(&z);
mres = _mm256_mul_pd(_mm256_loadu_pd(a), _mm256_loadu_pd(b));
_mm256_storeu_pd(ftmp, mres);
39
fres = ftmp[0]+ftmp[1]+ftmp[2];
return fres;
}
Calculation of the multiplication of two 3 ×3 matrices
Matrix33Vector3Multiply(double A[3][3], double x[3], double y[3])
{
double tmp[4];
__m256d A0 = _mm256_set_pd(0, A[2][0], A[1][0], A[0][0]);
__m256d A1 = _mm256_set_pd(0, A[2][1], A[1][1], A[0][1]);
__m256d A2 = _mm256_set_pd(0, A[2][2], A[1][2], A[0][2]);
__m256d x0 = _mm256_set_pd(0, x[0], x[0], x[0]);
__m256d x1 = _mm256_set_pd(0, x[1], x[1], x[1]);
__m256d x2 = _mm256_set_pd(0, x[2], x[2], x[2]);
__m256d res = _mm256_add_pd(_mm256_mul_pd(A0, x0),
_mm256_mul_pd(A1, x1));
res = _mm256_add_pd(res, _mm256_mul_pd(A2, x2));
_mm256_storeu_pd(tmp, res);
y[0]=tmp[0]; y[1]=tmp[1]; y[2]=tmp[2];
return;
}
Multiplication of a 3 ×1 matrix by a 3-element vector
Matrix31Vector3Mult(double mat[3], double vec[3], double result[3][3])
{
double tmp[3][4];
__m256d v0= _mm256_set_pd(mat[0], mat[0], mat[0], mat[0]);
__m256d v1= _mm256_set_pd(mat[1], mat[1], mat[1], mat[1]);
__m256d v2= _mm256_set_pd(mat[2], mat[2], mat[2], mat[2]);
__m256d rows0 = _mm256_set_pd(0, vec[2], vec[1], vec[0]);
__m256d rows1 = _mm256_set_pd(0, vec[2], vec[1], vec[0]);
__m256d rows2 = _mm256_set_pd(0, vec[2], vec[1], vec[0]);
__m256d prod0 = _mm256_mul_pd(v0, rows0);
__m256d prod1 = _mm256_mul_pd(v1, rows1);
__m256d prod2 = _mm256_mul_pd(v2, rows2);
_mm256_storeu_pd(tmp[0], prod0);
_mm256_storeu_pd(tmp[1], prod1);
_mm256_storeu_pd(tmp[2], prod2);
result[0][0]=tmp[0][0]; result[0][1]=tmp[0][1]; result[0][2]=tmp[0][2];
40
result[1][0]=tmp[1][0]; result[1][1]=tmp[1][1]; result[1][2]=tmp[1][2];
result[2][0]=tmp[2][0]; result[2][1]=tmp[2][1]; result[2][2]=tmp[2][2];
return;
}
Multiply a 3 ×3 matrix by a 3 element vector
void AVX_Matrix33Vector3Multiply(double A[3][3], double x[3], double y[3])
{
double tmp[4];
__m256d A0 = _mm256_set_pd(0, A[2][0], A[1][0], A[0][0]);
__m256d A1 = _mm256_set_pd(0, A[2][1], A[1][1], A[0][1]);
__m256d A2 = _mm256_set_pd(0, A[2][2], A[1][2], A[0][2]);
__m256d x0 = _mm256_set_pd(0, x[0], x[0], x[0]);
__m256d x1 = _mm256_set_pd(0, x[1], x[1], x[1]);
__m256d x2 = _mm256_set_pd(0, x[2], x[2], x[2]);
__m256d res = _mm256_add_pd(_mm256_mul_pd(A0, x0),
_mm256_mul_pd(A1, x1));
res = _mm256_add_pd(res, _mm256_mul_pd(A2, x2));
_mm256_storeu_pd(tmp, res);
y[0]=tmp[0]; y[1]=tmp[1]; y[2]=tmp[2];
return;
}

More Related Content

What's hot

THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...IJCNCJournal
 
Iccsit2010 paper2
Iccsit2010 paper2Iccsit2010 paper2
Iccsit2010 paper2hanums1
 
A New Efficient Cache Replacement Strategy for Named Data Networking
A New Efficient Cache Replacement Strategy for Named Data NetworkingA New Efficient Cache Replacement Strategy for Named Data Networking
A New Efficient Cache Replacement Strategy for Named Data NetworkingIJCNCJournal
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ijdpsjournal
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09Hiroshi Ono
 
Task mapping and routing optimization for hard real-time Networks-on-Chip
Task mapping and routing optimization for hard real-time Networks-on-ChipTask mapping and routing optimization for hard real-time Networks-on-Chip
Task mapping and routing optimization for hard real-time Networks-on-ChipjournalBEEI
 

What's hot (8)

THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
 
Iccsit2010 paper2
Iccsit2010 paper2Iccsit2010 paper2
Iccsit2010 paper2
 
A New Efficient Cache Replacement Strategy for Named Data Networking
A New Efficient Cache Replacement Strategy for Named Data NetworkingA New Efficient Cache Replacement Strategy for Named Data Networking
A New Efficient Cache Replacement Strategy for Named Data Networking
 
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Lecture 05 - Chapter 3 - Models of parallel computers and interconnections
Lecture 05 - Chapter 3 - Models of parallel computers and  interconnectionsLecture 05 - Chapter 3 - Models of parallel computers and  interconnections
Lecture 05 - Chapter 3 - Models of parallel computers and interconnections
 
Task mapping and routing optimization for hard real-time Networks-on-Chip
Task mapping and routing optimization for hard real-time Networks-on-ChipTask mapping and routing optimization for hard real-time Networks-on-Chip
Task mapping and routing optimization for hard real-time Networks-on-Chip
 

Viewers also liked

Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...
Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...
Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...theijes
 
Cheng_Guo_Physics_Thesis
Cheng_Guo_Physics_ThesisCheng_Guo_Physics_Thesis
Cheng_Guo_Physics_ThesisCheng Guo
 
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)f
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)fBell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)f
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)froyals0007
 
Physics Investigatory Project on Fluid Mechanics
Physics Investigatory Project on Fluid MechanicsPhysics Investigatory Project on Fluid Mechanics
Physics Investigatory Project on Fluid Mechanicsashrant
 
CBSE Class XII physics practical project on Metal detector
CBSE Class XII physics practical project on Metal detectorCBSE Class XII physics practical project on Metal detector
CBSE Class XII physics practical project on Metal detectorPranav Ghildiyal
 
Physics investigatory project
Physics investigatory projectPhysics investigatory project
Physics investigatory projectMihika Mahandwan
 

Viewers also liked (9)

Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...
Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...
Analysis Of Physic-Chemical Characteristics Of Polluted Water Samples And Eva...
 
Cheng_Guo_Physics_Thesis
Cheng_Guo_Physics_ThesisCheng_Guo_Physics_Thesis
Cheng_Guo_Physics_Thesis
 
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)f
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)fBell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)f
Bell Ceramics project rport on ceramic tiles(gaurav UBS ,chandigarh 2011)f
 
Physics Investigatory Project on Fluid Mechanics
Physics Investigatory Project on Fluid MechanicsPhysics Investigatory Project on Fluid Mechanics
Physics Investigatory Project on Fluid Mechanics
 
CBSE Class XII physics practical project on Metal detector
CBSE Class XII physics practical project on Metal detectorCBSE Class XII physics practical project on Metal detector
CBSE Class XII physics practical project on Metal detector
 
Physics investigatory project
Physics investigatory projectPhysics investigatory project
Physics investigatory project
 
Physics project class 12 EMI
Physics project class 12 EMIPhysics project class 12 EMI
Physics project class 12 EMI
 
physics project
physics projectphysics project
physics project
 
Mba project report
Mba project reportMba project report
Mba project report
 

Similar to OPTIMIZATION AND IMPLEMENTATION OF PARADIS CODE

Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHMisu Md Rakib Hossain
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
StandardIPinSpace.pdf
StandardIPinSpace.pdfStandardIPinSpace.pdf
StandardIPinSpace.pdfssuserf7cd2b
 
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC and Hackathons Monthly Highlights: April  2023OpenACC and Hackathons Monthly Highlights: April  2023
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
 
Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!A Jorge Garcia
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC
 
OpenFlow: Enabling Innovation in Campus Networks
OpenFlow: Enabling Innovation in Campus NetworksOpenFlow: Enabling Innovation in Campus Networks
OpenFlow: Enabling Innovation in Campus NetworksAndy Juan Sarango Veliz
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)MurtazaB
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarSuyog Potdar
 
OpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...ijceronline
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGAmy Roman
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)irjes
 
F233842
F233842F233842
F233842irjes
 
Modern Compiler Design 2e.pdf
Modern Compiler Design 2e.pdfModern Compiler Design 2e.pdf
Modern Compiler Design 2e.pdfFouzan Ali
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsSubin Mathew
 

Similar to OPTIMIZATION AND IMPLEMENTATION OF PARADIS CODE (20)

Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICH
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
StandardIPinSpace.pdf
StandardIPinSpace.pdfStandardIPinSpace.pdf
StandardIPinSpace.pdf
 
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC and Hackathons Monthly Highlights: April  2023OpenACC and Hackathons Monthly Highlights: April  2023
OpenACC and Hackathons Monthly Highlights: April 2023
 
Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!Building A Linux Cluster Using Raspberry PI #1!
Building A Linux Cluster Using Raspberry PI #1!
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024
 
OpenFlow: Enabling Innovation in Campus Networks
OpenFlow: Enabling Innovation in Campus NetworksOpenFlow: Enabling Innovation in Campus Networks
OpenFlow: Enabling Innovation in Campus Networks
 
PeerToPeerComputing (1)
PeerToPeerComputing (1)PeerToPeerComputing (1)
PeerToPeerComputing (1)
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Adaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog PotdarAdaptive Computing Seminar Report - Suyog Potdar
Adaptive Computing Seminar Report - Suyog Potdar
 
OpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly Highlights
 
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
The Parallel Architecture Approach, Single Program Multiple Data (Spmd) Imple...
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019OpenACC Monthly Highlights Summer 2019
OpenACC Monthly Highlights Summer 2019
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTING
 
International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)International Refereed Journal of Engineering and Science (IRJES)
International Refereed Journal of Engineering and Science (IRJES)
 
F233842
F233842F233842
F233842
 
Modern Compiler Design 2e.pdf
Modern Compiler Design 2e.pdfModern Compiler Design 2e.pdf
Modern Compiler Design 2e.pdf
 
grid computing
grid computinggrid computing
grid computing
 
TOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRisticsTOGETHER: TOpology GEneration THrough HEuRistics
TOGETHER: TOpology GEneration THrough HEuRistics
 

OPTIMIZATION AND IMPLEMENTATION OF PARADIS CODE

  • 1. OPTIMIZATION AND IMPLEMENTATION OF PARADIS CODE IN PARALLEL CLUSTER Thesis Submitted to The Department of Computer Science and College of Engineering of Southern University and A&M College In Partial Fulfillment of the Requirements for The Degree of Master of Science in Computer Science By Cheng Guo Baton Rouge, Louisiana December, 2015
  • 2.
  • 3. iii © Copyright by Cheng Guo All rights reserved 2015
  • 4. iv ABSTRACT In this work, we implemented and optimized ParaDiS (Parallel Dislocation Simulator) code as a dislocation dynamics simulation tool for an S-doped Ni crystalline ductility study. After a brief analysis of the original ParaDiS source code, we focused on the most computational expensive modules of ParaDiS and implemented different optimization approaches, including loop unrolling, SIMD (single instruction multiple data) intrinsics implementation for vector calculations with Intel AVX instruction sets, write buffer, and OpenMP level optimization. The computational improvement and time reduction of different optimization methods were also done at the parallel cluster QB2 of the LONI system. This optimization method can be extended to similar computer configurations and parallel simulation codes. Our simulation results also support our experiment data and models. Keywords: High performance computing, parallel computation, optimization
  • 5. v ACKNOWLEDGEMENTS Special thanks to Dr. Wei Cai (co-developer of ParaDiS) and Dr. Amin Arbabian from Stanford University for their patient explanation of the ParaDiS code to the author, which helped the author to better understand the code. The author would like to express his gratitude to Dr. Shuju Bai and Dr. Ebrahim Khosravi for their advice on this research and thesis. The author also wishes to express his sincere appreciation and gratitude to his research advisor, Dr. Shizhong Yang, for his guidance, advice, availability, and support from the beginning of the research to the completion of this thesis. The author would like to show his deepest gratitude to his parents. Lastly, the author would like to express his warm regards and blessings to all of those who supported him in any respect during the completion of this project.
  • 6. vi TABLE OF CONTENTS APROVAL PAGE ............................................................................................................... ii COPYRIGHT PAGE.......................................................................................................... iii ABSTRACT....................................................................................................................... iv ACKNOWLEDGEMENTS................................................................................................ v CHAPTER I INTRODUCTION......................................................................................... 1 Significance.................................................................................................................... 3 Statement of Problem..................................................................................................... 4 Objectives....................................................................................................................... 5 Delimitations.................................................................................................................. 5 CHAPTER II REVIEW OF RELATED LITERATURE.................................................... 6 Dislocation Dynamics.................................................................................................... 6 Dislocation Computational Algorithms ......................................................................... 8 Domain Decomposition and Paralleled Implementation ............................................. 10 Common DD Simulation Software.............................................................................. 12 CHAPTER III METHODOLOGY ................................................................................... 14 Design .......................................................................................................................... 14 Resources ..................................................................................................................... 16 Software Debugging..................................................................................................... 18 Preliminary Wall Time Analysis.................................................................................. 19
  • 7. vii CHAPTER IV OPTIMIZATION...................................................................................... 21 Loop Unrolling............................................................................................................. 21 SIMD Implementations................................................................................................ 22 Write Buffer ................................................................................................................. 23 Further Implementation of OpenMP............................................................................ 23 CHAPTER V RESULTS .................................................................................................. 25 Multi-core Performances.............................................................................................. 25 Effectiveness of Optimizations .................................................................................... 27 Association with Experimental Results ....................................................................... 28 Summary ...................................................................................................................... 31 Future Work ................................................................................................................. 31 APPENDIX A................................................................................................................... 37 APPENDIX B................................................................................................................... 38
  • 8. 1 CHAPTER I INTRODUCTION Background In the past decades, computation technologies have evolved rapidly: the top supercomputers nowadays are capable of performing over 30 Petaflops per second [1]. These enhanced calculation capabilities of supercomputers, or high performance computers (HPC), are to a great extent achieved both by doing every operation in a shorter time and by having many computing units performing operations simultaneously. The latter approach is usually referred to as parallelism, and it is essential for high performance computing. The central idea behind most powerful computers today is the use of multiple cores, by which a given computational task is divided into several sub- tasks, which are then simultaneously executed on different cores. These cores, therefore, solve the computational problem in a cooperative way, and the number of cores used by a supercomputer is continuously increasing in time. The cores of an operating parallel computer work simultaneously. Since data transfer is typically the dominant factor that limits scientific code, network connections of paralleled computer systems play an important role in the parallel performance of applications. To generate efficient parallel code, three important network characteristics need to be taken into account: topology, bandwidth, and latency. These features all have an important influence on the performance of a parallel computer.
  • 9. 2 Usually, parallel computers are available in two types: shared-memory and distributed memory (also known as clusters). The major difference between them is that a distributed memory parallel computer includes one main memory for each code, whereas, in shared- memory systems, CPUs operate in a common shared memory space. The dominant HPC architectures at present and for the foreseeable future are comprised of nodes that are shared-memory Non-uniform Memory Access (NUMA) machines connected with the rest of the nodes following a distributed-memory pattern. Naturally, the efficiency of data transfer throughout the different nodes of a computer has drawn a lot of attention [2]. The Message Passing Interface (MPI) protocol, in particular, has been developed as a solution for writing codes that run in parallel in both distributed-memory machines and shared- memory machines [3]. Similarly, another protocol called OpenMP interface was developed to achieve higher efficiency in shared-memory machines [4]. Since most supercomputers are configured to implement hybrid share-distributed-memory architectures, the mixed use of different programming codes, including both MPI and OpenMP, is preferable. As the essential components of HPC technologies, high-speed advanced data networks are not only revolutionizing the ways that educators, researchers, and businesses work, but they are also dramatically changing the scale and character of the problems they can solve. The Louisiana Optical Network Initiative (LONI), for instance, is a state-of-the-art fiber optics network that runs throughout Louisiana and connects Louisiana and Mississippi research universities to one another as well as to the National Lambda Rail and Internet 2. LONI provides the most powerful distributed supercomputer resources available to any academic community, with ~ 2 Petaflops of computational
  • 10. 3 capacity and 10 Gbps Internet connections. In 2008, LONI joined the TeraGrid program of the National Science Foundation (NSF) as a new resource provider; since, then LONI has also contributed its resources to the national research community. The increase of existing computational power has made simulations an important field in scientific and engineering disciplines, bridging experimental and purely theoretical studies. As a mighty research tool, it allows for exploration when experimental procedures are expensive/difficult to carry out or when an experimental procedure needs better results under specific conditions. Dislocation Dynamics (DD) simulation are broadly used in the study of material science, physics, and mechanical engineering [5-7]. With the explosive growth of computational power in the past decade, dislocation dynamics simulation can now handle larger and more complex systems that consist of hundreds of thousands of atoms or multi-million atoms thanks to powerful processor performance, larger HPC clusters, and improved modeling of theory and methods. Significance The parallel implementation and optimization of complex simulations on high performance computers (HPCs) has become a frontier field in scientific researches. The analysis and comparison of different algorithms and function implementations, message passing protocols on HPC systems with different configurations, and the performance and efficiency of HPC systems are also of importance since one of the major reasons for developing faster computers is the demand to solve increasingly complicated scientific and engineering problems. Up to this time, 3-dimensional dislocation dynamics
  • 11. 4 simulations have become a significant tool in the study of the plasticity of sub-micron size metallic components due to the increasing implementation of small-scale devices. Statement of Problem In this work, we aim to evaluate how much the efficiency of running ParaDiS can be improved with different node configurations and optimization approaches, including loop unrolling, SIMD instructions, writing cache, and further implementation of OpenMP in the QB2 cluster. We will also use the computational results to explain and support our experimental data. Hypothesis The high performance computing (HPC) efficiency of running ParaDiS can be improved and optimized by increasing the number of computational nodes and by implementing SIMD and OpenMP techniques. Research Questions 1. How can the performance of these computing systems be improved by increasing the number of computational nodes in the simulation processes? 2. What are the primary factors that affect the computation time for the ParaDiS code? 3. How can the ParaDiS package be used as the benchmark for testing jobs that are running on the LONI clusters? 4. What techniques can be used to reduce overall program run time?
  • 12. 5 5. How well can dislocation dynamics simulation results be used to explain current experimental results and models? Objectives Different optimization techniques such as loop reduction, SIMD instructions, and further implementation of OpenMP have been analyzed and tested in QB2. An evaluation of the effectiveness of scientific computational performance is also included in this research work. To ensure the validity and accuracy of the simulation, our results are compared to existing experimental data. Delimitations This research was only tested in the QB2 cluster. The Implementation of ParaDiS on other computer architectures, such as IBM Blue Gene, may require modification in the software configurations. The performance improvement due to an increase of computational nodes only represented the dislocation system in this research; the optimal number of nodes may differ if the system, or the scale of the system, changes.
  • 13. 6 CHAPTER II REVIEW OF RELATED LITERATURE Dislocation Dynamics The plastic deformation of single crystals is carried out by a large number of dislocations. To translate the fundamental understanding of dislocation mechanisms into a quantitative physical theory for crystal plasticity, a new means of tracking dislocation motion and interactions over long time spans and large space evolution is needed. Three dimensional dislocation-dynamics (DD) simulation is aimed at developing a numerical tool for simulating the dynamic behavior of large numbers of dislocations of arbitrary shapes, interaction among groups of 3D dislocations, and the behavior of prescribed cell walls [8]. It produces stress/strain curves and other mechanical properties and allows a detailed analysis of dislocation microstructure evolution. In a numerical implementation, dislocation lines are represented by connected discrete line segments that move according to driving forces, including dislocation line tension, dislocation interaction forces, and external loading. The dislocation segments respond to these forces by making discrete movements according to a mobility function that is characteristic of the dislocation type and specific material being simulated. This dislocation mobility can be extracted from experimental data or calculated with atomistic simulations. Further, mobility is one of the key inputs to a DD simulation. Another important consideration for DD simulations is dealing with close dislocation-dislocation interactions such as annihilation and junction
  • 14. 7 formation and breaking. These close interactions can be quite complex and usually require special treatment. An efficient way to deal with them is to use prescribed “rules.” A bottleneck for DD simulation, which is long range in nature, is the calculation of the elastic interactions between dislocations. In order to perform DD simulations for realistic material plastic behavior, efficient algorithms must be developed to enable the simulation over a reasonable time and space range with a large number of dislocations Dislocations are curvilinear defects in materials and are considered the primary carriers of the plastic deformation. Dislocation microstructure evolution and dislocation pattern formation in deformed materials are crucial to determining the mechanical properties of these materials. A full understanding of the physical origin of plasticity, therefore, requires fundamental knowledge of these phenomena, such as the occurrence of slip bands and cell structures. Dislocation theory has been extensively studied and used to explain plasticity since 1930’s. In the past decades, with the development of fast computers, a new methodology, called dislocation dynamics (DD) has been developed and successfully applied to examine many of the fundamental aspects of the plasticity of materials at the micro-scale [9]. In DD simulations, the equations of motion for dislocations are numerically solved to determine the evolution and interaction of dislocations. Microstructure details are directly observed in the simulation and can be used to validate experimental results. Many DD models exist and most of them use straight segments to connect discrete dislocation nodes on a dislocation loop [10, 11]. This leads to singularities in which the self-force is calculated at the intersection of two segments because the self-force is a function of the curvature of the dislocation line. The fine meshing of the dislocation line with many
  • 15. 8 segments is specifically required for strong interactions, which demands additional, expensive computations. Such limits have been overcome since the development of the parametric dislocation dynamics (PDD) model [12-15]. In the PDD, dislocations are represented as spatial curves connected through dislocation nodes. PDD and other DD simulations provide a promising way to link the microstructural and the macroscopic properties of materials. One interesting application is to use DD to explain the strain hardening of materials. Although progress has been made by PDD and general DD methods to explain the nature of plastic deformation [16-18], there are several challenges that limit their extensive utilization. First, in order to get preventative macro-scale plastic behaviors, the collective behaviors of large dislocation systems must be considered. The typical size scales are microns, which may contain tens of thousands of dislocations. This presents a huge demand for computational power that cannot be fulfilled by a single processor. Parallel simulation, in which a large problem is divided into small pieces and solved by different individual processors, is naturally selected as an alternative approach. Dislocation Computational Algorithms Dislocations have long-range elastic interactions such that all the dislocation neighbors have to be taken into account for the interaction calculation. This makes the computational burden scale O (N2 ), which becomes prohibitive for large N. To reduce this computational complexity, a “cut-off distance” in the simulation with long-range force fields is employed; however, it is known to produce spurious results. Much effort has been put into developing algorithms that diminish that drawback, and many of the
  • 16. 9 essential methods, developed independently by a number of groups, are based on the concept of a hierarchical tree [9]. While different methods of constructing tree structures and handling interaction potentials have been used, all of them share two important common characteristics. First, they utilize a hierarchical-tree data structure, and second, they directly compute the force on an individual particle from nearby particles while the force from remote particles is calculated by an approximate method. Among these methods are the Barnes–Hut (BH) method [19] and fast multipole method (FMM) [20]. The BH technique builds its data structure via a hierarchical subdivision of space into cubic cells (for 3-D problems), and an oct tree (quad tree for 2-D) is built. The subdivision of volume is repeated until the cells at the lowest level of the tree contain at most one or no particles. Each node of the tree in the BH method represents a physical volume of space, and the total mass and the center-of-mass within the volume is stored at the node. The BH method has a computational burden of O (NlnN). Far-field interactions are usually calculated with a low-order multipole expansion, though, often, just the zeroth- order ‘‘charge–charge’’ term is employed. The FMM method is based on the same oct- tree structure as the BH method, though with no restriction on having just one (or no) particles in the leaf nodes. With an efficient management of the interactions, however, FMM reduces computational effort to O (N). In addition to this computational efficiency, the errors in the FMM can be reduced to machine accuracy by keeping enough terms in multipole expansions. A multipole acceptance criteria is employed to determine what kind of interaction (direct or approximate) should be included for each particle. Parallel formulations for both the BH and FMM methods have been developed [21-24], and it is
  • 17. 10 possible to apply a similar hierarchical strategy to DD simulations. However, another important challenge for DD simulations is that dislocations are curvilinear defects that run through a lattice. This makes DD simulation more complicated than particle problems. While we can divide space into subdivisions, connectivity between dislocation segments has to be maintained if the dislocation occurs across the boundaries of these subdivisions. In the meantime, dislocation systems are heterogeneous—i.e., there are regions of high dislocation density and regions that are essentially empty of dislocations. This makes it difficult to divide the system into sub-systems with similar problem sizes for parallel simulations in order to maintain the load balance for each working processor. Due to these challenges, very few parallel DD simulations have been implemented. One well-known implementation, called DD3D, in which dislocations are treated as nodes connected through straight line segments, was developed by Lawrence Livermore National Laboratory (LLNL) [25]. In the original implementation, one node may have more than two arms due to the annihilation of two or more segments. This may create complex microstructures that can be artificial according to their topological rules. In later modeling implementations, improvements were made such that each dislocation retains its identity even in strong interactions and its topological configuration is fully controlled by the force acting on the dislocation without defined rules, which are more physically based. Domain Decomposition and Paralleled Implementation In order to implement the complex calculations required for various dislocation dynamics systems, it is crucial for the software package to utilize a large number of
  • 18. 11 processors efficiently in parallel. To date, an efficient usage of 1500 has been demonstrated [26]. In such an implementation, all processors are treated equally during the simulation. In other words, there is no distinction—such as “master” versus “slaves”—between the processors. The primary objective of DD parallelization is to divide the simulation box (also referred to as the domain), representing the physical volume that contains the dislocation system, into different sub-domains, and to solve the equation of motions of dislocation particles (DPs) in each domain independently on a single processor, as shown in Figure 1. This way, communications are mostly local. That is, each processor can obtain most of the information it needs by communicating with its nearest neighbors. As the result, the amount of work for each processor is reduced and the speed of calculation improved. Figure 1. Decomposition of total simulation space into 3 ×3 ×2 domains along x, y, z axes. Dislocation microstructures can be highly heterogeneous and some processors may contain a lot more nodes than others [26]; hence, dividing the total domain into equally sized and/or shaped sub-domains may lead to severe load imbalances. As a matter of fact, the only two requirements are that the DPs in each sub-domain should be as close as
  • 19. 12 possible and that each sub-domain has a similar number of DPs [9]. The first criterion ensures that the closest neighbors of most dislocation DPs reside on the same processor so that they will not be transferred from a different processor when needed. This way, communication is minimized. The second criterion ensures that all processors have a similar amount of work so that load balancing is achieved for optimal performance. To reach a good load balance, it is important to perform data decomposition, as follows. The total simulation box is first divided into Nx domains along the x direction such that each domain contains an equal number of nodes. Each domain is then further divided along the y direction by Ny times, and the resulting number of domains is again divided along the z direction by Nz times. At the end, we obtain Nx ×Ny ×Nz domains, all containing the same number of nodes, as shown in Figure 1. However, because the dislocation structure evolves during the simulation, one needs to re-partition the problem among processors from time to time in order to maintain a good load balance. The optimal number of nodes per domain is in the range of 200 to 1000. In this case, the computational load on each processor is relatively light, while most of the computing time is still spent on computation instead of communication. If and when the total number of dislocation segments increases significantly (e.g., due to dislocation multiplication), it is usually helpful to stop and restart the simulation with more processors in order to maintain a reasonable simulation speed. Common DD Simulation Software To date, different dislocation dynamics simulation software packages have been developed. Beside the ParaDiS code that we used as the primary simulation tool, there are also other open-source software packages. MicroMegas, for example, also known as
  • 20. 13 ‘mM’, is a 3-D discrete dislocation dynamics program developed by the 'Laboratoire d'Etude des Microstructures', CNRS-ONERA, France. The software was released as freeware under the terms of the GNU General Public License (GPL) and published by the Free Software Foundation [27]. The MicroMegas code, another such package, is mainly used for the study of mono-crystalline metal plasticity and is based on the elastic theory of dislocations. Its source code is written in a mix of FORTRAN 90 and FORTRAN 95 and includes 18 source modules. The code can be used for DD simulation for HCP, BCC, and FCC systems [28]. Although simulation codes differ in their detailed structures and their work-flow, there are some basic features that the codes have in common. All of the simulation codes treat dislocations as discrete finite sets of degrees of freedom attached to line segments. Then, forces on the dislocation lines are estimated from the elastic theory of dislocations and the positions of the dislocation segments are updated according to a system- dependent equation of motion [29]. The dislocation configurations are usually represented by a set of curved dislocation lines that are typically discretized into a succession of straight segments, marked by the start and end positions of the segments on a discrete simulation lattice. The degrees of freedom are defined by the position, length, and velocity of the segments. The entire dislocation dynamics simulation is based on the time integration for the motion of dislocation segments in well-defined crystallographic directions at a steady velocity. The simulation lattices are usually chosen with periodic boundary and dimensions comparable to volume attached to Gauss points in FE methods. As a result, DD simulations function like a numerical coarse-graining method through the connection between discrete and continuous descriptions of plastic flow.
  • 21. 14 CHAPTER III METHODOLOGY Design In preparation for the input parameters for the ParaDiS simulations, we carried out a series of first-principle simulations based on the density function theory (DFT) for bulk Ni system with various S substitution sites under pressures of 0, 15, and 30 GPa. We chose 1 S atom substitution at the center for the 12.5% S doped Ni system, and 2 S atoms at the center and corner for the 25% S doped Ni System. The simulation results of the shear moduli and Poisson’s Ratios (Table 1) used as ParaDiS input parameters are discussed below. Table 1. Shear Moduli and Poisson’s Ratios of Simulation Results Pressure Model Shear modulus (GPa) Poisson's Ratio 0 GPa Pure 102.9 0.3226 12.5% 75.24 0.3533 25% 22.76 0.4486 15 GPa Pure 124.45 0.3331 12.5% 94.36 0.3624 25% 34.72 0.4427 30 GPa Pure 142.35 0.3404 12.5% 112.6 0.3654 25% 48.42 0.4350
  • 22. 15 Figure 2. Initial dislocation setup for 1×1×1 µm3 lattice Cubic FCC Nickel System in ParaDiS. We then employed the Parallel Dislocation Simulator (ParaDiS) package as the primary tool to carry out our dislocation dynamics simulation study. The ParaDiS code is, first, compiled on the QB machine LONI clusters using Intel, GNU, and PGI C compilers. The initial dislocation configuration was then generated through the ParaDiSGen utility tool. A cubic system of 1×1×1 µm3 lattice was chosen as the initial simulation scale setup. We also applied the FCC0 mobility module as the dynamics functions for the dislocation development. The FCC_0 mobility module attempts to simulate easy glide, with its glide plane limited to one of the [111] planes in the FCC materials. Since no crystallographic information was used in the dislocation core reactions, junction formation could take place even slightly off the zone axis. By default, this module automatically enables the <enableCrossSlip> control parameter, which allows dislocations to cross-slip to new glide planes.
  • 23. 16 Moreover, we analyzed the original ParaDiS source code and explored different code optimization approaches to reduce program run time. The results of these different optimization techniques were compared using un-optimized code. Resources Previously known as Dislocation Dynamics in 3 Dimensions (DD3d), the ParaDiS code is a free software package for large-scale dislocation dynamics simulations. It was developed by Lawrence Livermore National Laboratory as a key part of the multi-scale ASC Physics and Engineering Models effort to model material strength. The ParaDiS code is written primarily in C with some C++ interface for real-time 3D plotting display capability. The implementation of MPI libraries makes it a mighty tool for parallel computing [26]. In addition, the ParaDiS package is based on a non-singular continuum theory of dislocations that allows for accurate numerical calculations of dislocations in terms of multiple dislocation nodes [30]. It is a massively parallel and specialized material physics code that enables study of the fundamental mechanisms of plasticity at the micro-structure dislocation level. In ParaDiS, dislocations are represented by nodes that are interconnected by straight segments [19], as shown in Figure 3. The junctions can be discretization nodes for representing a smooth line (e.g. node 1, 2, 3) or physical nodes where three dislocations meet (e.g. node 0).
  • 24. 17 Figure 3. Dislocation networks represented as a set of “nodes” (empty circles) interconnected by straight segments. Both discretization and physical nodes are treated on equal footing in ParaDiS—they have a common data structure and, essentially, the same equations of motion. Two nodes connected by an arm are referred to as neighbors, and an arbitrary number of neighbor nodes are allowed for a node. The arms between nodes represent straight dislocation segments and, hence, are associated with a Burgers vector. The Burgers vector on each arm is fixed until it is either destroyed or altered due to a dislocation reaction. Most of our simulation works run on the QB2 cluster of the LONI network. The Louisiana Optical Network Initiative (LONI) network is an advanced fiber optics network that runs throughout Louisiana. It connects major research universities in Louisiana, including Louisiana State University (LSU), Louisiana Tech University, LSU Health Sciences Center in New Orleans, LSU Health Sciences Center in Shreveport, Southern University, Tulane University, University of Louisiana at Lafayette and University of New Orleans, allowing greater collaboration on research that produces
  • 25. 18 results quickly and with great accuracy. LONI provides the most powerful distributed supercomputer resources available to any academic community, with 1~2 Petaflops of computational capacity [31]. As the core of LONI, the Queen Bee supercomputer system is located in the state Information Systems Building in downtown Baton Rouge. The original QB supercomputer was launched in 2007, with 50 Teraflops of computational capacity. It was later upgraded to 1500 Teraflop peak performance at the end of June 2014 and ranked as one of the Top 50 supercomputer systems in the world. The upgraded QB is known as QB2. It has 504 compute nodes with 960 NVIDIA Tesla K20x GPUs and over 10,000 Intel Xeon processing cores. Each node has two 10-core 2.8 GHz Xeon 64-bit processors and either 64 or 128 GB of memory. The QB2 cluster system also features 1/Gb sec. Ethernet management network and 10 Gb/ sec and 40 Gb/sec external connectivity, with a huge 2.8 PB Lustre file system [31]. Software Debugging During the compilation of the original ParaDiS code, there were a huge number of warnings. To prevent potential problems, it is recommended to fix these warnings rather than ignoring them. Some warnings are reported due to deprecated library calls, such as XKeycodeToKeysym(). The reason for this is that the original source code was written more than 5 years ago. Understandably, the code has fallen behind current compiler standards that suggest the use of new library calls for enhancement and bug fixes. A simple solution is to use new library calls to substitute the deprecated ones, with the inclusion of corresponding library files. Noticeable warnings also occur when return
  • 26. 19 values of certain functions—such as fscanf()—are ignored. Such warnings are fixed by placing these functions into if statements, which catch the return value of function calls in the event of failure. There are other warnings reporting a deprecated conversion from string constants to 'char*'. The reason for this is that in standard C, a string liberal is usually treated as a pointer to a constant char array, and any attempt to modify it risks a program crash. A simple solution is to replace ‘char *’ with ‘const char *’ for the reason discussed above. Besides these warnings, the ParaDiS failed to start and encountered a segmentation fault. The most likely reason was memory allocation issues, and this error was eliminated after setting all dynamically allocated pointers to NULL after the free() function calls. Initial attempts to run ParaDiS through PBS script job submission also failed, however, a direct mpirun command works. The reason is that the default makefile setup enables X- Window for plotting a real-time 3D dislocation dynamics box. Nevertheless, this feature is currently unavailable for jobs that run through PBS submission. A work-around was achieved by disabling the X-window plot capacity in the makefile configurations. Preliminary Wall Time Analysis Prior to optimization, preliminary wall time analysis was needed to determine how much time each module/operation takes. Detailed timing information was obtained from the timer files of the ParaDiS output data files. Noticeably, the nodal force and cell charge computations occupied an extremely high portion of the program execution wall time. The reason for this is that both modules were computationally intensive. In the Nodal force module, there was a huge number of vector and matrix calculations, such as
  • 27. 20 inner product of two matrices, cross product of two matrices, matrices multiplication, matrices transpose, inverse matrices calculations, etc. Due to accuracy concerns, most of the variables used in these matrices calculations were double-precision floating-point values. The arithmetic calculations of these floating point values, especially multiplication and division, took a substantially longer time than the integer values. The cell charge module implemented a nested 6 layer for the loop due to the nature of the Cartesian coordinate point notation, which resulted in the complexity of problem to N6 for the variable size of N. Table 2. Time occupancy of different modules in ParaDiS code Module Name Time/s TOTAL TIME NODAL FORCE CELL CHARGE SEND COL TOPO CHANGES SEGMENT FORCE COMM SPLIT MULTI-NODES COMM SEND GHOSTS COMM SEND VELOCITY HANDLE COLLISIONS GENERATE ALL OUTPUT NODAL VELOCITY 2245.238 1028.745 473.881 284.737 158.863 138.982 71.835 34.603 25.802 8.032 3.871
  • 28. 21 CHAPTER IV OPTIMIZATION Loop Unrolling Original ParaDiS code implements a huge amount of for loops. One approach that aims to optimize such program’s execution speed is loop unrolling, which reduces the number of loops by replacing original statements with a repeated sequence of similar independent statements [32], as a result of which the number of jump and branch instructions was reduced, which made the loops faster. If the statements inside the loop were independent, they could also benefit from the parallel execution with compiler and processor support. The disadvantages of loop unrolling were also taken into account. The additional sequential lines of code leads to an increase in source code size and reduced readability. Allocation of extra temporary variables also causes increased register usage and, if unrolled loop value is not within the scope of program memory allocation, instruction cache misses will be more likely to occur. In practice, this approach is helpful for loops for very simple functions with variable iterations that are considerably large (N>100000). Regarding the original ParaDiS source code, since single loops (un-nested loops) with huge iterations are not widely used, the benefit of loop unrolling is marginal.
  • 29. 22 SIMD Implementations Single instruction, multiple data (SIMD) input allows a computer to perform the same operation on multiple data simultaneously. Two improvements are achieved with the implementation of SIMD instructions. First is the capability to load data as small blocks instead of individual variables. This reduces data loading and retrieving time significantly compared with traditional CPU. Another advantage is that a single instruction can be applied to all of the data elements in a loaded block of data at the same time. These advantages make SIMD Instructions extremely helpful for vector calculations or vectorization processing. SIMD instructions require both hardware support from the processor and software support from the compiler. Traditional ways to implement SIMD instructions are to write the program in assembly language or to insert assembly code into standard C/C++ source codes. Later, intrinsics were developed as pre-defined functions by the compiler that directly mapped to a sequence of assembly language instructions [33]. No calling linkage is required for intrinsic functions since they were built in the compiler. Unlike assembly language, intrinsics allows the user to write SIMD code with a C/C++ interface without concerns about register names and allocations because the compiler handles this process automatically. Most modern C/C++ compilers, such as Intel, GNU, or Visual Studio support intrinsics and have intrinsic functions built-in. The capability of SIMD instructions advances with the development of processor architecture. This is because the maximum amounts of data that can be loaded to registers depend on the number of registers and register size. Early SIMD instruction (SSE
  • 30. 23 streaming SIMD extensions) only handled a block of 4 floating-point values due to the size limits of their 128 bit XMM registers. Current AVX (advanced vector extensions) SIMD library is much more powerful due to the increased register size from 128 to 256 bit in the processor architectural design, which allows the processor to load or compute a block of 4 double-precision floating point values or 8 single-precision floating point values at one time [34]. As ParaDiS is designed to solve simulation problems in a three-dimensional cubic space, there are large number of double precision floating point vector calculations with notation of Cartesian coordinates, such as matrix inner products, matrix cross products, matrix multiplications, and normalization of matrices. Vector calculations like these can be accelerated using SIMD instructions [35]. Write Buffer Buffer is a temporary storage region of memory space for data transfer. Since access speed is much faster in memory compared to hard disks, saving a block of data into the buffer then writing it back takes substantially shorter time than writing directly to a hard disk. The number of I/O request will also be substantially reduced. Further Implementation of OpenMP OpenMP (Open Multi-Processing) is a widely used compiler directive tool for multiprocessing programming on a variety of shared-memory super computers. It was supported by C/C++ and FORTRAN programming languages and has been implemented in many compilers, including Visual C++, Intel compiler, GNU compiler, and the Portland Group compiler. It allows parallel implementation through multithreading,
  • 31. 24 whereby a master thread forks a specified slave thread and each slave thread can execute the divided task independently [4]. The OpenMP mode is a new feature in ParaDiS 2.5.1, but it was only preliminarily implemented and not fully supported. Extended implementation of OpenMP to more time-consuming modules of ParaDiS may result in performance improvements due to multithreading.
  • 32. 25 CHAPTER V RESULTS Multi-core Performances Since ParaDiS implements MPI for parallelism, several test runs on the un-optimized ParaDiS codes were created as a control test and a benchmark reference. On a single computation QB2 node, there were 2 Intel Ivy Bridge Xeon 10-core processors; this allowed a maximum of 20 cores for computation on each node. We tested its performance by utilizing 1, 2, 8, 16, and 32 nodes, with all 20 of the processor cores enabled. As showed in Figure 4, the total execution time of ParaDiS gradually reduces Figure 4. Speedups of original ParaDiS code on different number of nodes. 2000 2100 2200 2300 2400 2500 2600 1 2 4 8 16 32 Totaltime(/s) Number of Nodes
  • 33. 26 when the number of computational nodes increases from 1 to 8 (20 cores to 160 cores), and the total execution time is reduced from 2480 to 2194 seconds, for a 13% improvement. However, when number of nodes exceeds 8, the total execution takes much longer, because the communication time between different nodes is significantly larger. When number of nodes increases to 32, there is no benefit compared with single node execution since the high cost of communication time makes it even slower than single node execution. The performance gain, as the number of nodes increases, is highly dependent on the scale of the problem’s input complexity. Regarding our simulation setup, 8 nodes was the optimal choice for solving the dislocation problem in this research, as this configuration took the shortest computation time to finish compared with other configurations. Besides accumulation of communication time and non-parallel sections of the ParaDiS code, there was another critical factor that kept ParaDiS from near-linear performance enhancement. This is because the QB2 parallel cluster is implemented in a Figure 5. Efficiency of different optimization approaches. 2000 2100 2200 2300 2400 2500 2600 Original Loop Unrolling Write Buffer AVX OpenMP Totaltime(/s) Optimization Appraoches
  • 34. 27 loosely coupled scheme. Each QB2 node runs an independent, autonomous OS, and only part of computational resources—such as RAM, cache and buses—are shared. Effectiveness of Optimizations ParaDiS runtimes with different optimization approaches are directly compared with un-optimized versions running on a single node. As shown in Figure 5, the use of different optimization methods yields remarkably distinct performance results. Theoretically, the loop unrolling approach can contribute to noticeable performance gains for single loops with a high number of iterations. However, in the case of ParaDiS code, the number of such loops is quite limited. Another reason that makes loop unrolling less effective is that in the actual implementation of single loops in ParaDiS, there are many dependency chains, which means that one function call relies on the results of previous statements. These two factors made loop unrolling counter-productive for our code optimization. A write buffer design only brings about very marginal performance improvement because the writing frequency of ParaDiS is not very intensive, considering that the default setting for data logging writes files every 200 cycles, at ~ 5 second intervals. The typical sizes of output files are also considerably small by today’s standards, with the largest file in the range of 10 to 20 KB. Compared with the previous optimization approaches mentioned above, optimization using SIMD intrinsic functions resulted in solid performance gains of 16.9%. The SIMD intrinsic function call allows the parallelism of vector calculations at the instruction level,
  • 35. 28 which is important for ParaDiS since the top time occupant module relies on floating point vector calculations. The further implementation of OpenMP allows more parallelism through multithreading and the mixed implementation of MPI. Such optimization brings about a noticeable reduction in execution time by 7% compared with the original ParaDiS code. The improvement is limited for a couple of reasons. The most important one is that there exists data dependency that prevents certain threads from executing in parallel. Load- balancing and synchronization overhead can also affect the final speedup in parallel computing when using OpenMP. Association with Experimental Results Our primary purpose for using ParaDiS code was to reveal an explanation for the theoretical dislocation dynamics in our ductility research on sulfur doped Ni multi crystalline under high pressure. We prepared S-doped Ni samples with different S concentration ratios of 7%, 11%, 14%, and 20% by ball mill. These samples were then characterized in the synchrotron x-ray facility at Lawrence Berkeley National Laboratory under different pressures up to ~30 GPa at ambient temperatures. As shown in Figure 6, from textual analysis of our X-ray Diffraction experiment, we concluded that the 14% S- doped Ni specimen exhibited the most ductility under high pressure (~ 30 GPa), compared with the 7%, 11%, and 20% S specimens. Although ParaDiS simulations assume an isotropic bulk system at the micron level, the previous literature suggested that textual patterns at high pressures are similar in the 20 nm – 500 nm particle range for FCC metals. It is reasonable, therefore, to use ParaDiS dislocation dynamic results to
  • 36. 29 Figure 6. Inverse pole figures of S doped nickel samples with 7%, 11%, 14% and 20% S concentrations along the normal direction (ND) under compression. Equal area projection and a linear scale are used. explain our experimental specimens at the ~ 40 nm scale. At a high pressure of ~ 30 GPa, the 12.5% S doped Ni system showed the highest dislocation densities and dislocation velocities along different directions, which is in agreement with our earlier textural pattern, in which a 14% S sample showed the strongest texture at a high pressure of 26.3 GPa (Figure 7 and 8). According to Ashby [36], under compression, the plastic deformation induced increases in dislocation density, resulting in strain hardening, which makes ductile materials stronger.
  • 37. 30 Figure 7. Dislocation density of S doped Ni under different pressures in ParaDiS simulation. Figure 8. The dislocation density along different direction for S doped Ni under different pressures in ParaDiS simulation.
  • 38. 31 Summary The original ParaDiS code gained substantial performance increases when the number of nodes increased from 1 to 8. A maximum speed-up of 13% was achieved for solving the dislocation systems in this work when the number of nodes increased to 8, with 160 cores in total. Among the different optimization approaches we have attempted, the SIMD implementation has the most noticeable speedup, by 16.9 %, using the AVX intrinsic function libraries, which allow accelerated vector computation. The loop- unrolling method led to a counter-productive effect due to the limitations of single loop implementations in the ParaDiS code. The write buffer optimization was less helpful in practice due to the comparatively low writing frequency and the small size of output logging files. The extended implementation of OpenMP to more ParaDiS modules brings about a 7% performance increase by multithreading during program execution, indicating that current parallel programs can be improved using a hybrid implementation of MPI and OpenMP. The results of our ParaDiS dislocation dynamic simulations gave strong theoretical support to our early experiment. We conclude that high dislocation density and dislocation velocity are the major reasons for the high ductility of the data we obtained from the experiment. Future Work Since the original code was written in C with small portion in C++, there was no object-oriented design. As a result, the structures and functions were separated. Several functions have to be defined again and again in source code, making the code less
  • 39. 32 readable and less maintainable. The standard C language also lacks sufficient advanced libraries and interfaces to handle strings, vectors, unions, etc. Re-writing the entire source code into pure C++ using object oriented design to replace separated structures and methods with classes is more favorable for the reasons mentioned above. The emerging GPGPU (General Purpose Graphic Processing Unit) is also a good candidate for accelerating the performance of ParaDiS. This is due to the fact that GPU has a large number of cores optimized for SIMD operation algorithms. To get the full benefit of the power of GPU, one needs to write the code in NVIDIA CUDA C/C++, a modified C/C++ code that is supposed to be run specifically on a GPU. However, this approach is challenging because the thread management schemes are different in GPU compared with CPU. A simpler, yet less powerful alternative is to use OpenACC libraries to accelerate parallel computation based on a hybrid implementation of CPU and GPU.
  • 40. 33 REFERENCES [1] Wikipedia. Tianhe-2 Supercomputer. 2014: http://en.wikipedia.org/wiki/Tianhe-2. [2] Hager, G. and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers 1ed. 2010: CRC Press. p. 95-103. [3] Gropp, W., et al., A high-performance, portable implementation of the MPI message passing interface standard. Parallel computing, 1996. 22(6): p. 789-828. [4] Dagum, L. and R. Menon, OpenMP: an industry standard API for shared-memory programming. Computational Science & Engineering, IEEE, 1998. 5(1): p. 46-55. [5] Motz, C. and D. Dunstan, Observation of the critical thickness phenomenon in dislocation dynamics simulation of microbeam bending. Acta Materialia, 2012. 60(4): p. 1603-1609. [6] Senger, J., et al., Dislocation microstructure evolution in cyclically twisted microsamples: a discrete dislocation dynamics simulation. Modelling and Simulation in Materials Science and Engineering, 2011. 19(7): p. 74-104. [7] Zhou, C. and R. LeSar, Dislocation dynamics simulations of plasticity in polycrystalline thin films. International Journal of Plasticity, 2012. 30: p. 185- 201. [8] Vattré, A., et al., Modelling crystal plasticity by 3D dislocation dynamics and the finite element method: the discrete-continuous model revisited. Journal of the Mechanics and Physics of Solids, 2014. 63: p. 491-505.
  • 41. 34 [9] Wang, Z., et al., A parallel algorithm for 3D dislocation dynamics. Journal of computational physics, 2006. 219(2): p. 608-621. [10] Rhee, M., et al., Dislocation stress fields for dynamic codes using anisotropic elasticity: methodology and analysis. Materials Science and Engineering: A, 2001. 309: p. 288-293. [11] Schwarz, K., Simulation of dislocations on the mesoscopic scale. I. Methods and examples. Journal of Applied Physics, 1999. 85(1): p. 108-119. [12] Ghoniem, N.M., J. Huang, and Z. Wang, Affine covariant-contravariant vector forms for the elastic field of parametric dislocations in isotropic crystals. Philosophical Magazine Letters, 2002. 82(2): p. 55-63. [13] Ghoniem, N., S.-H. Tong, and L. Sun, Parametric dislocation dynamics: a thermodynamics-based approach to investigations of mesoscopic plastic deformation. Physical Review B, 2000. 61(2): p. 913-927. [14] Beneš, M., et al., A parametric simulation method for discrete dislocation dynamics. The European Physical Journal-Special Topics, 2009. 177(1): p. 177- 191. [15] El-Awady, J.A., S.B. Biner, and N.M. Ghoniem, A self-consistent boundary element, parametric dislocation dynamics formulation of plastic flow in finite volumes. Journal of the Mechanics and Physics of Solids, 2008. 56(5): p. 2019- 2035. [16] Bulatov, V.V., Crystal Plasticity from Dislocation Dynamics, in Materials Issues for Generation IV Systems. 2008, Springer. p. 275-284.
  • 42. 35 [17] Ghoniem, N. and J. Huang, Computer simulations of mesoscopic plastic deformation with differential geometric forms for the elastic field of parametric dislocations: Review of recent progress. Le Journal de Physique IV, 2001. 11(PR5): p. 53-60. [18] Wang, Z., et al., Dislocation motion in thin Cu foils: a comparison between computer simulations and experiment. Acta materialia, 2004. 52(6): p. 1535-1542. [19] Dehnen, W., A hierarchical O (N) force calculation algorithm. Journal of Computational Physics, 2002. 179(1): p. 27-42. [20] Greengard, L. and V. Rokhlin, A fast algorithm for particle simulations. Journal of Computational Physics, 1997. 135(2): p. 280-292. [21] Amor, M., et al., A data parallel formulation of the barnes-hut method for n-body simulations, in Applied Parallel Computing. New Paradigms for HPC in Industry and Academia. 2001, Springer. p. 342-349. [22] Singh, J.P., et al., Load balancing and data locality in adaptive hierarchical N- body methods: Barnes-Hut, fast multipole, and radiosity. Journal of Parallel and Distributed Computing, 1995. 27(2): p. 118-141. [23] Grama, A., V. Kumar, and A.H. Sameh. N-Body Simulations Using Message Passsing Parallel Computers. in PPSC. 1995. [24] Grama, A., V. Kumar, and A. Sameh, Scalable parallel formulations of the Barnes–Hut method for n-body simulations. Parallel Computing, 1998. 24(5): p. 797-822. [25] Bulatov, V., et al. Scalable line dynamics in ParaDiS. in Proceedings of the 2004 ACM/IEEE conference on Supercomputing. 2004. IEEE Computer Society.
  • 43. 36 [26] Cai, W., et al., Massively-parallel dislocation dynamics simulations. in IUTAM Symposium on Mesoscopic Dynamics of Fracture Process and Materials Strength. 2004. [27] MicroMegas Introduction: http://zig.onera.fr/mm_home_page/. [28] Code: microMegas, a 3-D DDD (Discrete Dislocation Dynamics) simulations : https://icme.hpc.msstate.edu/mediawiki/index.php/Code:_microMegas. [29] Devincre, B., et al., Modeling crystal plasticity with dislocation dynamics simulations: The ‘microMegas’ code. Mechanics of Nano-objects. Presses de l'Ecole des Mines de Paris, Paris, 2011: p. 81-100. [30] Cai, W., et.al., A Non-singular Continuum Theory of Dislocations. J. Mech. Phys. Solids, 2006. 54(3): p. 561-587. [31] LONI OVERVIEW. 2014: http://www.loni.org/. [32] Huang, J.-C. and T. Leng. Generalized loop-unrolling: a method for program speedup. in Application-Specific Systems and Software Engineering and Technology, 1999. ASSET'99. Proceedings. 1999 IEEE Symposium on. 1999. IEEE. [33] MSDN. Introduction to SIMD Intrinsics: https://msdn.microsoft.com/en- us/library/26td21ds(v=vs.90).aspx. [34] Firasta, N., et al., Intel avx: New frontiers in performance improvements and energy efficiency. Intel white paper, 2008. [35] Eichenberger, A.E., P. Wu, and K. O'brien. Vectorization for SIMD architectures with alignment constraints. in ACM SIGPLAN Notices. 2004. ACM.
  • 44. 37 APPENDIX A PARADIS JOB SUBMISSION PBS SCRIPT #!/bin/bash #PBS -q workq #PBS -A loni_mat_bio6 #PBS -l nodes=8:ppn=20 #PBS -l walltime=03:00:00 #PBS -o /home/cguo1987/worka/output #PBS -j oe #PBS -N Cheng_Paradis #PBS -X # cd /home/cguo1987/worka/pub-ParaDiS.v2.5.1 # mpirun -np 20 -machinefile $PBS_NODEFILE bin/paradis tests/Nickel.ctrl # # Mark the time processing ends. # date # # And we're out'a here! # exit 0
  • 45. 38 APPENDIX B SIMD VECTOR AND MATRICES COMPUTATION FUNCTIONS Calculcation of the cross product of two vectors void cross(double a[3], double8 b[3], double8 c[3]) { _mm256_zeroall(); __m256d a, b, c, ea, eb; double tmp[4] = {0.0, 0.0, 0.0, 0.0}; // set to va[1][2][0][3] , vb[2][0][1][3] ea = _mm256_set_pd(0, va[0], va[2], va[1]); eb = _mm256_set_pd(0, vb[1], vb[0], vb[2]); // set to va[2][0][1][3] , vb[1][2][0][3] a = _mm256_set_pd(0, va[1], va[0], va[2]); b = _mm256_set_pd(0, vb[0], vb[2], vb[1]); c= _mm256_sub_pd(_mm256_mul_pd(ea, eb) , _mm256_mul_pd( a , b ) ); _mm256_storeu_pd(tmp,c); vc[0] = tmp[0]; vc[1] = tmp[1]; vc[2] = tmp[2]; } Calculation of the inner product of two vectors double inner(double a[3], double b[3]) { double z=0.0, fres = 0.0; double ftmp[4] = { 0.0, 0.0, 0.0, 0.0}; __m256d mres= _mm256_load_pd(&z); mres = _mm256_mul_pd(_mm256_loadu_pd(a), _mm256_loadu_pd(b)); _mm256_storeu_pd(ftmp, mres);
  • 46. 39 fres = ftmp[0]+ftmp[1]+ftmp[2]; return fres; } Calculation of the multiplication of two 3 ×3 matrices Matrix33Vector3Multiply(double A[3][3], double x[3], double y[3]) { double tmp[4]; __m256d A0 = _mm256_set_pd(0, A[2][0], A[1][0], A[0][0]); __m256d A1 = _mm256_set_pd(0, A[2][1], A[1][1], A[0][1]); __m256d A2 = _mm256_set_pd(0, A[2][2], A[1][2], A[0][2]); __m256d x0 = _mm256_set_pd(0, x[0], x[0], x[0]); __m256d x1 = _mm256_set_pd(0, x[1], x[1], x[1]); __m256d x2 = _mm256_set_pd(0, x[2], x[2], x[2]); __m256d res = _mm256_add_pd(_mm256_mul_pd(A0, x0), _mm256_mul_pd(A1, x1)); res = _mm256_add_pd(res, _mm256_mul_pd(A2, x2)); _mm256_storeu_pd(tmp, res); y[0]=tmp[0]; y[1]=tmp[1]; y[2]=tmp[2]; return; } Multiplication of a 3 ×1 matrix by a 3-element vector Matrix31Vector3Mult(double mat[3], double vec[3], double result[3][3]) { double tmp[3][4]; __m256d v0= _mm256_set_pd(mat[0], mat[0], mat[0], mat[0]); __m256d v1= _mm256_set_pd(mat[1], mat[1], mat[1], mat[1]); __m256d v2= _mm256_set_pd(mat[2], mat[2], mat[2], mat[2]); __m256d rows0 = _mm256_set_pd(0, vec[2], vec[1], vec[0]); __m256d rows1 = _mm256_set_pd(0, vec[2], vec[1], vec[0]); __m256d rows2 = _mm256_set_pd(0, vec[2], vec[1], vec[0]); __m256d prod0 = _mm256_mul_pd(v0, rows0); __m256d prod1 = _mm256_mul_pd(v1, rows1); __m256d prod2 = _mm256_mul_pd(v2, rows2); _mm256_storeu_pd(tmp[0], prod0); _mm256_storeu_pd(tmp[1], prod1); _mm256_storeu_pd(tmp[2], prod2); result[0][0]=tmp[0][0]; result[0][1]=tmp[0][1]; result[0][2]=tmp[0][2];
  • 47. 40 result[1][0]=tmp[1][0]; result[1][1]=tmp[1][1]; result[1][2]=tmp[1][2]; result[2][0]=tmp[2][0]; result[2][1]=tmp[2][1]; result[2][2]=tmp[2][2]; return; } Multiply a 3 ×3 matrix by a 3 element vector void AVX_Matrix33Vector3Multiply(double A[3][3], double x[3], double y[3]) { double tmp[4]; __m256d A0 = _mm256_set_pd(0, A[2][0], A[1][0], A[0][0]); __m256d A1 = _mm256_set_pd(0, A[2][1], A[1][1], A[0][1]); __m256d A2 = _mm256_set_pd(0, A[2][2], A[1][2], A[0][2]); __m256d x0 = _mm256_set_pd(0, x[0], x[0], x[0]); __m256d x1 = _mm256_set_pd(0, x[1], x[1], x[1]); __m256d x2 = _mm256_set_pd(0, x[2], x[2], x[2]); __m256d res = _mm256_add_pd(_mm256_mul_pd(A0, x0), _mm256_mul_pd(A1, x1)); res = _mm256_add_pd(res, _mm256_mul_pd(A2, x2)); _mm256_storeu_pd(tmp, res); y[0]=tmp[0]; y[1]=tmp[1]; y[2]=tmp[2]; return; }