SlideShare a Scribd company logo
1 of 15
Download to read offline
Parallelization and Optimization of Time-Domain Electronic Scattering
Simulation Software
Re-Design of the Wright-Patterson Max3d Software
Michael T. Patterson
CJ Suchyta
Silicon Graphics, Inc
03/10/99
1. Overview
The Wright-Patterson Air Force Base Max3d software program has been re-designed from the ground up.
The re-design has been performed to meet four objectives: modernizing the code, improving key aspects of
the code, improving the single CPU performance and scalability of the code, and incorporating debugging
and visualization features into the code.
The target platform for execution of the Max3d code is the Silicon Graphics’ Origin2000 and its
successors. On this platform, the re-designed Max3d code displays a 5x single CPU performance
improvement, and presently a 2x to 3x scalability improvement as compared to the original version of the
Max3d code. These figures are problem dependent and system-load dependent, but we fully anticipate
that with completion of this effort, the new version of the Max3d code will consistently run 10x to 20x
faster in elapsed time than its original version.
This re-design incorporates the Fortran 90, MPI, IEEE Floating-Point, SHMEM, and OpenMP (quasi-)
standards into a single piece of software All key aspects of the re-designed code are improved.
Performance of the re-designed software is greatly enhanced, debugging efforts are greatly reduced, and
adaptability of the code to emerging technologies is improved. Memory requirements have decreased by
10% with this new version of the Max3d code. Results are unchanged with respect to the original version
of the code.
2. Objectives
We have effectively re-designed the Max3d program from the ground up. The intent of the re-design was
to meet four objectives, as noted below. These four objectives are not presented in any order of
importance; rather, they are of equal value and generally complement one another.
Our four objectives in re-designing the Max3d code were:
1. Modernize the code:
a. Upgrade the original Fortran 77 code to Fortran 90
b. Enable use of modern compiler technology
c. Prepare for the MPI 2.0 standard
d. Incorporate the IEEE Floating-Point Standard
e. Allow for continual updates of each of the above
2. Improve Key Aspects of the Code
a. Reliability: reduce or eliminate unexplained crashes
b. Maintainability: make the code easier to understand & modify
c. Upgradability: allow for new algorithms and NUMA road-map
3. Improve the Single-CPU Performance and MPP Scalability:
a. Convert from vector-based code to microprocessor-based code
b. Convert from synchronous to asynchronous message passing
c. Convert from handshake to one-sided message passing
4. Incorporate Debugging and Visualization Features as Appropriate
a. Use debugging features to ease error tracking
b. Use visualization to monitor code’s execution and convergence
c. Use visualization for further physical insights derived from the code
In the following text, each of the above objectives will be addressed in order.
2.1 Objective 1: Modernize the Code
The Max3d software was originally written in the Fortran 77 language for the vector architecture. Later,
analysts added an MPI domain decomposition to the code. Never before has the code been fully optimized
for the microprocessor-based architecture.
The HPC market and technology have evolved rapidly during the life of the Max3d software. We have
sought in this effort to modernize the software with the following goals:
a. Upgrade the original Fortran 77 code to Fortran 90
b. Enable use of modern compiler technology
c. Prepare for the MPI 2.0 standard
d. Incorporate the IEEE Floating-Point Standard
e. Allow for continual updates of each of the above
We have achieved these goals, and have attempted to do so in such a manner that they might be addressed
continually. That is, as software technology evolves, we anticipate that the Max3d code will follow.
2.2 Objective 2: Improve Key Aspects of the Code
In re-designing the Max3d code, three key aspects of the code have been selected as the most important
code features needing improvement. Again, the self-complementary aspects are listed in no particular
order:
a. Reliability: reduce or eliminate unexplained crashes
b. Maintainability: make the code easier to understand & modify
c. Upgradability: allow for new algorithms and NUMA road-map
If these three aspects were to be condensed into a single objective, that objective would read, “Make the
code as easy as possible to understand, modify, and use.” Such a code has fewer bugs and more easily
allows for the incorporation of new ideas and computing techniques.
Surprisingly, perhaps, the code’s performance is not listed above as a key aspect of the code. Performance
has been addressed; it is listed as objective number three. Regardless, performance has received thus far
little attention compared to the listed key aspects. But as the listed key aspects of the code have improved
over time, performance has followed with corresponding improvements.
The design strategy used here recognizes that, with MPP codes, the largest performance gains are
obtained at the algorithm level, not at the loop level. Thus, large performance gains can only be obtained
once the algorithm is clearly defined, and therefore easily recognized, within the code. With definition
and recognition, the algorithm becomes much easier to modify.
Let us consider the three key aspects in order.
2.2.1 Reliability: reduce or eliminate unexplained crashes.
This aspect simply indicates that a concerted effort has been made to eliminate bugs within the code that,
depending on the compiler, system load, or input data set, could provoke the code to crash. Many of these
bugs are subtle, particularly within message passing sections of the code.
With just a few exceptions, the message passing sections of the Max3d code have been entirely rewritten.
The few exceptions are slated for rewrites too, as time allows. The rewrite of these sections has
incorporated numerous changes for clarity and correctness. A number of subtle bugs existed within the
original version of the Max3d code and these have (hopefully) all been removed. Further, an extensive
error-checking module (errors_mod) has been added to the Max3d code. Every call to the MPI library
returns a status value; suspect values are examined by the error-checking module.
Two types of tests are performed during the initiation stages of the program. In the first test, the domain
decomposition grid indices are checked for various errors. In the second test, the validity of the “update”
module is examined. That is, a prescribed data set is sent to the update module for transfer of all ghost
point data. The returned data is compared against the analytic solution. These tests provide early
assurance that the code will perform properly for the given input data set.
Other potential errors have been addressed, including the use of uninitialized data, and the over-indexing
of arrays. Calls to IEEE floating-point intrinsics test for underflows, overflows, and divide-by-zero. The
Max3d code is able to change the processor behavior, again by using IEEE floating-point intrinsics, upon
encountering any of these instances.
The potential error sources are not yet eliminated in full, but constant progress has been made. As
evidence of this, the MIPSpro f90 compiler can now compile the code at the -O3 optimization level and
provide correct results.
2.2.2 Maintainability: make the code easier to understand & modify
This aspect indicates that, as we have investigated and re-designed the Max3d code, we have added
documentation to the code, and relevant information to its output. The original code was already
modular, but we have increased the modularity (see aspect “d” below).
The goal in improving this aspect of the Max3d code is to allow other analysts, developers, or users to
quickly learn the code. While the new code contains more lines of code and is broken into numerous files
(modules), the structure and purpose of the individual units are more readily apparent.
Figure 1 presents the structure of the code. The code now consists of the main routine, program Max3d,
and fifteen modules. No individual subroutines, functions, or common blocks remain in the code; they
have all been replaced with modules. In Figure 1, the main program and its modules are grouped into
three layers: the main routine at the top of the figure, ten primary modules grouped together in the center
of the figure, and five secondary modules, which appear in the lower part of the figure.
The main routine is the top level of the organization. The primary modules perform all of the steps of the
solution algorithm, and the secondary modules support the primary modules. A quick overview of each
module, its name and purpose, follows.
main (max3d): The main routine performs only a few functions. It USEs all data modules to ensure that
their contents are treated as static data, and it invokes the primary modules.
In alphabetical order, the primary modules are:
commun_mod: This module creates a few MPI communicators which are used during the
solution procedure.
decomp_mod: This module initializes the MPI library, decomposes the solution domain,
discovers each process’s neighbors, and runs some sanity checks on the domain decomposition.
dtypes_mod: This module creates the derived data types used in the Max3d code.
errors_mod: This module monitors the MPI library for run-time errors.
indice_mod: This module sets up the MPI indices required by Max3d.
metric_mod: This module constructs and stores the grid coordinates and the metrics of the grid
transformation.
rcs_io_mod: This module performs the RCS calculation and i/o.
solver_mod: This module contains the solution kernel routines.
swatch_mod: This module initializes the stopwatch module.
update_mod: This module performs the ghost point update, passing the necessary data from one
process to another.
In alphabetical order, the secondary modules are:
ieeefp_mod: This module tests and implements IEEE floating-point behavior under a variety of
circumstances.
inputs_mod: This module holds the general inputs for the Max3d code.
params_mod: This module contains all parameters required for the Max3d code that can be set by
the user.
s_open_mod: This module performs i/o related functions, such as naming and opening files.
stopwatch: This module is a publicly available module which performs execution timings much
like a stopwatch.
Each of the modules can be thought of as an independent program. Each defines its own data
environment, to as high a degree as is possible, and calls its own supporting subroutines. This
modularization of the Max3d code attempts, within a Fortran 90 framework, to support object-oriented
program construction, and data encapsulation.
2.3 Objective 3: Improve the Single-CPU Performance and MPP Scalability
Performance improvements to the Max3d code have been made only secondarily to date; changes to the
code have instead been made primarily in support of the other three objectives. Performance of the code
has improved regardless, because the other three objectives lend themselves to improving the code
performance.
The sub-objectives listed earlier for Objective 3 are:
a. Convert from vector-based code to microprocessor-based code
b. Convert from synchronous to asynchronous message passing
c. Convert from handshake to one-sided message passing
Let us examine each. The original code was a vector algorithm ported to a cache-based microprocessor.
Thus single CPU performance was known to be low, this due primarily to a low data cache hit rate.
The single CPU performance of either the original or revised code is known only approximately. The
code is not currently capable of running with only a single domain; single CPU runs are thus not possible.
Our standard test case consists of the following grid:
i_max_global = 73
j_max_global = 61
k_max_global = 97
which is divided into a (1,2,4) triplet of sub-domains. Presently, performance numbers for the original
and revised versions of the Max3d code are as follows:
Original New
cpu 11607 2154 seconds
elapsed 1469 281 seconds
memory 40 36 Mbytes
We have thus experienced an approximate 5x improvement in the overall (single CPU and scalability)
performance, and again a 5x improvement in the elapsed time. Memory usage has been reduced by
approximately 10%. These numbers were obtained on an internally owned Origin 2000, which housed
195 MHz R10k processors. Please note that none of our performance numbers have been obtained on a
dedicated machine; they are thus subject to variations.
Techniques for improving the single CPU performance consist of the typical, well-documented techniques.
In particular, we have removed scratch arrays or at least reduced their rank. We have exchanged indices
on the dependent variable arrays such that the number of dependent variables (six) is cycled as the first
index of the arrays, thus greatly improving locality in data cache. A number of unnecessary operations
have been removed from the code, and many operations have been reordered to improve the cache hit rate.
The scalability of the code is improved, although this effort remains a work in progress. The largest
change made to the code in this effort has been a complete re-write of the technique used to update the
ghost point data. The original Max3d code performed this task in the “updt” routine; the new version of
the code performs the task in the “update” module.
The original update technique used asynchronous MPI calls to pass all ghost point data, but each of these
calls was immediately followed by an MPI_Wait call. Many of the MPI_Wait calls ware then immediately
followed by an MPI_Barrier call. In the kernel code that precedes the two-stage Runge-Kutta update step,
the old version of the Max3d code required 24 MPI_Wait calls and three MPI_Barrier calls. Thus, the
original code was effectively performing its data passing in a synchronous manner: this practice greatly
restricts scalability.
The new version of the Max3d code has removed most of the aforementioned waits and barriers. Our goal
is for each stage of the two-stage Runge-Kutta algorithm to require only two MPI_Wait calls: one before
the Runge-Kutta update, and one after. We believe this to be possible and have nearly achieved that goal.
The kernel routines fxi, geta, and hzeta that precede the Runge-Kutta update step now require,
collectively, only one MPI_Wait call and no MPI_Barrier calls. Routines supporting these three routines
still contain additional MPI_Wait calls but we expect that we can remove each of those waits also.
We continue to work on the code that follows the update step. This code and the message passing code
that immediately precedes the update step will likely be replaced with one-sided message passing code.
The update step itself can be considered the focal point of the algorithm: all processes must synchronize
before that point, and they must synchronize again afterward before the next stage of the algorithm can
begin. Thus, message passing in the immediate temporal locality of the update step must be highly
optimized to prevent wait time. We anticipate replacing the existing MPI calls in this section of the code
with Silicon Graphics’ SHMEM library calls.
We have also greatly reduced the amount of data that is passed between processes. Figures (2) through (6)
present the domain decomposition model and sample code output regarding the decomposition. Figure
six, in particular, presents a sub-domain and its lower z-axis ghost point sub-domains. The lower z-axis
ghost points are housed in four edge sub-domains, four corner sub-domains, and one mid-plane sub-
domain.
This partitioning of ghost point data was employed in the original MPI port of the Max3d code, and we
have retained its use. While not a particularly efficient methodology, the partitioning method allows for
maximum flexibility of the algorithm. This partitioning strategy also allowed us to make an important
change to the code that reduced the amount of data passed between processes by a factor of 5/6.
In the original version of the Max3d code, the fxi, geta, and hzeta routines each called the “updt” routine
to update *all* of their ghost point data to neighboring processes. This was largely unnecessary. The fxi
routine, for example, needs to update only the ghost point data at the lower x-axis of each sub-domain.
Likewise, the geta routine needs to update only the ghost point data at the lower y-axis of each sub-
domain, and the hzeta routine the lower z-axis data.
We have incorporated this optimization into the Max3d code, thereby reducing by 5/6 the amount of data
which must be passed between processes.
Lastly, we have incorporated into the Max3d code a second model, or level, of parallelism. This new
model is in fact very new: few researchers have employed a two-layer model of parallelism, but we feel
that the Max3d code, particularly when executed on Origin 2000 systems, is a perfect candidate for this
new model.
In this model, MPI is used in its usual sense, providing a domain decomposition and message passing
between the sub-domains. Now, a second layer of parallelism is added to the code: we use the OpenMP
library to distribute the work of the kernel loops to two processors per sub-domain. Here is how this two-
layer model of parallelism works, and why it works so well on the Origin 2000.
First, one needs to recall that the Origin 2000 hardware consists of nodes. Each node houses two MIPS
processors: each processor owns a separate data cache but the two processors share one memory. Thus, if
we instruct MPI to create eight sub-domains for our solution, we will employ eight processors running on
four nodes to obtain the solution.
For the two-layer model of parallelism, we first instruct the IRIX operating system to distribute the eight
sub-domains to distinct nodes, so that we are using only one CPU per node. Next, we employ OpenMP
directives to distribute the work of the important kernel loops to two processors per sub-domain. Thus,
each sub-domain is operated upon by two processors rather than just one. The sub-domain fits entirely
within each node’s memory, so the two processors on the node act as a two CPU SMP to compute the
solution for that sub-domain.
In effect, this two-layer model of parallelism allows us to double the scalability of our solution beyond that
which MPI alone can provide. Consider a problem, which, using only MPI, scales to 8, 16, and then 32
CPUs, but bottoms out in scalability at 32 CPUs. The two-layer model of parallelism allows us to scale
this solution immediately to 64 CPUs. Our tests with this two-layer model of parallelism indicate that the
second layer will provide approximately a 1.6x elapsed time speedup; further testing is certainly
warranted, and our performance numbers are suspect since they have not been obtained on a dedicated
machine.
Many, many source code changes have been applied to the Max3d code. The “ChangeLog” file that
accompanies the source code describes in detail all of these changes, and their impact on the code’s
performance.
2.4 Objective 4: Incorporate Debugging and Visualization Features as
Appropriate
The debugging and run-time error testing of MPP codes remains troublesome. The MPI standard itself
offers few methods by which errors can be detected or trapped. One reason for the lack of error detection
can be readily explained.
The Max3d code, for example, now uses asynchronous MPI calls to perform all of its data passing. If an
error should occur within the MPI library, it will likely occur long after the call itself to the MPI library
was made. The Max3d code will have progressed far from the asynchronous MPI call before the error
occurs. Thus, even if the MPI library can detect that an error has occurred, it has no reasonable means
with which to report it back to the calling code.
Regardless, we test the returned error status of every MPI call. We also instruct the MIPSpro f90 compiler
to examine each call to ensure that its arguments are correct in type, kind and rank.
As noted earlier, we also perform two types of tests during the initialization stage of the program. In the
first test, the domain decomposition grid indices are examined for various errors. In the second test, the
validity of the “update” module for the prescribed domain decomposition is verified. These tests provide
early assurance that the code will perform properly for the given input data set.
3. Data Encapsulation
Much thought was put into this aspect of the code, as this aspect strongly affects our ability to achieve the
stated objectives. Concerns with data encapsulation prompted a large percentage of the many changes to
the Max3d code.
Data encapsulations, and the related topics of data association, data inheritance, and scope, limit the
accessibility of data to the various parts of a program. Limiting the accessibility of data is generally a
good thing: it facilitates the understanding, debugging, and maintenance of a code.
With MPP codes, data encapsulation is even more critical.
1. Each process of the MPP job executes along a separate path through the code. The presence of these
multiple paths in the code complicates development, debugging, and comprehension of the code.
2. Many additional variables are required to govern execution along the separate threads. These
variables are required to identify the processes, to govern their execution paths, and to limit their
scope of operations.
The data encapsulation methodology employed in this code attempts to alleviate much of the confusion
related to all of the additional variables. How bad is this confusion? Consider the i-indices that govern
loops through the grid topology. Rather than having just an i-index that runs from one to IMAX, an MPP
code will have indices similar to:
i_max_global Extent of the global domain
i_max_local Extent of the local domain
i_skin_lo Number of ghost points at the lower index range
i_skin_hi Number of ghost points at the upper index range
i_box The local domain plus the ghost points
i_num_procs Number of processes along the i-direction
Other i-index variables may be associated with various mappings, derived data types, and boundary
conditions. Associated variables include neighbors’ ranks, and send & receive request statuses.
In the original version of the Max3d code, all of the code’s arrays, indices, status variables, etc. were
placed into common. Thus, most of the code’s variables had global scope. To reduce the confusion
associated with these many variables, several conventions are adopted in the Max3d code:
1. Variable naming conventions.
2. Use of parameters, rather than variables, whenever possible.
3. No use of common blocks; modules are used instead.
4. Stack-based (local to subroutines) variable storage as allowable.
5. USEd variables are read-only. USEing variables is very similar to associating by common and has all
the usual dangers. Here, we take the stand that anything USEd is to be considered read-only:
parameters and unchanged variables are safely USE-associated.
These conventions serve several purposes:
1. We want the reader to easily grasp the flow of the program. The main program simply invokes the
primary modules, which in turn perform simple tasks.
2. We want the reader to be able to easily learn where a given variable is set.
3. We want to ensure that a given array is set in its entirety.
4. We want to ensure that a given array, once set, is not reset by mistake.
Our method of date encapsulation is quite straightforward. Variables within a module are classified as
USEd, defined, or local. All USEd variables appear in a USE statement within the module. These
variables are presumed to have already been set to valid values, and they are not allowed to be reset by the
module.
Defined variables appear in the declaration section of the module. Their declaration will contain the
PUBLIC attribute. These variables will be set, in entirety if arrays, within the module, and will later be
USEd by other modules.
All other variables are local to the module. They will have, by default, a PRIVATE attribute within their
declaration.
At present, compilers cannot check that our convention is adhered to. However, other developers have
determined that this convention is a powerful one and have submitted requests to the Fortran 2000
standards committee to support the practice. Their requests have asked that a new declaration attribute be
added to Fortran 2000 to support this convention. We have submitted our request also, and have asked
that Silicon Graphics support the request.
Our data encapsulation method makes the code much easier to read and maintain. Variables are set only
in the module within which they are declared, and other modules that USE those variables must use them
in a read-only mode. Tracing the use of data in the Max3d code is thus much easier now.
4. Features to be Added to the Code
In spite of the many advances already made with the Max3d code, many more can be made. Various
features of the MPI-2 standard are becoming available on the Origin 2000 platform and can be
incorporated into the Max3d code. Some Fortran 95 and Fortran 2000 features will be relevant to the
code also.
Error checking of the MPI library calls remains troublesome. Several logic additions to the Max3d code
can be made to better facilitate error detection and, possibly, correction. A number of additional
optimization and nomenclature standardization additions are noted within the Max3d source code and
ChangeLog file.
A few restrictions to the domain decomposition would provide for performance improvements and code
clarity. Several parts of the kernel routines remain poorly understood (on our part) and have thus not been
documented or optimized.
A run-time visualization system could be incorporated into the code to provide one-, two- or three-
dimensional graphical output of the evolving solution. Several third-party, freely available software
packages provide this capability. Some of these even allow solution steering, whereby a user can change
an input to a running solution.
Lastly, the Max3d program simply needs to be tested for a wide variety of test cases to examine its results,
stability, and performance. Certainly, such testing would expose additional or newly introduced bugs, and
would provide us with the performance envelope within which the code operates.
5. Summary
Our efforts with the Max3d code are ongoing. We have made great progress in optimizing this code for
the Silicon Graphics Origin 2000 platform, and we see opportunities with the follow-on products that will
succeed the Origin 2000.
Our hope in delivering this new product is that its users will find the new code useful to their efforts. The
code, while not fully optimized at this time, has reached a stage of maturity, and inspection of it is now
appropriate. We solicit user feedback regarding the direction we have taken with the code, and with the
direction we foresee for it.
Already the new version of the code incorporates technologies that few other research groups have
mastered. Few, if any research groups have combined the Fortran 90, MPI, IEEE Floating-Point,
SHMEM, and OpenMP (quasi-) standards into a single piece of software. The code, furthermore, is well
positioned to evolve with the Silicon Graphics NUMA architecture.
paramsinputs s_openieeefp stopwatch
Main Program (max3d)
decomp
dtypes
errors
indice
metric
rcs_io
solver
swatch
update
commun
Figure 1: Organization of the Max3d Code
Rank Coordinates Begin End Resident Points
0/7 (0, 0, 0) (1, 1, 1) (73, 31, 24) (73, 31, 24) 54312
1/7 (0, 0, 1) (1, 1, 25) (73, 31, 49) (73, 31, 25) 56575
2/7 (0, 0, 2) (1, 1, 50) (73, 31, 73) (73, 31, 24) 54312
3/7 (0, 0, 3) (1, 1, 74) (73, 31, 97) (73, 31, 24) 54312
4/7 (0, 1, 0) (1, 32, 1) (73, 61, 24) (73, 30, 24) 52560
5/7 (0, 1, 1) (1, 32, 25) (73, 61, 49) (73, 30, 25) 54750
6/7 (0, 1, 2) (1, 32, 50) (73, 61, 73) (73, 30, 24) 52560
7/7 (0, 1, 3) (1, 32, 74) (73, 61, 97) (73, 30, 24) 52560
i_max_global = 73 i_max_local = 73
j_max_global = 61 j_max_local = 31
k_max_global = 97 k_max_local = 25
i_ghost_lo = 1 i_ghost_hi = 2
j_ghost_lo = 1 j_ghost_hi = 2
k_ghost_lo = 1 k_ghost_hi = 2
i_box = 76 i_num_procs = 1
j_box = 34 j_num_procs = 2
k_box = 28 k_num_procs = 4
x
y
z
j=31
(1,61,1)
(73,1,97)(1,1,97)
(1,61,97) (73,61,97)
0
1
2
3
4
5
6
7
(73,61,1)
k=24
k=25
k=49
k=50
k=73
k=74
k=97
j=32
j=61
k=1
j=1
i=1 i=73
Figure 2: Sample Domain Decomposition
I−indices
Consider an x−y plane for the sample problem. The
x−direction is not decomposed and thus consists of a
single domain. Because of its simplicity, the x−axis
provides a good starting point to describe the
various indices.
x
y
z
i_max_global = 73 = set by user = Global extent of domain
i_num_procs = 1 = set by user = Number of subdomains
i_ghost_lo = 1 = set by user = Ghost pts at low end
i_ghost_hi = 2 = set by user = Ghost pts at high end
i_max_local = 73 = 1 + (i_max_global − 1) / i_num_procs = Max locally owned points
i_box = 76 = i_ghost_lo + i_max_local + i_ghost_hi = Local + ghost points
my_indices(beg) = 1
my_indices(end) = 73
pts_on_proc = 73
at_ilo_face = true
at_ihi_face = true
i_local_beg = 2 = i_ghost_lo + 1
i_local_end = 74 = i_ghost_lo + pts_on_proc
ilo2 = 3 = i_local_beg + 1
ihim = 73 = i_local_end − 1
i_offset = −1 = mpi_my_indices(beg) − 1 − i_ghost_lo
i_rcv_ind(src) = 1
i_snd_ind(src) = 2 = 1 + i_ghost_lo
i_snd_ind(dst) = 74 = 1 + pts_on_proc ! this seems to be in error.
i_rcv_ind(dst) = 75 = 1 + i_ghost_lo + pts_on_proc
76741
1
2
73
Figure 3: Example of i−indices
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
0
4
0
1
2
3
4
5
6
7
Figure 4: The Coordinate Planes of a Sub−Domain
1
2
3
5
6
7
x
y
z
x
y
z
0
2
3
4
5
6
7 1
Figure 5: A Y−Z Plane & Its Ghost Points
Interior Domain
Ghost cells at lower end of z−axis
Figure 6: The Sub−Domains Which
Comprise One Section of Ghost Points
x
y
z

More Related Content

Similar to Parallelization and Optimization of Time-Domain Electronic Scattering Simulation Software

Binary code obfuscation through c++ template meta programming
Binary code obfuscation through c++ template meta programmingBinary code obfuscation through c++ template meta programming
Binary code obfuscation through c++ template meta programming
nong_dan
 
Dot Net Fundamentals
Dot Net FundamentalsDot Net Fundamentals
Dot Net Fundamentals
LiquidHub
 
Martin Koons Resume 2015
Martin Koons Resume 2015Martin Koons Resume 2015
Martin Koons Resume 2015
Marty Koons
 
Summary Create an Object-Oriented program that creates a simulator an.pdf
 Summary Create an Object-Oriented program that creates a simulator an.pdf Summary Create an Object-Oriented program that creates a simulator an.pdf
Summary Create an Object-Oriented program that creates a simulator an.pdf
allwinsupport
 
DMKit_2.0_README_1
DMKit_2.0_README_1DMKit_2.0_README_1
DMKit_2.0_README_1
ibtesting
 
A New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft CorporationA New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft Corporation
art_lee
 

Similar to Parallelization and Optimization of Time-Domain Electronic Scattering Simulation Software (20)

Binary code obfuscation through c++ template meta programming
Binary code obfuscation through c++ template meta programmingBinary code obfuscation through c++ template meta programming
Binary code obfuscation through c++ template meta programming
 
Documentation
DocumentationDocumentation
Documentation
 
Intermediate Representation in Compiler Construction
Intermediate Representation in Compiler ConstructionIntermediate Representation in Compiler Construction
Intermediate Representation in Compiler Construction
 
Dot Net Fundamentals
Dot Net FundamentalsDot Net Fundamentals
Dot Net Fundamentals
 
Martin Koons Resume 2015
Martin Koons Resume 2015Martin Koons Resume 2015
Martin Koons Resume 2015
 
Dlf2
Dlf2Dlf2
Dlf2
 
Summary Create an Object-Oriented program that creates a simulator an.pdf
 Summary Create an Object-Oriented program that creates a simulator an.pdf Summary Create an Object-Oriented program that creates a simulator an.pdf
Summary Create an Object-Oriented program that creates a simulator an.pdf
 
DMKit_2.0_README_1
DMKit_2.0_README_1DMKit_2.0_README_1
DMKit_2.0_README_1
 
MIGHTY MACROS AND POWERFUL PARAMETERS: MAXIMIZING EFFICIENCY AND FLEXIBILITY ...
MIGHTY MACROS AND POWERFUL PARAMETERS: MAXIMIZING EFFICIENCY AND FLEXIBILITY ...MIGHTY MACROS AND POWERFUL PARAMETERS: MAXIMIZING EFFICIENCY AND FLEXIBILITY ...
MIGHTY MACROS AND POWERFUL PARAMETERS: MAXIMIZING EFFICIENCY AND FLEXIBILITY ...
 
Ameya_Kasbekar_Resume
Ameya_Kasbekar_ResumeAmeya_Kasbekar_Resume
Ameya_Kasbekar_Resume
 
Improving Code Quality Through Effective Review Process
Improving Code Quality Through Effective  Review ProcessImproving Code Quality Through Effective  Review Process
Improving Code Quality Through Effective Review Process
 
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing SystemHybrid Model Based Testing Tool Architecture for Exascale Computing System
Hybrid Model Based Testing Tool Architecture for Exascale Computing System
 
Static code analysis for verification of the 64-bit applications
Static code analysis for verification of the 64-bit applicationsStatic code analysis for verification of the 64-bit applications
Static code analysis for verification of the 64-bit applications
 
MACRO ASSEBLER
MACRO ASSEBLERMACRO ASSEBLER
MACRO ASSEBLER
 
Lesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programsLesson 26. Optimization of 64-bit programs
Lesson 26. Optimization of 64-bit programs
 
Integrating profiling into mde compilers
Integrating profiling into mde compilersIntegrating profiling into mde compilers
Integrating profiling into mde compilers
 
URF Poster
URF PosterURF Poster
URF Poster
 
Unit-2.pptx
Unit-2.pptxUnit-2.pptx
Unit-2.pptx
 
Chap 2_ans.pdf
Chap 2_ans.pdfChap 2_ans.pdf
Chap 2_ans.pdf
 
A New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft CorporationA New Paradigm In Linux Debug From Viosoft Corporation
A New Paradigm In Linux Debug From Viosoft Corporation
 

Recently uploaded

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 

Parallelization and Optimization of Time-Domain Electronic Scattering Simulation Software

  • 1. Parallelization and Optimization of Time-Domain Electronic Scattering Simulation Software Re-Design of the Wright-Patterson Max3d Software Michael T. Patterson CJ Suchyta Silicon Graphics, Inc 03/10/99 1. Overview The Wright-Patterson Air Force Base Max3d software program has been re-designed from the ground up. The re-design has been performed to meet four objectives: modernizing the code, improving key aspects of the code, improving the single CPU performance and scalability of the code, and incorporating debugging and visualization features into the code. The target platform for execution of the Max3d code is the Silicon Graphics’ Origin2000 and its successors. On this platform, the re-designed Max3d code displays a 5x single CPU performance improvement, and presently a 2x to 3x scalability improvement as compared to the original version of the Max3d code. These figures are problem dependent and system-load dependent, but we fully anticipate that with completion of this effort, the new version of the Max3d code will consistently run 10x to 20x faster in elapsed time than its original version. This re-design incorporates the Fortran 90, MPI, IEEE Floating-Point, SHMEM, and OpenMP (quasi-) standards into a single piece of software All key aspects of the re-designed code are improved. Performance of the re-designed software is greatly enhanced, debugging efforts are greatly reduced, and adaptability of the code to emerging technologies is improved. Memory requirements have decreased by 10% with this new version of the Max3d code. Results are unchanged with respect to the original version of the code. 2. Objectives We have effectively re-designed the Max3d program from the ground up. The intent of the re-design was to meet four objectives, as noted below. These four objectives are not presented in any order of importance; rather, they are of equal value and generally complement one another. Our four objectives in re-designing the Max3d code were: 1. Modernize the code: a. Upgrade the original Fortran 77 code to Fortran 90 b. Enable use of modern compiler technology c. Prepare for the MPI 2.0 standard d. Incorporate the IEEE Floating-Point Standard
  • 2. e. Allow for continual updates of each of the above 2. Improve Key Aspects of the Code a. Reliability: reduce or eliminate unexplained crashes b. Maintainability: make the code easier to understand & modify c. Upgradability: allow for new algorithms and NUMA road-map 3. Improve the Single-CPU Performance and MPP Scalability: a. Convert from vector-based code to microprocessor-based code b. Convert from synchronous to asynchronous message passing c. Convert from handshake to one-sided message passing 4. Incorporate Debugging and Visualization Features as Appropriate a. Use debugging features to ease error tracking b. Use visualization to monitor code’s execution and convergence c. Use visualization for further physical insights derived from the code In the following text, each of the above objectives will be addressed in order. 2.1 Objective 1: Modernize the Code The Max3d software was originally written in the Fortran 77 language for the vector architecture. Later, analysts added an MPI domain decomposition to the code. Never before has the code been fully optimized for the microprocessor-based architecture. The HPC market and technology have evolved rapidly during the life of the Max3d software. We have sought in this effort to modernize the software with the following goals: a. Upgrade the original Fortran 77 code to Fortran 90 b. Enable use of modern compiler technology c. Prepare for the MPI 2.0 standard d. Incorporate the IEEE Floating-Point Standard e. Allow for continual updates of each of the above We have achieved these goals, and have attempted to do so in such a manner that they might be addressed continually. That is, as software technology evolves, we anticipate that the Max3d code will follow. 2.2 Objective 2: Improve Key Aspects of the Code In re-designing the Max3d code, three key aspects of the code have been selected as the most important code features needing improvement. Again, the self-complementary aspects are listed in no particular order: a. Reliability: reduce or eliminate unexplained crashes b. Maintainability: make the code easier to understand & modify c. Upgradability: allow for new algorithms and NUMA road-map
  • 3. If these three aspects were to be condensed into a single objective, that objective would read, “Make the code as easy as possible to understand, modify, and use.” Such a code has fewer bugs and more easily allows for the incorporation of new ideas and computing techniques. Surprisingly, perhaps, the code’s performance is not listed above as a key aspect of the code. Performance has been addressed; it is listed as objective number three. Regardless, performance has received thus far little attention compared to the listed key aspects. But as the listed key aspects of the code have improved over time, performance has followed with corresponding improvements. The design strategy used here recognizes that, with MPP codes, the largest performance gains are obtained at the algorithm level, not at the loop level. Thus, large performance gains can only be obtained once the algorithm is clearly defined, and therefore easily recognized, within the code. With definition and recognition, the algorithm becomes much easier to modify. Let us consider the three key aspects in order. 2.2.1 Reliability: reduce or eliminate unexplained crashes. This aspect simply indicates that a concerted effort has been made to eliminate bugs within the code that, depending on the compiler, system load, or input data set, could provoke the code to crash. Many of these bugs are subtle, particularly within message passing sections of the code. With just a few exceptions, the message passing sections of the Max3d code have been entirely rewritten. The few exceptions are slated for rewrites too, as time allows. The rewrite of these sections has incorporated numerous changes for clarity and correctness. A number of subtle bugs existed within the original version of the Max3d code and these have (hopefully) all been removed. Further, an extensive error-checking module (errors_mod) has been added to the Max3d code. Every call to the MPI library returns a status value; suspect values are examined by the error-checking module. Two types of tests are performed during the initiation stages of the program. In the first test, the domain decomposition grid indices are checked for various errors. In the second test, the validity of the “update” module is examined. That is, a prescribed data set is sent to the update module for transfer of all ghost point data. The returned data is compared against the analytic solution. These tests provide early assurance that the code will perform properly for the given input data set. Other potential errors have been addressed, including the use of uninitialized data, and the over-indexing of arrays. Calls to IEEE floating-point intrinsics test for underflows, overflows, and divide-by-zero. The Max3d code is able to change the processor behavior, again by using IEEE floating-point intrinsics, upon encountering any of these instances. The potential error sources are not yet eliminated in full, but constant progress has been made. As evidence of this, the MIPSpro f90 compiler can now compile the code at the -O3 optimization level and provide correct results. 2.2.2 Maintainability: make the code easier to understand & modify This aspect indicates that, as we have investigated and re-designed the Max3d code, we have added documentation to the code, and relevant information to its output. The original code was already modular, but we have increased the modularity (see aspect “d” below). The goal in improving this aspect of the Max3d code is to allow other analysts, developers, or users to quickly learn the code. While the new code contains more lines of code and is broken into numerous files (modules), the structure and purpose of the individual units are more readily apparent. Figure 1 presents the structure of the code. The code now consists of the main routine, program Max3d, and fifteen modules. No individual subroutines, functions, or common blocks remain in the code; they have all been replaced with modules. In Figure 1, the main program and its modules are grouped into
  • 4. three layers: the main routine at the top of the figure, ten primary modules grouped together in the center of the figure, and five secondary modules, which appear in the lower part of the figure. The main routine is the top level of the organization. The primary modules perform all of the steps of the solution algorithm, and the secondary modules support the primary modules. A quick overview of each module, its name and purpose, follows. main (max3d): The main routine performs only a few functions. It USEs all data modules to ensure that their contents are treated as static data, and it invokes the primary modules. In alphabetical order, the primary modules are: commun_mod: This module creates a few MPI communicators which are used during the solution procedure. decomp_mod: This module initializes the MPI library, decomposes the solution domain, discovers each process’s neighbors, and runs some sanity checks on the domain decomposition. dtypes_mod: This module creates the derived data types used in the Max3d code. errors_mod: This module monitors the MPI library for run-time errors. indice_mod: This module sets up the MPI indices required by Max3d. metric_mod: This module constructs and stores the grid coordinates and the metrics of the grid transformation. rcs_io_mod: This module performs the RCS calculation and i/o. solver_mod: This module contains the solution kernel routines. swatch_mod: This module initializes the stopwatch module. update_mod: This module performs the ghost point update, passing the necessary data from one process to another. In alphabetical order, the secondary modules are: ieeefp_mod: This module tests and implements IEEE floating-point behavior under a variety of circumstances. inputs_mod: This module holds the general inputs for the Max3d code. params_mod: This module contains all parameters required for the Max3d code that can be set by the user. s_open_mod: This module performs i/o related functions, such as naming and opening files. stopwatch: This module is a publicly available module which performs execution timings much like a stopwatch. Each of the modules can be thought of as an independent program. Each defines its own data environment, to as high a degree as is possible, and calls its own supporting subroutines. This modularization of the Max3d code attempts, within a Fortran 90 framework, to support object-oriented program construction, and data encapsulation. 2.3 Objective 3: Improve the Single-CPU Performance and MPP Scalability Performance improvements to the Max3d code have been made only secondarily to date; changes to the code have instead been made primarily in support of the other three objectives. Performance of the code
  • 5. has improved regardless, because the other three objectives lend themselves to improving the code performance. The sub-objectives listed earlier for Objective 3 are: a. Convert from vector-based code to microprocessor-based code b. Convert from synchronous to asynchronous message passing c. Convert from handshake to one-sided message passing Let us examine each. The original code was a vector algorithm ported to a cache-based microprocessor. Thus single CPU performance was known to be low, this due primarily to a low data cache hit rate. The single CPU performance of either the original or revised code is known only approximately. The code is not currently capable of running with only a single domain; single CPU runs are thus not possible. Our standard test case consists of the following grid: i_max_global = 73 j_max_global = 61 k_max_global = 97 which is divided into a (1,2,4) triplet of sub-domains. Presently, performance numbers for the original and revised versions of the Max3d code are as follows: Original New cpu 11607 2154 seconds elapsed 1469 281 seconds memory 40 36 Mbytes We have thus experienced an approximate 5x improvement in the overall (single CPU and scalability) performance, and again a 5x improvement in the elapsed time. Memory usage has been reduced by approximately 10%. These numbers were obtained on an internally owned Origin 2000, which housed 195 MHz R10k processors. Please note that none of our performance numbers have been obtained on a dedicated machine; they are thus subject to variations. Techniques for improving the single CPU performance consist of the typical, well-documented techniques. In particular, we have removed scratch arrays or at least reduced their rank. We have exchanged indices on the dependent variable arrays such that the number of dependent variables (six) is cycled as the first index of the arrays, thus greatly improving locality in data cache. A number of unnecessary operations have been removed from the code, and many operations have been reordered to improve the cache hit rate. The scalability of the code is improved, although this effort remains a work in progress. The largest change made to the code in this effort has been a complete re-write of the technique used to update the ghost point data. The original Max3d code performed this task in the “updt” routine; the new version of the code performs the task in the “update” module. The original update technique used asynchronous MPI calls to pass all ghost point data, but each of these calls was immediately followed by an MPI_Wait call. Many of the MPI_Wait calls ware then immediately followed by an MPI_Barrier call. In the kernel code that precedes the two-stage Runge-Kutta update step, the old version of the Max3d code required 24 MPI_Wait calls and three MPI_Barrier calls. Thus, the original code was effectively performing its data passing in a synchronous manner: this practice greatly restricts scalability. The new version of the Max3d code has removed most of the aforementioned waits and barriers. Our goal is for each stage of the two-stage Runge-Kutta algorithm to require only two MPI_Wait calls: one before the Runge-Kutta update, and one after. We believe this to be possible and have nearly achieved that goal. The kernel routines fxi, geta, and hzeta that precede the Runge-Kutta update step now require,
  • 6. collectively, only one MPI_Wait call and no MPI_Barrier calls. Routines supporting these three routines still contain additional MPI_Wait calls but we expect that we can remove each of those waits also. We continue to work on the code that follows the update step. This code and the message passing code that immediately precedes the update step will likely be replaced with one-sided message passing code. The update step itself can be considered the focal point of the algorithm: all processes must synchronize before that point, and they must synchronize again afterward before the next stage of the algorithm can begin. Thus, message passing in the immediate temporal locality of the update step must be highly optimized to prevent wait time. We anticipate replacing the existing MPI calls in this section of the code with Silicon Graphics’ SHMEM library calls. We have also greatly reduced the amount of data that is passed between processes. Figures (2) through (6) present the domain decomposition model and sample code output regarding the decomposition. Figure six, in particular, presents a sub-domain and its lower z-axis ghost point sub-domains. The lower z-axis ghost points are housed in four edge sub-domains, four corner sub-domains, and one mid-plane sub- domain. This partitioning of ghost point data was employed in the original MPI port of the Max3d code, and we have retained its use. While not a particularly efficient methodology, the partitioning method allows for maximum flexibility of the algorithm. This partitioning strategy also allowed us to make an important change to the code that reduced the amount of data passed between processes by a factor of 5/6. In the original version of the Max3d code, the fxi, geta, and hzeta routines each called the “updt” routine to update *all* of their ghost point data to neighboring processes. This was largely unnecessary. The fxi routine, for example, needs to update only the ghost point data at the lower x-axis of each sub-domain. Likewise, the geta routine needs to update only the ghost point data at the lower y-axis of each sub- domain, and the hzeta routine the lower z-axis data. We have incorporated this optimization into the Max3d code, thereby reducing by 5/6 the amount of data which must be passed between processes. Lastly, we have incorporated into the Max3d code a second model, or level, of parallelism. This new model is in fact very new: few researchers have employed a two-layer model of parallelism, but we feel that the Max3d code, particularly when executed on Origin 2000 systems, is a perfect candidate for this new model. In this model, MPI is used in its usual sense, providing a domain decomposition and message passing between the sub-domains. Now, a second layer of parallelism is added to the code: we use the OpenMP library to distribute the work of the kernel loops to two processors per sub-domain. Here is how this two- layer model of parallelism works, and why it works so well on the Origin 2000. First, one needs to recall that the Origin 2000 hardware consists of nodes. Each node houses two MIPS processors: each processor owns a separate data cache but the two processors share one memory. Thus, if we instruct MPI to create eight sub-domains for our solution, we will employ eight processors running on four nodes to obtain the solution. For the two-layer model of parallelism, we first instruct the IRIX operating system to distribute the eight sub-domains to distinct nodes, so that we are using only one CPU per node. Next, we employ OpenMP directives to distribute the work of the important kernel loops to two processors per sub-domain. Thus, each sub-domain is operated upon by two processors rather than just one. The sub-domain fits entirely within each node’s memory, so the two processors on the node act as a two CPU SMP to compute the solution for that sub-domain. In effect, this two-layer model of parallelism allows us to double the scalability of our solution beyond that which MPI alone can provide. Consider a problem, which, using only MPI, scales to 8, 16, and then 32 CPUs, but bottoms out in scalability at 32 CPUs. The two-layer model of parallelism allows us to scale this solution immediately to 64 CPUs. Our tests with this two-layer model of parallelism indicate that the second layer will provide approximately a 1.6x elapsed time speedup; further testing is certainly
  • 7. warranted, and our performance numbers are suspect since they have not been obtained on a dedicated machine. Many, many source code changes have been applied to the Max3d code. The “ChangeLog” file that accompanies the source code describes in detail all of these changes, and their impact on the code’s performance. 2.4 Objective 4: Incorporate Debugging and Visualization Features as Appropriate The debugging and run-time error testing of MPP codes remains troublesome. The MPI standard itself offers few methods by which errors can be detected or trapped. One reason for the lack of error detection can be readily explained. The Max3d code, for example, now uses asynchronous MPI calls to perform all of its data passing. If an error should occur within the MPI library, it will likely occur long after the call itself to the MPI library was made. The Max3d code will have progressed far from the asynchronous MPI call before the error occurs. Thus, even if the MPI library can detect that an error has occurred, it has no reasonable means with which to report it back to the calling code. Regardless, we test the returned error status of every MPI call. We also instruct the MIPSpro f90 compiler to examine each call to ensure that its arguments are correct in type, kind and rank. As noted earlier, we also perform two types of tests during the initialization stage of the program. In the first test, the domain decomposition grid indices are examined for various errors. In the second test, the validity of the “update” module for the prescribed domain decomposition is verified. These tests provide early assurance that the code will perform properly for the given input data set. 3. Data Encapsulation Much thought was put into this aspect of the code, as this aspect strongly affects our ability to achieve the stated objectives. Concerns with data encapsulation prompted a large percentage of the many changes to the Max3d code. Data encapsulations, and the related topics of data association, data inheritance, and scope, limit the accessibility of data to the various parts of a program. Limiting the accessibility of data is generally a good thing: it facilitates the understanding, debugging, and maintenance of a code. With MPP codes, data encapsulation is even more critical. 1. Each process of the MPP job executes along a separate path through the code. The presence of these multiple paths in the code complicates development, debugging, and comprehension of the code. 2. Many additional variables are required to govern execution along the separate threads. These variables are required to identify the processes, to govern their execution paths, and to limit their scope of operations. The data encapsulation methodology employed in this code attempts to alleviate much of the confusion related to all of the additional variables. How bad is this confusion? Consider the i-indices that govern loops through the grid topology. Rather than having just an i-index that runs from one to IMAX, an MPP code will have indices similar to: i_max_global Extent of the global domain i_max_local Extent of the local domain i_skin_lo Number of ghost points at the lower index range i_skin_hi Number of ghost points at the upper index range
  • 8. i_box The local domain plus the ghost points i_num_procs Number of processes along the i-direction Other i-index variables may be associated with various mappings, derived data types, and boundary conditions. Associated variables include neighbors’ ranks, and send & receive request statuses. In the original version of the Max3d code, all of the code’s arrays, indices, status variables, etc. were placed into common. Thus, most of the code’s variables had global scope. To reduce the confusion associated with these many variables, several conventions are adopted in the Max3d code: 1. Variable naming conventions. 2. Use of parameters, rather than variables, whenever possible. 3. No use of common blocks; modules are used instead. 4. Stack-based (local to subroutines) variable storage as allowable. 5. USEd variables are read-only. USEing variables is very similar to associating by common and has all the usual dangers. Here, we take the stand that anything USEd is to be considered read-only: parameters and unchanged variables are safely USE-associated. These conventions serve several purposes: 1. We want the reader to easily grasp the flow of the program. The main program simply invokes the primary modules, which in turn perform simple tasks. 2. We want the reader to be able to easily learn where a given variable is set. 3. We want to ensure that a given array is set in its entirety. 4. We want to ensure that a given array, once set, is not reset by mistake. Our method of date encapsulation is quite straightforward. Variables within a module are classified as USEd, defined, or local. All USEd variables appear in a USE statement within the module. These variables are presumed to have already been set to valid values, and they are not allowed to be reset by the module. Defined variables appear in the declaration section of the module. Their declaration will contain the PUBLIC attribute. These variables will be set, in entirety if arrays, within the module, and will later be USEd by other modules. All other variables are local to the module. They will have, by default, a PRIVATE attribute within their declaration. At present, compilers cannot check that our convention is adhered to. However, other developers have determined that this convention is a powerful one and have submitted requests to the Fortran 2000 standards committee to support the practice. Their requests have asked that a new declaration attribute be added to Fortran 2000 to support this convention. We have submitted our request also, and have asked that Silicon Graphics support the request. Our data encapsulation method makes the code much easier to read and maintain. Variables are set only in the module within which they are declared, and other modules that USE those variables must use them in a read-only mode. Tracing the use of data in the Max3d code is thus much easier now. 4. Features to be Added to the Code In spite of the many advances already made with the Max3d code, many more can be made. Various features of the MPI-2 standard are becoming available on the Origin 2000 platform and can be incorporated into the Max3d code. Some Fortran 95 and Fortran 2000 features will be relevant to the code also.
  • 9. Error checking of the MPI library calls remains troublesome. Several logic additions to the Max3d code can be made to better facilitate error detection and, possibly, correction. A number of additional optimization and nomenclature standardization additions are noted within the Max3d source code and ChangeLog file. A few restrictions to the domain decomposition would provide for performance improvements and code clarity. Several parts of the kernel routines remain poorly understood (on our part) and have thus not been documented or optimized. A run-time visualization system could be incorporated into the code to provide one-, two- or three- dimensional graphical output of the evolving solution. Several third-party, freely available software packages provide this capability. Some of these even allow solution steering, whereby a user can change an input to a running solution. Lastly, the Max3d program simply needs to be tested for a wide variety of test cases to examine its results, stability, and performance. Certainly, such testing would expose additional or newly introduced bugs, and would provide us with the performance envelope within which the code operates. 5. Summary Our efforts with the Max3d code are ongoing. We have made great progress in optimizing this code for the Silicon Graphics Origin 2000 platform, and we see opportunities with the follow-on products that will succeed the Origin 2000. Our hope in delivering this new product is that its users will find the new code useful to their efforts. The code, while not fully optimized at this time, has reached a stage of maturity, and inspection of it is now appropriate. We solicit user feedback regarding the direction we have taken with the code, and with the direction we foresee for it. Already the new version of the code incorporates technologies that few other research groups have mastered. Few, if any research groups have combined the Fortran 90, MPI, IEEE Floating-Point, SHMEM, and OpenMP (quasi-) standards into a single piece of software. The code, furthermore, is well positioned to evolve with the Silicon Graphics NUMA architecture.
  • 10. paramsinputs s_openieeefp stopwatch Main Program (max3d) decomp dtypes errors indice metric rcs_io solver swatch update commun Figure 1: Organization of the Max3d Code
  • 11. Rank Coordinates Begin End Resident Points 0/7 (0, 0, 0) (1, 1, 1) (73, 31, 24) (73, 31, 24) 54312 1/7 (0, 0, 1) (1, 1, 25) (73, 31, 49) (73, 31, 25) 56575 2/7 (0, 0, 2) (1, 1, 50) (73, 31, 73) (73, 31, 24) 54312 3/7 (0, 0, 3) (1, 1, 74) (73, 31, 97) (73, 31, 24) 54312 4/7 (0, 1, 0) (1, 32, 1) (73, 61, 24) (73, 30, 24) 52560 5/7 (0, 1, 1) (1, 32, 25) (73, 61, 49) (73, 30, 25) 54750 6/7 (0, 1, 2) (1, 32, 50) (73, 61, 73) (73, 30, 24) 52560 7/7 (0, 1, 3) (1, 32, 74) (73, 61, 97) (73, 30, 24) 52560 i_max_global = 73 i_max_local = 73 j_max_global = 61 j_max_local = 31 k_max_global = 97 k_max_local = 25 i_ghost_lo = 1 i_ghost_hi = 2 j_ghost_lo = 1 j_ghost_hi = 2 k_ghost_lo = 1 k_ghost_hi = 2 i_box = 76 i_num_procs = 1 j_box = 34 j_num_procs = 2 k_box = 28 k_num_procs = 4 x y z j=31 (1,61,1) (73,1,97)(1,1,97) (1,61,97) (73,61,97) 0 1 2 3 4 5 6 7 (73,61,1) k=24 k=25 k=49 k=50 k=73 k=74 k=97 j=32 j=61 k=1 j=1 i=1 i=73 Figure 2: Sample Domain Decomposition
  • 12. I−indices Consider an x−y plane for the sample problem. The x−direction is not decomposed and thus consists of a single domain. Because of its simplicity, the x−axis provides a good starting point to describe the various indices. x y z i_max_global = 73 = set by user = Global extent of domain i_num_procs = 1 = set by user = Number of subdomains i_ghost_lo = 1 = set by user = Ghost pts at low end i_ghost_hi = 2 = set by user = Ghost pts at high end i_max_local = 73 = 1 + (i_max_global − 1) / i_num_procs = Max locally owned points i_box = 76 = i_ghost_lo + i_max_local + i_ghost_hi = Local + ghost points my_indices(beg) = 1 my_indices(end) = 73 pts_on_proc = 73 at_ilo_face = true at_ihi_face = true i_local_beg = 2 = i_ghost_lo + 1 i_local_end = 74 = i_ghost_lo + pts_on_proc ilo2 = 3 = i_local_beg + 1 ihim = 73 = i_local_end − 1 i_offset = −1 = mpi_my_indices(beg) − 1 − i_ghost_lo i_rcv_ind(src) = 1 i_snd_ind(src) = 2 = 1 + i_ghost_lo i_snd_ind(dst) = 74 = 1 + pts_on_proc ! this seems to be in error. i_rcv_ind(dst) = 75 = 1 + i_ghost_lo + pts_on_proc 76741 1 2 73 Figure 3: Example of i−indices 0 1 2 3 4 5 6 7
  • 13. 0 1 2 3 4 5 6 7 0 4 0 1 2 3 4 5 6 7 Figure 4: The Coordinate Planes of a Sub−Domain 1 2 3 5 6 7 x y z
  • 14. x y z 0 2 3 4 5 6 7 1 Figure 5: A Y−Z Plane & Its Ghost Points
  • 15. Interior Domain Ghost cells at lower end of z−axis Figure 6: The Sub−Domains Which Comprise One Section of Ghost Points x y z