SlideShare a Scribd company logo
Introduction into the problems of
developing parallel programs
Author: Andrey Karpov

Date: 21.01.2008


Abstract
As developing parallel software is rather a difficult task at present, the questions of theoretical training
of specialists and investigation of methodology of projecting such systems become very urgent. Within
the framework of this article we provide historical and technical information preparing a programmer
for gaining knowledge in the sphere of developing parallel computer systems.


For reader
This document is part of a series of articles devoted to issues of creating quality and effective program
solutions for modern 64-bit multi-core systems. You can read other articles on the site
http://www.viva64.com.


Introduction
It is very difficult for programmers who only begin to use multi-processor computers to master all the
peculiarities of their usage while developing programs for applied tasks. As practice shows difficulties
begin when effectiveness and mobility are required of parallel software being developed. It is explained
by that universal means simplifying a programmer's labor and providing full access to debugging
information are only being developed. The problem is that there are no standards in the sphere of
creating and debugging programs for parallel systems because the field of computer science is very
young. Correspondingly, there are no logically complete training courses on parallel programming for
beginners at present.

Development of multi-processor computers is inseparably linked with development of parallel
programming technologies, both universal and for concrete computer architectures. By a programming
technology, that is by organization of work with memory, we mean usage of means of controlling a
concrete computer.

It should be noted that while developing software (both controlling means and means for solving
applied tasks) for super-computers special attention should be paid to programming technique, i.e.
building of a logical program architecture. By this we mean development and addition of paralleling
algorithms increasing effectiveness of their execution on multi-processor computers.


1. History of development of multi-processor complexes and parallel
computations
50 years have passed since appearance of the first computing machines - computers. During this time
the sphere of their usage has covered almost every field of human activity. Nowadays, it is impossible to
imagine effective work without using computers in such spheres as production scheduling and control,
projecting and developing complex devices, publishing, education, in other words, in all the fields where
processing of large sizes of information is needed. Such tasks appeared in the middle of the previous
century due to development of atomic energetics, aircraft building, rocket-cosmic technologies and
some other science and technique fields [1].

Nowadays, the field of tasks demanding powerful computing resources for their solution has extended
even more. This relates to fundamental changes in the very organization of scientific investigations.
Because of wide introduction of computers, computational modeling and numerical experiment have
developed greatly [2]. Filling the gap between physical experiments and analytical approaches,
computational modeling allowed us to investigate phenomena which are either too complicated to be
investigated through analytical approaches or too expensive or dangerous to be investigated
experimentally. Meanwhile, numerical experiment allowed us to make the process of scientific and
technical search much cheaper. It became possible to model in real time the processes of intensive
physico-chemical and nuclear reactions, global atmospheric processes, processes of economical and
industrial development of regions etc. It is obvious that solution of such great tasks requires great
computational resources.

Usage of computers for computational purposes has always remained the main force of the progress in
computer technologies. That's why it is no wonder that as a main characteristic of computers we use
such an index as performance, i.e. the value showing what quantity of arithmetic operations it can
perform in a time unit. It is this index which shows the scale of progress achieved in computer
technologies. Thus, for example, performance of one of the first computers EDSAC was only about 100
operations per second, while peak performance of Earth Simulator, one of the most powerful super-
computers nowadays, is 40 trillion operations per second. Thus, performance has increased a 400 billion
times! There is no other sphere of human activity where progress is so evident and so great. Of course,
anyone would immediately ask: why did it become possible? Strangely enough, the answer is rather
simple: because of 1000-time increase of electronic circuits' performance and maximum extension of
paralleling of data processing.

The idea of parallel data processing as a powerful source for increasing performance of computers was
expressed by Charles Babbage about a hundred years before the first electronic computer appeared. But
the level of technological development in the middle of the 19th century didn't allow him to fulfill this
idea. With the appearance of the first electronic computers these ideas became more than once the
starting point when developing the most advanced and high-performance computer systems [3].
Without exaggeration we can say that the whole history of development of high-performance computer
systems is the history of fulfilling the ideas of parallel processing at a certain stage of development of
computer technologies, naturally, combined with increase of speed and safety of electronic circuits.

Brand new decisions in increasing performance of computer systems were introduction of pipeline
organization of command execution, inclusion of vector operations into the command system allowing
you to process whole data arrays by one command; distribution of calculations among many processors.
Combination of these 3 mechanisms in the architecture of the super-computer Earth Simulator
consisting of 5120 vector-pipeline processors allowed it to gain record performance, which excesses
performance of modern personal computers by 20000 times.

It is obviously that such systems are too expensive and are produced in single copies [4]. And what
about commercial production nowadays? The wide variety of computers produced in the world today
can be roughly divided into four classes: Personal Computers (PC); Workstations (WS); Supercomputers
(SC); cluster systems [5].

This division is very approximate because of rapid progress in the sphere of development of
microelectronic technologies. Performance of computers of every class doubles nearly every 18 months
at present (in accordance to the so called Moore's Law). Because of this the supercomputers of the
beginning of the 90-s often yield to modern workstations in performance, and personal computers
become successful rivals to workstations. However, let's try to classify them somehow.

Personal computers. Strange enough, in this case we mean single-processor systems on Intel or AMD
platforms controlled by single-user OS (Microsoft Windows and others). They are used mostly as a
personal work place.

Workstations. Most often these are computers with RISC-processors with multi-user OS relating to UNIX
OS family. They contain from one to four processors, support remote control [6] and can maintain needs
of a small group of users.

Supercomputers. Their distinctive feature is that they are usually large and, consequently, very
expensive multi-processor systems. In most cases supercomputers use the same commercial processors
as workstations. That's why the difference between them is often rather in quantity than in quality. For
example, we can speak of a 4-processor workstation by SUN company and a 64-processor
supercomputer by the same company. Most likely, the both use the same microprocessors.

Cluster systems. In recent years they have been used in the whole world as a cheap alternative to
supercomputers. A system of the required performance is assembled from ready-made commercial
computers united in their turn by some commercial DCE. Thus, multi-processor systems which have
been early associated with supercomputers mostly, nowadays become popular in the whole range of
produced computer systems, from personal computers to supercomputers on the basis of vector-
pipeline processors. On the one hand, this circumstance increases availability of supercomputer
technologies and, on the other hand, makes mastering them urgent as you need to use special
programming technologies for all the types of multi-processor systems in order to allow a program to
fully use the resources of a high-performance computer system [7, 8]. Usually this is implemented by
dividing a program with the help of some tool into parallel branches each of which is executed on a
separate processor.


2. Using multi-processor systems
Supercomputers are developed first of all to solve complex tasks demanding large quantity of
calculations. Meanwhile, this implies that a single program can be created requiring all the
supercomputer's resources for its execution. But creating such a program can be impossible or
unreasonable. In fact, when you develop a parallel program for a multi-processor system, it is not
enough to divide it into parallel branches. For effective usage of the resources you need to provide
balanced load of all the processors what in its turn means that all the program branches should execute
approximately the same quantity of computational work. But sometimes it is impossible. For example,
when solving some parametric task for different parameters' values the time of searching for solution
can vary greatly. It seems more reasonable in such cases to perform calculations for each parameter
with the help of a simple single-processor program [9]. But even in this simple case we may need
resources of a supercomputer because execution of full computational work on a single-processor
system may require too much time. Parallel execution of many programs for different parameters'
values allows us to significantly speed up solving the task. And finally we should mention that using
supercomputers is always more effective for maintaining needs of a large group of users than using the
corresponding number of single-processor workstations as it is easier in this case to provide balanced
and more effective load of computational resources with the help of the task managing system.

Unlike common multi-user systems, OS of supercomputers, as a rule, in order to get the maximum rate
of program execution don't permit to share resources of one processor between different,
simultaneously executed programs. That's why there can be the following modes of using an n-
processor system as two opposite variants:

    •   all the resources are allocated for execution of one program and in this case we expect an n-fold
        speed-up of program execution in comparison to a single-processor system;
    •   n common single-processor programs are executed simultaneously and the user expects that
        other programs won't influence the speed of execution of his program.


3. Parallelism in computational modeling tasks
3.1. Static and dynamic balancing
When solving various tasks of mathematical physics on multi-processor systems with the help of mesh
methods [10] two approaches to building parallel programs are widely used. The first approach is called
geometrical parallelism method, and the second one - group decision method [11]. Ideas on which these
methods are based are simple and smart. It won't be exaggeration to say that most tasks of gas
dynamics, microelectronics, ecology and many others, which are now solved by using the finite
difference method or finite element method, are solved effectively by the geometrical parallelism
method. The group decision method is reasonable to use when building parallel algorithms of solving
tasks by Monte-Carlo methods, when a series of single-type calculations is performed and in some other
cases.

We should note that the geometrical parallelism method is a method of static load balancing which
defines a section of the mesh executed by each processor beforehand. Static balancing is effective when
priori information is enough for preliminary distribution of the common computing load equally among
processor nodes. The group decision method is a method of dynamic balancing load. When using this
method it is not known beforehand what particular mesh nodes will be processed by this or that
processor. The processors receive tasks dynamically as they have executed the already received, what
provides balanced load of processor nodes when there are many independent tasks.

3.2. Parallelism of "group decision" type
Parallelism of "group decision" type is convenient for performing calculations dividing into more single-
type tasks each of which is solved independently from the others. No data transfer occurs between such
tasks and, consequently, there is no need of their mutual synchronization.

Let's consider an example of a computational mesh as a set of independent nodes in each of which we
should define some parameters on each temporal layer by solving a system of ODE with the
corresponding initial data [12]. Solution of the system in each node depends only on local values of the
variables in this node. Meanwhile computational load differs very much in different nodes. When
building a parallel program with the help of the classical "group decision" method the following strategy
of computational load distribution is used.
One control processor is defined while all the other processors are used as processing nodes, i.e.
computing nodes. Each computing processor performs primary tasks - solution of ODE system for the
next mesh point with the corresponding local parameters. The control processor distributes the primary
tasks among the computing processors and collects the results.

In the beginning of the next step each processor waits for a new data chunk, processes it, returns the
result and starts waiting for the next task until instead of the next task it gets a message that all the
mesh points are processed.

As there is no need to synchronize primary tasks, different processors can get different number of
computational nodes as the data processing is finished. Thereby the problem of balanced load of
processors is solved even if the time for solving the equation system for different mesh points or
processors' performance vary greatly.

In case of heterogeneous computational load when computing different points of the spatial mesh,
usage of the "group decision" method potentially allows you to significantly reduce downtime and
increase effectiveness of paralleling in comparison to the geometrical parallelism method considered
further. The advantages of this method can be fully implemented if the data for processing are from the
beginning concentrated on one of the processors which in this case can serve as the control processor.
When the source data are initially distributed among the processors at random, preliminary collection of
the data corresponding to all the computational points on one of the processors is required to use this
method. The necessity of the preliminary data copying from all the processors to one and the following
return of the results from this processor to the processors-"holders" of the points significantly reduces
effectiveness of this method and makes it of little use for solving most tasks of computational modeling.

3.3. Geometrical parallelism.
The source task can be split into a group of fields independent from each other at each computational
step and crossing only at the division boundary. That is, we compute (n+1) temporal layer in each field
and after that coordinate the boundaries and pass on to computing the next layer.

But using this approach we have problems with recalculation of values at the boundaries between these
fields when we divide the computational field into non-crossing subfields, that's why we offer the next
logical step - to divide the source field into mutually crossing subfields.

There will appear two "dummy" points to the left of the first field and to the right of the last field. Thus,
we get four processes independent from each other at each step. To pass on to the next iteration we
need to coordinate the boundaries as the first field should give to the second one its left boundary for
the next step, and in its turn the second field should give to the first one its right boundary and so on.

This method can be generalized into most computational methods based on equations for modeling
physical processes.


4. Effectiveness of a parallel program
4.1. Notion of an effective parallel program
Using supercomputers imposes certain requirements on the new developed software providing safe and
economical implementation of the algorithm when solving applied tasks. Effectiveness of using
supercomputers becomes apparent when creating complex research complexes and expert systems.
It is much more difficult to write a parallel program than to write a sequential one. Creation of software
for parallel computers is the central problem of supercomputer calculations [13].

Partially the problem of choosing the optimal number of parallel branches in correspondence with the
criterion of minimum total time costs can be solved with the help of automations of parallel program
generation. A particular case of solving this problem for the computer systems with MIMD architecture
is considered in the article by V.A. Kostenko "To the question of evaluating the optimal parallelism level"
[14].

Effectiveness of using multi-processor computer systems is to a large degree determined by the quality
of the applied parallel programs. A program is considered effective when all the processors defined for
processes are loaded during its execution. But practically it is impossible.

4.2. Properties of an ideal parallel program
Let's note that an ideal parallel program possesses the following properties:

    1. Lengths of simultaneously executed branches are equal.
    2. Downtimes relating to data waiting, control transfer and conflicts occurring when using
       common resources, are fully excluded.
    3. Data transfer is fully combined with calculations.

Increase of parallelism's effectiveness (decrease of time costs on the overhead costs) is reached by the
following means:

    •   enlargement of paralleling units;
    •   decrease of complexity of the algorithms of generating parallel procedures (subprograms);
    •   preliminary preparation of the package of different source data variants;
    •   paralleling of the algorithms of generating parallel procedures (subprograms).

4.3. Adaptation of programs to the parallel computers' architecture
The main stages of the process of adapting programs to the architecture of parallel computers and
description of the tasks occurring at each of these stages are given in the article by A.S. Antonov
"Effective adaptation of sequential programs to the modern vector-pipeline and array-parallel
supercomputers" [15]. We would like to pay special attention to some of the tasks which the authors of
this analysis faced. Among these tasks are:

    •   investigation of the common program structure;
    •   definition of the main computational core, input-output localization;
    •   definition of the potential parallelism of a fragment;
    •   definition of the sequential fragments of calculations and attempt to use alternative algorithms
        for such fragments;
    •   definition and minimization of data redistribution points;
    •   conversion of the traditional loop for parallel algorithms;
    •   minimization of the number and size of temporary arrays for optimizing cache-memory
        handling;
    •   passing on from the source program working with full arrays to the program processing only a
        local chunk distributed for a processor: change of arrays' sizes and the corresponding
        transformation of the program text.
We should note that solution of these tasks allows us to perform an effective port of a sequential
program on a parallel architecture.

The process of developing a parallel program is very long and laborious despite that, as a rule, there
already exists an implementation of its "sequential" counterpart. A program is usually developed on a
computer with a certain architecture and its practical application is performed on another computer,
more powerful and with the typology different from that of the former machine. This approach allows
you to economize computer time on more powerful supercomputers which are much fewer than
cheaper ones.

When porting a parallel program on computers with a different architecture a programmer faces the
problem of invalidity of once developed parallel procedures.

At present there are no universal means of adapting programs to a concrete architecture of
supercomputers and that's why this problem has to be mostly solved manually what makes the process
very labor-intensive [15]. To save labor of a programmer RAS mathematical institutions are developing
libraries of effective procedures and algorithms for concrete architectures of supercomputers (RAS Ural
Department, Research-and-development computer center of MSU named after M.V. Lomonosov). Using
these libraries can partially save labor of an applied programmer not only at the stage of modifying a
program for more powerful supercomputers, but at the stage of the primary development of a parallel
program.


5. Debugging and monitoring issues
The problem of debugging and monitoring is very urgent as there are no managers that could provide an
applied software developer by intermediate information especially urgent at the initial stage of
designing [16]. In the general case the task of debugging and monitoring such systems is put in the
following way [17, 18]. There is a mesh of nodes heterogeneous in their hardware and/or software
platforms, on each of which many processes (threads) are executed simultaneously [19]. There is also a
total number of users each of which would like to control and/or operate his subset of program and/or
hardware components.

Understanding of debugging/monitoring as controlled execution changes the position of debugging in
the systems' life cycle, allows you to use architectural and protocol solutions characteristic of controlling
means. This makes the controlling means scalable, capable of maintaining the distributed
heterogeneous systems.

It is important for further development of debugging/monitoring means to create a set of specifications
defining functionality of the manager programs being developed [20].

Programs are complex dynamic systems, especially parallel and interactive (operating in dialog mode)
ones which include complex interactions between program processes themselves and between the
processes and the outer world. Analysis of such programs cannot be performed in terms of relations
between input and output values of the program as it is usually performed for sequential programs. This
shows that checking and proving correct work of such programs demands developing adequate means
of formal specification. In particular, it is necessary to be able to express relations between the system's
states at those instants of time when some events occur accompanying the program system's operation.
The article "Applying temporal logic to program specification" by M.K. Vasilyev [21] discusses the
approach to analysis of a parallel program based on applying mathematical logic.
Process control is one of the most important tasks of OS. To perform this function on supercomputers
semaphore technology [22] can be used which consists in locking and unlocking of processes.

Semaphores have been traditionally used for synchronizing processes addressing shared data. Each
process should exclude for all the other processes the possibility of simultaneous address to its data.

When solving applied tasks the size of the received information in most cases is so large that the
possibility of verification - the detailed analysis of the data received directly by a computational program
- is impossible. As there are no universal graphical packages with visualization of different isometric
projections and color gammas for such situations, applied software developers are advised to start
developing such packages.


6. Paralleling objects modeling
Developed computer-usage approaches are based on the thesis: computer is a cognition tool with the
help of which people get new information about an object or phenomenon being investigated [23].
Consequently, a qualified user should know the modern cognition methodology, i.e. modeling. Modeling
is not only designing of a cognition object, but a cognition method as well. Modeling is work
methodology whose effectiveness becomes apparent only when specialists are highly qualified and
know well the modern formalization means - logic and mathematics.

Having defined the problem and stated the goal a researcher starts searching for a solution. The way he
passes becomes a method.

The process of modeling presupposes both the way "from the object to the model" (reflection of reality
in a paradigm) and the way "from the model to the object" (test of the model's truth on its possibilities).
Computer is the natural means of performing such "research" cycles.

Software-development theory specialists rarely pay attention to modeling when describing the process
of software creation. On the other hand, modeling specialists prove urgent necessity of wide usage of
their methods when projecting any complex system [24]. As a software complex is a complex system
with many levels and components and a complex structure of relations between them, it is necessary to
use modeling when developing such systems.

Taking into consideration that parallel software development (development of a paralleling object) is
very difficult nowadays, the problem of creating theoretical basis of its projecting is even more urgent.
Besides analysis of the structure and properties of the developed programs on all the projecting stages,
modeling can help describe all the peculiarities of interaction between parallel processes at the level of
a simulation model. In his work "Modeling of parallel software using PS-networks" [25] N.G. Markov
suggests using the graph-analytical approach to simulation modeling of a program project on the basis
of the demands put before parallel software. The aim of this work was to work out the demands to the
parallel software simulation modeling mechanism and also to create a mechanism keeping balance
between mathematical simplicity and rigor on the one hand and practical applicability on the other
hand.

Thus, we can state that the most convenient means of analyzing computational algorithms of parallel
computations is graphs [26].

The problem of creating modern packages of applied programs intended for a wide range of mechanics
tasks goes out of limits of synthesizing these tasks from separate program modules. It is related to the
global optimization of the whole computational sequence of tasks [27]. That's why a package of
programs as a product used for scientific-applied purposes not only by its creators but by end users as
well, should be developed at an absolutely new programming level. When developing modern software
it is necessary not only to take into consideration non-linear (with feedback) relations between all the
links of a calculation chain but also implement the possibility of segmenting a program at high, middle
and low paralleling levels of the computational process. Segmentation is necessary for more effective
usage of multi-processor systems. Besides, when developing a numerical algorithm we should
coordinate the issues of accuracy and safety of the end software and also the issues of its effectiveness
and portability on a concrete supercomputer's architecture. Such parallel programming differs greatly
from the traditional programming, i.e. sequential programming.

6.1. Levels of decomposition of paralleling objects
To provide supercomputers' users with possibilities of simultaneous performance of many scientific
calculations or multi-thread processing of requests in a database on multi-processor computers, the
corresponding software should be installed. In this case paralleling functions are performed not only by
applied software but by the OS as well.

In the general case, two main interrelated problems occur when creating OS of parallel data processing:
the first one is minimizing of the time of performing the given calculations' size, and the second is
synchronization of many simultaneously interacting parallel process [28]. To solve each of them
different approaches are being developed. In the mentioned work it is offered to take into consideration
that when implementing complex synchronization mechanisms overhead costs increase and this
influence badly the efficiency of solving tasks. The stated problem in the systems with parallelism
limited by the number of processors is solved by minimizing the total time of performing the given
calculations' size.

The results of implementing this approach relate, first of all, to "operational parallelism". The method
based on building the schedule of launching and finishing each of the competing processes can be useful
in such systems. It gives you an opportunity not only to more effectively solve the process
synchronization problem but significantly reduce system costs and wasteful downtimes of the
processors. The method of managing interaction between parallel processes is implemented with the
help of "semaphore" technology [29].

When researchers create applied software, the practical value of numerical methods they develop is
determined not only by the results received with their help when investigating complex phenomena but
by their applicability on concrete supercomputers as well. It was found that as performance of personal
computers grows stimulating development of computational methods, there also occur qualitative
changes in supercomputers' architecture focused on development of parallelism and specialization of
processors. And this, in its turn, stimulates search for new representations of physical phenomena that
would permit more direct presentation on the computers' architecture. Thus, for example, the cellular-
automat approach appeared in gas- and hydrodynamics [30]. The article shows a new model of parallel
calculations - cellular-neural network (CNN). The article describes the essence of a cellular-automat
model and also shows rich opportunities of CNN for representing spatio-temporal dynamics of active
mediums. This model can serve as the basis for creating parallel programs intended for solving
differential equations in partial derivatives and also for imitation of nonlinear dynamics phenomena. It is
noted that usage of CNN calculation methods together with parallel processors will allow us to greatly
increase the quality of solution of such tasks.
The aim of any work connected with parallel programming is review of interrelations between the
mathematical algorithm's structure and a multi-processor computational system's architecture.
Depending on the complexity of the stated task different types of interrelations can be implemented.
These interrelations are called the levels of decomposition of the source task. They can be defined as
follows [31]:

The first level - division of a task into subtasks.

The second level - division of each separate subtask into a subset of quasi-single-type procedures
executed simultaneously at different source data. In mathematical physics this parallelism type is called
geometrical parallelism or data parallelism as paralleling is performed in this case by distributing
calculations in different points of the computational field into different processors.

The third level - paralleling of separate procedures.

The fourth and the deepest paralleling level - division of arithmetical processes according to the number
of processors.

It is recommended not to use the last level on supercomputers with distributed memory in which for
each processor local memory is allocated. The researchers of most applied tasks are advised to stop the
process of their decomposition at the second level.

6.2. Possibility of paralleling objects in computational modeling algorithms
And now let's consider what objects in the algorithms of task solution can be paralleled.

The main numerical methods (the finite-element method, the finite-difference method and others)
bring the source task to forming a system of linear algebraic equations (SLAE) and its further solution
[32, 33]. For example, in a sequential program implementing the finite-element method, most time is
spent on forming the SLAE itself (calculation of coefficients) but not on its solution. It is also important
to mention that the elements of SLAE matrix depend only on their locations in it and do not depend on
each other. In this case parallel algorithms of SLAE formation can be used effectively. And here you
should perform the following operations:

    1. split the computational task into parallel branches;
    2. perform calculations in these branches;
    3. form and solve SLAE (by any method).

The article [34] gives an example of description of a parallel algorithm of SLAE formation and also
peculiarities of using MPI technology.

The article [35] considers implementation of Gauss-method for solution of sparse systems of linear
algebraic equations on computers with parallel processes and shared memory. It is pointed out that
division into several command threads can be performed either according to the functional feature or
directly by data. When the task is stated like this, only data-relating division can be implemented.
Meanwhile, you should pay attention whether it is possible to single out unlinked fields from the task.

The same article points that the offered parallel algorithm is bound to a concrete computer architecture
but it also states that effectiveness of the paralleling algorithm depends only on the correlation between
the number of processes and processors and also on the size of information processed at one loop.
We can propose a thesis that loops are one of the most important program constructions with
accessible parallelism. The problem of extracting fine-grained parallelism (parallelism inside loops) from
these constructions is of great importance in view of increasing popularity of superscalar computers
[36].

The article [37] presents algorithms for computational procedures and also results received with their
help and based on the high-accuracy parallel arithmetic methodology. It is suggested that this
methodology be used for solving applied tasks of linear algebra and mathematical physics. The
mentioned work is devoted to creating algorithmic and program means of supporting accurate array
computations based on complex usage of parallelism of MIMD-systems [38] and multibit arithmetic with
dynamic operand length. Special attention is paid to influence of roundings in basic array operations on
the accuracy of matrix tasks' calculation. The work includes the library of programs and text examples
demonstrating effectiveness of the developed approach. The given results show the possibility of
performing accurate array computations with simultaneous message transfer on parallel computers.
The developed package of applied programs can be adapted for execution on parallel computers of
different types.

It is obvious that using multibit arithmetic is not typical of supercomputers. Its usage will inevitably lead
to slowdown of application execution. But time loss in this case will be compensated not only by
calculations with maximum usage of standard data types but also by adaptation of highly effective
parallel algorithms initially suited for execution on one-processor computers to means of high-accuracy
processing. Multibit arithmetic should be used only in the most "heavy" algorithm sections. But even in
this case the dynamic operand length helps process only a limited number of bits. It is this way which is
supposed for reaching balance between speed and accuracy of computations.

The article [39] analyzes in detail the vector-pipeline architecture of supercomputers of CRAY family. As
the result of the performed research, programming factors reducing supercomputers' performance
were discovered. To them relate:

    •   sectioning of long vector operations (increases overhead costs);
    •   overload of commands' buffers (increases overhead costs);
    •   conflicts of memory access (in case of using shared resources);
    •   limited capacity of data transfer channels (depends on the supercomputer's architecture);
    •   other factors.

It also gives examples (in program codes) showing the way out.

In some of the works mentioned above the development is singled out in which the algorithm structure
doesn't adapt to the computer's structure but defines its structure itself [40]. The work is intended for
creating new modern computer technologies and methods of parallel programming meant for
increasing effectiveness of solving fundamental scientific and applied problems in the sphere of
computational modeling of aerodynamics and gas-dynamics' tasks. Special attention is paid to
theoretical issues of paralleling. The work considers different methods of decomposing a full task into
simultaneously executed subtasks. Of high strategic importance is the theoretical stage of investigating
the problem of paralleling a program complex, that is development of principles (and concrete methods
on its basis) of optimal decomposition of the whole totality of algorithms, composing the processor
system and its operational environment.
The article describes three main decomposition (segmentation) types for the program complex "Thread-
3" planned for development in the process of preparation for paralleling the algorithms which make it
up:

    •   physico-mathemetical;
    •   geometrical;
    •   technological.

One of the global types of high-level structuring for the task being solved is decomposition of the
investigated physical process into subprocesses composing it and, consequently, segmentation of the
common algorithm of solving the full task into several algorithms of solving the subtasks composing it.

A segmental algorithm of parallel calculations of physical processes is suggested and meanwhile all the
module-segments of the computational core of the program are launched simultaneously. Besides,
inside each segment subsegments can simultaneously start working.

When paralleling computational procedures of extreme importance is synchronization and routing of
data improper organization of which either leads to incorrect calculation or to large overhead costs of
computer and astronomical time due to various delays of calculations because of data waiting and,
consequently, downtime of the processors in some segments. It is supposed that the latter leads to
nonoptimality or even impossibility to use the configurable processor space.

Geometrical decomposition (segmentation) of the full task and the following parallelization of
calculations allows you to significantly reduce astronomical time needed for calculations. Geometrical
decomposition consists in dividing the whole integration domain into a map of subdomains
(subsegments) and also in a single-step calculation of the physical process' state in each subdomain
followed by joining of solutions. The article lists requirements to mathematical definition of the task
permitting geometrical decomposition.

Technological decomposition implies segmentation of mathematically independent tasks. There can be
several levels of technological decomposition. The most typical example of decomposition is paralleling
a program into certain physico-mathemetical tasks each of which can similarly consist of algorithmically
independent tasks. The process of technological decomposition depends greatly on the program's
structure and numerical methods used in it.

Using the latter decomposition type presupposes special attention to the parallel program's
effectiveness.


Conclusion
Despite obvious success in using multi-processor systems there appear debates about their low
effectiveness. Increase of multi-processor systems' performance is generally determined by balance
between computational operation and data exchanges on its background. Non-fulfillment of this
condition is one of the causes of performance loss during paralleling with increasing number of
computational program modules.

Evaluation of programs' effectiveness has been carried out since first multi-processor systems -
transputers. Even then the first attempts were made to successfully solve the problem of maximum
usage of calculation time. When solving a concrete task, first of all it is necessary to search for
parallelism variants by dividing a separate task into several subtasks. After that, data parallelism (or
geometrical parallelism) can be performed, that is division of computational field. This type of
parallelism means that the computational field is divided into subfields each of which is correlated to a
separate system's processor.

When developing real parallel programs, as a rule, high effectiveness demands many changes of the
program to find the best scheme of its paralleling. Success of this search is determined by simplicity of
the program's modification.


References
    1. V.N. Datsuk, A.A. Bukatov, A.I. Zhegulo. Electronic user's guide on the course "Multi-processor
        systems and parallel programming" Part I. Introduction into programming organization and
        methods of multi-processor computer systems. Rostov-on-Don, 2000.
    2. E.V. Neupokoev, G.A. Tarnavskiy, V.A. Vshivkov. Paralleling marching algorithms: target
        computational experiments. // Autometriccs, N 4, volume 38, 2002, pp. 74-87.
    3. V.V. Korneev. Parallel computer systems. Moscow: "Knowledge", 1999. - 320 pp.
    4. G.I. Shpakovskiy. Parallel computers' architecture. - Minsk, 1989. - 136 pp.
    5. A.O. Latsis. How to build and use a supercomputer. Moscow: Bestseller Publishing house, 2003.
        274 pp.
    6. D.U. Labutin. System of remote access to the computational cluster (access manager): high-
        performance parallel computations on cluster systems. Material of the second international
        scientific-practical seminar, Nizhny Novgorod: Nizhny Novgorod University Publishing house,
        2002. pp.184-187.
    7. K.E. Afanasyev. Multi-processor computer systems and parallel programming: tutorial/ K.E.
        Afanasyev, S.V. Stukolov, A.V. Demidov, V.V. Malishenko; Kemerovo State University. -
        Kemerovo: Kuzbassvuzizdat, 2003. - 182 pp.
    8. S.A. Nemnyugin, O.L. Stesik. Parallel programming for multi-processor computer systems. - St.
        Petersburg: BHV-Petersburg, 2002. - 400 pp.
    9. A.V. Demidov, K.V. Sidelnikov. Emulation of parallel data processing on a personal computer //
        XLI International scientific student conference "Students and Scientific-and-Technological
        Advance". Collection of works. Novosibirsk, 2003. pp. 110-111.
    10. A.A. Samarskiy, E.S. Nikolaev. Methods of solving mesh equations. Moscow: Science, 1978. 561
        pp.
    11. V.V. Samofalov, A.V. Konovalov, S.V. Sharf. Dynamics and statics: searching for compromise //
        Works of All-Russian scientific conference "High-performance computations and their
        applications". Moscow, 2000. pp. 165-167.
    12. S.K. Godunov, V.S. Ryabenkiy. Difference schemes. - Moscow: Science, 1973. - 400 pp.
    13. V.V. Voevodin. Supercomputers: yesterday, today, tomorrow. // Collection of popular science
        articles "Russian science at the dawn of the new century". Under the editorship of academician
        V.P. Skulachov. Moscow: Scientific world, 2001. pp. 475-483.
    14. V.A. Kostenko. To the question of evaluating the optimal parallelism level. // Programming.
        1995, 4, pp. 24-28.
    15. A.S. Antonov, V.V. Voevodin. Effective adaptation of sequential programs to the modern vector-
        pipeline and array-parallel supercomputers. // Programming. 1996, 4, pp. 37-51.
    16. E. Sallivan. Time is money. Creating a team of software developers/Translated from English. -
        Moscow: Publishing house "Russkaya Redaktsiya", 2002. - 368 pp.: illustrations.
17. V.A. Galatenko, K.A. Kostuhin. Debugging and monitoring of distributed heterogeneous systems.
    // Programming, 2002, 1. pp. 27-37.
18. V.A. Krukov, R.V. Udovichenko. "Debugging of DVM programs". Programming. - 2001. N. 3.-
    pp.19-29.
19. A.P. Sapozhnikov, T.F. Sapozhnikova. Reengineering technology of distributed computations in
    the local network. Works of international conference "Distributed computations and Grid-
    technologies in science and education" (Dubna, June 29 - July 2 2004). 11-2004-205, Dubna,
    JINR, 2004. pp.183-190.
20. V.V. Samofalov, A.V. Konovalov. Technology of debugging programs for computers with mass
    parallelism // "Issues of atomic science and technique". Series Mathemetical modeling of
    physical processes. 1996. Issue 4. pp. 52-56.
21. M.K. Valiev. Applying temporal logic to program specification. // Programming. 1998, 2, pp. 3-9.
22. V.A. Krukov. OS of distributed computer systems. (tutorial).
23. N.L. Zaharyeva, V.B. Hoziev, P.D. Shirkov. Modeling and education.// Mathematical modeling,
    1999, volume 11, N 5, pp. 101-116.
24. A.A. Shalito. Automatic program projecting. Algorithmization and programming of logical control
    tasks. "Izvestiya akademii nauk. Teoriya i sistemi upravleniya" magazine, issue 6. November-
    December 2000. pp.63-81.
25. N.G. Markov, E.A. Miroshnichenko, A.V. Saraykin. Modeling of parallel software using PS-
    networks. // Programming. 1995, N 5, pp. 24-32.
26. N.M. Ershov. Building graphs of computational algorithms by autotracing method. //
    Programming. 2000, N 6, pp. 58-64.
27. V.P. Gergel, R.G. Strongin. Fundamentals of parallel computations for multi-processor computer
    systems. Tutorial. Nizhniy Novgorod: Publishing house of Nizhniy Novgorod State University
    named after N.I. Lobachevskiy, 2000. - 176 pp.
28. V.P. Ivannikov, N.R. Kovalevskiy, V.M. Metelskiy. About minimal time of implementing
    distributed competing processes in synchronous modes. // Programming. 2000, N 5, pp. 44-52.
29. M.K. Valiev. Applying temporal logic to program specification. // Programming. 1998, 2, pp. 3-9.
30. O.L. Bandman. Cellular-neural models of spatio-temporal dynamics. // Programming. 1999, N 1,
    pp. 4-17.
31. A.E. Duysekulov, T.G. Elizarova. Using multi-processor computer systems for implementing
    kinetically coordinated difference schemes of gas-dynamics. // Mathematical modeling, 1990,
    volume. 2, N 7, pp. 139-147.
32. O.A. Dmitrieva. Parallel algorithms of numerical solution of simple differential equations.
    //Mathematical modeling N 5, 2000, pp. 81-86.
33. N.R. Bahvalov. Numerical methods // Main editorship of physico-mathematical literature of
    "Science" publishing house, Moscow, 1975. - 632 pp.
34. D.B. Moskvin, V.A. Pavlov. Experience of using MPI technology for solving a system of second-
    order Fredholm integral equations. //Mathematical modeling, 2000, N 8, pp. 3-8.
35. A.N. Zavorin. Parallel solution of linear systems when modeling electric circuits. // Mathematical
    modeling, 1991, volume 3, N 3, pp. 91-96.
36. Ki-Chang Kim. Fine-grained paralleling of incomplete loop nests. // Programming. 1997, N 2, pp.
    52-66.
37. V.A. Morozov, A.P. Vazhenin. Matrix arithmetic of multiple accuracy for parallel systems with
    message transfer. // Programming. 1999, N1, pp. 66-77.
38. V.V. Voevodin. Mathematical models and methods in parallel processes. - Moscow: Science,
    1986. - 296 pp.
39. V.V. Voevodin. Is it easy to get the promised gigaflop? // Programming. 1995, N 4, pp. 13-23.
    40. G.A. Tarnavskiy, R.I. Shpak. Decomposition of methods and paralleling of algorithms of solving
        aerodynamics and physical gas-dynamics tasks: computer system "Thread-3". // Programming.
        2000, N 6, pp. 45-57.


Additional references.
    1. A.R. Antonov. Parallel programming using MPI technology: tutorial. - Moscow: MSU publishing
        house, 2004. - 71 pp.
    2. R.B. Berozin, V.M. Paskonov. Component system of visualizing the results of computations on
        multi-processor computer systems // Materials of All-Russian scientific conference "High-
        performance computations and their applications", 2000, pp. 202-203.
    3. N.V. Bocharov. Parallel programming technologies and technique. Review.// "Programming",
        2003, N 1. PP. 5-23. UDC 681.3.06
    4. V.V. Voevodin, Vl.V. Voevodin. Parallel computations. - St.Petersburg, 2002. - 600 pp.
    5. V.P. Gergel, A.N. Svistunov. Development of an integrated environment of high-performance
        computations on cluster systems. Materials of the second international scientific-practical
        seminar, Nyzhniy Novgorod: Nyzhniy Novgorod publishing house, 2002. PP.78-82.
    6. V.P. Ilyin. About paralleling strategies in mathematical modeling. // Programming. 1999, N 1, pp.
        41-46.
    7. A.N. Karpov. Data visualization on parallel computer complexes // 15th International conference
        GRAPHICON-2005. Novosibirsk, Russia, June 20-24 2005.
    8. N.A. Konovalov, V.A. Krukov, A.A. Pogrebtsov, U.L. Sazanov. C-DVM - the language fordeveloping
        mobile parallel programs. // Programming. - 1999. N1. PP. 20-28.
    9. N.A. Konovalov, V.A. Krukov, R.N. Mikhailov, L.A. Pogrebtsov. Fortran-DVM - the language for
        developing mobile parallel programs. // Programming. 1995, N 1. PP. 49-54.
    10. R.V. Popova, R.V. Sharf. Organization of saving temporary data on MBC // Theses of the report
        from All-Russian conference "Urgent problems of applied mathematics and mechanics"
        (Yekaterinburg, February 3-7 2003), pp. 62.
    11. I.V. Prangishvili, R.Ya. Vilenkin, I.L. Medvedev. Parallel computer systems with shared control. -
        Moscow: Energoatomizdat, 1983. - 312 pp.
    12. L.B. Sokolinskiy. Parallel machines of databases. // Collection of popular science articles "Russian
        science at the dawn of the new century". Under the editorship of V.P. Skulachov. -Moscow:
        scientific world, 2001. PP. 484-494.
    13. E. Tanenbaum. Distributed systems. Principles and paradigms. - St.Petersburg: Piter, 2003. - 877
        pp.
    14. M.V. Yakobovskiy, R.A. Sukov. Dynamic load balancing // Materials of "High-performance
        computations and their applications" conference, Chernogolovka, 2000, PP. 34-39.
    15. A.V. Komolkin, R.A. Nemnugin. Electronic tutorial "Programming for high-performance
        computers".


About the author
Andrey Karpov, http://www.viva64.com

Develops program solutions in the sphere of resource-intensive applications' quality and performance
increase. One of the developers of Viva64 static analyzer for verifying 64-bit software. Participates in
developing VivaCore open library for working with C/C++ code.

More Related Content

Viewers also liked

Optimization in the world of 64-bit errors
Optimization  in the world of 64-bit errorsOptimization  in the world of 64-bit errors
Optimization in the world of 64-bit errors
PVS-Studio
 
Lesson 19. Pattern 11. Serialization and data interchange
Lesson 19. Pattern 11. Serialization and data interchangeLesson 19. Pattern 11. Serialization and data interchange
Lesson 19. Pattern 11. Serialization and data interchange
PVS-Studio
 
Analysis of the Ultimate Toolbox project
Analysis of the Ultimate Toolbox projectAnalysis of the Ultimate Toolbox project
Analysis of the Ultimate Toolbox project
PVS-Studio
 
Lesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignmentLesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignment
PVS-Studio
 
VivaMP - a tool for OpenMP
VivaMP - a tool for OpenMPVivaMP - a tool for OpenMP
VivaMP - a tool for OpenMP
PVS-Studio
 
Parallel programs to multi-processor computers!
Parallel programs to multi-processor computers!Parallel programs to multi-processor computers!
Parallel programs to multi-processor computers!
PVS-Studio
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
PVS-Studio
 

Viewers also liked (7)

Optimization in the world of 64-bit errors
Optimization  in the world of 64-bit errorsOptimization  in the world of 64-bit errors
Optimization in the world of 64-bit errors
 
Lesson 19. Pattern 11. Serialization and data interchange
Lesson 19. Pattern 11. Serialization and data interchangeLesson 19. Pattern 11. Serialization and data interchange
Lesson 19. Pattern 11. Serialization and data interchange
 
Analysis of the Ultimate Toolbox project
Analysis of the Ultimate Toolbox projectAnalysis of the Ultimate Toolbox project
Analysis of the Ultimate Toolbox project
 
Lesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignmentLesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignment
 
VivaMP - a tool for OpenMP
VivaMP - a tool for OpenMPVivaMP - a tool for OpenMP
VivaMP - a tool for OpenMP
 
Parallel programs to multi-processor computers!
Parallel programs to multi-processor computers!Parallel programs to multi-processor computers!
Parallel programs to multi-processor computers!
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 

Similar to Introduction into the problems of developing parallel programs

Real-world Concurrency : Notes
Real-world Concurrency : NotesReal-world Concurrency : Notes
Real-world Concurrency : Notes
Subhajit Sahu
 
Isometric Making Essay
Isometric Making EssayIsometric Making Essay
Isometric Making Essay
Alana Cartwright
 
Ge6151 computer programming notes
Ge6151 computer programming notesGe6151 computer programming notes
Ge6151 computer programming notes
shanmura
 
Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICH
Misu Md Rakib Hossain
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
HasanAfwaaz1
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
Andrey Karpov
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
PVS-Studio
 
computer notes - Introduction to operating system
computer notes - Introduction to operating systemcomputer notes - Introduction to operating system
computer notes - Introduction to operating system
ecomputernotes
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
Ankit Singh
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
AbdullahMunir32
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
VinothkumarUruman1
 
Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...
PVS-Studio
 
Computer Software Ultimate History and Benefits
Computer Software Ultimate History and BenefitsComputer Software Ultimate History and Benefits
Computer Software Ultimate History and Benefits
Tyler Aaron
 
Linux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop ComputerLinux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop Computer
IOSR Journals
 
Linux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop ComputerLinux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop Computer
IOSR Journals
 
Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
Malobe Lottin Cyrille Marcel
 
Revant Rastogi
Revant Rastogi Revant Rastogi
Revant Rastogi
Revant Rastogi
 
Chap10.ppt Chemistry applications in computer science
Chap10.ppt Chemistry applications in computer scienceChap10.ppt Chemistry applications in computer science
Chap10.ppt Chemistry applications in computer science
pranshu19981
 

Similar to Introduction into the problems of developing parallel programs (20)

Real-world Concurrency : Notes
Real-world Concurrency : NotesReal-world Concurrency : Notes
Real-world Concurrency : Notes
 
Isometric Making Essay
Isometric Making EssayIsometric Making Essay
Isometric Making Essay
 
Ge6151 computer programming notes
Ge6151 computer programming notesGe6151 computer programming notes
Ge6151 computer programming notes
 
Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICH
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
COA.pptx
COA.pptxCOA.pptx
COA.pptx
 
Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++Development of resource-intensive applications in Visual C++
Development of resource-intensive applications in Visual C++
 
computer notes - Introduction to operating system
computer notes - Introduction to operating systemcomputer notes - Introduction to operating system
computer notes - Introduction to operating system
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
 
Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...Adaptation of the technology of the static code analyzer for developing paral...
Adaptation of the technology of the static code analyzer for developing paral...
 
Computer Software Ultimate History and Benefits
Computer Software Ultimate History and BenefitsComputer Software Ultimate History and Benefits
Computer Software Ultimate History and Benefits
 
Linux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop ComputerLinux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop Computer
 
Linux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop ComputerLinux-Based Data Acquisition and Processing On Palmtop Computer
Linux-Based Data Acquisition and Processing On Palmtop Computer
 
Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Revant Rastogi
Revant Rastogi Revant Rastogi
Revant Rastogi
 
Chap10.ppt
Chap10.pptChap10.ppt
Chap10.ppt
 
Chap10.ppt Chemistry applications in computer science
Chap10.ppt Chemistry applications in computer scienceChap10.ppt Chemistry applications in computer science
Chap10.ppt Chemistry applications in computer science
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

Introduction into the problems of developing parallel programs

  • 1. Introduction into the problems of developing parallel programs Author: Andrey Karpov Date: 21.01.2008 Abstract As developing parallel software is rather a difficult task at present, the questions of theoretical training of specialists and investigation of methodology of projecting such systems become very urgent. Within the framework of this article we provide historical and technical information preparing a programmer for gaining knowledge in the sphere of developing parallel computer systems. For reader This document is part of a series of articles devoted to issues of creating quality and effective program solutions for modern 64-bit multi-core systems. You can read other articles on the site http://www.viva64.com. Introduction It is very difficult for programmers who only begin to use multi-processor computers to master all the peculiarities of their usage while developing programs for applied tasks. As practice shows difficulties begin when effectiveness and mobility are required of parallel software being developed. It is explained by that universal means simplifying a programmer's labor and providing full access to debugging information are only being developed. The problem is that there are no standards in the sphere of creating and debugging programs for parallel systems because the field of computer science is very young. Correspondingly, there are no logically complete training courses on parallel programming for beginners at present. Development of multi-processor computers is inseparably linked with development of parallel programming technologies, both universal and for concrete computer architectures. By a programming technology, that is by organization of work with memory, we mean usage of means of controlling a concrete computer. It should be noted that while developing software (both controlling means and means for solving applied tasks) for super-computers special attention should be paid to programming technique, i.e. building of a logical program architecture. By this we mean development and addition of paralleling algorithms increasing effectiveness of their execution on multi-processor computers. 1. History of development of multi-processor complexes and parallel computations 50 years have passed since appearance of the first computing machines - computers. During this time the sphere of their usage has covered almost every field of human activity. Nowadays, it is impossible to
  • 2. imagine effective work without using computers in such spheres as production scheduling and control, projecting and developing complex devices, publishing, education, in other words, in all the fields where processing of large sizes of information is needed. Such tasks appeared in the middle of the previous century due to development of atomic energetics, aircraft building, rocket-cosmic technologies and some other science and technique fields [1]. Nowadays, the field of tasks demanding powerful computing resources for their solution has extended even more. This relates to fundamental changes in the very organization of scientific investigations. Because of wide introduction of computers, computational modeling and numerical experiment have developed greatly [2]. Filling the gap between physical experiments and analytical approaches, computational modeling allowed us to investigate phenomena which are either too complicated to be investigated through analytical approaches or too expensive or dangerous to be investigated experimentally. Meanwhile, numerical experiment allowed us to make the process of scientific and technical search much cheaper. It became possible to model in real time the processes of intensive physico-chemical and nuclear reactions, global atmospheric processes, processes of economical and industrial development of regions etc. It is obvious that solution of such great tasks requires great computational resources. Usage of computers for computational purposes has always remained the main force of the progress in computer technologies. That's why it is no wonder that as a main characteristic of computers we use such an index as performance, i.e. the value showing what quantity of arithmetic operations it can perform in a time unit. It is this index which shows the scale of progress achieved in computer technologies. Thus, for example, performance of one of the first computers EDSAC was only about 100 operations per second, while peak performance of Earth Simulator, one of the most powerful super- computers nowadays, is 40 trillion operations per second. Thus, performance has increased a 400 billion times! There is no other sphere of human activity where progress is so evident and so great. Of course, anyone would immediately ask: why did it become possible? Strangely enough, the answer is rather simple: because of 1000-time increase of electronic circuits' performance and maximum extension of paralleling of data processing. The idea of parallel data processing as a powerful source for increasing performance of computers was expressed by Charles Babbage about a hundred years before the first electronic computer appeared. But the level of technological development in the middle of the 19th century didn't allow him to fulfill this idea. With the appearance of the first electronic computers these ideas became more than once the starting point when developing the most advanced and high-performance computer systems [3]. Without exaggeration we can say that the whole history of development of high-performance computer systems is the history of fulfilling the ideas of parallel processing at a certain stage of development of computer technologies, naturally, combined with increase of speed and safety of electronic circuits. Brand new decisions in increasing performance of computer systems were introduction of pipeline organization of command execution, inclusion of vector operations into the command system allowing you to process whole data arrays by one command; distribution of calculations among many processors. Combination of these 3 mechanisms in the architecture of the super-computer Earth Simulator consisting of 5120 vector-pipeline processors allowed it to gain record performance, which excesses performance of modern personal computers by 20000 times. It is obviously that such systems are too expensive and are produced in single copies [4]. And what about commercial production nowadays? The wide variety of computers produced in the world today
  • 3. can be roughly divided into four classes: Personal Computers (PC); Workstations (WS); Supercomputers (SC); cluster systems [5]. This division is very approximate because of rapid progress in the sphere of development of microelectronic technologies. Performance of computers of every class doubles nearly every 18 months at present (in accordance to the so called Moore's Law). Because of this the supercomputers of the beginning of the 90-s often yield to modern workstations in performance, and personal computers become successful rivals to workstations. However, let's try to classify them somehow. Personal computers. Strange enough, in this case we mean single-processor systems on Intel or AMD platforms controlled by single-user OS (Microsoft Windows and others). They are used mostly as a personal work place. Workstations. Most often these are computers with RISC-processors with multi-user OS relating to UNIX OS family. They contain from one to four processors, support remote control [6] and can maintain needs of a small group of users. Supercomputers. Their distinctive feature is that they are usually large and, consequently, very expensive multi-processor systems. In most cases supercomputers use the same commercial processors as workstations. That's why the difference between them is often rather in quantity than in quality. For example, we can speak of a 4-processor workstation by SUN company and a 64-processor supercomputer by the same company. Most likely, the both use the same microprocessors. Cluster systems. In recent years they have been used in the whole world as a cheap alternative to supercomputers. A system of the required performance is assembled from ready-made commercial computers united in their turn by some commercial DCE. Thus, multi-processor systems which have been early associated with supercomputers mostly, nowadays become popular in the whole range of produced computer systems, from personal computers to supercomputers on the basis of vector- pipeline processors. On the one hand, this circumstance increases availability of supercomputer technologies and, on the other hand, makes mastering them urgent as you need to use special programming technologies for all the types of multi-processor systems in order to allow a program to fully use the resources of a high-performance computer system [7, 8]. Usually this is implemented by dividing a program with the help of some tool into parallel branches each of which is executed on a separate processor. 2. Using multi-processor systems Supercomputers are developed first of all to solve complex tasks demanding large quantity of calculations. Meanwhile, this implies that a single program can be created requiring all the supercomputer's resources for its execution. But creating such a program can be impossible or unreasonable. In fact, when you develop a parallel program for a multi-processor system, it is not enough to divide it into parallel branches. For effective usage of the resources you need to provide balanced load of all the processors what in its turn means that all the program branches should execute approximately the same quantity of computational work. But sometimes it is impossible. For example, when solving some parametric task for different parameters' values the time of searching for solution can vary greatly. It seems more reasonable in such cases to perform calculations for each parameter with the help of a simple single-processor program [9]. But even in this simple case we may need resources of a supercomputer because execution of full computational work on a single-processor system may require too much time. Parallel execution of many programs for different parameters'
  • 4. values allows us to significantly speed up solving the task. And finally we should mention that using supercomputers is always more effective for maintaining needs of a large group of users than using the corresponding number of single-processor workstations as it is easier in this case to provide balanced and more effective load of computational resources with the help of the task managing system. Unlike common multi-user systems, OS of supercomputers, as a rule, in order to get the maximum rate of program execution don't permit to share resources of one processor between different, simultaneously executed programs. That's why there can be the following modes of using an n- processor system as two opposite variants: • all the resources are allocated for execution of one program and in this case we expect an n-fold speed-up of program execution in comparison to a single-processor system; • n common single-processor programs are executed simultaneously and the user expects that other programs won't influence the speed of execution of his program. 3. Parallelism in computational modeling tasks 3.1. Static and dynamic balancing When solving various tasks of mathematical physics on multi-processor systems with the help of mesh methods [10] two approaches to building parallel programs are widely used. The first approach is called geometrical parallelism method, and the second one - group decision method [11]. Ideas on which these methods are based are simple and smart. It won't be exaggeration to say that most tasks of gas dynamics, microelectronics, ecology and many others, which are now solved by using the finite difference method or finite element method, are solved effectively by the geometrical parallelism method. The group decision method is reasonable to use when building parallel algorithms of solving tasks by Monte-Carlo methods, when a series of single-type calculations is performed and in some other cases. We should note that the geometrical parallelism method is a method of static load balancing which defines a section of the mesh executed by each processor beforehand. Static balancing is effective when priori information is enough for preliminary distribution of the common computing load equally among processor nodes. The group decision method is a method of dynamic balancing load. When using this method it is not known beforehand what particular mesh nodes will be processed by this or that processor. The processors receive tasks dynamically as they have executed the already received, what provides balanced load of processor nodes when there are many independent tasks. 3.2. Parallelism of "group decision" type Parallelism of "group decision" type is convenient for performing calculations dividing into more single- type tasks each of which is solved independently from the others. No data transfer occurs between such tasks and, consequently, there is no need of their mutual synchronization. Let's consider an example of a computational mesh as a set of independent nodes in each of which we should define some parameters on each temporal layer by solving a system of ODE with the corresponding initial data [12]. Solution of the system in each node depends only on local values of the variables in this node. Meanwhile computational load differs very much in different nodes. When building a parallel program with the help of the classical "group decision" method the following strategy of computational load distribution is used.
  • 5. One control processor is defined while all the other processors are used as processing nodes, i.e. computing nodes. Each computing processor performs primary tasks - solution of ODE system for the next mesh point with the corresponding local parameters. The control processor distributes the primary tasks among the computing processors and collects the results. In the beginning of the next step each processor waits for a new data chunk, processes it, returns the result and starts waiting for the next task until instead of the next task it gets a message that all the mesh points are processed. As there is no need to synchronize primary tasks, different processors can get different number of computational nodes as the data processing is finished. Thereby the problem of balanced load of processors is solved even if the time for solving the equation system for different mesh points or processors' performance vary greatly. In case of heterogeneous computational load when computing different points of the spatial mesh, usage of the "group decision" method potentially allows you to significantly reduce downtime and increase effectiveness of paralleling in comparison to the geometrical parallelism method considered further. The advantages of this method can be fully implemented if the data for processing are from the beginning concentrated on one of the processors which in this case can serve as the control processor. When the source data are initially distributed among the processors at random, preliminary collection of the data corresponding to all the computational points on one of the processors is required to use this method. The necessity of the preliminary data copying from all the processors to one and the following return of the results from this processor to the processors-"holders" of the points significantly reduces effectiveness of this method and makes it of little use for solving most tasks of computational modeling. 3.3. Geometrical parallelism. The source task can be split into a group of fields independent from each other at each computational step and crossing only at the division boundary. That is, we compute (n+1) temporal layer in each field and after that coordinate the boundaries and pass on to computing the next layer. But using this approach we have problems with recalculation of values at the boundaries between these fields when we divide the computational field into non-crossing subfields, that's why we offer the next logical step - to divide the source field into mutually crossing subfields. There will appear two "dummy" points to the left of the first field and to the right of the last field. Thus, we get four processes independent from each other at each step. To pass on to the next iteration we need to coordinate the boundaries as the first field should give to the second one its left boundary for the next step, and in its turn the second field should give to the first one its right boundary and so on. This method can be generalized into most computational methods based on equations for modeling physical processes. 4. Effectiveness of a parallel program 4.1. Notion of an effective parallel program Using supercomputers imposes certain requirements on the new developed software providing safe and economical implementation of the algorithm when solving applied tasks. Effectiveness of using supercomputers becomes apparent when creating complex research complexes and expert systems.
  • 6. It is much more difficult to write a parallel program than to write a sequential one. Creation of software for parallel computers is the central problem of supercomputer calculations [13]. Partially the problem of choosing the optimal number of parallel branches in correspondence with the criterion of minimum total time costs can be solved with the help of automations of parallel program generation. A particular case of solving this problem for the computer systems with MIMD architecture is considered in the article by V.A. Kostenko "To the question of evaluating the optimal parallelism level" [14]. Effectiveness of using multi-processor computer systems is to a large degree determined by the quality of the applied parallel programs. A program is considered effective when all the processors defined for processes are loaded during its execution. But practically it is impossible. 4.2. Properties of an ideal parallel program Let's note that an ideal parallel program possesses the following properties: 1. Lengths of simultaneously executed branches are equal. 2. Downtimes relating to data waiting, control transfer and conflicts occurring when using common resources, are fully excluded. 3. Data transfer is fully combined with calculations. Increase of parallelism's effectiveness (decrease of time costs on the overhead costs) is reached by the following means: • enlargement of paralleling units; • decrease of complexity of the algorithms of generating parallel procedures (subprograms); • preliminary preparation of the package of different source data variants; • paralleling of the algorithms of generating parallel procedures (subprograms). 4.3. Adaptation of programs to the parallel computers' architecture The main stages of the process of adapting programs to the architecture of parallel computers and description of the tasks occurring at each of these stages are given in the article by A.S. Antonov "Effective adaptation of sequential programs to the modern vector-pipeline and array-parallel supercomputers" [15]. We would like to pay special attention to some of the tasks which the authors of this analysis faced. Among these tasks are: • investigation of the common program structure; • definition of the main computational core, input-output localization; • definition of the potential parallelism of a fragment; • definition of the sequential fragments of calculations and attempt to use alternative algorithms for such fragments; • definition and minimization of data redistribution points; • conversion of the traditional loop for parallel algorithms; • minimization of the number and size of temporary arrays for optimizing cache-memory handling; • passing on from the source program working with full arrays to the program processing only a local chunk distributed for a processor: change of arrays' sizes and the corresponding transformation of the program text.
  • 7. We should note that solution of these tasks allows us to perform an effective port of a sequential program on a parallel architecture. The process of developing a parallel program is very long and laborious despite that, as a rule, there already exists an implementation of its "sequential" counterpart. A program is usually developed on a computer with a certain architecture and its practical application is performed on another computer, more powerful and with the typology different from that of the former machine. This approach allows you to economize computer time on more powerful supercomputers which are much fewer than cheaper ones. When porting a parallel program on computers with a different architecture a programmer faces the problem of invalidity of once developed parallel procedures. At present there are no universal means of adapting programs to a concrete architecture of supercomputers and that's why this problem has to be mostly solved manually what makes the process very labor-intensive [15]. To save labor of a programmer RAS mathematical institutions are developing libraries of effective procedures and algorithms for concrete architectures of supercomputers (RAS Ural Department, Research-and-development computer center of MSU named after M.V. Lomonosov). Using these libraries can partially save labor of an applied programmer not only at the stage of modifying a program for more powerful supercomputers, but at the stage of the primary development of a parallel program. 5. Debugging and monitoring issues The problem of debugging and monitoring is very urgent as there are no managers that could provide an applied software developer by intermediate information especially urgent at the initial stage of designing [16]. In the general case the task of debugging and monitoring such systems is put in the following way [17, 18]. There is a mesh of nodes heterogeneous in their hardware and/or software platforms, on each of which many processes (threads) are executed simultaneously [19]. There is also a total number of users each of which would like to control and/or operate his subset of program and/or hardware components. Understanding of debugging/monitoring as controlled execution changes the position of debugging in the systems' life cycle, allows you to use architectural and protocol solutions characteristic of controlling means. This makes the controlling means scalable, capable of maintaining the distributed heterogeneous systems. It is important for further development of debugging/monitoring means to create a set of specifications defining functionality of the manager programs being developed [20]. Programs are complex dynamic systems, especially parallel and interactive (operating in dialog mode) ones which include complex interactions between program processes themselves and between the processes and the outer world. Analysis of such programs cannot be performed in terms of relations between input and output values of the program as it is usually performed for sequential programs. This shows that checking and proving correct work of such programs demands developing adequate means of formal specification. In particular, it is necessary to be able to express relations between the system's states at those instants of time when some events occur accompanying the program system's operation. The article "Applying temporal logic to program specification" by M.K. Vasilyev [21] discusses the approach to analysis of a parallel program based on applying mathematical logic.
  • 8. Process control is one of the most important tasks of OS. To perform this function on supercomputers semaphore technology [22] can be used which consists in locking and unlocking of processes. Semaphores have been traditionally used for synchronizing processes addressing shared data. Each process should exclude for all the other processes the possibility of simultaneous address to its data. When solving applied tasks the size of the received information in most cases is so large that the possibility of verification - the detailed analysis of the data received directly by a computational program - is impossible. As there are no universal graphical packages with visualization of different isometric projections and color gammas for such situations, applied software developers are advised to start developing such packages. 6. Paralleling objects modeling Developed computer-usage approaches are based on the thesis: computer is a cognition tool with the help of which people get new information about an object or phenomenon being investigated [23]. Consequently, a qualified user should know the modern cognition methodology, i.e. modeling. Modeling is not only designing of a cognition object, but a cognition method as well. Modeling is work methodology whose effectiveness becomes apparent only when specialists are highly qualified and know well the modern formalization means - logic and mathematics. Having defined the problem and stated the goal a researcher starts searching for a solution. The way he passes becomes a method. The process of modeling presupposes both the way "from the object to the model" (reflection of reality in a paradigm) and the way "from the model to the object" (test of the model's truth on its possibilities). Computer is the natural means of performing such "research" cycles. Software-development theory specialists rarely pay attention to modeling when describing the process of software creation. On the other hand, modeling specialists prove urgent necessity of wide usage of their methods when projecting any complex system [24]. As a software complex is a complex system with many levels and components and a complex structure of relations between them, it is necessary to use modeling when developing such systems. Taking into consideration that parallel software development (development of a paralleling object) is very difficult nowadays, the problem of creating theoretical basis of its projecting is even more urgent. Besides analysis of the structure and properties of the developed programs on all the projecting stages, modeling can help describe all the peculiarities of interaction between parallel processes at the level of a simulation model. In his work "Modeling of parallel software using PS-networks" [25] N.G. Markov suggests using the graph-analytical approach to simulation modeling of a program project on the basis of the demands put before parallel software. The aim of this work was to work out the demands to the parallel software simulation modeling mechanism and also to create a mechanism keeping balance between mathematical simplicity and rigor on the one hand and practical applicability on the other hand. Thus, we can state that the most convenient means of analyzing computational algorithms of parallel computations is graphs [26]. The problem of creating modern packages of applied programs intended for a wide range of mechanics tasks goes out of limits of synthesizing these tasks from separate program modules. It is related to the
  • 9. global optimization of the whole computational sequence of tasks [27]. That's why a package of programs as a product used for scientific-applied purposes not only by its creators but by end users as well, should be developed at an absolutely new programming level. When developing modern software it is necessary not only to take into consideration non-linear (with feedback) relations between all the links of a calculation chain but also implement the possibility of segmenting a program at high, middle and low paralleling levels of the computational process. Segmentation is necessary for more effective usage of multi-processor systems. Besides, when developing a numerical algorithm we should coordinate the issues of accuracy and safety of the end software and also the issues of its effectiveness and portability on a concrete supercomputer's architecture. Such parallel programming differs greatly from the traditional programming, i.e. sequential programming. 6.1. Levels of decomposition of paralleling objects To provide supercomputers' users with possibilities of simultaneous performance of many scientific calculations or multi-thread processing of requests in a database on multi-processor computers, the corresponding software should be installed. In this case paralleling functions are performed not only by applied software but by the OS as well. In the general case, two main interrelated problems occur when creating OS of parallel data processing: the first one is minimizing of the time of performing the given calculations' size, and the second is synchronization of many simultaneously interacting parallel process [28]. To solve each of them different approaches are being developed. In the mentioned work it is offered to take into consideration that when implementing complex synchronization mechanisms overhead costs increase and this influence badly the efficiency of solving tasks. The stated problem in the systems with parallelism limited by the number of processors is solved by minimizing the total time of performing the given calculations' size. The results of implementing this approach relate, first of all, to "operational parallelism". The method based on building the schedule of launching and finishing each of the competing processes can be useful in such systems. It gives you an opportunity not only to more effectively solve the process synchronization problem but significantly reduce system costs and wasteful downtimes of the processors. The method of managing interaction between parallel processes is implemented with the help of "semaphore" technology [29]. When researchers create applied software, the practical value of numerical methods they develop is determined not only by the results received with their help when investigating complex phenomena but by their applicability on concrete supercomputers as well. It was found that as performance of personal computers grows stimulating development of computational methods, there also occur qualitative changes in supercomputers' architecture focused on development of parallelism and specialization of processors. And this, in its turn, stimulates search for new representations of physical phenomena that would permit more direct presentation on the computers' architecture. Thus, for example, the cellular- automat approach appeared in gas- and hydrodynamics [30]. The article shows a new model of parallel calculations - cellular-neural network (CNN). The article describes the essence of a cellular-automat model and also shows rich opportunities of CNN for representing spatio-temporal dynamics of active mediums. This model can serve as the basis for creating parallel programs intended for solving differential equations in partial derivatives and also for imitation of nonlinear dynamics phenomena. It is noted that usage of CNN calculation methods together with parallel processors will allow us to greatly increase the quality of solution of such tasks.
  • 10. The aim of any work connected with parallel programming is review of interrelations between the mathematical algorithm's structure and a multi-processor computational system's architecture. Depending on the complexity of the stated task different types of interrelations can be implemented. These interrelations are called the levels of decomposition of the source task. They can be defined as follows [31]: The first level - division of a task into subtasks. The second level - division of each separate subtask into a subset of quasi-single-type procedures executed simultaneously at different source data. In mathematical physics this parallelism type is called geometrical parallelism or data parallelism as paralleling is performed in this case by distributing calculations in different points of the computational field into different processors. The third level - paralleling of separate procedures. The fourth and the deepest paralleling level - division of arithmetical processes according to the number of processors. It is recommended not to use the last level on supercomputers with distributed memory in which for each processor local memory is allocated. The researchers of most applied tasks are advised to stop the process of their decomposition at the second level. 6.2. Possibility of paralleling objects in computational modeling algorithms And now let's consider what objects in the algorithms of task solution can be paralleled. The main numerical methods (the finite-element method, the finite-difference method and others) bring the source task to forming a system of linear algebraic equations (SLAE) and its further solution [32, 33]. For example, in a sequential program implementing the finite-element method, most time is spent on forming the SLAE itself (calculation of coefficients) but not on its solution. It is also important to mention that the elements of SLAE matrix depend only on their locations in it and do not depend on each other. In this case parallel algorithms of SLAE formation can be used effectively. And here you should perform the following operations: 1. split the computational task into parallel branches; 2. perform calculations in these branches; 3. form and solve SLAE (by any method). The article [34] gives an example of description of a parallel algorithm of SLAE formation and also peculiarities of using MPI technology. The article [35] considers implementation of Gauss-method for solution of sparse systems of linear algebraic equations on computers with parallel processes and shared memory. It is pointed out that division into several command threads can be performed either according to the functional feature or directly by data. When the task is stated like this, only data-relating division can be implemented. Meanwhile, you should pay attention whether it is possible to single out unlinked fields from the task. The same article points that the offered parallel algorithm is bound to a concrete computer architecture but it also states that effectiveness of the paralleling algorithm depends only on the correlation between the number of processes and processors and also on the size of information processed at one loop.
  • 11. We can propose a thesis that loops are one of the most important program constructions with accessible parallelism. The problem of extracting fine-grained parallelism (parallelism inside loops) from these constructions is of great importance in view of increasing popularity of superscalar computers [36]. The article [37] presents algorithms for computational procedures and also results received with their help and based on the high-accuracy parallel arithmetic methodology. It is suggested that this methodology be used for solving applied tasks of linear algebra and mathematical physics. The mentioned work is devoted to creating algorithmic and program means of supporting accurate array computations based on complex usage of parallelism of MIMD-systems [38] and multibit arithmetic with dynamic operand length. Special attention is paid to influence of roundings in basic array operations on the accuracy of matrix tasks' calculation. The work includes the library of programs and text examples demonstrating effectiveness of the developed approach. The given results show the possibility of performing accurate array computations with simultaneous message transfer on parallel computers. The developed package of applied programs can be adapted for execution on parallel computers of different types. It is obvious that using multibit arithmetic is not typical of supercomputers. Its usage will inevitably lead to slowdown of application execution. But time loss in this case will be compensated not only by calculations with maximum usage of standard data types but also by adaptation of highly effective parallel algorithms initially suited for execution on one-processor computers to means of high-accuracy processing. Multibit arithmetic should be used only in the most "heavy" algorithm sections. But even in this case the dynamic operand length helps process only a limited number of bits. It is this way which is supposed for reaching balance between speed and accuracy of computations. The article [39] analyzes in detail the vector-pipeline architecture of supercomputers of CRAY family. As the result of the performed research, programming factors reducing supercomputers' performance were discovered. To them relate: • sectioning of long vector operations (increases overhead costs); • overload of commands' buffers (increases overhead costs); • conflicts of memory access (in case of using shared resources); • limited capacity of data transfer channels (depends on the supercomputer's architecture); • other factors. It also gives examples (in program codes) showing the way out. In some of the works mentioned above the development is singled out in which the algorithm structure doesn't adapt to the computer's structure but defines its structure itself [40]. The work is intended for creating new modern computer technologies and methods of parallel programming meant for increasing effectiveness of solving fundamental scientific and applied problems in the sphere of computational modeling of aerodynamics and gas-dynamics' tasks. Special attention is paid to theoretical issues of paralleling. The work considers different methods of decomposing a full task into simultaneously executed subtasks. Of high strategic importance is the theoretical stage of investigating the problem of paralleling a program complex, that is development of principles (and concrete methods on its basis) of optimal decomposition of the whole totality of algorithms, composing the processor system and its operational environment.
  • 12. The article describes three main decomposition (segmentation) types for the program complex "Thread- 3" planned for development in the process of preparation for paralleling the algorithms which make it up: • physico-mathemetical; • geometrical; • technological. One of the global types of high-level structuring for the task being solved is decomposition of the investigated physical process into subprocesses composing it and, consequently, segmentation of the common algorithm of solving the full task into several algorithms of solving the subtasks composing it. A segmental algorithm of parallel calculations of physical processes is suggested and meanwhile all the module-segments of the computational core of the program are launched simultaneously. Besides, inside each segment subsegments can simultaneously start working. When paralleling computational procedures of extreme importance is synchronization and routing of data improper organization of which either leads to incorrect calculation or to large overhead costs of computer and astronomical time due to various delays of calculations because of data waiting and, consequently, downtime of the processors in some segments. It is supposed that the latter leads to nonoptimality or even impossibility to use the configurable processor space. Geometrical decomposition (segmentation) of the full task and the following parallelization of calculations allows you to significantly reduce astronomical time needed for calculations. Geometrical decomposition consists in dividing the whole integration domain into a map of subdomains (subsegments) and also in a single-step calculation of the physical process' state in each subdomain followed by joining of solutions. The article lists requirements to mathematical definition of the task permitting geometrical decomposition. Technological decomposition implies segmentation of mathematically independent tasks. There can be several levels of technological decomposition. The most typical example of decomposition is paralleling a program into certain physico-mathemetical tasks each of which can similarly consist of algorithmically independent tasks. The process of technological decomposition depends greatly on the program's structure and numerical methods used in it. Using the latter decomposition type presupposes special attention to the parallel program's effectiveness. Conclusion Despite obvious success in using multi-processor systems there appear debates about their low effectiveness. Increase of multi-processor systems' performance is generally determined by balance between computational operation and data exchanges on its background. Non-fulfillment of this condition is one of the causes of performance loss during paralleling with increasing number of computational program modules. Evaluation of programs' effectiveness has been carried out since first multi-processor systems - transputers. Even then the first attempts were made to successfully solve the problem of maximum usage of calculation time. When solving a concrete task, first of all it is necessary to search for parallelism variants by dividing a separate task into several subtasks. After that, data parallelism (or
  • 13. geometrical parallelism) can be performed, that is division of computational field. This type of parallelism means that the computational field is divided into subfields each of which is correlated to a separate system's processor. When developing real parallel programs, as a rule, high effectiveness demands many changes of the program to find the best scheme of its paralleling. Success of this search is determined by simplicity of the program's modification. References 1. V.N. Datsuk, A.A. Bukatov, A.I. Zhegulo. Electronic user's guide on the course "Multi-processor systems and parallel programming" Part I. Introduction into programming organization and methods of multi-processor computer systems. Rostov-on-Don, 2000. 2. E.V. Neupokoev, G.A. Tarnavskiy, V.A. Vshivkov. Paralleling marching algorithms: target computational experiments. // Autometriccs, N 4, volume 38, 2002, pp. 74-87. 3. V.V. Korneev. Parallel computer systems. Moscow: "Knowledge", 1999. - 320 pp. 4. G.I. Shpakovskiy. Parallel computers' architecture. - Minsk, 1989. - 136 pp. 5. A.O. Latsis. How to build and use a supercomputer. Moscow: Bestseller Publishing house, 2003. 274 pp. 6. D.U. Labutin. System of remote access to the computational cluster (access manager): high- performance parallel computations on cluster systems. Material of the second international scientific-practical seminar, Nizhny Novgorod: Nizhny Novgorod University Publishing house, 2002. pp.184-187. 7. K.E. Afanasyev. Multi-processor computer systems and parallel programming: tutorial/ K.E. Afanasyev, S.V. Stukolov, A.V. Demidov, V.V. Malishenko; Kemerovo State University. - Kemerovo: Kuzbassvuzizdat, 2003. - 182 pp. 8. S.A. Nemnyugin, O.L. Stesik. Parallel programming for multi-processor computer systems. - St. Petersburg: BHV-Petersburg, 2002. - 400 pp. 9. A.V. Demidov, K.V. Sidelnikov. Emulation of parallel data processing on a personal computer // XLI International scientific student conference "Students and Scientific-and-Technological Advance". Collection of works. Novosibirsk, 2003. pp. 110-111. 10. A.A. Samarskiy, E.S. Nikolaev. Methods of solving mesh equations. Moscow: Science, 1978. 561 pp. 11. V.V. Samofalov, A.V. Konovalov, S.V. Sharf. Dynamics and statics: searching for compromise // Works of All-Russian scientific conference "High-performance computations and their applications". Moscow, 2000. pp. 165-167. 12. S.K. Godunov, V.S. Ryabenkiy. Difference schemes. - Moscow: Science, 1973. - 400 pp. 13. V.V. Voevodin. Supercomputers: yesterday, today, tomorrow. // Collection of popular science articles "Russian science at the dawn of the new century". Under the editorship of academician V.P. Skulachov. Moscow: Scientific world, 2001. pp. 475-483. 14. V.A. Kostenko. To the question of evaluating the optimal parallelism level. // Programming. 1995, 4, pp. 24-28. 15. A.S. Antonov, V.V. Voevodin. Effective adaptation of sequential programs to the modern vector- pipeline and array-parallel supercomputers. // Programming. 1996, 4, pp. 37-51. 16. E. Sallivan. Time is money. Creating a team of software developers/Translated from English. - Moscow: Publishing house "Russkaya Redaktsiya", 2002. - 368 pp.: illustrations.
  • 14. 17. V.A. Galatenko, K.A. Kostuhin. Debugging and monitoring of distributed heterogeneous systems. // Programming, 2002, 1. pp. 27-37. 18. V.A. Krukov, R.V. Udovichenko. "Debugging of DVM programs". Programming. - 2001. N. 3.- pp.19-29. 19. A.P. Sapozhnikov, T.F. Sapozhnikova. Reengineering technology of distributed computations in the local network. Works of international conference "Distributed computations and Grid- technologies in science and education" (Dubna, June 29 - July 2 2004). 11-2004-205, Dubna, JINR, 2004. pp.183-190. 20. V.V. Samofalov, A.V. Konovalov. Technology of debugging programs for computers with mass parallelism // "Issues of atomic science and technique". Series Mathemetical modeling of physical processes. 1996. Issue 4. pp. 52-56. 21. M.K. Valiev. Applying temporal logic to program specification. // Programming. 1998, 2, pp. 3-9. 22. V.A. Krukov. OS of distributed computer systems. (tutorial). 23. N.L. Zaharyeva, V.B. Hoziev, P.D. Shirkov. Modeling and education.// Mathematical modeling, 1999, volume 11, N 5, pp. 101-116. 24. A.A. Shalito. Automatic program projecting. Algorithmization and programming of logical control tasks. "Izvestiya akademii nauk. Teoriya i sistemi upravleniya" magazine, issue 6. November- December 2000. pp.63-81. 25. N.G. Markov, E.A. Miroshnichenko, A.V. Saraykin. Modeling of parallel software using PS- networks. // Programming. 1995, N 5, pp. 24-32. 26. N.M. Ershov. Building graphs of computational algorithms by autotracing method. // Programming. 2000, N 6, pp. 58-64. 27. V.P. Gergel, R.G. Strongin. Fundamentals of parallel computations for multi-processor computer systems. Tutorial. Nizhniy Novgorod: Publishing house of Nizhniy Novgorod State University named after N.I. Lobachevskiy, 2000. - 176 pp. 28. V.P. Ivannikov, N.R. Kovalevskiy, V.M. Metelskiy. About minimal time of implementing distributed competing processes in synchronous modes. // Programming. 2000, N 5, pp. 44-52. 29. M.K. Valiev. Applying temporal logic to program specification. // Programming. 1998, 2, pp. 3-9. 30. O.L. Bandman. Cellular-neural models of spatio-temporal dynamics. // Programming. 1999, N 1, pp. 4-17. 31. A.E. Duysekulov, T.G. Elizarova. Using multi-processor computer systems for implementing kinetically coordinated difference schemes of gas-dynamics. // Mathematical modeling, 1990, volume. 2, N 7, pp. 139-147. 32. O.A. Dmitrieva. Parallel algorithms of numerical solution of simple differential equations. //Mathematical modeling N 5, 2000, pp. 81-86. 33. N.R. Bahvalov. Numerical methods // Main editorship of physico-mathematical literature of "Science" publishing house, Moscow, 1975. - 632 pp. 34. D.B. Moskvin, V.A. Pavlov. Experience of using MPI technology for solving a system of second- order Fredholm integral equations. //Mathematical modeling, 2000, N 8, pp. 3-8. 35. A.N. Zavorin. Parallel solution of linear systems when modeling electric circuits. // Mathematical modeling, 1991, volume 3, N 3, pp. 91-96. 36. Ki-Chang Kim. Fine-grained paralleling of incomplete loop nests. // Programming. 1997, N 2, pp. 52-66. 37. V.A. Morozov, A.P. Vazhenin. Matrix arithmetic of multiple accuracy for parallel systems with message transfer. // Programming. 1999, N1, pp. 66-77. 38. V.V. Voevodin. Mathematical models and methods in parallel processes. - Moscow: Science, 1986. - 296 pp.
  • 15. 39. V.V. Voevodin. Is it easy to get the promised gigaflop? // Programming. 1995, N 4, pp. 13-23. 40. G.A. Tarnavskiy, R.I. Shpak. Decomposition of methods and paralleling of algorithms of solving aerodynamics and physical gas-dynamics tasks: computer system "Thread-3". // Programming. 2000, N 6, pp. 45-57. Additional references. 1. A.R. Antonov. Parallel programming using MPI technology: tutorial. - Moscow: MSU publishing house, 2004. - 71 pp. 2. R.B. Berozin, V.M. Paskonov. Component system of visualizing the results of computations on multi-processor computer systems // Materials of All-Russian scientific conference "High- performance computations and their applications", 2000, pp. 202-203. 3. N.V. Bocharov. Parallel programming technologies and technique. Review.// "Programming", 2003, N 1. PP. 5-23. UDC 681.3.06 4. V.V. Voevodin, Vl.V. Voevodin. Parallel computations. - St.Petersburg, 2002. - 600 pp. 5. V.P. Gergel, A.N. Svistunov. Development of an integrated environment of high-performance computations on cluster systems. Materials of the second international scientific-practical seminar, Nyzhniy Novgorod: Nyzhniy Novgorod publishing house, 2002. PP.78-82. 6. V.P. Ilyin. About paralleling strategies in mathematical modeling. // Programming. 1999, N 1, pp. 41-46. 7. A.N. Karpov. Data visualization on parallel computer complexes // 15th International conference GRAPHICON-2005. Novosibirsk, Russia, June 20-24 2005. 8. N.A. Konovalov, V.A. Krukov, A.A. Pogrebtsov, U.L. Sazanov. C-DVM - the language fordeveloping mobile parallel programs. // Programming. - 1999. N1. PP. 20-28. 9. N.A. Konovalov, V.A. Krukov, R.N. Mikhailov, L.A. Pogrebtsov. Fortran-DVM - the language for developing mobile parallel programs. // Programming. 1995, N 1. PP. 49-54. 10. R.V. Popova, R.V. Sharf. Organization of saving temporary data on MBC // Theses of the report from All-Russian conference "Urgent problems of applied mathematics and mechanics" (Yekaterinburg, February 3-7 2003), pp. 62. 11. I.V. Prangishvili, R.Ya. Vilenkin, I.L. Medvedev. Parallel computer systems with shared control. - Moscow: Energoatomizdat, 1983. - 312 pp. 12. L.B. Sokolinskiy. Parallel machines of databases. // Collection of popular science articles "Russian science at the dawn of the new century". Under the editorship of V.P. Skulachov. -Moscow: scientific world, 2001. PP. 484-494. 13. E. Tanenbaum. Distributed systems. Principles and paradigms. - St.Petersburg: Piter, 2003. - 877 pp. 14. M.V. Yakobovskiy, R.A. Sukov. Dynamic load balancing // Materials of "High-performance computations and their applications" conference, Chernogolovka, 2000, PP. 34-39. 15. A.V. Komolkin, R.A. Nemnugin. Electronic tutorial "Programming for high-performance computers". About the author Andrey Karpov, http://www.viva64.com Develops program solutions in the sphere of resource-intensive applications' quality and performance increase. One of the developers of Viva64 static analyzer for verifying 64-bit software. Participates in developing VivaCore open library for working with C/C++ code.