Parallel programs to multi-processor computers!


Published on

For instance, when a man cannot manage everything he creates a tally for himself, brainless, irresponsible, able only to solder pins, or carry heavy loads, or write from dictation but doing this very well.
Arkadiy and Boris Strugatskie
The article is an introduction into parallel programs for beginners. It was published in "PC World" magazine (

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Parallel programs to multi-processor computers!

  1. 1. Parallel programs to multi-processorcomputers!Author: Igor OreschenkovDate: 04.02.2009AbstractFor instance, when a man cannot manage everything he creates a tally for himself, brainless,irresponsible, able only to solder pins, or carry heavy loads, or write from dictation but doing this verywell.Arkadiy and Boris StrugatskieThe article is an introduction into parallel programs for beginners. It was published in "PC World"magazine ( program, a particular case of labor divisionHardly would we have seen the world around us as we see it if in its time mankind didnt invented theprinciple of labor division. Due to this there are professionals specializing in a particular sphere and ableto perform their work perfectly. Implementation of this approach in the sphere of industry allowed us toincrease product release tenfold. Thats why its no wonder that when computers appeared therearouse the question of applying the principle of labor division to computer programs operation.Research in this sphere allowed us to detect also a pleasing fact that complicated mathematicalcalculations related, for example, to matrix operations, can be split into independent subtasks. Eachsubtask can be executed separately and then you can get the result of solution of the source task bycombining the results of subtasks solutions. As subtasks are independent they can be solvedsimultaneously by several executors. If a computer takes the role of the executor this approach allowsus to find solutions for computationally very complicated tasks of realistic modeling of real-environmentobjects, cryptanalysis, real-time control of technological processes. It is this principle according to whichall the supercomputers - the pride of scientific laboratories and computational centers - operate.Having reached the limit of processors performance increase relating to fundamental physical limits,computer manufacturers began to increase the number of computational nodes in their products. Andnow inside the chassis of a common PC you may find what has been earlier called a supercomputer - amulti-core computing device. Operation systems have been ready for this technology for a long time.Unix and Linux, OS/2, Mac OS and modern versions of Microsoft Windows show performance increaseat simultaneous execution of several programs. For example, while you watch a film, antivirus scanningof your hard disk can be performed without your noticing it. But there are situations when solution ofonly one task can demand a lot of computational resources. Leaving aside the problem of nuclear fusionsynthesis or cracking a password to an archive, lets think about such common things as music and videocoding and computer games. Programs implementing these tasks enable fully the resources of CPU. Butincrease of the number of processors in the system wont give you marked performance gain in theseexamples if no special methods were used while developing the programs. Indeed, how does an
  2. 2. operation system know the peculiarities of coding methods implementation to be able to distributecalculations between processors? Can it know how to perform parallel calculation of a game-situationscene?Processes and threads - twinsSo, we have a multi-processor computer. We have analyzed the task and singled out the subtasks whichcan be solved simultaneously. How can we implement solution of these subtasks in a program? Modernoperation systems offer us two variants of code execution - in the form of processes and in the form ofthreads.A process is operation of a program loaded into main memory and ready for execution. So, when wesimultaneously perform two actions - serf the news in the Internet and burn files on a CD, two processesare executed simultaneously - an internet-browser and a CD-writing program are operating. A processconsists of program code, operated data and various resources, for example, files or system queuesbelonging to the program. Each process is executed in its address space, i.e. has access only to its ownarea of main memory. On the one hand it is good because one process cannot interfere with the other(until they both address the non-divisible resource, for example, the disk drive, but it is a different thing)and errors in operation of one program in no way will influence operation of the other. But on the otherhand, if processes are meant for solving one common task, we face the question: how will theyexchange information and interact at all?Launch of processes - Unix and WindowsWhat is interesting, developers of Mircosoft Windows and Unix (Linux) operation systems applieddifferent approaches to the question of launching a child process. From the viewpoint of a programmerusing WIN32 API everything is quite natural. In the point of launch of a support process CreateProcess()function is called whose one argument indicates the path to EXE-file with the program to be executed.In Unix-like operation system this is implemented in a different way. In the code point where we need tocreate a new process, fork() system call is used. As the result the operating program "forks", that is thestate of the program at the moment of fork() execution is copied (by the state we understand the valuesof the CPU registers, stack, data area and list of open files) and on the basis of this copy a new process islaunched possessing the same information as the parent process. Both parent and child processescontinue executing the same program beginning with the command following fork() system call. Asexecution of one and the same work by two processors is senseless, we face the question: how toimplement solution of an independent subtask in each process weve got? The point is that from theviewpoint of the parent process the result of fork() function is the identifier of the child process, andfork() simply returns zero value into the child process. Each process can identify itself by this code. Andall the rest is a technical matter. The child process has only to prepare an environment for executing anew program, load its code with the help of exec() system call and transfer control to it.A thread is a means of forking a program inside a process. The process can contain several threads eachof which is executed independently from the others keeping its own values of registers and its ownstack. Threads have access to all the global resources of the executed process, for example, open files ormemory segments. Each process has at least one executed thread, the main thread. When solution ofthe task comes to the section allowing paralleling, the processor spawns the necessary number ofadditional threads executed simultaneously, solving their subtasks and returning the results to the mainthread.
  3. 3. To solve complex tasks it is not enough just to perform parallel execution of different code sections. Youneed also to arrange exchange of the results of their operation, that is organize interaction betweenthese sections. One would think, in the case of threads it could be easily implemented by providingaccess to the defined common area of main memory. But this simplicity is deceptive because operationsof writing into main memory by different threads can intersect very unpredictably so that the integrityof the information kept in this memory area will be broken.Example of threads interactionImagine a billing system of a cellular operator executed on a powerful multi-processor server. Its nowonder that charge-off from a clients account for the provided services is performed by one programthread while charge of payments on the clients account is performed by another thread. The firstoperation can be written in the form: S := S - A where S - balance of the account at the moment ofoperation, A - cost of phone calls made. The processor will execute this operation in three steps: at firstthe source value of S variable will be received, then A will be subtracted from it, after that the result willbe written into S cell. Similarly the second operation is performed: S := S + B where B - sum of moneypaid by the client. Suppose the client make a call some time later after he has paid through a bankoffice. As charge of payments on the account occurs with some delay due to objective reasons, it ispossible that both operations - charge-off and charge-on - will be executed in the billing system nearly atthe same time. Suppose that charge-off operation be executed first and the charge-on operation followsit with one-beat lag (figure 1). After the three beats of the charge-off operation the current accountbalance will equal S-A but the fourth beat will make this value S+B.
  4. 4. Figure 1 - An example of a billing systems operationThus, the final balance of the account will equal S+B instead of the right S-A+B value. In a differentsituation such a collision could be not so good for the client. Thats why to avoid such problems it isnecessary to take special measures to protect separate program sections from simultaneous execution.In the next article sections about synchronization and interaction of simultaneously executed programsections the terms "process" and "thread" will be used together as the described mechanisms are inmost cases applicable to both notions.Synchronization...Modern operation systems offer a wide range of means for synchronizing parallel processes. Thesimplest way for synchronizing simultaneously executed code sections is waiting of one thread fortermination of execution of the other. The same mechanism provides support of operation systemsexecuting asynchronous operations, such as file input-output or data exchange through network.Besides, functions of waiting can be used at mutual announcing about the events taking place (figure 2).
  5. 5. Figure 2 - Event is an object of operation systems which can switch its state from "common" to "signal"Thus, for example, while developing a system of bank trading day an event can be registered forannouncing about receipt of funds to a clients account. This event can be waited for by the threadperforming execution of client pay directions.If there is a possibility that different program sections can simultaneously modify some variable it isnecessary to implement protection of this variable. For example, if one thread deals with calculations ofa game scene and the other deals with its repainting, the latter should be performed only after all thenecessary calculations are done. For this purpose a special object mutex is introduced into operationsystems (English "mutual execution"). Figure 3 - Mutex is a system object which can be in one of the two internal states - "free" or "busy".
  6. 6. Before a thread begins to change the variable it should perform capture of the mutex corresponding tothis variable. If it succeeds the mutex changes its internal state and the thread continues execution. Ifthe other thread tries to capture the corresponding mutex while the first thread works with the variable,it will be rejected and it will have to wait until the mutex is released by the first thread (figure 3). Thus,at a concrete moment of time it is guaranteed that only one section works with the variable.While programming there can be situations when you need to modify several global variables in onecode section. If there is a possibility that this code will be executed by several threads simultaneously,you should implement its protection. Specially for such cases operation systems offer to arrange thecorresponding code into a critical section. Figure 4 - Critical section is a program section which can be executed by only one thread at a time.The critical section can be executed by only one thread at a time. The other thread can inquire if thecritical section is available for execution and in case of negative answer can either wait until it is free orexecute another operation (figure 4). Critical sections differ from mutexes also in that one and the samethread can enter the critical section it has captured many times while when trying to capture a mutexthe threads execution will stop and a so called "deadlock" can occur.Suppose we have a set of single-type resources, for example, several windows for displaying informationor a pool of network printers. To keep record of distribution of such resources between processors weuse semaphores.
  7. 7. Figure 5 - Semaphore allows you to keep record of distribution of single-type resources between threadsSemaphore is a system object which is a counter. Before a thread starts working with the resource fromthe set it should address the semaphore associated with it. If the value of the semaphore counter isgreater than zero the thread is allowed to work with the resource while the counters value decrementsby one (figure 5). But if all the resources of the set are used the thread will be enqueued. When theresource is free the thread announces the semaphore about it, the semaphores value rises and theresource can be used again.Sometimes it is necessary to urgently stop execution of a process to perform some urgent actions. Someoperation systems use signals for this purpose. A process which has to execute a special operation sendsa signal to the other process. When writing this other process a procedure of processing signals shouldbe implemented: when the process gets a signal it can suspend execution or terminate its work, executea special subprogram created for such a case or just ignore the signal.... and interactionAs it was said above, to solve a common task together processors may need not only to coordinate theirwork but also to exchange information - for example, intermediate results. For this purpose you can usemain-memory sharing, information exchange through a file, transfer of data through unidirectionalpipes or imitation of network operation (figure 6).
  8. 8. Figure 6 - Ways of exchanging data between processorsThe quickest and most obvious way of information exchange between simultaneously executed codesections is to use the shared area of main memory. One process writes data into memory, the otherreads them and vice versa. But in this case you need to implement synchronization of the processes byone of the above mentioned ways. If information exchange between the threads through the sharedmemory can be implemented directly the processes should at first request the necessary memorysegment from operation systems and coordinate the procedure of its usage.The next method of information exchange is using pipes. Pipe is a system object allowing you to transferinformation in one direction, as a rule. The most known examples of pipes are standard input/outputstreams (stdin, stdout). Stdout of one process can be directed into the stdin of the other. After this theinformation being written into stdout of the first process can be read by the second process. Operation
  9. 9. system allow us to create additional pipes. To work with pipes we use functions whose syntax is similarto functions of file operations.You can perform data exchange with the help of files as well. Modern operation systems haveembedded mechanisms of buffering information participating in file operations thats why this methodis rather effective. But from the security viewpoint this approach yields to those described above as anunauthorized application can get access to the file used for inter-processor exchange. Before using thismethod you should study the guidance on implementation of file operations in a concrete operationsystem and take measures to protect data.Duplex data exchange between processes can be implemented by operation system network means.With the help of sockets two processes can establish a channel between them to transfer data in thesame way as a browser and a HTTP-server. The format of the data being transferred in this case can beabsolutely of any kind - the point is that the processes use the same agreements of exchange procedure,the protocol. Moreover, nothing prevents these processes from being executed on different computers.And with this last notice we pass on to the next article section.Supercomputers - everywhere!Speaking about parallel execution of programs, we suggested that we use a computer with severalprocessors. But such a computer is not the only platform for executing parallel programs. Lets considera simplest computational network consisting of ten common one-processor computers united by aFastEthernet twisted pair. Looks familiar, doesnt it? What is each computer in fact? Generally speaking,it is a processor with main and disk memory available to it. Each computer is a perfect area for executingone process. And what if on each of the ten computers we launch simultaneously programs which solvesubtasks of cracking the password to an archive? Obviously, solution of the task will demand much lesstime than in case of using one computer. And what if the network consists not of 10 but of 100computers? And this network is linked to the Internet?From this example we see that even small local networks have a great potential computational power.And if you hardly can use this power for computer games, for CAD systems it will be good enough. Agroup of computers united by the bus of transferring data into a single computational system is called acluster system or cluster. You need only to write a special program in which you should implement thepossibility of information exchange between its simultaneously executed copies (processes) through thelocal network. Generally speaking, this program may have traditional client-server architecture, i.e. beable both to send messages to its neighbors and receive messages from them. But there already existready solutions for simplifying development of programs for cluster systems, for example, MPI andPVM, which offer a programmer means for performing both point-to-point and collective interactionbetween processes executed on the nodes of the cluster system, and also the methods of theirsynchronization.MPI specification (Message Passing Interface) offers a programming model in which a program spawnsseveral processes interacting with the help of addressing subprograms of message passing andreceiving. Its implementations are libraries of subprograms which can be used in C/C++ and Fortranlanguages. When launching an MPI-program a fixed number of processes is created. MPI specificationhas been created as an industry standard thats why all its means are focused on getting highperformance when used on symmetrical multi-processor systems and homogeneous cluster systems(supercomputers). There is a free implementation of MPI - MPICH (MPI CHamelion), whose Windows
  10. 10. and Linux versions are available for download here: Beinginstalled on a PC MPICH system can perform development and debugging of programs for further usagewithout any modifications on clusters or supercomputers.Unlike MPI the system of development and execution of parallel programs PVM (Parallel VirtualMachine) has been created within the framework of a research project meant for heterogeneouscomputational complexes. It allows you to quickly unite a heterogeneous set of computers connectedthrough network into a computational resource which is called "Parallel Computing Machine". Thecomputers may have different architectures and operate on different OS. PVM system is a set oflibraries and utilities meant for development and debugging of parallel programs and also for controllingconfiguration of the Virtual Computing Machine. C/C++ and Fortran languages are supported.Configuration of the computational complex can change dynamically by means of excluding some nodesand adding new ones. Such universality is possible also because of some decrease of performance incomparison with MPI system.Thus, both MPI and PVM allow you without much effort to turn the local network of your organizationinto a powerful computing machine able to solve complex tasks.We should admit that parallel computational systems are very common nowadays. Operation systemprovide developers with the necessary low-level service, while for solving applied tasks there exist readyproved means in the form of specialized utilities and libraries for popular high-level programminglanguages. And a programmer should master methods of parallel program development to keep up withthe modern tendencies.