Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debugging and optimization of multi-thread OpenMP-programs

540 views

Published on

The task of familiarizing programmers with the sphere of developing parallel applications is getting more and more urgent. This article is a brief introduction into creation of multi-thread applications based on OpenMP technology. The approaches to debugging and optimization of parallel applications are described.

Published in: Technology
  • Be the first to comment

Debugging and optimization of multi-thread OpenMP-programs

  1. 1. Debugging and optimization of multi-thread OpenMP-programsAuthors: Andrey Karpov, Evgeniy RomanovskyDate: 20.11.2009AbstractThe task of familiarizing programmers with the sphere of developing parallel applications is getting moreand more urgent. This article is a brief introduction into creation of multi-thread applications based onOpenMP technology. The approaches to debugging and optimization of parallel applications aredescribed.In this article we discuss OpenMP technology whose main value is the possibility of improving andoptimizing the already created code. OpenMP standard is a set of specifications for paralleling code inan environment with shared memory. OpenMP can be used if this standard is supported by thecompiler. Besides, you need absolutely new tools for the debugging stage when errors are detected,localized and corrected and optimization is performed.A debugger for sequential code is a tool which a programmer knows very well and uses very often. Itallows you to trace changes of variables values when debugging a program in single-stepping modewith the help of advanced user interface. But in case of debugging and testing of multi-threadapplications everything is different, in particular creation of multi-thread applications becomes the mainarea in creating effective applications.Debugging of a sequential program is based on that the level of predictability of initial and currentprograms states is determined by the input data. When a programmer debugs multi-thread code heusually faces absolutely unique problems: in different operation systems different planning strategiesare used, the computer systems load changes dynamically, processes priorities can be different etc.Precise reconstruction of the programs state at a certain moment of its execution (a trivial thing forsequential debugging) is very difficult in case of a parallel program because of its undeterminedbehavior. In other words, behavior of processes launched in the system, in particular their execution andwaiting for execution, deadlocks and the like, depends on random events occurring in the system. Whatto do? It is obvious that we need absolutely different means for debugging parallel code.As parallel computer systems have become a common thing on consumer market, demand for means ofdebugging multi-thread applications has increased greatly. We will consider debugging and increase ofperformance of a multi-thread application based on OpenMP technology. The full program text,separate fragments from which we will consider further, is given in Appendix 1 at the end of the article.Lets consider sequential code of Function function given in Listing 1 as an example. This simplesubprogram calculates values of some mathematical function with one argument.double Function(int N){
  2. 2. double x, y, s=0; for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}When calling this function with N argument equal 15000, well get 287305025.528.This function can be easily paralleled with the help of OpenMP standards means. Lets add only onestring before the first operator for (Listing 2).double FunctionOpenMP(int N){ double x, y, s=0; #pragma omp parallel for num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}Unfortunately, the code weve created is incorrect and the result of the function is in general undefined.For example, it can be 298441282.231. Lets consider possible causes.
  3. 3. The main cause of errors in parallel programs is incorrect work with shared resources, i.e. resourcescommon for all launched processes, and in particular - with shared variables.This program is successfully compiled in Microsoft Visual Studio 2005 environment and the compilereven doesnt show any warning messages. However it is incorrect. To understand this you should recallthat variables in OpenMP-programs are divided into shared, existing as single copies and available for allthe threads, and private, localized in a concrete process. Besides, there is a rule saying that by default allthe variables in parallel regions of OpenMP are shared save for parallel loops indexes and variablesdefined inside these parallel regions.From the example above it is obvious that x, y and s variables are shared what it absolutely incorrect. svariable should be shared as it is the adder in the given algorithm. But when working with x or y eachprocess calculates their next value and writes it into the corresponding variable (x or y). And in this casethe result depends on the sequence of executing the parallel threads. In other words, if the first threadcalculates x value, writes it into x variable and after the same operations are performed by the secondthread, tries to read the value of x variable it will get the value written into it by the last thread, i.e. thesecond one. Such errors, when program operation depends on the sequence of executing different codesections, are called race condition or data race ("race" condition or "race" of computing threads; itmeans that unsynchronized memory accesses occur).To search such errors we need special program means. One of them is Intel Thread Checker. Theproducts site: http://www.viva64.com/go.php?url=526. This program is provided as a module for IntelVTune Performance Analyzer profiler adding to already existing means for working with multi-threadcode. Intel Thread Checker allows you to detect both the errors described above and many otherdefects, for example, deadlocks (points of mutual lock of computational threads) and memory leaks.When Intel Thread Checker is installed, a new project category will appear in New Project dialog of IntelVTune Performance Analyzer application - Threading Wizards (wizards for working with threads), amongwhich there will be Intel Thread Checker Wizard. To launch an example, you should select it and pointthe path to the executed program in the next wizards window. The program will be executed and theprofiler will collect all data about the applications operation. An example of such information providedby Intel Thread Checker is shown on Figure 1.
  4. 4. Figure 1 - a lot of critical errors were detected during Thread Checkers operationAs we see, the number of errors is large even for such a small program. Thread Checker groups thedetected errors simultaneously estimating how critical they are for program operation and providestheir description increasing the programmers labor effectiveness. Besides, on Source View inlay there isprogram code of the application where the sections with the errors are marked (Figure 2).
  5. 5. Figure 2 - Analysis of multi-thread code by Intel Thread CheckerWe should mention that Intel Thread Checker cannot detect errors in some cases. This occurs when thecode rarely gets control or is executed on a system with a different architecture. Errors can be alsomissed when the set of input text data differs greatly from the data processed by the program whenused by end users. All this doesnt allow us to be sure that there are no errors in a multi-thread programwhen it is tested by dynamic analysis means the results of which depend on the environment and timeof its execution.But there is another tool and it is good news for OpenMP developers. This tool is VivaMP offering analternative approach to verifying parallel programs. VivaMP is built according the principle of static code
  6. 6. analyzer and allows you to test the application code without launching it. To learn more about VivaMPtool visit the developers site http://www.viva64.com/vivamp-tool/.VivaMPs scopes of use: • Controlling correctness of the code of applications developed on the basis of OpenMP technology. • Aid in mastering OpenMP technology and integrating it into already existing projects. • Creating parallel applications which use resources more effectively. • Search of errors in existing OpenMP applications.VivaMP analyzer integrates into Visual Studio 2005/2008 environment and provides simple interface fortesting applications (Figure 3). Figure 3 - Launch of VivaMP tool, integrated into Visual Studio 2005
  7. 7. If we launch VivaMP for our example well get a message about errors in 4 different strings where messageincorrect modification of variables occurs (Figure 4). Figure 4 - The result of VivaMP static analyzers operationOf course, static analysis also has some disadvantages as well as dynamic analysis. But being usedtogether these methodologies (Intel Thread Checker and VivaMP tools) supplement each other verywell. And used together they serve rather a safe method of detecting errors in m multi-thread applications.The above described error of writing into x and y variables detected by Intel Thread Checker and VivaMPtools can be easily corrected: you only need to add one more directive into #pragma omp parallel forconstruction: private (x, y). Thus, these two variables will be defined as private and there will be x,separate copies of x and y in each computational thread. But you should also keep in mind that all thethreads save the calculated result by adding it to s variable. Such errors occur when one computationalthread tries to write some value into shared memory and simultaneously the second thread performsreading. In the given example it can lead to an incorrect result.Lets consider s += j*y instruction. Originally it is suggested that each thread add the calculated result tothe current value of s variable and then the same operations are executed by all the other threads. Butin some cases the two threads begin to execute s += j*y instruction simultaneously, that is each of themfirst reads the current value of s variable, then adds the result of j*y multiplication to this value andwrites the final result into s variable.Unlike reading operation which can be implemented in parallel mode and is quite quick, the writingoperation is always sequential. Consequently, if the first thread writes the new value first, the secondthread, having performed the writing operation after that, will erase the result of the first threadscalculations because the both computational threads read one and the same value of s variable and onebegin to write their data into it. In other words, the value of s variable which will be written into sharedmemory by the second thread doesnt consider the results of calculations performed by the first threadat all. You can avoid such a situation by making sure that at any moment only one thread is allowed to ll.
  8. 8. execute s += j*y operation. Such operations are called indivisible or atomic. When we need to point thecompiler that some instruction is atomic we use #pragma omp atomic construction. The program codein which the described operations are corrected is shown in Listing 3.double FixedFunctionOpenMP(int N){ double x, y, s=0; #pragma omp parallel for private(x,y) num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { #pragma omp atomic s += j * y; y = y * x; }; }; return s;}Having recompiled the program and analyzed it once again in Thread Checker, well see that it doesntcontain critical errors. Only two messages are shown telling that the parallel threads end when reachingreturn operator in MathFunction function. In the given example it is alright because only the code insidethis function is paralleled. VivaMP static analyzer wont show any messages on this code at all as it isabsolutely correct from its viewpoint.But some work still remains. Lets see if our code has really become more effective after paralleling.Lets measure the execution time for three functions: 1 - sequential, 2 - parallel incorrect, 3 - parallelcorrect. The results of this measuring for N=1500 are given in Table 1.Function Result Execution timeSequential variant of the function 287305025.528 0.5781 secondsIncorrect variant of the parallel function 298441282.231 2.9531 secondsCorrect variant of the parallel function using atomic directive 287305025.528 36.8281 secondsTable 1 - Results of functions operationWhat do we see in the table? The parallel variant of the incorrect function works slower in several times.But we are not interested in this function. The problem is that the correct variant works even more than60 times slower. Do we need such parallelism? Of course not.
  9. 9. The reason is that we have chosen a very ineffective method of solving the problem of summing theresult in s variable by using atomic directive. This approach leads to that the threads wait for each othervery often. To avoid constant deadlocks when executing atomic summing operation we can use thespecial directive reduction. reduction option defines that the variable will get the combined value at theexit from the parallel block. The following operations are permissible: +, *, -, &, |, ^, &&, ||. Themodified variant of the function is shown in Listing 4.double OptimizedFunction(int N){ double x, y, s=0; #pragma omp parallel for private(x,y) num_threads(2) reduction(+: s) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}In this case well get the functions variant not only correct but of higher performance as well (Table 2).The speed of calculations has almost doubled (more exactly, it has increased in 1.85 times) what is agood result for such functions.Function Result Execution timeSequential variant of the function 287305025.528 0.5781 secondsIncorrect variant of the parallel function 298441282.231 2.9531 secondsCorrect variant of the parallel function using atomic directive 287305025.528 36.8281 secondsCorrect variant of the parallel function using reduction directive 287305025.528 0.3125 secondsTable 2 - Results of functions operationIn conclusion I would like to point out once again that an operable parallel program not always can beeffective. And although parallel programming provides many ways to increase code effectiveness itdemands attention and good knowledge of the used technologies from the programmer. Fortunately,there exist such tools as Intel Thread Checker and VivaMP which greatly simplify creation and testing ofmulti-thread applications. Dear readers, I wish you good luck in mastering the new field of knowledge!
  10. 10. Appendix N1. Text of the demo program#include "stdafx.h"#include <omp.h>#include <stdlib.h>#include <windows.h>class VivaMeteringTimeStruct {public: VivaMeteringTimeStruct() { m_userTime = GetCurrentUserTime(); } ~VivaMeteringTimeStruct() { printf("Time = %.4f secondsn", GetUserSeconds()); } double GetUserSeconds();private: __int64 GetCurrentUserTime() const; __int64 m_userTime;};__int64 VivaMeteringTimeStruct::GetCurrentUserTime() const{ FILETIME creationTime, exitTime, kernelTime, userTime; GetThreadTimes(GetCurrentThread(), &creationTime, &exitTime, &kernelTime, &userTime); __int64 curTime; curTime = userTime.dwHighDateTime; curTime <<= 32; curTime += userTime.dwLowDateTime; return curTime;}double VivaMeteringTimeStruct::GetUserSeconds(){ __int64 delta = GetCurrentUserTime() - m_userTime;
  11. 11. return double(delta) / 10000000.0;}double Function(int N){ double x, y, s=0; for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}double FunctionOpenMP(int N){ double x, y, s=0; #pragma omp parallel for num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}double FixedFunctionOpenMP(int N)
  12. 12. { double x, y, s=0; #pragma omp parallel for private(x,y) num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { #pragma omp atomic s += j * y; y = y * x; }; }; return s;}double OptimizedFunction(int N){ double x, y, s=0; #pragma omp parallel for private(x,y) num_threads(2) reduction(+: s) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s;}int _tmain(int , _TCHAR* [])
  13. 13. { int N = 15000; { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", Function(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", FunctionOpenMP(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", FixedFunctionOpenMP(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", OptimizedFunction(N)); } return 0;}

×