Development of a static code analyzer for detecting errors of porting programs on 64-bit systems
Upcoming SlideShare
Loading in...5

Development of a static code analyzer for detecting errors of porting programs on 64-bit systems



The article concerns the task of developing a program tool called static analyzer. The tool being developed is used for diagnosing potentially unsafe syntactic structures of C++ from the viewpoint of ...

The article concerns the task of developing a program tool called static analyzer. The tool being developed is used for diagnosing potentially unsafe syntactic structures of C++ from the viewpoint of porting program code on 64-bit systems. Here we focus not on the problems of porting occurring in programs, but on the peculiarities of creating a specialized code analyzer. The analyzer is intended for working with the code of C/C++ programs.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Development of a static code analyzer for detecting errors of porting programs on 64-bit systems Development of a static code analyzer for detecting errors of porting programs on 64-bit systems Document Transcript

  • Development of a static code analyzer fordetecting errors of porting programs on64-bit systemsAuthor: Evgeniy RyzhkovDate: 26.03.2009AbstractThe article concerns the task of developing a program tool called static analyzer. The tool beingdeveloped is used for diagnosing potentially unsafe syntactic structures of C++ from the viewpoint ofporting program code on 64-bit systems. Here we focus not on the problems of porting occurring inprograms, but on the peculiarities of creating a specialized code analyzer. The analyzer is intended forworking with the code of C/C++ programs.IntroductionOne of the modern tendencies in IT is port of software on 64-bit processors. Obsolete 32-bit processors(and consequently 32-bit programs) have some limitations which impede software developers and holdback the progress. First of all, such a limitation is the size of the maximum available main memory for aprogram (2 GB). Although there are some methods which allow you to evade these limitations in somecases, in general we can for sure say that port on 64-bit program solutions is inevitable.For most programs porting on a new architecture means at least the necessity of their recompilation. Ofcourse, there can be other ways. But within the framework of this article we speak about C and C++languages thats why recompilation is inevitable. Unfortunately, this very recompilation often hasunexpected and unpleasant consequences.Change of the architectures capacity (for example from 32 bits to 64 bits) means, first of all, change ofthe sizes of the basic data types and also correlations between them. As the result behaviour of aprogram can change after its recompilation for a new architecture. Practice shows that the behavior isnot only able to change but it really does. And the compiler very often doesnt show warning messageson those constructions which are potentially unsafe from the viewpoint of the new 64-bit architecture.Of course, less correct code sections will be detected by the compiler. However, far not all thepotentially unsafe syntactic constructions can be detected with the help of the traditional programtools. And it is this reason that a new code analyzer appears. But before we speak about the new toolwe still need to describe the errors, which our analyzer will detect, in detail.1 Some errors of porting programs on 64-bit systemsDetailed analysis of all the potentially unsafe C and C++ syntactic constructions is beyond the scope ofthis article. We refer those readers who are interested in this problem to the encyclopaedic article [1]which gives rather full investigation of this issue. For purposes of projecting a code analyzer we stillshould list the main types of errors here.
  • Before we speak about concrete errors lets revise some data types used in C and C++ languages. Theyare listed in Table 1.Type name Type size in Type size in Description bits (32-bit bits (64-bit system) system)ptrdiff_t 32 64 Signed integer type resulting from subtraction of two pointers. Basically used for storing sizes and indexes of arrays. Sometimes is used as the result of the function returning the size or -1 when an error occurs.size_t 32 64 Unsigned integer type. The result of sizeof() operator. Often used for storing size or number of objects.intptr_t, uintptr_t, 32 64 Integer types capable to store a pointers value.SIZE_T, SSIZE_T,INT_PTR,DWORD_PTR etcTable N1. Description of some integer types.What is peculiar about these data types is that their size alters depending on the architecture. On 64-bitsystems it is 64 bits and on 32-bit ones - 32 bits.Lets introduce the notion "memsize-type":DEFINITON: By memsize-type we understand any simple integer type able to store a pointer andchanging its size when the paltforms capacity changes from 32-bit to 64-bit. All the types listed in table1 are memsize-types.Most problems occurring in the code of programs (if we speak about support of 64 bits) relate to disuseor incorrect use of memsize-types.So, lets describe the possible errors.1.1 Using "magic" constantsPresence of "magic" constants (i.e. values calculated in an unknown way) in programs is in itselfundesirable. But in case of porting programs on 64-bit systems "magic" constants acquire one more verysignificant disadvantage. They can lead to incorrect operation of programs. We speak about those"magic" constants which focus on some concrete peculiarity of the architecture. For example, they mayfocus on that the size of a pointer is 32 bits (4 bytes).Lets consider a simple example.size_t values[ARRAY_SIZE];memset(values, ARRAY_SIZE * 4, 0);On a 32-bit system this code would be correct but the size of size_t type on a 64-bit system increases to8 bytes. Unfortunately, a fixed size (4 bytes) was used in the code. As the result the array will be filledwith zeros not completely.There are some other variants of incorrect use of such constants.
  • 1.2 Address arithmeticLets consider an example of a type error in address arithmeticunsigned short a16, b16, c16;char *pointer;...pointer += a16 * b16 * c16;This example works correctly with pointers if the value of "a16 * b16 * c16" expression doesnt exceedUINT_MAX (4Gb). This code could always work correctly on a 32-bit platform as the program neverallocated arrays of more size. On a 64-bit architecture the arrays size exceeded UINT_MAX items.Suppose, we want to shift the pointers value at bytes, so variables a16, b16 and c16 havevalues 3000, 2000 and 1000 correspondingly. When calculating "a16 * b16 * c16" expression all thevariables will be converted into int type according to C++ rules and only after that they will bemultiplied. During multiplication an overflow will occur. The incorrect result of the expression will beextended to ptrdiff_t type and incorrect calculation of the pointer will occur.But such errors occur not only by large data but in common arrays as well. Lets consider an interestingcode for working with an array containing only 5 items. This example works in 32-bit mode but it wontwork in 64-bit mode:int A = -2;unsigned B = 1;int array[5] = { 1, 2, 3, 4, 5 };int *ptr = array + 3;ptr = ptr + (A + B); //Invalid pointer value on 64-bit platformprintf("%in", *ptr); //Access violation on 64-bit platformLets follow the process of calculating "ptr + (A + B)": • According to C++ rules A variable of int type is converted into unsigned type. • A and B are summed. As the result well get the value 0xFFFFFFFF of unsigned type.After that "ptr + 0xFFFFFFFFu" expression is calculated but the result will depend on the on the size ofthe pointer on the given architecture. If addition is performed in a 32-bit program this expression will besimilar to "ptr - 1" and well successfully print number 3.In a 64-bit program 0xFFFFFFFFu value will be fairly added to the pointer and consequently the pointerwill be far outside the arrays limits. And when we try to get access to the item by this pointer well meettroubles.1.3 Using integer types and types of variable size togetherUsing memsize- and non-memsize-types in expressions can lead to incorrect results on 64-bit systemsand relate to changing of the input values range. Here are some examples:size_t Count = BigValue;
  • for (unsigned Index = 0; Index != Count; ++Index){ ... }This is an example of an eternal loop if Count > UINT_MAX. Suppose on 32-bit systems this code workedwith the range of less than UINT_MAX iterations. But the 64-bit version of the program can processmore data and it may need more iterations. As values of Index variable lie within the range[0..UINT_MAX] the condition "Index != Count" will never be fulfilled and that will lead to the eternalloop.1.4 Virtual and overloaded functionsIf you have large hierachies of inheritance of classes with virual functions in your program, you caninadvertently use arguments of different types which actually coincide on a 32-bit system. For example,in the basic class you use size_t type as the argument of a virtual function while in the child class you useunsigned type. Consequnetly, this code will be incorrect on a 64-bit system.This error doesnt necessarily lie in complex heritage hierachies and here is an example:class CWinApp { ... virtual void WinHelp(DWORD_PTR dwData, UINT nCmd);};class CSampleApp : public CWinApp { ... virtual void WinHelp(DWORD dwData, UINT nCmd);};Youll face troubles when compiling this code for a 64-bit platform. Youll get two functions with thesame name but with different parameters and the users code will fail to launch as the result.Similar problems can occur when using overloaded functions.As weve already said this is far not the complete list of possible errors (see [1]), however, it allows youto formulate requirements to the code analyzer.2 Requirements to the code analyzerOn the basis of the list of potentially unsafe constructions which need to be diagnosed we can formulatethe following requirements: 1. The analyzer must allow performing lexical analysis of the program code. It is necessary for analyzing potentially unsafe numerical literals. 2. The analyzer must allow performing parsing of the program code. It is impossible to carry out all the necessary tests only at the level of lexical analysis. We should also note the complexity of syntax of C and especially C++ languages. Because of this it is the full parsing that we should provide and not, for example, search on the basis of regular expressions.
  • 3. Type analysis is also an important part of the analyzer. The complexity of types in the target languages is such that the subsystem of calculating types is rather labor-intensive. However, we cannot avoid it.We should mention that the concrete architecture of implementing the listed functional doesnt matter,but this implementation must be full.In the literature on developing compilers [2] it is said that the traditional compiler has the followingoperation stages: Figure 1 - A traditional compilers operation stagesPay attention that these are "logical" operation phases. In the real compiler some stages are united andsome are executed parallel with the others. Thus, for example, phases of syntactic and semanticanalyses are combined.Neither code generation nor its optimization is needed for the code analyzer. That is, we must developthe part of the compiler which performs lexical, syntactic and semantic analysis.3 The code analyzers architectureOn the basis of the listed requirements to the system being developed, we can offer the followingstructure of the code analyzer: 1. Lexical analysis unit. Finite-state machines serve as the mathematical apparatus for this unit. The result of lexical analysis is a set of lexemes. 2. Syntactic analysis unit. Grammars serve as the mathematical apparatus here; the result of syntactic analysis is the parse tree. 3. Semantic (contextual) analysis unit. The mathematical apparatus here are also grammars but of a special kind: either grammars specially "extended", or the so called attribute grammars [3]. The result of this analysis is the parse tree with marked additional information about types (or the attributed parse tree). 4. Error diagnosing system. It is the part of the code analyzer which is directly responsible for detecting potentially unsafe constructions from the viewpoint of porting code on 64-bit systems.The listed units are standard [4] for traditional compilers (figure 2), or more exactly for that part of thecompiler which is called the front end compiler.
  • Figure 2 - Scheme of the front-end compilerAnother part of a traditional compiler - the back-end compiler - is responsible for optimization and codegeneration and is of no interest for us here.
  • Thus, the code analyzer being developed must contain the front-end compiler to provide the necessarylevel of code analysis.3.1 Lexical analysis unitThe lexical analyzer is a finite-state machine describing the rules of lexical analysis of a concreteprogramming language.Description of the lexical analyzer can be presented not only as a finitie-state machine but as a regularexpression as well. Both variants of description are equal as they are easily converted into each other.Figure 3 shows the part of the finitie-state machine describing the C analyzer.
  • Figure 3 - The finite-state machine describing a part of the lexical analyzer (taken from [3])As we have already said, at this stage only analysis of one type of potentially unsafe constructions ispossible - use of "magic" constants. All the other types of analysis will be performed at the next stages.3.2. Syntactic analysis unitThe syntactic analysis unit works with the grammar apparatus to build an abstract syntax tree accordingto the set of lexemes got at the previous stage. The task of syntactic analysis can be formulated moreexactly: can the program code be deduced from the grammar of the given lanugage? As the result of thededucibility check we get the abstract syntax tree but the point is in the very check of the codesbelonging to a concrete programming language.The code parsing results in building a parse tree. An example of such a tree for a code section on figure 4is shown on figure main(){ int a = 2; int b = a + 3; printf("%d", b);}Figure 4 - Example of code (for a parse tree)
  • Figure 5 - Example of a parse treeWe should mention that in case of some simple programming languages the structure of a programbecomes absolutely clear when the parse tree is built. But for such a complex language as C++ we needan additional stage when the built tree will be supplemented, for example, with information about datatypes.3.3. Semantic analysis unitIn the semantic analysis unit, of the most interest is the subsystem of calculating types. The point is thatdata types in C++ are rather a complex and very extensible set of entities. Besides basic typescharacteristic of any programming language (integer, symbol etc) C++ has pointers to functions,templates, classes etc.Such a complex subsystem of types doesnt allow us to perform the full analysis of a program at thestage of syntactic analysis. Thats why the parse tree is input in the semantic analysis unit and then issupplemented with information about all the data types.At this stage the operation of calculating types also takes place. C++ allows coding rather complexexpressions, but very often it is not easy to identify their type. Figure 6 shows an example of code forwhich calculation of types is needed when transferring arguments into the function.void call_func(double x);
  • int main(){ int a = 2; float b = 3.0; call_func(a+b);}Figure 6 - Example of code (calculation of the type).In this case we need to calculate the type of the result of (a+b) expression and add information aboutthe type to the parse tree (figure 7). Figure 7 - Example of a parse tree supplemented with information about types
  • When the semantic analysis unit stops operating all possible information about the program is availablefor further processing.3.4 Error diagnosing systemWhen speaking about processing errors developers of compilers mean peuliarities of the compilersbehavior when detecting incorrect program codes. In this meaning errors can be classified into severaltypes [2]: • lexical errors - incorrectly written identifiers, key words or operators; • syntactic errors - for example, arithmetic expressions with unbalanced brackets; • semantic errors - operators used together with operands incompatible with them.All these types of errors mean that instead of a program which is correct form the programminglanguages viewpoint the compiler gets an incorrect program. And its task is to firstly diagnose an errorand secondly continue work if possible or stop working.A different approach to errors is used when we speak about static analysis of source program codes withthe purpose of detecting potentially unsafe syntactic constructions. The basic difference lies in that thesyntactic code analyzer gets program code absolutely correct lexically, syntactically and semantically.Thats why it is unfortunately impossible to implement the system of diagnosing incorrect constructionsin the static analyzer in the same way as the error diagnosing system in a traditional copmiler.4 Implementation of the code analyzerImplementation of the code analyzer consists in implementation of two parts: • the front end compiler; • the subsystem of diagnosing potentially unsafe constructions.To implement the front end compiler we will use the existing open C++ code analysis library OpenC++[6], or more exactly its modification VivaCore [7]. This is a syntactic code analyzer written by ourselves inwhich analysis is performed by the recursive descendant method (recursive descendant analysis) withreturn. The choice of a self-written analyzer is explained by the complexity of C++ and absence of readydescribed grammars of this language for using aids of automatic designing of code analyzers of YACCand Bison types.As it was said in section 3.4, to implement the subsystem of searching potentially unsafe constructionsby using an error diagnosing system, which is traditional for compilers, is impossible. Lets use severalmethods of modifying the basic C++ grammar for this.First of all, we need to correct the description of the basic C++ types. In section 1 weve introduced thenotion of memsize-types, i.e. types of variable size (table 1). We will process all these types in programsas one special type (memsize). In other words, all the real data types important from the viewpoint ofporting code on 64-bit systems, will be processed as one type in programs code (for example, ptrdiff_t,size_t, void* etc).Further we need to extend the grammar adding symbols-actions into its output rules [5]. Then therecursive descendant procedure which performs syntactic analysis will also perform some additionalactions to check semantics. It is these additional actions that comprise the essence of the static codeanalyzer.
  • For example, a fragment of grammar for testing correctness of virtual functions (from section 1.4) canlook as follows:<HEADING_OF_VIRUTAL_FUNCTION> > <virtual> <HEADING_OF_FUNCTION> CheckVirtual()Here CheckVirtual() is that very symbol-action. Action CheckVirtual() will be called immediately after therecursive descendant procedure has detected definition of a virtual function in the code being analyzed.And only inside the procedure CheckVirtual() arguments correctness in defining the virtual function willbe checked.Checks of all the potentially unsafe constructions in C and C++ which are described in [1] are presentedas similar symbols-actions. These symbols-actions themselves are added into the languages grammar,or more exactly, in the syntactic analyzer which calls symbols-actions when parsing the program code.5 ResultsThe architecture and structure of the code analyzer discussed in the article became the basis of thecommercial product Viva64 [8]. Viva64 is the static analyzer of the code of the programs written in C andC++. It is intended for detecting syntactic constructions potentially unsafe from the viewpoint of portingcode on 64-bit systems in the source code of programs.6 ConclusionStatic analyzer is a program consisting of two parts: • the front end compiler; • the subsystem of diagnosing potentially unsafe syntactic constructions.The front end compiler is a traditional component of a common compiler, so the principles of its designand development are examined rather thoroughly.The subsystem of diagnosing potentially unsafe syntactic constructions is that component of the staticcode analyzer which makes analyzers unique, differing in the set of tasks solved. Thus, within theframework of this article the task of porting the program code on 64-bit systems is discussed. It is theknowledge about 64-bit software that became the basis of the diagnosing subsystem.Integration of the front end compiler from VivaCore project [7] and knowledge about 64-bit software [1]allowed us to develop the program product Viva64 [8].References 1. A. Karpov. 20 issues of porting C++ code on the 64-bit platform // RSDN Magazine #1-2007. 2. A. Aho, R. Seti, D. Ulman. Compilers: principles, technologies and tools. : Translation from English. - Moscow: Publishing house "Williams", 2003. - 768 pp.: illustrations. - Paral.tit. English. 3. V.A. Serebryakov, M.P. Galochkin. Foundations of designing compilers. Moscow: Editorial URSS, 2001. - 224 pp. 4. E.A. Zuev. Principles and methods of creating the front end compiler of C++ Standard. Doctoral physicomathematical-sciences candidates thesis. Moscow, 1999.
  • 5. Formal grammars and languages. Elements of translation theory. / I.A. Volkova, T.V. Rudenko; MSU named after M.V. Lomonosov, Computational mathematics and cybernetics faculty, 62 pp. 21 sm, 2nd edition, revised and corrected Moscow Dialog-MSU 1999.6. OpenC++ (C++ frontend library). VivaCore Library. Viva64 Tool.