Using Static Analysis in Program Development


Published on

Static analysis allows checking program code before the tested program is executed. The static analysis process consists of three steps. First, the analyzed program code is split into tokens, i.e. constants, identifiers, reserved symbols, etc. This operation is performed by lexer. Second, the tokens are passed to parser, which builds an abstract syntax tree (AST) based on the tokens. Finally, the static analysis is performed over the AST. This article describes three static analysis techniques: AST walker analysis, data flow analysis and path-sensitive data flow analysis.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Using Static Analysis in Program Development

  1. 1. Using Static Analysis in ProgramDevelopmentAuthors: Alexey KolosovDate: 31.01.2008AbstractStatic analysis allows checking program code before the tested program is executed. The static analysisprocess consists of three steps. First, the analyzed program code is split into tokens, i.e. constants,identifiers, reserved symbols, etc. This operation is performed by lexer. Second, the tokens are passedto parser, which builds an abstract syntax tree (AST) based on the tokens. Finally, the static analysis isperformed over the AST. This article describes three static analysis techniques: AST walker analysis, dataflow analysis and path-sensitive data flow analysis.IntroductionApplication testing is an important part of software development process. There are many differenttypes of software testing. Among them there are two types involving the applications code: staticanalysis and dynamic analysis.Dynamic analysis is performed on executable code of a compiled program. Dynamic analysis checks onlyuser-specific behavior. That is, only the code, executed during a test is checked. Dynamic analyzer canprovide the developer with information on memory leaks, programs performance, call stack, etc.Static analysis allows checking program code before the tested program is executed. Compiler alwaysperforms static analysis during the compilation process. However, in large, real-life projects it is oftennecessary to make the entire source code fulfill some additional requirements. These additionalrequirements may vary from variables naming to portability (for example, the code should besuccessfully executed both on x86 and x64 platforms). The most common requirements are: • Reliability - a lower amount of bugs in the tested program. • Maintainability - better understanding of the source code by others so that it is easier to upgrade/change the source code. • Testability - shorter testing time due to more effective testing process. • Portability - flexibility when the tested program is launched on different hardware platforms (for example, x86 and x64, as it has already been mentioned above). • Readability - better understanding of the code by others and therefore shorter review times and clearer code reading[1].All requirements can be divided into two categories: rules and guidelines. Rules describe what ismandatory, while guidelines describe what is recommended (by analogy with errors and warningsproduced by built-in code analyzers of standard compilers).Rules and guidelines form a coding standard. A coding standard defines the way a developer must andshould write program code.
  2. 2. A static analyzer finds source code lines, which presumably do not fulfill the specified coding standardand displays diagnostic messages so that the developer can understand what is wrong with these lines.The static analysis process is similar to compilation except that no executable or object code isgenerated. This article describes the static analysis process step by step.The Analysis ProcessStatic analysis process consists of two steps: abstract syntax tree creation and abstract syntax treeanalysis.In order to analyze source code, a static analysis tool should "understand" the code, that is, parse it andcreate a structure, describing the code in a convenient form. This form is named abstract syntax tree(often referred to as AST). To check, whether source code fulfils a coding standard, this tree should bebuilt.In general case, an abstract syntax tree is built only for an analyzed fragment of a source code (forexample, for a specific function). Before the tree can be built, the source code is first processed by alexer and then by a parser.The lexer is responsible for dividing the input stream into individual tokens, identifying the type of thetokens, and passing tokens one at a time to the next stage of the analysis. The lexer reads text data lineby line and splits a line to reserved words, identifiers and constants, which are called tokens. After atoken is retrieved, the lexer identifies the type of the token.If the first character of the token is a digit the token is a number, or if the first character is a minus signthe token is a negative number. If the token is a number it might be a real or an integer. If it contains adecimal point or the letter E (which indicates scientific notation) then it is a real, otherwise it is aninteger. Note that this could be masking a lexical error. If the analyzed source code contains a token"4xyz" the lexer will turn it into an integer 4. It is likely that any such error will cause a syntax error,which the parser can catch. However, such errors can also be processed by lexer.If the token is not a number it could be a string. String constants can be identified by quote marks, singlequote marks or other symbols, depending on the syntax of the analyzed language.Finally, if the token is not a string, it must be an identifier, a reserved word, or a reserved symbol. If thetoken is not identified as one of them, it is a lexical error. The lexer does not handle errors itself, so itsimply notifies the parser that an unidentified token type has been found. The parser will handle theerror[2].The parser has an understanding of the languages grammar. It is responsible for identifying syntaxerrors and for translating an error free program into internal data structures, abstract syntax trees, thatcan be processed by static analyzer.While lexer knows only languages syntax, parser also recognizes context. For example, lets declare a Cfunction::int Func(){return 0;}Lexer will process this line in the following way (see table 1):int Func ( ) { return 0 ; }
  3. 3. reserved identifier reserved reserved reserved reserved integer reserved reservedword symbol symbol symbol word constant symbol symbolTable 1. Tokens of the "int Func(){return 0};" string.The line will be identified as 8 correct tokens and these tokens will be passed to parser. The parser willcheck the context and find out that it is a declaration of function, which takes no parameters, returns aninteger, and always returns 0.The parser will find it out by creating an abstract syntax tree from the tokens provided by the lexer andanalyzing the tree. If the tokens and the tree built from them will be considered to be correct, the treewill be used for the static analysis. Otherwise, the parser will report an error.However, building an abstract syntax tree is not just organizing a set of tokens in a tree form.Abstract Syntax TreesAn abstract syntax tree captures the essential structure of the input in a tree form, while omittingunnecessary syntactic details. ASTs can be distinguished from concrete syntax trees by their omission oftree nodes to represent punctuation marks such as semi-colons to terminate statements or commas toseparate function arguments. ASTs also omit tree nodes that represent unary productions in thegrammar. Such information is directly represented in ASTs by the structure of the tree.ASTs can be created with hand-written parsers or by code produced by parser generators. ASTs aregenerally created bottom-up.When designing the nodes of the tree, a common design choice is determining the granularity of therepresentation of the AST. That is, whether all constructs of the source language are represented as adifferent type of AST nodes or whether some constructs of the source language are represented with acommon type of AST node and differentiated using a value. One example of choosing the granularity ofrepresentation is determining how to represent binary arithmetic operations. One choice is to have asingle binary operation tree node, which has as one of its attributes the operation, e.g. "+". The otherchoice is to have a tree node for every binary operation. In an object-oriented language, this wouldresults in classes like: AddBinary, SubtractBinary, MultiplyBinary, etc. with an abstract base class ofBinary[3].For example, let us parse two expressions: 1 + 2 * 3 + 4 * 5 and 1+ 2 * (3 + 4) * 5 (see figure 1):
  4. 4. Figure 1. Parsed expressions: 1 + 2 * 3 + 4 * 5 (left) and 1+ 2 * (3 + 4) * 5 (right).As one can see from the figure, the expression can be restored to its original form if you walk the treefrom left to right.After the abstract syntax tree is created and verified, the static analyzer will be able to check, whetherthe source code fulfils the rules and guidelines specified by the code standard.Static Analysis TechniquesThere are many different analysis techniques, such as AST walker analysis, dataflow analysis, path-sensitive data flow analysis, etc. Concrete implementations of these techniques vary from tool to tool.Static analysis tools for different programming languages can be based on various analysis frameworks.These frameworks contain core sets of common techniques, which can be used in static analysis tools sothat these tools reuse the same infrastructure. The supported analysis techniques and the way thesetechniques are implemented varies from framework to framework. For example, a framework mayprovide easy way to create an AST walker analyzer, but has no support for data-flow analysis[4].Although all the three above mentioned analysis techniques use the AST created by parser, thetechniques differ by their algorithms and purposes.AST walker analysis, as one can see from the term, is performed by walking the AST and checkingwhether it fulfils the coding standard, specified as a set of rules and guidelines. This is the analysisperformed by compilers.Data flow analysis can be described as a process to collect information about the use, definition, anddependencies of data in programs. The data flow analysis algorithm operates on a control flow graph(CFG), generated from the source code AST. The CFG represents all possible execution paths of a givencomputer program: the nodes represent pieces of code and the edges represent possible controltransfers between these code pieces. Since the analysis is performed without executing the testedprogram, it is impossible to determine the exact output of the program, i.e. to find out which executionpath in the control flow graph is actually taken. That is why data flow analysis algorithms makeapproximations of this behavior, for example, by considering both branches of an if-then-else statementand by performing a fixed-point computation for the body of a while statement. Such a fixed-pointalways exists because the data flow equations compute sets of variables and there are only a finitenumber of variables available since we only consider programs with a finite number of statements.Therefore, there is a finite upper limit to the number of elements of the computed sets which meansthat a fixed-point always exists. In terms of control flow graphs, static analysis means that all possible
  5. 5. execution paths are considered to be actual execution paths. The result of this assumption is that onecan only obtain approximate solutions for certain data flow analysis problems[5].The data flow analysis algorithm described above is path-insensitive, because it contributes all executionpaths - whether feasible or infeasible, heavily or rarely executed - to a solution. However, programsexecute only a small fraction of their potential paths and, moreover, execution time and cost is usuallyconcentrated in a far smaller subset of hot paths. Therefore, it is natural to reduce the analyzed CFGand, therefore, to reduce the amount of calculations so that only a subset of the CFG paths areanalyzed. Path-sensitive analysis operates on a reduced CFG, which does not include infeasible pathsand does not contain "dangerous" code. The paths selection criteria are different in different tools. Forexample, a tool may analyze only the paths containing dynamic arrays declaration, which is consideredto be "dangerous" according to the tools settings.ConclusionThe number of static analysis tools and techniques grows from year to year and this proves the growinginterest in static analyzers. The cause of the interest is that the software under development becomesmore and more complex and, therefore, it becomes impossible for developers to check the source codemanually.This article gave a brief description of the static analysis process and analysis techniques.References 1. Dirk Giesen Philosophy and practical implementation of static analyzer tools [Electronic resource]. -Electronic data. -Dirk Giesen, cop. 1998. -Access mode: 2. James Alan Farrell Compiler Basics [Electronic resource]. -Electronic data. -James Alan Farrell, cop 1995. -Access mode: 3. Joel Jones Abstract syntax tree implementation idioms [Electronic resource]. -Proceedings of the 10th Conference on Pattern Languages of Programs 2003, cop 2003. 4. Ciera Nicole Christopher Evaluating Static Analysis Frameworks [Electronic resource].- Ciera Nicole, cop. 2006. - Access mode: 5. Leon Moonen A Generic Architecture for Data Flow Analysis to Support Reverse Engineering [Electronic resource]. - Proceedings of the 2nd International Workshop on the Theory and Practice of Algebraic Specifications, cop. 1997.