Static Analysis of Computer programs

1 Introduction
The problem of slicing requires us to solve data dependencies and control de-pendencies.
Finding data dependencies require us to solve the problem of reach-ing
definitions. Control dependencies require finding dominance frontiers. The
problem of aliasing has to to solved for computing correct data dependencies.
Each of these individual problems can be viewed as instance of the more general
problem of program analysis.
2 Program Analysis
Program analysis is the method of statically computing properties of a program.
Program analysis is useful for performing optimizations. There can be many
interesting properties that can be queried about programs. [?]. In particular
for slicing we are interested in the following.
1. What definitions of a variable can reach a particular program point
2. What objects can a reference variable point to
3. What types can a receiver variable point to
4. Can the pointer p point to null
5. For computing control dependence information, dominator information is
needed
6. What variables can be modified or referenced by a procedure
The actual paths that are taken during execution is difficult to determine.
So it is difficult to give precise answers to many of these questions related to
program analysis, approximate solutions are computed. These solutions are
conservative in the sense, they err on the safer side. They overestimate the
number of definitions reaching a program point or the objects the reference
variable can point to.
The initial methods of performing data flow analysis was done by executing
the program symbolically tracing through all control flow paths collecting in-formation
about dataflow values. This is computationally expensive and there
can be problem of termination if dataflow values keep alternating.
Monotone data flow frameworks overcome these problems by
1

1. putting a partial order on the abstract data values such that they change
in the same direction during abstract interpretation , reducing termination
problems.
2. If two control flow paths merge, we have to assume conservative informa-tion
that holds good about the abstract values along both the paths. A
semilattice L is used to represent the abstract values because the meet
operation gives us exactly this information.
3. To every node v in the CFG, we assign a variable [[v]] ranging over the
elements of L.
4. for every point in the program ,an equation that relates the value of the
variable of the corresponding node to those of other nodes (typically the
neighbors).
To guarantee termination, two useful limitations are made on the general
framework, which enforces all the transfer functions to be monotone and the
height of the lattice to be finite. Intuitively, this means that the transfer function
can only climb up and since the height is finite, it can at most reach the top
in the end and thus have to stabilize. This framework is called monotone data
flow analysis framework and its formal definition is listed below
With each node a function is associated which maps each value in the semi-lattice
to another value in the semilattice. The set of such functions at every
node in the program forms a system of dataflow equations. A fixed point solu-tion
to these equations gives the conservative information that can be assumed
to be valid at every node in the program.
For dataflow frameworks the set of functions F has to satisfy the following
properties.
1. Closedness under composition is important because we can summarize the
effects of a sequence statements without leaving the function space.
2. Identity function has to be there to take into account of empty basic
blocks.
3. Closed under pointwise meet, if h(x) = f(x) ^ g(x) then h 2 F. This is
necessary because h can represent the effect of convergence of two control
flow paths.
2

4. Monotonicity of functions in F guarantee the existence of fixed point so-lution.
Distributivity
2.1 Characterization of data flow frameworks
In presence of loops, the transfer function f has to be applied multiple times
to summarize the effect of loops. If f describes dataflow effect of once around a
cyclic path, then f represents the effect of a loop where the number of itera-tions
is apriori indeterminate.
f is said to be k-bounded , if the solution reaches a fixedpoint values before
the k-th pass around the loop. For the classical bitvector problems ,the transfer
function is given by f(x) = GEN + X − KILL, thus f f = f, making them
2-bounded. Frameworks for which f f = f hold true are called fast frameworks.
In summary, a dataflow framework is characterized by
1. Algebraic properties of functions in F ( monotonicity, distributivity)
2. Finiteness properties of functions in F( boundedness, fastness, rapidity)
3. Finiteness properties of lattice L (height)
4. Partitionability of L and F
There is an important class of k-bounded partionable problems called bitvector
problems.
2.2 Dataflow Analysis Examples
Dataflow analysis techniques are used for the discovery of compile time code
improvements.
see p.12. JFL for dataflow frameworks defn Reaching Definitions The
reaching definitions problem asks for each node (program point) n, which as-signments
might have determined the values of the program variables at n. It
uses (}(Asgn),,[, Asgn, ;) for property lattice.
3

2.3 Solution Procedures
The solution to a dataflow problem
2.3.1 Meet over all paths solution
A transfer function is associated with every basic block. We can define a trans-fer
function associated with a path P given by x0 ! x1 ! x2.... ! xn as
compositions of transfer functions associated with individual basic blocks.
The meet over all paths solution is given by
If all paths are executable, then MOP solution is the best statically deter-minable
solution. However it is infeasible to determine those paths that are
statically executed. The inaccuracy results from the fact additional infeasible
paths may be included. Thus the MOP solution is a conservative approximation
to the real solution.
2.3.2 Maximal Fixed Point solution
The MFP solution is given by
Though MOP is the best possible solution, it has been proved that a gen-eral
algorithm to compute MOP solution does not exist [?]. Intuitively this is
due to presence of loops which leads to infinite number of paths. MFP solu-tion
sacrifices precision for computability. MFP solution is less precise because
it considers only the effect on flow values by adjacent neighbors. In case of
distributive framework both the solutions are identical.
2.3.3 Iterative Methods
The iterative method solves the system of equations by initializing the node
variables to some conservative values and successively recomputing them till
a fixed point is reached. The naive implementation of this is called chaotic
4

iteration. Since this is not efficient clever iterative algorithms like worklist,
roundrobin, node listing are used.
Kam Ullman showed that a class of problems that a round robin algorithm
visiting nodes in reverse post order can solve in d(G)+3 passes over the graph
G. d(G) is the depth of the graph, the maximum no of edges that can occur on
any acyclic path in G.
2.3.4 Elimination Methods
When the CFG is reducible, which means that there are no multiple entry
loops, elimination algorithm is preferred because it is usually more efficient than
iterative algorithms. The central idea of elimination method is to represent all
paths from start node to a particular node as a regular expression. We can
get a high level summary function fre corresponding to the regular function
by replacing each node with corresponding transfer function, the concatenation
with function composition , the * operator with function closures, the union with
meet operation. Thus the data value at a node is fre(init). The requirement of
reducibility of the CFG assures that the summary function could be obtained
from the above operations. Since a regular expression is used to represent a set
of paths, elimination method is also called path algebra based approach.
Elimination methods exploit the structural properties of the graph. The
flow graph is reduced to single node using a series of graph transformations.
The data flow properties of a node in a region are determined from dataflow
properties of the region’s header node.
3 Other Methods of Program Analysis
1. Abstract Interpretation
2. Constraint based analysis
Many data flow analysis problems like reaching definitions can be efficiently
solved using abstract interpretation. For other problems like pointer analysis,
type inference , constraint based approach is more suitable.
Using Data flow analysis, many interesting pieces of information cannot be
gathered because data flow analysis does not make use of the semantic of the
5

programming language’s operators.
Abstract Interpretation
Abstract interpretation amount to executing the program on the abstract
values instead of actual values. For example, the abstract values for integers
can the their signs or the property of being even/odd. Since concrete computa-tion
is not performed , the results inferred by abstract interpretation can only
be approximate. Another reason for approximation is that abstract interpre-tation
computes conservative information that are valid in all control flow paths.
Abstract interpretation is a theory of semantics approximation. The idea of
abstract interpretation is to create a new semantic of the programming language
so that the semantic always terminates and the store for every program points
contains a superset of the values that are possible in the actual semantic, for
every possible input. Since in this new semantic a store does not contain a single
value for a variable but a set (or interval) of possible values, the evaluation of
boolean and arithmetic expressions must be redefined.
Abstract interpretation is a very powerful program analysis method. It uses
information on the programming language’s semantic and can detect possible
runtime errors, like division by zero or variable overflow. Since abstract inter-pretation
can be computationally very expensive it must be taken care choose
an appropriate value domain and appropiate heuristic for loop termination to
ensure feasabillity.
For example, in the following program, determining the actual values x can
take is not feasible. However, if we are interested in abstract values of x say odd,
even we can determine that x can have only even values. Abstract interpretation
determines this information.
void f() {
int x=2;
while(...) {
// what values x can have here
x=x+2;
}
}
6

The most common application of abstract interpretation for solving dataflow
problems.
Set based analysis
In set based analysis , sets of values are associated with a variable. The
major difference between set based analysis and abstract interpretation is that
set based analysis does not employ an abstract domain to approximate the
underlying domain of computation. In set based analysis, the approximation
occurs because all dependencies between variables are ignored. For example, if
at some point in the program , the environment [x ! 1, y ! 2] and [x ! 3, y !
4] can be encountered, then set based analysis will conclude that x can have
values 1,3 and y can have values 2,4. The dependency ”x is 1 when y is 2” is
lost.
In constraint based analysis, the properties to be computed are expressed as
set constraints. In the following example, the property to be computed is the
set of values a variable can have at runtime.
if(...) {
x=3;
}
else {
x=6;
}
// x can have {3,6}
print (x)
These sets need not only represent integers as in the above example. The
sets can represents points to sets of a variable or types of receiver variable.
This ”set based” approach to program analysis consists of three phases.
The first step is to label the interested program points, which may be terms,
expressions or program variables. The second step is to associate with each
label a variable which denotes the abstract values at the point. The one can
derive a set of constraints on these variables. In the final step, these constraints
are solved to find their minimum solution (this solving process is the main
computational part of the analysis).
To solve the constraints, the usual procedure is to represent each set expres-sion
as a node and each constraint as a directed edge. A transitive closure on
7

this graph gives all possible constraints that can be inferred.
Data flow analysis can be viewed as set based analysis. For example consider
live variable analysis for C programs.
1. domain : set of program variables
2. Variables : Sin, Sout
3. Constants : Sdef , Suse
4. Constraints : Sin = Suse [ (Sout − Sdef )
S
5. Sout =
X2succ(S) Xin
3.1 Interprocedural Analysis - Context Sensitivity
3.1.1 Functional Approach
The idea behind functional approach is same as that of elimination algorithms,
where we used a large transfer function to summarize the effect of a region.
The set of valid paths in the intraprocedural case can be represented by regular
expression. In interprocedural cases, the calls and returns have to be matched.
The set of valid paths is given by context free grammar. A non terminal in
the context free grammar represents the procedure. It is possible to compute a
large function corresponding to the non terminal which summarizes the effects
of those paths produced by the non terminal .
example
3.1.2 Call String Approach
The idea behind call string approach is the contents of the stack is an important
element to distinguish between dataflow information. The context is usually the
current call stack. The use of call string guarantee that dataflow values from
invalid paths are not considered. The length of call string may be unbounded ,
so k-bounded call strings are used.
8

Static Analysis of Computer programs

More Related Content

What's hot

Viewers also liked

Similar to Static Analysis of Computer programs

More from Arvind Devaraj

Recently uploaded

Static Analysis of Computer programs