1 Introduction 
The problem of slicing requires us to solve data dependencies and control de-pendencies. 
Finding data dependencies require us to solve the problem of reach-ing 
definitions. Control dependencies require finding dominance frontiers. The 
problem of aliasing has to to solved for computing correct data dependencies. 
Each of these individual problems can be viewed as instance of the more general 
problem of program analysis. 
2 Program Analysis 
Program analysis is the method of statically computing properties of a program. 
Program analysis is useful for performing optimizations. There can be many 
interesting properties that can be queried about programs. [?]. In particular 
for slicing we are interested in the following. 
1. What definitions of a variable can reach a particular program point 
2. What objects can a reference variable point to 
3. What types can a receiver variable point to 
4. Can the pointer p point to null 
5. For computing control dependence information, dominator information is 
needed 
6. What variables can be modified or referenced by a procedure 
The actual paths that are taken during execution is difficult to determine. 
So it is difficult to give precise answers to many of these questions related to 
program analysis, approximate solutions are computed. These solutions are 
conservative in the sense, they err on the safer side. They overestimate the 
number of definitions reaching a program point or the objects the reference 
variable can point to. 
The initial methods of performing data flow analysis was done by executing 
the program symbolically tracing through all control flow paths collecting in-formation 
about dataflow values. This is computationally expensive and there 
can be problem of termination if dataflow values keep alternating. 
Monotone data flow frameworks overcome these problems by 
1
1. putting a partial order on the abstract data values such that they change 
in the same direction during abstract interpretation , reducing termination 
problems. 
2. If two control flow paths merge, we have to assume conservative informa-tion 
that holds good about the abstract values along both the paths. A 
semilattice L is used to represent the abstract values because the meet 
operation gives us exactly this information. 
3. To every node v in the CFG, we assign a variable [[v]] ranging over the 
elements of L. 
4. for every point in the program ,an equation that relates the value of the 
variable of the corresponding node to those of other nodes (typically the 
neighbors). 
To guarantee termination, two useful limitations are made on the general 
framework, which enforces all the transfer functions to be monotone and the 
height of the lattice to be finite. Intuitively, this means that the transfer function 
can only climb up and since the height is finite, it can at most reach the top 
in the end and thus have to stabilize. This framework is called monotone data 
flow analysis framework and its formal definition is listed below 
With each node a function is associated which maps each value in the semi-lattice 
to another value in the semilattice. The set of such functions at every 
node in the program forms a system of dataflow equations. A fixed point solu-tion 
to these equations gives the conservative information that can be assumed 
to be valid at every node in the program. 
For dataflow frameworks the set of functions F has to satisfy the following 
properties. 
1. Closedness under composition is important because we can summarize the 
effects of a sequence statements without leaving the function space. 
2. Identity function has to be there to take into account of empty basic 
blocks. 
3. Closed under pointwise meet, if h(x) = f(x) ^ g(x) then h 2 F. This is 
necessary because h can represent the effect of convergence of two control 
flow paths. 
2
4. Monotonicity of functions in F guarantee the existence of fixed point so-lution. 
Distributivity 
2.1 Characterization of data flow frameworks 
In presence of loops, the transfer function f has to be applied multiple times 
to summarize the effect of loops. If f describes dataflow effect of once around a 
cyclic path, then f represents the effect of a loop where the number of itera-tions 
is apriori indeterminate. 
f is said to be k-bounded , if the solution reaches a fixedpoint values before 
the k-th pass around the loop. For the classical bitvector problems ,the transfer 
function is given by f(x) = GEN + X − KILL, thus f  f = f, making them 
2-bounded. Frameworks for which f f = f hold true are called fast frameworks. 
In summary, a dataflow framework is characterized by 
1. Algebraic properties of functions in F ( monotonicity, distributivity) 
2. Finiteness properties of functions in F( boundedness, fastness, rapidity) 
3. Finiteness properties of lattice L (height) 
4. Partitionability of L and F 
There is an important class of k-bounded partionable problems called bitvector 
problems. 
2.2 Dataflow Analysis Examples 
Dataflow analysis techniques are used for the discovery of compile time code 
improvements. 
see p.12. JFL for dataflow frameworks defn Reaching Definitions The 
reaching definitions problem asks for each node (program point) n, which as-signments 
might have determined the values of the program variables at n. It 
uses (}(Asgn),,[, Asgn, ;) for property lattice. 
3
2.3 Solution Procedures 
The solution to a dataflow problem 
2.3.1 Meet over all paths solution 
A transfer function is associated with every basic block. We can define a trans-fer 
function associated with a path P given by x0 ! x1 ! x2.... ! xn as 
compositions of transfer functions associated with individual basic blocks. 
The meet over all paths solution is given by 
If all paths are executable, then MOP solution is the best statically deter-minable 
solution. However it is infeasible to determine those paths that are 
statically executed. The inaccuracy results from the fact additional infeasible 
paths may be included. Thus the MOP solution is a conservative approximation 
to the real solution. 
2.3.2 Maximal Fixed Point solution 
The MFP solution is given by 
Though MOP is the best possible solution, it has been proved that a gen-eral 
algorithm to compute MOP solution does not exist [?]. Intuitively this is 
due to presence of loops which leads to infinite number of paths. MFP solu-tion 
sacrifices precision for computability. MFP solution is less precise because 
it considers only the effect on flow values by adjacent neighbors. In case of 
distributive framework both the solutions are identical. 
2.3.3 Iterative Methods 
The iterative method solves the system of equations by initializing the node 
variables to some conservative values and successively recomputing them till 
a fixed point is reached. The naive implementation of this is called chaotic 
4
iteration. Since this is not efficient clever iterative algorithms like worklist, 
roundrobin, node listing are used. 
Kam Ullman showed that a class of problems that a round robin algorithm 
visiting nodes in reverse post order can solve in d(G)+3 passes over the graph 
G. d(G) is the depth of the graph, the maximum no of edges that can occur on 
any acyclic path in G. 
2.3.4 Elimination Methods 
When the CFG is reducible, which means that there are no multiple entry 
loops, elimination algorithm is preferred because it is usually more efficient than 
iterative algorithms. The central idea of elimination method is to represent all 
paths from start node to a particular node as a regular expression. We can 
get a high level summary function fre corresponding to the regular function 
by replacing each node with corresponding transfer function, the concatenation 
with function composition , the * operator with function closures, the union with 
meet operation. Thus the data value at a node is fre(init). The requirement of 
reducibility of the CFG assures that the summary function could be obtained 
from the above operations. Since a regular expression is used to represent a set 
of paths, elimination method is also called path algebra based approach. 
Elimination methods exploit the structural properties of the graph. The 
flow graph is reduced to single node using a series of graph transformations. 
The data flow properties of a node in a region are determined from dataflow 
properties of the region’s header node. 
3 Other Methods of Program Analysis 
1. Abstract Interpretation 
2. Constraint based analysis 
Many data flow analysis problems like reaching definitions can be efficiently 
solved using abstract interpretation. For other problems like pointer analysis, 
type inference , constraint based approach is more suitable. 
Using Data flow analysis, many interesting pieces of information cannot be 
gathered because data flow analysis does not make use of the semantic of the 
5
programming language’s operators. 
Abstract Interpretation 
Abstract interpretation amount to executing the program on the abstract 
values instead of actual values. For example, the abstract values for integers 
can the their signs or the property of being even/odd. Since concrete computa-tion 
is not performed , the results inferred by abstract interpretation can only 
be approximate. Another reason for approximation is that abstract interpre-tation 
computes conservative information that are valid in all control flow paths. 
Abstract interpretation is a theory of semantics approximation. The idea of 
abstract interpretation is to create a new semantic of the programming language 
so that the semantic always terminates and the store for every program points 
contains a superset of the values that are possible in the actual semantic, for 
every possible input. Since in this new semantic a store does not contain a single 
value for a variable but a set (or interval) of possible values, the evaluation of 
boolean and arithmetic expressions must be redefined. 
Abstract interpretation is a very powerful program analysis method. It uses 
information on the programming language’s semantic and can detect possible 
runtime errors, like division by zero or variable overflow. Since abstract inter-pretation 
can be computationally very expensive it must be taken care choose 
an appropriate value domain and appropiate heuristic for loop termination to 
ensure feasabillity. 
For example, in the following program, determining the actual values x can 
take is not feasible. However, if we are interested in abstract values of x say odd, 
even we can determine that x can have only even values. Abstract interpretation 
determines this information. 
void f() { 
int x=2; 
while(...) { 
// what values x can have here 
x=x+2; 
} 
} 
6
The most common application of abstract interpretation for solving dataflow 
problems. 
Set based analysis 
In set based analysis , sets of values are associated with a variable. The 
major difference between set based analysis and abstract interpretation is that 
set based analysis does not employ an abstract domain to approximate the 
underlying domain of computation. In set based analysis, the approximation 
occurs because all dependencies between variables are ignored. For example, if 
at some point in the program , the environment [x ! 1, y ! 2] and [x ! 3, y ! 
4] can be encountered, then set based analysis will conclude that x can have 
values 1,3 and y can have values 2,4. The dependency ”x is 1 when y is 2” is 
lost. 
In constraint based analysis, the properties to be computed are expressed as 
set constraints. In the following example, the property to be computed is the 
set of values a variable can have at runtime. 
if(...) { 
x=3; 
} 
else { 
x=6; 
} 
// x can have {3,6} 
print (x) 
These sets need not only represent integers as in the above example. The 
sets can represents points to sets of a variable or types of receiver variable. 
This ”set based” approach to program analysis consists of three phases. 
The first step is to label the interested program points, which may be terms, 
expressions or program variables. The second step is to associate with each 
label a variable which denotes the abstract values at the point. The one can 
derive a set of constraints on these variables. In the final step, these constraints 
are solved to find their minimum solution (this solving process is the main 
computational part of the analysis). 
To solve the constraints, the usual procedure is to represent each set expres-sion 
as a node and each constraint as a directed edge. A transitive closure on 
7
this graph gives all possible constraints that can be inferred. 
Data flow analysis can be viewed as set based analysis. For example consider 
live variable analysis for C programs. 
1. domain : set of program variables 
2. Variables : Sin, Sout 
3. Constants : Sdef , Suse 
4. Constraints : Sin = Suse [ (Sout − Sdef ) 
S 
5. Sout = 
X2succ(S) Xin 
3.1 Interprocedural Analysis - Context Sensitivity 
3.1.1 Functional Approach 
The idea behind functional approach is same as that of elimination algorithms, 
where we used a large transfer function to summarize the effect of a region. 
The set of valid paths in the intraprocedural case can be represented by regular 
expression. In interprocedural cases, the calls and returns have to be matched. 
The set of valid paths is given by context free grammar. A non terminal in 
the context free grammar represents the procedure. It is possible to compute a 
large function corresponding to the non terminal which summarizes the effects 
of those paths produced by the non terminal . 
example 
3.1.2 Call String Approach 
The idea behind call string approach is the contents of the stack is an important 
element to distinguish between dataflow information. The context is usually the 
current call stack. The use of call string guarantee that dataflow values from 
invalid paths are not considered. The length of call string may be unbounded , 
so k-bounded call strings are used. 
8

Static Analysis of Computer programs

  • 1.
    1 Introduction Theproblem of slicing requires us to solve data dependencies and control de-pendencies. Finding data dependencies require us to solve the problem of reach-ing definitions. Control dependencies require finding dominance frontiers. The problem of aliasing has to to solved for computing correct data dependencies. Each of these individual problems can be viewed as instance of the more general problem of program analysis. 2 Program Analysis Program analysis is the method of statically computing properties of a program. Program analysis is useful for performing optimizations. There can be many interesting properties that can be queried about programs. [?]. In particular for slicing we are interested in the following. 1. What definitions of a variable can reach a particular program point 2. What objects can a reference variable point to 3. What types can a receiver variable point to 4. Can the pointer p point to null 5. For computing control dependence information, dominator information is needed 6. What variables can be modified or referenced by a procedure The actual paths that are taken during execution is difficult to determine. So it is difficult to give precise answers to many of these questions related to program analysis, approximate solutions are computed. These solutions are conservative in the sense, they err on the safer side. They overestimate the number of definitions reaching a program point or the objects the reference variable can point to. The initial methods of performing data flow analysis was done by executing the program symbolically tracing through all control flow paths collecting in-formation about dataflow values. This is computationally expensive and there can be problem of termination if dataflow values keep alternating. Monotone data flow frameworks overcome these problems by 1
  • 2.
    1. putting apartial order on the abstract data values such that they change in the same direction during abstract interpretation , reducing termination problems. 2. If two control flow paths merge, we have to assume conservative informa-tion that holds good about the abstract values along both the paths. A semilattice L is used to represent the abstract values because the meet operation gives us exactly this information. 3. To every node v in the CFG, we assign a variable [[v]] ranging over the elements of L. 4. for every point in the program ,an equation that relates the value of the variable of the corresponding node to those of other nodes (typically the neighbors). To guarantee termination, two useful limitations are made on the general framework, which enforces all the transfer functions to be monotone and the height of the lattice to be finite. Intuitively, this means that the transfer function can only climb up and since the height is finite, it can at most reach the top in the end and thus have to stabilize. This framework is called monotone data flow analysis framework and its formal definition is listed below With each node a function is associated which maps each value in the semi-lattice to another value in the semilattice. The set of such functions at every node in the program forms a system of dataflow equations. A fixed point solu-tion to these equations gives the conservative information that can be assumed to be valid at every node in the program. For dataflow frameworks the set of functions F has to satisfy the following properties. 1. Closedness under composition is important because we can summarize the effects of a sequence statements without leaving the function space. 2. Identity function has to be there to take into account of empty basic blocks. 3. Closed under pointwise meet, if h(x) = f(x) ^ g(x) then h 2 F. This is necessary because h can represent the effect of convergence of two control flow paths. 2
  • 3.
    4. Monotonicity offunctions in F guarantee the existence of fixed point so-lution. Distributivity 2.1 Characterization of data flow frameworks In presence of loops, the transfer function f has to be applied multiple times to summarize the effect of loops. If f describes dataflow effect of once around a cyclic path, then f represents the effect of a loop where the number of itera-tions is apriori indeterminate. f is said to be k-bounded , if the solution reaches a fixedpoint values before the k-th pass around the loop. For the classical bitvector problems ,the transfer function is given by f(x) = GEN + X − KILL, thus f f = f, making them 2-bounded. Frameworks for which f f = f hold true are called fast frameworks. In summary, a dataflow framework is characterized by 1. Algebraic properties of functions in F ( monotonicity, distributivity) 2. Finiteness properties of functions in F( boundedness, fastness, rapidity) 3. Finiteness properties of lattice L (height) 4. Partitionability of L and F There is an important class of k-bounded partionable problems called bitvector problems. 2.2 Dataflow Analysis Examples Dataflow analysis techniques are used for the discovery of compile time code improvements. see p.12. JFL for dataflow frameworks defn Reaching Definitions The reaching definitions problem asks for each node (program point) n, which as-signments might have determined the values of the program variables at n. It uses (}(Asgn),,[, Asgn, ;) for property lattice. 3
  • 4.
    2.3 Solution Procedures The solution to a dataflow problem 2.3.1 Meet over all paths solution A transfer function is associated with every basic block. We can define a trans-fer function associated with a path P given by x0 ! x1 ! x2.... ! xn as compositions of transfer functions associated with individual basic blocks. The meet over all paths solution is given by If all paths are executable, then MOP solution is the best statically deter-minable solution. However it is infeasible to determine those paths that are statically executed. The inaccuracy results from the fact additional infeasible paths may be included. Thus the MOP solution is a conservative approximation to the real solution. 2.3.2 Maximal Fixed Point solution The MFP solution is given by Though MOP is the best possible solution, it has been proved that a gen-eral algorithm to compute MOP solution does not exist [?]. Intuitively this is due to presence of loops which leads to infinite number of paths. MFP solu-tion sacrifices precision for computability. MFP solution is less precise because it considers only the effect on flow values by adjacent neighbors. In case of distributive framework both the solutions are identical. 2.3.3 Iterative Methods The iterative method solves the system of equations by initializing the node variables to some conservative values and successively recomputing them till a fixed point is reached. The naive implementation of this is called chaotic 4
  • 5.
    iteration. Since thisis not efficient clever iterative algorithms like worklist, roundrobin, node listing are used. Kam Ullman showed that a class of problems that a round robin algorithm visiting nodes in reverse post order can solve in d(G)+3 passes over the graph G. d(G) is the depth of the graph, the maximum no of edges that can occur on any acyclic path in G. 2.3.4 Elimination Methods When the CFG is reducible, which means that there are no multiple entry loops, elimination algorithm is preferred because it is usually more efficient than iterative algorithms. The central idea of elimination method is to represent all paths from start node to a particular node as a regular expression. We can get a high level summary function fre corresponding to the regular function by replacing each node with corresponding transfer function, the concatenation with function composition , the * operator with function closures, the union with meet operation. Thus the data value at a node is fre(init). The requirement of reducibility of the CFG assures that the summary function could be obtained from the above operations. Since a regular expression is used to represent a set of paths, elimination method is also called path algebra based approach. Elimination methods exploit the structural properties of the graph. The flow graph is reduced to single node using a series of graph transformations. The data flow properties of a node in a region are determined from dataflow properties of the region’s header node. 3 Other Methods of Program Analysis 1. Abstract Interpretation 2. Constraint based analysis Many data flow analysis problems like reaching definitions can be efficiently solved using abstract interpretation. For other problems like pointer analysis, type inference , constraint based approach is more suitable. Using Data flow analysis, many interesting pieces of information cannot be gathered because data flow analysis does not make use of the semantic of the 5
  • 6.
    programming language’s operators. Abstract Interpretation Abstract interpretation amount to executing the program on the abstract values instead of actual values. For example, the abstract values for integers can the their signs or the property of being even/odd. Since concrete computa-tion is not performed , the results inferred by abstract interpretation can only be approximate. Another reason for approximation is that abstract interpre-tation computes conservative information that are valid in all control flow paths. Abstract interpretation is a theory of semantics approximation. The idea of abstract interpretation is to create a new semantic of the programming language so that the semantic always terminates and the store for every program points contains a superset of the values that are possible in the actual semantic, for every possible input. Since in this new semantic a store does not contain a single value for a variable but a set (or interval) of possible values, the evaluation of boolean and arithmetic expressions must be redefined. Abstract interpretation is a very powerful program analysis method. It uses information on the programming language’s semantic and can detect possible runtime errors, like division by zero or variable overflow. Since abstract inter-pretation can be computationally very expensive it must be taken care choose an appropriate value domain and appropiate heuristic for loop termination to ensure feasabillity. For example, in the following program, determining the actual values x can take is not feasible. However, if we are interested in abstract values of x say odd, even we can determine that x can have only even values. Abstract interpretation determines this information. void f() { int x=2; while(...) { // what values x can have here x=x+2; } } 6
  • 7.
    The most commonapplication of abstract interpretation for solving dataflow problems. Set based analysis In set based analysis , sets of values are associated with a variable. The major difference between set based analysis and abstract interpretation is that set based analysis does not employ an abstract domain to approximate the underlying domain of computation. In set based analysis, the approximation occurs because all dependencies between variables are ignored. For example, if at some point in the program , the environment [x ! 1, y ! 2] and [x ! 3, y ! 4] can be encountered, then set based analysis will conclude that x can have values 1,3 and y can have values 2,4. The dependency ”x is 1 when y is 2” is lost. In constraint based analysis, the properties to be computed are expressed as set constraints. In the following example, the property to be computed is the set of values a variable can have at runtime. if(...) { x=3; } else { x=6; } // x can have {3,6} print (x) These sets need not only represent integers as in the above example. The sets can represents points to sets of a variable or types of receiver variable. This ”set based” approach to program analysis consists of three phases. The first step is to label the interested program points, which may be terms, expressions or program variables. The second step is to associate with each label a variable which denotes the abstract values at the point. The one can derive a set of constraints on these variables. In the final step, these constraints are solved to find their minimum solution (this solving process is the main computational part of the analysis). To solve the constraints, the usual procedure is to represent each set expres-sion as a node and each constraint as a directed edge. A transitive closure on 7
  • 8.
    this graph givesall possible constraints that can be inferred. Data flow analysis can be viewed as set based analysis. For example consider live variable analysis for C programs. 1. domain : set of program variables 2. Variables : Sin, Sout 3. Constants : Sdef , Suse 4. Constraints : Sin = Suse [ (Sout − Sdef ) S 5. Sout = X2succ(S) Xin 3.1 Interprocedural Analysis - Context Sensitivity 3.1.1 Functional Approach The idea behind functional approach is same as that of elimination algorithms, where we used a large transfer function to summarize the effect of a region. The set of valid paths in the intraprocedural case can be represented by regular expression. In interprocedural cases, the calls and returns have to be matched. The set of valid paths is given by context free grammar. A non terminal in the context free grammar represents the procedure. It is possible to compute a large function corresponding to the non terminal which summarizes the effects of those paths produced by the non terminal . example 3.1.2 Call String Approach The idea behind call string approach is the contents of the stack is an important element to distinguish between dataflow information. The context is usually the current call stack. The use of call string guarantee that dataflow values from invalid paths are not considered. The length of call string may be unbounded , so k-bounded call strings are used. 8