Thesis: Slicing of Java Programs using the Soot Framework (2006)

A Static Slicing Tool for Sequential Java Programs

A Thesis
Submitted For the Degree of
Master of Science (Engineering)
in the Faculty of Engineering

by

Arvind Devaraj

Computer Science and Automation
Indian Institute of Science
BANGALORE – 560 012

March 2007

Abstract

A program slice consists of a subset of the statements of a program that can potentially
aﬀect values computed at some point of interest. Such a point of interest along with a set
of variables is called a slicing criterion. Slicing tools are useful for several applications,
such as program understanding, testing, program integration, and so forth. Slicing object
oriented programs has some special problems, that need to be addressed due to features
like inheritance, polymorphism and dynamic binding. Alias analysis is important for
precision of slices. In this thesis we implement a slicing tool for sequential Java programs
in the SOOT framework. SOOT is a front-end for Java developed at McGill University
and it provides several forms of intermediate code. We have integrated the slicer into
the framework. We also propose an improved technique for intraprocedural points-to
analysis. We have implemented this technique and compare the results of the analysis
with those for a ﬂow-insensitive scheme in SOOT. Performance results of the slicer are
reported for several benchmarks.

ii

Contents

Abstract ii

1 Introduction 1
1.1 Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The SOOT Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Slicing 7
2.1 Intraprocedural Slicing using PDG . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Program Dependence Graph . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Slicing using the Program Dependence Graph . . . . . . . . . . . 8
2.1.3 Construction of the Data Dependence Graph . . . . . . . . . . . . 9
2.1.4 Control Dependence Graph . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Slicing in presence of unstructured control ﬂow . . . . . . . . . . . 14
2.1.6 Reconstructing CFG from the sliced PDG . . . . . . . . . . . . . 17
2.2 Interprocedural Slicing using SDG . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 System Dependence Graph . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Calling context problem . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Computing Summary Edges . . . . . . . . . . . . . . . . . . . . . 21
2.2.4 The Two Phase Slicing Algorithm . . . . . . . . . . . . . . . . . 21
2.2.5 Handling Shared Variables . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Slicing Object Oriented Programs . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1 Dependence Graph for Object Oriented Programs . . . . . . . . . 26
2.3.2 Handling Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Handling Polymorphism . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Case Study - Elevator Class and its Dependence Graph . . . . . . 35

3 Points to Analysis 38
3.1 Need for Points to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Pointer Analysis using Constraints . . . . . . . . . . . . . . . . . . . . . 39
3.3 Dimensions of Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Andersen’s Algorithm for C . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Andersen’s Algorithm for Java . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.1 Model for references and heap objects . . . . . . . . . . . . . . . . 45

iii

CONTENTS iv

3.5.2 Computation of points to sets in SPARK . . . . . . . . . . . . . 47
3.6 CallGraph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6.1 Handling Virtual Methods . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Improvements to Points to Analysis . . . . . . . . . . . . . . . . . . . . . 50
3.8 Improving Flow Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8.1 Computing Valid Subgraph at each Program Point . . . . . . . . 53
3.8.2 Computation of Access Expressions . . . . . . . . . . . . . . . . 55
3.8.3 Checking for Satisﬁability . . . . . . . . . . . . . . . . . . . . . . 60

4 Implementation and Experimental Results 62
4.1 Soot-A bytecode analysis framework . . . . . . . . . . . . . . . . . . . . 62
4.2 Steps in performing slicing in Soot . . . . . . . . . . . . . . . . . . . . . 65
4.3 Points to Analysis and Call Graph . . . . . . . . . . . . . . . . . . . . . 65
4.4 Computing Required Classes . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Side eﬀect computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7 Computing the Class Dependence Graph . . . . . . . . . . . . . . . . . . 70
4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Conclusion and Future Work 75

Bibliography 77

List of Tables

3.1 Constraints for C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Constraints for Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Data ﬂow equations for computing valid edges . . . . . . . . . . . . . . . 53
3.4 Computation of Valid edges . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Benchmarks Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Number of Edges in the Class Dependence Graph . . . . . . . . . . . . . 72
4.3 Timing Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Program Statistics - Partial Flow Sensitive . . . . . . . . . . . . . . . . . 73
4.5 Precision Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

v

List of Figures

1.1 A program and its slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 A Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Post Dominator Tree for the CFG in Figure 2.1 . . . . . . . . . . . . . . 12
2.3 Dominance Frontiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 A program and its PDG (taken from [39]) . . . . . . . . . . . . . . . . . 15
2.5 Augmented CFG and PDG for the program in Figure 2.4 (taken from [39]) 16
2.6 A program with function calls . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 System Dependence Graph for an interprocedural program . . . . . . . . 19
2.8 Slicing the System Dependence Graph . . . . . . . . . . . . . . . . . . . 24
2.9 Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 The Dependence Graph for the main function (from [67]) . . . . . . . . 29
2.11 The Dependence Graphs for functions C() and D() (from [67]) . . . . . 29
2.12 Interface Dependence Graph (from [58]) . . . . . . . . . . . . . . . . . . 33
2.13 The Elevator program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.14 Dependence Graph for Elevator program . . . . . . . . . . . . . . . . . . 37

3.1 Need for Points to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Points to Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Imprecision due to context insensitive analysis . . . . . . . . . . . . . . . 43
3.4 Object Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 An example program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Access Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 OFG Subgraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.8 Access Expressions(for a DAG) . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Access Expressions (for general graph) . . . . . . . . . . . . . . . . . . . 60
3.10 Simpliﬁed Access Expressions . . . . . . . . . . . . . . . . . . . . . . . . 60
3.11 Dominator Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 Soot Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Computation of the class dependence graph . . . . . . . . . . . . . . . . 66
4.3 Jimple code and its slice . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

vi

Chapter 1

Introduction

1.1 Slicing
A program slice consists of the parts of a program that can potentially affect the value of
variables computed at some point of interest. Such a point is called the slicing criterion
and is specified by a pair (program point,set of variables).The original concept of a
program slice was proposed by Mark Weiser [61]. According to his definition

A slice s of program p is a subset of the statements of p that retains some
specified behavior of p. The desired behavior is detailed by means of a slicing
criterion c. Generally, a slicing criterion c is a set of variables V and a
program point l. When the slice s is executed, it must always have the same
values as program p for the variables in V at point l.

Weiser claimed that a program slice was the abstraction that users had in mind as
they debugged programs. There have been variations in the definitions of program slices
depending on the application in mind. Weiser’s original definition required a slice S of
a program to be an executable subset of the program, whereas another common defini-
tion defines a slice as a subset of statements that directly or indirectly affect the values
computed at the point of interest but are not necessarily an executable segment. Fig-
ure 1.1 shows a program sliced with respect to the slicing criterion ( print(product),

1

Chapter 1. Introduction 2

read(n);
i = 1; read(n);
sum = 0; i = 1;
product = 1; product = 1;
while (i<=n) { while (i<=n) {
sum = sum + i;
product = product * i; product = product * i;
i = i + 1; i = i + 1;
} }
print(sum);
print(product); print(product);

Figure 1.1: A program and its slice

product) . Since the transformed program is expected to be much smaller than the
original it is hoped that dependencies between statements in the program will be more
explicit. Surveys on program slicing are presented in [45], [73]. Slicing tools have been
used for several applications, such as program understanding [82], testing [74] [75], pro-
gram integration [78], model checking [79] and so forth.

1. Program Understanding: Software engineers are assigned to understand a mas-
sive piece of code and modify parts of them. When modifying a program, we need
to comprehend a section of the program rather than the whole program. Backward
and forward slicing can be used to browse the code and understand the interde-
pendence between various parts of the program.

2. Testing: In the context of testing, a problem that is often encountered is that of
finding the set of program statements that are affected by a change in the program.
This analysis is termed impact analysis. To determine what tests need to be re-run
to test test a modified statement S, a backward slice on S will get the statements
that actually influence the behavior of the program.

3. Debugging: Quite often the statement that is actually responsible for a bug that
shows up at some program point P is statically far away from P . To reduce the
search space of possible causes for the error the programmer can use a backward


slice to eliminate parts of the code that could not have been the cause of the
problem.

4. Model Checking: Model checking is a verification technique that performs an
exhaustive exploration of a program’s state space. Typically the execution of a
program is simulated and path and states encountered in the simulation are checked
against correctness specifications phrased as temporal logic formula. The use of
slicing here is to reduce the size of a program P beginning checked for a property
by eliminating statements and variables that are irrelevant to the formula.

There is an essential difference between static and dynamic slices. A static slice
disregards the actual inputs to a program whereas the latter relies on a specific test case
and therefore is in general , more precise.
When slicing a program P we are concerned with both correctness as well as precision.
For correctness we demand that the slice S produced by the tool is a superset of the
actual slice S(p) for the slicing criterion p. Precision has to do with the size of the slice.
For two correct slices S1 and S2 , S1 is more precise than S2 , if the statements of S1
are a subset of the statements of S2 . Obtaining the most precise slice, is in general not
computable, hence our aim is to compute a correct slice that is as precise as possible.
The slicing problem can be addressed by viewing it as a reachability problem in a
Program Dependence Graph (PDG) [54]. A PDG is a directed graph with vertices cor-
responding to statements and predicates and edges corresponding to data and control
dependences. For the sequential intraprocedural case, the backward slice with respect
to a node in the PDG is the set of all nodes in the PDG on which this node is tran-
sitively dependent. Thus given the PDG, a simple reachability algorithm on the PDG
will construct the slice. However when considering interprocedural slices, the process
is more complicated as mere reachability will produce imprecise slices. One needs to
track only interprocedural realizable paths, where a realizable path corresponds to legal
call/return pairs where a procedure always returns to the call site where it was invoked.
The structure on which interprocedural slicing is generally implemented is the System
Dependence Graph [63] (SDG). This graph is a collection of graphs corresponding to


PDG’ss for individual procedures augmented with some extra edges that capture the
interaction between them. Slicing of interprocedural programs is described by Horwitz
et.al [63]. They use the SDG to track dependencies in a program and use a two phase
algorithm to ensure that only feasible paths are tracked, that is, those in which procedure
calls are matched with the correct return statements.
Slicing object oriented programs adds yet another dimension of complexity to the
slicing problem. Object-oriented concepts such as classes, objects, inheritance, poly-
morphism and dynamic binding make representation and analysis techniques used for
imperative programming languages inadequate for object-oriented programs. The Class
Dependence Graph has been introduced by Larsen and Harrold [66], which can represent
class hierarchy, data members and polymorphism. Some more features were added by
Liang and Harrold [67].
The resolution of aliases is required for the correct computation of data dependencies.
To compute the dependence graph, it is necessary to build a call graph. The computation
of call graph becomes complicated in presence of dynamic binding , i.e. when the target
of a method call depends on the runtime type of a variable. Algorithms like Rapid Type
Analysis (RTA) [26] compute call graphs using type information.
A key analysis for object oriented languages is alias analysis. The objective here is
to follow an object O from its point of allocation to find out which objects reference
O and which other objects are referenced by the fields of O Resolving aliasing becomes
important for the correct computation of data dependencies in the dependence graph.
The precision of the analysis depends on various factors like flow sensitivity, context
sensitivity and handling of field references. Andersen [64] gives a flow insensitive method
for finding aliases using subset constraints. Lhotak [70] describes the method adapted
for Java programs.
In this thesis we implement a slicing tool for sequential Java programs and integrate
it into the SOOT framework. We briefly describe the framework and the contributions
of the thesis.


1.2 The SOOT Framework
The SOOT analysis and transformation framework [69] is a Java optimization framework
developed by the Sable Research Group at McGill University and it is intended to be a
robust, easy-to-use research framework. It has been used extensively for program analy-
sis, instrumentation, and optimization. It provides several forms of intermediate code for
analyzing and optimizing Java bytecode. Jimple is a typed three address representation,
which we have used in our implementation.
Our objective is to implement a slicing tool within the Soot framework [69] and make
it publicly available. At the time this work was begun there was no publicly available
slicing infrastructure for Java. The Indus [81] project addresses the slicing problem for
Java programs and source code has been made available in February 2007.

1.3 Contributions of the thesis
The following are the contributions of this thesis:

1. We have implemented the routines for creating the program dependence graphs
and the class dependence graph for an input Java program that is represented in
the form of Jimple intermediate code.

2. We have integrated a slicer into the framework. For inter-procedural slicing we
have implemented the two-phase slicing algorithm of [63].

3. We propose an improved technique for intraprocedural points-to analysis. This uses
path expressions to track paths that encode valid points-to information. A simple
data-flow analysis formulation collects valid edges, i.e. those that are added to
the object flow graph. Reachability queries are handled in a reasonable amount of
time. We have implemented this technique and compare the results of the analysis
with those for a flow-insensitive scheme in SOOT.

4. The slicing tool has been run on several benchmarks and we report on times taken


to build the class dependence graph, its size, slice sizes for some given slicing criteria
and slicing times.

Chapter 2

Slicing

In this chapter, we discuss techniques for slicing a program and in particular issues that
arise when slicing object oriented programs. The first part of the chapter describes the
Program Dependence Graph (PDG), its construction and the algorithm for intraproce-
dural slicing. For slicing programs with function calls, the System Dependence Graph
(SDG) is used. The SDG is a collection of PDGs individual procedures with additional
edges for modeling procedure calls and parameter bindings. The second part of the
chapter describes the construction of SDG and the algorithm for interprocedural slicing.
The third part of the chapter describes dependence graph computation of object ori-
ented programs, which is complicated because objects can be passed as parameters and
methods can be invoked upon objects. Also we need the results of points to analysis to
determine what objects are pointed by each reference variable. Then we describe the ex-
tension of the algorithm for computing the dependence graph in presence of inheritance
and polymorphic function calls.

2.1 Intraprocedural Slicing using PDG
Weiser’s approach [61] to program slicing is based on dataflow equations. In his approach,
the set of relevant variables is iteratively computed till a fixed point is reached. Slicing
via graph reachability was introduced by Ottenstein [54]. In this approach a dependence

7

Chapter 2. Slicing 8

graph of the program is constructed and the problem of slicing reduces to computing
reachability on the dependence graph. We adopt this in our implementation.

2.1.1 Program Dependence Graph

A program dependence graph (PDG) represents the data and control dependencies in
the program. Nodes of PDG represent statements and predicates in a source program,
and its edges denote dependence relations. The PDG can be constructed as follows.

1. Build the program’s CFG, and use it to compute data and control dependencies:
Node N is data dependent on node M iff M defines a variable x, N uses x, and
there is an x-definition-free path in the CFG from M to N . Node N is control
dependent on node M iff M is a predicate node whose evaluation to true or false
determines whether N will be executed.

2. Build the PDG. The nodes of the PDG are almost the same as the nodes of the
CFG. However, in addition, there is a a special enter node, and a node for each
predicate. The PDG does not include the CFG’s exit node. The edges of the PDG
represent the data and control dependencies computed using the CFG.

2.1.2 Slicing using the Program Dependence Graph

To compute the slice from statement (or predicate) S, start from the PDG node that
represents S and follow the data- and control-dependence edges backwards in the PDG.
The components of the slice are all of the nodes reached in this manner.
The computation of the data dependence graph is described in Section 2.1.3. Com-
puting the control dependence graph is described in Section 2.1.4. Figure 2.4 shows an
example program and its corresponding PDG. Solid lines represent control dependencies
while dashes lines represent data dependencies.


2.1.3 Construction of the Data Dependence Graph

A data dependence graph represents the association between definitions and uses of a
variable. There is an association (d, u) between a definition of variable v at d and a use
of variable v at u iff there is at least one control flow path from d to u with no intervening
definition of v.
Each node represent a statement. An edge represents a flow dependency between
statements. Though there are many kinds of data dependencies between statements,
only flow dependencies are necessary for the purpose of slicing as only flow dependence
needs to be traced back in order to compute the PDG nodes comprising the slice. Output
and anti dependence edges do not represent true data dependence. Instead they encode
a partial order on program statements, which is necessary to preserve since there is no
explicit control flow relation between PDG nodes. However, PDG slices are normally
mapped back to high-level source code, where control flow is explicitly represented. Thus
there is no need for any such control flow information to be present in the computed
PDG slice.
Computation of flow dependencies is done by computing the problem of reaching
definitions. The problem of reaching definitions is a classical bitvector problem solvable
by monotone dataflow framework. This associates a program point with the set of
definitions reaching that point. The definitions reaching a program point along with the
use of a variable form flow dependencies.

Dependence in presence of arrays and records

In the presence of composite data types like arrays, records and pointers, the most
conservative method is to assume a definition of a variable to be the definition of the
entire composite object [83]. A definition (or use) of an element of an array can be
considered as definition (or use) of the entire array. For example, consider the statement

a[i] = x


Here the variable a is defined and variables i, x are used. Thus DEF = {a} and
REF = {i, x}. The value of a is used in computing the address of a[i] and thus a must
also be included in the REF set. The correct value for REF is {a, i, x} [45] . This
approach is conservative leading to large slices created due to spurious dependencies.
Our current implementation handles composite data types in this manner, though more
refined methods have been proposed in the literature. Agrawal et.al. [53] propose a
modified algorithm for computing reaching definitions that determines the memory loca-
tions defined and used in statements and computes whether the intersection among those
locations is complete or partial or statically indeterminable. Another method to avoid
spurious dependencies is to use array index tests like GCD tests which can determine
that there is no dependence between two array accesses expressions.

Data dependencies in presence of aliasing

When computing data dependencies the major problem occurs due to presence of aliasing,
Consider the following example. Here there is a data dependency between x.a = ... and ...
= y.a since both x and y point to the object o1. Without alias analysis this dependency
is missed because the syntactic expressions x.a and y.a are different. Thus resolving
aliases is necessary for the correct computation of data dependencies. Also if worst case
assumptions are made for field loads and stores, many spurious dependencies are created.

v o i d fun ( ) {
obj x , y ;
x=new o b j ( ) ; // o1 i s th e o b j e c t c r e a t e d
y=x ;
x.a = . . . . ;
... = y.a ;
}


P if(x>y)
S1 max = x;
else
S2 max = y;

2.1.4 Control Dependence Graph

Another kind of dependence between statements arises due to the presence of control
structure.
For example, in the above code, the execution of S1 is dependent on the predicate
x > y . Thus S1 is said to be control dependent on P. A slice with respect to S1 has to
include P, because the execution of S1 depends on the outcome of the predicate node P.
Two nodes Y and Z should be identified as having identical control conditions if in
every run of the program node, Y is executed if and only if Z is executed. In Figure
2.1, nodes 2 and 5 are said to be control dependent on the true branch of node 1,
since their execution is dependent conditionally on the outcome of node 1. The original
method for computing control dependence information using postdominators is presented
by Ferrante et.al. [47]. Cytron et.al. [46] gives an improved method for constructing
control dependence information by using dominance frontiers.

Finding control dependence using postdominators relationship

A node X is said to be a postdominator of node Y if all possible paths from Y to the exit
node must pass through X. A node N is said to be control dependent on edge a → b , if

1. N postdominates b

2. N does not postdominate a

In Figure 2.1, to find the nodes that are control dependent on edge 1 → 2, we find
nodes that postdominate node 2 but not node 1. Nodes 2 and 5 are such nodes. So
nodes 2 and 5 are control dependent on the edge 1 → 2.


This observation suggests that to ﬁnd the nodes that are control dependent on the
edge X → Y , we can traverse the postdominator tree and mark all nodes that postdom-
inate Y to be control dependent on Y , we stop when we reach the postdominator of
X.

HIJK
ONML 1
ÐÐ UU
ÐÐ UU HIJK
ONML
Ð
Ð ÐÐ UU 7 b
UU ÐÐ bb
HIJK
ONML
2 b UU ÐÐÐ bb
bb
Ð bb UU ÐÐ bb
Ð bb UU ÐÐ b1
ÐÐ bb ÐÐ
Ð
Ð ÐÐ 0
U' HIJK
ONML
5 b HIJK
ONML
6 HIJK
ONML
1
HIJK
ONML HIJK
ONML HIJK
ONML Ð bb
3 b 4 6 ÐÐÐ bb
bb Ð ÐÐ bb
bb Ð × × ÐÐ bb
bb ÐÐ ×× Ð b1
ÐÐ ÐÐ
0 Ð Ð ×× HIJK
ONML
2 HIJK
ONML
4 HIJK
ONML
3
HIJK
ONML
5 b ××
bb ××
bb ×××
bb
0 ××
Ó
HIJK
ONML 7 Figure 2.2: Post Dominator Tree
for the CFG in Figure 2.1

Figure 2.1: A Control Flow Graph

Using Dominance Frontiers to compute Control Dependence

Control dependencies between statements can be computed in an eﬃcient manner us-
ing the dominance frontier information. Cytron et.al. [46] describes the method for
computing dominance frontiers.
A dominance frontier for vertex vi contains all vertices vj such that vi dominates an
immediate predecessor of vj , but vi does not strictly dominate vj [62]
DF (vi ) = { vj | (vj ∈ V ) (∃vk ∈ P red(vj )) ((vi dom vk ) ∧ ¬(vi sdom vj )) }
Informally, the set of nodes lying just outside the dominated region of Y is said to


HIJK
ONML S
Ö
Ö
ÖÖ

ÖÖ @
ÖÖÖ HIJK
ONML
Y h
Ö
ÖÖ {{ hh
hh
ÖÖ {{ hh
ÖÓ Ö }{{{ 3
ONML
HIJK
Z WVUT
PQRS
Y g PQRS
WVUT
Y
QQ gg z
QQ gg zz
QQ gg zz
3 zz
}
QQ PQRS
WVUT
QQ Y
QQ uuu
QQ uu
QQ uuu
QQ uu
uuu
Q% uu
zuu
HIJK
ONML X

Figure 2.3: Dominance Frontiers

be in the dominance frontier of Y. In the example in Figure 2.3, Y dominates nodes
Y’,Y”,Y”’ and X lies just outside the dominated region. So X is said to be in the
dominance frontier of Y.
Note that if X is in the dominance frontier of Y , then there would be at least two
incoming paths to X of which one contains Y another not does not. If the CFG is
reversed, then we have two outgoing paths from X, one containing Y and another not
containing Y. This is same as the condition for Y to be control dependent on X. Thus
to find control dependence it is enough to find the dominance frontiers on the reverse
control flow graph. Algorithm 1 computes the control dependence information.


Algorithm 1 Algorithm to compute the Control Dependence Graph
compute dominance frontiers of reversed CFG G i.e.
for all N in G do
let RDF (N ) be reverse dominator frontiers of N
if RDF (N ) is empty then
N is made control dependent on method entry node
end if
for all node P in RDF (N ) do
for all node S in CFG successor of P do
if S = N or N postdominates S then
N is made control dependent on P
end if
end for
end for
end for

2.1.5 Slicing in presence of unstructured control flow

In the presence of unstructured control flow caused due to jump statements like goto,
break, continue and return, the algorithm for slicing can produce an incorrect slice. While
Java does not have goto statements, break and continue statements cause unstructured
control flow. Consider computing slice with respect to the statement print(prod) in
Figure 2.4. When the slicing algorithm discussed in Section 2.1.2 is applied , the state-
ment break is not included, which is incorrect.
This was discovered by Choi and Ferrante [38] and by Ball and Horwitz [37] who
present a method to compute a correct slice in presence of unstructured control flow
statements. Their method to correct for such statements is based on the observation
that jumps are similar to predicate nodes in a way - both affect flow of control. Thus
jumps are also made to be sources of control dependence edges. A jump vertex has an
outgoing true edge to the target of the jump, and an outgoing false edge to the statement
that would execute if the jump were a no-op. A jump vertex is considered as a pseudo
predicate since the outgoing false edge is non-executable. The original CFG augmented
with these non-executable edges is called the Augmented Control Flow Graph (ACFG).
Kumar and Horwitz [39] describe the following algorithm for slicing in presence of
jump statements.


enter
prod = 1;
k = 1; prod = 1
while (k = 10) {
k=1
if (MAXINT/k prod) break;
prod = prod * k; while (k = 10) T
k++; F if (MAXINT/k prod)
} print(k)
print(k); T F
print(prod); print(prod) break prod = prod * k

exit k++
(a) Example Program (b) CFG

enter

prod = 1 k=1 while (k = 10) print(k) print(prod)

if (MAXINT/k prod)
break k++
prod = prod * k

(c) PDG

Figure 2.4: A program and its PDG (taken from [39])


enter enter
prod = 1 prod = 1 print(prod)

k=1 k=1 print(k)

while (k = 10) T while (k = 10)
F if (MAXINT/k prod)
print(k)
T T F if (MAXINT/k prod)
print(prod) break prod = prod * k
F break k++
exit k++
prod = prod * k

(a) ACFG (b) Corresponding APDG

Figure 2.5: Augmented CFG and PDG for the program in Figure 2.4 (taken from [39])


1. Build the program’s augmented control ﬂow graph described previously. Labels
are treated as separate statements; i.e., each label is represented in the ACFG by
a node with one outgoing edge to the statement that it labels.

2. Build the program’s augmented PDG. Ignore the non-executable ACFG edges when
computing data-dependence edges; do not ignore them when computing control-
dependence edges. (This way, the nodes that are executed only because a jump
is present, as well as those that are not executed but would be if the jump were
removed, are control dependent on the jump node, and therefore the jump will be
included in their slices.)

3. To compute the slice from node S, follow data- and control-dependence edges back-
wards from S . A label L is included in a slice iﬀ a statement “goto L” is in the
slice

2.1.6 Reconstructing CFG from the sliced PDG

Reconstructing the CFG from the PDG is described in in [71]. From the CFG and the
PDG slice, a sliced CFG is constructed by walking through all nodes. For each node n,
we execute the following.

1. If n is a goto statement or return statement, leave it in the slice

2. If n is a conditional statement , there are three cases

(a) If n is not in the PDG slice, it can be removed

(b) If n is in the PDG slice, but one of the branches is not, replace the jump to
that branch with a jump to the convergence node of the branch (the node
where two branches reconnect). If that node doesn’t exist , replace the jump
with a jump to the return statement of the program

(c) If n is present in the PDG slice and both branches are present leave n in the
CFG


main() {
sum=0;
i=1;
while(i11)
{
sum=add(sum,i);
i=add(i,1);
}
print(sum);
print(i);
}

int add(int a,int b) {
result=a+b;
return result;
}

Figure 2.6: A program with function calls

3. Otherwise check if n is present in the PDG, if not remove it

We next describe the interprocedural slicing algorithm implemented in this thesis.

2.2 Interprocedural Slicing using SDG

2.2.1 System Dependence Graph

For interprocedural slicing, Horwitz et.al [63] introduce the System Dependence Graph
(SDG). A system-dependence graph is a collection of program-dependence graphs, one
for each procedure, with additional edges for modeling parameter passing. Figure 2.6
shows a program with function calls. Figure 2.7 displays its SDG.
Each PDG contains an entry node that represents entry to the procedure. To model
procedure calls and parameter passing, an SDG introduces additional nodes and edges.
Accesses to global variables are modeled via additional parameters of the procedure.
They assume parameters are passed by value-result, and introduce additional nodes in


main

sum=0 i=1 while(i11) print(sum) print(i)

call add call add

a_in=sum a_in=i i=r_out
sum=r_out
b_in=i
b_in=1

enter add

a=a_in
b=b_in
r_out=result
result=a+b

control edge parameter edge
data edge call edge
summary edge

Figure 2.7: System Dependence Graph for an interprocedural program


the interprocedural case. The following additional nodes are introduced.

1. Call-site nodes representing the call sites.

2. Actual-in and actual-out nodes representing the input and output parameters at
the call sites. They are control dependent on the call-site node.

3. Formal-in and formal-out nodes representing the input and output parameters at
the called procedure. They are control dependent on the procedure’s entry node.

They also introduce additional edges to link the program dependence graphs together:

1. Call edges link the call-site nodes with the procedure entry nodes.

2. Parameter-in edges link the actual-in nodes with the formal-in nodes.

3. Parameter-out edges link the formal-out nodes with the actual-out nodes

2.2.2 Calling context problem

For computing an intraprocedural slice, a simple reachability algorithm on the PDG is
sufficient. However in interprocedural case, a simple reachability over the SDG doesn’t
work since not all the paths are valid. For example, in Figure 2.7, the path a in = sum →
a = a in → result = a + b → r out = result → i = r out is not valid interprocedurally.
In an interprocedural valid path, a call edge must be matched with its corresponding
return edge.
To address this problem, Horwitz et.al. [63] introduce the concept of summary edges.
These edges summarize the effect of a procedure call. There is a summary edge between
an actual in and an actual out node of a call site, if there is a dependency between the
corresponding formal in and formal out node of the called procedure. Thus a summary
edge summarizes the effect of a procedure call.


2.2.3 Computing Summary Edges

We describe computation of summary edges in Algorithm 2. The algorithm takes the
given SDG and adds summary edges. P is the set of path edges. Each edge in P of
the form (n, m) encodes the information that there is a realizable path in the SDG from
n to m. The worklist contains path edges that need to be processed. The algorithm
begins by asserting that there is a realizable path from each formal out node to itself.
The set of realizable paths P is extended by traversing backwards through dependence
edges. If during the traversal, a formal in-node is encountered, then we have a realizable
path from formal-in to formal-out node. Therefore a summary edge is added between
the actual in and actual out nodes of the corresponding call sites. Because the insertion
of summary edges makes more paths feasible, this process is continued iteratively, till no
more summary edges can be added. The algorithm for computing summary information
is displayed in Algorithm 2
Computing the summary edges is equivalent to the functional approach suggested by
Sharir and Pnueli [41].

2.2.4 The Two Phase Slicing Algorithm

Horwitz et.al [63] describe the two phase algorithm. The interprocedural backward slicing
algorithm consists of two phases. The first phase traverses backwards from the node in
the SDG that represents the slicing criterion along all edges except parameter-out edges,
and marks those nodes that are reached. The second phase traverses backwards from all
nodes marked during the first phase along all edges except call and parameter-in edges,
and marks reached nodes. The slice is the union of the marked nodes. Let s be the
slicing criterion in procedure P

1. Phase 1 identifies vertices that can reach s, and are either in P itself or in a
procedure that calls P (either directly or transitively). Because parameter out
edges are not followed, the traversal in Phase 1, does not descend into procedures


Algorithm 2 Computing Summary Information
W = ∅, W is the worklist
P = ∅, P is the set of pathedges
for all n ∈ N which is a formal out node do
W = W ∪ (n, n)
P = P ∪ (n, n)
end for

while W = ∅, worklist is not empty do
remove one element (n,m) from worklist
if n is a formal in node then
for all n → n which is a parameter in edge do
for all m → m which is a parameter out edge do
if n and m belong to the same call site then
E = E ∪ n → m add a new summary edge
for all (m , x) ∈ P do
P = P ∪ (n , x)
W = W ∪ (n , x)
end for
end if
end for
end for
else
for all n → n do
if (n , m) ∈ P then
/
P = P ∪ (n , m)
W = W ∪ (n , m)
end if
end for
end if
end while


called by P. Though the algorithm doesn’t descend into the called procedures, the
effects of such procedures are not ignored due to the presence of summary edges.

2. Phase 2 identifies vertices that reach s from procedures (transitively) called by P
or from procedures called by procedures that (transitively) call P. Because call
edges and parameter in edges are not followed, the traversal in phase 2 doesn’t
ascend into calling procedures; the transitive flow dependence edges from actual in
to actual out vertices make such ascents unnecessary.

We implemented a variation of the two phase slicing algorithm as described by Krinke
[49]. Figure 2.8 shows the vertices in SDG marked during phase 1 and phase 2, when
the statement print(i) is given as slicing criteria. The first phase traverses backwards
along all edges except the parameter out edge r out = result → i = r out . Thus the
first phase does not descend into the procedure add. In second phase traverses backwards
all edges except the parameter in edges and call edges. Thus in the second phase neither
the edge a in = sum → a = a in nor the edge call add → a = a in is traversed.

2.2.5 Handling Shared Variables

This section deals with handling variables that are shared across procedures. Shared
variables include global variables in imperative languages. Though Java does not have
global variables, instance members of a class can be treated as global variables that are
accessible by the member functions.
Shared variables are handled by passing them as a additional parameters in every
function. Considering every shared variable as a parameter is a correct but inefficient as
it increases the number of nodes. We can reduce the number of parameters passed by
doing interprocedural analysis and using the GMOD and GREF information [42].

1. GMOD(P) : The set of variables that might be modified by P itself or by a proce-
dure (transitively) called from P

2. GREF(P) : The set of variables that might be referenced by P itself or by a pro-
cedure (transitively) called from P


main

sum=0 i=1 while(i11) print(sum) print(i)

call add call add

a_in=sum a_in=i i=r_out
sum=r_out
b_in=i
b_in=1

enter add

a=a_in
b=b_in
r_out=result
result=a+b

marked in phase 1 control edge parameter edge
data edge call edge
marked in phase 2
summary edge

Figure 2.8: Slicing the System Dependence Graph


Algorithm 3 Two phase slicing algorithm (Krinke’s version)
input G=(N,E) the given SDG, s ∈ N the slicing criterion
output S ⊆ N , the slice
W up = s
W down = ∅
First phase
while W up = ∅ worklist is not empty do
remove one element n from W up
for all m → n ∈ E do
if m ∈ S then
/
if m → n is a parameter out edge then
W down = W down ∪ m
S =S∪m
else
W up = W up ∪ m
S =S∪m
end if
end if
end for
end while

while W down = ∅ worklist not empty do
remove an element n from the worklist
for all m → n ∈ E do
if m ∈ S then
/
if m → n is not a parameter in edge or call edge then
W down = W down ∪ m
S =S∪m
end if
end if
end for
end while


GMOD and GREF sets are used to determine which parameter vertices are included
in procedure dependence graphs . At procedure entry, these nodes are inserted

1. Formal in for each variable in GM OD(P ) ∪ GREF (P )

2. Formal out for each variable in GM OD(P )

Similarly at a call site, the following nodes are inserted

1. Actual in for each variable in GM OD(P ) ∪ GREF (P )

2. Actual out for each variable in GM OD(P )

2.3 Slicing Object Oriented Programs
The System Dependence Graph (SDG) is not sufficient to represent all dependencies
for object oriented programs. An efficient graph representation of an object oriented
program should employ a class representation that can be reused in the construction of
other classes and applications that use the class. Section 2.3.1 discuss about dependence
graph representation for object oriented programs. Sections 2.3.2 and 2.3.3 discuss about
inheritance and polymorphism respectively.

2.3.1 Dependence Graph for Object Oriented Programs

The dependencies within a single method are represented using a Method Dependence
Graph (MDG), which is composed of data dependence subgraph and control dependence
subgraph. The MDG has a method entry node which represents the start of a method.
The method entry vertex has a formal in vertex for every formal parameter and a formal
out vertex for each formal parameter that may be modified. Each call site has a call vertex
and a set of actual parameter vertices: an actual-in vertex for each actual parameter at
the call site and an actual-out vertex for each actual parameter that may be modified
by the called procedure. Parameter out edges are added from each formal-out node to
the corresponding actual-out node. The effects of return statements are modeled by


connecting the return statement to its corresponding call vertex using a parameter-out
edge. Summary edges are added from actual in to actual out nodes as described in
Section 2.2.3.
Larsen and Harrold [66] represent the dependencies in a class using the class de-
pendence graph (ClDG). A ClDG is a collection of MDGs constructed for individual
methods in the program. In addition it contains a class entry vertex that is connected to
the method entry vertex for each method in the class by a class member edge. Class entry
vertices and class member edges let us track dependencies that arise due to interaction
among classes.
In presence of multiple classes, additional dependence edges are required to record
the interaction between classes. For example, when a class C1 creates an object of class
C2, there is an implicit call to C2’s constructor. When there is a call site in method m1
of class C1 to method m2 of class C2 , there is a call dependence edge from the call site
in m1 to method start vertex of m2 . Parameter in edges are added from actual in to the
corresponding formal in node and parameter out edges are added from formal out to the
corresponding actual in node.
In object oriented programs, data dependence computation is complicated by the
fact that statements can read to and write from fields of objects, i.e. a statement can
have side effects. Computation of side effect information requires points to analysis and is
further discussed in Chapter 3. Also, methods can be invoked on objects and objects can
be passed as parameters. An algorithm for computing data dependence must consider
this into account.

Handling objects at callsites

In presence of a function call invoked on an object such as o.m1(), the function call can
modify the data members of o. Larsen and Harrold observe that data member variables
of a class are accessible to all methods in the class and hence can be treated as global
variables. They use additional parameters to represent the data members referenced by a
method. Thus the data dependence introduced by two consecutive method calls via data


class Base {
int a,b;
protected void vm() {
a=a+b;
}
public Base() { class Derived extends Base {
a=0; long d;
b=0; public void vm() {
} d=d+b;
public void m2(int i) { }
b=b+i; public Derived() {
} super();
public void m1() { d=0;
if(b0) vm(); }
b=b+1; public void m3() {
} d=d+1;
m2(1);
public void main1() { }
Base o = new Base(); public void m4() {
Base ba = new Base(); m1();
ba.m1(); }
ba.m2(1);
o.m2(1); public void main2() {
} int i=read();
public void C(Base ba) { Base p;
ba.m1(); if(i0)
ba.m2(1); p=new Base();
} else
public void D() { p=new Derived();
Base o = new Base(); C(p);
C(o); p.m1();
o.m1(); }
} }
}

Figure 2.9: Program


Figure 2.10: The Dependence Graph for the main function (from [67])

Figure 2.11: The Dependence Graphs for functions C() and D() (from [67])


member variables can be represented as data dependence between the actual parameters
at the method callsites. Figure 2.10 shows the dependence graph constructed for the
main program of Figure 2.9. Variables a and b are considered as global variables shared
across methods m1(), m2() and Base(). The data member variables are considered as
additional parameters that are passed to the function. This method of slicing includes
only those statements that are necessary for data members at the slicing criteria to
receive correct values. For example, slicing with respect to the node b = b out associated
with the statement o.m2() will exclude statements that assign to data member a.
One source of imprecision of this method is that it does not consider the fact that
data members may belong to different objects and creates spurious dependencies between
data members of different objects. In the above example, the slice wrongly includes the
statements ba.m1() and ba.m2(). Liang and Harrold [67] give an improved algorithm for
object sensitive slicing.
In the dependence graph representation of [67], the constructor has no formal in
vertices for the instance variables since these variables cannot be referenced before they
are allocated by the class constructor. Thus the algorithm omits formal-in vertices
for instance variables in the class constructor In the approaches of [67], [66] the data
members of the class are treated as additional parameters to be passed to the function.
This increases the number of parameter nodes. The number of additional nodes can
be reduced using GMOD/GREF information. Actual-out and Formal-out vertices are
needed only for those data members that are modified by the member function. Actual-in
and Formal-in vertices are needed for those data members accessed by the function.

Handling Parameter Objects

Tonella [59] represents an object as a single vertex when the object is used as a parameter.
This representation can lead to imprecise slices because it considers modification (or
access) of an individual field in an object to be a modification(or access) of the entire
object. For example, if the slicing criteria is o.b at the end of D() (in Figure 2.9), then
C(o) must be included. This in turn causes the slicer to include the parameter ba,


which causes ba.a and ba.b to be included, though ba.a does not affect o.b. To overcome
this limitation, Liang and Harrold [67] expand the parameter object as a tree. Figure
2.11 shows the parameter ba being expanded into a tree. At the first level, the node
representing ba is expanded into two nodes, Base and Derived each representing the type
ba can possibly have. At the next level, each node is expanded into its constituent data
members. Since data members can themselves be objects, the expansion is recursively
done till we get primitive data types. In presence of recursive data types, where tree
height can be infinite , k-limiting is used to limit the height of the tree to k. At the call
statement C(o) in Figure 2.9, the parameter object o is expanded into its data members.
At the function call, actual in and actual out vertices are created for the data members
of o. Summary edges are added between the actual in and actual out vertices if there is
a dependence possible through the called procedure.

2.3.2 Handling Inheritance

Java provides a single inheritance model which means that a new Java class can be
designed that inherits state variables and functionality from an existing class. The
functionality of base class methods can be overridden by simply redefining the methods
in the base class. Larsen and Harrold [66] construct dependence graph representations
for methods defined by the derived class . The representations of all methods that
are inherited from superclasses are simply reused. To construct the dependence graph
representation of class Derived (Figure 2.9), new representations are constructed for
methods such m3(), m4(). The representation of m1() is reused from class Base
Liang and Harrold [67] illustrate that in the presence of virtual methods, it is not pos-
sible to directly reuse the representations of the methods of the superclass.For example,
we cannot directly reuse the representation for m1() in class Base when we construct
the representation for class Derived. In the Base class , the call statement vm() in
m1() resolves to Base :: vm(). If a class derived from Base redefines vm(), then the call
statement vm() no longer resolves to Base :: vm(), but to the newly defined vm() of the
derived class. The callsites in the representation of m1() for class Derived have to be


changed. A method needs a new representation if

1. the method is declared in the new class

2. the method is declared in a lower class in the hierarchy and calls a newly redefined
virtual method directly or indirectly.

For example, methods declared in Dervied need a new representation because these
methods satisfy (1), Base.m1() also needs a new representation because it satisfies (2):
Base.m1() calls Dervied.vm() which is redefined in class Derived

Handling Interfaces

In Java, interfaces declare methods but let the responsibility of defining the methods to
concrete classes implementing the interface. Interfaces allows the programmer to work
with objects by using the interface behavior that they implement, rather than by their
class definition.
Single Interfaces
We use the interface representation graph [58] to represent a Java interface and its
corresponding classes that implement it. There is a unique vertex called interface start
vertex for the entry of the interface. Each method declaration in the interface can be
regarded as a call to its corresponding method in a class that implements it and therefore
a call vertex is created for each method declaration in the interface. The interface start
vertex is connected to each call vertex of the method declaration by interface membership
dependence arcs. If there are more than once classes that implement the interface, we
connect a method call in the interface to every corresponding method that implement it
in the classes.
Interface Extending Similar to extending classes, the representation of extended
interface is constructed by reusing the representation of all methods that are inherited
from superinterfaces. For newly defined methods in the extended interface, new repre-
sentations are created.


ie1 interface A { ie1
c1 void method1(int h); (a)
c2 void method2(int v); f1_in: this.h=this.h_in
} f2_in: this.v=this.v_in
ie3 interface B extends A { c1 c2 f3_in: this.u=this.u_in
c4 void method3(int u); f4_in: h1=h1_in
} f5_in: v1=v1_in
a1_in a2_in f6_in: u1=u1_in
ce5 class C1 implements A {
f7_in: h2=h2_in
s6 int h, v; e9 e16 f8_in: v2=v2_in
e7 public void method1(int h1) { e7 e13
a1_in: h1_in=h
s8 this.h = h1; a2_in: v1_in=v
} a3_in: u1_in=u
e9 public void method2(int v1) {
s10 this.v = v1; f4_in f7_in f5_in f8_in
} s10 s17
s8 s14
}
ce11 class C2 implements A {
s12 int h, v;
e13 public void method1(int h2) { ie3
s14 this.h = h2+1;
(b)
}
s17 this.v = v2+1;
}
c1 c2 c4
}
ce18 class C3 implements B { interface-membership
s19 int h, v, u; a1_in a2_in a3_in dependence arc
e20 public void method1(int h1) {
s21 this.h = h1+2; control dependence arc
e20 e22 e24
}
s23 this.v = v1+2; call dependence arc

}
f4_in f5_in f6_in
e24 public void method3(int u1) { parameter dependence arc
s25 this.u = u1+2; s21 s23 s25
}
}

Figure 2.12: Interface Dependence Graph (from [58])


2.3.3 Handling Polymorphism

In Java, method calls are bound to the implementation at runtime. Method invocation
expressions such as o.m(args) are executed as follows

1. The runtime type T of o is determined.

2. Load T.class

3. Check T to find an implementation for method m. If T does not define an imple-
mentation, T checks its superclass, and its superclass until an implementation is
found.

4. Invoke method m with the argument list, args, and also pass o to the method,
which will become the this value for method m.

A polymorphic reference can refer to instances of more than one class. A class
dependence graph represents such polymorphic method call by using a polymorphic
choice vertex [66]. A polymorphic choice vertex represents the selection of a particular
call given a set of possible destinations. In this method a message sent to a polymorphic
object is represented as a set of callsites one for each candidate message handling method,
connected to a polymorphic choice vertex with polymorphic choice edges. This approach
may give incorrect results: in function main() , Larsen’s approach uses only one callsite to
represent statement p.m1() because m1() is declared only in Base. However, when m1()
is called from objects of class Derived, it invokes Derived.vm() to modify d and when
m1() is called from objects of class Base, it invokes Base.vm() to modify a. One callsite
cannot precisely represent both cases. This approach also computes spurious dependence:
the approach is equivalent to using several objects, each belonging to a different type
to represent a polymorphic object. The data dependence construction algorithm cannot
distinguish data members with the same names in these different objects.
Liang and Harrold [67] give an improved method in representing polymorphism to
overcome this limitation. A polymorphic object is represented as a tree: the root of the
tree represents the polymorphic object and the children of the root represent objects of


the possible types. When the polymorphic object is used as a parameter, the children
are further expanded into trees; when the polymorphic object receives a message, the
children are further expanded into callsites. In Figure 2.11 the callsite ba.m1() can have
receiver types Base and Derived . Thus the call site is expanded (one for each type of
receiver).

2.3.4 Case Study - Elevator Class and its Dependence Graph

Figure 2.13 shows the elevator program and the slice with respect to the line 59. Figure
2.14 shows the class dependence graph constructed for the program. The C++ Elevator
class discussed in [72] has been modiﬁed for Java.


30 }

31 int current floor;
1 class Elevator { 32 int current direction;
33 int top floor;
2 static int UP=1, DOWN=-1; 34 }

3 public Elevator(int t) { 35 class AlarmElevator extends Elevator {
4 current floor=1;
36 public AlarmElevator(int top floor) {
5 current direction = UP;
6 top floor = t; 37 super(top floor);
7 } 38 alarm on=0;
39 }
8 public void up() { 40 public void set alarm() {
9 current direction=UP; 41 alarm on=1;
10 } 42 }
43 public void reset alarm() {
11 public void down() { 44 alarm on=0; }
12 current direction=DOWN; 45 public void go(int floor) {
13 } 46 if(!alarm on)
14 int which floor() { 47 super.go(floor);
15 return current floor; 48 }
16 }
49 protected int alarm on;
17 public int direction() { 50 }
18 return current direction;
19 } 51 class Test {
52 public static void main(String args[]) {
20 public void go (int floor) { 53 Elevator e;
54 if(condition)
21 if(current direction==UP) {
55 e=new Elevator(10);
22 while (current floor!= floor 56 else
23 current floor = top floor)) 57 e=new AlarmElevator(10);
24 current floor= current floor+1 ; 58 e.go(5);
25 } 59 System.out.print(e.which floor());
26 else { 60 }
27 while (current floor != floor 61 }
28 current floor 0)
29 current floor= current floor-1;

Figure 2.13: The Elevator program


52 slice point

54 58 59

A4_in 14
57 55
P1 F1_in

15

A10_in A4_out A5_out A6_out A7_out A11_in A4_out A5_out A6_out

A4_in A5_in A6_in A7_in A9_in A4_out A4_in A5_in A6_in A9_in A4_out
36

F3_in F1_out F2_out F3_out F8_out

40

37
31
A8_in A4_out A5_out A6_out

3 F1_in F2_in F3_in F8_in F5_in F1_out

F4_in F1_out F2_out F3_out 32

33

4 5 6
control dependence A4_out
edge A4_in A5_in A6_in A8_in

20
data dependence
edge
F1_in F2_in F3_in F5_in F1_out
summary edge
21
call edge, parameter edge
22 27

key for parameter vertices A1_in: a_in = current_floor 24 29
A1_out: current_floor = a_out
F1_in: current_floor = current_floor_in A2_in: b_in = 1
F1_out: current_floor_out = current_floor A3_in: b_in: = ?1
F2_in: current_dirn = current_dirn_in A4_in: current_floor_in = current_floor
F2_out: current_dirn_out = current_dirn A4_out: current_floor = current_floor_out
F3_in: top_floor = top_floor_in A5_in: current_dirn_in = current_dirn
F3_out: top_floor_out = top_floor A5_out: current_dirn = current_dirn_out
F4_in: 1_top_floor = 1_top_floor_in A6_in: top_floor_in = top_floor
F5_in: floor = floor_in A6_out: top_floor = top_floor_out
F6_in: a = a_in A7_in: alarm_on_in = alarm_on
F6_out: a_out = a A7_out: alarm_on = alarm_on_out
F7_in: b = b_in A8_in: 1_top_floor_in = 1_top_floor
F8_in: alarm_on = alarm_on_in A9_in: floor_in = 5
F8_out: alarm_on_out = alarm_on A!0_in: top_floor = 10
A11_in: 1_top_floor = 10

Figure 2.14: Dependence Graph for Elevator program

Chapter 3

Points to Analysis

In this chapter we first discuss the need for points to analysis. In the context of slicing,
points to analysis is essential for the correct computation of data dependencies and
construction of call graph. We summarize some issues related to computing points to
sets, including the methods for its computation and various factors that affect precision
. We next describe Andersen’s algorithm for pointer analysis for C and its adaptation
for Java. We then describe a new method for intra-procedural alias analysis which is an
improvement over flow insensitive analysis but not as precise as a flow sensitive analysis.

3.1 Need for Points to Analysis
The goal of pointer analysis is to statically determine the set of memory locations that
can be pointed to by a pointer variable. If two variables can access the same memory
location, the variables are said to be aliased. Alias analysis is necessary for program anal-
ysis, optimizations and correct computation of data dependence which is necessary for
slicing. Consider the computation of data dependence in Figure 3.1. Here the statement
print(y.a) is dependent on x.a=... , since x and y are aliased due to the execution
of the statement y=x. Without alias analysis, it is not possible to infer that statement 7
is dependent on statement 4.
A points to graph gives information about the set of memory locations pointed at by

38

Chapter 3. Points to Analysis 39

1 void fun() {
2 obj x,y;
3 x=new obj(); // O1 represent the object allocated
4 x.a = ....;
5 ... = y.a;
6 y = x;
7 print(y.a);
8 }

Figure 3.1: Need for Points to Analysis

each variable. Figure 3.1 shows a program and its associated points to graph.
In C a variable can point to another stack variable or dynamically allocated memory
on heap, whereas in Java a reference variable can point only to objects allocated on
heap, as stack variables cannot be pointed to due to lack of address of operator ().
Dynamically allocated memory locations on heap are not named. One convention is to
refer objects (memory locations) by the statement at which they are created. A statement
can be executed many times and therefore can create a new object each time. Thus
approximations are introduced in the points to graph if the above convention is used.
Another cause for approximation is the presence of recursion and dynamic allocation of
memory, which leads to statically unbounded number of memory locations.

3.2 Pointer Analysis using Constraints
Our aim is to derive the points to graph from the program text. One method to derive
the points to graph is using constraints [64]. If pts(q) denotes the set of objects initially
pointed by q, after an assignment such as p = q, p can additionally point to those objects,
which are initially pointed at by q. Thus we have the constraint pts(p) ⊇ pts(q). Every
statement in the program has an associated constraint. A solution to the constraints
gives the points to sets associated with every variable.
The constraints such as pts(p) ⊇ pts(q) are also called subset constraints or inclusion
based constraints. Andersen uses subset constraints for analyzing C program and his
algorithm is described in Section 3.4


Points to graph for a C program Points to graph for a Java program

s heap2
int a=1, b=2; class Obj { int f; }
int *p, *q; r heap1 Obj r,s,t;
void *r, *s;
h1: r = new Obj();
p = a; q h2: s = new Obj();
q = b;
p h3: r.f = new Obj(); t
h1: r = malloc
t = s; heap2
h2: s = malloc a s
f
b r
heap1
f
heap3
f

Figure 3.2: Points to Graphs

Subset vs Unification Constraints

The constraints generated can be either subset based or equality based. A subset con-
straint such as p ⊇ q says that the the points-to set of p contains the points-to set of
q. Instead of having subset constraints, Steensgaard [13] uses equality based constraints
where after each assignment like p = q, the points to sets of p and q are unified i.e. the
points to sets of both the variables are made identical.
Steensgaard’s approach is based on a non standard type system, where type does not
refer to declared type in the program source. Instead, the type of a variable describes
a set of locations possibly pointed to by the variable at runtime. At initialization each
variable is described by a different type. When two variables can point to the same mem-
ory location, the types represented by the variables are merged. However the stronger
constraints make the analysis less precise. The equality based approach is also called
unification because it treats assignments as bidirectional. This unification merges the


points to set of both sides of the assignment and is essentially computing an equivalence
relation defined by assignments, which is done by the fast union find algorithm [22]
If all the variables can be assigned types, subject to the constraints, then the sys-
tem of constraints is said to be satisfiable or well typed. Points-to analysis reduces to
the problem of assigning types to all locations (variables) in a program, such that the
variables in the program are well-typed. At the end of the analysis, two locations are
assigned different types, unless they have to be described by the same type in order for
the system of constraints to be well-typed.

3.3 Dimensions of Precision
The various factors that contribute to the precision of the analysis computed are flow
sensitivity, field sensitivity, context sensitivity and heap modelling. Ryder [17] discusses
various parameters that contribute to the precision of the analysis

Flow Sensitive vs Flow Insensitive approach

A flow sensitive analysis takes into account the control flow structure of the program.
Thus the points-to set associated with a variable is dependent on the program point. It
computes the mapping variable ⊗ program point → memory location. This is precise
but requires a large amount of memory since the points to sets of the same variable at
two different program points may be different and their points-to sets have to be recorded
separately. Flow sensitive analysis allows us to take advantage of strong updates, where
after a statement x = ..., the points to information about x prior to that statement can
be removed.
A flow insensitive approach computes conservative information that is valid at all
program points. It considers the program as a set of statements and computes points-to
information ignoring control flow. Flow insensitive analysis computes a single points to
relation that holds regardless of the order in which assignment statements are actually


executed.
A flow insensitive analysis produces imprecise results. Consider the computation of
data dependence for the program in Figure 3.1. If we apply flow insensitive alias anal-
ysis, then the analysis will conclude that x and y can both point to O1 , and thus the
statement ... = y.a (line 5) is made dependent on x.a = ... . But y can point to O1
only after the statement y = x. Thus flow insensitive analysis leads to spurious data
dependence.

Field Sensitivity

Aggregate objects such as structures can be handled by one of three approaches: field-
insensitive, where field information is discarded by modeling each aggregate with a single
constraint variable; field-based, where one constraint variable models all instances of a
field; and finally, field-sensitive, where a unique variable models each field instance of an
object. The following table describes these approaches for the code segment

x.a = new object();
y.b = x.a ;

field based pts(b) ⊇ pts(a)
field insensitive pts(y) ⊇ pts(x)
field sensitive pts(y.b) ⊇ pts(x.a)

Heap Abstraction

Two variables are aliased if they can refer to the same object in memory. Thus we need
to keep track of objects that can be present at runtime. The objects created at runtime
cannot be determined statically and have to be conservatively approximated. The least
precise manner is to consider the entire heap as a single object. The most common man-
ner of abstraction is to have one abstract object per program point. This abstract object
is a representative of all the objects that can be created at runtime due to that program


main() {
object a,b,c,d;
a=new object(); pts(a) ⊇ {o1}
b=new object(); pts(b) ⊇ {o2}
c=id(a); pts(r) ⊇ pts(a), pts(c) ⊇ pts(r)
d=id(b); pts(r) ⊇ pts(b), pts(d) ⊇ pts(r)
}

object id(object r) {

return r;
}

Figure 3.3: Imprecision due to context insensitive analysis

point. A more precise abstraction is to take context sensitivity into account using the
calling context to distinguish between various objects created at the same program point.

Context Sensitivity

A context sensitive analysis distinguishes between different calling contexts and does not
merge data flow information from multiple contexts. In Figure 3.3, a and b point to o1
and o2 respectively. Due to the function calls, c is made to point to o1 and d is made
to point to o2. So the actual points to sets are a → o1 , b → o2, c → o1 and c → d A
context insensitive analysis models parameter bindings as explicit assignments. Thus r
points to both the objects o1 and o2. This leads to smearing of information making c
and d point to both o1 and o2.
One method to incorporate context sensitivity is to summarize each procedure and
embed that information at the call sites. A method can change the points to sets of
all data reachable through static variables, incoming parameters and all objects created
by the method and its callees. A method’s summary must include the effect of all the
updates that the function and all its callees can make, in terms of incoming parameters.
Thus summaries are huge. Also there is another difficulty due to call back mechanism.


In presence of dynamic binding, we do not know which method would be called making
it difficult to summarize the method [1].
Another method to incorporate context sensitivity is the cloning based approach.
Cloning based approaches expands the call graph for each calling context. Thus there
is a separate path for each calling context. A context insensitive algorithm can thus be
run on the expanded graph. This leads to an exponential blowup. Whaley and Lam
[18] use Binary Decision Diagrams (BDD) are used to handle the exponential increase in
complexity caused due to cloning. BDDs were first used for pointer analysis by Berndl
et.al [31]. Milanova et.al [20] introduces object sensitivity, which is a form of context
sensitivity. Instead of using the call stack to distinguish different contexts, they use the
receiver object to distinguish between different contexts.

3.4 Andersen’s Algorithm for C
Andersen proposed a flow insensitive , context insensitive version of points to analysis
for C. His analysis modeled the heap using a separate concrete location to represent all
memory allocated at a given dynamic allocation site. The implementation expressed the
analysis using subset constraints and then solved the constraints.
Andersen’s algorithm [64] models the points to relations as subset constraints. After a
statement such as p=q, p additionally points to those objects, which are initially pointed
by q. Thus we have the constraint pts(p) ⊇ pts(q). The list of constraints for C is given
in Table 3.1

p = x x ∈ pts(p)
p=q pts(p) ⊇ pts(q)
p = ∗q ∀x ∈ pts(q), pts(p) ⊇ pts(x)
∗p = q ∀x ∈ pts(p), pts(x) ⊇ pts(q)

Table 3.1: Constraints for C

Constraints are represented using a constraint graph. Each node N in the constraint
graph represents a variable and is annotated with pts(N ), the set of objects the variable

Thesis: Slicing of Java Programs using the Soot Framework (2006)

Thesis: Slicing of Java Programs using the Soot Framework (2006)

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Thesis: Slicing of Java Programs using the Soot Framework (2006)

Similar to Thesis: Slicing of Java Programs using the Soot Framework (2006) (20)

More from Arvind Devaraj

More from Arvind Devaraj (20)

Recently uploaded

Recently uploaded (20)

Thesis: Slicing of Java Programs using the Soot Framework (2006)