Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)

Kenyon: A Software Stratigraphy Platform

Jennifer Bevan, Sunghun Lijie Zou, Mike Godfrey
Kim, E. James Whitehead Jr. University of Waterloo
University of California, Santa Cruz {lzou, migod}
{jbevan, hunkim, ejw} @uwaterloo.edu
@cs.ucsc.edu

Motivation

 Static analysis-based software evolution
research has several common technical
issues to manage.
 Extracting meaningful configurations from an
SCM repository.
 Calculating static relations, metrics.
 Augments data from commit log messages.
 Saving the extracted facts.
 For later time-based analysis, data mining,
incremental data addition.

Ongoing Static Evolution Research

 Instability Analysis (J. Bevan)
 Refines Zimmerman/Ying/Murphy using static
dependence to remove temporal dependencies
 Entity Mapping/Origin Analysis (L. Zou, M.
Godfrey)
 Uses static metrics to identify moved/split/merged
procedures, files.
 Code clone evolution (M. Kim)
 Identifies clones and follows their evolution.

More Static Evolution Research

 Association rule mining
 For predicting changes [Ying et al., IEEE TSE, v30 n9, Sept. 2004]
 For architectural justification [Zimmermann, Diehl, and Zeller,
Proc. IWPSE 2003]
 Identifying code “chunks” for future
modularization [Mockus and Weiss, IEEE Software, v18 n2, 2001]
 “Feature” identification [Fischer, Pinzger, and Gall, Proc. WCRE
2003]

 …and the ongoing research related to these.

Problem

 Despite similarity of approach, systems make
several choices that limit sharing of technology and
results:
 Usually choosing a single SCM system (CVS) for data.
 Usually creating a proprietary database schema.
 Usually not easily integratable with other research
projects for result sharing.
 The cost of computationally expensive analysis
techniques are not amortized across multiple
research directions.

Solution: Kenyon

 Kenyon is designed to facilitate static software
evolution research by providing common solutions
to these common problems:
 Phase 1: Automatic configuration extraction from SCM
 Phase 2: Invoking static analysis tool(s)
 Phase 3: Storing the results from these preprocessing
steps.
 Asynchronously provides access to previously
processed and stored data.

Kenyon Processing

Phases 2 & 3
Fact Extraction
Phase 1 (Static Analysis)
Configuration and Persist
Extraction Gathered Facts
SCM Kenyon
Repository Repository
(RDBMS/
Hibernate)
Filesystem

Client Tools
perform queries,
add new facts

Client
Software
(e.g., IVA)

Phase 1: Extract Configurations

 Kenyon provides transaction recovery and logical
configuration extraction for multiple SCM systems.
 Configurations specified by time + branch identifier.
 Sliding window algorithm for transaction recovery.
 Only changes from completed transactions are extracted
for a “logical configuration”.
 Only changes from transactions that completed between
two specifications are considered for a “configuration
delta”.

Configuration Specification

 Kenyon’s logical configuration extraction and delta
calculations allow researchers to consider software
“as it existed at time T on branch B”.
 Most SCM systems archive data along a timeline with
varying support for parallel development.
 Kenyon uses this commonality as the basis for its SCM
interface and configuration specification.
 There is no indication that change-set based SCM
systems will not be supportable by Kenyon.

Logical Configuration

• At any given point in time,
one or more transactions may
have just completed, and one
or more may be ongoing. T1
• Ongoing transactions are F4
shown in red.
• Completed transactions are F2
shown in green. F1
F3

Configuration Deltas

• Configuration deltas are
calculated as C(T2) –
C(T1).
• Only changes from T2
transactions completing
between T1 (exclusive) and
T1 F4
T2 (inclusive) are
considered.
F3
F2
F1

Data from Phase 1

 Valid configuration specifications for extraction are
created by Kenyon, one per timestamp where a
transaction completed.
 For each configuration extracted:
 Author and log message of each transaction completing
at that specification.
 The configuration is placed on the filesystem.
 A configuration delta for each consecutive pair of
configurations processed can also be stored.

Phase 2: Invoke Fact Extractors

 Kenyon provides an abstract class that is used to
invoke third-party fact extractors on the
configuration extracted to the filesystem.
 Kenyon users would subclass this class to invoke their
own fact extractor.
 Support for Codesurfer (line-level analysis) and
SWAGKIT (procedure-level analysis) are provided with
Kenyon. [www.grammatech.com, swag.uwaterloo.ca]
 FactExtractor subclasses have a tri-modal return status:
“failure”, “new data to store”, or “no new data to store”.

Data from Phase 2

 FactExtractor subclasses provide:
 A ConfigGraph that maps software elements to nodes
and static relationships to edges.
 The graph, any node, and any edge may be attributed
with static metrics.
 Multiple fact extractors may be invoked on a single
configuration: each created ConfigGraph is saved
with a reference to the fact extractor that created it.
 If a configuration has already been processed by a
given fact extractor, it will not be processed again
unless new metrics are to be calculated.

Phase 3: Data Storage

 Kenyon uses Hibernate to persist data
classes.
 Hibernate is an “object/relational persistence and
query service for Java” [www.hibernate.org].
 Allows reuse of Kenyon classes by research
tools implemented in Java.
 Each configuration processed by Kenyon is
assigned to a Project, the top-level data class
persisted by Kenyon.

Persisted Kenyon Data

• Projects contain one set of
data for each configuration Project
specification processed. 1

N
• Each such data set N 1
ConfigGraph ConfigData
contains one or more 1 1
ConfigGraphs, each 1 N
produced by a different
FactExtractor ConfigSpec
FactExtractor.
1 2
• FactExtractors specify 1 1
what GraphSchema GraphSchema ConfigDelta
subclass they use (not
restrictive).

Data Access

 Hibernate allows access to preprocessed data using
SQL or the Hibernate query methods (HQL, QBE/
QBC), which support class/field-based queries.
 A Hibernate query returns a List of Objects, each of
which is of the type originally persisted.
 Data fields in the returned class are populated unless
specified as lazily loaded.
 Kenyon provides several convenience queries for
common anticipated queries, such as “what
configurations are available for this project”.

Kenyon Usage

 Kenyon processes data based on specifications in a
configuration file
 Start time, stop time, how often to process
 Fact extractors and their assigned metric calculators.
 SCM parameters, filesystem parameters, some control
over what Hibernate persists.
 A “processing run” will reuse any previously
processed data if available
 For example, if a ConfigGraph has already been created,
if new metrics are necessary they are calculated and
added to the existing ConfigGraph.

Iterative Refinement Example

 When looking for “interesting” timeframes of
evolution, a multiple-pass process is recommended.
 A user can configure Kenyon to process the changes in a
system once per day.
 Days with high activity or other metrics exceeding a
threshold can be flagged as “interesting”.
 The user can then configure Kenyon to process those
days (via multiple processing runs) at the frequency of
“every 20 minutes”.
 This process can repeat down to the “every second”
level.

Parallel Preprocessing

 Kenyon is a single-threaded process, but Hibernate
supports multiple connections to a single Kenyon
database.
 A 10-year history can be processed in chunks by
any number of computers, even if the processing
configurations have overlapping times or different
intervals.
 Kenyon does not integrate the deltas between
different processing runs, so a small overlap in
processing chunks is suggested.

Kenyon Architecture

ConfigData Project Hibernate/DBMS

ConfigGraph <<calls>>
DataManager
<<calls>>

<<calls>>
MetricLoader Fact Extractor SCMInterface
<<calls>>

SCM
Filesystem
Repository

Current Status

 Kenyon 1.2 available at
http://kenyon.dforge.cse.ucsc.edu
 Supports CVS, Subversion, and ClearCase
 Students in 290G are performing projects
using Kenyon this quarter
 Actively working with Samsung to analyze
some of their source code.

Future Work (1/3)

 Continue working with M. Kim
 Evaluate usefulness of SCM-only module.
 If she decides to use Kenyon, assist with full integration.
 Finish integration of Beagle/Kenyon and
IVA/Kenyon.
 Work with G. Murphy on using Kenyon at UBC.
 Evaluate Kenyon’s ability to reduce the time-to-
results for static software evolution research by
analyzing the seminar class projects.

Future Work (2/3)

 Support branch path traversal
 Allow users to see the branch points in a system and
specify a path for processing instead of a single branch.
 Will reuse existing visualizations, must add specification
mechanism.
 Incorporate full language-specific containment
models for better inter-language graph traversal and
mapping.
 Use M. Godfrey’s Java fact extractor and containment
model.

Future Work (3/3)

 Support more of the Standard Exchange
Formats for ConfigGraph export.
 TA is already supported, but only the Fact
sections. Schema sections should be improved
to use the language-specific containment models.
 Encourage other reseachers to use Kenyon,
and improve results-sharing, capabilities, etc.
based on their feedback.

Open Issues (1/3)

 The exact mechanism for allowing data
sharing between researchers is not entirely
controllable by Kenyon
 Database setup and administration can
effectively override much of Kenyon’s
preferences.
 By default, Kenyon-created tables are not
mutable by processes other than Kenyon.

Open Issues (2/3)

 Kenyon provides a public class, EvolutionPath, that
links a subgraph in one ConfigGraph to one in
another ConfigGraph.
 Directed and attributable.
 Basic building block for evolution data.
 Is currently persisted by Kenyon, will likely not be
after 1.1, due to database mutability issues.
 Other research projects can subclass and, if they want to
share their results easily, persist them to a Hibernate
database using the provided Hibernate mapping
examples.

Open Issues (3/3)

 Kenyon is able to be automatically invoked
via a post-commit script or a cron job.
 Should Kenyon be able to be automatically
invoked from an IDE?
 What sort of support should Kenyon provide
for better integration with, for example,
Eclipse?

Conclusions (1/2)

 Kenyon is an engineering solution, designed to
amortize the cost of the computationally expensive
preprocessing steps that can benefit static software
evolution research.
 Research projects using Kenyon will not have to
independently create solutions for these common
problems.
 18% code reduction in Beagle without really trying.
 Is expected to reduce the lag between beginning system
implementation and producing research results.

Conclusions (2/2)

 Kenyon is not intended to be a lightweight data
mining system for software evolution research.
 Tradeoff of speed vs. precision is still controllable via
the choice of fact extractors.
 The configuration extraction time and associated
network lag already put the per-configuration time at
O(seconds)
 Instead, it allows the cost of time-consuming,
computationally expensive preprocessing, to be
amortized among researchers.

Questions?

 Kenyon was created primarily from code that existed in
IVA, which is being funded by NSF grant CCR-01234603.
Kenyon also contains code from Beagle, the origin analysis
project overseen by Mike Godfrey.

 Email jbevan@cs.ucsc.edu with future questions.

http://www.cse.ucsc.edu/research/labs/grase/kenyon/

Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)

More Related Content

Similar to Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)

More from Sung Kim

Recently uploaded

Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)