Kenyon: A Software Stratigraphy Platform



 Jennifer Bevan, Sunghun               Lijie Zou, Mike Godfrey
Kim, E. James Whitehead Jr.               University of Waterloo
University of California, Santa Cruz        {lzou, migod}
     {jbevan, hunkim, ejw}                  @uwaterloo.edu
         @cs.ucsc.edu
Motivation

 Static analysis-based software evolution
  research has several common technical
  issues to manage.
     Extracting meaningful configurations from an
      SCM repository.
     Calculating static relations, metrics.
         Augments data from commit log messages.
     Saving the extracted facts.
         For later time-based analysis, data mining,
          incremental data addition.
Ongoing Static Evolution Research

 Instability Analysis (J. Bevan)
      Refines Zimmerman/Ying/Murphy using static
       dependence to remove temporal dependencies
 Entity Mapping/Origin Analysis (L. Zou, M.
  Godfrey)
      Uses static metrics to identify moved/split/merged
       procedures, files.
 Code clone evolution (M. Kim)
      Identifies clones and follows their evolution.
More Static Evolution Research

 Association rule mining
      For predicting changes [Ying et al., IEEE TSE, v30 n9, Sept. 2004]
      For architectural justification [Zimmermann, Diehl, and Zeller,
       Proc. IWPSE 2003]
 Identifying code “chunks” for future
  modularization [Mockus and Weiss, IEEE Software, v18 n2, 2001]
 “Feature” identification [Fischer, Pinzger, and Gall, Proc. WCRE
  2003]

 …and the ongoing research related to these.
Problem

 Despite similarity of approach, systems make
  several choices that limit sharing of technology and
  results:
      Usually choosing a single SCM system (CVS) for data.
      Usually creating a proprietary database schema.
      Usually not easily integratable with other research
       projects for result sharing.
 The cost of computationally expensive analysis
  techniques are not amortized across multiple
  research directions.
Solution: Kenyon

 Kenyon is designed to facilitate static software
  evolution research by providing common solutions
  to these common problems:
      Phase 1: Automatic configuration extraction from SCM
      Phase 2: Invoking static analysis tool(s)
      Phase 3: Storing the results from these preprocessing
       steps.
      Asynchronously provides access to previously
       processed and stored data.
Kenyon Processing

                                           Phases 2 & 3
                                           Fact Extraction
             Phase 1                       (Static Analysis)
             Configuration                 and Persist
             Extraction                    Gathered Facts
  SCM                                                            Kenyon
Repository                                                       Repository
                                                                 (RDBMS/
                                                                 Hibernate)
                             Filesystem



                                                               Client Tools
                                                               perform queries,
                                                               add new facts


                                            Client
                                           Software
                                          (e.g., IVA)
Phase 1: Extract Configurations

 Kenyon provides transaction recovery and logical
  configuration extraction for multiple SCM systems.
      Configurations specified by time + branch identifier.
      Sliding window algorithm for transaction recovery.
      Only changes from completed transactions are extracted
       for a “logical configuration”.
      Only changes from transactions that completed between
       two specifications are considered for a “configuration
       delta”.
Configuration Specification

 Kenyon’s logical configuration extraction and delta
  calculations allow researchers to consider software
  “as it existed at time T on branch B”.
      Most SCM systems archive data along a timeline with
       varying support for parallel development.
      Kenyon uses this commonality as the basis for its SCM
       interface and configuration specification.
      There is no indication that change-set based SCM
       systems will not be supportable by Kenyon.
Logical Configuration

• At any given point in time,
  one or more transactions may
  have just completed, and one
  or more may be ongoing.        T1
• Ongoing transactions are                           F4
  shown in red.
• Completed transactions are               F2
  shown in green.                     F1
                                                F3
Configuration Deltas

• Configuration deltas are
  calculated as C(T2) –
  C(T1).
• Only changes from            T2
  transactions completing
  between T1 (exclusive) and
                               T1                  F4
  T2 (inclusive) are
  considered.
                                              F3
                                         F2
                                    F1
Data from Phase 1

 Valid configuration specifications for extraction are
  created by Kenyon, one per timestamp where a
  transaction completed.
 For each configuration extracted:
      Author and log message of each transaction completing
       at that specification.
      The configuration is placed on the filesystem.
 A configuration delta for each consecutive pair of
  configurations processed can also be stored.
Phase 2: Invoke Fact Extractors

 Kenyon provides an abstract class that is used to
  invoke third-party fact extractors on the
  configuration extracted to the filesystem.
      Kenyon users would subclass this class to invoke their
       own fact extractor.
      Support for Codesurfer (line-level analysis) and
       SWAGKIT (procedure-level analysis) are provided with
       Kenyon. [www.grammatech.com, swag.uwaterloo.ca]
      FactExtractor subclasses have a tri-modal return status:
       “failure”, “new data to store”, or “no new data to store”.
Data from Phase 2

 FactExtractor subclasses provide:
      A ConfigGraph that maps software elements to nodes
       and static relationships to edges.
      The graph, any node, and any edge may be attributed
       with static metrics.
 Multiple fact extractors may be invoked on a single
  configuration: each created ConfigGraph is saved
  with a reference to the fact extractor that created it.
 If a configuration has already been processed by a
  given fact extractor, it will not be processed again
  unless new metrics are to be calculated.
Phase 3: Data Storage

 Kenyon uses Hibernate to persist data
  classes.
     Hibernate is an “object/relational persistence and
      query service for Java” [www.hibernate.org].
     Allows reuse of Kenyon classes by research
      tools implemented in Java.
     Each configuration processed by Kenyon is
      assigned to a Project, the top-level data class
      persisted by Kenyon.
Persisted Kenyon Data

• Projects contain one set of
  data for each configuration                             Project
  specification processed.                                    1

                                                              N
• Each such data set                            N   1
                                ConfigGraph             ConfigData
  contains one or more                1                       1
  ConfigGraphs, each                  1                       N
  produced by a different
                                FactExtractor           ConfigSpec
  FactExtractor.
                                      1                       2
• FactExtractors specify              1                       1
  what GraphSchema              GraphSchema             ConfigDelta
  subclass they use (not
  restrictive).
Data Access

 Hibernate allows access to preprocessed data using
  SQL or the Hibernate query methods (HQL, QBE/
  QBC), which support class/field-based queries.
      A Hibernate query returns a List of Objects, each of
       which is of the type originally persisted.
      Data fields in the returned class are populated unless
       specified as lazily loaded.
 Kenyon provides several convenience queries for
  common anticipated queries, such as “what
  configurations are available for this project”.
Kenyon Usage

 Kenyon processes data based on specifications in a
  configuration file
      Start time, stop time, how often to process
      Fact extractors and their assigned metric calculators.
      SCM parameters, filesystem parameters, some control
       over what Hibernate persists.
 A “processing run” will reuse any previously
  processed data if available
      For example, if a ConfigGraph has already been created,
       if new metrics are necessary they are calculated and
       added to the existing ConfigGraph.
Iterative Refinement Example

 When looking for “interesting” timeframes of
  evolution, a multiple-pass process is recommended.
      A user can configure Kenyon to process the changes in a
       system once per day.
      Days with high activity or other metrics exceeding a
       threshold can be flagged as “interesting”.
      The user can then configure Kenyon to process those
       days (via multiple processing runs) at the frequency of
       “every 20 minutes”.
      This process can repeat down to the “every second”
       level.
Parallel Preprocessing

 Kenyon is a single-threaded process, but Hibernate
  supports multiple connections to a single Kenyon
  database.
 A 10-year history can be processed in chunks by
  any number of computers, even if the processing
  configurations have overlapping times or different
  intervals.
 Kenyon does not integrate the deltas between
  different processing runs, so a small overlap in
  processing chunks is suggested.
Kenyon Architecture


 ConfigData           Project                 Hibernate/DBMS


 ConfigGraph                           <<calls>>
                                                   DataManager
                                                   <<calls>>

               <<calls>>
MetricLoader               Fact Extractor          SCMInterface
                                                   <<calls>>



                                                        SCM
                            Filesystem
                                                      Repository
Current Status

 Kenyon 1.2 available at
  http://kenyon.dforge.cse.ucsc.edu
 Supports CVS, Subversion, and ClearCase
 Students in 290G are performing projects
  using Kenyon this quarter
 Actively working with Samsung to analyze
  some of their source code.
Future Work (1/3)

 Continue working with M. Kim
      Evaluate usefulness of SCM-only module.
      If she decides to use Kenyon, assist with full integration.
 Finish integration of Beagle/Kenyon and
  IVA/Kenyon.
 Work with G. Murphy on using Kenyon at UBC.
 Evaluate Kenyon’s ability to reduce the time-to-
  results for static software evolution research by
  analyzing the seminar class projects.
Future Work (2/3)

 Support branch path traversal
      Allow users to see the branch points in a system and
       specify a path for processing instead of a single branch.
      Will reuse existing visualizations, must add specification
       mechanism.
 Incorporate full language-specific containment
  models for better inter-language graph traversal and
  mapping.
      Use M. Godfrey’s Java fact extractor and containment
       model.
Future Work (3/3)

 Support more of the Standard Exchange
  Formats for ConfigGraph export.
     TA is already supported, but only the Fact
      sections. Schema sections should be improved
      to use the language-specific containment models.
 Encourage other reseachers to use Kenyon,
  and improve results-sharing, capabilities, etc.
  based on their feedback.
Open Issues (1/3)

 The exact mechanism for allowing data
  sharing between researchers is not entirely
  controllable by Kenyon
     Database setup and administration can
      effectively override much of Kenyon’s
      preferences.
     By default, Kenyon-created tables are not
      mutable by processes other than Kenyon.
Open Issues (2/3)

 Kenyon provides a public class, EvolutionPath, that
  links a subgraph in one ConfigGraph to one in
  another ConfigGraph.
      Directed and attributable.
      Basic building block for evolution data.
 Is currently persisted by Kenyon, will likely not be
  after 1.1, due to database mutability issues.
      Other research projects can subclass and, if they want to
       share their results easily, persist them to a Hibernate
       database using the provided Hibernate mapping
       examples.
Open Issues (3/3)

 Kenyon is able to be automatically invoked
  via a post-commit script or a cron job.
 Should Kenyon be able to be automatically
  invoked from an IDE?
 What sort of support should Kenyon provide
  for better integration with, for example,
  Eclipse?
Conclusions (1/2)

 Kenyon is an engineering solution, designed to
  amortize the cost of the computationally expensive
  preprocessing steps that can benefit static software
  evolution research.
 Research projects using Kenyon will not have to
  independently create solutions for these common
  problems.
      18% code reduction in Beagle without really trying.
      Is expected to reduce the lag between beginning system
       implementation and producing research results.
Conclusions (2/2)

 Kenyon is not intended to be a lightweight data
  mining system for software evolution research.
      Tradeoff of speed vs. precision is still controllable via
       the choice of fact extractors.
      The configuration extraction time and associated
       network lag already put the per-configuration time at
       O(seconds)
 Instead, it allows the cost of time-consuming,
  computationally expensive preprocessing, to be
  amortized among researchers.
Questions?

 Kenyon was created primarily from code that existed in
  IVA, which is being funded by NSF grant CCR-01234603.
  Kenyon also contains code from Beagle, the origin analysis
  project overseen by Mike Godfrey.


 Email jbevan@cs.ucsc.edu with future questions.

   http://www.cse.ucsc.edu/research/labs/grase/kenyon/

Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)

  • 1.
    Kenyon: A SoftwareStratigraphy Platform Jennifer Bevan, Sunghun Lijie Zou, Mike Godfrey Kim, E. James Whitehead Jr. University of Waterloo University of California, Santa Cruz {lzou, migod} {jbevan, hunkim, ejw} @uwaterloo.edu @cs.ucsc.edu
  • 2.
    Motivation  Static analysis-basedsoftware evolution research has several common technical issues to manage.  Extracting meaningful configurations from an SCM repository.  Calculating static relations, metrics.  Augments data from commit log messages.  Saving the extracted facts.  For later time-based analysis, data mining, incremental data addition.
  • 3.
    Ongoing Static EvolutionResearch  Instability Analysis (J. Bevan)  Refines Zimmerman/Ying/Murphy using static dependence to remove temporal dependencies  Entity Mapping/Origin Analysis (L. Zou, M. Godfrey)  Uses static metrics to identify moved/split/merged procedures, files.  Code clone evolution (M. Kim)  Identifies clones and follows their evolution.
  • 4.
    More Static EvolutionResearch  Association rule mining  For predicting changes [Ying et al., IEEE TSE, v30 n9, Sept. 2004]  For architectural justification [Zimmermann, Diehl, and Zeller, Proc. IWPSE 2003]  Identifying code “chunks” for future modularization [Mockus and Weiss, IEEE Software, v18 n2, 2001]  “Feature” identification [Fischer, Pinzger, and Gall, Proc. WCRE 2003]  …and the ongoing research related to these.
  • 5.
    Problem  Despite similarityof approach, systems make several choices that limit sharing of technology and results:  Usually choosing a single SCM system (CVS) for data.  Usually creating a proprietary database schema.  Usually not easily integratable with other research projects for result sharing.  The cost of computationally expensive analysis techniques are not amortized across multiple research directions.
  • 6.
    Solution: Kenyon  Kenyonis designed to facilitate static software evolution research by providing common solutions to these common problems:  Phase 1: Automatic configuration extraction from SCM  Phase 2: Invoking static analysis tool(s)  Phase 3: Storing the results from these preprocessing steps.  Asynchronously provides access to previously processed and stored data.
  • 7.
    Kenyon Processing Phases 2 & 3 Fact Extraction Phase 1 (Static Analysis) Configuration and Persist Extraction Gathered Facts SCM Kenyon Repository Repository (RDBMS/ Hibernate) Filesystem Client Tools perform queries, add new facts Client Software (e.g., IVA)
  • 8.
    Phase 1: ExtractConfigurations  Kenyon provides transaction recovery and logical configuration extraction for multiple SCM systems.  Configurations specified by time + branch identifier.  Sliding window algorithm for transaction recovery.  Only changes from completed transactions are extracted for a “logical configuration”.  Only changes from transactions that completed between two specifications are considered for a “configuration delta”.
  • 9.
    Configuration Specification  Kenyon’slogical configuration extraction and delta calculations allow researchers to consider software “as it existed at time T on branch B”.  Most SCM systems archive data along a timeline with varying support for parallel development.  Kenyon uses this commonality as the basis for its SCM interface and configuration specification.  There is no indication that change-set based SCM systems will not be supportable by Kenyon.
  • 10.
    Logical Configuration • Atany given point in time, one or more transactions may have just completed, and one or more may be ongoing. T1 • Ongoing transactions are F4 shown in red. • Completed transactions are F2 shown in green. F1 F3
  • 11.
    Configuration Deltas • Configurationdeltas are calculated as C(T2) – C(T1). • Only changes from T2 transactions completing between T1 (exclusive) and T1 F4 T2 (inclusive) are considered. F3 F2 F1
  • 12.
    Data from Phase1  Valid configuration specifications for extraction are created by Kenyon, one per timestamp where a transaction completed.  For each configuration extracted:  Author and log message of each transaction completing at that specification.  The configuration is placed on the filesystem.  A configuration delta for each consecutive pair of configurations processed can also be stored.
  • 13.
    Phase 2: InvokeFact Extractors  Kenyon provides an abstract class that is used to invoke third-party fact extractors on the configuration extracted to the filesystem.  Kenyon users would subclass this class to invoke their own fact extractor.  Support for Codesurfer (line-level analysis) and SWAGKIT (procedure-level analysis) are provided with Kenyon. [www.grammatech.com, swag.uwaterloo.ca]  FactExtractor subclasses have a tri-modal return status: “failure”, “new data to store”, or “no new data to store”.
  • 14.
    Data from Phase2  FactExtractor subclasses provide:  A ConfigGraph that maps software elements to nodes and static relationships to edges.  The graph, any node, and any edge may be attributed with static metrics.  Multiple fact extractors may be invoked on a single configuration: each created ConfigGraph is saved with a reference to the fact extractor that created it.  If a configuration has already been processed by a given fact extractor, it will not be processed again unless new metrics are to be calculated.
  • 15.
    Phase 3: DataStorage  Kenyon uses Hibernate to persist data classes.  Hibernate is an “object/relational persistence and query service for Java” [www.hibernate.org].  Allows reuse of Kenyon classes by research tools implemented in Java.  Each configuration processed by Kenyon is assigned to a Project, the top-level data class persisted by Kenyon.
  • 16.
    Persisted Kenyon Data •Projects contain one set of data for each configuration Project specification processed. 1 N • Each such data set N 1 ConfigGraph ConfigData contains one or more 1 1 ConfigGraphs, each 1 N produced by a different FactExtractor ConfigSpec FactExtractor. 1 2 • FactExtractors specify 1 1 what GraphSchema GraphSchema ConfigDelta subclass they use (not restrictive).
  • 17.
    Data Access  Hibernateallows access to preprocessed data using SQL or the Hibernate query methods (HQL, QBE/ QBC), which support class/field-based queries.  A Hibernate query returns a List of Objects, each of which is of the type originally persisted.  Data fields in the returned class are populated unless specified as lazily loaded.  Kenyon provides several convenience queries for common anticipated queries, such as “what configurations are available for this project”.
  • 18.
    Kenyon Usage  Kenyonprocesses data based on specifications in a configuration file  Start time, stop time, how often to process  Fact extractors and their assigned metric calculators.  SCM parameters, filesystem parameters, some control over what Hibernate persists.  A “processing run” will reuse any previously processed data if available  For example, if a ConfigGraph has already been created, if new metrics are necessary they are calculated and added to the existing ConfigGraph.
  • 19.
    Iterative Refinement Example When looking for “interesting” timeframes of evolution, a multiple-pass process is recommended.  A user can configure Kenyon to process the changes in a system once per day.  Days with high activity or other metrics exceeding a threshold can be flagged as “interesting”.  The user can then configure Kenyon to process those days (via multiple processing runs) at the frequency of “every 20 minutes”.  This process can repeat down to the “every second” level.
  • 20.
    Parallel Preprocessing  Kenyonis a single-threaded process, but Hibernate supports multiple connections to a single Kenyon database.  A 10-year history can be processed in chunks by any number of computers, even if the processing configurations have overlapping times or different intervals.  Kenyon does not integrate the deltas between different processing runs, so a small overlap in processing chunks is suggested.
  • 21.
    Kenyon Architecture ConfigData Project Hibernate/DBMS ConfigGraph <<calls>> DataManager <<calls>> <<calls>> MetricLoader Fact Extractor SCMInterface <<calls>> SCM Filesystem Repository
  • 22.
    Current Status  Kenyon1.2 available at http://kenyon.dforge.cse.ucsc.edu  Supports CVS, Subversion, and ClearCase  Students in 290G are performing projects using Kenyon this quarter  Actively working with Samsung to analyze some of their source code.
  • 23.
    Future Work (1/3) Continue working with M. Kim  Evaluate usefulness of SCM-only module.  If she decides to use Kenyon, assist with full integration.  Finish integration of Beagle/Kenyon and IVA/Kenyon.  Work with G. Murphy on using Kenyon at UBC.  Evaluate Kenyon’s ability to reduce the time-to- results for static software evolution research by analyzing the seminar class projects.
  • 24.
    Future Work (2/3) Support branch path traversal  Allow users to see the branch points in a system and specify a path for processing instead of a single branch.  Will reuse existing visualizations, must add specification mechanism.  Incorporate full language-specific containment models for better inter-language graph traversal and mapping.  Use M. Godfrey’s Java fact extractor and containment model.
  • 25.
    Future Work (3/3) Support more of the Standard Exchange Formats for ConfigGraph export.  TA is already supported, but only the Fact sections. Schema sections should be improved to use the language-specific containment models.  Encourage other reseachers to use Kenyon, and improve results-sharing, capabilities, etc. based on their feedback.
  • 26.
    Open Issues (1/3) The exact mechanism for allowing data sharing between researchers is not entirely controllable by Kenyon  Database setup and administration can effectively override much of Kenyon’s preferences.  By default, Kenyon-created tables are not mutable by processes other than Kenyon.
  • 27.
    Open Issues (2/3) Kenyon provides a public class, EvolutionPath, that links a subgraph in one ConfigGraph to one in another ConfigGraph.  Directed and attributable.  Basic building block for evolution data.  Is currently persisted by Kenyon, will likely not be after 1.1, due to database mutability issues.  Other research projects can subclass and, if they want to share their results easily, persist them to a Hibernate database using the provided Hibernate mapping examples.
  • 28.
    Open Issues (3/3) Kenyon is able to be automatically invoked via a post-commit script or a cron job.  Should Kenyon be able to be automatically invoked from an IDE?  What sort of support should Kenyon provide for better integration with, for example, Eclipse?
  • 29.
    Conclusions (1/2)  Kenyonis an engineering solution, designed to amortize the cost of the computationally expensive preprocessing steps that can benefit static software evolution research.  Research projects using Kenyon will not have to independently create solutions for these common problems.  18% code reduction in Beagle without really trying.  Is expected to reduce the lag between beginning system implementation and producing research results.
  • 30.
    Conclusions (2/2)  Kenyonis not intended to be a lightweight data mining system for software evolution research.  Tradeoff of speed vs. precision is still controllable via the choice of fact extractors.  The configuration extraction time and associated network lag already put the per-configuration time at O(seconds)  Instead, it allows the cost of time-consuming, computationally expensive preprocessing, to be amortized among researchers.
  • 31.
    Questions?  Kenyon wascreated primarily from code that existed in IVA, which is being funded by NSF grant CCR-01234603. Kenyon also contains code from Beagle, the origin analysis project overseen by Mike Godfrey.  Email jbevan@cs.ucsc.edu with future questions. http://www.cse.ucsc.edu/research/labs/grase/kenyon/