Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)


Published on

Sung's First ESEC/FSE paper thanks to Jen, Kim, and Mike!

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Kenyon: A Software Stratigraphy Platform (ESEC/FSE 2005)

  1. 1. Kenyon: A Software Stratigraphy Platform Jennifer Bevan, Sunghun Lijie Zou, Mike GodfreyKim, E. James Whitehead Jr. University of WaterlooUniversity of California, Santa Cruz {lzou, migod} {jbevan, hunkim, ejw}
  2. 2. Motivation Static analysis-based software evolution research has several common technical issues to manage.  Extracting meaningful configurations from an SCM repository.  Calculating static relations, metrics.  Augments data from commit log messages.  Saving the extracted facts.  For later time-based analysis, data mining, incremental data addition.
  3. 3. Ongoing Static Evolution Research Instability Analysis (J. Bevan)  Refines Zimmerman/Ying/Murphy using static dependence to remove temporal dependencies Entity Mapping/Origin Analysis (L. Zou, M. Godfrey)  Uses static metrics to identify moved/split/merged procedures, files. Code clone evolution (M. Kim)  Identifies clones and follows their evolution.
  4. 4. More Static Evolution Research Association rule mining  For predicting changes [Ying et al., IEEE TSE, v30 n9, Sept. 2004]  For architectural justification [Zimmermann, Diehl, and Zeller, Proc. IWPSE 2003] Identifying code “chunks” for future modularization [Mockus and Weiss, IEEE Software, v18 n2, 2001] “Feature” identification [Fischer, Pinzger, and Gall, Proc. WCRE 2003] …and the ongoing research related to these.
  5. 5. Problem Despite similarity of approach, systems make several choices that limit sharing of technology and results:  Usually choosing a single SCM system (CVS) for data.  Usually creating a proprietary database schema.  Usually not easily integratable with other research projects for result sharing. The cost of computationally expensive analysis techniques are not amortized across multiple research directions.
  6. 6. Solution: Kenyon Kenyon is designed to facilitate static software evolution research by providing common solutions to these common problems:  Phase 1: Automatic configuration extraction from SCM  Phase 2: Invoking static analysis tool(s)  Phase 3: Storing the results from these preprocessing steps.  Asynchronously provides access to previously processed and stored data.
  7. 7. Kenyon Processing Phases 2 & 3 Fact Extraction Phase 1 (Static Analysis) Configuration and Persist Extraction Gathered Facts SCM KenyonRepository Repository (RDBMS/ Hibernate) Filesystem Client Tools perform queries, add new facts Client Software (e.g., IVA)
  8. 8. Phase 1: Extract Configurations Kenyon provides transaction recovery and logical configuration extraction for multiple SCM systems.  Configurations specified by time + branch identifier.  Sliding window algorithm for transaction recovery.  Only changes from completed transactions are extracted for a “logical configuration”.  Only changes from transactions that completed between two specifications are considered for a “configuration delta”.
  9. 9. Configuration Specification Kenyon’s logical configuration extraction and delta calculations allow researchers to consider software “as it existed at time T on branch B”.  Most SCM systems archive data along a timeline with varying support for parallel development.  Kenyon uses this commonality as the basis for its SCM interface and configuration specification.  There is no indication that change-set based SCM systems will not be supportable by Kenyon.
  10. 10. Logical Configuration• At any given point in time, one or more transactions may have just completed, and one or more may be ongoing. T1• Ongoing transactions are F4 shown in red.• Completed transactions are F2 shown in green. F1 F3
  11. 11. Configuration Deltas• Configuration deltas are calculated as C(T2) – C(T1).• Only changes from T2 transactions completing between T1 (exclusive) and T1 F4 T2 (inclusive) are considered. F3 F2 F1
  12. 12. Data from Phase 1 Valid configuration specifications for extraction are created by Kenyon, one per timestamp where a transaction completed. For each configuration extracted:  Author and log message of each transaction completing at that specification.  The configuration is placed on the filesystem. A configuration delta for each consecutive pair of configurations processed can also be stored.
  13. 13. Phase 2: Invoke Fact Extractors Kenyon provides an abstract class that is used to invoke third-party fact extractors on the configuration extracted to the filesystem.  Kenyon users would subclass this class to invoke their own fact extractor.  Support for Codesurfer (line-level analysis) and SWAGKIT (procedure-level analysis) are provided with Kenyon. [,]  FactExtractor subclasses have a tri-modal return status: “failure”, “new data to store”, or “no new data to store”.
  14. 14. Data from Phase 2 FactExtractor subclasses provide:  A ConfigGraph that maps software elements to nodes and static relationships to edges.  The graph, any node, and any edge may be attributed with static metrics. Multiple fact extractors may be invoked on a single configuration: each created ConfigGraph is saved with a reference to the fact extractor that created it. If a configuration has already been processed by a given fact extractor, it will not be processed again unless new metrics are to be calculated.
  15. 15. Phase 3: Data Storage Kenyon uses Hibernate to persist data classes.  Hibernate is an “object/relational persistence and query service for Java” [].  Allows reuse of Kenyon classes by research tools implemented in Java.  Each configuration processed by Kenyon is assigned to a Project, the top-level data class persisted by Kenyon.
  16. 16. Persisted Kenyon Data• Projects contain one set of data for each configuration Project specification processed. 1 N• Each such data set N 1 ConfigGraph ConfigData contains one or more 1 1 ConfigGraphs, each 1 N produced by a different FactExtractor ConfigSpec FactExtractor. 1 2• FactExtractors specify 1 1 what GraphSchema GraphSchema ConfigDelta subclass they use (not restrictive).
  17. 17. Data Access Hibernate allows access to preprocessed data using SQL or the Hibernate query methods (HQL, QBE/ QBC), which support class/field-based queries.  A Hibernate query returns a List of Objects, each of which is of the type originally persisted.  Data fields in the returned class are populated unless specified as lazily loaded. Kenyon provides several convenience queries for common anticipated queries, such as “what configurations are available for this project”.
  18. 18. Kenyon Usage Kenyon processes data based on specifications in a configuration file  Start time, stop time, how often to process  Fact extractors and their assigned metric calculators.  SCM parameters, filesystem parameters, some control over what Hibernate persists. A “processing run” will reuse any previously processed data if available  For example, if a ConfigGraph has already been created, if new metrics are necessary they are calculated and added to the existing ConfigGraph.
  19. 19. Iterative Refinement Example When looking for “interesting” timeframes of evolution, a multiple-pass process is recommended.  A user can configure Kenyon to process the changes in a system once per day.  Days with high activity or other metrics exceeding a threshold can be flagged as “interesting”.  The user can then configure Kenyon to process those days (via multiple processing runs) at the frequency of “every 20 minutes”.  This process can repeat down to the “every second” level.
  20. 20. Parallel Preprocessing Kenyon is a single-threaded process, but Hibernate supports multiple connections to a single Kenyon database. A 10-year history can be processed in chunks by any number of computers, even if the processing configurations have overlapping times or different intervals. Kenyon does not integrate the deltas between different processing runs, so a small overlap in processing chunks is suggested.
  21. 21. Kenyon Architecture ConfigData Project Hibernate/DBMS ConfigGraph <<calls>> DataManager <<calls>> <<calls>>MetricLoader Fact Extractor SCMInterface <<calls>> SCM Filesystem Repository
  22. 22. Current Status Kenyon 1.2 available at Supports CVS, Subversion, and ClearCase Students in 290G are performing projects using Kenyon this quarter Actively working with Samsung to analyze some of their source code.
  23. 23. Future Work (1/3) Continue working with M. Kim  Evaluate usefulness of SCM-only module.  If she decides to use Kenyon, assist with full integration. Finish integration of Beagle/Kenyon and IVA/Kenyon. Work with G. Murphy on using Kenyon at UBC. Evaluate Kenyon’s ability to reduce the time-to- results for static software evolution research by analyzing the seminar class projects.
  24. 24. Future Work (2/3) Support branch path traversal  Allow users to see the branch points in a system and specify a path for processing instead of a single branch.  Will reuse existing visualizations, must add specification mechanism. Incorporate full language-specific containment models for better inter-language graph traversal and mapping.  Use M. Godfrey’s Java fact extractor and containment model.
  25. 25. Future Work (3/3) Support more of the Standard Exchange Formats for ConfigGraph export.  TA is already supported, but only the Fact sections. Schema sections should be improved to use the language-specific containment models. Encourage other reseachers to use Kenyon, and improve results-sharing, capabilities, etc. based on their feedback.
  26. 26. Open Issues (1/3) The exact mechanism for allowing data sharing between researchers is not entirely controllable by Kenyon  Database setup and administration can effectively override much of Kenyon’s preferences.  By default, Kenyon-created tables are not mutable by processes other than Kenyon.
  27. 27. Open Issues (2/3) Kenyon provides a public class, EvolutionPath, that links a subgraph in one ConfigGraph to one in another ConfigGraph.  Directed and attributable.  Basic building block for evolution data. Is currently persisted by Kenyon, will likely not be after 1.1, due to database mutability issues.  Other research projects can subclass and, if they want to share their results easily, persist them to a Hibernate database using the provided Hibernate mapping examples.
  28. 28. Open Issues (3/3) Kenyon is able to be automatically invoked via a post-commit script or a cron job. Should Kenyon be able to be automatically invoked from an IDE? What sort of support should Kenyon provide for better integration with, for example, Eclipse?
  29. 29. Conclusions (1/2) Kenyon is an engineering solution, designed to amortize the cost of the computationally expensive preprocessing steps that can benefit static software evolution research. Research projects using Kenyon will not have to independently create solutions for these common problems.  18% code reduction in Beagle without really trying.  Is expected to reduce the lag between beginning system implementation and producing research results.
  30. 30. Conclusions (2/2) Kenyon is not intended to be a lightweight data mining system for software evolution research.  Tradeoff of speed vs. precision is still controllable via the choice of fact extractors.  The configuration extraction time and associated network lag already put the per-configuration time at O(seconds) Instead, it allows the cost of time-consuming, computationally expensive preprocessing, to be amortized among researchers.
  31. 31. Questions? Kenyon was created primarily from code that existed in IVA, which is being funded by NSF grant CCR-01234603. Kenyon also contains code from Beagle, the origin analysis project overseen by Mike Godfrey. Email with future questions.