Source code comprehension on evolving software
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Source code comprehension on evolving software

on

  • 172 views

Yida's PQE

Yida's PQE

Statistics

Views

Total Views
172
Views on SlideShare
172
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • We know that software is continuously evolving since developers practically change source code all the time. One of the consequences is, developers also have to understand these code changes, which I refer to as CCC through this talk. Last year, we conducted an exploratory study in MS, where we sent surveys and conducted interviews with MS developers for their practices on CCC. This work is published in FSE. In this work, we found first, CCC is frequently required. The majority of developers understand code changes several times each day <br /> In this year’s ICSE, B in their empirical study on modern code review, they also expressed the similar findings that CCC is more common than understanding the entire program, but CCC is also the most challenging part. <br /> These motivate our work since CCC is a challenging activity but it’s also fundamental to developers’ daily practices.
  • So in the literature survey, I identify 3 major categories related to CCC. <br /> First is program differencing. This line of work try to help developers by describing code changes <br /> Second is …. Studies in this category take one step further to try to reasoning and explain code changes <br /> Third is. This is sort of “customized” CCC.
  • Unix diff is the most well-known example in this category. But it’s also well-recognized for two major limitations. <br /> Ldiff: <br /> diff: Longest common subsequence <br /> All possible hunk pairs -> similarity (vector space cosine similarity) -> pick the topmost pairs <br /> Line matching -> Levenhstein edit distance -> above threshold is marked as changed <br /> Unmatched lines are new hunks -> iterate step 2 <br /> Since these techniques treat program as normal text, they report program difference as changes to characters. But from a developer’s point of view, the syntactic, or structure information about the source code is lost. This motivates another line of work, which we call “syntax differencing”
  • This line of work uses structured representation of a program. <br /> Changedistiller, which represents a program as an abstract syntax tree and applies tree differencing algorithm. <br /> In addition to AST, studies also represent code in XML, which can also embed …Then we can apply XML differencing algorithms, like diffX proposed in, to compute program differences. <br /> In cases when developers perform behavior-preserving modifications such as switch the order of if-else, it will still report the differences although from developer’s perspective, they might not think it is an important change.
  • Therefore, the next line of work focuses on semantic differencing of two program versions. Semantic diff operates on method level, and compares variable dependencies to derive behavioral changes. <br /> In the old version of method add, if x not equal to HI, add it to TOT, otherwise, add DEF to total. From this code, we can derive a list of dependencies, for example, … <br /> In the new version, developers simply want to switch the order of if-else but mistakenly uses assignment instead of equals. Therefore, when the technique computes variable dependencies and compare it to previous ones, it will report that.. <br /> These behavioral differences are certainly not expected because when x is assigned to HI, the initial value of x is always lost. In such cases, semantic diff is certainly better than syntactic diff since it can raise developers’ attention on program’s unexpected behavioral change.
  • Another work, Jdiff, which is published in, is about semantic differencing for oo program. <br /> Simply applying syntactic differencing, we’ll only know that m1 is added, and . But developers may be more interested in how the behavior of program is changed. <br /> if the dynamic type of a is B, the call a.m1 in new version actually invokes m1 in B. <br /> The exception thrown will be caught by different catch blocks after the change. <br /> Jdiff extends CFG to combine…ECFG considers dynamic binding and exception handling for the previous example, and graph differencing algorithm can be applied to reveal the difference.
  • Some studies also use symbolic execution to characterize programs’ behavior. This technique…instead of actual values. For example, a symbolic execution for this code fragment is like, if this condition is satisfied, return; otherwise, if…, return… <br /> XXX proposed differential symbolic execution that compares the SE of two program versions. The output is like this. Under which condition, two different versions produces different results.
  • Now I’ve covered 3 categories in program differencing. These work basically try to help CCC by describing what the code change is. The next line of work, which I call “CCS”, takes a further step to try to explain code changes.
  • Program is presented as a set of predicates that describe code elements, containment relationships, and structural dependencies, which are called “facts”. Then Lsdiff computes changed facts between two program versions. <br /> Inferring rules from the list of change facts <br /> Also inferring exceptions to the rules. Example: all Car’s subtypes’ start methods added calls to the Key.chk method except for the subtype Kia
  • Finally, DeltaDoc uses some transformation heuristics to summarize these statements’ differences to human-readable documentation.
  • The studies we’ve seen so far all extract information from source code itself. However, other software artifacts, such as commit log, can also be helpful for understanding code changes since from these artifacts, we might found useful natural language sentences related to the code changes. Motivated by this observation, …proposed… <br /> Each sentence has some features, for example. To locate the most informative or relevant sentences, they are ranked by their feature values. <br /> Here is an example of their output. For this change, its summary contains a list of relevant sentences extracted from its evolutionary documents.
  • The major challenges of using evolutionary documents is first, linkage between these documents might not exist so we may not even be able to find documents relevant to a code change. This problem is known as the “missing link” and is studied recently. <br /> In addition, document may not… In such cases, we can not rely on them to extract informative change summaries. <br /> As for I introduced before, the biggest problem is verbosity. This is rules and exceptions generated by Lsdiff to describe a code change. This is the number of lines in the change documentation. Compared to human-written commit log, which is the black bar, documentation generated by DeltaDoc is still very long. <br /> Another challenge is, some uninterested changes can be identified automatically. For example, a rule reported by Lsdiff says…, which in the user study, participants complain that such a rule is not useful.
  • Therefore, there are studies that customize CCC so that developers can query their interested changes and filtering out irrelevant changes.
  • Non-essential changes include …, which is less likely to be of developers’ interest. <br /> They use ChangeDistiller to detect changes, and apply PPA to resolve type bindings for partial programs (i.e., code changes) <br /> However, the goal of this work is to…
  • In general, studies in this category focuses on querying meaningful changes and filtering out non-essential changes.

Source code comprehension on evolving software Presentation Transcript

  • 1. Source Code Comprehension on Evolving Software: A Literature Survey Yida Tao Supervisor: Sunghun Kim 1
  • 2. Motivation Code Change Comprehension Tao et al., FSE’12 Code change comprehension is • Frequently required • In major development activities, in particular the code-review process • How do software engineers understand code changes? An exploratory study in industry. Tao et al., FSE’12 • Expectations, outcomes, and challenges of modern code review. Bacchelli and Bird, ICSE’13 Bacchelli & Bird, ICSE’13 • “…review and understand code they have not seen before may be more common that a developer working on new code” • “From interviews, no other code review challenge emerged as clearly as understanding the submitted change” 2
  • 3. Outline Program Differencing Describing code changes Code Change Summarization Explaining code changes Querying and Filtering Customization Code Change Comprehension 3
  • 4. Program Differencing 4  Text Differencing  Syntactic Differencing  Semantic Differencing
  • 5. Text Differencing  Flat representation of a program  Sequence of strings  Unix diff  Only output added/deleted lines, can not detect modified lines  Hard to determine when a code fragment is moved upward or downward  Ldiff (Canfora et al., ICSE’09)  An enhanced line differencing tool  Limitations  Changes to *characters*  No syntactic-structure information 5
  • 6. Syntactic Differencing  Structured representation of a program  Abstract syntax tree; XML  ChangeDistiller (Fluri et al., TSE’07)  Tree differencing  Node: bigram string similarity  Control structure: subtree similarity  Output: tree edit script (insert, delete, move, update)  XML differecing  srcXML (Maletic & Collard, ICSM’04): embeds abstract syntax and structure within the source code  diffX (Al-Ekram et al., CASCON '05)  Limitation  Cannot describe how the behavior of a program is changed  Still report differences for behavior-preserving changes 6
  • 7. Semantic Differencing  Semantic diff (Jackson and Ladd, ICSM’94)  Method-level  Variable dependencies comparison 7 ==
  • 8. Semantic Differencing (cont.)  JDiff (Apiwattanapong et al. ASE’04, 06)  Extended control-flow graph (ECFG)  Dynamic binding, class hierarchy, exception handling, etc. 8
  • 9. Semantic Differencing (cont.)  Differential symbolic execution (Person et al., FSE’08)  “Executing” a program using symbolic values 9
  • 10. Outline Program Differencing Text Differencing Syntactic differencing Semantic differencing Code Change Comprehension Code Change Summarization Explaining code changes Querying and Filtering Customization 10
  • 11. Code Change Summarization  LSdiff (Kim and Notkin, ICSE’09)  Group related changes  Detect potential inconsistencies in a code change 11
  • 12. Code Change Summarization (cont.)  DeltaDoc (Buse and Weimer, ASE’10)  Symbolic execution: obtain path predicates for each statement in both versions  Identify statements that are added, deleted, or have a changed predicates  Summarization 12
  • 13. Code Change Summarization (cont.)  Multi-document summarization (Rastkar and Murphy, ICSE’13)  Linking evolutionary documents (commit log, issue tracking entries)  Finding the most informative sentences to extract to form a summary  Similarity between a sentence and the title of the enclosing document  Overlap between a sentence and the adjacent document 13
  • 14. Code Change Summarization (cont.)  Challenges  Evolutionary documents  Linkage might not be found (Bachman et al., FSE’10, Wu et al., FSE’11)  Human-written document may be unavailable or uninformative (Buse and Weimer, ASE’10, Tao et al., FSE’12)  Automatically generated document  Verbosity  Uninteresting changes are identified, e.g., “all types that declared toString() added constructors” (Kim and Notkin, ICSE’09) 14 LSdiff DeltaDoc
  • 15. Outline Program Differencing Text Differencing Syntactic differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Code Change Comprehension Querying and Filtering Customization 15
  • 16. Querying and Filtering  Specifying and detecting meaningful changes (Yu et al., ASE’11)  Normalize the program (user-specified) before differencing  Non-trivial to construct the query 16
  • 17. Querying and Filtering (cont.)  Filtering non-essential changes (Kawrykow and Robillard, ICSE’11)  Non-essential changes: rename-induced modifications, local variable extraction, trivial keyword modification, whitespace and documentation updates  ChangeDistiller (Fluri et al., TSE’07) + Partial program analysis (Dagenais and Robillard, ICSE’08)  Goal: improving mining and recommendation accuracy instead of developers’ comprehension 17
  • 18. Outline Program Differencing Text Differencing Syntactic differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Code Change Comprehension 18
  • 19. Research Directions Program Differencing Text Differencing Syntactic differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Source Code Changes Work-item-based changes? 19
  • 20. Work-item-based Changes  Multiple work-items in a single code change (e.g., a bug fix + code cleanup + a new feature)  Very difficult to understand (Tao et al., FSE’12) 20 JFreeChart revision 1083 Trivial keyword removal Bug fix Formatting
  • 21. Work-item-based Change Detection  Multiple work-items in a single code change (e.g., a bug fix + code cleanup + a new feature)  Very difficult to understand (Tao et al., FSE’12)  Change decomposition  Program slicing (entity dependencies)  Pattern matching (similarities)  A single work-item spreads across multiple code changes (e.g., 5 changes to finally fix a bug completely)  Change aggregation  Linkage to the same issue  Heuristics like time duration, commit authors, program dependencies, etc. 21
  • 22. Research Directions Program Differencing Text Differencing Syntax differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Code Change Comprehension Work-item change detection Change decomposition Change aggregation 22
  • 23. Research Directions Program Differencing Text Differencing Syntax differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Work-item-specific changes Code Change Comprehension Work-item change detection Change decomposition Change aggregation 23
  • 24. Research Directions Program Differencing Text Differencing Syntax differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Work-item-specific changes Code Change Comprehension Concrete Execution Work-item change detection Change decomposition Change aggregation 24
  • 25. Explaining code changes with executions of co- changed test cases 25  Test cases  Best documentation for source code  Test cases co-changed with source code  Documentation for code changes?  Mostly synchronous co-evolution of production and test code (Zaidman et al., Empirical Software Engineering’11)  Differential test executions  Co-changed test cases T  Executing T on the old version P and new version P’  Comparing executions to explained change behaviors From StackExchange http://programmers.stackexchange.com/questions/154439/quality-of-code-in- unit-tests?newsletter=1&nlcode=67628%7c1a35 • “Unit tests are one of the best sources of documentation for your system, and arguably the most reliable form” • “Unit tests are often the first thing you look at when trying to grasp what some piece of code does” • “They can also serve as a starting point for people new to the code base”
  • 26. Research Directions Program Differencing Text Differencing Syntax differencing Semantic differencing Code Change Summarization Rules and exceptions Control-flow changes Evolutionary documentation Querying and Filtering Meaningful changes Non-essential changes Work-item-specific changes Code Change Comprehension Concrete Execution • Co-changed test cases • Differential test execution Work-item change detection Change decomposition Change aggregation 26