The content was created from "The Art and Science of Analyzing Software Data"
O Baysal, Kononenko, O. (Oleksii), Holmes, R. (Reid), and Godfrey, M.W. (Michael W.), “Synthesizing Knowledge from Software Development Artifacts”, 2015.
2. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 2
Outline
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
3. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 3
Problem Statement
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
The data is out there. The problem is making practical use of it.
to process
the myriad development artifacts
challenging
a graph-based representation of specific properties within a development
artifact and how they change over time
Lifecycle Model
• Lifecycle models can be applied to any data that changes its state, or
is annotated with new data, over time.
4. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 4
Problem Statement - Examples
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• For example, “issues” in issue-tracking systems often start their lives as OPEN,
but are eventually RESOLVED or marked as WONTFIX.
• A lifecycle model for issues aggregates data to provide a graphical representation
of how bugs flow through these three states.
• For example, the lifecycle model would capture the proportion of bugs that are
reopened from a WONTFIX state that might be interesting for a manager
considering adjustments to their issue triage process.
• Such lifecycle models often exist in process models (if defined). Therefore,
extracting one from the history enables comparing the defined lifecycle with the
actual one.
In this chapter, we apply lifecycle models to capture how the patch review
process works within the Mozilla project.
5. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 5
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• By examining how each artifact evolves, we can build a summary that captures
common dynamic evolutionary patterns.
• Each node in a lifecycle model represents a state that can be derived by
examining the evolution of an artifact.
• change status of issues
• patches under review
• lines of code
metadata history
extract
lifecycle models
6. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 6
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• To model the patch lifecycle, e.g., for the Mozilla project, we first examine
Mozilla’s code review policy and processes.
• Mozilla employs a two-tier code review process for validating submitted
patches — review and super-review.
• performed by a module owner or peers of the module
• a reviewer is someone who has domain expertise in a problem area
Review
• required if the patch involves integration or modifies core codebase
infrastructure
Super-review
• We then extract the states a patch can go through and define the final states it
can be assigned to.
7. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 7
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Figure 4.1 illustrates a lifecycle model of a patch as a simple graph:
• the nodes show the various states a patch can go through during the review
process,
• while the edges capture the transitions between lifecycle states.
• A transition represents an event that is labeled as a flag and its status reported
during the review process.
• For simplicity, we show only the key code review patches that have “review” (r)
and “super-review” (sr) flags attached to them.
8. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 8
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
The code review process begins when
a patch is submitted and a review is requested:
a review or super-review has been requested.
• A patch can be assigned to one of three states:
• Submitted, Accepted, or Rejected.
9. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 9
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
The code review process begins when
a patch is submitted and a review is requested:
a review or super-review has been requested.
• A patch can be assigned to one of three states:
• Submitted, Accepted, or Rejected.
Once the review is requested (a flag
contains a question mark “?” at the end),
the patch enters the Submitted state.
10. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 10
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
The code review process begins when
a patch is submitted and a review is requested:
a review or super-review has been requested.
• A patch can be assigned to one of three states:
• Submitted, Accepted, or Rejected.
Note that both the Accepted and Rejected
states permit self-transitions.
double review process
Once the review is requested (a flag
contains a question mark “?” at the end),
the patch enters the Submitted state.
11. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 11
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
The code review process begins when
a patch is submitted and a review is requested:
a review or super-review has been requested.
• A patch can be assigned to one of three states:
• Submitted, Accepted, or Rejected.
Note that both the Accepted and Rejected
states permit self-transitions.
double review process
Once the review is requested (a flag
contains a question mark “?” at the end),
the patch enters the Submitted state.
There are four possible terminal states for a patch:
12. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 12
Example: Patch Lifecycle
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Landed: a patch meets the code review criteria and is incorporated into the codebase.
• Resubmitted: a patch is superseded by additional refinements after being accepted or
rejected.
• Abandoned: a patch is not improved after being rejected.
• Timeout: a patch with review requests that are never answered.
There are four possible terminal states for a patch:
#patches = #Submitted patches
13. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 13
Model Extraction
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Lifecycle models can be generated as follows:
1. Determine the key states of the system or process(e.g., a number of events
that could occur) and their attributes.
2. Define the necessary transitions between the states and specify an
attribute for each transition.
3. Define the terminal outcomes(optional).
4. Define and gather qualitative or quantitative measurements for each state
transition and, if present, for the final outcomes.
• The pattern provides means to model, organize, and reason about the data or
underlying processes otherwise hidden in individual artifacts.
• While lifecycle models can be modified or extended depending on the needs, they
work best when the state space is well defined.
• Applying this pattern to complex data may require some abstraction.
14. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 14
Code Review
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Code review is a key element of any mature software development process.
• Indeed, community contributions are often the life blood of a successful open
source project; however, the core developers must also be able to assess the
quality of the incoming contributions, lest they negatively impact the overall quality
of the system.
• The code review process evaluates the quality of source code modifications
(submitted as patches) before they are committed to a project’s version control
repository.
• Here, we want to explore whether code review in a project is “democratic,” i.e.,
are contributions reviewed “equally,” regardless of the developers’ previous
involvement on a project.
• For example, do patches from core developers have a higher chance of being
accepted?
• Do patches from casual contributors take longer to get feedback?
15. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 15
Code Review: Mozilla Project
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• A two-tiered code review process
• A review is typically performed by the owner of the module or a “peer”; the
reviewer has domain expertise in the problem area.
• A super-review is required if the patch involves integration or modifies core
Mozilla infrastructure.
• Bugzilla users flag patches with metadata to capture code review requests and
evaluations. A typical patch review process consists of the following steps:
1. Once a patch is ready, its owner requests are view from a module owner or
a peer. The review flag is set to “r?”.
• If the patch owner decides to request a super-review, he may also do so
and the flag is set to “sr?”.
2. When the patch passes are review or a super-review, the flag is set to “r+,”
or “sr+,” respectively. If it fails, the reviewer sets the flag to “r−” or “sr−” and
provides explanation on a review by adding comments to a bug in Bugzilla.
3. If the patch is rejected, the patch owner may resubmit a new version of the
patch that will undergo a review process from the beginning. If the patch is
approved, it will be checked into the project’s official codebase.
16. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 16
Mozilla Firefox
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• To generate lifecycle models all events have been extracted from Mozilla’s public
Bugzilla instance.
• All code committed to Mozilla Firefox undergoes code review.
• Developers submit a patch containing their change to Bugzilla and request a
review.
• Reviewers annotate the patch either positively or negatively to reflect their
opinion of the code under review.
• For highly-impactful patches super-reviews may be requested and performed.
• Once the reviewers approve a patch it can be applied to the Mozilla source code
repository.
Lifecycle Models code review processes
apply
The goal is:
to identify the ways in which patches typically flow through the code review process
17. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 17
Mozilla Firefox - Model Extraction
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Patches exist in one of three key states: Submitted, Accepted, Rejected
1. State Identification
18. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 18
Mozilla Firefox - Model Extraction
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Code reviews transition patches between the three primary states.
2. Transition Extraction
A review request is annotated with “r?”
Positive reviews are denoted with a “r+”
Negative reviews are annotated “r−”
Super-reviews are prepended with an s (e.g., “sr+”/“sr−”).
19. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 19
Mozilla Firefox - Model Extraction
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Patches can also exist in four terminal states, which are defined considering
entire patch history to resolve an issue.
3. Terminal State Determination
20. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 20
Mozilla Firefox - Model Extraction
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Study we measured both the number of times each transition happened along
with the median time taken to transition between states.
4. Measurement
21. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 21
Mozilla Firefox - Lifecycle Analysis
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• A large proportion of Accepted patches are still Resubmitted by authors for
revision.
• Rejected patches are usually Resubmitted, easing concerns that rejecting a
borderline patch could cause it to be abandoned.
Very few patches Timeout in practice.
22. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 22
Mozilla Firefox - Lifecycle Analysis
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• From the timing data for core contributors (refer to Table 4.1) we see that it takes
an average of 8.9 hours to get an initial “r+” review, while getting a negative “r–”
review takes 11.8 hours.
• The transition “r? → r+” is a lot faster than “r? → r–” showing that reviewers
provide faster responses if a patch is of good quality.
average of 11.8 hours
average of 8.9 hours
23. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 23
Mozilla Firefox - Lifecycle Analysis
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• To our surprise, the fastest “r? → r+” is detected for casual developers.
• Super-reviews, in general, are approved very quickly, within 8−10 hours.
• This finding conforms to the Mozilla’s code review policy: super- reviewers are
supposed to respond within 24 hours of a super-review request.
• However, it takes much longer — 4 to 6 days — for a super-reviewer to reject a
patch that requires an integration, as these patches often require extensive
discussions with others.
the fastest “r? -> r+”
24. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 24
Mozilla Firefox - Lifecycle Analysis
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Comparing the lifecycles for core and casual contributors
6% more patches
increased by factor of 3.12x
Mozilla Firefox’s patch lifecycle for core developers
7% fewer patches
increased by factor of 3.5x
Mozilla Firefox’s patch lifecycle for casual developers
25. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 25
Mozilla Firefox - Lifecycle Analysis
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Our findings suggest that patches from casual contributors are more likely to be
abandoned by both reviewers and contributors themselves.
• Thus, it is likely that these patches should receive extra care to both ensure quality
and encourage future contributions from the community members who prefer to
participate in the collaborative development on a less regular basis.
generated discussion and raised some concerns among the developers on the Mozilla Development Planning team
26. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 26
Other Applications
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Lifecycle models provide practitioners with fact-based views about their projects.
• Supporting practitioners with better insights into their processes and systems,
these models help them make better data-driven development decisions.
• The lifecycle model is a flexible approach that can be used in a variety of
software investigation tasks. For example:
• Issues: As developers work on issues their state changes. Common states here
would include NEW, ASSIGNED, WONTFIX, CLOSED, although these may vary
from project to project.
• Component/Priority assignments: Issues are often assigned to specific code
components (and are given priority assignments). As an issue is triaged and
worked upon these assignments can change.
• Source code: The evolutionary history of any line or block of source code can be
considered capturing Addition, Deletion, and Modification. This data can be
aggregated at the line, block, method, or class level.
• Discussions: Online discussions, e.g., those on StackOverflow, can have status
of CLOSED, UNANSWERED, REOPENED, DUPLICATE, and PROTECTED.
27. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 27
Conclusion
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
• Lifecycle models can be easy to generate and interpret, making them practical
for use in a wide variety of data modeling applications.
• Software developers and managers make decisions based on the understanding
they have of their software systems.
• While artifacts can be investigated individually, being able to summarize
characteristics about a set of development artifacts can be useful.
• Lifecycle models capture the dynamic nature of how various development
artifacts change over time in a graphical form that can be easily understood and
communicated.
• Lifecycle models enables reasoning of the underlying processes and dynamics of
the artifacts being analyzed.
28. SYNTHESIZING KNOWLEDGE FROM SOFTWARE DEVELOPMENT ARTIFACTS
JEONBUK NATIONAL
UNIVERSITY
Jeongwhan Choi 28
1.Problem Statement 2.Artifact Lifecycle Models 3.Code Review
4.Lifecycle Analysis 5. Other Applications 6.Conclusion
Thank you