SlideShare a Scribd company logo
1 of 56
kherzig@acm.org
Kim Herzig
PhD Defense Talk
Mining and Untangling
Change Genealogies
Mining Software Repositories
Software repositories contain
events and artifacts collected
during software development.
Mining Software Repositories
Software repositories contain
events and artifacts collected
during software development.
Prediction
models.
Recommender
systems.
Development process
measures.
Models allowing
to replay history.
Mining Software Repositories
Collect &
combine
Filter Interpret
Software repositories contain
events and artifacts collected
during software development.
Prediction
models.
Recommender
systems.
Development process
measures.
Models allowing
to replay history.
Mining Software Repositories
Collect &
combine
Filter Interpret
Software repositories contain
events and artifacts collected
during software development.
Which are the most defect-‐prone
source files?
Mining Software Repositories
Collect &
combine
Filter Interpret
Software repositories contain
events and artifacts collected
during software development.
Mining Software Repositories
Collect &
combine
Filter Interpret
Software repositories contain
events and artifacts collected
during software development.
Code History
Bugs
Mining Software Repositories
Collect &
combine
Filter Interpret
Which change fixed which bug?
Which files were changed to do so?
Consider only closed and resolved bug reports.
Distinguish between pre-‐ and post-‐release bugs.
Count the number of distinct
bugs per file as quality measure.
Software repositories contain
events and artifacts collected
during software development.
Code History
Bugs
Mining Version Archives
Developer A
changes File A Code Base
submit
Developer might
react on suggestion.
Help developers to prevent incomplete changes; prevent bugs.
Mining Version Archives
Developer A
changes File A Code Base
Model
submit
Detect changes to be
submitted
Analyze historic
code changes.
Suggest further code changes that other
developers did when changing File A.
Developer might
react on suggestion.
Help developers to prevent incomplete changes; prevent bugs.
Mining Change Couplings
Code History
Mining Change Couplings
Code History
time
Mining Change Couplings
Code History
is a rule: when is changed than someone changes .Interpret
time
Select only frequent occurring patterns.Filter
Mining Change Couplings
Code History
is a rule: when is changed than someone changes .Interpret
Prediction If someone changes we suggest as likely other change.
time
Select only frequent occurring patterns.Filter
Mining Change Couplings
Code History
is a rule: when is changed than someone changes .Interpret
Prediction If someone changes we suggest as likely other change.
Assumption was that depends on .
time
But what if and are
frequently occurring but
independent changes?
Change Genealogies
time
Combine spatial and temporal dimension of archives.
‣ Reasoning over multiple components at multiple points in time
Model dependencies between code changes.
‣ Reasoning over the dependencies and impact of changes
Change Genealogies
Change Genealogies
When does a change depend on another change ?
‣ If we cannot apply without applying first (﴾e.g. without breaking compilation)﴿.
Change Genealogies
When does a change depend on another change ?
‣ If we cannot apply without applying first (﴾e.g. without breaking compilation)﴿.
Which dependencies to be modeled?
‣ We cannot model all dependencies.
‣ We need a change abstraction that covers most situations
and allows efficient dependency rules.
Change Operations
int cache = 10;
...
6
...
public class C {
public C() {
List<String> list = new List<String>();
if(list.size() < 1){
if(list.isEmty()){
}
}
3
4
5
9
20
21
Source code change
Analyze source code
‣ many revision cannot be compiled
‣ but must be considered in genealogy.
Track changes applied to
‣ method definitions
‣ method calls
Reduce applied changes to
AD added method definition
DD deleted method definition
AC added method call
DC deleted method call
modified method definitionMD
Change Operations
Source code change
Analyze source code
‣ many revision cannot be compiled
‣ but must be considered in genealogy.
Track changes applied to
‣ method definitions
‣ method calls
Reduce applied changes to
AD added method definition
DD deleted method definition
AC added method call
DC deleted method call
modified method definitionMD
int cache = 10;
...
6
...
public class C {
public C() {
List<String> list = new List<String>();
if(list.size() < 1){
}
}
3
4
5
9
20
21
Dependency Rules
DDAD
D
You cannot define the same method twice.
MD AD
D
MD MD
D
You cannot modify a method that does not exist.
DD AD
D
DD MD
D
You cannot delete a method that does not exists.
You can only call methods that exist.
AC AD
D
AC MD
D
DC AC
C
You cannot delete a method call that does not exist.
Genealogy Example
[2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE
File 1
File 2
File 3
File 4
+ int A.foo(int)
+ int B.bar(int)
+ B.bar(5)
- int A.foo(int)
+ float A.foo(float)
+ d = A.foo(5d)
- x = B.bar(5)
+ x = A.foo(5f)
+ d = A.foo(d) + e = A.foo(-1f)
CS1 CS2 CS3 CS4 CS5
Genealogy Example
[2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE
File 1
File 2
File 3
File 4
+ int A.foo(int)
+ int B.bar(int)
+ B.bar(5)
- int A.foo(int)
+ float A.foo(float)
+ d = A.foo(5d)
- x = B.bar(5)
+ x = A.foo(5f)
+ d = A.foo(d) + e = A.foo(-1f)
CS1 CS2 CS3 CS4 CS5
Genealogy Example
[2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE
File 1
File 2
File 3
File 4
+ int A.foo(int)
+ int B.bar(int)
+ B.bar(5)
- int A.foo(int)
+ float A.foo(float)
+ d = A.foo(5d)
- x = B.bar(5)
+ x = A.foo(5f)
+ d = A.foo(d) + e = A.foo(-1f)
CS1 CS2 CS3 CS4 CS5
Genealogy Example
[2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE
File 1
File 2
File 3
File 4
+ int A.foo(int)
+ int B.bar(int)
+ B.bar(5)
- int A.foo(int)
+ float A.foo(float)
+ d = A.foo(5d)
- x = B.bar(5)
+ x = A.foo(5f)
+ d = A.foo(d) + e = A.foo(-1f)
CS1 CS2 CS3 CS4 CS5
Vertex Annotation
Author
Timestamp
Change set ID
method name
class namefull qualified
file name
Each edge is
labeled with the
causing
dependency rule.
Change Set Layer
Change Operation Layer
Change Set Layer ➡
File 1
File 2
File 3
File 4
CS1 CS2 CS3 CS4 CS5
As default Layer
Change Set Layer
Change Operation Layer
Change Set Layer ➡
File 1
File 2
File 3
File 4
CS1 CS2 CS3 CS4 CS5
CS1 CS2 CS3 CS4 CS5
As default Layer
Change Set Layer
Change Operation Layer
Change Set Layer ➡
File 1
File 2
File 3
File 4
CS1 CS2 CS3 CS4 CS5
CS1 CS2 CS3 CS4 CS5
As default Layer
change set layer: directed and acyclic.
Change Set Layer
Change Operation Layer
Change Set Layer ➡
File 1
File 2
File 3
File 4
CS1 CS2 CS3 CS4 CS5
CS1 CS2 CS3 CS4 CS5
As default Layer
Cause effect Prediction
Classification Untangling
Applications &
Assumptions
Prediction
Classification Untangling
Cause effect Applications &
Assumptions
[2011] Herzig and Zeller, “Mining Cause-Effect-Chains from Version Archives”, ISSRE
Funded by
Faculty Grant
Expressing Cause Effect Chains
Developer DDeveloper A
…
Developer B Developer C
Software System
What is the long-‐term impact of the initial change?
‣ Related to change couplings but requires dependencies.
Original problem stems from large repositories combining multiple
projects using each others code.
‣ Example: Developer A might deprecate and replace an authentication method in system A forcing
the authentication in system D to be updated.
Expressing Cause Effect Chains
Developer DDeveloper A
…
Developer B Developer C
Software System
Expressing Cause Effect Chains
Developer DDeveloper A
…
Developer B Developer C
Did developer A trigger the green changes?
Does depend on ? Is there a path from to ?
Software System
Expressing Cause Effect Chains
Developer DDeveloper A
…
Developer B Developer C
Did developer A trigger the green changes?
Does depend on ? Is there a path from to ?
We can use CTL to formulate the property
) EF
Software System
Change genealogies allow model checking version archives.
Expressing Cause Effect Chains
Developer DDeveloper A
…
Developer B Developer C
Did developer A trigger the green changes?
Does depend on ? Is there a path from to ?
We can use CTL to formulate the property
) EF
Software System
Model Checking Genealogies
create
change genealogy
from version
archive
extract valid CTL
rules using
model checking
report frequent
occurring rules as
recommendations
Recommendations
...
) EF
) AG( ) EF )
) EF( ^ )
must be valid in time window ranked by confidence & support
Model Checking Genealogies
create
change genealogy
from version
archive
extract valid CTL
rules using
model checking
report frequent
occurring rules as
recommendations
Recommendations
...
) EF
) AG( ) EF )
) EF( ^ )
must be valid in time window ranked by confidence & support
Recommendations ensured to be based on structural dependencies.
Predicting Future Events
project history
(﴾having a structural dependency on current change)﴿
Predicting Future Events
project history
10% initial
training set
(﴾having a structural dependency on current change)﴿
Prediction
model
Predicting Future Events
project history
10% initial
training set
(﴾having a structural dependency on current change)﴿
Prediction
model
use the top 3 ranked recommendations
to predict files that will change in time
window. (ranked by confidence, support)
Predicting Future Events
project history
10% initial
training set
(﴾having a structural dependency on current change)﴿
Prediction
model
use the top 3 ranked recommendations
to predict files that will change in time
window. (ranked by confidence, support)
Predicting Future Events
project history
10% initial
training set
(﴾having a structural dependency on current change)﴿
#correct predictions + #false predictions
#correct predictions
precision =
use the top 3 ranked recommendations
to predict files that will change in time
window. (ranked by confidence, support)
Predicting Future Events
project history
10%
training set
(﴾having a structural dependency on current change)﴿
use the top 3 ranked recommendations
to predict files that will change in time
window. (ranked by confidence, support)
Predicting Future Events
project history
10%
training set
+1
(﴾having a structural dependency on current change)﴿
How well can we predict cause-‐effect chains?
Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ArgoUML Jaxen JRuby XStream
Model checking Most frequent changed files
More than 60% of all predictions
are valid.
‣ Which is in sync with other approaches.
How well can we predict cause-‐effect chains?
Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ArgoUML Jaxen JRuby XStream
Model checking Most frequent changed files
[eRose]
[Canfora]
[Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010
[eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
More than 60% of all predictions
are valid.
‣ Which is in sync with other approaches.
Structural dependencies ensured.
How well can we predict cause-‐effect chains?
Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ArgoUML Jaxen JRuby XStream
Model checking Most frequent changed files
[eRose]
[Canfora]
[Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010
[eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
More than 60% of all predictions
are valid.
‣ Which is in sync with other approaches.
Structural dependencies ensured.
For over 47% of commits all
recommendations are valid.
How well can we predict cause-‐effect chains?
Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ArgoUML Jaxen JRuby XStream
Model checking Most frequent changed files
[eRose]
[Canfora]
[Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010
[eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
More than 60% of all predictions
are valid.
‣ Which is in sync with other approaches.
Structural dependencies ensured.
For over 47% of commits all
recommendations are valid.
Average rank of highest hit is 1.8.
How well can we predict cause-‐effect chains?
Precision
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ArgoUML Jaxen JRuby XStream
Model checking Most frequent changed files
[eRose]
[Canfora]
[Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010
[eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
Prediction
Classification Untangling
Cause effect Applications &
Assumptions
Predicting Defects
[2012] Herzig et al., “Classifying Changes and Predicting Defects Using Change Genealogies”, under submission
Using network metrics on change
genealogies.
‣ Metrics express the dependency structures.
‣ Motivated by studies using call graph network metrics
‣ Assumption: Central changes tend to be more crucial and
thus more likely to add defects.
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
0 1 0 1 1
0 0 1 1 0
1 1 0 0 1
0 1 0 1 1
0 0 0 1 1
1 1 1 0 0
0 1 0 1 1
▸ four open-‐source projects ▸ stratified 100 cross-‐fold setup ▸ using four different machine learners
PrecisionRecall
HTTPClient Jackrabbit Lucene-‐Java Rhino
Complexity
Metrics
Network
Metrics
Genealogy
Metrics
Network +
Genealogy Metrics
0.2
0.4
0.6
0.8
0.4
0.6
0.8
Genealogy metrics outperform
code and network metrics when
predicting defects.
Prediction
Untangling Changes
[2013] Herzig and Zeller, “The Impact of Tangled Code Changes”, MSR
0
0.3
0.6
0.9
Blob size 2 Blob size 3 Blob size 4
ArgoUML Google Webtool Kit JRuby XStream
natural bob size occurrence
4%
5%
18%
73%
2 3 4 >4
The genealogy change set layer
requires atomic code changes.
‣ Change sets applying changes targeting
multiple issues cause false dependencies.
‣ Manual classification of 7,000 change sets shows
that ~10% of all bug fixes are tangled.
Can we untangle
tangled code
changes?
‣ Yes, for most tangled changes
(﴾blob size 2)﴿ with a precision
between 0.7 and 0.9.
Tangled Change Set Untangling Algorithm Change Set Partition
A
B
Distance MeasuresData DependenciesChange CouplingsCall-Graph
Confidence Voters
0
1
2
3
4
5
6
7
8
9
10
Untangling
Misclassified Issue Reports
[2013] Herzig et al., “It’s not a Bug, It’s a Feature: How Misclassification Impact Bug Prediction”, ICSE
After untangling:
‣ Which partition is a bug fix, which one a new feature?
But are bug reports reporting bugs? No.
‣ Manually classified >7,000 issue reports:
‣ Every 3rd bug report does not require a code fix.
63%
37%
HTTPClient
75%
25%
Jackrabbit
65%
35%
Lucene-‐Java
59%
41%
Rhino
61%
39%
Tomcat5
-TrackerBugzilla-Tracker
wrongly classified correctly classified
0%
10%
20%
30%
40%
HTTPClient Jackrabbit Lucene-‐Java Rhino Tomcat5
TOP 5% TOP 10% TOP 15% TOP 20%
With severe impact on bug count models.
‣ Up to 40% false positive most defect prone files.
‣ Automatically classification of issue reports possible.
[2008] Antoniol et al., “Is it a bug or an enhancement? A text based approach to classify
change requests.”, CASCON
Classification
Change Genealogies

More Related Content

What's hot

The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
University of California, San Diego
 

What's hot (20)

ASE2010
ASE2010ASE2010
ASE2010
 
Deadlock Avoidance - OS
Deadlock Avoidance - OSDeadlock Avoidance - OS
Deadlock Avoidance - OS
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Gossip & Key Value Store
Gossip & Key Value StoreGossip & Key Value Store
Gossip & Key Value Store
 
The Materials API
The Materials APIThe Materials API
The Materials API
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Data automation 101
Data automation 101Data automation 101
Data automation 101
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...Big data from the LHC commissioning: practical lessons from big science - Sim...
Big data from the LHC commissioning: practical lessons from big science - Sim...
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank RoarkH2O World - Munging, modeling, and pipelines using Python - Hank Roark
H2O World - Munging, modeling, and pipelines using Python - Hank Roark
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF SourcesMore Complete Resultset Retrieval from Large Heterogeneous RDF Sources
More Complete Resultset Retrieval from Large Heterogeneous RDF Sources
 
The Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environmentThe Galaxy bioinformatics workflow environment
The Galaxy bioinformatics workflow environment
 

Viewers also liked

MSR2012 - Explaining Software Defects Using Topic Models
MSR2012 - Explaining Software Defects Using Topic ModelsMSR2012 - Explaining Software Defects Using Topic Models
MSR2012 - Explaining Software Defects Using Topic Models
Concordia University
 
MSR2014 - An Empirical Study of Dormant Bugs
MSR2014 - An Empirical Study of Dormant BugsMSR2014 - An Empirical Study of Dormant Bugs
MSR2014 - An Empirical Study of Dormant Bugs
Concordia University
 
MIning Software Repositories (MSR) 2010 presentation
MIning Software Repositories (MSR) 2010 presentationMIning Software Repositories (MSR) 2010 presentation
MIning Software Repositories (MSR) 2010 presentation
Ahmed Lamkanfi
 

Viewers also liked (8)

MSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerMSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick Trigger
 
MSR2012 - Explaining Software Defects Using Topic Models
MSR2012 - Explaining Software Defects Using Topic ModelsMSR2012 - Explaining Software Defects Using Topic Models
MSR2012 - Explaining Software Defects Using Topic Models
 
Mining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better Software
 
2013 07 05 (uc3m) lasi emadrid grobles jgbarahona urjc lecciones aprendidas a...
2013 07 05 (uc3m) lasi emadrid grobles jgbarahona urjc lecciones aprendidas a...2013 07 05 (uc3m) lasi emadrid grobles jgbarahona urjc lecciones aprendidas a...
2013 07 05 (uc3m) lasi emadrid grobles jgbarahona urjc lecciones aprendidas a...
 
MSR2014 - An Empirical Study of Dormant Bugs
MSR2014 - An Empirical Study of Dormant BugsMSR2014 - An Empirical Study of Dormant Bugs
MSR2014 - An Empirical Study of Dormant Bugs
 
MIning Software Repositories (MSR) 2010 presentation
MIning Software Repositories (MSR) 2010 presentationMIning Software Repositories (MSR) 2010 presentation
MIning Software Repositories (MSR) 2010 presentation
 
Where does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositoriesWhere does it go from here? The role of software in digital repositories
Where does it go from here? The role of software in digital repositories
 
Code coverage for MSR Researches [Work in Progress]
Code coverage for MSR Researches [Work in Progress]Code coverage for MSR Researches [Work in Progress]
Code coverage for MSR Researches [Work in Progress]
 

Similar to Mining and Untangling Change Genealogies (PhD Defense Talk)

Configuration Management
Configuration ManagementConfiguration Management
Configuration Management
elliando dias
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
Martin Pinzger
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Master
reachtimsq
 

Similar to Mining and Untangling Change Genealogies (PhD Defense Talk) (20)

Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Of Changes and Their History
Of Changes and Their HistoryOf Changes and Their History
Of Changes and Their History
 
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiffAnalyzing Changes in Software Systems From ChangeDistiller to FMDiff
Analyzing Changes in Software Systems From ChangeDistiller to FMDiff
 
Configuration Management
Configuration ManagementConfiguration Management
Configuration Management
 
Innoslate's Ontology - LML, SysML, DoDAF, and more
Innoslate's Ontology - LML, SysML, DoDAF, and moreInnoslate's Ontology - LML, SysML, DoDAF, and more
Innoslate's Ontology - LML, SysML, DoDAF, and more
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Production Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated WorldProduction Readiness Strategies in an Automated World
Production Readiness Strategies in an Automated World
 
Enterprise Software Architecture styles
Enterprise Software Architecture stylesEnterprise Software Architecture styles
Enterprise Software Architecture styles
 
ISSTA 2017 Impact Paper Award Presentation
ISSTA 2017 Impact Paper Award PresentationISSTA 2017 Impact Paper Award Presentation
ISSTA 2017 Impact Paper Award Presentation
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 
Stat 5
Stat 5Stat 5
Stat 5
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
Cm5 secure code_training_1day_system configuration
Cm5 secure code_training_1day_system configurationCm5 secure code_training_1day_system configuration
Cm5 secure code_training_1day_system configuration
 
SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Oppo...
SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Oppo...SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Oppo...
SERENE 2014 School: Resilience in Cyber-Physical Systems: Challenges and Oppo...
 
SERENE 2014 School: Gabor karsai serene2014_school
SERENE 2014 School: Gabor karsai serene2014_schoolSERENE 2014 School: Gabor karsai serene2014_school
SERENE 2014 School: Gabor karsai serene2014_school
 
software requirement specification
software requirement specificationsoftware requirement specification
software requirement specification
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
AI Development with H2O.ai
AI Development with H2O.aiAI Development with H2O.ai
AI Development with H2O.ai
 
Stat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo MasterStat 5.4 Pre Sales Demo Master
Stat 5.4 Pre Sales Demo Master
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

More from Kim Herzig

Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Kim Herzig
 
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Kim Herzig
 
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
Kim Herzig
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)
Kim Herzig
 
The Impact of Tangled Code Changes
The Impact of Tangled Code ChangesThe Impact of Tangled Code Changes
The Impact of Tangled Code Changes
Kim Herzig
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
Kim Herzig
 
Software Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software ArchivesSoftware Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software Archives
Kim Herzig
 

More from Kim Herzig (12)

Keynote AST 2016
Keynote AST 2016Keynote AST 2016
Keynote AST 2016
 
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
Empirically Detecting False Test Alarms Using Association Rules @ ICSE 2015
 
The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015The Art of Testing Less without Sacrificing Quality @ ICSE 2015
The Art of Testing Less without Sacrificing Quality @ ICSE 2015
 
Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015Code Ownership and Software Quality: A Replication Study @ MSR 2015
Code Ownership and Software Quality: A Replication Study @ MSR 2015
 
Issre2014 test defectprediction
Issre2014 test defectpredictionIssre2014 test defectprediction
Issre2014 test defectprediction
 
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...The Impact of Test Ownership and Team Structure on the Reliability and Effect...
The Impact of Test Ownership and Team Structure on the Reliability and Effect...
 
Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)Predicting Defects Using Change Genealogies (ISSE 2013)
Predicting Defects Using Change Genealogies (ISSE 2013)
 
The Impact of Tangled Code Changes
The Impact of Tangled Code ChangesThe Impact of Tangled Code Changes
The Impact of Tangled Code Changes
 
Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011
 
Network vs. Code Metrics to Predict Defects: A Replication Study
Network vs. Code Metrics  to Predict Defects: A Replication StudyNetwork vs. Code Metrics  to Predict Defects: A Replication Study
Network vs. Code Metrics to Predict Defects: A Replication Study
 
Capturing the Long Term Impact of Changes
Capturing the Long Term Impact of ChangesCapturing the Long Term Impact of Changes
Capturing the Long Term Impact of Changes
 
Software Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software ArchivesSoftware Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software Archives
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Mining and Untangling Change Genealogies (PhD Defense Talk)

  • 1. kherzig@acm.org Kim Herzig PhD Defense Talk Mining and Untangling Change Genealogies
  • 2. Mining Software Repositories Software repositories contain events and artifacts collected during software development.
  • 3. Mining Software Repositories Software repositories contain events and artifacts collected during software development. Prediction models. Recommender systems. Development process measures. Models allowing to replay history.
  • 4. Mining Software Repositories Collect & combine Filter Interpret Software repositories contain events and artifacts collected during software development. Prediction models. Recommender systems. Development process measures. Models allowing to replay history.
  • 5. Mining Software Repositories Collect & combine Filter Interpret Software repositories contain events and artifacts collected during software development. Which are the most defect-‐prone source files?
  • 6. Mining Software Repositories Collect & combine Filter Interpret Software repositories contain events and artifacts collected during software development.
  • 7. Mining Software Repositories Collect & combine Filter Interpret Software repositories contain events and artifacts collected during software development. Code History Bugs
  • 8. Mining Software Repositories Collect & combine Filter Interpret Which change fixed which bug? Which files were changed to do so? Consider only closed and resolved bug reports. Distinguish between pre-‐ and post-‐release bugs. Count the number of distinct bugs per file as quality measure. Software repositories contain events and artifacts collected during software development. Code History Bugs
  • 9. Mining Version Archives Developer A changes File A Code Base submit Developer might react on suggestion. Help developers to prevent incomplete changes; prevent bugs.
  • 10. Mining Version Archives Developer A changes File A Code Base Model submit Detect changes to be submitted Analyze historic code changes. Suggest further code changes that other developers did when changing File A. Developer might react on suggestion. Help developers to prevent incomplete changes; prevent bugs.
  • 13. Mining Change Couplings Code History is a rule: when is changed than someone changes .Interpret time
  • 14. Select only frequent occurring patterns.Filter Mining Change Couplings Code History is a rule: when is changed than someone changes .Interpret Prediction If someone changes we suggest as likely other change. time
  • 15. Select only frequent occurring patterns.Filter Mining Change Couplings Code History is a rule: when is changed than someone changes .Interpret Prediction If someone changes we suggest as likely other change. Assumption was that depends on . time But what if and are frequently occurring but independent changes?
  • 16. Change Genealogies time Combine spatial and temporal dimension of archives. ‣ Reasoning over multiple components at multiple points in time Model dependencies between code changes. ‣ Reasoning over the dependencies and impact of changes
  • 18. Change Genealogies When does a change depend on another change ? ‣ If we cannot apply without applying first (﴾e.g. without breaking compilation)﴿.
  • 19. Change Genealogies When does a change depend on another change ? ‣ If we cannot apply without applying first (﴾e.g. without breaking compilation)﴿. Which dependencies to be modeled? ‣ We cannot model all dependencies. ‣ We need a change abstraction that covers most situations and allows efficient dependency rules.
  • 20. Change Operations int cache = 10; ... 6 ... public class C { public C() { List<String> list = new List<String>(); if(list.size() < 1){ if(list.isEmty()){ } } 3 4 5 9 20 21 Source code change Analyze source code ‣ many revision cannot be compiled ‣ but must be considered in genealogy. Track changes applied to ‣ method definitions ‣ method calls Reduce applied changes to AD added method definition DD deleted method definition AC added method call DC deleted method call modified method definitionMD
  • 21. Change Operations Source code change Analyze source code ‣ many revision cannot be compiled ‣ but must be considered in genealogy. Track changes applied to ‣ method definitions ‣ method calls Reduce applied changes to AD added method definition DD deleted method definition AC added method call DC deleted method call modified method definitionMD int cache = 10; ... 6 ... public class C { public C() { List<String> list = new List<String>(); if(list.size() < 1){ } } 3 4 5 9 20 21
  • 22. Dependency Rules DDAD D You cannot define the same method twice. MD AD D MD MD D You cannot modify a method that does not exist. DD AD D DD MD D You cannot delete a method that does not exists. You can only call methods that exist. AC AD D AC MD D DC AC C You cannot delete a method call that does not exist.
  • 23. Genealogy Example [2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE File 1 File 2 File 3 File 4 + int A.foo(int) + int B.bar(int) + B.bar(5) - int A.foo(int) + float A.foo(float) + d = A.foo(5d) - x = B.bar(5) + x = A.foo(5f) + d = A.foo(d) + e = A.foo(-1f) CS1 CS2 CS3 CS4 CS5
  • 24. Genealogy Example [2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE File 1 File 2 File 3 File 4 + int A.foo(int) + int B.bar(int) + B.bar(5) - int A.foo(int) + float A.foo(float) + d = A.foo(5d) - x = B.bar(5) + x = A.foo(5f) + d = A.foo(d) + e = A.foo(-1f) CS1 CS2 CS3 CS4 CS5
  • 25. Genealogy Example [2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE File 1 File 2 File 3 File 4 + int A.foo(int) + int B.bar(int) + B.bar(5) - int A.foo(int) + float A.foo(float) + d = A.foo(5d) - x = B.bar(5) + x = A.foo(5f) + d = A.foo(d) + e = A.foo(-1f) CS1 CS2 CS3 CS4 CS5
  • 26. Genealogy Example [2010] Herzig, “Capturing the Long-Term Impact of Changes”, ICSE File 1 File 2 File 3 File 4 + int A.foo(int) + int B.bar(int) + B.bar(5) - int A.foo(int) + float A.foo(float) + d = A.foo(5d) - x = B.bar(5) + x = A.foo(5f) + d = A.foo(d) + e = A.foo(-1f) CS1 CS2 CS3 CS4 CS5 Vertex Annotation Author Timestamp Change set ID method name class namefull qualified file name Each edge is labeled with the causing dependency rule.
  • 27. Change Set Layer Change Operation Layer Change Set Layer ➡ File 1 File 2 File 3 File 4 CS1 CS2 CS3 CS4 CS5 As default Layer
  • 28. Change Set Layer Change Operation Layer Change Set Layer ➡ File 1 File 2 File 3 File 4 CS1 CS2 CS3 CS4 CS5 CS1 CS2 CS3 CS4 CS5 As default Layer
  • 29. Change Set Layer Change Operation Layer Change Set Layer ➡ File 1 File 2 File 3 File 4 CS1 CS2 CS3 CS4 CS5 CS1 CS2 CS3 CS4 CS5 As default Layer
  • 30. change set layer: directed and acyclic. Change Set Layer Change Operation Layer Change Set Layer ➡ File 1 File 2 File 3 File 4 CS1 CS2 CS3 CS4 CS5 CS1 CS2 CS3 CS4 CS5 As default Layer
  • 31. Cause effect Prediction Classification Untangling Applications & Assumptions
  • 32. Prediction Classification Untangling Cause effect Applications & Assumptions [2011] Herzig and Zeller, “Mining Cause-Effect-Chains from Version Archives”, ISSRE Funded by Faculty Grant
  • 33. Expressing Cause Effect Chains Developer DDeveloper A … Developer B Developer C Software System What is the long-‐term impact of the initial change? ‣ Related to change couplings but requires dependencies. Original problem stems from large repositories combining multiple projects using each others code. ‣ Example: Developer A might deprecate and replace an authentication method in system A forcing the authentication in system D to be updated.
  • 34. Expressing Cause Effect Chains Developer DDeveloper A … Developer B Developer C Software System
  • 35. Expressing Cause Effect Chains Developer DDeveloper A … Developer B Developer C Did developer A trigger the green changes? Does depend on ? Is there a path from to ? Software System
  • 36. Expressing Cause Effect Chains Developer DDeveloper A … Developer B Developer C Did developer A trigger the green changes? Does depend on ? Is there a path from to ? We can use CTL to formulate the property ) EF Software System
  • 37. Change genealogies allow model checking version archives. Expressing Cause Effect Chains Developer DDeveloper A … Developer B Developer C Did developer A trigger the green changes? Does depend on ? Is there a path from to ? We can use CTL to formulate the property ) EF Software System
  • 38. Model Checking Genealogies create change genealogy from version archive extract valid CTL rules using model checking report frequent occurring rules as recommendations Recommendations ... ) EF ) AG( ) EF ) ) EF( ^ ) must be valid in time window ranked by confidence & support
  • 39. Model Checking Genealogies create change genealogy from version archive extract valid CTL rules using model checking report frequent occurring rules as recommendations Recommendations ... ) EF ) AG( ) EF ) ) EF( ^ ) must be valid in time window ranked by confidence & support Recommendations ensured to be based on structural dependencies.
  • 40. Predicting Future Events project history (﴾having a structural dependency on current change)﴿
  • 41. Predicting Future Events project history 10% initial training set (﴾having a structural dependency on current change)﴿
  • 42. Prediction model Predicting Future Events project history 10% initial training set (﴾having a structural dependency on current change)﴿
  • 43. Prediction model use the top 3 ranked recommendations to predict files that will change in time window. (ranked by confidence, support) Predicting Future Events project history 10% initial training set (﴾having a structural dependency on current change)﴿
  • 44. Prediction model use the top 3 ranked recommendations to predict files that will change in time window. (ranked by confidence, support) Predicting Future Events project history 10% initial training set (﴾having a structural dependency on current change)﴿ #correct predictions + #false predictions #correct predictions precision =
  • 45. use the top 3 ranked recommendations to predict files that will change in time window. (ranked by confidence, support) Predicting Future Events project history 10% training set (﴾having a structural dependency on current change)﴿
  • 46. use the top 3 ranked recommendations to predict files that will change in time window. (ranked by confidence, support) Predicting Future Events project history 10% training set +1 (﴾having a structural dependency on current change)﴿
  • 47. How well can we predict cause-‐effect chains? Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ArgoUML Jaxen JRuby XStream Model checking Most frequent changed files
  • 48. More than 60% of all predictions are valid. ‣ Which is in sync with other approaches. How well can we predict cause-‐effect chains? Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ArgoUML Jaxen JRuby XStream Model checking Most frequent changed files [eRose] [Canfora] [Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010 [eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
  • 49. More than 60% of all predictions are valid. ‣ Which is in sync with other approaches. Structural dependencies ensured. How well can we predict cause-‐effect chains? Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ArgoUML Jaxen JRuby XStream Model checking Most frequent changed files [eRose] [Canfora] [Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010 [eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
  • 50. More than 60% of all predictions are valid. ‣ Which is in sync with other approaches. Structural dependencies ensured. For over 47% of commits all recommendations are valid. How well can we predict cause-‐effect chains? Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ArgoUML Jaxen JRuby XStream Model checking Most frequent changed files [eRose] [Canfora] [Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010 [eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
  • 51. More than 60% of all predictions are valid. ‣ Which is in sync with other approaches. Structural dependencies ensured. For over 47% of commits all recommendations are valid. Average rank of highest hit is 1.8. How well can we predict cause-‐effect chains? Precision 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ArgoUML Jaxen JRuby XStream Model checking Most frequent changed files [eRose] [Canfora] [Canfora] Canfora et al., “Using Multivariate Time Series and Association Rules to Detect Logical Change Couplings: an Empirical Study”, ICSM 2010 [eRose] Zimmermann et al., “Mining Version Histories to Guide Software Changes”, ICSE 2004
  • 53. Predicting Defects [2012] Herzig et al., “Classifying Changes and Predicting Defects Using Change Genealogies”, under submission Using network metrics on change genealogies. ‣ Metrics express the dependency structures. ‣ Motivated by studies using call graph network metrics ‣ Assumption: Central changes tend to be more crucial and thus more likely to add defects. 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 ▸ four open-‐source projects ▸ stratified 100 cross-‐fold setup ▸ using four different machine learners PrecisionRecall HTTPClient Jackrabbit Lucene-‐Java Rhino Complexity Metrics Network Metrics Genealogy Metrics Network + Genealogy Metrics 0.2 0.4 0.6 0.8 0.4 0.6 0.8 Genealogy metrics outperform code and network metrics when predicting defects. Prediction
  • 54. Untangling Changes [2013] Herzig and Zeller, “The Impact of Tangled Code Changes”, MSR 0 0.3 0.6 0.9 Blob size 2 Blob size 3 Blob size 4 ArgoUML Google Webtool Kit JRuby XStream natural bob size occurrence 4% 5% 18% 73% 2 3 4 >4 The genealogy change set layer requires atomic code changes. ‣ Change sets applying changes targeting multiple issues cause false dependencies. ‣ Manual classification of 7,000 change sets shows that ~10% of all bug fixes are tangled. Can we untangle tangled code changes? ‣ Yes, for most tangled changes (﴾blob size 2)﴿ with a precision between 0.7 and 0.9. Tangled Change Set Untangling Algorithm Change Set Partition A B Distance MeasuresData DependenciesChange CouplingsCall-Graph Confidence Voters 0 1 2 3 4 5 6 7 8 9 10 Untangling
  • 55. Misclassified Issue Reports [2013] Herzig et al., “It’s not a Bug, It’s a Feature: How Misclassification Impact Bug Prediction”, ICSE After untangling: ‣ Which partition is a bug fix, which one a new feature? But are bug reports reporting bugs? No. ‣ Manually classified >7,000 issue reports: ‣ Every 3rd bug report does not require a code fix. 63% 37% HTTPClient 75% 25% Jackrabbit 65% 35% Lucene-‐Java 59% 41% Rhino 61% 39% Tomcat5 -TrackerBugzilla-Tracker wrongly classified correctly classified 0% 10% 20% 30% 40% HTTPClient Jackrabbit Lucene-‐Java Rhino Tomcat5 TOP 5% TOP 10% TOP 15% TOP 20% With severe impact on bug count models. ‣ Up to 40% false positive most defect prone files. ‣ Automatically classification of issue reports possible. [2008] Antoniol et al., “Is it a bug or an enhancement? A text based approach to classify change requests.”, CASCON Classification