A Tale of Experiments on
Bug Prediction
Martin Pinzger
Professor of Software Engineering
University of Klagenfurt, Austria


Follow me: @pinzger
Software repositories
2
Hmm, wait a minute
3
Can’t we learn “something” from that data?
Goal of software repository mining
Software Analytics
To obtain insightful and actionable information for completing various tasks
around developing and maintaining software systems
Examples
Quality analysis and defect prediction
Detecting “hot-spots”
Preventing defects
Recommender (advisory) systems
Code completion
Suggesting good code examples
Helping in using an API
...
4
Examples from my mining research
My mining research
The relationship between developer contributions and failure-prone Microsoft Vista
binaries (FSE 2008)
Predicting failure-prone methods (ESEM 2012)
Predicting Build Co-Changes with Source Code Change and Commit Categories
(to appear at SANER 2016)
For more see: http://serg.aau.at/bin/view/MartinPinzger/Publications
Surveys on software repository mining
A survey and taxonomy of approaches for mining software repositories in the
context of software evolution, Kagdi et al. 2007
Evaluating defect prediction approaches: a benchmark and an extensive
comparison, D’Ambros et al. 2012
Conference: MSR 2016 http://msrconf.org/
5
Method-Level Bug Prediction
with Emanuel Giger, Marco D’Ambros*, Harald Gall
University of Zurich
*University of Lugano
Many existing studies to predict bug-prone files
A comparative analysis of the efficiency of change metrics and static
code attributes for defect prediction, Moser et al. 2008
Use of relative code churn measures to predict system defect
density, Nagappan et al. 2005
Cross-project defect prediction: a large scale experiment on data vs.
domain vs. process, Zimmermann et al. 2009
Predicting faults using the complexity of code changes, Hassan et al.
2009
7
Prediction granularity
11 methods on average
class 1 class 2 class 3 class n...class 2
4 methods are bug prone (ca. 36%)
Retrieving bug-prone methods saves manual inspection effort and
testing effort
8
Large files are typically the most bug-prone files
Research questions
How accurate can we predict buggy methods?
Which characteristics (i.e., metrics) are indicating bug prone
methods?
How does the accuracy vary if the number of buggy methods
decreases?
9
Research questions
10
RQ1 What is the accuracy of bug prediction on
method level?
RQ2 Which characteristics (i.e., metrics) are
indicating bug prone methods?
RQ3 How does the accuracy vary if the number of
buggy methods decreases?
Experiment with 21 Java open source projects
11
Project #Classes #Methods #M-Histories #Bugs
JDT Core 1.140 17.703 43.134 4.888
Jena2 897 8.340 7.764 704
Lucene 477 3.870 1.754 377
Xerces 693 8.189 6.866 1.017
Derby Engine 1.394 18.693 9.507 1.663
Ant Core 827 8.698 17.993 1.900
Approach overview
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
12
Investigated metrics
13
Source code metrics (from the last release)
fanIn, fanOut, localVar, parameters, commentToCodeRatio, countPath, McCabe
Complexity, statements, maxNesting
Change metrics
methodHistories, authors,
stmtAdded, maxStmtAdded, avgStmtAdded,
stmtDeleted, maxStmtDeleted, avgStmtDeleted,
churn, maxChurn, avgChurn,
decl, cond, elseAdded, elseDeleted
Bugs
Count bug references in commit logs for changed methods
Predicting bug-prone methods
Bug-prone vs. not bug-prone
14
.1 Experimental Setup
Prior to model building and classification we labeled
ethod in our dataset either as bug-prone or not bug-p
s follows:
bugClass =
not bug − prone : #bugs = 0
bug − prone : #bugs >= 1
hese two classes represent the binary target classes
aining and validating the prediction models. Using 0
pectively 1) as cut-point is a common approach applie
any studies covering bug prediction models, e.g., [30
7, 4, 27, 37]. Other cut-points are applied in litera
r instance, a statistical lower confidence bound [33] or
edian [16]. Those varying cut-points as well as the div
Models computed with change metrics (CM) perform best
authors and methodHistories are the most important measures
Accuracy of prediction models
15
Table 4: Median classification results over all pro-
jects per classifier and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95
SVM .96 .83 .86 .7 .48 .63 .95 .8 .96
BN .96 .82 .86 .73 .46 .73 .96 .81 .96
J48 .95 .84 .82 .69 .56 .58 .91 .83 .89
values of the code metrics model are approximately 0.7 for
each classifier—what is defined by Lessman et al. as ”promis-
ing” [26]. However, the source code metrics suffer from con-
siderably low precision values. The highest median precision
Predicting bug-prone methods with diff. cut-points
Bug-prone vs. not bug-prone
p = 75%, 90%, 95% percentiles of #bugs in methods per project
-> predict the top 25%, 10%, and 5% bug-prone methods
16
ow the classification performance varies (RQ3) as the
er of samples in the target class shrinks, and wheth
bserve similar findings as in Section 3.2 regarding t
ults of the change and code metrics (RQ2). For tha
pplied three additional cut-point values as follows:
bugClass =
not bug − prone : #bugs <= p
bug − prone : #bugs > p
here p represents either the value of the 75%, 90%, or
ercentile of the distribution of the number of bugs in
ds per project. For example, using the 95% percent
ut-point for prior binning would mean to predict the
ve percent” methods in terms of the number of bugs.
To conduct this study we applied the same experim
etup as in Section 3.1, except for the differently chose
Decreasing the number of bug-prone methods
Models trained with Random Forest (RndFor)
Change metrics (CM) perform best
Precision decreases (as expected)
17
Table 5: Median classification results for RndFor
ver all projects per cut-point and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95
75% .97 .72 .95 .75 .39 .63 .97 .74 .95
90% .97 .58 .94 .77 .20 .69 .98 .64 .94
95% .97 .62 .92 .79 .13 .72 .98 .68 .92
ion in the case of the 95% percentile (median precision of
.13). Looking at the change metrics and the combined
model the median precision is significantly higher for the
Application: file level vs. method level prediction
JDT Core 3.0 - LocalDeclaration.class
Contains 6 methods / 1 affected by post release bugs
LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97
File-level: p=0.17 to guess the bug-prone method
Need to manually rule out 5 methods to reach >0.82 precision 1 / (6-5)
JDT Core 3.0 - Main.class
Contains 26 methods / 11 affected by post release bugs
Main.configure(...) was predicted bug-prone with p=1.0
File-level: p=0.42 to guess a bug-prone method
Need to rule out 13 methods to reach >0.82 precision 11 / (26-13)
18
What can we learn from that?
Large files are more likely to change and have bugs
Test large files more thoroughly - YES
Bugs are fixed through changes that again lead to bugs
Stop changing our systems - NO, of course not!
Test changing entities more thoroughly - YES
Are we not already doing that?
Do we really need (complex) prediction models for that?
Not sure - might be the reason why these models are not really used, yet
Microsoft started to add prediction models to their quality assurance tools - current status?
But, use at least a metric tools and keep track of your code quality
-> Continuous integration environments, SONAR
19
Can developer-module networks
predict failures?
with Nachi Nagappan, Brendan Murphy
Microsoft Research
Team structure and post-failures
Results of an initial study with MS Vista
#Authors and #Commits of binaries is correlated with the #post-release failures
We wanted to find out
Are binaries with fragmented contributions from many developers more likely to
have post-release failures?
Should developers focus on one thing?


21
Study with MS Vista project
Data
Released in January, 2007
> 4 years of development
Several thousand developers
Several thousand binaries (*.exe, *.dll)
Several millions of commits
22
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Approach in a nutshell
23
Change
Logs
Bugs
Regression Analysis
Validation with data splitting
Alice
Dan
Eric Go
Hin c
5
4
6
2
5 7
4
a
4
Bob
2
b
6
Fu
Binary #bugs #centrality
a 12 0.9
b 7 0.5
c 3 0.2
Contribution network
24
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Windows binary (*.dll)
Developer
Which binary is failure-prone?
Measuring fragmentation
25
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Freeman degree
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Bonacich’s powerCloseness
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Research questions
Are binaries with fragmented contributions more failure-prone?
Does more fragmentation also mean a higher number of post-release
failures?
Which measures of fragmentation are useful for failure estimation?
26
Correlation analysis
27
nrCommits nrAuthors Power dPower Closeness Reach Betweenness
Failures 0,7 0,699 0,692 0,74 0,747 0,746 0,503
nrCommits 0,704 0,996 0,773 0,748 0,732 0,466
nrAuthors 0,683 0,981 0,914 0,944 0,83
Power 0,756 0,732 0,714 0,439
dPower 0,943 0,964 0,772
Closeness 0,99 0,738
Reach 0,773
Spearman rank correlation
All correlations are significant at the 0.01 level (2-tailed)
How to predict failure-prone binaries?
Binary logistic regression of 50 random splits
4 principal components from 7 centrality measures
28
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
Precision Recall AUC
Hot to predict the number of failures?
Linear regression of 50 random splits
#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits
All correlations are significant at the 0.01 level (2-tailed)
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
R-Square Pearson Spearman
29
Which fragmentation measures to use?
30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
R-SquareSpearman
Model with nrAuthors,
nrCommits
Model with nCloseness,
nrAuthors, nrCommits
Summary of results
Centrality measures can predict more than 83% of failure-pone Vista
binaries
Closeness, nrAuthors, and nrCommits can predict the number of post-
release failures
Closeness or Reach can improve prediction of the number of post-
release failures by 32%
More information
Can Developer-Module Networks Predict Failures?, FSE 2008
31
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
32
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-organize/restrict developer contributions
Simply: fire Bob!
Find out the reasons why Bob is contributing to both binaries
At MS, few key developers helped in many places to get Vista running
33
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-factor central binaries
Check the contract between “a” and “b” - decouple them
E.g., analyze if “a” contains functionality that should be moved to “b” or a new
binary
34
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Increase testing of binaries “a” and “b”
Yes, since these binaries are failure prone
35
What did Microsoft do with the results?
Results were kept within MS Research (at least in 2007 and 2008)
Be careful - our findings were purely based on the data
Such findings often do not show the full picture, since not everything is recorded
My results triggered a lot more research on developer contributions
at MS Research
36
Why researchers want/need/must
collaborate with industry?
My experiences with open source and industrial
software projects
Open source projects: pros
+ Provide tons of data
E.g., Eclipse project, Apache projects, etc.
+ Easy to access
Almost “all” data is publicly available, e.g., Github
+ No organizational obstacles to get access
+ You get ALL the data, not just parts of it
38
Open source projects: cons
- Research is (sometimes) difficult to motivate
Sometimes researchers analyze open source projects to try out some new
technique/algorithms but without knowing what actual problem they want to
solve
Why are you doing this?
What is the value for the research and industry
- Difficult to get in touch with the developers to validate the results
10 years back, I showed our results to Mozilla - they only said “interesting” but
that was it!
Some open source communities are more responsive: e.g., Eclipse community
Still, you typically do not get the chance to meet them face-2-face
39
Industrial projects: pros
+ Usually provide real problems
But sometimes these problems are not “research” problems - and researchers
want to do research
+ Provide contact to developers to obtain feedback on the findings
Industry as a laboratory
Helps to evaluate what is useful and what is not
+ Potential to see our research results used at least by developers
Processes, tools, algorithms, models, best practices, etc.
40
Industrial projects: cons
- Often expectations between researchers and developers differ a lot
Developers want to get things done - researchers want to publish
Solutions are too complex and/or only applicable to a very specific case study,
therefore not useful
- Note, researchers often provide know how, not ready made tools
Is that a good idea? - Not always, but we are typically cheaper than most
consultants
- Industry provides only partial case studies
Often not really useful to perform research on -> we need data and a lot of it
41
How I got in touch with Microsoft (Research)
Met them at the Microsoft developers conference
Got invited to Redmond to show them what we could do FOR them
Agreed on sending a researcher (me) to Redmond for three months
Fully paid by Microsoft
Once within Microsoft got access to the data and to some developers
My main contact was with MS Research


Microsoft invests in internships and visiting researchers
They use it to find and hire talented people
42
What is next on my research agenda?
Study defect prediction in industrial software projects
I am still looking for a good industrial partner (and a student)
Ease understanding changes and their effects
What is the effect on the design?
What is the effect on the quality?
Recommender techniques
Identify the sources of problems
Recommend and perform refactorings to solve the problem
Provide advice on the effects of changes to prevent problems
For this I want and need to collaborate with industry!
43
Conclusions
44
Questions?
Martin Pinzger
martin.pinzger@aau.at
the history of a software system to assemble the dataset for
our experiments: (1) versioning data including lines modi-
fied (LM), (2) bug data, i.e., which files contained bugs and
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
Figure 1: Stepwise overview of the data extraction process.
1. Versioning Data. We use EVOLIZER [14] to access the ver-
sioning repositories , e.g., CVS, SVN, or GIT. They provide
log entries that contain information about revisions of files
that belong to a system. From the log entries we extract the
revision number (to identify the revisions of a file in correct
temporal order), the revision timestamp, the name of the de-
veloper who checked-in the new revision, and the commit
message. We then compute LM for a source file as the sum of
lines added, lines deleted, and lines changed per file revision.
2. Bug Data. Bug reports are stored in bug repositories such
as Bugzilla. Traditional bug tracking and versioning repos-
Update Core 595 8’496 251’434 36’151 532 Oct0
Debug UI 1’954 18’862 444’061 81’836 3’120 May
JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov
Help 598 3’658 66’743 12’170 243 May
JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0
OSGI 748 9’866 335’253 56’238 1’411 Nov
single source code statements, e.g., method invocatio
ments, between two versions of a program by com
their respective abstract syntax trees (AST). Each chan
represents a tree edit operation that is required to tr
one version of the AST into the other. The algorithm i
mented in CHANGEDISTILLER [14] that pairwise co
the ASTs between all direct subsequent revisions of e
Based on this information, we then count the numbe
ferent source code changes (SCC) per file revision.
The preprocessed data from step 1-3 is stored into
lease History Database (RHDB) [10]. From that data,
compute LM, SCC, and Bugs for each source file by a
ing the values over the given observation period.
3. EMPIRICAL STUDY
In this section, we present the empirical study that
formed to investigate the hypotheses stated in Sectio
discuss the dataset, the statistical methods and machi
ing algorithms we used, and report on the results a
ings of the experiments.
3.1 Dataset and Data Preparation
We performed our experiments on 15 plugins of the
platform. Eclipse is a popular open source system
been studied extensively before [4,27,38,39].
Table 1 gives an overview of the Eclipse dataset
this study with the number of unique *.java files (Fi
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Academia wants/needs/must
collaborate with industry
Industry should invest in such
a collaboration
45

A Tale of Experiments on Bug Prediction

  • 1.
    A Tale ofExperiments on Bug Prediction Martin Pinzger Professor of Software Engineering University of Klagenfurt, Austria 
 Follow me: @pinzger
  • 2.
  • 3.
    Hmm, wait aminute 3 Can’t we learn “something” from that data?
  • 4.
    Goal of softwarerepository mining Software Analytics To obtain insightful and actionable information for completing various tasks around developing and maintaining software systems Examples Quality analysis and defect prediction Detecting “hot-spots” Preventing defects Recommender (advisory) systems Code completion Suggesting good code examples Helping in using an API ... 4
  • 5.
    Examples from mymining research My mining research The relationship between developer contributions and failure-prone Microsoft Vista binaries (FSE 2008) Predicting failure-prone methods (ESEM 2012) Predicting Build Co-Changes with Source Code Change and Commit Categories (to appear at SANER 2016) For more see: http://serg.aau.at/bin/view/MartinPinzger/Publications Surveys on software repository mining A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Kagdi et al. 2007 Evaluating defect prediction approaches: a benchmark and an extensive comparison, D’Ambros et al. 2012 Conference: MSR 2016 http://msrconf.org/ 5
  • 6.
    Method-Level Bug Prediction withEmanuel Giger, Marco D’Ambros*, Harald Gall University of Zurich *University of Lugano
  • 7.
    Many existing studiesto predict bug-prone files A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction, Moser et al. 2008 Use of relative code churn measures to predict system defect density, Nagappan et al. 2005 Cross-project defect prediction: a large scale experiment on data vs. domain vs. process, Zimmermann et al. 2009 Predicting faults using the complexity of code changes, Hassan et al. 2009 7
  • 8.
    Prediction granularity 11 methodson average class 1 class 2 class 3 class n...class 2 4 methods are bug prone (ca. 36%) Retrieving bug-prone methods saves manual inspection effort and testing effort 8 Large files are typically the most bug-prone files
  • 9.
    Research questions How accuratecan we predict buggy methods? Which characteristics (i.e., metrics) are indicating bug prone methods? How does the accuracy vary if the number of buggy methods decreases? 9
  • 10.
    Research questions 10 RQ1 Whatis the accuracy of bug prediction on method level? RQ2 Which characteristics (i.e., metrics) are indicating bug prone methods? RQ3 How does the accuracy vary if the number of buggy methods decreases?
  • 11.
    Experiment with 21Java open source projects 11 Project #Classes #Methods #M-Histories #Bugs JDT Core 1.140 17.703 43.134 4.888 Jena2 897 8.340 7.764 704 Lucene 477 3.870 1.754 377 Xerces 693 8.189 6.866 1.017 Derby Engine 1.394 18.693 9.507 1.663 Ant Core 827 8.698 17.993 1.900
  • 12.
    Approach overview how manyof them (Bugs), and (3) fine-grained source code changes (SCC). 4. Experiment 2. Bug Data 3. Source Code Changes (SCC)1.Versioning Data CVS, SVN, GIT Evolizer RHDB Log Entries ChangeDistiller Subsequent Versions Changes #bug123 Message Bug Support Vector Machine 1.1 1.2 AST Comparison 12
  • 13.
    Investigated metrics 13 Source codemetrics (from the last release) fanIn, fanOut, localVar, parameters, commentToCodeRatio, countPath, McCabe Complexity, statements, maxNesting Change metrics methodHistories, authors, stmtAdded, maxStmtAdded, avgStmtAdded, stmtDeleted, maxStmtDeleted, avgStmtDeleted, churn, maxChurn, avgChurn, decl, cond, elseAdded, elseDeleted Bugs Count bug references in commit logs for changed methods
  • 14.
    Predicting bug-prone methods Bug-pronevs. not bug-prone 14 .1 Experimental Setup Prior to model building and classification we labeled ethod in our dataset either as bug-prone or not bug-p s follows: bugClass = not bug − prone : #bugs = 0 bug − prone : #bugs >= 1 hese two classes represent the binary target classes aining and validating the prediction models. Using 0 pectively 1) as cut-point is a common approach applie any studies covering bug prediction models, e.g., [30 7, 4, 27, 37]. Other cut-points are applied in litera r instance, a statistical lower confidence bound [33] or edian [16]. Those varying cut-points as well as the div
  • 15.
    Models computed withchange metrics (CM) perform best authors and methodHistories are the most important measures Accuracy of prediction models 15 Table 4: Median classification results over all pro- jects per classifier and per model CM SCM CM&SCM AUC P R AUC P R AUC P R RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95 SVM .96 .83 .86 .7 .48 .63 .95 .8 .96 BN .96 .82 .86 .73 .46 .73 .96 .81 .96 J48 .95 .84 .82 .69 .56 .58 .91 .83 .89 values of the code metrics model are approximately 0.7 for each classifier—what is defined by Lessman et al. as ”promis- ing” [26]. However, the source code metrics suffer from con- siderably low precision values. The highest median precision
  • 16.
    Predicting bug-prone methodswith diff. cut-points Bug-prone vs. not bug-prone p = 75%, 90%, 95% percentiles of #bugs in methods per project -> predict the top 25%, 10%, and 5% bug-prone methods 16 ow the classification performance varies (RQ3) as the er of samples in the target class shrinks, and wheth bserve similar findings as in Section 3.2 regarding t ults of the change and code metrics (RQ2). For tha pplied three additional cut-point values as follows: bugClass = not bug − prone : #bugs <= p bug − prone : #bugs > p here p represents either the value of the 75%, 90%, or ercentile of the distribution of the number of bugs in ds per project. For example, using the 95% percent ut-point for prior binning would mean to predict the ve percent” methods in terms of the number of bugs. To conduct this study we applied the same experim etup as in Section 3.1, except for the differently chose
  • 17.
    Decreasing the numberof bug-prone methods Models trained with Random Forest (RndFor) Change metrics (CM) perform best Precision decreases (as expected) 17 Table 5: Median classification results for RndFor ver all projects per cut-point and per model CM SCM CM&SCM AUC P R AUC P R AUC P R GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95 75% .97 .72 .95 .75 .39 .63 .97 .74 .95 90% .97 .58 .94 .77 .20 .69 .98 .64 .94 95% .97 .62 .92 .79 .13 .72 .98 .68 .92 ion in the case of the 95% percentile (median precision of .13). Looking at the change metrics and the combined model the median precision is significantly higher for the
  • 18.
    Application: file levelvs. method level prediction JDT Core 3.0 - LocalDeclaration.class Contains 6 methods / 1 affected by post release bugs LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97 File-level: p=0.17 to guess the bug-prone method Need to manually rule out 5 methods to reach >0.82 precision 1 / (6-5) JDT Core 3.0 - Main.class Contains 26 methods / 11 affected by post release bugs Main.configure(...) was predicted bug-prone with p=1.0 File-level: p=0.42 to guess a bug-prone method Need to rule out 13 methods to reach >0.82 precision 11 / (26-13) 18
  • 19.
    What can welearn from that? Large files are more likely to change and have bugs Test large files more thoroughly - YES Bugs are fixed through changes that again lead to bugs Stop changing our systems - NO, of course not! Test changing entities more thoroughly - YES Are we not already doing that? Do we really need (complex) prediction models for that? Not sure - might be the reason why these models are not really used, yet Microsoft started to add prediction models to their quality assurance tools - current status? But, use at least a metric tools and keep track of your code quality -> Continuous integration environments, SONAR 19
  • 20.
    Can developer-module networks predictfailures? with Nachi Nagappan, Brendan Murphy Microsoft Research
  • 21.
    Team structure andpost-failures Results of an initial study with MS Vista #Authors and #Commits of binaries is correlated with the #post-release failures We wanted to find out Are binaries with fragmented contributions from many developers more likely to have post-release failures? Should developers focus on one thing? 
 21
  • 22.
    Study with MSVista project Data Released in January, 2007 > 4 years of development Several thousand developers Several thousand binaries (*.exe, *.dll) Several millions of commits 22
  • 23.
    Alice Bob Dan Eric Fu Go Hin ab c Approach in anutshell 23 Change Logs Bugs Regression Analysis Validation with data splitting Alice Dan Eric Go Hin c 5 4 6 2 5 7 4 a 4 Bob 2 b 6 Fu Binary #bugs #centrality a 12 0.9 b 7 0.5 c 3 0.2
  • 24.
    Contribution network 24 Alice Bob Dan Eric Fu Go Hin ab c Windows binary(*.dll) Developer Which binary is failure-prone?
  • 25.
  • 26.
    Research questions Are binarieswith fragmented contributions more failure-prone? Does more fragmentation also mean a higher number of post-release failures? Which measures of fragmentation are useful for failure estimation? 26
  • 27.
    Correlation analysis 27 nrCommits nrAuthorsPower dPower Closeness Reach Betweenness Failures 0,7 0,699 0,692 0,74 0,747 0,746 0,503 nrCommits 0,704 0,996 0,773 0,748 0,732 0,466 nrAuthors 0,683 0,981 0,914 0,944 0,83 Power 0,756 0,732 0,714 0,439 dPower 0,943 0,964 0,772 Closeness 0,99 0,738 Reach 0,773 Spearman rank correlation All correlations are significant at the 0.01 level (2-tailed)
  • 28.
    How to predictfailure-prone binaries? Binary logistic regression of 50 random splits 4 principal components from 7 centrality measures 28 40200 1.00 0.90 0.80 0.70 0.60 0.50 40200 1.00 0.90 0.80 0.70 0.60 0.50 40200 1.00 0.90 0.80 0.70 0.60 0.50 Precision Recall AUC
  • 29.
    Hot to predictthe number of failures? Linear regression of 50 random splits #Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits All correlations are significant at the 0.01 level (2-tailed) 40200 1.00 0.90 0.80 0.70 0.60 0.50 40200 1.00 0.90 0.80 0.70 0.60 0.50 40200 1.00 0.90 0.80 0.70 0.60 0.50 R-Square Pearson Spearman 29
  • 30.
    Which fragmentation measuresto use? 30 40200 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 40200 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 40200 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 40200 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 R-SquareSpearman Model with nrAuthors, nrCommits Model with nCloseness, nrAuthors, nrCommits
  • 31.
    Summary of results Centralitymeasures can predict more than 83% of failure-pone Vista binaries Closeness, nrAuthors, and nrCommits can predict the number of post- release failures Closeness or Reach can improve prediction of the number of post- release failures by 32% More information Can Developer-Module Networks Predict Failures?, FSE 2008 31
  • 32.
  • 33.
    Alice Bob Dan Eric Fu Go Hin ab c 5 4 6 2 4 6 2 5 7 4 Whatcan we learn from that? Re-organize/restrict developer contributions Simply: fire Bob! Find out the reasons why Bob is contributing to both binaries At MS, few key developers helped in many places to get Vista running 33
  • 34.
    Alice Bob Dan Eric Fu Go Hin ab c 5 4 6 2 4 6 2 5 7 4 Whatcan we learn from that? Re-factor central binaries Check the contract between “a” and “b” - decouple them E.g., analyze if “a” contains functionality that should be moved to “b” or a new binary 34
  • 35.
    Alice Bob Dan Eric Fu Go Hin ab c 5 4 6 2 4 6 2 5 7 4 Whatcan we learn from that? Increase testing of binaries “a” and “b” Yes, since these binaries are failure prone 35
  • 36.
    What did Microsoftdo with the results? Results were kept within MS Research (at least in 2007 and 2008) Be careful - our findings were purely based on the data Such findings often do not show the full picture, since not everything is recorded My results triggered a lot more research on developer contributions at MS Research 36
  • 37.
    Why researchers want/need/must collaboratewith industry? My experiences with open source and industrial software projects
  • 38.
    Open source projects:pros + Provide tons of data E.g., Eclipse project, Apache projects, etc. + Easy to access Almost “all” data is publicly available, e.g., Github + No organizational obstacles to get access + You get ALL the data, not just parts of it 38
  • 39.
    Open source projects:cons - Research is (sometimes) difficult to motivate Sometimes researchers analyze open source projects to try out some new technique/algorithms but without knowing what actual problem they want to solve Why are you doing this? What is the value for the research and industry - Difficult to get in touch with the developers to validate the results 10 years back, I showed our results to Mozilla - they only said “interesting” but that was it! Some open source communities are more responsive: e.g., Eclipse community Still, you typically do not get the chance to meet them face-2-face 39
  • 40.
    Industrial projects: pros +Usually provide real problems But sometimes these problems are not “research” problems - and researchers want to do research + Provide contact to developers to obtain feedback on the findings Industry as a laboratory Helps to evaluate what is useful and what is not + Potential to see our research results used at least by developers Processes, tools, algorithms, models, best practices, etc. 40
  • 41.
    Industrial projects: cons -Often expectations between researchers and developers differ a lot Developers want to get things done - researchers want to publish Solutions are too complex and/or only applicable to a very specific case study, therefore not useful - Note, researchers often provide know how, not ready made tools Is that a good idea? - Not always, but we are typically cheaper than most consultants - Industry provides only partial case studies Often not really useful to perform research on -> we need data and a lot of it 41
  • 42.
    How I gotin touch with Microsoft (Research) Met them at the Microsoft developers conference Got invited to Redmond to show them what we could do FOR them Agreed on sending a researcher (me) to Redmond for three months Fully paid by Microsoft Once within Microsoft got access to the data and to some developers My main contact was with MS Research 
 Microsoft invests in internships and visiting researchers They use it to find and hire talented people 42
  • 43.
    What is nexton my research agenda? Study defect prediction in industrial software projects I am still looking for a good industrial partner (and a student) Ease understanding changes and their effects What is the effect on the design? What is the effect on the quality? Recommender techniques Identify the sources of problems Recommend and perform refactorings to solve the problem Provide advice on the effects of changes to prevent problems For this I want and need to collaborate with industry! 43
  • 44.
    Conclusions 44 Questions? Martin Pinzger martin.pinzger@aau.at the historyof a software system to assemble the dataset for our experiments: (1) versioning data including lines modi- fied (LM), (2) bug data, i.e., which files contained bugs and how many of them (Bugs), and (3) fine-grained source code changes (SCC). 4. Experiment 2. Bug Data 3. Source Code Changes (SCC)1.Versioning Data CVS, SVN, GIT Evolizer RHDB Log Entries ChangeDistiller Subsequent Versions Changes #bug123 Message Bug Support Vector Machine 1.1 1.2 AST Comparison Figure 1: Stepwise overview of the data extraction process. 1. Versioning Data. We use EVOLIZER [14] to access the ver- sioning repositories , e.g., CVS, SVN, or GIT. They provide log entries that contain information about revisions of files that belong to a system. From the log entries we extract the revision number (to identify the revisions of a file in correct temporal order), the revision timestamp, the name of the de- veloper who checked-in the new revision, and the commit message. We then compute LM for a source file as the sum of lines added, lines deleted, and lines changed per file revision. 2. Bug Data. Bug reports are stored in bug repositories such as Bugzilla. Traditional bug tracking and versioning repos- Update Core 595 8’496 251’434 36’151 532 Oct0 Debug UI 1’954 18’862 444’061 81’836 3’120 May JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov Help 598 3’658 66’743 12’170 243 May JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0 OSGI 748 9’866 335’253 56’238 1’411 Nov single source code statements, e.g., method invocatio ments, between two versions of a program by com their respective abstract syntax trees (AST). Each chan represents a tree edit operation that is required to tr one version of the AST into the other. The algorithm i mented in CHANGEDISTILLER [14] that pairwise co the ASTs between all direct subsequent revisions of e Based on this information, we then count the numbe ferent source code changes (SCC) per file revision. The preprocessed data from step 1-3 is stored into lease History Database (RHDB) [10]. From that data, compute LM, SCC, and Bugs for each source file by a ing the values over the given observation period. 3. EMPIRICAL STUDY In this section, we present the empirical study that formed to investigate the hypotheses stated in Sectio discuss the dataset, the statistical methods and machi ing algorithms we used, and report on the results a ings of the experiments. 3.1 Dataset and Data Preparation We performed our experiments on 15 plugins of the platform. Eclipse is a popular open source system been studied extensively before [4,27,38,39]. Table 1 gives an overview of the Eclipse dataset this study with the number of unique *.java files (Fi Alice Bob Dan Eric Fu Go Hin ab c 5 4 6 2 4 6 2 5 7 4 Academia wants/needs/must collaborate with industry Industry should invest in such a collaboration
  • 45.