A Tale of Experiments on Bug Prediction

A Tale of Experiments on
Bug Prediction
Martin Pinzger
Professor of Software Engineering
University of Klagenfurt, Austria
 
Follow me: @pinzger

Hmm, wait a minute
3
Can’t we learn “something” from that data?

Goal of software repository mining
Software Analytics
To obtain insightful and actionable information for completing various tasks
around developing and maintaining software systems
Examples
Quality analysis and defect prediction
Detecting “hot-spots”
Preventing defects
Recommender (advisory) systems
Code completion
Suggesting good code examples
Helping in using an API
...
4

Examples from my mining research
My mining research
The relationship between developer contributions and failure-prone Microsoft Vista
binaries (FSE 2008)
Predicting failure-prone methods (ESEM 2012)
Predicting Build Co-Changes with Source Code Change and Commit Categories
(to appear at SANER 2016)
For more see: http://serg.aau.at/bin/view/MartinPinzger/Publications
Surveys on software repository mining
A survey and taxonomy of approaches for mining software repositories in the
context of software evolution, Kagdi et al. 2007
Evaluating defect prediction approaches: a benchmark and an extensive
comparison, D’Ambros et al. 2012
Conference: MSR 2016 http://msrconf.org/
5

Method-Level Bug Prediction
with Emanuel Giger, Marco D’Ambros*, Harald Gall
University of Zurich
*University of Lugano

Many existing studies to predict bug-prone ﬁles
A comparative analysis of the efﬁciency of change metrics and static
code attributes for defect prediction, Moser et al. 2008
Use of relative code churn measures to predict system defect
density, Nagappan et al. 2005
Cross-project defect prediction: a large scale experiment on data vs.
domain vs. process, Zimmermann et al. 2009
Predicting faults using the complexity of code changes, Hassan et al.
2009
7

Prediction granularity
11 methods on average
class 1 class 2 class 3 class n...class 2
4 methods are bug prone (ca. 36%)
Retrieving bug-prone methods saves manual inspection effort and
testing effort
8
Large ﬁles are typically the most bug-prone ﬁles

Research questions
How accurate can we predict buggy methods?
Which characteristics (i.e., metrics) are indicating bug prone
methods?
How does the accuracy vary if the number of buggy methods
decreases?
9

Research questions
10
RQ1 What is the accuracy of bug prediction on
method level?
RQ2 Which characteristics (i.e., metrics) are
indicating bug prone methods?
RQ3 How does the accuracy vary if the number of
buggy methods decreases?

Experiment with 21 Java open source projects
11
Project #Classes #Methods #M-Histories #Bugs
JDT Core 1.140 17.703 43.134 4.888
Jena2 897 8.340 7.764 704
Lucene 477 3.870 1.754 377
Xerces 693 8.189 6.866 1.017
Derby Engine 1.394 18.693 9.507 1.663
Ant Core 827 8.698 17.993 1.900

Approach overview
how many of them (Bugs), and (3) ﬁne-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
12

Investigated metrics
13
Source code metrics (from the last release)
fanIn, fanOut, localVar, parameters, commentToCodeRatio, countPath, McCabe
Complexity, statements, maxNesting
Change metrics
methodHistories, authors,
stmtAdded, maxStmtAdded, avgStmtAdded,
stmtDeleted, maxStmtDeleted, avgStmtDeleted,
churn, maxChurn, avgChurn,
decl, cond, elseAdded, elseDeleted
Bugs
Count bug references in commit logs for changed methods

Predicting bug-prone methods
Bug-prone vs. not bug-prone
14
.1 Experimental Setup
Prior to model building and classiﬁcation we labeled
ethod in our dataset either as bug-prone or not bug-p
s follows:
bugClass =
not bug − prone : #bugs = 0
bug − prone : #bugs >= 1
hese two classes represent the binary target classes
aining and validating the prediction models. Using 0
pectively 1) as cut-point is a common approach applie
any studies covering bug prediction models, e.g., [30
7, 4, 27, 37]. Other cut-points are applied in litera
r instance, a statistical lower conﬁdence bound [33] or
edian [16]. Those varying cut-points as well as the div

Models computed with change metrics (CM) perform best
authors and methodHistories are the most important measures
Accuracy of prediction models
15
Table 4: Median classification results over all pro-
jects per classifier and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95
SVM .96 .83 .86 .7 .48 .63 .95 .8 .96
BN .96 .82 .86 .73 .46 .73 .96 .81 .96
J48 .95 .84 .82 .69 .56 .58 .91 .83 .89
values of the code metrics model are approximately 0.7 for
each classifier—what is defined by Lessman et al. as ”promis-
ing” [26]. However, the source code metrics suffer from con-
siderably low precision values. The highest median precision

Predicting bug-prone methods with diff. cut-points
Bug-prone vs. not bug-prone
p = 75%, 90%, 95% percentiles of #bugs in methods per project
-> predict the top 25%, 10%, and 5% bug-prone methods
16
ow the classification performance varies (RQ3) as the
er of samples in the target class shrinks, and wheth
bserve similar findings as in Section 3.2 regarding t
ults of the change and code metrics (RQ2). For tha
pplied three additional cut-point values as follows:
bugClass =
not bug − prone : #bugs <= p
bug − prone : #bugs > p
here p represents either the value of the 75%, 90%, or
ercentile of the distribution of the number of bugs in
ds per project. For example, using the 95% percent
ut-point for prior binning would mean to predict the
ve percent” methods in terms of the number of bugs.
To conduct this study we applied the same experim
etup as in Section 3.1, except for the differently chose

Decreasing the number of bug-prone methods
Models trained with Random Forest (RndFor)
Change metrics (CM) perform best
Precision decreases (as expected)
17
Table 5: Median classiﬁcation results for RndFor
ver all projects per cut-point and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95
75% .97 .72 .95 .75 .39 .63 .97 .74 .95
90% .97 .58 .94 .77 .20 .69 .98 .64 .94
95% .97 .62 .92 .79 .13 .72 .98 .68 .92
ion in the case of the 95% percentile (median precision of
.13). Looking at the change metrics and the combined
model the median precision is signiﬁcantly higher for the

Application: ﬁle level vs. method level prediction
JDT Core 3.0 - LocalDeclaration.class
Contains 6 methods / 1 affected by post release bugs
LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97
File-level: p=0.17 to guess the bug-prone method
Need to manually rule out 5 methods to reach >0.82 precision 1 / (6-5)
JDT Core 3.0 - Main.class
Contains 26 methods / 11 affected by post release bugs
Main.conﬁgure(...) was predicted bug-prone with p=1.0
File-level: p=0.42 to guess a bug-prone method
Need to rule out 13 methods to reach >0.82 precision 11 / (26-13)
18

What can we learn from that?
Large files are more likely to change and have bugs
Test large files more thoroughly - YES
Bugs are fixed through changes that again lead to bugs
Stop changing our systems - NO, of course not!
Test changing entities more thoroughly - YES
Are we not already doing that?
Do we really need (complex) prediction models for that?
Not sure - might be the reason why these models are not really used, yet
Microsoft started to add prediction models to their quality assurance tools - current status?
But, use at least a metric tools and keep track of your code quality
-> Continuous integration environments, SONAR
19

Can developer-module networks
predict failures?
with Nachi Nagappan, Brendan Murphy
Microsoft Research

Team structure and post-failures
Results of an initial study with MS Vista
#Authors and #Commits of binaries is correlated with the #post-release failures
We wanted to ﬁnd out
Are binaries with fragmented contributions from many developers more likely to
have post-release failures?
Should developers focus on one thing?
 
21

Study with MS Vista project
Data
Released in January, 2007
> 4 years of development
Several thousand developers
Several thousand binaries (*.exe, *.dll)
Several millions of commits
22

Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Approach in a nutshell
23
Change
Logs
Bugs
Regression Analysis
Validation with data splitting
Alice
Dan
Eric Go
Hin c
5
4
6
2
5 7
4
a
4
Bob
2
b
6
Fu
Binary #bugs #centrality
a 12 0.9
b 7 0.5
c 3 0.2

Contribution network
24
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Windows binary (*.dll)
Developer
Which binary is failure-prone?

Measuring fragmentation
25
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Freeman degree
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Bonacich’s powerCloseness
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c

Research questions
Are binaries with fragmented contributions more failure-prone?
Does more fragmentation also mean a higher number of post-release
failures?
Which measures of fragmentation are useful for failure estimation?
26

Correlation analysis
27
nrCommits nrAuthors Power dPower Closeness Reach Betweenness
Failures 0,7 0,699 0,692 0,74 0,747 0,746 0,503
nrCommits 0,704 0,996 0,773 0,748 0,732 0,466
nrAuthors 0,683 0,981 0,914 0,944 0,83
Power 0,756 0,732 0,714 0,439
dPower 0,943 0,964 0,772
Closeness 0,99 0,738
Reach 0,773
Spearman rank correlation
All correlations are signiﬁcant at the 0.01 level (2-tailed)

How to predict failure-prone binaries?
Binary logistic regression of 50 random splits
4 principal components from 7 centrality measures
28
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
Precision Recall AUC

Hot to predict the number of failures?
Linear regression of 50 random splits
#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits
All correlations are signiﬁcant at the 0.01 level (2-tailed)
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
R-Square Pearson Spearman
29

Which fragmentation measures to use?
30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
R-SquareSpearman
Model with nrAuthors,
nrCommits
Model with nCloseness,
nrAuthors, nrCommits

Summary of results
Centrality measures can predict more than 83% of failure-pone Vista
binaries
Closeness, nrAuthors, and nrCommits can predict the number of post-
release failures
Closeness or Reach can improve prediction of the number of post-
release failures by 32%
More information
Can Developer-Module Networks Predict Failures?, FSE 2008
31

Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
32

Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Re-organize/restrict developer contributions
Simply: ﬁre Bob!
Find out the reasons why Bob is contributing to both binaries
At MS, few key developers helped in many places to get Vista running
33

Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Re-factor central binaries
Check the contract between “a” and “b” - decouple them
E.g., analyze if “a” contains functionality that should be moved to “b” or a new
binary
34

Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Increase testing of binaries “a” and “b”
Yes, since these binaries are failure prone
35

What did Microsoft do with the results?
Results were kept within MS Research (at least in 2007 and 2008)
Be careful - our ﬁndings were purely based on the data
Such ﬁndings often do not show the full picture, since not everything is recorded
My results triggered a lot more research on developer contributions
at MS Research
36

Why researchers want/need/must
collaborate with industry?
My experiences with open source and industrial
software projects

Open source projects: pros
+ Provide tons of data
E.g., Eclipse project, Apache projects, etc.
+ Easy to access
Almost “all” data is publicly available, e.g., Github
+ No organizational obstacles to get access
+ You get ALL the data, not just parts of it
38

Open source projects: cons
- Research is (sometimes) difﬁcult to motivate
Sometimes researchers analyze open source projects to try out some new
technique/algorithms but without knowing what actual problem they want to
solve
Why are you doing this?
What is the value for the research and industry
- Difﬁcult to get in touch with the developers to validate the results
10 years back, I showed our results to Mozilla - they only said “interesting” but
that was it!
Some open source communities are more responsive: e.g., Eclipse community
Still, you typically do not get the chance to meet them face-2-face
39

Industrial projects: pros
+ Usually provide real problems
But sometimes these problems are not “research” problems - and researchers
want to do research
+ Provide contact to developers to obtain feedback on the ﬁndings
Industry as a laboratory
Helps to evaluate what is useful and what is not
+ Potential to see our research results used at least by developers
Processes, tools, algorithms, models, best practices, etc.
40

Industrial projects: cons
- Often expectations between researchers and developers differ a lot
Developers want to get things done - researchers want to publish
Solutions are too complex and/or only applicable to a very speciﬁc case study,
therefore not useful
- Note, researchers often provide know how, not ready made tools
Is that a good idea? - Not always, but we are typically cheaper than most
consultants
- Industry provides only partial case studies
Often not really useful to perform research on -> we need data and a lot of it
41

How I got in touch with Microsoft (Research)
Met them at the Microsoft developers conference
Got invited to Redmond to show them what we could do FOR them
Agreed on sending a researcher (me) to Redmond for three months
Fully paid by Microsoft
Once within Microsoft got access to the data and to some developers
My main contact was with MS Research
 
Microsoft invests in internships and visiting researchers
They use it to ﬁnd and hire talented people
42

What is next on my research agenda?
Study defect prediction in industrial software projects
I am still looking for a good industrial partner (and a student)
Ease understanding changes and their effects
What is the effect on the design?
What is the effect on the quality?
Recommender techniques
Identify the sources of problems
Recommend and perform refactorings to solve the problem
Provide advice on the effects of changes to prevent problems
For this I want and need to collaborate with industry!
43

Conclusions
44
Questions?
Martin Pinzger
martin.pinzger@aau.at
the history of a software system to assemble the dataset for
our experiments: (1) versioning data including lines modi-
fied (LM), (2) bug data, i.e., which files contained bugs and
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
Figure 1: Stepwise overview of the data extraction process.
1. Versioning Data. We use EVOLIZER [14] to access the ver-
sioning repositories , e.g., CVS, SVN, or GIT. They provide
log entries that contain information about revisions of files
that belong to a system. From the log entries we extract the
revision number (to identify the revisions of a file in correct
temporal order), the revision timestamp, the name of the de-
veloper who checked-in the new revision, and the commit
message. We then compute LM for a source file as the sum of
lines added, lines deleted, and lines changed per file revision.
2. Bug Data. Bug reports are stored in bug repositories such
as Bugzilla. Traditional bug tracking and versioning repos-
Update Core 595 8’496 251’434 36’151 532 Oct0
Debug UI 1’954 18’862 444’061 81’836 3’120 May
JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov
Help 598 3’658 66’743 12’170 243 May
JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0
OSGI 748 9’866 335’253 56’238 1’411 Nov
single source code statements, e.g., method invocatio
ments, between two versions of a program by com
their respective abstract syntax trees (AST). Each chan
represents a tree edit operation that is required to tr
one version of the AST into the other. The algorithm i
mented in CHANGEDISTILLER [14] that pairwise co
the ASTs between all direct subsequent revisions of e
Based on this information, we then count the numbe
ferent source code changes (SCC) per file revision.
The preprocessed data from step 1-3 is stored into
lease History Database (RHDB) [10]. From that data,
compute LM, SCC, and Bugs for each source file by a
ing the values over the given observation period.
3. EMPIRICAL STUDY
In this section, we present the empirical study that
formed to investigate the hypotheses stated in Sectio
discuss the dataset, the statistical methods and machi
ing algorithms we used, and report on the results a
ings of the experiments.
3.1 Dataset and Data Preparation
We performed our experiments on 15 plugins of the
platform. Eclipse is a popular open source system
been studied extensively before [4,27,38,39].
Table 1 gives an overview of the Eclipse dataset
this study with the number of unique *.java files (Fi
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Academia wants/needs/must
collaborate with industry
Industry should invest in such
a collaboration

A Tale of Experiments on Bug Prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to A Tale of Experiments on Bug Prediction

Similar to A Tale of Experiments on Bug Prediction (20)

Recently uploaded

Recently uploaded (20)

A Tale of Experiments on Bug Prediction