This document summarizes Martin Pinzger's research on predicting buggy methods using software repository mining. The key points are:
1. Pinzger and colleagues conducted experiments on 21 Java projects to predict buggy methods using source code and change metrics. Change metrics like authors and method histories performed best with up to 96% accuracy.
2. Predicting buggy methods at a finer granularity than files can save manual inspection and testing effort. Accuracy decreases as fewer methods are predicted but change metrics maintain higher precision.
3. Case studies on two classes show that method-level prediction achieves over 82% precision compared to only 17-42% at the file level. This demonstrates the benefit of finer-
3. Hmm, wait a minute
3
Can’t we learn “something” from that data?
4. Goal of software repository mining
Software Analytics
To obtain insightful and actionable information for completing various tasks
around developing and maintaining software systems
Examples
Quality analysis and defect prediction
Detecting “hot-spots”
Preventing defects
Recommender (advisory) systems
Code completion
Suggesting good code examples
Helping in using an API
...
4
5. Examples from my mining research
My mining research
The relationship between developer contributions and failure-prone Microsoft Vista
binaries (FSE 2008)
Predicting failure-prone methods (ESEM 2012)
Predicting Build Co-Changes with Source Code Change and Commit Categories
(to appear at SANER 2016)
For more see: http://serg.aau.at/bin/view/MartinPinzger/Publications
Surveys on software repository mining
A survey and taxonomy of approaches for mining software repositories in the
context of software evolution, Kagdi et al. 2007
Evaluating defect prediction approaches: a benchmark and an extensive
comparison, D’Ambros et al. 2012
Conference: MSR 2016 http://msrconf.org/
5
7. Many existing studies to predict bug-prone files
A comparative analysis of the efficiency of change metrics and static
code attributes for defect prediction, Moser et al. 2008
Use of relative code churn measures to predict system defect
density, Nagappan et al. 2005
Cross-project defect prediction: a large scale experiment on data vs.
domain vs. process, Zimmermann et al. 2009
Predicting faults using the complexity of code changes, Hassan et al.
2009
7
8. Prediction granularity
11 methods on average
class 1 class 2 class 3 class n...class 2
4 methods are bug prone (ca. 36%)
Retrieving bug-prone methods saves manual inspection effort and
testing effort
8
Large files are typically the most bug-prone files
9. Research questions
How accurate can we predict buggy methods?
Which characteristics (i.e., metrics) are indicating bug prone
methods?
How does the accuracy vary if the number of buggy methods
decreases?
9
10. Research questions
10
RQ1 What is the accuracy of bug prediction on
method level?
RQ2 Which characteristics (i.e., metrics) are
indicating bug prone methods?
RQ3 How does the accuracy vary if the number of
buggy methods decreases?
14. Predicting bug-prone methods
Bug-prone vs. not bug-prone
14
.1 Experimental Setup
Prior to model building and classification we labeled
ethod in our dataset either as bug-prone or not bug-p
s follows:
bugClass =
not bug − prone : #bugs = 0
bug − prone : #bugs >= 1
hese two classes represent the binary target classes
aining and validating the prediction models. Using 0
pectively 1) as cut-point is a common approach applie
any studies covering bug prediction models, e.g., [30
7, 4, 27, 37]. Other cut-points are applied in litera
r instance, a statistical lower confidence bound [33] or
edian [16]. Those varying cut-points as well as the div
15. Models computed with change metrics (CM) perform best
authors and methodHistories are the most important measures
Accuracy of prediction models
15
Table 4: Median classification results over all pro-
jects per classifier and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
RndFor .95 .84 .88 .72 .5 .64 .95 .85 .95
SVM .96 .83 .86 .7 .48 .63 .95 .8 .96
BN .96 .82 .86 .73 .46 .73 .96 .81 .96
J48 .95 .84 .82 .69 .56 .58 .91 .83 .89
values of the code metrics model are approximately 0.7 for
each classifier—what is defined by Lessman et al. as ”promis-
ing” [26]. However, the source code metrics suffer from con-
siderably low precision values. The highest median precision
16. Predicting bug-prone methods with diff. cut-points
Bug-prone vs. not bug-prone
p = 75%, 90%, 95% percentiles of #bugs in methods per project
-> predict the top 25%, 10%, and 5% bug-prone methods
16
ow the classification performance varies (RQ3) as the
er of samples in the target class shrinks, and wheth
bserve similar findings as in Section 3.2 regarding t
ults of the change and code metrics (RQ2). For tha
pplied three additional cut-point values as follows:
bugClass =
not bug − prone : #bugs <= p
bug − prone : #bugs > p
here p represents either the value of the 75%, 90%, or
ercentile of the distribution of the number of bugs in
ds per project. For example, using the 95% percent
ut-point for prior binning would mean to predict the
ve percent” methods in terms of the number of bugs.
To conduct this study we applied the same experim
etup as in Section 3.1, except for the differently chose
17. Decreasing the number of bug-prone methods
Models trained with Random Forest (RndFor)
Change metrics (CM) perform best
Precision decreases (as expected)
17
Table 5: Median classification results for RndFor
ver all projects per cut-point and per model
CM SCM CM&SCM
AUC P R AUC P R AUC P R
GT0 .95 .84 .88 .72 .50 .64 .95 .85 .95
75% .97 .72 .95 .75 .39 .63 .97 .74 .95
90% .97 .58 .94 .77 .20 .69 .98 .64 .94
95% .97 .62 .92 .79 .13 .72 .98 .68 .92
ion in the case of the 95% percentile (median precision of
.13). Looking at the change metrics and the combined
model the median precision is significantly higher for the
18. Application: file level vs. method level prediction
JDT Core 3.0 - LocalDeclaration.class
Contains 6 methods / 1 affected by post release bugs
LocalDeclaration.resolve(...) was predicted bug-prone with p=0.97
File-level: p=0.17 to guess the bug-prone method
Need to manually rule out 5 methods to reach >0.82 precision 1 / (6-5)
JDT Core 3.0 - Main.class
Contains 26 methods / 11 affected by post release bugs
Main.configure(...) was predicted bug-prone with p=1.0
File-level: p=0.42 to guess a bug-prone method
Need to rule out 13 methods to reach >0.82 precision 11 / (26-13)
18
19. What can we learn from that?
Large files are more likely to change and have bugs
Test large files more thoroughly - YES
Bugs are fixed through changes that again lead to bugs
Stop changing our systems - NO, of course not!
Test changing entities more thoroughly - YES
Are we not already doing that?
Do we really need (complex) prediction models for that?
Not sure - might be the reason why these models are not really used, yet
Microsoft started to add prediction models to their quality assurance tools - current status?
But, use at least a metric tools and keep track of your code quality
-> Continuous integration environments, SONAR
19
21. Team structure and post-failures
Results of an initial study with MS Vista
#Authors and #Commits of binaries is correlated with the #post-release failures
We wanted to find out
Are binaries with fragmented contributions from many developers more likely to
have post-release failures?
Should developers focus on one thing?
21
22. Study with MS Vista project
Data
Released in January, 2007
> 4 years of development
Several thousand developers
Several thousand binaries (*.exe, *.dll)
Several millions of commits
22
23. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
Approach in a nutshell
23
Change
Logs
Bugs
Regression Analysis
Validation with data splitting
Alice
Dan
Eric Go
Hin c
5
4
6
2
5 7
4
a
4
Bob
2
b
6
Fu
Binary #bugs #centrality
a 12 0.9
b 7 0.5
c 3 0.2
26. Research questions
Are binaries with fragmented contributions more failure-prone?
Does more fragmentation also mean a higher number of post-release
failures?
Which measures of fragmentation are useful for failure estimation?
26
27. Correlation analysis
27
nrCommits nrAuthors Power dPower Closeness Reach Betweenness
Failures 0,7 0,699 0,692 0,74 0,747 0,746 0,503
nrCommits 0,704 0,996 0,773 0,748 0,732 0,466
nrAuthors 0,683 0,981 0,914 0,944 0,83
Power 0,756 0,732 0,714 0,439
dPower 0,943 0,964 0,772
Closeness 0,99 0,738
Reach 0,773
Spearman rank correlation
All correlations are significant at the 0.01 level (2-tailed)
28. How to predict failure-prone binaries?
Binary logistic regression of 50 random splits
4 principal components from 7 centrality measures
28
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
Precision Recall AUC
29. Hot to predict the number of failures?
Linear regression of 50 random splits
#Failures = b0 + b1*nCloseness + b2*nrAuthors + b3*nrCommits
All correlations are significant at the 0.01 level (2-tailed)
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
40200
1.00
0.90
0.80
0.70
0.60
0.50
R-Square Pearson Spearman
29
30. Which fragmentation measures to use?
30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
40200
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
R-SquareSpearman
Model with nrAuthors,
nrCommits
Model with nCloseness,
nrAuthors, nrCommits
31. Summary of results
Centrality measures can predict more than 83% of failure-pone Vista
binaries
Closeness, nrAuthors, and nrCommits can predict the number of post-
release failures
Closeness or Reach can improve prediction of the number of post-
release failures by 32%
More information
Can Developer-Module Networks Predict Failures?, FSE 2008
31
33. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-organize/restrict developer contributions
Simply: fire Bob!
Find out the reasons why Bob is contributing to both binaries
At MS, few key developers helped in many places to get Vista running
33
34. Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
What can we learn from that?
Re-factor central binaries
Check the contract between “a” and “b” - decouple them
E.g., analyze if “a” contains functionality that should be moved to “b” or a new
binary
34
36. What did Microsoft do with the results?
Results were kept within MS Research (at least in 2007 and 2008)
Be careful - our findings were purely based on the data
Such findings often do not show the full picture, since not everything is recorded
My results triggered a lot more research on developer contributions
at MS Research
36
38. Open source projects: pros
+ Provide tons of data
E.g., Eclipse project, Apache projects, etc.
+ Easy to access
Almost “all” data is publicly available, e.g., Github
+ No organizational obstacles to get access
+ You get ALL the data, not just parts of it
38
39. Open source projects: cons
- Research is (sometimes) difficult to motivate
Sometimes researchers analyze open source projects to try out some new
technique/algorithms but without knowing what actual problem they want to
solve
Why are you doing this?
What is the value for the research and industry
- Difficult to get in touch with the developers to validate the results
10 years back, I showed our results to Mozilla - they only said “interesting” but
that was it!
Some open source communities are more responsive: e.g., Eclipse community
Still, you typically do not get the chance to meet them face-2-face
39
40. Industrial projects: pros
+ Usually provide real problems
But sometimes these problems are not “research” problems - and researchers
want to do research
+ Provide contact to developers to obtain feedback on the findings
Industry as a laboratory
Helps to evaluate what is useful and what is not
+ Potential to see our research results used at least by developers
Processes, tools, algorithms, models, best practices, etc.
40
41. Industrial projects: cons
- Often expectations between researchers and developers differ a lot
Developers want to get things done - researchers want to publish
Solutions are too complex and/or only applicable to a very specific case study,
therefore not useful
- Note, researchers often provide know how, not ready made tools
Is that a good idea? - Not always, but we are typically cheaper than most
consultants
- Industry provides only partial case studies
Often not really useful to perform research on -> we need data and a lot of it
41
42. How I got in touch with Microsoft (Research)
Met them at the Microsoft developers conference
Got invited to Redmond to show them what we could do FOR them
Agreed on sending a researcher (me) to Redmond for three months
Fully paid by Microsoft
Once within Microsoft got access to the data and to some developers
My main contact was with MS Research
Microsoft invests in internships and visiting researchers
They use it to find and hire talented people
42
43. What is next on my research agenda?
Study defect prediction in industrial software projects
I am still looking for a good industrial partner (and a student)
Ease understanding changes and their effects
What is the effect on the design?
What is the effect on the quality?
Recommender techniques
Identify the sources of problems
Recommend and perform refactorings to solve the problem
Provide advice on the effects of changes to prevent problems
For this I want and need to collaborate with industry!
43
44. Conclusions
44
Questions?
Martin Pinzger
martin.pinzger@aau.at
the history of a software system to assemble the dataset for
our experiments: (1) versioning data including lines modi-
fied (LM), (2) bug data, i.e., which files contained bugs and
how many of them (Bugs), and (3) fine-grained source code
changes (SCC).
4. Experiment
2. Bug Data
3. Source Code Changes (SCC)1.Versioning Data
CVS, SVN,
GIT
Evolizer
RHDB
Log Entries
ChangeDistiller
Subsequent
Versions
Changes
#bug123
Message Bug
Support
Vector
Machine
1.1 1.2
AST
Comparison
Figure 1: Stepwise overview of the data extraction process.
1. Versioning Data. We use EVOLIZER [14] to access the ver-
sioning repositories , e.g., CVS, SVN, or GIT. They provide
log entries that contain information about revisions of files
that belong to a system. From the log entries we extract the
revision number (to identify the revisions of a file in correct
temporal order), the revision timestamp, the name of the de-
veloper who checked-in the new revision, and the commit
message. We then compute LM for a source file as the sum of
lines added, lines deleted, and lines changed per file revision.
2. Bug Data. Bug reports are stored in bug repositories such
as Bugzilla. Traditional bug tracking and versioning repos-
Update Core 595 8’496 251’434 36’151 532 Oct0
Debug UI 1’954 18’862 444’061 81’836 3’120 May
JDT Debug UI 775 8’663 168’598 45’645 2’002 Nov
Help 598 3’658 66’743 12’170 243 May
JDT Core 1’705 63’038 2’814K 451’483 6’033 Jun0
OSGI 748 9’866 335’253 56’238 1’411 Nov
single source code statements, e.g., method invocatio
ments, between two versions of a program by com
their respective abstract syntax trees (AST). Each chan
represents a tree edit operation that is required to tr
one version of the AST into the other. The algorithm i
mented in CHANGEDISTILLER [14] that pairwise co
the ASTs between all direct subsequent revisions of e
Based on this information, we then count the numbe
ferent source code changes (SCC) per file revision.
The preprocessed data from step 1-3 is stored into
lease History Database (RHDB) [10]. From that data,
compute LM, SCC, and Bugs for each source file by a
ing the values over the given observation period.
3. EMPIRICAL STUDY
In this section, we present the empirical study that
formed to investigate the hypotheses stated in Sectio
discuss the dataset, the statistical methods and machi
ing algorithms we used, and report on the results a
ings of the experiments.
3.1 Dataset and Data Preparation
We performed our experiments on 15 plugins of the
platform. Eclipse is a popular open source system
been studied extensively before [4,27,38,39].
Table 1 gives an overview of the Eclipse dataset
this study with the number of unique *.java files (Fi
Alice
Bob
Dan
Eric
Fu
Go
Hin
ab
c
5
4
6
2 4
6
2
5 7
4
Academia wants/needs/must
collaborate with industry
Industry should invest in such
a collaboration