Bug Prediction Based on Fine-Grained Module Histories

Bug Prediction Based on
Fine-Grained Module
Histories
H i d e a k i H a t a
O s a m u M i z u n o
To h r u K i k u n o

1

Overview
Background
Historical metrics are useful for bug prediction
Problem
For method-level prediction, it is difﬁcult to
collect historical metrics
Solution & Results
Historage: ﬁne-grained version control system
First study of method-level bug prediction with
well-known historical metrics
2

Bug Prediction
Papers
Papers (TSE, EMSE, ICSE, ESEC/FSE, FSE, ICSM, MSR)

15

10

5

0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

3

Code

Historical Metrics
Code Process
•Code churn •Changes
•Past bugs
•Process complexity

Organization Geography
•Developers •Locations
•Org structure •Distribution
•Network
•Ownership
Bug Prediction Survey: http://bpsurvey-hidehata.dotcloud.com/
4

Mining Version
Control Repository
Commit message

Fix bug #32528

... n-3 n-2 n-1 n n+1 n+2 n+3 ...
< July 2007 > Code delta
Su Mo Tu We Th Fr Sa

1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 1 2 3 4
5 6 7 8 9 10 11

5

What We Have
Learned
Prediction accuracy

Historical metrics ≥ Static code metrics
[Moser et al. ’08, Kamei et al. ’10]

Required effort

File-level ≤ Package-level
[Kamei et al. ’10, Nguyen et al. ’10,
Posnett et al. ’11]

6

State of the Art
Papers (TSE, EMSE, ICSE, ESEC/FSE, FSE, ICSM, MSR)

Package-level
Cache model
[Kim et al. ’07]
File-level
Spam ﬁltering model
[Mizuno et al. ’07]
Method-level

0 5 10 15

No method-level prediction with well-known historical metrics

7

Method-Level
Prediction

Requirement

Method-level historical metrics

Problem

Analysis of method histories is difﬁcult

8

Difficulties
1.Tracking methods is troublesome

Matching methods should be found between
sequential snapshots

2.Method-level metadata are not easily available

Metadata (who, when, n-2 n-1 n
how, etc.) are associated
with ﬁles

9

Historage
com1 com2
Fine-grained version
control system[1]

is created on top on a
Git repository Method
Method
Method
Method
Method Method
Method Method

stores methods as ﬁles

detects rename/move
Method
Method
Method
Method

with Git mechanism

[1] Hata et al., “Historage: Fine-Grained Version Control System for Java,” IWPSE-EVOL ’11.
Tool: git2historage(https://github.com/hdrky/git2historage)
10

Visualization of repository history
•tree: directory
•white node: method

Git - ﬁle histories Historage - method histories
11

Mining Historage
Commit message

Fix bug #32528

Method
Method Method
Method Method

... n-3 n-2 ...
Method Method

n-1 n n+1 n+2 n+3

< July 2007 > Code delta
Su Mo Tu We Th Fr Sa

1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31 1 2 3 4
5 6 7 8 9 10 11

12

Study
Comparison
Prediction level: package, file, and method
Same metrics and a same prediction algorithm
(random forest)
Buggy modules: identified with SZZ algorithm[2]

Evaluation
10-fold cross validation
Effort-based evaluation
[2] Sliwerski et al., “When do changes induce fixes?” MSR ’05.
13

Target
Project Period # of commits
Xpand 2y6m 1,038
WTP Incubator 2y8m 1,133
Ant 11y7m 2,590
Lucene/Solr 1y6m 3,485
OpenJPA 5y4m 4,180
Cassandra 2y6m 4423
ECF 6y6m 9,748
Wicket 7y 15,033

14

Collected Metrics
LOC Lines of code
Add/DelLOC Added / Deleted LOC

Chg/FixChgNum # of changes/bug-ﬁx changes
PastBugNum # of ﬁxed bug IDs
Period Existing days
BugIntroNum # of bug introducing changes
LogCoupNum # of logical coupling changes
Avg/Max/MinInterval Avg/Max/Min change interval
HCM Process complexity metric

DevTotal/Major/Minor # of Total/Major /Minor developers
Ownership Highest proportion of ownership
15

Effort-Based
Evaluation
100
Percent of Bugs found

75

50

25

0
0 20 40 60 80 100
Percent of LOC
sample curve
16

Result (ECF)

100
Percent of Bugs Found
80
60
40
20

Package
File
Method
0

0 20 40 60 80 100
Percent of Lines

17

1000 Times Run (ECF)

80
Percent of Bugs Found
60
40
20
0

Package File Method

percentages of bugs found in 20% LOC on a 1,000 times run
18

1000 Times Run (All)
Package File Method
100
Percent of bugs found

75

50

25

0
Xpand WTP Incubator Ant Lucene/Solr OpenJPA Cassandra ECF Wicket

median values of the percentage of bugs found in 20% LOC
19

Why Is Method-Level
800
Prediction Effective?

10 20 30 40 50 60
Number of methods
600
LOC
400
200
0

0
Package File Method All Buggy
Size # of method in a ﬁle
Although models predict buggy modules correctly, they are
largely non-buggy in packages, or ﬁles.
20

Observations from
Correlation Analysis
Are there differences between method-level and
package/ﬁle -level prediction models?

Same
Large changes tend to be buggy

Frequent changes tend to be buggy

Different
Bugs do not occur repeatedly
Organizational metrics may not contribute to method-
level prediction
21

Threats to Validity

Targets are limited to open-source written in
Java projects

No manual inspection of identifying buggy
modules

Effort-based evaluation may not reﬂect actual
efforts

22

Fine-Grained Study Is
Big Data Analysis
Need scalable Files Methods

techniques 30000

preparing ﬁne-
22500
grained data
(making Historage)
15000
analyzing histories
(collecting metrics) 7500

building prediction 0
models Xpand Ant ECF Wicket
# of modules in one snapshot
23

Conclusions
Summary

Method-level bug prediction with well-known
historical metrics
Future work

Empirical studies of actual effort using method-
level prediction

More metrics and more projects (including
industrial projects)

24

Bug Prediction Based on Fine-Grained Module Histories

Recommended

Recommended

More Related Content

Similar to Bug Prediction Based on Fine-Grained Module Histories

Similar to Bug Prediction Based on Fine-Grained Module Histories (20)

More from Hideaki Hata

More from Hideaki Hata (6)

Recently uploaded

Recently uploaded (20)

Bug Prediction Based on Fine-Grained Module Histories