Dealing with Noise in Defect Prediction

Dealing with
Noise
in bug prediction

Sunghun Kim, Hongyu Zhang,
Rongxin Wu and Liang Gong
The Hong Kong University of Science & Technology
Tsinghua University

Where are the bugs?

2

Where are the bugs?
Complex ﬁles!
[Menzies et al.]

2

Where are the bugs?
Modified files!
Complex files! [Nagappan et al.]
[Menzies et al.]

2

Where are the bugs?
Modiﬁed ﬁles!
[Menzies et al.]

Nearby other bugs!
[Zimmermann et al.]

2

Where are the bugs?
Modified files!
[Menzies et al.]

Nearby other bugs! Previously fixed files
[Zimmermann et al.] [Hassan et al.]

2

Prediction model
training instances
(features+ labels)

3

Prediction model
training instances
(features+ labels)

Learner
3

Prediction model
training instances
(features+ labels)

?

Learner
3

Prediction model
training instances
(features+ labels)

?

Learner Prediction
3

Training on software evolution is key

• Software features can be used to predict bugs
• Defect labels obtained from software evolution
• Supervised learning algorithms

Version Bug
Archive Database

4

Change classiﬁcation

5
Kim, Whitehead Jr., Zhang: Classifying Software Changes: Clean or Buggy? (TSE 2008)


bug-introducing (“bad”)

X X X X

5


BUILD A LEARNER

X X X X

5


BUILD A LEARNER

X X X X
new change

5


BUILD A LEARNER

X X X X
new change

PREDICT QUALITY
5

Training Classiﬁers
0 1 0 1 0 1 0 1 … 0 1
Historical
changes

0 0 0 1 0 1 0 1 … 0 0
0 1 1 1 0 1 1 1 … 0 0
0 1 0 3 0 0 0 1 … 0 1
0 1 0 1 0 1 0 1 … 0 0

§ Machine learning techniques
• Bayesian Network, SVM

Source Repository Bug Database

all commits C
commit all bugs B
commit commit

commit commit

commit
ﬁxed bugs Bf

commit

commit

commit

7 Bird et al. “Fair and Balanced? Bias in Bug-Fix Datasets,” FSE2009


all commits C
commit all bugs B
commit commit

commit commit

commit
ﬁxed bugs Bf

commit

commit

commit

linked via log messages



all commits C
commit all bugs B
commit commit

commit commit

commit
fixed bugs Bf

commit

linked fixed bugs Bfl
commit

commit




all commits C
commit all bugs B
commit commit

commit commit

commit
fixed bugs Bf

commit

linked fixes Cfl linked fixed bugs Bfl
commit

commit




all commits C
commit all bugs B
commit commit

commit commit

related,
commit
but not linked ﬁxed bugs Bf

commit

commit

commit




all commits C
commit all bugs B
commit commit

commit commit

bug ﬁxes Cf related,
commit

commit

commit

commit




all commits C
oise!
N all bugs B
commit

commit commit

commit commit

bug ﬁxes Cf related,
commit

commit

commit

commit



Effect of training on
superbiased data (Severity)

Trained on all bugs
Trained on biased data1

0% 20% 40% 60% 80% 100%

Bug Recall


Trained on all bugs

0% 20% 40% 60% 80% 100%

Bug Recall


Bias in bug severityon all bugs
Trained
affects BugCache on biased data1
Trained

0% 20% 40% 60% 80% 100%

Bug Recall

Are defect prediction
models learned from
noisy data reliable?

11

Study questions

• Q1: How resistant a defect prediction
model is to noise?
• Q2: How much noise could be detected/
removed?
• Q3: Could we remove noise to improve
defect prediction performance?

12

Study approach
Training

Testing

13

Study approach
Training

Bayes Net Testing

13

Making noisy training instances

Training Testing

14


1 Removing
buggy labels False negative noise

Training Testing

14


1 Removing

Training Testing
2 Adding
buggy labels

False positive noise

14

Prediction models
buggy
Rev n Rev n+1
... ...
... ...

change

clean
change classiﬁcation

15

Prediction models
buggy
Rev n Rev n+1
... ...
... ...

change

clean

buggy

File
...
...
...
File
clean
ﬁle-level defect prediction
15

Performance evaluation
§ 4 possible outcomes from prediction models
§ Classifying a buggy change as buggy (nb->b)
§ Classifying a buggy change as clean (nb->c)
§ Classifying a clean change as clean (nc->c)
§ Classifying a clean change as buggy (nc->b)

nb->b nb->b
§ Precision = , Recall=
nb->b + nc->b nb->b + nb->c

precision ! recall
§ F-measure = 2 !
precision + recall
16

Subjects
subject # instances % buggy # features
Columba 1,800 29.4% 17,411
Eclipse (JDT) 659 10.1% 16,192
Scarab 1,090 50.6% 5,710

ﬁle-level defect prediction
subject # instances % buggy # features
SWT 1,485 44% 18
Debug 1,065 24.7% 18

17

Experimental results
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

!#)"
!#("
-./0123"
!#'"
!#&" 40115"6.7"-./0123"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
18

Columba
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

!#)"
!#("
-./0123"
!#'"
!#&" 40115"6.7"-./0123"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
19

Columba
$"
!#,"
!#+" 1. Random guess (50% buggy, 50% clean)
2. Columba’s defect rate is about 30%
!#*"
!"##$%&'()*+",)

3. Precision = 0.3 and Recall =0.5
!#)" 4. F-measure = 0.375 (2*0.5*0.3)/(0.3+0.5)
!#("
-./0123"
!#'"
!#&" 40115"6.7"-./0123"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
19

Columba
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

!#)"
!#("
-./0123"
!#'"
!#&" 40115"6.7"-./0123"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
20

Eclipse (JDT)
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

!#)"
!#("
-./0123"
!#'"
!#&" 45667"89:"-./0123"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
21

Scarab
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

!#)"
!#("
-./0/1"
!#'"
!#&" 23445"670"-./0/1"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
22

Eclipse (Debug)
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)%

!#)"
!#("
-./01"
!#'"
!#&" -0223"456"-./01"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)%

23

Eclipse (SWT)
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)%

!#)"
!#("
-./"
!#'"
!#&" 01223"456"-./"

!#%"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)%

24

Q1: How resistant a defect
prediction model is to noise?
$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

-./"
!#)"
!#(" 01234"
!#'" 5673829"
!#&"
:;7<=>1"
!#%"
-;9?92"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
25

$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

-./"
!#)"
!#(" 01234"
!#'" 5673829"
!#&"
:;7<=>1"
!#%"
-;9?92"
!#$"
!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
26

$"
!#,"
!#+"
!#*"
!"##$%&'()*+",)

-./"
!#)"
!#(" 01234"
!#'" 5673829"
!#&"
!#%"
!#$"
20~30% :;7<=>1"

-;9?92"

!"
!" !#$" !#%" !#&" !#'" !#(" !#)"
-./%0,*1212#%+)3%4*5+)%2)#*67)%-&8/%9%4*5+)%:;+167)%-&</%,*3)
26

Study questions

model is to noise?
removed?
• Q3: Could we remove noise to improve

27

Detecting noise

1 Removing

Original training
2 Adding
buggy labels


28

Detecting noise

False negative noise

Original training


29

30
ts. However, it is very hard to get a golden set. In our approach,
e carefully select high quality datasets and assume them the
lden sets. We then add FPs and FNs intentionally to create a
ise set. To add FPs and FNs, we randomly selects instances in a
lden set and artificially change their labels from buggy to clean
from clean to buggy, inspired by experiments in [4].
Original training
?
noise
Clean False negative noise
Detecting noise
F igure 4. C reating biased training set
make FN data sets (for RQ1), we randomly select n% buggy

return Aj
Closest 9. Thenoise identiﬁcation algorit
F igure
list pseudo-code of the C LN I

A

31

Noise detection performance

Precision Recall F-measure

Debug 0.681 0.871 0.764

SWT 0.624 0.830 0.712

(noise level =20%)
32

Noise detection performance
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3 Precision

0.2 Recall

0.1
F-measure

0
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

FP & FN noise level
Noise Rate
33

Study questions

model is to noise?
removed?
Q3: Could we remove noise to improve

34

Bug prediction using cleaned data
Noisey Cleaned

100

75
SWT F-measure

50

25

0
0% 15% 30% 45%

35 Noise level

Noisey Cleaned

100

75
SWT F-measure

50

25

0
0% 15% 30% 45%

36 Noise level

Noisey Cleaned

100

75
SWT F-measure

50

25
76%
F-measure
with 45% noise
0
0% 15% 30% 45%

36 Noise level

Study limitations

• All datasets are collected from open source
projects
• The golden set used in this paper may not be
perfect
• The noisy data simulations may not reﬂect
the actual noise patterns in practice

37

Summary

• Prediction models (used in our experiments)
are resistant (up to 20~30%) of noise
• Noise detection is promising
• Future work
- Building oracle defect sets
- Improving noise detection algorithms
- Applying to more defect prediction models
(regression, bugcache)

38

Dealing with Noise in Defect Prediction

Recommended

Recommended

More Related Content

More from Sung Kim

More from Sung Kim (20)

Recently uploaded

Recently uploaded (20)

Dealing with Noise in Defect Prediction