The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models

The Impact of Mislabelling on the
Performance and Interpretation
of Defect Prediction Models
Chakkrit (Kla) 
Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto
@klainfo kla@chakkrit.com

Software defects are costly
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2

Software defects are costly
Reputation
The Obama administration will always
be connected to healthcare.gov
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2

SQA teams try to ﬁnd defects  
before they escape to the ﬁeld
3

SQA teams have limited resources
4
Limited 
QA Resources
Software continues to grow  
in size and complexity

5
Defect prediction models help 
SQA teams to

5
SQA teams to
Predict 
what are risky modules

5
SQA teams to
Predict 
what are risky modules
Understand  
what makes software fail

Modules that are ﬁxed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release
Snapshot at the release date
Defect Dataset

6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Defect Dataset

6
Changes
Release Date
Issues
Module 2
Module 3
Module 4
Bug Report#1
Defect Dataset

6
Changes
Release Date
Issues
Module 2
Module 3
Module 4
Fixed 
Module1, Module2
Bug Report#1
Defect Dataset

6
Changes
Release Date
Issues
Module 2
Module 3
Module 4
Fixed 
Module1, Module2
Bug Report#1
Label as
Defective
Defect Dataset

6
Changes
Release Date
Issues
Module 2
Module 3
Module 4
Fixed 
Module1, Module2
Bug Report#1
Label as
Clean
Label as
Defective
Defect Dataset

Defect models are trained  
using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset

Defect models are trained  
using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or 
Statistical Learning
Defect  
model

Defect data are noisy
The reliability of the models depends
on the quality of the training data
8
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or 
Statistical Learning
Defect  
modelNOISY
Unreliable

Issue reports are mislabelled
9
Fixed 
Module1, Module2
Bug Report#1
Fields in issue tracking
systems are often missing  
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect 
Mislabelling
A bug could be
mislabelled as a
new feature

10
Fixed 
Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
or incorrect.
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect 
Mislabelling
A bug could be
mislabelled as a
new feature

11
Fixed 
Module1, Module2
Bug Report#1
are mislabelled.
or incorrect.
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect 
Mislabelling
A bug could be
mislabelled as a
new feature

Then, modules are mislabelled
12
#1
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
M3
#1

12
#1
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
M3
#1

12
#1
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3
#1

12
#1
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3M3
M3 should be  
a clean module
#2
#1

13
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and  
[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact  
on the performance.

14
Mislabelling is likely non-random
We suspect that novice developers are likely to
mislabel more than experienced developers.
Novice developers
are known to overlook
the bookkeeping issue
[Bachmann et al., FSE 2010]

(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
15

(RQ1) The Nature
of Mislabelling
(RQ3) Its Impact  
on the Interpretation
Defect  
model
(RQ2) Its Impact  
on the Performance
15

Using prediction models to classify 
whether issue reports are mislabelled
16
Prediction  
Model

16
Prediction  
Model
Mislabelling is predictable
Performs  
Well

16
Prediction  
Model
Mislabelling is random
Performs  
Poorly
Mislabelling is predictable
Performs  
Well

Selecting our studied systems
17
Manually-curated
issue reports

Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
18

Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
PerformanceValue
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
19

Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
PerformanceValue
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
20

Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
PerformanceValue
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our models achieve a mean of F-measure up to 0.73,
which is 4-34 times better than random guessing. 20

(RQ1) The Nature
of Mislabelling
21

(RQ1) The Nature
of Mislabelling
21
Mislabelling is
non-random

(RQ1) The Nature
of Mislabelling
(RQ2) The Impact  
on the Performance
21
Mislabelling is
non-random

22
Compare the performance between
clean models and noisy models
Clean  
Performance
Realistic Noisy 
Performance
Random Noisy 
Performance
VS VS

Generating three samples
23
Clean
M1
M2
M3
M4
(Oracle) 
#2 is mislabelled
Clean 
Sample
#2
#1

24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle) 
#2 is mislabelled
#2
#1
Clean 
Sample
Realistic  
Noisy 
Sample

Realistically ﬂip the
modules’ label that
are addressed by the
mislabelled issue
reports.
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle) 
#2 is mislabelled
#2
#1
Clean 
Sample
Realistic  
Noisy 
Sample

mislabelled issue
reports.
25
Add Noise
M1
M2
M3
M4
Random  
Noisy 
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle) 
#2 is mislabelled
#2
#1
Clean 
Sample
Realistic  
Noisy 
Sample

Randomly ﬂip the
module’s label
mislabelled issue
reports.
25
Add Noise
M1
M2
M3
M4
Random  
Noisy 
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle) 
#2 is mislabelled
#2
#1
Clean 
Sample
Realistic  
Noisy 
Sample

26
Clean  
Performance
Realistic Noisy 
Performance
Random Noisy 
Performance
Clean  
Sample
Realistic Noisy 
Sample
Random Noisy 
Sample
VS VS
Defect  
model
Defect  
model
Defect  
model
Generate the performance of 

26
Clean  
Performance
Realistic Noisy 
Performance
Random Noisy 
Performance
Clean  
Sample
Realistic Noisy 
Sample
Random Noisy 
Sample
VS VS
Defect  
model
Defect  
model
Defect  
model
Performance  
Ratio
=
Performance of Realistic Noisy Model
Performance of Clean Model
Generate the performance of 

While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation: 
Ratio = 1 means there is no impact.
Precision Recall
1.00.50.02.01.5
Ratio

27
= Realistic Noisy
Clean
Interpretation: 
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio

27
= Realistic Noisy
Clean
Interpretation: 
Models trained on noisy data
achieve 56% of the recall of
models trained on clean
data.
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio

(RQ1) The Nature
of Mislabelling
on the Performance
28
Mislabelling is
non-random

(RQ1) The Nature
of Mislabelling
on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted

(RQ1) The Nature
of Mislabelling
Defect  
model
on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted

29
Generate the rank of metrics of 
Clean  
model
Realistic
noisy
model
Variable Importance 
Scores
Scores
Scores
Random
noisy
model

30
Generate the rank of metrics of 
Clean  
model
Realistic
noisy
model
Scores
Scores
Scores
Rank of metrics
Ranking Ranking Ranking
Rank of metrics Rank of metrics
Random
noisy
model

31
2 1 3
Clean Model
Rank of metrics  
of the clean model
Whether a metric of the clean model appears
at the same rank in the noisy models?

31
2 1 3
Clean Model
Rank of metrics  
of the clean model
Noisy Model
2 1 3
?
Rank of metrics  
of the noisy model
Whether a metric of the clean model appears
at the same rank in the noisy models?

32
Only the metrics in the 1st
rank are
robust to the mislabelling
2 1 3
Clean Model Noisy Model
2 1 3
85% of the metrics in the 1st rank of the clean model  
also appear in the 1st rank of the noisy model.

33
Conversely, the metrics in the  
2nd
and 3rd
ranks are less stable
2 1 3
Clean Model Noisy Model
2 1 3
As little as 18% of the metrics in the 2nd and 3rd rank of the
clean models appear in the same rank in the noisy models

(RQ1) The Nature
of Mislabelling
Defect  
model
on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted

(RQ1) The Nature
of Mislabelling
Defect  
model
on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
Only top-rank
metrics are
robust to the
mislabelling

(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
35

(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports

(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules

(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics

37
Fixed 
Module1, Module2
Bug Report#1
are mislabelled.
or incorrect.
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect 
Mislabelling
A bug could be
mislabelled as a
new feature

38
performance
Random mislabelling
on the performance.

(RQ1) The Nature
of Mislabelling
Findings
Defect  
model
on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling

(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics

36
Fixed 
Module1, Module2
Bug Report#1
are mislabelled.
or incorrect.
Defect 
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect 
Mislabelling
A bug could be
mislabelled as a
new feature
@klainfo kla@chakkrit.com
12
performance
Random mislabelling
on the performance.
(RQ1) The Nature
of Mislabelling
Findings
Defect  
model
on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling
(RQ1) The Nature
of Mislabelling
Suggestions
Defect  
model
on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics

The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models

Similar to The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models (20)

More from Chakkrit (Kla) Tantithamthavorn

More from Chakkrit (Kla) Tantithamthavorn (6)

Recently uploaded

Recently uploaded (20)

The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models