The reliability of a prediction model depends on the quality of the data from which it was trained. Therefore, defect prediction models may be unreliable if they are trained using noisy data. Recent research suggests that randomly-injected noise that changes the classification (label) of software modules from defective to clean (and vice versa) can impact the performance of defect models. Yet, in reality, incorrectly labelled (i.e., mislabelled) issue reports are likely non-random. In this paper, we study whether mislabelling is random, and the impact that realistic mislabelling has on the performance and interpretation of defect models. Through a case study of 3,931 manually-curated issue reports from the Apache Jackrabbit and Lucene systems, we find that: (1) issue report mislabelling is not random; (2) precision is rarely impacted by mislabelled issue reports, suggesting that practitioners can rely on the accuracy of modules labelled as defective by models that are trained using noisy data; (3) however, models trained on noisy data typically achieve 56%-68% of the recall of models trained on clean data; and (4) only the metrics in top influence rank of our defect models are robust to the noise introduced by mislabelling, suggesting that the less influential metrics of models that are trained on noisy data should not be interpreted or used to make decisions.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models
1. The Impact of Mislabelling on the
Performance and Interpretation
of Defect Prediction Models
Chakkrit (Kla)
Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto
@klainfo kla@chakkrit.com
3. Software defects are costly
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2
4. Software defects are costly
Reputation
The Obama administration will always
be connected to healthcare.gov
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2
5. SQA teams try to find defects
before they escape to the field
3
6. SQA teams have limited resources
4
Limited
QA Resources
Software continues to grow
in size and complexity
9. 5
Defect prediction models help
SQA teams to
Predict
what are risky modules
Understand
what makes software fail
10. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release
Snapshot at the release date
Defect Dataset
11. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
12. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
13. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Bug Report#1
Snapshot at the release date
Defect Dataset
14. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Defect Dataset
15. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Label as
Defective
Defect Dataset
16. Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed
Module1, Module2
Bug Report#1
Snapshot at the release date
Label as
Clean
Label as
Defective
Defect Dataset
17. Defect models are trained
using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
18. Defect models are trained
using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or
Statistical Learning
Defect
model
19. Defect data are noisy
The reliability of the models depends
on the quality of the training data
8
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or
Statistical Learning
Defect
modelNOISY
Unreliable
20. Issue reports are mislabelled
9
Fixed
Module1, Module2
Bug Report#1
Fields in issue tracking
systems are often missing
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect
Mislabelling
A bug could be
mislabelled as a
new feature
21. Issue reports are mislabelled
10
Fixed
Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect
Mislabelling
A bug could be
mislabelled as a
new feature
22. Issue reports are mislabelled
11
Fixed
Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect
Mislabelling
A bug could be
mislabelled as a
new feature
23. Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
M3
#1
24. Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
M3
#1
25. Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3
#1
26. Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3M3
M3 should be
a clean module
#2
#1
27. 13
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and
[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact
on the performance.
28. 14
Mislabelling is likely non-random
We suspect that novice developers are likely to
mislabel more than experienced developers.
Novice developers
are known to overlook
the bookkeeping issue
[Bachmann et al., FSE 2010]
29. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
15
30. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) Its Impact
on the Interpretation
Defect
model
(RQ2) Its Impact
on the Performance
15
31. Using prediction models to classify
whether issue reports are mislabelled
16
Prediction
Model
32. Using prediction models to classify
whether issue reports are mislabelled
16
Prediction
Model
Mislabelling is predictable
Performs
Well
33. Using prediction models to classify
whether issue reports are mislabelled
16
Prediction
Model
Mislabelling is random
Performs
Poorly
Mislabelling is predictable
Performs
Well
34. Selecting our studied systems
17
Manually-curated
issue reports
[Herzig et al., ICSE 2013]
39. Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
Our models achieve a mean of F-measure up to 0.73,
which is 4-34 times better than random guessing. 20
40. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
21
41. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
21
Mislabelling is
non-random
42. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact
on the Performance
21
Mislabelling is
non-random
43. 22
Compare the performance between
clean models and noisy models
Clean
Performance
Realistic Noisy
Performance
Random Noisy
Performance
VS VS
46. Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle)
#2 is mislabelled
#2
#1
Clean
Sample
Realistic
Noisy
Sample
47. Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random
Noisy
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)
#2 is mislabelled
#2
#1
Clean
Sample
Realistic
Noisy
Sample
48. Randomly flip the
module’s label
Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random
Noisy
Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)
#2 is mislabelled
#2
#1
Clean
Sample
Realistic
Noisy
Sample
49. 26
Clean
Performance
Realistic Noisy
Performance
Random Noisy
Performance
Clean
Sample
Realistic Noisy
Sample
Random Noisy
Sample
VS VS
Defect
model
Defect
model
Defect
model
Generate the performance of
clean models and noisy models
50. 26
Clean
Performance
Realistic Noisy
Performance
Random Noisy
Performance
Clean
Sample
Realistic Noisy
Sample
Random Noisy
Sample
VS VS
Defect
model
Defect
model
Defect
model
Performance
Ratio
=
Performance of Realistic Noisy Model
Performance of Clean Model
Generate the performance of
clean models and noisy models
51. While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:
Ratio = 1 means there is no impact.
Precision Recall
1.00.50.02.01.5
Ratio
52. While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:
Ratio = 1 means there is no impact.
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio
53. While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:
Ratio = 1 means there is no impact.
Models trained on noisy data
achieve 56% of the recall of
models trained on clean
data.
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio
54. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact
on the Performance
28
Mislabelling is
non-random
55. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact
on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
56. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
57. 29
Generate the rank of metrics of
clean models and noisy models
Clean
model
Realistic
noisy
model
Variable Importance
Scores
Variable Importance
Scores
Variable Importance
Scores
Random
noisy
model
58. 30
Generate the rank of metrics of
clean models and noisy models
Clean
model
Realistic
noisy
model
Variable Importance
Scores
Variable Importance
Scores
Variable Importance
Scores
Rank of metrics
Ranking Ranking Ranking
Rank of metrics Rank of metrics
Random
noisy
model
59. 31
2 1 3
Clean Model
Rank of metrics
of the clean model
Whether a metric of the clean model appears
at the same rank in the noisy models?
60. 31
2 1 3
Clean Model
Rank of metrics
of the clean model
Noisy Model
2 1 3
?
Rank of metrics
of the noisy model
Whether a metric of the clean model appears
at the same rank in the noisy models?
61. 32
Only the metrics in the 1st
rank are
robust to the mislabelling
2 1 3
Clean Model Noisy Model
2 1 3
85% of the metrics in the 1st rank of the clean model
also appear in the 1st rank of the noisy model.
62. 33
Conversely, the metrics in the
2nd
and 3rd
ranks are less stable
2 1 3
Clean Model Noisy Model
2 1 3
As little as 18% of the metrics in the 2nd and 3rd rank of the
clean models appear in the same rank in the noisy models
63. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
64. (RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
Only top-rank
metrics are
robust to the
mislabelling
65. (RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
35
66. (RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
67. (RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
68. (RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics
70. Issue reports are mislabelled
37
Fixed
Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect
Mislabelling
A bug could be
mislabelled as a
new feature
71. 38
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and
[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact
on the performance.
72. (RQ1) The Nature
of Mislabelling
Findings
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling
73. (RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics
74. Issue reports are mislabelled
36
Fixed
Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing
or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect
Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect
Mislabelling
A bug could be
mislabelled as a
new feature
@klainfo kla@chakkrit.com
12
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and
[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact
on the performance.
(RQ1) The Nature
of Mislabelling
Findings
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact
on the Interpretation
Defect
model
(RQ2) The Impact
on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics