SlideShare a Scribd company logo
1 of 74
Download to read offline
The Impact of Mislabelling on the
Performance and Interpretation
of Defect Prediction Models
Chakkrit (Kla)

Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto
@klainfo kla@chakkrit.com
Software defects are costly
2
Software defects are costly
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2
Software defects are costly
Reputation
The Obama administration will always
be connected to healthcare.gov
Monetary
NIST estimates that software defects cost
the US economy $59.5 billion per year!
2
SQA teams try to find defects 

before they escape to the field
3
SQA teams have limited resources
4
Limited

QA Resources
Software continues to grow 

in size and complexity
5
Defect prediction models help

SQA teams to
5
Defect prediction models help

SQA teams to
Predict

what are risky modules
5
Defect prediction models help

SQA teams to
Predict

what are risky modules
Understand 

what makes software fail
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Bug Report#1
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed

Module1, Module2
Bug Report#1
Snapshot at the release date
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed

Module1, Module2
Bug Report#1
Snapshot at the release date
Label as
Defective
Defect Dataset
Modules that are fixed during post-release
development are set as defective
6
Changes
Release Date
Issues
Post-Release Module 1
Module 2
Module 3
Module 4
Fixed

Module1, Module2
Bug Report#1
Snapshot at the release date
Label as
Clean
Label as
Defective
Defect Dataset
Defect models are trained 

using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Defect models are trained 

using Machine Learning
7
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or

Statistical Learning
Defect 

model
Defect data are noisy
The reliability of the models depends
on the quality of the training data
8
Module 1
Module 2
Module 3
Module 4
Defect Dataset
Machine Learning or

Statistical Learning
Defect 

modelNOISY
Unreliable
Issue reports are mislabelled
9
Fixed

Module1, Module2
Bug Report#1
Fields in issue tracking
systems are often missing 

or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect

Mislabelling
A bug could be
mislabelled as a
new feature
Issue reports are mislabelled
10
Fixed

Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing 

or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect

Mislabelling
A bug could be
mislabelled as a
new feature
Issue reports are mislabelled
11
Fixed

Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing 

or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect

Mislabelling
A bug could be
mislabelled as a
new feature
Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
M3
#1
Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
M3
#1
Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3
#1
Then, modules are mislabelled
12
#1
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
M1
M2
M4
NOISY DATA
M1,M2
#2
M3
#2 is mislabelled.
M3M3
M3 should be 

a clean module
#2
#1
13
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and 

[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact 

on the performance.
14
Mislabelling is likely non-random
We suspect that novice developers are likely to
mislabel more than experienced developers.
Novice developers
are known to overlook
the bookkeeping issue
[Bachmann et al., FSE 2010]
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
15
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) Its Impact 

on the Interpretation
Defect 

model
(RQ2) Its Impact 

on the Performance
15
Using prediction models to classify

whether issue reports are mislabelled
16
Prediction 

Model
Using prediction models to classify

whether issue reports are mislabelled
16
Prediction 

Model
Mislabelling is predictable
Performs 

Well
Using prediction models to classify

whether issue reports are mislabelled
16
Prediction 

Model
Mislabelling is random
Performs 

Poorly
Mislabelling is predictable
Performs 

Well
Selecting our studied systems
17
Manually-curated
issue reports
[Herzig et al., ICSE 2013]
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
18
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
18
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
19
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
20
Jackrabbit Lucene
0.78
0.12
0.64
0.50
0.70
0.19
0.75
0.12
0.71
0.50
0.73
0.19
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Precision Recall F−Measure Precision Recall F−Measure
PerformanceValue
Our Model Random Guessing
Jackrabbit
0.78
0.64
0.70
0.6
0.7
0.8
0.9
1.0eValue
Our Model
Mislabelling is non-random
Jackrabbit Luc
0.78
0.70
0.75
0.71
0.8
0.9
1.0
lue
Our Model Random Guess
Jackrabbit Lucene
0.78 0.75 0.0.8
0.9
1.0
e
Our Model Random Guessing
Jackrabbit Lu
0.78
0.70
0.75
0.7
0.7
0.8
0.9
1.0
alue
Our Model Random Gues
Our models achieve a mean of F-measure up to 0.73,
which is 4-34 times better than random guessing. 20
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
21
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
21
Mislabelling is
non-random
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact 

on the Performance
21
Mislabelling is
non-random
22
Compare the performance between
clean models and noisy models
Clean 

Performance
Realistic Noisy

Performance
Random Noisy

Performance
VS VS
Generating three samples
23
Clean
M1
M2
M3
M4
(Oracle)

#2 is mislabelled
Clean

Sample
#2
#1
Generating three samples
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle)

#2 is mislabelled
#2
#1
Clean

Sample
Realistic 

Noisy

Sample
Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
24
Add Noise
M1
M2
M3
M4
Clean
M1
M2
M3
M4
(Oracle)

#2 is mislabelled
#2
#1
Clean

Sample
Realistic 

Noisy

Sample
Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random 

Noisy

Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)

#2 is mislabelled
#2
#1
Clean

Sample
Realistic 

Noisy

Sample
Randomly flip the
module’s label
Realistically flip the
modules’ label that
are addressed by the
mislabelled issue
reports.
Generating three samples
25
Add Noise
M1
M2
M3
M4
Random 

Noisy

Sample
M1
M2
M3
M4
Add Noise
Clean
M1
M2
M3
M4
(Oracle)

#2 is mislabelled
#2
#1
Clean

Sample
Realistic 

Noisy

Sample
26
Clean 

Performance
Realistic Noisy

Performance
Random Noisy

Performance
Clean 

Sample
Realistic Noisy

Sample
Random Noisy

Sample
VS VS
Defect 

model
Defect 

model
Defect 

model
Generate the performance of

clean models and noisy models
26
Clean 

Performance
Realistic Noisy

Performance
Random Noisy

Performance
Clean 

Sample
Realistic Noisy

Sample
Random Noisy

Sample
VS VS
Defect 

model
Defect 

model
Defect 

model
Performance 

Ratio
=
Performance of Realistic Noisy Model
Performance of Clean Model
Generate the performance of

clean models and noisy models
While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:

Ratio = 1 means there is no impact.
Precision Recall
1.00.50.02.01.5
Ratio
While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:

Ratio = 1 means there is no impact.
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio
While the recall is often impacted,
the precision is rarely impacted.
27
= Realistic Noisy
Clean
Interpretation:

Ratio = 1 means there is no impact.
Models trained on noisy data
achieve 56% of the recall of
models trained on clean
data.
Precision is rarely impactedby realistic mislabelling.
Precision Recall
1.00.50.02.01.5
Ratio
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact 

on the Performance
28
Mislabelling is
non-random
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ2) The Impact 

on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
28
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
29
Generate the rank of metrics of

clean models and noisy models
Clean 

model
Realistic
noisy
model
Variable Importance

Scores
Variable Importance

Scores
Variable Importance

Scores
Random
noisy
model
30
Generate the rank of metrics of

clean models and noisy models
Clean 

model
Realistic
noisy
model
Variable Importance

Scores
Variable Importance

Scores
Variable Importance

Scores
Rank of metrics
Ranking Ranking Ranking
Rank of metrics Rank of metrics
Random
noisy
model
31
2 1 3
Clean Model
Rank of metrics 

of the clean model
Whether a metric of the clean model appears
at the same rank in the noisy models?
31
2 1 3
Clean Model
Rank of metrics 

of the clean model
Noisy Model
2 1 3
?
Rank of metrics 

of the noisy model
Whether a metric of the clean model appears
at the same rank in the noisy models?
32
Only the metrics in the 1st
rank are
robust to the mislabelling
2 1 3
Clean Model Noisy Model
2 1 3
85% of the metrics in the 1st rank of the clean model 

also appear in the 1st rank of the noisy model.
33
Conversely, the metrics in the 

2nd
and 3rd
ranks are less stable
2 1 3
Clean Model Noisy Model
2 1 3
As little as 18% of the metrics in the 2nd and 3rd rank of the
clean models appear in the same rank in the noisy models
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
(RQ1) The Nature
of Mislabelling
The impact of realistic mislabelling on the
performance and interpretation of defect models
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
34
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted
Only top-rank
metrics are
robust to the
mislabelling
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
35
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
35
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics
36
Issue reports are mislabelled
37
Fixed

Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing 

or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect

Mislabelling
A bug could be
mislabelled as a
new feature
38
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and 

[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact 

on the performance.
(RQ1) The Nature
of Mislabelling
Findings
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics
Issue reports are mislabelled
36
Fixed

Module1, Module2
Bug Report#1
43% of issue reports
are mislabelled.
[Herzig et al., ICSE 2013]
[Antoniol et al., CASCON 2008]
Fields in issue tracking
systems are often missing 

or incorrect.
[Aranda et al., ICSE 2009]
Actual Classify Meaning
Defect

Mislabelling
A new feature could
be incorrectly
labeled as a bug
Non-Defect

Mislabelling
A bug could be
mislabelled as a
new feature
@klainfo kla@chakkrit.com
12
Mislabelling may impact the
performance
Prior works assumed that
mislabelling is random
[Kim et al., ICSE 2011] and 

[Seiffert et al., Information Science 2014]
Random mislabelling
has a negative impact 

on the performance.
(RQ1) The Nature
of Mislabelling
Findings
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
39
Mislabelling is
non-random
While the recall is
often impacted,
the precision is
rarely impacted.
Only top-rank
metrics are
robust to the
mislabelling
(RQ1) The Nature
of Mislabelling
Suggestions
(RQ3) The Impact 

on the Interpretation
Defect 

model
(RQ2) The Impact 

on the Performance
40
Researchers can
use our noise
models to clean
mislabelled issue
reports
Cleaning data will
improve the
ability to identify
defective modules
Quality
improvement plan
should be made
based on the top
rank metrics

More Related Content

Viewers also liked

Top 8 programmer developer resume samples
Top 8 programmer developer resume samplesTop 8 programmer developer resume samples
Top 8 programmer developer resume samples
BrianMcKnight789
 
Top 8 flatbed driver resume samples
Top 8 flatbed driver resume samplesTop 8 flatbed driver resume samples
Top 8 flatbed driver resume samples
DanielPower789
 
4.'Vintage Grand Prix at the Green island'..4th.
4.'Vintage Grand Prix at the Green island'..4th.4.'Vintage Grand Prix at the Green island'..4th.
4.'Vintage Grand Prix at the Green island'..4th.
MA ADM Natalie Petraki
 

Viewers also liked (11)

Product 4 Yamaram Kaisar
Product 4 Yamaram KaisarProduct 4 Yamaram Kaisar
Product 4 Yamaram Kaisar
 
Top 8 programmer developer resume samples
Top 8 programmer developer resume samplesTop 8 programmer developer resume samples
Top 8 programmer developer resume samples
 
Briton gets 1st bionic eye implant
Briton gets 1st bionic eye implantBriton gets 1st bionic eye implant
Briton gets 1st bionic eye implant
 
Top 8 flatbed driver resume samples
Top 8 flatbed driver resume samplesTop 8 flatbed driver resume samples
Top 8 flatbed driver resume samples
 
digat3
digat3digat3
digat3
 
Dr. Bill Dexter Presentation
Dr. Bill Dexter PresentationDr. Bill Dexter Presentation
Dr. Bill Dexter Presentation
 
Challenging jurisdiction and anti-suit provisions in Russia
Challenging jurisdiction and anti-suit provisions in RussiaChallenging jurisdiction and anti-suit provisions in Russia
Challenging jurisdiction and anti-suit provisions in Russia
 
Instemming Raad 01-07-2015 met de uitgevoerde check op het Verslag Informatie...
Instemming Raad 01-07-2015 met de uitgevoerde check op het Verslag Informatie...Instemming Raad 01-07-2015 met de uitgevoerde check op het Verslag Informatie...
Instemming Raad 01-07-2015 met de uitgevoerde check op het Verslag Informatie...
 
Better Business: Creating a Successful Social Media Strategy
Better Business: Creating a Successful Social Media StrategyBetter Business: Creating a Successful Social Media Strategy
Better Business: Creating a Successful Social Media Strategy
 
4.'Vintage Grand Prix at the Green island'..4th.
4.'Vintage Grand Prix at the Green island'..4th.4.'Vintage Grand Prix at the Green island'..4th.
4.'Vintage Grand Prix at the Green island'..4th.
 
ppt
pptppt
ppt
 

Similar to Presentation

Icsme14danieletal 150722141344-lva1-app6891
Icsme14danieletal 150722141344-lva1-app6891Icsme14danieletal 150722141344-lva1-app6891
Icsme14danieletal 150722141344-lva1-app6891
SAIL_QU
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Chakkrit (Kla) Tantithamthavorn
 
Mining Software Defects: Should We Consider Affected Releases?
Mining Software Defects: Should We Consider Affected Releases?Mining Software Defects: Should We Consider Affected Releases?
Mining Software Defects: Should We Consider Affected Releases?
Chakkrit (Kla) Tantithamthavorn
 
Software reliability prediction
Software reliability predictionSoftware reliability prediction
Software reliability prediction
Mirza Mohymen
 
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
Giovanni Rosa
 
Works For Me! Characterizing Non-Reproducible Bug Reports
Works For Me! Characterizing Non-Reproducible Bug ReportsWorks For Me! Characterizing Non-Reproducible Bug Reports
Works For Me! Characterizing Non-Reproducible Bug Reports
SALT Lab @ UBC
 
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdf
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdfLlama 2 Open Foundation and Fine-Tuned Chat Models.pdf
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdf
Dr. Yasir Butt
 

Similar to Presentation (20)

BH-US-06-Bilar.pdf
BH-US-06-Bilar.pdfBH-US-06-Bilar.pdf
BH-US-06-Bilar.pdf
 
Because you can’t fix what you don’t know is broken...
Because you can’t fix what you don’t know is broken...Because you can’t fix what you don’t know is broken...
Because you can’t fix what you don’t know is broken...
 
Bypassing Secure Boot using Fault Injection
Bypassing Secure Boot using Fault InjectionBypassing Secure Boot using Fault Injection
Bypassing Secure Boot using Fault Injection
 
Fighting Software Inefficiency Through Automated Bug Detection
 Fighting Software Inefficiency Through Automated Bug Detection Fighting Software Inefficiency Through Automated Bug Detection
Fighting Software Inefficiency Through Automated Bug Detection
 
How much time it takes for my feature to arrive?
How much time it takes for my feature to arrive?How much time it takes for my feature to arrive?
How much time it takes for my feature to arrive?
 
Icsme14danieletal 150722141344-lva1-app6891
Icsme14danieletal 150722141344-lva1-app6891Icsme14danieletal 150722141344-lva1-app6891
Icsme14danieletal 150722141344-lva1-app6891
 
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
How to Fix Hundreds of Bugs in Legacy Code and Not Die (Unreal Engine 4)
 
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
 
Integration Testing at go-mmt
Integration Testing at go-mmtIntegration Testing at go-mmt
Integration Testing at go-mmt
 
Software Development Lifecycle Presentation
Software Development Lifecycle PresentationSoftware Development Lifecycle Presentation
Software Development Lifecycle Presentation
 
Mining Software Defects: Should We Consider Affected Releases?
Mining Software Defects: Should We Consider Affected Releases?Mining Software Defects: Should We Consider Affected Releases?
Mining Software Defects: Should We Consider Affected Releases?
 
Nss power point_machine_learning
Nss power point_machine_learningNss power point_machine_learning
Nss power point_machine_learning
 
Software reliability prediction
Software reliability predictionSoftware reliability prediction
Software reliability prediction
 
A Future where we don’t write tests
A Future where we don’t write testsA Future where we don’t write tests
A Future where we don’t write tests
 
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
Evaluating SZZ Implementations Through a Developer-informed Oracle (ICSE 2021)
 
Taming scary production code that nobody wants to touch
Taming scary production code that nobody wants to touchTaming scary production code that nobody wants to touch
Taming scary production code that nobody wants to touch
 
Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?Duplicate Bug Reports Considered Harmful ... Really?
Duplicate Bug Reports Considered Harmful ... Really?
 
Works For Me! Characterizing Non-Reproducible Bug Reports
Works For Me! Characterizing Non-Reproducible Bug ReportsWorks For Me! Characterizing Non-Reproducible Bug Reports
Works For Me! Characterizing Non-Reproducible Bug Reports
 
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdf
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdfLlama 2 Open Foundation and Fine-Tuned Chat Models.pdf
Llama 2 Open Foundation and Fine-Tuned Chat Models.pdf
 
DevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More DefectsDevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More Defects
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
SAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
SAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Presentation

  • 1. The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models Chakkrit (Kla)
 Tantithamthavorn Shane McIntosh Ahmed E. Hassan Akinori Ihara Kenichi Matsumoto @klainfo kla@chakkrit.com
  • 3. Software defects are costly Monetary NIST estimates that software defects cost the US economy $59.5 billion per year! 2
  • 4. Software defects are costly Reputation The Obama administration will always be connected to healthcare.gov Monetary NIST estimates that software defects cost the US economy $59.5 billion per year! 2
  • 5. SQA teams try to find defects 
 before they escape to the field 3
  • 6. SQA teams have limited resources 4 Limited
 QA Resources Software continues to grow 
 in size and complexity
  • 7. 5 Defect prediction models help
 SQA teams to
  • 8. 5 Defect prediction models help
 SQA teams to Predict
 what are risky modules
  • 9. 5 Defect prediction models help
 SQA teams to Predict
 what are risky modules Understand 
 what makes software fail
  • 10. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Snapshot at the release date Defect Dataset
  • 11. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Snapshot at the release date Defect Dataset
  • 12. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Snapshot at the release date Defect Dataset
  • 13. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Bug Report#1 Snapshot at the release date Defect Dataset
  • 14. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Fixed
 Module1, Module2 Bug Report#1 Snapshot at the release date Defect Dataset
  • 15. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Fixed
 Module1, Module2 Bug Report#1 Snapshot at the release date Label as Defective Defect Dataset
  • 16. Modules that are fixed during post-release development are set as defective 6 Changes Release Date Issues Post-Release Module 1 Module 2 Module 3 Module 4 Fixed
 Module1, Module2 Bug Report#1 Snapshot at the release date Label as Clean Label as Defective Defect Dataset
  • 17. Defect models are trained 
 using Machine Learning 7 Module 1 Module 2 Module 3 Module 4 Defect Dataset
  • 18. Defect models are trained 
 using Machine Learning 7 Module 1 Module 2 Module 3 Module 4 Defect Dataset Machine Learning or
 Statistical Learning Defect 
 model
  • 19. Defect data are noisy The reliability of the models depends on the quality of the training data 8 Module 1 Module 2 Module 3 Module 4 Defect Dataset Machine Learning or
 Statistical Learning Defect 
 modelNOISY Unreliable
  • 20. Issue reports are mislabelled 9 Fixed
 Module1, Module2 Bug Report#1 Fields in issue tracking systems are often missing 
 or incorrect. [Aranda et al., ICSE 2009] Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug Non-Defect
 Mislabelling A bug could be mislabelled as a new feature
  • 21. Issue reports are mislabelled 10 Fixed
 Module1, Module2 Bug Report#1 43% of issue reports are mislabelled. [Herzig et al., ICSE 2013] [Antoniol et al., CASCON 2008] Fields in issue tracking systems are often missing 
 or incorrect. [Aranda et al., ICSE 2009] Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug Non-Defect
 Mislabelling A bug could be mislabelled as a new feature
  • 22. Issue reports are mislabelled 11 Fixed
 Module1, Module2 Bug Report#1 43% of issue reports are mislabelled. [Herzig et al., ICSE 2013] [Antoniol et al., CASCON 2008] Fields in issue tracking systems are often missing 
 or incorrect. [Aranda et al., ICSE 2009] Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug Non-Defect
 Mislabelling A bug could be mislabelled as a new feature
  • 23. Then, modules are mislabelled 12 #1 Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug M1 M2 M4 NOISY DATA M1,M2 M3 #1
  • 24. Then, modules are mislabelled 12 #1 Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug M1 M2 M4 NOISY DATA M1,M2 #2 M3 M3 #1
  • 25. Then, modules are mislabelled 12 #1 Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug M1 M2 M4 NOISY DATA M1,M2 #2 M3 #2 is mislabelled. M3 #1
  • 26. Then, modules are mislabelled 12 #1 Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug M1 M2 M4 NOISY DATA M1,M2 #2 M3 #2 is mislabelled. M3M3 M3 should be 
 a clean module #2 #1
  • 27. 13 Mislabelling may impact the performance Prior works assumed that mislabelling is random [Kim et al., ICSE 2011] and 
 [Seiffert et al., Information Science 2014] Random mislabelling has a negative impact 
 on the performance.
  • 28. 14 Mislabelling is likely non-random We suspect that novice developers are likely to mislabel more than experienced developers. Novice developers are known to overlook the bookkeeping issue [Bachmann et al., FSE 2010]
  • 29. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models 15
  • 30. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ3) Its Impact 
 on the Interpretation Defect 
 model (RQ2) Its Impact 
 on the Performance 15
  • 31. Using prediction models to classify
 whether issue reports are mislabelled 16 Prediction 
 Model
  • 32. Using prediction models to classify
 whether issue reports are mislabelled 16 Prediction 
 Model Mislabelling is predictable Performs 
 Well
  • 33. Using prediction models to classify
 whether issue reports are mislabelled 16 Prediction 
 Model Mislabelling is random Performs 
 Poorly Mislabelling is predictable Performs 
 Well
  • 34. Selecting our studied systems 17 Manually-curated issue reports [Herzig et al., ICSE 2013]
  • 35. Jackrabbit Lucene 0.78 0.12 0.64 0.50 0.70 0.19 0.75 0.12 0.71 0.50 0.73 0.19 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Recall F−Measure Precision Recall F−Measure PerformanceValue Our Model Random Guessing Jackrabbit 0.78 0.64 0.70 0.6 0.7 0.8 0.9 1.0eValue Our Model Mislabelling is non-random Jackrabbit Luc 0.78 0.70 0.75 0.71 0.8 0.9 1.0 lue Our Model Random Guess Jackrabbit Lucene 0.78 0.75 0.0.8 0.9 1.0 e Our Model Random Guessing Jackrabbit Lu 0.78 0.70 0.75 0.7 0.7 0.8 0.9 1.0 alue Our Model Random Gues 18
  • 36. Jackrabbit Lucene 0.78 0.12 0.64 0.50 0.70 0.19 0.75 0.12 0.71 0.50 0.73 0.19 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Recall F−Measure Precision Recall F−Measure PerformanceValue Our Model Random Guessing Jackrabbit 0.78 0.64 0.70 0.6 0.7 0.8 0.9 1.0eValue Our Model Mislabelling is non-random Jackrabbit Luc 0.78 0.70 0.75 0.71 0.8 0.9 1.0 lue Our Model Random Guess Jackrabbit Lucene 0.78 0.75 0.0.8 0.9 1.0 e Our Model Random Guessing Jackrabbit Lu 0.78 0.70 0.75 0.7 0.7 0.8 0.9 1.0 alue Our Model Random Gues 18
  • 37. Jackrabbit Lucene 0.78 0.12 0.64 0.50 0.70 0.19 0.75 0.12 0.71 0.50 0.73 0.19 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Recall F−Measure Precision Recall F−Measure PerformanceValue Our Model Random Guessing Jackrabbit 0.78 0.64 0.70 0.6 0.7 0.8 0.9 1.0eValue Our Model Mislabelling is non-random Jackrabbit Luc 0.78 0.70 0.75 0.71 0.8 0.9 1.0 lue Our Model Random Guess Jackrabbit Lucene 0.78 0.75 0.0.8 0.9 1.0 e Our Model Random Guessing Jackrabbit Lu 0.78 0.70 0.75 0.7 0.7 0.8 0.9 1.0 alue Our Model Random Gues 19
  • 38. Jackrabbit Lucene 0.78 0.12 0.64 0.50 0.70 0.19 0.75 0.12 0.71 0.50 0.73 0.19 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Recall F−Measure Precision Recall F−Measure PerformanceValue Our Model Random Guessing Jackrabbit 0.78 0.64 0.70 0.6 0.7 0.8 0.9 1.0eValue Our Model Mislabelling is non-random Jackrabbit Luc 0.78 0.70 0.75 0.71 0.8 0.9 1.0 lue Our Model Random Guess Jackrabbit Lucene 0.78 0.75 0.0.8 0.9 1.0 e Our Model Random Guessing Jackrabbit Lu 0.78 0.70 0.75 0.7 0.7 0.8 0.9 1.0 alue Our Model Random Gues 20
  • 39. Jackrabbit Lucene 0.78 0.12 0.64 0.50 0.70 0.19 0.75 0.12 0.71 0.50 0.73 0.19 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Precision Recall F−Measure Precision Recall F−Measure PerformanceValue Our Model Random Guessing Jackrabbit 0.78 0.64 0.70 0.6 0.7 0.8 0.9 1.0eValue Our Model Mislabelling is non-random Jackrabbit Luc 0.78 0.70 0.75 0.71 0.8 0.9 1.0 lue Our Model Random Guess Jackrabbit Lucene 0.78 0.75 0.0.8 0.9 1.0 e Our Model Random Guessing Jackrabbit Lu 0.78 0.70 0.75 0.7 0.7 0.8 0.9 1.0 alue Our Model Random Gues Our models achieve a mean of F-measure up to 0.73, which is 4-34 times better than random guessing. 20
  • 40. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models 21
  • 41. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models 21 Mislabelling is non-random
  • 42. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ2) The Impact 
 on the Performance 21 Mislabelling is non-random
  • 43. 22 Compare the performance between clean models and noisy models Clean 
 Performance Realistic Noisy
 Performance Random Noisy
 Performance VS VS
  • 44. Generating three samples 23 Clean M1 M2 M3 M4 (Oracle)
 #2 is mislabelled Clean
 Sample #2 #1
  • 45. Generating three samples 24 Add Noise M1 M2 M3 M4 Clean M1 M2 M3 M4 (Oracle)
 #2 is mislabelled #2 #1 Clean
 Sample Realistic 
 Noisy
 Sample
  • 46. Realistically flip the modules’ label that are addressed by the mislabelled issue reports. Generating three samples 24 Add Noise M1 M2 M3 M4 Clean M1 M2 M3 M4 (Oracle)
 #2 is mislabelled #2 #1 Clean
 Sample Realistic 
 Noisy
 Sample
  • 47. Realistically flip the modules’ label that are addressed by the mislabelled issue reports. Generating three samples 25 Add Noise M1 M2 M3 M4 Random 
 Noisy
 Sample M1 M2 M3 M4 Add Noise Clean M1 M2 M3 M4 (Oracle)
 #2 is mislabelled #2 #1 Clean
 Sample Realistic 
 Noisy
 Sample
  • 48. Randomly flip the module’s label Realistically flip the modules’ label that are addressed by the mislabelled issue reports. Generating three samples 25 Add Noise M1 M2 M3 M4 Random 
 Noisy
 Sample M1 M2 M3 M4 Add Noise Clean M1 M2 M3 M4 (Oracle)
 #2 is mislabelled #2 #1 Clean
 Sample Realistic 
 Noisy
 Sample
  • 49. 26 Clean 
 Performance Realistic Noisy
 Performance Random Noisy
 Performance Clean 
 Sample Realistic Noisy
 Sample Random Noisy
 Sample VS VS Defect 
 model Defect 
 model Defect 
 model Generate the performance of
 clean models and noisy models
  • 50. 26 Clean 
 Performance Realistic Noisy
 Performance Random Noisy
 Performance Clean 
 Sample Realistic Noisy
 Sample Random Noisy
 Sample VS VS Defect 
 model Defect 
 model Defect 
 model Performance 
 Ratio = Performance of Realistic Noisy Model Performance of Clean Model Generate the performance of
 clean models and noisy models
  • 51. While the recall is often impacted, the precision is rarely impacted. 27 = Realistic Noisy Clean Interpretation:
 Ratio = 1 means there is no impact. Precision Recall 1.00.50.02.01.5 Ratio
  • 52. While the recall is often impacted, the precision is rarely impacted. 27 = Realistic Noisy Clean Interpretation:
 Ratio = 1 means there is no impact. Precision is rarely impactedby realistic mislabelling. Precision Recall 1.00.50.02.01.5 Ratio
  • 53. While the recall is often impacted, the precision is rarely impacted. 27 = Realistic Noisy Clean Interpretation:
 Ratio = 1 means there is no impact. Models trained on noisy data achieve 56% of the recall of models trained on clean data. Precision is rarely impactedby realistic mislabelling. Precision Recall 1.00.50.02.01.5 Ratio
  • 54. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ2) The Impact 
 on the Performance 28 Mislabelling is non-random
  • 55. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ2) The Impact 
 on the Performance 28 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted
  • 56. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 28 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted
  • 57. 29 Generate the rank of metrics of
 clean models and noisy models Clean 
 model Realistic noisy model Variable Importance
 Scores Variable Importance
 Scores Variable Importance
 Scores Random noisy model
  • 58. 30 Generate the rank of metrics of
 clean models and noisy models Clean 
 model Realistic noisy model Variable Importance
 Scores Variable Importance
 Scores Variable Importance
 Scores Rank of metrics Ranking Ranking Ranking Rank of metrics Rank of metrics Random noisy model
  • 59. 31 2 1 3 Clean Model Rank of metrics 
 of the clean model Whether a metric of the clean model appears at the same rank in the noisy models?
  • 60. 31 2 1 3 Clean Model Rank of metrics 
 of the clean model Noisy Model 2 1 3 ? Rank of metrics 
 of the noisy model Whether a metric of the clean model appears at the same rank in the noisy models?
  • 61. 32 Only the metrics in the 1st rank are robust to the mislabelling 2 1 3 Clean Model Noisy Model 2 1 3 85% of the metrics in the 1st rank of the clean model 
 also appear in the 1st rank of the noisy model.
  • 62. 33 Conversely, the metrics in the 
 2nd and 3rd ranks are less stable 2 1 3 Clean Model Noisy Model 2 1 3 As little as 18% of the metrics in the 2nd and 3rd rank of the clean models appear in the same rank in the noisy models
  • 63. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 34 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted
  • 64. (RQ1) The Nature of Mislabelling The impact of realistic mislabelling on the performance and interpretation of defect models (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 34 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted Only top-rank metrics are robust to the mislabelling
  • 65. (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 35
  • 66. (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 35 Researchers can use our noise models to clean mislabelled issue reports
  • 67. (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 35 Researchers can use our noise models to clean mislabelled issue reports Cleaning data will improve the ability to identify defective modules
  • 68. (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 35 Researchers can use our noise models to clean mislabelled issue reports Cleaning data will improve the ability to identify defective modules Quality improvement plan should be made based on the top rank metrics
  • 69. 36
  • 70. Issue reports are mislabelled 37 Fixed
 Module1, Module2 Bug Report#1 43% of issue reports are mislabelled. [Herzig et al., ICSE 2013] [Antoniol et al., CASCON 2008] Fields in issue tracking systems are often missing 
 or incorrect. [Aranda et al., ICSE 2009] Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug Non-Defect
 Mislabelling A bug could be mislabelled as a new feature
  • 71. 38 Mislabelling may impact the performance Prior works assumed that mislabelling is random [Kim et al., ICSE 2011] and 
 [Seiffert et al., Information Science 2014] Random mislabelling has a negative impact 
 on the performance.
  • 72. (RQ1) The Nature of Mislabelling Findings (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 39 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted. Only top-rank metrics are robust to the mislabelling
  • 73. (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 40 Researchers can use our noise models to clean mislabelled issue reports Cleaning data will improve the ability to identify defective modules Quality improvement plan should be made based on the top rank metrics
  • 74. Issue reports are mislabelled 36 Fixed
 Module1, Module2 Bug Report#1 43% of issue reports are mislabelled. [Herzig et al., ICSE 2013] [Antoniol et al., CASCON 2008] Fields in issue tracking systems are often missing 
 or incorrect. [Aranda et al., ICSE 2009] Actual Classify Meaning Defect
 Mislabelling A new feature could be incorrectly labeled as a bug Non-Defect
 Mislabelling A bug could be mislabelled as a new feature @klainfo kla@chakkrit.com 12 Mislabelling may impact the performance Prior works assumed that mislabelling is random [Kim et al., ICSE 2011] and 
 [Seiffert et al., Information Science 2014] Random mislabelling has a negative impact 
 on the performance. (RQ1) The Nature of Mislabelling Findings (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 39 Mislabelling is non-random While the recall is often impacted, the precision is rarely impacted. Only top-rank metrics are robust to the mislabelling (RQ1) The Nature of Mislabelling Suggestions (RQ3) The Impact 
 on the Interpretation Defect 
 model (RQ2) The Impact 
 on the Performance 40 Researchers can use our noise models to clean mislabelled issue reports Cleaning data will improve the ability to identify defective modules Quality improvement plan should be made based on the top rank metrics