SlideShare a Scribd company logo
SMU Classification: Restricted
Strategic Partner:
On the Unreliability of Bug Severity Data
Yuan TIAN
Data Scientist at Living Analytics Research Centre,
Singapore Management University
ytian@smu.edu.sg
April 18th, 2018 @ Queen’s University, Canada
SMU Classification: Restricted
2
Supervised Machine Learning models rely heavily on labels
Labels
Predictive
Model
Feature
Extraction
Learning
Expected Label
Training
Data
New
Data
Cat
Not Cat
SMU Classification: Restricted
3
Traditional machine learning models suffer from noisy labels
in training set
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
SMU Classification: Restricted
4
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
Traditional machine learning models suffer from noisy labels
in training set
Noise Level=3%
F1=0.83
F1=0.4
SMU Classification: Restricted
5
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
SMU Classification: Restricted
6
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
Assume that absolute ground truth
for labels exists although the labels
may be noisy for some reasons.
SMU Classification: Restricted
7
Medical Diagnosis Malware Detection
Classification is in some cases subjective, which results in
inter-labeller variability
Image Tagging
“inconsistent
labels” noise
SMU Classification: Restricted
8
“Inconsistent Labels” Noise
Classification is in some cases subjective, which results in
inter-labeller variability (cont.)
How interesting is this book ?
How informative is this tweet ?
Is it a high quality product ?
How severe is the problem ?
User created tags for images, content, etc.
Job titles created by different companies
Numeric Label
Categorical Label
SMU Classification: Restricted
9
Before collecting labels
Readily available data
• Shared labelling criteria
• Multiple labelers
• Repeated labelling
• Labels are agreed by
all/majority
• Pairwise comparison
• Averaging
• Majority voting
• Consensus voting
• Remove outliers
• Learning with
uncertain labels
?
How to measure
inconsistency?
Multiple labelers
Single labeler
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to cope with
inconsistency?
SMU Classification: Restricted
10
Community Intelligence
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to measure
inconsistency?
How to cope with
inconsistency?
SMU Classification: Restricted
11
Part 1:
│Noisy labels negatively impact learning
│Overview of approaches for coping with inconsistent labels
│Sample : Bug severity levels
Part 2:
│Future research direction: Big Data to Thick (high quality) Data
Talk Outline
SMU Classification: Restricted
12
Tester/
User
Summary
Description
Product, component, version
Severity
1:Blocker
3: Major
4:Minor
5:Trivial
2: Critical
Severity is assigned to reflect the level of
impact that a bug has on the system
SMU Classification: Restricted
13
Prior work in automated bug severity labelling
SMU Classification: Restricted
14
Prior work in automated bug severity labelling
Existing approaches assume that all assigned
severity levels are consistent.
Classifiers are evaluated using Accuracy &
F-measure, AUC.
SMU Classification: Restricted
15
How to measure label
inconsistency?
How to evaluate machine
learning models with
inconsistent labels?
human-machine
_____________
human-human
Challenges and our solutions
SMU Classification: Restricted
16
Bug Detection &
Reporting
Bug Triaging
#Validity Check
# Duplicate Bug Detection
# Bug Prioritization
# Bug Assignment
Debugging &
Bug Fixing
Duplicate bugs should
have the same severity
levels, if “severity level”
labels are consistent.
How to measure the inconsistency?
SMU Classification: Restricted
17
Blocker Critical
Major
Minor
Inconsistent Duplicate Buckets
1
2
3
Clean Duplicate Buckets
Manual verify 1,394 bug reports
(statistically representative
sample)
95% of the inconsistent
buckets are reporting the
same bug.
Up to 51% of human-assigned bug severity labels are
inconsistent !
SMU Classification: Restricted
18
human-machine
_____________
human-human
Krippendorff's alpha
A new evaluation measure for machine learning tasks with
inconsistent labels:
𝛼 = 1 −
𝐷𝑜
𝐷𝑒
Observed disagreements
Expected disagreements
when the bug severity
levels are randomly
assigned.
𝛼 = 1 regarded as perfect agreement.
Benefit of Alpha:
• Allow multiple labellers
• Good for ordinal labels
• Factoring class distributions
• Less biased to number of
labels and the number of
coders
SMU Classification: Restricted
19
Bug Report 1 2 3 4 5 6 7 8 9 10
Human 1 2 3 3 4 3 3 4 3 5
Machine A 2 3 3 3 3 3 3 5 3 2
Machine B 3 4 3 3 3 3 3 5 3 2
Machine C 3 3 3 3 3 3 3 3 3 3
Krippendorff’s Alpha Vs Accuracy
Accuracy: Machine A = Machine B = Machine C
Alpha: Machine A > Machine B > Machine C
SMU Classification: Restricted
20
human-machine
_____________
human-human
Krippendorff's alpha
Low agreement between machine learning systems and human might
be due to data inconsistency!
DataSet Human-Human Human-Machine
(REP+kNN)
Agreement
Ratio
OpenOffice 0.538 0.415 0.77
Mozilla 0.675 0.556 0.82
Eclipse 0.595 0.510 0.86
Vs.
SMU Classification: Restricted
21
Community intelligence can be used to
identify and quantify the inconsistency
of subjective labels.
Performance of machine learning
models should be measured within
context, e.g., relative to human inter-
agreement.
human-machine
_____________
human-human
Take away messages
SMU Classification: Restricted
22
During 2017, every minute of the day:
4M tweets
36 M google
searches
600 page edits
0.4 M trips
120 new
professionals $258,751.90 in sales
Big Data is everywhere, however…
Bad data is costing organizations some $3.1 trillion a year in
the US alone.
83% of companies said their revenue is affected by inaccurate
and incomplete customer or prospect data.
SMU Classification: Restricted
23
Big
Data
Thick
Data
Size of
Data
Quality
Big Data Vs Thick Data (high quality data, but expensive to
collect, hard to scale)
SMU Classification: Restricted
24
Big
Data
Thick
Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Make big data “ thick ”
SMU Classification: Restricted
25
Big
Data
Thick Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Challenges:
1. Impossible to specify all the data semantics beforehand.
2. Manual labelling of noise is expensive and time
consuming, impossible to scale.
3. Lack of quality metrics for big data.
4. Lack of performance measures for noisy data.
Make big data “ thick ”
SMU Classification: Restricted
26
• Integrate existing internal/external resources collectively
created by relevant communities.
• Utilize knowledge in unstructured data.
How to make big data thick in a lightweight cost-effective way?
SMU Classification: Restricted
27
Assessing Quality of
Big Data
Lightweight Scalable Noise
Reduction/Correction
Techniques
New Noise Tolerant Learning
Algorithm
for Big Data
New Performance Measures
for Noisy Data
Future Roadmap: Big Data to Thick Data
SMU Classification: Restricted
28
human-machine
_____________
human-human
• New measure for machine learning
performance with inconsistent labels
Challenges: Big Data to Thick Data
• Need metrics for quality assessment and
model evaluation
• Need lightweight noise reduction methods
Conclusion ytian@smu.edu.sg
• Classification is in some cases subjective,
which results in inter-labeler variability.
SMU Classification: Restricted
ytian@smu.edu.sg
Backup Slides
29
SMU Classification: Restricted
30
Bucket/
Bug Report
1 2 3 4 5 6 7 8 9 10
1 2 3 3 4 3 3 4 3 5
2 3 3 3 3 3 3 5 3 2
#Raters 2 2 2 2 2 2 2 2 2 2
Count Matrix
Derived from
Ratings
Predefined
Distance Matrix
This is why alpha is
good for ordinal labels!
(c,k) represents a
pair of ratings
Computation of Krippendorff's alpha
𝛼 = 0.27
SMU Classification: Restricted
31
How do inconsistent labels affect varies machine learning
models?
Duplicate Bug
Reports
Clean Bug Reports Inconsistent Bug Reports
Test Bug
Reports
Train (Inconsistent Data Ratio:0%)
Inconsistent Data Ratio:20%
SMU Classification: Restricted
32
Noise injection leads to drop in alpha for all datasets
(Mozilla)
(OpenOffice)
Decision Tree
Naïve Bayes Multinomial
Support Vector Machine
REP + k-Nearest Neighbor
Ratio of Inconsistent Bug Reports in Training Set
(Eclipse)

More Related Content

Similar to On the Unreliability of Bug Severity Data

NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
Sujeet Suryawanshi
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
ZainULABIDIN496386
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
ijsc
 
An Overview of Advesarial-attack-in-Recommender-system.pptx
An Overview of Advesarial-attack-in-Recommender-system.pptxAn Overview of Advesarial-attack-in-Recommender-system.pptx
An Overview of Advesarial-attack-in-Recommender-system.pptx
vudinhphuong96
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learning
ZAMANCHBWN
 
Towards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxTowards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptx
Luis775803
 
Real-world Strategies for Debugging Machine Learning Systems
Real-world Strategies for Debugging Machine Learning SystemsReal-world Strategies for Debugging Machine Learning Systems
Real-world Strategies for Debugging Machine Learning Systems
Databricks
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
MarcoMellia
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
Pramit Choudhary
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
theijes
 
A survey of random forest based methods for
A survey of random forest based methods forA survey of random forest based methods for
A survey of random forest based methods for
Nikhil Sharma
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
ijcsit
 
STM-UNIT-1.pptx
STM-UNIT-1.pptxSTM-UNIT-1.pptx
STM-UNIT-1.pptx
nischal55
 
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
MonicaDommaraju
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
Egyptian Engineers Association
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
Aboul Ella Hassanien
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET Journal
 
Community Detection Method for Multi-Label Classification
Community Detection Method for Multi-Label ClassificationCommunity Detection Method for Multi-Label Classification
Community Detection Method for Multi-Label Classification
Elaine Cecília Gatto
 

Similar to On the Unreliability of Bug Severity Data (20)

NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 
An Overview of Advesarial-attack-in-Recommender-system.pptx
An Overview of Advesarial-attack-in-Recommender-system.pptxAn Overview of Advesarial-attack-in-Recommender-system.pptx
An Overview of Advesarial-attack-in-Recommender-system.pptx
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
detailed Presentation on supervised learning
 detailed Presentation on supervised learning detailed Presentation on supervised learning
detailed Presentation on supervised learning
 
Towards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxTowards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptx
 
Real-world Strategies for Debugging Machine Learning Systems
Real-world Strategies for Debugging Machine Learning SystemsReal-world Strategies for Debugging Machine Learning Systems
Real-world Strategies for Debugging Machine Learning Systems
 
DATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITODATI, AI E ROBOTICA @POLITO
DATI, AI E ROBOTICA @POLITO
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
 
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
 
A survey of random forest based methods for
A survey of random forest based methods forA survey of random forest based methods for
A survey of random forest based methods for
 
Fuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software ClassificationFuzzy Rule Base System for Software Classification
Fuzzy Rule Base System for Software Classification
 
STM-UNIT-1.pptx
STM-UNIT-1.pptxSTM-UNIT-1.pptx
STM-UNIT-1.pptx
 
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
A Survey of Image Classification with Deep Learning in the Presence of Noisy ...
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
Community Detection Method for Multi-Label Classification
Community Detection Method for Multi-Label ClassificationCommunity Detection Method for Multi-Label Classification
Community Detection Method for Multi-Label Classification
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
SAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
SAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
SAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
SAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Recently uploaded

All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
gapen1
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
kalichargn70th171
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
Remote DBA Services
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
ervikas4
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
aisafed42
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
anfaltahir1010
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 

Recently uploaded (20)

All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
如何办理(hull学位证书)英国赫尔大学毕业证硕士文凭原版一模一样
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
Oracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptxOracle 23c New Features For DBAs and Developers.pptx
Oracle 23c New Features For DBAs and Developers.pptx
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptxMigration From CH 1.0 to CH 2.0 and  Mule 4.6 & Java 17 Upgrade.pptx
Migration From CH 1.0 to CH 2.0 and Mule 4.6 & Java 17 Upgrade.pptx
 
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabhQuarter 3 SLRP grade 9.. gshajsbhhaheabh
Quarter 3 SLRP grade 9.. gshajsbhhaheabh
 
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLESINTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
INTRODUCTION TO AI CLASSICAL THEORY TARGETED EXAMPLES
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 

On the Unreliability of Bug Severity Data

  • 1. SMU Classification: Restricted Strategic Partner: On the Unreliability of Bug Severity Data Yuan TIAN Data Scientist at Living Analytics Research Centre, Singapore Management University ytian@smu.edu.sg April 18th, 2018 @ Queen’s University, Canada
  • 2. SMU Classification: Restricted 2 Supervised Machine Learning models rely heavily on labels Labels Predictive Model Feature Extraction Learning Expected Label Training Data New Data Cat Not Cat
  • 3. SMU Classification: Restricted 3 Traditional machine learning models suffer from noisy labels in training set Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems.
  • 4. SMU Classification: Restricted 4 Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance Traditional machine learning models suffer from noisy labels in training set Noise Level=3% F1=0.83 F1=0.4
  • 5. SMU Classification: Restricted 5 Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance More Complex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set
  • 6. SMU Classification: Restricted 6 Decreased Performance More Complex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set Assume that absolute ground truth for labels exists although the labels may be noisy for some reasons.
  • 7. SMU Classification: Restricted 7 Medical Diagnosis Malware Detection Classification is in some cases subjective, which results in inter-labeller variability Image Tagging “inconsistent labels” noise
  • 8. SMU Classification: Restricted 8 “Inconsistent Labels” Noise Classification is in some cases subjective, which results in inter-labeller variability (cont.) How interesting is this book ? How informative is this tweet ? Is it a high quality product ? How severe is the problem ? User created tags for images, content, etc. Job titles created by different companies Numeric Label Categorical Label
  • 9. SMU Classification: Restricted 9 Before collecting labels Readily available data • Shared labelling criteria • Multiple labelers • Repeated labelling • Labels are agreed by all/majority • Pairwise comparison • Averaging • Majority voting • Consensus voting • Remove outliers • Learning with uncertain labels ? How to measure inconsistency? Multiple labelers Single labeler How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to cope with inconsistency?
  • 10. SMU Classification: Restricted 10 Community Intelligence How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to measure inconsistency? How to cope with inconsistency?
  • 11. SMU Classification: Restricted 11 Part 1: │Noisy labels negatively impact learning │Overview of approaches for coping with inconsistent labels │Sample : Bug severity levels Part 2: │Future research direction: Big Data to Thick (high quality) Data Talk Outline
  • 12. SMU Classification: Restricted 12 Tester/ User Summary Description Product, component, version Severity 1:Blocker 3: Major 4:Minor 5:Trivial 2: Critical Severity is assigned to reflect the level of impact that a bug has on the system
  • 13. SMU Classification: Restricted 13 Prior work in automated bug severity labelling
  • 14. SMU Classification: Restricted 14 Prior work in automated bug severity labelling Existing approaches assume that all assigned severity levels are consistent. Classifiers are evaluated using Accuracy & F-measure, AUC.
  • 15. SMU Classification: Restricted 15 How to measure label inconsistency? How to evaluate machine learning models with inconsistent labels? human-machine _____________ human-human Challenges and our solutions
  • 16. SMU Classification: Restricted 16 Bug Detection & Reporting Bug Triaging #Validity Check # Duplicate Bug Detection # Bug Prioritization # Bug Assignment Debugging & Bug Fixing Duplicate bugs should have the same severity levels, if “severity level” labels are consistent. How to measure the inconsistency?
  • 17. SMU Classification: Restricted 17 Blocker Critical Major Minor Inconsistent Duplicate Buckets 1 2 3 Clean Duplicate Buckets Manual verify 1,394 bug reports (statistically representative sample) 95% of the inconsistent buckets are reporting the same bug. Up to 51% of human-assigned bug severity labels are inconsistent !
  • 18. SMU Classification: Restricted 18 human-machine _____________ human-human Krippendorff's alpha A new evaluation measure for machine learning tasks with inconsistent labels: 𝛼 = 1 − 𝐷𝑜 𝐷𝑒 Observed disagreements Expected disagreements when the bug severity levels are randomly assigned. 𝛼 = 1 regarded as perfect agreement. Benefit of Alpha: • Allow multiple labellers • Good for ordinal labels • Factoring class distributions • Less biased to number of labels and the number of coders
  • 19. SMU Classification: Restricted 19 Bug Report 1 2 3 4 5 6 7 8 9 10 Human 1 2 3 3 4 3 3 4 3 5 Machine A 2 3 3 3 3 3 3 5 3 2 Machine B 3 4 3 3 3 3 3 5 3 2 Machine C 3 3 3 3 3 3 3 3 3 3 Krippendorff’s Alpha Vs Accuracy Accuracy: Machine A = Machine B = Machine C Alpha: Machine A > Machine B > Machine C
  • 20. SMU Classification: Restricted 20 human-machine _____________ human-human Krippendorff's alpha Low agreement between machine learning systems and human might be due to data inconsistency! DataSet Human-Human Human-Machine (REP+kNN) Agreement Ratio OpenOffice 0.538 0.415 0.77 Mozilla 0.675 0.556 0.82 Eclipse 0.595 0.510 0.86 Vs.
  • 21. SMU Classification: Restricted 21 Community intelligence can be used to identify and quantify the inconsistency of subjective labels. Performance of machine learning models should be measured within context, e.g., relative to human inter- agreement. human-machine _____________ human-human Take away messages
  • 22. SMU Classification: Restricted 22 During 2017, every minute of the day: 4M tweets 36 M google searches 600 page edits 0.4 M trips 120 new professionals $258,751.90 in sales Big Data is everywhere, however… Bad data is costing organizations some $3.1 trillion a year in the US alone. 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.
  • 23. SMU Classification: Restricted 23 Big Data Thick Data Size of Data Quality Big Data Vs Thick Data (high quality data, but expensive to collect, hard to scale)
  • 24. SMU Classification: Restricted 24 Big Data Thick Data Size of Data Quality Big Data Size of Data Quality Make big data “ thick ”
  • 25. SMU Classification: Restricted 25 Big Data Thick Data Size of Data Quality Big Data Size of Data Quality Challenges: 1. Impossible to specify all the data semantics beforehand. 2. Manual labelling of noise is expensive and time consuming, impossible to scale. 3. Lack of quality metrics for big data. 4. Lack of performance measures for noisy data. Make big data “ thick ”
  • 26. SMU Classification: Restricted 26 • Integrate existing internal/external resources collectively created by relevant communities. • Utilize knowledge in unstructured data. How to make big data thick in a lightweight cost-effective way?
  • 27. SMU Classification: Restricted 27 Assessing Quality of Big Data Lightweight Scalable Noise Reduction/Correction Techniques New Noise Tolerant Learning Algorithm for Big Data New Performance Measures for Noisy Data Future Roadmap: Big Data to Thick Data
  • 28. SMU Classification: Restricted 28 human-machine _____________ human-human • New measure for machine learning performance with inconsistent labels Challenges: Big Data to Thick Data • Need metrics for quality assessment and model evaluation • Need lightweight noise reduction methods Conclusion ytian@smu.edu.sg • Classification is in some cases subjective, which results in inter-labeler variability.
  • 30. SMU Classification: Restricted 30 Bucket/ Bug Report 1 2 3 4 5 6 7 8 9 10 1 2 3 3 4 3 3 4 3 5 2 3 3 3 3 3 3 5 3 2 #Raters 2 2 2 2 2 2 2 2 2 2 Count Matrix Derived from Ratings Predefined Distance Matrix This is why alpha is good for ordinal labels! (c,k) represents a pair of ratings Computation of Krippendorff's alpha 𝛼 = 0.27
  • 31. SMU Classification: Restricted 31 How do inconsistent labels affect varies machine learning models? Duplicate Bug Reports Clean Bug Reports Inconsistent Bug Reports Test Bug Reports Train (Inconsistent Data Ratio:0%) Inconsistent Data Ratio:20%
  • 32. SMU Classification: Restricted 32 Noise injection leads to drop in alpha for all datasets (Mozilla) (OpenOffice) Decision Tree Naïve Bayes Multinomial Support Vector Machine REP + k-Nearest Neighbor Ratio of Inconsistent Bug Reports in Training Set (Eclipse)

Editor's Notes

  1. Thanks prof. z for the introduction, thanks all for taking time to attend this talk, I am honored to be invited here. As you can see from the title, I’d like to share some of my experience in dealing with inconsistent labels, which is important for making your later analysis on the data reliable and effective. Please feel free to interrupt if you find any difficulties regarding the slide.
  2. To begin with, I would like to introduce supervised machine learning models, which are probability the most common machine learning models used nowadays. And they are also the ones that are affected the most by the quality of labels. In the slide, we see a general flow of supervised machine learning, it takes labelled training data as input, gleaning information from it, and eventually learn a model that can label new data. Let’s me give you a simple example here, at the top of the slide, you can see 7 images, each of them is associated with a label 1/0 indicating whether there is a cat in the image. Supervised machine learning takes these data-label pairs as training input, and then extracts features from the images. In traditional machine learning flow, features are defined manually, while in the latest deep learning techniques, features are learned from the data. After the feature extraction process, a model is learned for a mapping between feature values and labels, so that given the new image shown at the right bottom of the slide, we hope that the learned model is able to identify that there is a cat in the image.
  3. Since supervised machine learning models are popular, and rely heavily on labels, intuitively, we machine learning practitioners want clean labels in our training data. However, things don’t always go as we wish. Have you noticed that just among the 7 images we saw in the previous slide, the image cycled in red contains a dog, instead of a cat, in this case, we say we encounter an incorrect label. In fact, real-world data often contain noisy labels due to various reasons. For example, we human make mistakes, including experts. secondly, as collecting reliable labels is a expensive and time costly task, many studies leverage crowdsourcing platforms to collect non-expert labels in a cheap and fast way, and some regard machine generated labels as the ground truth. Last, noisy labels might simply due to communication or encoding problems.
  4. In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. http://www.stat.purdue.edu/~jianzhan/papers/sigir03zhang.pdf
  5. In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. So the message I want to deliver here is, we should take care of noisy label when we design machine learning models.
  6. In most of the research on analysing noisy labels, we have an important assumption that absolute ground truth for label exist, like we could easily tell that the cycled image is wrongly labelled. However, classification can be subjective, where the ground truth is hard to tell.
  7. For instance, two doctors may give different diagnosis regarding the same patient based on their experiences, especially when the information are not fully collected. In the field of computer security, different companies have their own standard in determining whether a software is malware or not. Thus when people combine different malware benchmarks together, there might be conflict labels on the same software. In the image tagging process, users are allowed to create tags by themselves, thus when we may find different tags regarding the same object. Other data such as movie ratings and application ratings also encounter inconsistent caused by the subjective classification process.
  8. To summarize the inconsistent labels noise we have seen so far, we can divide them into two groups depend on whether the label can be represented using a numeric variable or categorical variable. For questions regarding opinions of people,like interestingness of a book, a score between 0-10, or 1-5, is usually assigned to measure levels of agreement on a statement. So we can use a numeric value to represent each category. For scenario like image tagging, each tag is a categorical label. Similarity, job titles created by different companies may be different for the same person, or the same job. There are no standard terminology regarding the same data. Since subjective classification tasks often happens, especially when we want to model user behaviours and preference, and if we just ignore the inconsistent label noise introduced by the process, the learning will suffer from performance drop, and many other negative consequences as we have talked in the beginning of this talk.
  9. So here comes the question, how people cope with inconsistent label noise? Well, to answer this question, we need to first figure out when the inconsistent labels are encountered. Sometimes, we are the ones who can control the label collection process, sometime, we start with labelled data. In the label collection process, some people do not aware of potential inconsistency introduced by the subjective nature of the classification task. While some people do care about the labeling process, especially when they are creating benchmark dataset. Several strategies are adopted, which I believe all researchers should consider before collecting human annotated labels. For example, involving multiple labelers and making sure that one instance has been labelled more than one labelers. If disagreement appear regarding one instance, we should consider either throw the data, or resolve the disagreement among labelers. Recently, there is also a trend of adopting pairwise comparison rather than assigning an absolute score, but this method would require many pairs of comparisons. If we do not have control on the process, but we have labels provided by multiple labelers, many studies go for different voting methods to merge multiple labelers’ into one label, but this method … Other studies will study the reliability of each labeler and filter outliers, or treat label as a distribution over all possible labels, which is called uncertain label. But how about we only have one label per instance, especially when we know that nothing has been controlled in the label collection process. In the literature, rare work consider this case, but we keep seeing the danger of ignoring such inconsistent noise, which motivate my work on this area. The key challenges for single labeler case, are: How to measure inconsistency of labelers? How to cope with such inconsistency?
  10. And the key point of our solution is to leveraging collectively provided labels, which I call community intelligence on other tasks to transfer single labeler setting into multi labeler settings.
  11. This work is part of my long-term research program which focuses on coping with data quality issues in big data settings.
  12. bug severity level reflects the impact of bug on the system, it is assigned during bug reporting process, which is shown in the slide.