SMU Classification: Restricted
Strategic Partner:
On the Unreliability of Bug Severity Data
Yuan TIAN
Data Scientist at Living Analytics Research Centre,
Singapore Management University
ytian@smu.edu.sg
April 18th, 2018 @ Queen’s University, Canada
SMU Classification: Restricted
2
Supervised Machine Learning models rely heavily on labels
Labels
Predictive
Model
Feature
Extraction
Learning
Expected Label
Training
Data
New
Data
Cat
Not Cat
SMU Classification: Restricted
3
Traditional machine learning models suffer from noisy labels
in training set
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
SMU Classification: Restricted
4
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
Traditional machine learning models suffer from noisy labels
in training set
Noise Level=3%
F1=0.83
F1=0.4
SMU Classification: Restricted
5
Noisy labels due to:
• Human mistakes.
• Non-expert generated labels.
• Machine generated labels.
• Communication or encoding
problems.
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
SMU Classification: Restricted
6
Decreased
Performance
More Complex
Model
Unreliable
Performance
Measures
Incorrect
Influential
Features
Traditional machine learning models suffer from noisy labels
in training set
Assume that absolute ground truth
for labels exists although the labels
may be noisy for some reasons.
SMU Classification: Restricted
7
Medical Diagnosis Malware Detection
Classification is in some cases subjective, which results in
inter-labeller variability
Image Tagging
“inconsistent
labels” noise
SMU Classification: Restricted
8
“Inconsistent Labels” Noise
Classification is in some cases subjective, which results in
inter-labeller variability (cont.)
How interesting is this book ?
How informative is this tweet ?
Is it a high quality product ?
How severe is the problem ?
User created tags for images, content, etc.
Job titles created by different companies
Numeric Label
Categorical Label
SMU Classification: Restricted
9
Before collecting labels
Readily available data
• Shared labelling criteria
• Multiple labelers
• Repeated labelling
• Labels are agreed by
all/majority
• Pairwise comparison
• Averaging
• Majority voting
• Consensus voting
• Remove outliers
• Learning with
uncertain labels
?
How to measure
inconsistency?
Multiple labelers
Single labeler
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to cope with
inconsistency?
SMU Classification: Restricted
10
Community Intelligence
How people cope with label noise (“inconsistent labels”)
caused by the subjective labelling, etc.
How to measure
inconsistency?
How to cope with
inconsistency?
SMU Classification: Restricted
11
Part 1:
│Noisy labels negatively impact learning
│Overview of approaches for coping with inconsistent labels
│Sample : Bug severity levels
Part 2:
│Future research direction: Big Data to Thick (high quality) Data
Talk Outline
SMU Classification: Restricted
12
Tester/
User
Summary
Description
Product, component, version
Severity
1:Blocker
3: Major
4:Minor
5:Trivial
2: Critical
Severity is assigned to reflect the level of
impact that a bug has on the system
SMU Classification: Restricted
13
Prior work in automated bug severity labelling
SMU Classification: Restricted
14
Prior work in automated bug severity labelling
Existing approaches assume that all assigned
severity levels are consistent.
Classifiers are evaluated using Accuracy &
F-measure, AUC.
SMU Classification: Restricted
15
How to measure label
inconsistency?
How to evaluate machine
learning models with
inconsistent labels?
human-machine
_____________
human-human
Challenges and our solutions
SMU Classification: Restricted
16
Bug Detection &
Reporting
Bug Triaging
#Validity Check
# Duplicate Bug Detection
# Bug Prioritization
# Bug Assignment
Debugging &
Bug Fixing
Duplicate bugs should
have the same severity
levels, if “severity level”
labels are consistent.
How to measure the inconsistency?
SMU Classification: Restricted
17
Blocker Critical
Major
Minor
Inconsistent Duplicate Buckets
1
2
3
Clean Duplicate Buckets
Manual verify 1,394 bug reports
(statistically representative
sample)
95% of the inconsistent
buckets are reporting the
same bug.
Up to 51% of human-assigned bug severity labels are
inconsistent !
SMU Classification: Restricted
18
human-machine
_____________
human-human
Krippendorff's alpha
A new evaluation measure for machine learning tasks with
inconsistent labels:
𝛼 = 1 −
𝐷𝑜
𝐷𝑒
Observed disagreements
Expected disagreements
when the bug severity
levels are randomly
assigned.
𝛼 = 1 regarded as perfect agreement.
Benefit of Alpha:
• Allow multiple labellers
• Good for ordinal labels
• Factoring class distributions
• Less biased to number of
labels and the number of
coders
SMU Classification: Restricted
19
Bug Report 1 2 3 4 5 6 7 8 9 10
Human 1 2 3 3 4 3 3 4 3 5
Machine A 2 3 3 3 3 3 3 5 3 2
Machine B 3 4 3 3 3 3 3 5 3 2
Machine C 3 3 3 3 3 3 3 3 3 3
Krippendorff’s Alpha Vs Accuracy
Accuracy: Machine A = Machine B = Machine C
Alpha: Machine A > Machine B > Machine C
SMU Classification: Restricted
20
human-machine
_____________
human-human
Krippendorff's alpha
Low agreement between machine learning systems and human might
be due to data inconsistency!
DataSet Human-Human Human-Machine
(REP+kNN)
Agreement
Ratio
OpenOffice 0.538 0.415 0.77
Mozilla 0.675 0.556 0.82
Eclipse 0.595 0.510 0.86
Vs.
SMU Classification: Restricted
21
Community intelligence can be used to
identify and quantify the inconsistency
of subjective labels.
Performance of machine learning
models should be measured within
context, e.g., relative to human inter-
agreement.
human-machine
_____________
human-human
Take away messages
SMU Classification: Restricted
22
During 2017, every minute of the day:
4M tweets
36 M google
searches
600 page edits
0.4 M trips
120 new
professionals $258,751.90 in sales
Big Data is everywhere, however…
Bad data is costing organizations some $3.1 trillion a year in
the US alone.
83% of companies said their revenue is affected by inaccurate
and incomplete customer or prospect data.
SMU Classification: Restricted
23
Big
Data
Thick
Data
Size of
Data
Quality
Big Data Vs Thick Data (high quality data, but expensive to
collect, hard to scale)
SMU Classification: Restricted
24
Big
Data
Thick
Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Make big data “ thick ”
SMU Classification: Restricted
25
Big
Data
Thick Data
Size of
Data
Quality
Big
Data
Size of
Data
Quality
Challenges:
1. Impossible to specify all the data semantics beforehand.
2. Manual labelling of noise is expensive and time
consuming, impossible to scale.
3. Lack of quality metrics for big data.
4. Lack of performance measures for noisy data.
Make big data “ thick ”
SMU Classification: Restricted
26
• Integrate existing internal/external resources collectively
created by relevant communities.
• Utilize knowledge in unstructured data.
How to make big data thick in a lightweight cost-effective way?
SMU Classification: Restricted
27
Assessing Quality of
Big Data
Lightweight Scalable Noise
Reduction/Correction
Techniques
New Noise Tolerant Learning
Algorithm
for Big Data
New Performance Measures
for Noisy Data
Future Roadmap: Big Data to Thick Data
SMU Classification: Restricted
28
human-machine
_____________
human-human
• New measure for machine learning
performance with inconsistent labels
Challenges: Big Data to Thick Data
• Need metrics for quality assessment and
model evaluation
• Need lightweight noise reduction methods
Conclusion ytian@smu.edu.sg
• Classification is in some cases subjective,
which results in inter-labeler variability.
SMU Classification: Restricted
ytian@smu.edu.sg
Backup Slides
29
SMU Classification: Restricted
30
Bucket/
Bug Report
1 2 3 4 5 6 7 8 9 10
1 2 3 3 4 3 3 4 3 5
2 3 3 3 3 3 3 5 3 2
#Raters 2 2 2 2 2 2 2 2 2 2
Count Matrix
Derived from
Ratings
Predefined
Distance Matrix
This is why alpha is
good for ordinal labels!
(c,k) represents a
pair of ratings
Computation of Krippendorff's alpha
𝛼 = 0.27
SMU Classification: Restricted
31
How do inconsistent labels affect varies machine learning
models?
Duplicate Bug
Reports
Clean Bug Reports Inconsistent Bug Reports
Test Bug
Reports
Train (Inconsistent Data Ratio:0%)
Inconsistent Data Ratio:20%
SMU Classification: Restricted
32
Noise injection leads to drop in alpha for all datasets
(Mozilla)
(OpenOffice)
Decision Tree
Naïve Bayes Multinomial
Support Vector Machine
REP + k-Nearest Neighbor
Ratio of Inconsistent Bug Reports in Training Set
(Eclipse)

On the Unreliability of Bug Severity Data

  • 1.
    SMU Classification: Restricted StrategicPartner: On the Unreliability of Bug Severity Data Yuan TIAN Data Scientist at Living Analytics Research Centre, Singapore Management University ytian@smu.edu.sg April 18th, 2018 @ Queen’s University, Canada
  • 2.
    SMU Classification: Restricted 2 SupervisedMachine Learning models rely heavily on labels Labels Predictive Model Feature Extraction Learning Expected Label Training Data New Data Cat Not Cat
  • 3.
    SMU Classification: Restricted 3 Traditionalmachine learning models suffer from noisy labels in training set Noisy labels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems.
  • 4.
    SMU Classification: Restricted 4 Noisylabels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance Traditional machine learning models suffer from noisy labels in training set Noise Level=3% F1=0.83 F1=0.4
  • 5.
    SMU Classification: Restricted 5 Noisylabels due to: • Human mistakes. • Non-expert generated labels. • Machine generated labels. • Communication or encoding problems. Decreased Performance More Complex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set
  • 6.
    SMU Classification: Restricted 6 Decreased Performance MoreComplex Model Unreliable Performance Measures Incorrect Influential Features Traditional machine learning models suffer from noisy labels in training set Assume that absolute ground truth for labels exists although the labels may be noisy for some reasons.
  • 7.
    SMU Classification: Restricted 7 MedicalDiagnosis Malware Detection Classification is in some cases subjective, which results in inter-labeller variability Image Tagging “inconsistent labels” noise
  • 8.
    SMU Classification: Restricted 8 “InconsistentLabels” Noise Classification is in some cases subjective, which results in inter-labeller variability (cont.) How interesting is this book ? How informative is this tweet ? Is it a high quality product ? How severe is the problem ? User created tags for images, content, etc. Job titles created by different companies Numeric Label Categorical Label
  • 9.
    SMU Classification: Restricted 9 Beforecollecting labels Readily available data • Shared labelling criteria • Multiple labelers • Repeated labelling • Labels are agreed by all/majority • Pairwise comparison • Averaging • Majority voting • Consensus voting • Remove outliers • Learning with uncertain labels ? How to measure inconsistency? Multiple labelers Single labeler How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to cope with inconsistency?
  • 10.
    SMU Classification: Restricted 10 CommunityIntelligence How people cope with label noise (“inconsistent labels”) caused by the subjective labelling, etc. How to measure inconsistency? How to cope with inconsistency?
  • 11.
    SMU Classification: Restricted 11 Part1: │Noisy labels negatively impact learning │Overview of approaches for coping with inconsistent labels │Sample : Bug severity levels Part 2: │Future research direction: Big Data to Thick (high quality) Data Talk Outline
  • 12.
    SMU Classification: Restricted 12 Tester/ User Summary Description Product,component, version Severity 1:Blocker 3: Major 4:Minor 5:Trivial 2: Critical Severity is assigned to reflect the level of impact that a bug has on the system
  • 13.
    SMU Classification: Restricted 13 Priorwork in automated bug severity labelling
  • 14.
    SMU Classification: Restricted 14 Priorwork in automated bug severity labelling Existing approaches assume that all assigned severity levels are consistent. Classifiers are evaluated using Accuracy & F-measure, AUC.
  • 15.
    SMU Classification: Restricted 15 Howto measure label inconsistency? How to evaluate machine learning models with inconsistent labels? human-machine _____________ human-human Challenges and our solutions
  • 16.
    SMU Classification: Restricted 16 BugDetection & Reporting Bug Triaging #Validity Check # Duplicate Bug Detection # Bug Prioritization # Bug Assignment Debugging & Bug Fixing Duplicate bugs should have the same severity levels, if “severity level” labels are consistent. How to measure the inconsistency?
  • 17.
    SMU Classification: Restricted 17 BlockerCritical Major Minor Inconsistent Duplicate Buckets 1 2 3 Clean Duplicate Buckets Manual verify 1,394 bug reports (statistically representative sample) 95% of the inconsistent buckets are reporting the same bug. Up to 51% of human-assigned bug severity labels are inconsistent !
  • 18.
    SMU Classification: Restricted 18 human-machine _____________ human-human Krippendorff'salpha A new evaluation measure for machine learning tasks with inconsistent labels: 𝛼 = 1 − 𝐷𝑜 𝐷𝑒 Observed disagreements Expected disagreements when the bug severity levels are randomly assigned. 𝛼 = 1 regarded as perfect agreement. Benefit of Alpha: • Allow multiple labellers • Good for ordinal labels • Factoring class distributions • Less biased to number of labels and the number of coders
  • 19.
    SMU Classification: Restricted 19 BugReport 1 2 3 4 5 6 7 8 9 10 Human 1 2 3 3 4 3 3 4 3 5 Machine A 2 3 3 3 3 3 3 5 3 2 Machine B 3 4 3 3 3 3 3 5 3 2 Machine C 3 3 3 3 3 3 3 3 3 3 Krippendorff’s Alpha Vs Accuracy Accuracy: Machine A = Machine B = Machine C Alpha: Machine A > Machine B > Machine C
  • 20.
    SMU Classification: Restricted 20 human-machine _____________ human-human Krippendorff'salpha Low agreement between machine learning systems and human might be due to data inconsistency! DataSet Human-Human Human-Machine (REP+kNN) Agreement Ratio OpenOffice 0.538 0.415 0.77 Mozilla 0.675 0.556 0.82 Eclipse 0.595 0.510 0.86 Vs.
  • 21.
    SMU Classification: Restricted 21 Communityintelligence can be used to identify and quantify the inconsistency of subjective labels. Performance of machine learning models should be measured within context, e.g., relative to human inter- agreement. human-machine _____________ human-human Take away messages
  • 22.
    SMU Classification: Restricted 22 During2017, every minute of the day: 4M tweets 36 M google searches 600 page edits 0.4 M trips 120 new professionals $258,751.90 in sales Big Data is everywhere, however… Bad data is costing organizations some $3.1 trillion a year in the US alone. 83% of companies said their revenue is affected by inaccurate and incomplete customer or prospect data.
  • 23.
    SMU Classification: Restricted 23 Big Data Thick Data Sizeof Data Quality Big Data Vs Thick Data (high quality data, but expensive to collect, hard to scale)
  • 24.
    SMU Classification: Restricted 24 Big Data Thick Data Sizeof Data Quality Big Data Size of Data Quality Make big data “ thick ”
  • 25.
    SMU Classification: Restricted 25 Big Data ThickData Size of Data Quality Big Data Size of Data Quality Challenges: 1. Impossible to specify all the data semantics beforehand. 2. Manual labelling of noise is expensive and time consuming, impossible to scale. 3. Lack of quality metrics for big data. 4. Lack of performance measures for noisy data. Make big data “ thick ”
  • 26.
    SMU Classification: Restricted 26 •Integrate existing internal/external resources collectively created by relevant communities. • Utilize knowledge in unstructured data. How to make big data thick in a lightweight cost-effective way?
  • 27.
    SMU Classification: Restricted 27 AssessingQuality of Big Data Lightweight Scalable Noise Reduction/Correction Techniques New Noise Tolerant Learning Algorithm for Big Data New Performance Measures for Noisy Data Future Roadmap: Big Data to Thick Data
  • 28.
    SMU Classification: Restricted 28 human-machine _____________ human-human •New measure for machine learning performance with inconsistent labels Challenges: Big Data to Thick Data • Need metrics for quality assessment and model evaluation • Need lightweight noise reduction methods Conclusion ytian@smu.edu.sg • Classification is in some cases subjective, which results in inter-labeler variability.
  • 29.
  • 30.
    SMU Classification: Restricted 30 Bucket/ BugReport 1 2 3 4 5 6 7 8 9 10 1 2 3 3 4 3 3 4 3 5 2 3 3 3 3 3 3 5 3 2 #Raters 2 2 2 2 2 2 2 2 2 2 Count Matrix Derived from Ratings Predefined Distance Matrix This is why alpha is good for ordinal labels! (c,k) represents a pair of ratings Computation of Krippendorff's alpha 𝛼 = 0.27
  • 31.
    SMU Classification: Restricted 31 Howdo inconsistent labels affect varies machine learning models? Duplicate Bug Reports Clean Bug Reports Inconsistent Bug Reports Test Bug Reports Train (Inconsistent Data Ratio:0%) Inconsistent Data Ratio:20%
  • 32.
    SMU Classification: Restricted 32 Noiseinjection leads to drop in alpha for all datasets (Mozilla) (OpenOffice) Decision Tree Naïve Bayes Multinomial Support Vector Machine REP + k-Nearest Neighbor Ratio of Inconsistent Bug Reports in Training Set (Eclipse)

Editor's Notes

  • #2 Thanks prof. z for the introduction, thanks all for taking time to attend this talk, I am honored to be invited here. As you can see from the title, I’d like to share some of my experience in dealing with inconsistent labels, which is important for making your later analysis on the data reliable and effective. Please feel free to interrupt if you find any difficulties regarding the slide.
  • #3 To begin with, I would like to introduce supervised machine learning models, which are probability the most common machine learning models used nowadays. And they are also the ones that are affected the most by the quality of labels. In the slide, we see a general flow of supervised machine learning, it takes labelled training data as input, gleaning information from it, and eventually learn a model that can label new data. Let’s me give you a simple example here, at the top of the slide, you can see 7 images, each of them is associated with a label 1/0 indicating whether there is a cat in the image. Supervised machine learning takes these data-label pairs as training input, and then extracts features from the images. In traditional machine learning flow, features are defined manually, while in the latest deep learning techniques, features are learned from the data. After the feature extraction process, a model is learned for a mapping between feature values and labels, so that given the new image shown at the right bottom of the slide, we hope that the learned model is able to identify that there is a cat in the image.
  • #4 Since supervised machine learning models are popular, and rely heavily on labels, intuitively, we machine learning practitioners want clean labels in our training data. However, things don’t always go as we wish. Have you noticed that just among the 7 images we saw in the previous slide, the image cycled in red contains a dog, instead of a cat, in this case, we say we encounter an incorrect label. In fact, real-world data often contain noisy labels due to various reasons. For example, we human make mistakes, including experts. secondly, as collecting reliable labels is a expensive and time costly task, many studies leverage crowdsourcing platforms to collect non-expert labels in a cheap and fast way, and some regard machine generated labels as the ground truth. Last, noisy labels might simply due to communication or encoding problems.
  • #5 In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. http://www.stat.purdue.edu/~jianzhan/papers/sigir03zhang.pdf
  • #6 In fact, people have studied the impact of noisy labels for a long time in the machine learning areas, theoretically or empirically demonstrate that noisy labels can bring negative consequences for learning. The most important two includes descried performance and unreliable performance measures. The image on the right side shows an study regarding performance of traditional classifiers on a particular task, we could see that the performance of the classifiers all drop dramatically after noise level is greater than 3%. Noisy labels often lead to more complex model, and incorrect influential features. So the message I want to deliver here is, we should take care of noisy label when we design machine learning models.
  • #7 In most of the research on analysing noisy labels, we have an important assumption that absolute ground truth for label exist, like we could easily tell that the cycled image is wrongly labelled. However, classification can be subjective, where the ground truth is hard to tell.
  • #8 For instance, two doctors may give different diagnosis regarding the same patient based on their experiences, especially when the information are not fully collected. In the field of computer security, different companies have their own standard in determining whether a software is malware or not. Thus when people combine different malware benchmarks together, there might be conflict labels on the same software. In the image tagging process, users are allowed to create tags by themselves, thus when we may find different tags regarding the same object. Other data such as movie ratings and application ratings also encounter inconsistent caused by the subjective classification process.
  • #9 To summarize the inconsistent labels noise we have seen so far, we can divide them into two groups depend on whether the label can be represented using a numeric variable or categorical variable. For questions regarding opinions of people,like interestingness of a book, a score between 0-10, or 1-5, is usually assigned to measure levels of agreement on a statement. So we can use a numeric value to represent each category. For scenario like image tagging, each tag is a categorical label. Similarity, job titles created by different companies may be different for the same person, or the same job. There are no standard terminology regarding the same data. Since subjective classification tasks often happens, especially when we want to model user behaviours and preference, and if we just ignore the inconsistent label noise introduced by the process, the learning will suffer from performance drop, and many other negative consequences as we have talked in the beginning of this talk.
  • #10 So here comes the question, how people cope with inconsistent label noise? Well, to answer this question, we need to first figure out when the inconsistent labels are encountered. Sometimes, we are the ones who can control the label collection process, sometime, we start with labelled data. In the label collection process, some people do not aware of potential inconsistency introduced by the subjective nature of the classification task. While some people do care about the labeling process, especially when they are creating benchmark dataset. Several strategies are adopted, which I believe all researchers should consider before collecting human annotated labels. For example, involving multiple labelers and making sure that one instance has been labelled more than one labelers. If disagreement appear regarding one instance, we should consider either throw the data, or resolve the disagreement among labelers. Recently, there is also a trend of adopting pairwise comparison rather than assigning an absolute score, but this method would require many pairs of comparisons. If we do not have control on the process, but we have labels provided by multiple labelers, many studies go for different voting methods to merge multiple labelers’ into one label, but this method … Other studies will study the reliability of each labeler and filter outliers, or treat label as a distribution over all possible labels, which is called uncertain label. But how about we only have one label per instance, especially when we know that nothing has been controlled in the label collection process. In the literature, rare work consider this case, but we keep seeing the danger of ignoring such inconsistent noise, which motivate my work on this area. The key challenges for single labeler case, are: How to measure inconsistency of labelers? How to cope with such inconsistency?
  • #11 And the key point of our solution is to leveraging collectively provided labels, which I call community intelligence on other tasks to transfer single labeler setting into multi labeler settings.
  • #12 This work is part of my long-term research program which focuses on coping with data quality issues in big data settings.
  • #13 bug severity level reflects the impact of bug on the system, it is assigned during bug reporting process, which is shown in the slide.