SlideShare a Scribd company logo
Overview and Importance of Data Quality
for Machine Learning Tasks
KDD Tutorial / © 2020 IBM Corporation
Nitin Gupta Shashank
Mujumdar
Shazia Afzal Hima Patel
IBM Research India
Acknowledgements
KDD Tutorial / © 2020 IBM Corporation
Abhinav Jain Ruhi Sharma
Sameep Mehta
Lokesh N
Shanmukh Guttula Vitobha Munigala
Agenda
Why is data quality measurement important for Machine Learning Applications?
Part I : Discussion on quality metrics for structured/tabular data
Part II : Discussion on quality metrics for unstructured text data
Part III : Discussion on Human In Loop for Data Quality Measurements
Conclusion
KDD Tutorial / © 2020 IBM Corporation
Need for Data Quality for Machine
Learning
KDD Tutorial / © 2020 IBM Corporation
Data Preparation in Machine Learning
KDD Tutorial / © 2020 IBM Corporation
“Data collection and preparation are typically
the most time-consuming activities in developing
an AI-based application, much more so than
selecting and tuning a model.” – MIT Sloan Survey
https://sloanreview.mit.edu/projects/reshaping-business-with-artificial-
intelligence/
Data preparation accounts for about 80% of the work of data
scientists” - Forbes
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-
most-time-consuming-least-enjoyable-data-science-task-survey-
says/#70d9599b6f63
Challenges with Data Preparation
KDD Tutorial / © 2020 IBM Corporation
Challenges with Data Preparation
KDD Tutorial / © 2020 IBM Corporation
Iterative debugging which is cumbersome and time consuming
Data Quality Analysis can help..
KDD Tutorial / © 2020 IBM Corporation
▪ Know the issues in the data
beforehand, example noise in labels,
overlap between classes.
▪ Make informed choices for data
preprocessing and model selection
▪ Reduce turn around time for data
science projects.
Different personas in enterprise setting..
KDD Tutorial / © 2020 IBM Corporation
Source : Google Images
To put it all together
KDD Tutorial / © 2020 IBM Corporation
Data Assessment and Readiness Module
▪ Need for algorithms that can assess training datasets
▪ Across modalities.. Structured/unstructured/timeseries etc
▪ Allow for complex interaction between the different personas and human in loop techniques
▪ Need for automation
To summarize:
KDD Tutorial / © 2020 IBM Corporation
George Fuechsel, IBM 305 RAMAC technician
Lot of progress in last several years on improving ML
algorithms including building automated machine
learning toolkits (AutoML)
However,
Quality of a ML model is directly proportional to
Quality of Data
Hence, there is a need for systematic study of
measuring quality of data with respect to machine
learning tasks.
Data Quality Metrics for
Structured Data
KDD Tutorial / © 2020 IBM Corporation
Data Quality Metrics
▪ We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Cleaning
▪ What are some common data cleaning techniques used for machine learning?
▪ Do data cleaning techniques always help in building a machine learning pipeline
▪ Joint cleaning and model building techniques
KDD Tutorial / © 2020 IBM Corporation
Common Data Cleaning Techniques
Data Cleaning Issues
KDD Tutorial / © 2020 IBM Corporation
Quantitative Qualitative
Integrity Constraints based errors (not covered)
- Functional Dependency Constraints
- Denial Constraints
- Others ..
- Missing Values
- Outliers
- Noise in the data
- Noise in labels
- Data Linters
- …
Is data cleaning always helpful for ML pipeline?
▪ Question: what is the impact of different cleaning methods on
ML models?
▪ Experiment setup:
▪ Datasets: 13 real world datasets
▪ Five error types: outliers, duplicates, inconsistencies,
mislabels and missing values
▪ Seven classification algorithms: Logistic Regression,
Decision Tree, Random Forest, Adaboost, XGBoost, k-
Nearest Neighbors, and Naïve Bayes
KDD Tutorial / © 2020 IBM Corporation
[LXJ19]
Cleaning methods used for error types
Image Source: Li et al, 2019. CleanML: A Benchmark for Joint Data
Cleaning and Machine Learning [Experiments and Analysis]
Insights: Impact of different cleaning techniques
KDD Tutorial / © 2020 IBM Corporation
[LXJ19]
Impact, if cleaned* Negative Impacts*
Inconsistencies Insignificant No
Outliers Insignificant/ Positive Possible
Duplicates Insignificant Yes
Missing Values Positive If imputation is far away
from ground truth
Mislabels Positive If dataset accuracy < 50%
*based on general patterns across the datasets and different train test settings
In conclusion
▪ Data cleaning does not necessarily improve the quality of downstream ML models
▪ Impact depends on different factors:
▪ Cleaning algorithm and its set of parameters
▪ ML model dependent
▪ Order of cleaning operators
▪ Model selection and cleaning algorithm can increase robustness of impacts – no one
solution!
KDD Tutorial / © 2020 IBM Corporation
[LXJ19]
[LAU19]
Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Class Imbalance
Unequal distribution of classes within a dataset
KDD Tutorial / © 2020 IBM Corporation
Source: https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59
Why it happens?
Expected in domains where data with one pattern is more common than other.
KDD Tutorial / © 2020 IBM Corporation
Source: Krawczyk et al, 2016. Learning from imbalanced data: open challenges and future direction
[Kra16]
Why Imbalanced Classification is Hard?
KDD Tutorial / © 2020 IBM Corporation
Classifier assumes data to
be balanced
Unequal Cost of
Misclassification Errors:
False negatives are
important than False
positives
Minority Class is more
important for data mining
but given less priority by
learning algorithm
Minority class instances
can be detected as noise
Evaluation Metrics for Imbalanced Datasets
Accuracy Paradox
KDD Tutorial / © 2020 IBM Corporation
90%
Majority
10%
Minority
Learning
Algorithm
Always
Predict
Accuracy = 90% Reasonable ?
F1 Score, Recall, AUC PR, AUC ROC, G-Mean….
Source: https://en.wikipedia.org/wiki/Accuracy_paradox
Factors affecting class imbalance
▪ Imbalance Ratio
▪ Overlap between classes
▪ Smaller sub-concepts
▪ Dataset Size
▪ Label Noise
▪ Combination
▪ …
KDD Tutorial / © 2020 IBM Corporation 24
Affecting Factor: Imbalance Ratio
KDD Tutorial / © 2020 IBM Corporation
Imbalance Ratio
Learning algorithm biased towards majority class
Minority sample considered as the noise
In confusion region, priorities given to majority
Increase Increase
➔ ➔
Is the imbalance ratio only cause of performance
degradation in learning from imbalanced data? NO
➔
Source: . https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8
[WP03]
[GBSW04]
Affecting Factor: Overlap
KDD Tutorial / © 2020 IBM Corporation
Priority given to majority class in the overlapping region
Minority data points are considered as noise due to a
smaller number of data points from minority class in the
overlapping region
Source: Ali et al, 2015. Classification with class imbalance problem: A Review
[ALI15]
Affecting Factor: Smaller sub-concepts
It is difficult to discern separate groupings
with so few examples in minority class
KDD Tutorial / © 2020 IBM Corporation
Source: Ali et al, 2015. Classification with class imbalance problem: A Review
[ALI15]
Affecting Factor: Dataset Size
KDD Tutorial / © 2020 IBM Corporation
Size
Deteriorate the evaluation
measure
Underlying structure of the
minority class distributions
is difficult to understand
➔ ➔
Source: Ali et al, 2015. Classification with class imbalance problem: A Review
[ALI15]
Affecting Factor: Combined Effect
KDD Tutorial / © 2020 IBM Corporation
Combined effect of varying dataset
size, overlap, and imbalance ratio
Source: Misha et al, 2010. Overlap versus Imbalance
[DT10]
Modelling Strategies: Types
KDD Tutorial / © 2020 IBM Corporation
Our Focus
Source: Paula et al, 2015. A Survey of Predictive Modelling under Imbalanced Distributions
[BTR15]
Resampling Techniques
• Oversampling
• Undersampling
• Ensemble Based Techniques
KDD Tutorial / © 2020 IBM Corporation
Does imbalance recovery method always
help?
OR
Does the Impact of imbalance recovery
method same on all datasets.
Bayes Impact index
KDD Tutorial / © 2020 IBM Corporation
Totally Overlapped: No
Impact
Separable: No Impact Partially Overlapped:
High Impact
Source: Yang et al, 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem
[LCT19]
Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Label Noise
KDD Tutorial / © 2020 IBM Corporation
Given Label – Iris-setosa
Correct Label –Iris-virginica (based on attributes analysis)
▪ Most of the large data generated or annotated
have some noisy labels.
▪ In this metric we discuss “How one can identify
these label errors and correct them to model
data better?”.
There are atleast 100,000 label issues is ImageNet!
Source: https://l7.curtisnorthcutt.com/confident-learning
Effects of Label Noise
Possible Sources of Label Noise:
▪ Insufficient information provided to the labeler
▪ Errors in the labelling itself
▪ Subjectivity of the labelling task
▪ Communication/encoding problems
Label noise can have several effects:
▪ Decrease in classification performance
▪ Pose a threat to tasks like feature selection
▪ In online settings, new labelled data may
contradict the original labelled data
KDD Tutorial / © 2020 IBM Corporation
Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
▪ Designing robust algorithms that are
insensitive to noise
▪ Not directly extensible to other learning
algorithms
▪ Requires to change an existing
method, which neither is always
possible nor easy to develop
▪ Filtering out noise before passing to
underlying ML task
▪ Independent of the classification
algorithm
▪ Helps in improving classification
accuracy and reduced model
complexity.
KDD Tutorial / © 2020 IBM Corporation
Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
▪ Learning with Noisy Labels (NIPS-2014)
▪ Robust Loss Functions under Label Noise
for Deep Neural Networks (AAAI-2017)
▪ Probabilistic End-To-End Noise Correction
for Learning With Noisy Labels (CVPR-2019)
▪ Can Gradient Clipping Mitigate Label Noise?
(ICLR-2020)
▪ Identifying mislabelled training data (Journal
of artificial intelligence research 1999)
▪ On the labeling correctness in computer
vision datasets (IAL 2018)
▪ Finding label noise examples in large scale
datasets (SMC 2017)
▪ Confident Learning: Estimating Uncertainty
in Dataset Label (Arxiv -2019)
KDD Tutorial / © 2020 IBM Corporation
Filter Based Approaches
▪ On the labeling correctness in
computer vision datasets
[ARK18]
▪ Published in Interactive
Adaptive Learning 2018
▪ Train CNN ensembles to
detect incorrect labels
▪ Voting schemes used:
▪ Majority Voting
KDD Tutorial / © 2020 IBM Corporation
▪ Identifying mislabelled training
data [BF99]
▪ Published in Journal of
artificial intelligence research,
1999
▪ Train filters on parts of the
training data in order to
identify mislabeled examples
in the remaining training data.
▪ Voting schemes used:
▪ Majority Voting
▪ Consensus Voting
▪ Finding label noise examples in large
scale datasets [EGH17]
▪ Published in International
Conference on Systems, Man, and
Cybernetics, 2017
▪ Two-level approach
▪ Level 1 – Train SVM on the
unfiltered data. All support
vectors can be potential
candidates for noisy points.
▪ Level 2 – Train classifier on the
original data without the support
vectors. Samples for which the
original and the predicted label
does not match, are marked as
label noise.
Source: Broadley et al, 1999, Identifying mislabeled training Data
Eliminating Class Noise in
Large Datasets – ICML 2003
Criterion for a rule to be GR:
▪ It should have classification precision on its own
partition above a threshold
▪ It should have minimum coverage on its own
partition
▪ coverage - how many examples rules
classifies (either noise or not) confidently -
that is its coverage.
KDD Tutorial / © 2020 IBM Corporation
Source: Zhu et al, 2003. Eliminating Class Noise In Large Datasets
[ZWC03]
Confident Learning: Estimating
Uncertainty in Dataset Label -2019
The Principles of Confident Learning
▪ Prune to search for label errors
▪ Count to train on clean data
▪ Rank which examples to use during training
KDD Tutorial / © 2020 IBM Corporation
In case of Label Collison
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]
KDD Tutorial / © 2020 IBM Corporation
Confident Learning: Estimating
Uncertainty in Dataset Label -2019
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]
KDD Tutorial / © 2020 IBM Corporation
Confident Learning: Estimating
Uncertainty in Dataset Label -2019
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]
Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Valuation
Value of a training datum is how much impact it has in the predictor performance.
KDD Tutorial / © 2020 IBM Corporation
Prediction
Model
Valuation
Model
Training Data
Validation Data
Data Valuation
High Valued
Low Valued
Source: Google Images
KDD Tutorial / © 2020 IBM Corporation
Is the impact
of all the cat
images from
training set
same?
𝑃 , 𝐴, 𝑉 == 𝑃( , 𝐴, 𝑉)
𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎
𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕
𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎
𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙
𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽
𝑉𝑎𝑙( , 𝐴, 𝑉) == 𝑉𝑎𝑙( , 𝐴, 𝑉)
?
Is
Then
Source: Google Images
Scenarios in which datum has low value:
▪ Incorrect label
▪ Input is noisy or low quality
▪ Usefulness for target task
KDD Tutorial / © 2020 IBM Corporation
Label: Dog Label: Dog Label: Dog Label: Dog
Validation Set Source: Google Images
Application
▪ Identifying Data Quality
▪ High value data ➔ Significant contribution
▪ Low value data ➔ Noise or outliers or not useful for task
▪ Domain Adaptation
▪ Data Gathering
KDD Tutorial / © 2020 IBM Corporation
Leave One Out
KDD Tutorial / © 2020 IBM Corporation
𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎
𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕
𝑻: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕
𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎
𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙
𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽
𝑉𝑎𝑙 𝑥, 𝐴, 𝑉 = 𝑃 𝑇, 𝐴, 𝑉 − 𝑃(𝑇 − 𝑥, 𝐴, 𝑉)
Value of training datum 𝑥
Performance of learning algorithm 𝐴 on validation set
𝑉 when trained on full data 𝑻 including 𝑥
Performance of learning algorithm 𝐴 on validation set
𝑉 when trained on full data 𝑻 excluding 𝑥
[Coo77]
Source: Cook et al, 1977. Detection of influential observation in linear regression
Example
KDD Tutorial / © 2020 IBM Corporation
𝑉𝑎𝑙( , 𝐴, 𝑉) = 𝑃( , 𝐴, 𝑉) − 𝑃( , 𝐴, 𝑉)
0.032 = 0.80 – 0.768
Source: Google Images
[Coo77]
Implications
➢ Doesn’t capture complex interactions between subsets of data
KDD Tutorial / © 2020 IBM Corporation
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
Reasonable ?
Source: Google Images
[Coo77]
Data Shapley
KDD Tutorial / © 2020 IBM Corporation
Valuation of a datum should come
from the way it coalesces with
other data in the dataset
Valuation of is derived from its
marginal contribution to all other
subsets in the data
[GZ19]
Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning
Shapley Values – Formulation
KDD Tutorial / © 2020 IBM Corporation
𝜙 𝑖 = 𝐶 ෍
𝑆⊆𝐷{𝑖}
𝑣(𝑆 ∪ {𝑖}) − 𝑣(𝑆)
𝑛−1
𝑆
Constant
value of 𝑖 together with 𝑆
Shapley value of datum 𝑖
value of 𝑆 alone
Iterates over all subsets in data
Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning
[GZ19]
Practical Implications of Data Shapley
KDD Tutorial / © 2020 IBM Corporation
▪ Computation of Shapley value of a datum requires 2 𝐷 −1
summations
▪ In practice, we approximate Shapley values using Monte Carlo simulations
▪ The number of simulations required doesn’t scale with amount of data
Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning
[GZ19]
Results using Data Shapley
KDD Tutorial / © 2020 IBM Corporation
The data points that have higher
noise receive lesser Shapley values
Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning
[GZ19]
DVRL
KDD Tutorial / © 2020 IBM Corporation
Reward = Performance(selected data)
–
performance(complete data)
Data valuation is a function
from datum to its value
Sample data based on valuation
Build target model
using selected data
[YAP19]
Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning
KDD Tutorial / © 2020 IBM Corporation
DVRL Results
▪ Removing high valued points affect the model performance the most
▪ Removing low valued points affect the model performance the least
Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning
[YAP19]
Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Homogeneity
KDD Tutorial / © 2020 IBM Corporation
Homogeneity
Syntactic
Distance
matching
Pattern
matching
Semantic
Embedding
matching
Context
matching
We call a data homogenous, if all the entries follow a unique pattern
Data Homogeneity
KDD Tutorial / © 2020 IBM Corporation
12th Jan, 2020
Account Manager
12/01/2020
Acc. Manager
12th Jan, 2020
Account Manager
12/01/2020
Acc. Manager
Homogenous Homogenous
In-homogenous
➢ Two different date formats
➢ Two different values for the same category
In-homogeneity affects ML pipeline
KDD Tutorial / © 2020 IBM Corporation
▪ Adult Census dataset downloaded from Kaggle
▪ Task is to predict Income level (>50k/<=50k) given several attributes of a person
Income level (Train) Income level (Test)
<=50K <=50K.
<=50K <=50K.
>50K >50K.
<=50K >50K.
<=50K <=50K.
<=50K <=50K.
>50K <=50K.
>50K >50K.
4 output neurons?
Source: Image from Google Images
Inhomogeneity causes
KDD Tutorial / © 2020 IBM Corporation
When data is gathered
by different people
In the absence (or
weak presence) of a
data collection
protocol
When data is merged
from different sources
When data is stored in
different formats (e.g.
.csv, .xlsx) etc.
Inhomogeneity
Homogeneity score
KDD Tutorial / © 2020 IBM Corporation
Desiderata for measures that characterize homogeneity
▪ Lesser the number of clusters, better the homogeneity
▪ More skewed the distribution of clusters, better the homogeneity
Homogeneity( ) < Homogeneity( )
Homogeneity ( ) < Homogeneity( )
[DKO+07]
Source: Dai et al, 2006. Column heterogeneity as a measure of data quality
Homogeneity score
KDD Tutorial / © 2020 IBM Corporation
Entropy Simpson’s biodiversity Index
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑝 = ෍ 𝑝 𝑥 ln 𝑝(𝑥)
▪ Satisfies both the desiderata
▪ Not bounded 𝑅+
▪ Satisfies both the desiderata
▪ Bounded in [0,1]
𝑃 𝐶 𝑥𝑖 ≠ 𝐶(𝑥𝑗)
𝑃 is the probability distribution of the
entries among clusters
𝑥_𝑖 and 𝑥_𝑗 are two entries and
𝐶(. ) denotes the cluster index
[sim]
Source : http://www.countrysideinfo.co.uk/simpsons.htm
Syntactic Homogeneity
KDD Tutorial / © 2020 IBM Corporation
Distance based similarity
Edit distance
Sequence based distance measure
d a v e
0 1 2 3 4
d 1 0 1 2 3
v 2 1 1 1 2
a 3 2 1 2 2
y0 y1 y2 y3 y4
x = d – v a
y = d a v e
substitute a with e
insert a (after d)
𝐽 𝑥, 𝑦 =
|𝐵𝑥 ∩ 𝐵𝑦|
|𝐵𝑥 ∪ 𝐵𝑦|
▪ Eg. 𝑥 = dave, 𝑦 = dav
▪ 𝐵_𝑥 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣𝑒, 𝑒#},
𝐵_𝑦 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣#}
▪ 𝐽(𝑥, 𝑦) = 3/6
Jaccard Similarity
Set based distance measure
Source: Doan et al, 2012. Principles of Data Integration
KDD Tutorial / © 2020 IBM Corporation
Syntactic Homogeneity
Pattern based similarity (Regex based clustering)
Ram Kumar.
u{1}l{3} s{1} u{1}l{4} .{1}
w s{1} w .{1}
ROOT
Lakshman Kumar.
u{1}l{7} s{1} u{1}l{4} .{1}
w s{1} w .{1}
ROOT
Larger the alphabet set of the regex language, more the number of clusters detected
Source: Doan et al, 2012. Principles of Data Integration
Semantic Homogeneity
KDD Tutorial / © 2020 IBM Corporation
Embeddings based similarity
▪ Applicable when the entries in the data are
meaningful English words
▪ Can use off the shelf embeddings like word2vec,
Glove etc.
▪ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝜙 𝑥 , 𝜙 𝑦 ;
▪ 𝜙 . = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
▪ Embeddings capture semantic information as well
[MCCD13]
Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space
KDD Tutorial / © 2020 IBM Corporation
Semantic Homogeneity
Context based similarity
▪ Distributional hypothesis : words that are
used and occur in the same context tend to
purport similar meanings
▪ In Structured dataset, word ≈ an entry in a row,
context ≈ remaining entries in the row
▪ We may not be able to use off the shelf
embeddings
▪ Need to train dataset specific embeddings
▪ Entries that have similar neighborhood in their
rows tend to show higher similarities
▪ Useful in cases where
𝑠𝑖𝑚 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒, 𝑚𝑎𝑛𝑎𝑔𝑒𝑟 being high is not
desirable in a dataset
Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space
[MCCD13]
Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Transformation
▪ The goal is to allow the business user to transform provided data (heterogeneous) into user intended
format (homogeneous), by showing a few examples of expected output for the given input samples.
KDD Tutorial / © 2020 IBM Corporation
Data Transformation Examples
KDD Tutorial / © 2020 IBM Corporation
Data Transformation Using
PbE Systems
Programming-by-Example systems capture user intent using sample input-output example pairs and learn
transformation programs that convert inputs into their corresponding outputs.
KDD Tutorial / © 2020 IBM Corporation
User provided input-
output example pairs
Program space
expressed using a
Domain-specific
Language (DSL)
Decode/Search for
Transformation
Programs
Transformation Program
Fig 2. Generic process across all Programming-by-Example frameworks
Domain-specific Language
DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific
tasks and formally represents the transformation program for the user to interpret.
KDD Tutorial / © 2020 IBM Corporation
Source: Bunel et al, 2018. Karel DSL to guide an agent inside a grid world
Source: Gulwani et al, 2011. Flash-fill DSL for string to string transformation
What Are The Different Types Of
Programming-by-example Frameworks?
DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific
tasks and formally represents the transformation program for the user to interpret.
KDD Tutorial / © 2020 IBM Corporation
Symboli
c
Statistical
Hybrid
❑ Uses statistical methods to smartly
select sub-programs for further
exploration. Requires additional
efforts to incorporate branching
decisions
❑ No way to retract any wrong
decision made while exploration.
We might eliminate correct
program during the process.
❑ Examples: NGDS, PL meets ML
❑ Designed to produce correct programs by
construction using logical reasoning and
domain-specific language
❑ Produces the intended program with few input-
output examples
❑ Requires significant engineering effort and an
underlying program search method
❑ Examples: PROSE
❑ Use of Neural Networks to decode transformation
directly from encoded input-output string pairs.
❑ Generated program may not satisfy the provided
specification.
❑ The programs are generated using beam search
❑ Requires a lot of training data, generated randomly,
which might result in poor generalizability on real-
world tasks
❑ Examples: Robust-Fill, NPBE, Deep-Coder
Flash Meta – A Framework for
Inductive Program Synthesis (2015)
KDD Tutorial / © 2020 IBM Corporation
▪ Consider the transformation “(425) 706-7709” --> “425”.
▪ The output string ”425” can be represented in 4 possible ways
as concatenation of substrings ‘4’, ‘2’, ‘5’, ‘42’, ‘25’.
▪ Leaf nodes e1, e2, e3, e12, e23 are VSAs to output substrings
‘4’, ‘2’, ‘5’, ‘42’, ‘25’ respectively.
Key aspects for symbolic systems:
▪ Independence of problems:
▪ Number of possible sub-problems is quadratic:
▪ Enumerative Search:
Drawback: Takes a lot of time to look for correct
programs in the transformation space
Source: Polozov et al, 2015. Flash Meta – A Framework for Inductive Program Synthesis
[PG15]
Source: Polozov et al, 2015. FlashMeta –A FrameworkforInductiveProgram Synthesis
Hybrid Systems
KDD Tutorial / © 2020 IBM Corporation
To overcome the issue of symbolic models, hybrid systems have been proposed which
▪ use statistical models to generate probability distribution over DSL operators at each step of the
partially-constructed AST
▪ use most-likely ones to guide deductive search for generating transformation programs.
Program Search: System uses deductive search to look for programs that satisfy input-output transformation for all
user provided examples..
Program Ranking: Ranking function is used to pick top programs from the set of valid programs based on
properties like generalizability, simplicity etc.
Caveat: No way to go back when wrong decisions are made. We might
end up with sub-optimal ones or eliminating valid programs completely
Statistical System: Robust Fill
KDD Tutorial / © 2020 IBM Corporation
▪ DL approach based on seq-to-seq encoder-decoder networks which perform as good as conventional
non-DL based PbE approaches like Flash-Fill but with the added motivation of having the learned model
to be more robust to noise in the input data.
Domain Specific Language
Source: Devlin et al, 2017. Robust-Fill
[DUB+17]
Statistical System:
KDD Tutorial / © 2020 IBM Corporation
Key aspects:
▪ Use of neural networks with no symbolic guidance unlike hybrid systems.
▪ Use of Beam-Search to decode/predict top-k most likely transformation programs based on joint
probability of decoded DSL operators
▪ Use of synthetic datasets for model training which might lead to poor generalizability in real-world
tasks.
▪ Caveat: No guarantee whether the synthesized program will compile or satisfy user-provided
specifications
[DUB+17]
Statistical System:
KDD Tutorial / © 2020 IBM Corporation
Solution: Execution-guided Synthesis
▪ During Beam-Search, execute the partially constructed transformation program on the go and eliminate
ones which produce syntax errors or empty outputs
▪ Restrict DSL vocabulary at each step to enforce grammar correctness
▪ Incorporate execution guidance during training and testing so that program correctness is guaranteed by
construction.
Source: Wang et al, 2018. Execution-Guided Neural Program Synthesis
[KMP+18]
Data Quality Metrics for
Unstructured Text Data
KDD Tutorial / © 2020 IBM Corporation
Outline
KDD Tutorial / © 2020 IBM Corporation
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
What is Unstructured Text?
KDD Tutorial / © 2020 IBM Corporation
Html/Xml
Documents
Tweets Ticket Data
Email
Product Reviews
How to Assess Quality of Text?
Different properties of text can be measured to quantify its quality!
▪ Lexical Properties
- Vocabulary, Misspellings etc.
▪ Syntactic Properties
- Grammatical Relations, Syntax Violations etc.
▪ Semantic Properties
- Text Meaningfulness, Text Complexity etc.
▪ Discourse Properties
- Readability, Text Coherence etc.
▪ Distributional Properties
- Outliers, Topics, Embeddings
KDD Tutorial / © 2020 IBM Corporation
How to Assess Quality of Text?
Different Approaches can be further broadly classified as
▪ Generic Text quality
▪ Task specific
KDD Tutorial / © 2020 IBM Corporation
Generic Text Quality Approaches:
▪ Text Readability
▪ Text Coherence
▪ Text Formality
▪ Text Ambiguity
▪ Text Outliers
Task Specific Approaches:
▪ Text Classification
▪ Text Generation
▪ Semantic Similarity
▪ Dialog Systems
▪ Label Noise
▪ Lexical Diversity
▪ Text Cleanliness
▪ Text Appropriateness
▪ Text Complexity
▪ Text Bias
Outline
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Automatic Text Scoring Using Neural Networks
Objective:
Automatically assign a score to the text given a marking scale.
Approach:
1. Train a Model to learn word embeddings that capture both context and word-specific scores
2. Train an LSTM-based network using the score-specific word embeddings to output the text score
Task:
Predict the Essay scores written by students ranging from Grade 7 to Grade 10
KDD Tutorial / © 2020 IBM Corporation
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
Automatic Text Scoring Using Neural Networks
Score-Specific Word Embeddings:
▪ Learn both context and word specific scores during training.
▪ Context:
▪ For a sequence of words S = (w1,…, wt,…, wn), construct S´ = (w1,…, wc,…, wn | wc ∈ 𝕍, wt ≠ wc)
▪ The model learns word embeddings by ranking the activation of the true sequence S higher than the activation of its
‘noisy’ counterpart S´ through a hinge margin loss.
▪ Word Specific Scores:
▪ Apply linear regression on the sequence embedding to predict the essay score
▪ Minimize the mean squared error between the predicted ŷ and the actual essay score y
KDD Tutorial / © 2020 IBM Corporation
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
Automatic Text Scoring Using Neural Networks
KDD Tutorial / © 2020 IBM Corporation
Comparison:
▪ Standard Embeddings only consider context and hence place correct and incorrect spellings close to each other.
▪ Score-Specific Embeddings discriminate between spelling mistakes based on low and high essay scores, thus
separating them in the embedding space while preserving the context
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
Automatic Text Scoring Using Neural Networks
Explainability:
▪ Which words in the input text correspond to a high score vs a low score?
▪ Perform a single pass of an essay from left to right and let the LSTM make its score prediction
▪ Provide a pseudo-high score and a pseudo-low score as ground truth and measure the gradient
magnitudes corresponding to each word
KDD Tutorial / © 2020 IBM Corporation
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
Automatic Text Scoring Using Neural Networks
KDD Tutorial / © 2020 IBM Corporation
Visualizations:
▪ Positive feedback for correctly placed punctuation or long-distance dependencies (Sentence 6 “are … researching”)
▪ Negative feedback for POS mistakes (Sentence 2, “Being patience”)
▪ Incorrect negative feedback (Sentence 3, “reader satisfied”)
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
Transparent text quality assessment with
convolutional neural networks
Objective:
Automatically assign a score to the input Swedish text
Approach:
1. Train a CNN with residual connections on character embeddings
2. Utilize a pseudo-probabilistic framework to incorporate axioms on varying quality text vs high
quality text to train the model
Task:
1. Predict the Essay scores and compare against human graders
2. Qualitatively Analyze text to identify low vs high scoring regions
KDD Tutorial / © 2020 IBM Corporation Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
[ÖG17]
Transparent text quality assessment with
convolutional neural networks
Training:
▪ For a sequence of characters S = (s1, s2,…, sn), construct the character embedding matrix 𝕍 x d
▪ Pass the character embedding matrix through the CNN
▪ Final quality score is the dot product of the output weight vector and the mean value of the outputs at the
final residual layer.
▪ Utilise pseudo-probabilistic framework P(a > b) = σ(q(a) − q(b)), where σ(x) is the logistic function and q(x)
is the network score
KDD Tutorial / © 2020 IBM Corporation
[ÖG17]
Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
Transparent text quality assessment with
convolutional neural networks
KDD Tutorial / © 2020 IBM Corporation
Axioms:
1. All authors (professional or not) are consistent
2. Professional authors are better than blog authors
3. All professional authors are equal
4. Blog authors are not equal
[ÖG17]
Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
Transparent text quality assessment with
convolutional neural networks
Explainability:
▪ Which characters in the input text correspond to a high score vs a low score?
▪ Red encodes low scores, blue encodes high scores. Faded colors are used for scores near zero.
▪ News examples are scored higher than the blog examples
▪ Informal spellings, use of smileys leads to low scores in the blog text
KDD Tutorial / © 2020 IBM Corporation
[ÖG17]
Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
Outline
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Outlier Detection
Outlier in a tabular or timeseries data can be
interpreted as a
▪ Deviating point
▪ Noise
▪ Rarely occurring point
What can be an outlier in text data?
▪ Topically diverse sample?
▪ Gibberish text?
▪ Meaningless sentences?
▪ Samples from other language?
KDD Tutorial / © 2020 IBM Corporation
Images Credits: Google Images
Figure 1 Figure 2
Figure 3
Outlier Examples in Text Data
▪ Repetitive: greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad
full stop greatest show ever mad full stop
▪ Incomprehensible: lived let idea heck bear walk never heard whole years really funny beginning went
hill quickly
▪ Incomplete: Suspenseful subtle much much disturbing
KDD Tutorial / © 2020 IBM Corporation
Outlier Detection
Classical techniques
Matrix factorization
Techniques
DL techniques
Self-attention
techniques for text
representation
Outlier Detection in Text Data
Outlier Detection for Text Data
▪ Feature Generation – Simple Bag of Words
Approach
▪ Apply Matrix Factorization – Decompose the
given term matrix into a low rank matrix L and
an outlier matrix Z
▪ Further, L can be expressed as
▪ where,
▪ The l2 norm score of a particular column zx
serves as an outlier score for the document
KDD Tutorial / © 2020 IBM Corporation Source : Kannan et al, 2017. Outlier detection for text data
[KWAP17]
Outlier Detection for Text Data
▪ For experiments, outliers are sampled from a unique class from standard datasets
▪ Receiver operator characteristics are studied to assess the performance of the
proposed approach.
▪ Approach is useful at identifying outliers even from regular classes.
▪ Patterns such as unusually short/long documents, unique words, repetitive vocabulary
etc. were observed in detected outliers.
KDD Tutorial / © 2020 IBM Corporation
[KWAP17]
Source : Kannan et al, 2017. Outlier detection for text data
Unsupervised Anomaly Detection in Text
Multi-Head Self-Attention:
KDD Tutorial / © 2020 IBM Corporation
Sentence Embedding
Attention matrix
▪ Proposes a novel one-class classification method which leverages pretrained word
embeddings to perform anomaly detection on text
▪ Given a word embedding matrix H, multi-head self-attention is used to map sentences of
variable length to a collection of fixed size representations, representing a sentence with
respect to multiple contexts
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
[RZV+19]
(1)
(2)
Unsupervised Anomaly Detection on Text
KDD Tutorial / © 2020 IBM Corporation
CVDD Objective
Context Vector
Orthogonality Constraint
▪ These sentence representations are trained along with a collection of context vectors such that
the context vectors and representations are similar while keeping the context vectors diverse
▪ Greater the distance of mk(H) to ck implies a more anomalous sample w.r.t. context k
Outlier Scoring
[RZV+19]
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
(1)
More Generic Text Quality Approaches
Local Text Coherence:
▪ [MS18] Mesgar et. al. "A neural local coherence model for text quality assessment." EMNLP, 2018.
Text Formality:
▪ [JMAS19] Jain et. al. "Unsupervised controllable text formalization." AAAI, 2019.
Text Bias:
▪ [RDNMJ13] Recasens et. al. "Linguistic models for analyzing and detecting biased language." ACL,
2013.
KDD Tutorial / © 2020 IBM Corporation
Outline
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Identifying Non-Obvious Cases in Semantic
Similarity Datasets
Objective:
▪ Identify obvious and non-obvious text pairs from semantic similarity datasets
Approach:
1. Compare the Jenson-Shannon Divergence (JSD) of the text snippets against the median JSD of
the entire distribution
2. Classify the text snippets based on divergence from the median
Task:
▪ Analyze model performance numbers on SemEval Community Question Answering for both
obvious and non-obvious cases
KDD Tutorial / © 2020 IBM Corporation Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
[PLN19]
Identifying Non-Obvious Cases in Semantic
Similarity Datasets
KDD Tutorial / © 2020 IBM Corporation
Lexical divergence distribution by labels across datasets:
▪ The datasets differ with respect to the degree of lexical divergence
they contain
▪ Pairs with negative labels tend to have higher JSD scores than pairs
with positive labels
▪ Model performance numbers TPR/TNR/F1 score should be higher on
non-obvious cases compared to obvious cases.
Examples of Obvious and Non-
Obvious Semantic Text Pairs:
[PLN19]
Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
Improving Neural Conversational Models
with Entropy-Based Data Filtering
Objective:
▪ Filter generic utterances from dialog datasets to improve model performance
Approach:
▪ Device a data filtering by identifying open-ended inputs determined by calculating the entropy of the
distribution over target utterances
Task:
▪ Measure custom metric scores such as response length, KL divergence, Coherence, Bleu score
etc. for standard dialog datasets based on source/target/both filtering mechanisms
KDD Tutorial / © 2020 IBM Corporation Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering
[CPR19]
Improving Neural Conversational Models
with Entropy-Based Data Filtering
KDD Tutorial / © 2020 IBM Corporation
Examples:
1. Source = “What did you do today?”
• has high entropy, since it is paired with many different responses in the data
2. Source = “What color is the sky?”
• has low entropy since it’s observed with few responses in the data
3. Target = “I don’t know”
• has high entropy, since it is paired with many different inputs in the data
4. Target = “Try not to take on more than you can handle”
• has low entropy, since it is paired with few inputs in the data
[CPR19]
Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering
Outline
KDD Tutorial / © 2020 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Understanding the Difficulty of Text Classification Tasks
Proposes an approach to design data quality metric which
explains the complexity of the given data for classification task
▪ Considers various data characteristics to generates a 48 dim
feature vector for each dataset.
▪ Data characteristics include
▪ Class Diversity: Count based probability distribution of classes in
the dataset
▪ Class Imbalance: σ𝑐=1
𝐶
|
1
𝐶
−
𝑛𝑐
𝑇𝑑𝑎𝑡𝑎
|
▪ Class Interference: Similarities among samples belonging to
different classes
▪ Data Complexity: Linguistic properties of data samples
KDD Tutorial / © 2020 IBM Corporation Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
[CRZ18]
Feature vector for a given dataset cover
quality properties such as
▪ Class Diversity (2-Dim)
▪ Shannon Class Diversity
▪ Shannon Class Equitability
▪ Class Imbalance (1-Dim)
▪ Class Interference (24-Dim)
▪ Hellinger Similarity
▪ Top N-Gram Interference
▪ Mutual Information
▪ Data Complexity (21-Dim)
▪ Distinct n-gram : Total n-gram
▪ Inverse Flesch Reading Ease
▪ N-Gram and Character diversity
Understanding the Difficulty of Text Classification Tasks
▪ Authors propose usage of genetic algorithms to intelligently explore the 248 possible combinations
▪ The fitness function for the genetic algorithm was Pearson correlation between difficulty score and
model performance on test set
▪ 89 datasets were considered for evaluation and 12 different types of models on each dataset
▪ The effectiveness of a given combination of metric is measured using its correlation with the
performance of various models on various datasets
▪ Stronger the negative correlation of a metric with model performance, better the metric explains data
complexity
KDD Tutorial / © 2020 IBM Corporation
[CRZ18]
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
Understanding the Difficulty of Text Classification Tasks
KDD Tutorial / © 2020 IBM Corporation
Difficulty Measure D2 =
Distinct Unigrams : Total Unigrams + Class Imbalance +
Class Diversity + Maximum Unigram Hellinger Similarity
+ Unigram Mutual Info.
Correlation = −0.8814
[CRZ18]
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
Outline
KDD Tutorial / © 2020 IBM Corporation
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Next Steps
KDD Tutorial / © 2020 IBM Corporation
How to assess overall text quality?
▪ Framework that allows user to assess text quality across various dimensions
▪ Standardized set of quality metrics that output a score to indicate a low/high score
▪ Provide insights into the specific samples in the dataset which contribute to a low/high
score
▪ Specific recommendations for addressing poor text quality and evidence of model
performance improvement
Next Steps
KDD Tutorial / © 2020 IBM Corporation
How to assess quality of NLP models independent of task?
▪ Held-out datasets are often not comprehensive
▪ Contain the same biases as the training data
▪ Real-world performance may be overestimated
▪ Difficult to figure out where model is failing and how to fix it
[RWGS20]
Source : Ribeiro et al, 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Human In Loop Techniques for Data
Quality Assessments
KDD Tutorial / © 2020 IBM Corporation
Humans in the Loop
▪ Human intervention in data quality analysis
▪ Validation and feedback
▪ Possible enhancements
KDD Tutorial / © 2020 IBM Corporation
▪ Data readiness assessment
▪ Metrics for structured data
▪ Metrics for unstructured data
Human Input in the AI Lifecycle
KDD Tutorial / © 2020 IBM Corporation
Supporting Data Operations
Keywords:
▪ Preparation
▪ Integration
▪ Cleaning
▪ Curation
▪ Transformation
▪ Manipulation
Notable tools:
▪ Trifacta ~ Potter’s Wheel
▪ OpenRefine
▪ Wrangler
➢ How to validate that data is clean enough
➢ How to delegate sub-tasks between
respective role players / data workers
➢ How to enable reuse of human input and
codifying tacit knowledge
➢ How to provide cognitive and not just
statistical support
KDD Tutorial / © 2020 IBM Corporation
[Ham13]
[KPHH11]
[RH01]
Example: Label Noise Remediation
Identify noisy
samples
•Check target columns /
attributes to detect suspect
noisy or anomalous labels
•E.g. Target column with the
following entries :
A B Label
Abc def 1
Abc def 2
Abc def 1
Abc def 1
Visual preview of
noisy data
•Sampling to show presentable
number of records in batches
> Show suspected noisy class
labels and predicted labels to
preview
A B Label Predicted
Abc def 1 1
Abc def 2 1
Abc def 1 1
Abc def 1 1
Remediation input
from HIL
•Review
•Correct
•Approve
> HIL reviews if row is
noisy or not, verifies if
predicted label is correct or
not and additional marks
pertinent columns
corresponding
Document and
preserve
•Learn operations for
future recommendations
to reduce HIL effort and
enable reuse/transfer of
knowledge
> In future, better
prediction using pertinent
columns and user
actions
Example: Data Homogeneity and Transformation
Identify
heterogenous
samples
•Check target columns /
attributes to detect variation
on sample data
•E.g. Target column with the
following entries :
<=50K
<=50K.
>50K
>50K.
E.g. Gender column with the
following entries:
male
M
female
F
Visual preview of
problematic data
•Sampling to show
presentable number
•Give explanation
•Viewing options to reduce
human effort
> Show unique class labels
and ask for
correction/verification
> Show distinct formats and
ask for correction/verification
Transformation
input from HIL
•Review
•Correct
•Approve
> HIL sees anomaly in the
‘.’ appearing at end and
provides corresponding
transformation
> Rectify heterogenous
data formats and specify
rule for transformation
Document and
preserve
•Learn operations for
future recommendations
to reduce HIL effort and
enable reuse/transfer of
knowledge
> In future, suggest
remedial action
> Suggest possible
remediations or rules
KDD Tutorial / © 2020 IBM Corporation
Next Steps…
Shared collaborative workspace to streamline, regulate and document human input/expertise
Focus on reproducibility, traceability, and verifiability
Easy to user
specification
interface
e.g. graphical
interface, by
example or in
natural language
instead of
complex
grammars or
regular
expressions
Visual preview
appropriate
visualisation of
relevant data
chunks to see the
changes
Minimum
latency
effect of
transformation
should be quickly
visible
Clear lineage
ability to view
what changes
have been done
previously and by
whom
documentation of
operations and
observed effects
Enable
assessment of
data operations
basic analysis
using default ML
models to show
impact of
changes
Data versioning
to enable
experiments for
trial and testing
e.g. ability to
undo changes
Streamlined
collaboration
task lists based
on persona for
shared data
cleaning pipeline
[MLW+19]
[KHFW16]
[ROE+19]
Summary
▪ Addressing data quality issues before it enters an ML pipeline allows
taking remedial actions to reduce model building efforts and turn-around
times.
▪ Measuring the data quality in a systematic and objective manner,
through standardised metrics for example, can improve its reliability and
permit informed reuse.
▪ There are distinct metrics to quantify the data issues and anomalies
based on whether the data is structured or unstructured.
▪ Various data workers or personas are involved in the data quality
assessment pipeline depending on the level and type of human
intervention required.
▪ Streamlining the collaboration between usage of standardized metrics
and human in the loop inputs can serve as a powerful framework to
integrate and optimize the data quality assessment process.
KDD Tutorial / © 2020 IBM Corporation
We invite you to join us on this agenda towards a data-centric approach to analyzing data quality.
Contact Hima Patel with your ideas and enquiries.
KDD Tutorial / © 2020 IBM Corporation
References
KDD Tutorial / © 2020 IBM Corporation
[BF99] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167,
1999.
[ZWC03] Xingquan Zhu, Xindong Wu, and Qijun Chen. Eliminating class noise in large datasets. In Proceedings of the 20th
International Conference on Machine Learning(ICML-03), pages 920–927, 2003.
[DLG17]J unhua Ding, XinChuan Li, and Venkat N Gudivada. Augmentation and evaluation of training data for deep learning. In2017
IEEE International Conference on Big Data(Big Data), pages 2603–2611. IEEE, 2017.
[ARK18] Mohammed Al-Rawi and Dimosthenis Karatzas. On the labeling correctness in computer vision datasets. In
IAL@PKDD/ECML, 2018.
[NJC19] Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint
arXiv:1911.00068, 2019.
[Coo77] R Dennis Cook. Detection of influential observation in linear regression.T echnometrics
[EGH17] Rajmadhan Ekambaram, Dmitry B Goldgof, and Lawrence O Hall. Finding label noise examples in large scale datasets.
In2017 IEEE International Conference on Systems, Man, and Cybernetics pages 2420–2424., 2017.
References
KDD Tutorial / © 2020 IBM Corporation
[GZ19] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. arXiv preprint
arXiv:1904.02868, 2019
[YAP19] Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Data valuation using reinforcement learning. arXiv preprint
arXiv:1909.11671, 2019.
[DKO+07] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. Column heterogeneity
as a measure of data quality. 2007
[sim] http://www.countrysideinfo.co.uk/simpsons.htm.
[DHI12] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of data integration. Elsevier, 2012.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781, 2013
[ALI15] Aida Ali and Siti Mariyam Hj. Shamsuddin and Anca L. Ralescu. Classification with class imbalance problem: A review. SOCO
2015
References
KDD Tutorial / © 2020 IBM Corporation
[PG15] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015
ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 107–126,
2015.
[DUB+17] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli.
Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume
70, pages 990–998, 2017.
[KMP+18] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani. Neural-guided
deductive search for real-time program synthesis from examples. arXiv preprint arXiv:1804.01186, 2018.
[AYR16] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. Automatic text scoring using neural networks. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–725, Berlin, Germany,
August 2016. Association for Computational Linguistics.
[CPR19] Richard Csaky, Patrik Purgai, and Gabor Recski. Improving neural conversational models with entropy-based data filtering.
arXiv preprint arXiv:1905.05471, 2019.
References
KDD Tutorial / © 2020 IBM Corporation
[CRZ18] Edward Collins, Nikolai Rozanov, and Bingbing Zhang. Evolutionary data measures: Understanding the difficulty of text
classification tasks. arXiv preprint arXiv:1811.01910, 2018.
[JMAS19] Parag Jain, Abhijit Mishra, Amar Prakash Azad, and Karthik Sankaranarayanan. Unsupervised controllable text
formalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6554–6561, 2019.
[KWAP17] Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. Outlier detection for text data. In
Proceedings of the 2017 siam international conference on data mining, pages 489–497. SIAM, 2017.
[MS18] Mohsen Mesgar and Michael Strube. A neural local coherence model for text quality assessment. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 4328–4339, 2018.
[ÖG17] Robert Östling and Gintar ̇e Grigonyt ̇e. Transparent text quality assessment with convolutional neural networks. In
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 282–286, 2017.
References
KDD Tutorial / © 2020 IBM Corporation
[PLN19] Nicole Peinelt, Maria Liakata, and Dong Nguyen. Aiming beyond the obvious: Identifying non-obvious cases in semantic
similarity datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2792–2798,
2019.
[RDNMJ13] Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and detecting
biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1650–1659, 2013.
[RWGS20] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp
models with checklist. arXiv preprint arXiv:2005.04118, 2020.
[RZV+19] Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. Self-attentive, multi-context one-
class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 4061–4071, 2019.
[RH01] Vijayshankar Raman and Joseph M Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1,
pages 381–390, 2001.
[ROE+19] El Kindi Rezig, Mourad Ouzzani, Ahmed K Elmagarmid, Walid G Aref, and Michael Stonebraker. Towards an end-to-end
human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–7, 2019.
References
KDD Tutorial / © 2020 IBM Corporation
[Ham13] Kelli Ham. Openrefine (version 2.5). http://openrefine. org. free, open-source tool for cleaning and transforming data.
Journal of the Medical Library Association: JMLA, 101(3):233, 2013.
[KHFW16] Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. Towards reliable interactive data cleaning: A user
survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–5, 2016.
[KPHH11] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Interactive visual specification of data
transformation scripts. In ACM Human Factors in Computing Systems (CHI), 2011.
[MLW+19] Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas
Erickson. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems, pages 1–15, 2019.
[BTR15] Paula Branco, Luis Torgo, and Rita Ribeiro. A survey of predictive modelling under imbalanced distributions.arXiv preprint
arXiv:1505.01658, 2015
References
KDD Tutorial / © 2020 IBM Corporation
[DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages
220–231. Springer, 2010.
[GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from
imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757–
763. Springer, 2004
[Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003.
[JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49,
2004.
[Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence
volume, 2016.
[LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for
classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019
[Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003.
References
KDD Tutorial / © 2020 IBM Corporation
[DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages
220–231. Springer, 2010.
[GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from
imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757–
763. Springer, 2004
[JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49,
2004.
[Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence
volume, 2016.
[LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for
classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019
WP03] Gary M Weiss and Foster Provost. Learning when training data are costly: The effect of class distribution on tree induction.
Journal of artificial intelligence research,19:315–354, 2003

More Related Content

What's hot

What's hot (20)

Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Measures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairnessMeasures and mismeasures of algorithmic fairness
Measures and mismeasures of algorithmic fairness
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
machine learning
machine learningmachine learning
machine learning
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Idiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big DataIdiro Analytics - Analytics & Big Data
Idiro Analytics - Analytics & Big Data
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Data Cleansing
Data CleansingData Cleansing
Data Cleansing
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Machine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting Datasets
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 

Similar to Overview and Importance of Data Quality for Machine Learning Tasks

Similar to Overview and Importance of Data Quality for Machine Learning Tasks (20)

Data-Ed Webinar: Implementing the Data Management Maturity Model (DMM) - With...
Data-Ed Webinar: Implementing the Data Management Maturity Model (DMM) - With...Data-Ed Webinar: Implementing the Data Management Maturity Model (DMM) - With...
Data-Ed Webinar: Implementing the Data Management Maturity Model (DMM) - With...
 
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектовAI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
AI&BigData Lab 2016. Сергей Шельпук: Методология Data Science проектов
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
Agile Mumbai 2022 - Ashwinee Singh | Agile in AI or AI in Agile?
 
Test-Driven Machine Learning
Test-Driven Machine LearningTest-Driven Machine Learning
Test-Driven Machine Learning
 
Application of DEA in IT & Communication
Application of DEA in IT & CommunicationApplication of DEA in IT & Communication
Application of DEA in IT & Communication
 
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
 
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
Machine Learning: Addressing the Disillusionment to Bring Actual Business Ben...
 
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
MongoDB World 2019: From Transformation to Innovation: Lean-teams, Continuous...
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
Demystifying ML/AI
Demystifying ML/AIDemystifying ML/AI
Demystifying ML/AI
 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
 
Data mining knowing the unknown
Data mining knowing the unknownData mining knowing the unknown
Data mining knowing the unknown
 
CMDB in Cloud Times: Myths, Mistakes, and Mastery
CMDB in Cloud Times: Myths, Mistakes, and Mastery CMDB in Cloud Times: Myths, Mistakes, and Mastery
CMDB in Cloud Times: Myths, Mistakes, and Mastery
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
How Does Data Create Economic Value_ Foundations For Valuation Models.pdf
How Does Data Create Economic Value_ Foundations For Valuation Models.pdfHow Does Data Create Economic Value_ Foundations For Valuation Models.pdf
How Does Data Create Economic Value_ Foundations For Valuation Models.pdf
 
vodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applicationsvodQA Pune (2019) - Testing AI,ML applications
vodQA Pune (2019) - Testing AI,ML applications
 
Robotic Process Automation
Robotic Process AutomationRobotic Process Automation
Robotic Process Automation
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
benishzehra469
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 

Overview and Importance of Data Quality for Machine Learning Tasks

  • 1. Overview and Importance of Data Quality for Machine Learning Tasks KDD Tutorial / © 2020 IBM Corporation Nitin Gupta Shashank Mujumdar Shazia Afzal Hima Patel IBM Research India
  • 2. Acknowledgements KDD Tutorial / © 2020 IBM Corporation Abhinav Jain Ruhi Sharma Sameep Mehta Lokesh N Shanmukh Guttula Vitobha Munigala
  • 3. Agenda Why is data quality measurement important for Machine Learning Applications? Part I : Discussion on quality metrics for structured/tabular data Part II : Discussion on quality metrics for unstructured text data Part III : Discussion on Human In Loop for Data Quality Measurements Conclusion KDD Tutorial / © 2020 IBM Corporation
  • 4. Need for Data Quality for Machine Learning KDD Tutorial / © 2020 IBM Corporation
  • 5. Data Preparation in Machine Learning KDD Tutorial / © 2020 IBM Corporation “Data collection and preparation are typically the most time-consuming activities in developing an AI-based application, much more so than selecting and tuning a model.” – MIT Sloan Survey https://sloanreview.mit.edu/projects/reshaping-business-with-artificial- intelligence/ Data preparation accounts for about 80% of the work of data scientists” - Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation- most-time-consuming-least-enjoyable-data-science-task-survey- says/#70d9599b6f63
  • 6. Challenges with Data Preparation KDD Tutorial / © 2020 IBM Corporation
  • 7. Challenges with Data Preparation KDD Tutorial / © 2020 IBM Corporation Iterative debugging which is cumbersome and time consuming
  • 8. Data Quality Analysis can help.. KDD Tutorial / © 2020 IBM Corporation ▪ Know the issues in the data beforehand, example noise in labels, overlap between classes. ▪ Make informed choices for data preprocessing and model selection ▪ Reduce turn around time for data science projects.
  • 9. Different personas in enterprise setting.. KDD Tutorial / © 2020 IBM Corporation Source : Google Images
  • 10. To put it all together KDD Tutorial / © 2020 IBM Corporation Data Assessment and Readiness Module ▪ Need for algorithms that can assess training datasets ▪ Across modalities.. Structured/unstructured/timeseries etc ▪ Allow for complex interaction between the different personas and human in loop techniques ▪ Need for automation
  • 11. To summarize: KDD Tutorial / © 2020 IBM Corporation George Fuechsel, IBM 305 RAMAC technician Lot of progress in last several years on improving ML algorithms including building automated machine learning toolkits (AutoML) However, Quality of a ML model is directly proportional to Quality of Data Hence, there is a need for systematic study of measuring quality of data with respect to machine learning tasks.
  • 12. Data Quality Metrics for Structured Data KDD Tutorial / © 2020 IBM Corporation
  • 13. Data Quality Metrics ▪ We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 14. Data Cleaning ▪ What are some common data cleaning techniques used for machine learning? ▪ Do data cleaning techniques always help in building a machine learning pipeline ▪ Joint cleaning and model building techniques KDD Tutorial / © 2020 IBM Corporation
  • 15. Common Data Cleaning Techniques Data Cleaning Issues KDD Tutorial / © 2020 IBM Corporation Quantitative Qualitative Integrity Constraints based errors (not covered) - Functional Dependency Constraints - Denial Constraints - Others .. - Missing Values - Outliers - Noise in the data - Noise in labels - Data Linters - …
  • 16. Is data cleaning always helpful for ML pipeline? ▪ Question: what is the impact of different cleaning methods on ML models? ▪ Experiment setup: ▪ Datasets: 13 real world datasets ▪ Five error types: outliers, duplicates, inconsistencies, mislabels and missing values ▪ Seven classification algorithms: Logistic Regression, Decision Tree, Random Forest, Adaboost, XGBoost, k- Nearest Neighbors, and Naïve Bayes KDD Tutorial / © 2020 IBM Corporation [LXJ19] Cleaning methods used for error types Image Source: Li et al, 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]
  • 17. Insights: Impact of different cleaning techniques KDD Tutorial / © 2020 IBM Corporation [LXJ19] Impact, if cleaned* Negative Impacts* Inconsistencies Insignificant No Outliers Insignificant/ Positive Possible Duplicates Insignificant Yes Missing Values Positive If imputation is far away from ground truth Mislabels Positive If dataset accuracy < 50% *based on general patterns across the datasets and different train test settings
  • 18. In conclusion ▪ Data cleaning does not necessarily improve the quality of downstream ML models ▪ Impact depends on different factors: ▪ Cleaning algorithm and its set of parameters ▪ ML model dependent ▪ Order of cleaning operators ▪ Model selection and cleaning algorithm can increase robustness of impacts – no one solution! KDD Tutorial / © 2020 IBM Corporation [LXJ19] [LAU19]
  • 19. Data Quality Metrics We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 20. Class Imbalance Unequal distribution of classes within a dataset KDD Tutorial / © 2020 IBM Corporation Source: https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59
  • 21. Why it happens? Expected in domains where data with one pattern is more common than other. KDD Tutorial / © 2020 IBM Corporation Source: Krawczyk et al, 2016. Learning from imbalanced data: open challenges and future direction [Kra16]
  • 22. Why Imbalanced Classification is Hard? KDD Tutorial / © 2020 IBM Corporation Classifier assumes data to be balanced Unequal Cost of Misclassification Errors: False negatives are important than False positives Minority Class is more important for data mining but given less priority by learning algorithm Minority class instances can be detected as noise
  • 23. Evaluation Metrics for Imbalanced Datasets Accuracy Paradox KDD Tutorial / © 2020 IBM Corporation 90% Majority 10% Minority Learning Algorithm Always Predict Accuracy = 90% Reasonable ? F1 Score, Recall, AUC PR, AUC ROC, G-Mean…. Source: https://en.wikipedia.org/wiki/Accuracy_paradox
  • 24. Factors affecting class imbalance ▪ Imbalance Ratio ▪ Overlap between classes ▪ Smaller sub-concepts ▪ Dataset Size ▪ Label Noise ▪ Combination ▪ … KDD Tutorial / © 2020 IBM Corporation 24
  • 25. Affecting Factor: Imbalance Ratio KDD Tutorial / © 2020 IBM Corporation Imbalance Ratio Learning algorithm biased towards majority class Minority sample considered as the noise In confusion region, priorities given to majority Increase Increase ➔ ➔ Is the imbalance ratio only cause of performance degradation in learning from imbalanced data? NO ➔ Source: . https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8 [WP03] [GBSW04]
  • 26. Affecting Factor: Overlap KDD Tutorial / © 2020 IBM Corporation Priority given to majority class in the overlapping region Minority data points are considered as noise due to a smaller number of data points from minority class in the overlapping region Source: Ali et al, 2015. Classification with class imbalance problem: A Review [ALI15]
  • 27. Affecting Factor: Smaller sub-concepts It is difficult to discern separate groupings with so few examples in minority class KDD Tutorial / © 2020 IBM Corporation Source: Ali et al, 2015. Classification with class imbalance problem: A Review [ALI15]
  • 28. Affecting Factor: Dataset Size KDD Tutorial / © 2020 IBM Corporation Size Deteriorate the evaluation measure Underlying structure of the minority class distributions is difficult to understand ➔ ➔ Source: Ali et al, 2015. Classification with class imbalance problem: A Review [ALI15]
  • 29. Affecting Factor: Combined Effect KDD Tutorial / © 2020 IBM Corporation Combined effect of varying dataset size, overlap, and imbalance ratio Source: Misha et al, 2010. Overlap versus Imbalance [DT10]
  • 30. Modelling Strategies: Types KDD Tutorial / © 2020 IBM Corporation Our Focus Source: Paula et al, 2015. A Survey of Predictive Modelling under Imbalanced Distributions [BTR15]
  • 31. Resampling Techniques • Oversampling • Undersampling • Ensemble Based Techniques KDD Tutorial / © 2020 IBM Corporation Does imbalance recovery method always help? OR Does the Impact of imbalance recovery method same on all datasets.
  • 32. Bayes Impact index KDD Tutorial / © 2020 IBM Corporation Totally Overlapped: No Impact Separable: No Impact Partially Overlapped: High Impact Source: Yang et al, 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem [LCT19]
  • 33. Data Quality Metrics We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 34. Label Noise KDD Tutorial / © 2020 IBM Corporation Given Label – Iris-setosa Correct Label –Iris-virginica (based on attributes analysis) ▪ Most of the large data generated or annotated have some noisy labels. ▪ In this metric we discuss “How one can identify these label errors and correct them to model data better?”. There are atleast 100,000 label issues is ImageNet! Source: https://l7.curtisnorthcutt.com/confident-learning
  • 35. Effects of Label Noise Possible Sources of Label Noise: ▪ Insufficient information provided to the labeler ▪ Errors in the labelling itself ▪ Subjectivity of the labelling task ▪ Communication/encoding problems Label noise can have several effects: ▪ Decrease in classification performance ▪ Pose a threat to tasks like feature selection ▪ In online settings, new labelled data may contradict the original labelled data KDD Tutorial / © 2020 IBM Corporation
  • 36. Label Noise Techniques ‘ Label Noise Algorithm Level Approaches Data Level Approaches ▪ Designing robust algorithms that are insensitive to noise ▪ Not directly extensible to other learning algorithms ▪ Requires to change an existing method, which neither is always possible nor easy to develop ▪ Filtering out noise before passing to underlying ML task ▪ Independent of the classification algorithm ▪ Helps in improving classification accuracy and reduced model complexity. KDD Tutorial / © 2020 IBM Corporation
  • 37. Label Noise Techniques ‘ Label Noise Algorithm Level Approaches Data Level Approaches ▪ Learning with Noisy Labels (NIPS-2014) ▪ Robust Loss Functions under Label Noise for Deep Neural Networks (AAAI-2017) ▪ Probabilistic End-To-End Noise Correction for Learning With Noisy Labels (CVPR-2019) ▪ Can Gradient Clipping Mitigate Label Noise? (ICLR-2020) ▪ Identifying mislabelled training data (Journal of artificial intelligence research 1999) ▪ On the labeling correctness in computer vision datasets (IAL 2018) ▪ Finding label noise examples in large scale datasets (SMC 2017) ▪ Confident Learning: Estimating Uncertainty in Dataset Label (Arxiv -2019) KDD Tutorial / © 2020 IBM Corporation
  • 38. Filter Based Approaches ▪ On the labeling correctness in computer vision datasets [ARK18] ▪ Published in Interactive Adaptive Learning 2018 ▪ Train CNN ensembles to detect incorrect labels ▪ Voting schemes used: ▪ Majority Voting KDD Tutorial / © 2020 IBM Corporation ▪ Identifying mislabelled training data [BF99] ▪ Published in Journal of artificial intelligence research, 1999 ▪ Train filters on parts of the training data in order to identify mislabeled examples in the remaining training data. ▪ Voting schemes used: ▪ Majority Voting ▪ Consensus Voting ▪ Finding label noise examples in large scale datasets [EGH17] ▪ Published in International Conference on Systems, Man, and Cybernetics, 2017 ▪ Two-level approach ▪ Level 1 – Train SVM on the unfiltered data. All support vectors can be potential candidates for noisy points. ▪ Level 2 – Train classifier on the original data without the support vectors. Samples for which the original and the predicted label does not match, are marked as label noise. Source: Broadley et al, 1999, Identifying mislabeled training Data
  • 39. Eliminating Class Noise in Large Datasets – ICML 2003 Criterion for a rule to be GR: ▪ It should have classification precision on its own partition above a threshold ▪ It should have minimum coverage on its own partition ▪ coverage - how many examples rules classifies (either noise or not) confidently - that is its coverage. KDD Tutorial / © 2020 IBM Corporation Source: Zhu et al, 2003. Eliminating Class Noise In Large Datasets [ZWC03]
  • 40. Confident Learning: Estimating Uncertainty in Dataset Label -2019 The Principles of Confident Learning ▪ Prune to search for label errors ▪ Count to train on clean data ▪ Rank which examples to use during training KDD Tutorial / © 2020 IBM Corporation In case of Label Collison Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label [NJC19]
  • 41. KDD Tutorial / © 2020 IBM Corporation Confident Learning: Estimating Uncertainty in Dataset Label -2019 Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label [NJC19]
  • 42. KDD Tutorial / © 2020 IBM Corporation Confident Learning: Estimating Uncertainty in Dataset Label -2019 Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label [NJC19]
  • 43. Data Quality Metrics We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 44. Data Valuation Value of a training datum is how much impact it has in the predictor performance. KDD Tutorial / © 2020 IBM Corporation Prediction Model Valuation Model Training Data Validation Data Data Valuation High Valued Low Valued Source: Google Images
  • 45. KDD Tutorial / © 2020 IBM Corporation Is the impact of all the cat images from training set same? 𝑃 , 𝐴, 𝑉 == 𝑃( , 𝐴, 𝑉) 𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎 𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕 𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎 𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙 𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝑉𝑎𝑙( , 𝐴, 𝑉) == 𝑉𝑎𝑙( , 𝐴, 𝑉) ? Is Then Source: Google Images
  • 46. Scenarios in which datum has low value: ▪ Incorrect label ▪ Input is noisy or low quality ▪ Usefulness for target task KDD Tutorial / © 2020 IBM Corporation Label: Dog Label: Dog Label: Dog Label: Dog Validation Set Source: Google Images
  • 47. Application ▪ Identifying Data Quality ▪ High value data ➔ Significant contribution ▪ Low value data ➔ Noise or outliers or not useful for task ▪ Domain Adaptation ▪ Data Gathering KDD Tutorial / © 2020 IBM Corporation
  • 48. Leave One Out KDD Tutorial / © 2020 IBM Corporation 𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎 𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕 𝑻: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕 𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎 𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙 𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝑉𝑎𝑙 𝑥, 𝐴, 𝑉 = 𝑃 𝑇, 𝐴, 𝑉 − 𝑃(𝑇 − 𝑥, 𝐴, 𝑉) Value of training datum 𝑥 Performance of learning algorithm 𝐴 on validation set 𝑉 when trained on full data 𝑻 including 𝑥 Performance of learning algorithm 𝐴 on validation set 𝑉 when trained on full data 𝑻 excluding 𝑥 [Coo77] Source: Cook et al, 1977. Detection of influential observation in linear regression
  • 49. Example KDD Tutorial / © 2020 IBM Corporation 𝑉𝑎𝑙( , 𝐴, 𝑉) = 𝑃( , 𝐴, 𝑉) − 𝑃( , 𝐴, 𝑉) 0.032 = 0.80 – 0.768 Source: Google Images [Coo77]
  • 50. Implications ➢ Doesn’t capture complex interactions between subsets of data KDD Tutorial / © 2020 IBM Corporation 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 𝑉𝑎𝑙( , 𝐴, 𝑉) = 0 Reasonable ? Source: Google Images [Coo77]
  • 51. Data Shapley KDD Tutorial / © 2020 IBM Corporation Valuation of a datum should come from the way it coalesces with other data in the dataset Valuation of is derived from its marginal contribution to all other subsets in the data [GZ19] Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning
  • 52. Shapley Values – Formulation KDD Tutorial / © 2020 IBM Corporation 𝜙 𝑖 = 𝐶 ෍ 𝑆⊆𝐷{𝑖} 𝑣(𝑆 ∪ {𝑖}) − 𝑣(𝑆) 𝑛−1 𝑆 Constant value of 𝑖 together with 𝑆 Shapley value of datum 𝑖 value of 𝑆 alone Iterates over all subsets in data Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning [GZ19]
  • 53. Practical Implications of Data Shapley KDD Tutorial / © 2020 IBM Corporation ▪ Computation of Shapley value of a datum requires 2 𝐷 −1 summations ▪ In practice, we approximate Shapley values using Monte Carlo simulations ▪ The number of simulations required doesn’t scale with amount of data Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning [GZ19]
  • 54. Results using Data Shapley KDD Tutorial / © 2020 IBM Corporation The data points that have higher noise receive lesser Shapley values Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning [GZ19]
  • 55. DVRL KDD Tutorial / © 2020 IBM Corporation Reward = Performance(selected data) – performance(complete data) Data valuation is a function from datum to its value Sample data based on valuation Build target model using selected data [YAP19] Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning
  • 56. KDD Tutorial / © 2020 IBM Corporation DVRL Results ▪ Removing high valued points affect the model performance the most ▪ Removing low valued points affect the model performance the least Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning [YAP19]
  • 57. Data Quality Metrics We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 58. Data Homogeneity KDD Tutorial / © 2020 IBM Corporation Homogeneity Syntactic Distance matching Pattern matching Semantic Embedding matching Context matching We call a data homogenous, if all the entries follow a unique pattern
  • 59. Data Homogeneity KDD Tutorial / © 2020 IBM Corporation 12th Jan, 2020 Account Manager 12/01/2020 Acc. Manager 12th Jan, 2020 Account Manager 12/01/2020 Acc. Manager Homogenous Homogenous In-homogenous ➢ Two different date formats ➢ Two different values for the same category
  • 60. In-homogeneity affects ML pipeline KDD Tutorial / © 2020 IBM Corporation ▪ Adult Census dataset downloaded from Kaggle ▪ Task is to predict Income level (>50k/<=50k) given several attributes of a person Income level (Train) Income level (Test) <=50K <=50K. <=50K <=50K. >50K >50K. <=50K >50K. <=50K <=50K. <=50K <=50K. >50K <=50K. >50K >50K. 4 output neurons? Source: Image from Google Images
  • 61. Inhomogeneity causes KDD Tutorial / © 2020 IBM Corporation When data is gathered by different people In the absence (or weak presence) of a data collection protocol When data is merged from different sources When data is stored in different formats (e.g. .csv, .xlsx) etc. Inhomogeneity
  • 62. Homogeneity score KDD Tutorial / © 2020 IBM Corporation Desiderata for measures that characterize homogeneity ▪ Lesser the number of clusters, better the homogeneity ▪ More skewed the distribution of clusters, better the homogeneity Homogeneity( ) < Homogeneity( ) Homogeneity ( ) < Homogeneity( ) [DKO+07] Source: Dai et al, 2006. Column heterogeneity as a measure of data quality
  • 63. Homogeneity score KDD Tutorial / © 2020 IBM Corporation Entropy Simpson’s biodiversity Index 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑝 = ෍ 𝑝 𝑥 ln 𝑝(𝑥) ▪ Satisfies both the desiderata ▪ Not bounded 𝑅+ ▪ Satisfies both the desiderata ▪ Bounded in [0,1] 𝑃 𝐶 𝑥𝑖 ≠ 𝐶(𝑥𝑗) 𝑃 is the probability distribution of the entries among clusters 𝑥_𝑖 and 𝑥_𝑗 are two entries and 𝐶(. ) denotes the cluster index [sim] Source : http://www.countrysideinfo.co.uk/simpsons.htm
  • 64. Syntactic Homogeneity KDD Tutorial / © 2020 IBM Corporation Distance based similarity Edit distance Sequence based distance measure d a v e 0 1 2 3 4 d 1 0 1 2 3 v 2 1 1 1 2 a 3 2 1 2 2 y0 y1 y2 y3 y4 x = d – v a y = d a v e substitute a with e insert a (after d) 𝐽 𝑥, 𝑦 = |𝐵𝑥 ∩ 𝐵𝑦| |𝐵𝑥 ∪ 𝐵𝑦| ▪ Eg. 𝑥 = dave, 𝑦 = dav ▪ 𝐵_𝑥 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣𝑒, 𝑒#}, 𝐵_𝑦 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣#} ▪ 𝐽(𝑥, 𝑦) = 3/6 Jaccard Similarity Set based distance measure Source: Doan et al, 2012. Principles of Data Integration
  • 65. KDD Tutorial / © 2020 IBM Corporation Syntactic Homogeneity Pattern based similarity (Regex based clustering) Ram Kumar. u{1}l{3} s{1} u{1}l{4} .{1} w s{1} w .{1} ROOT Lakshman Kumar. u{1}l{7} s{1} u{1}l{4} .{1} w s{1} w .{1} ROOT Larger the alphabet set of the regex language, more the number of clusters detected Source: Doan et al, 2012. Principles of Data Integration
  • 66. Semantic Homogeneity KDD Tutorial / © 2020 IBM Corporation Embeddings based similarity ▪ Applicable when the entries in the data are meaningful English words ▪ Can use off the shelf embeddings like word2vec, Glove etc. ▪ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝜙 𝑥 , 𝜙 𝑦 ; ▪ 𝜙 . = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 ▪ Embeddings capture semantic information as well [MCCD13] Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space
  • 67. KDD Tutorial / © 2020 IBM Corporation Semantic Homogeneity Context based similarity ▪ Distributional hypothesis : words that are used and occur in the same context tend to purport similar meanings ▪ In Structured dataset, word ≈ an entry in a row, context ≈ remaining entries in the row ▪ We may not be able to use off the shelf embeddings ▪ Need to train dataset specific embeddings ▪ Entries that have similar neighborhood in their rows tend to show higher similarities ▪ Useful in cases where 𝑠𝑖𝑚 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒, 𝑚𝑎𝑛𝑎𝑔𝑒𝑟 being high is not desirable in a dataset Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space [MCCD13]
  • 68. Data Quality Metrics We will cover the following topics: ▪ Data Cleaning ▪ Class Imbalance ▪ Label Noise ▪ Data Valuation ▪ Data Homogeneity ▪ Data Transformations KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 69. Data Transformation ▪ The goal is to allow the business user to transform provided data (heterogeneous) into user intended format (homogeneous), by showing a few examples of expected output for the given input samples. KDD Tutorial / © 2020 IBM Corporation
  • 70. Data Transformation Examples KDD Tutorial / © 2020 IBM Corporation
  • 71. Data Transformation Using PbE Systems Programming-by-Example systems capture user intent using sample input-output example pairs and learn transformation programs that convert inputs into their corresponding outputs. KDD Tutorial / © 2020 IBM Corporation User provided input- output example pairs Program space expressed using a Domain-specific Language (DSL) Decode/Search for Transformation Programs Transformation Program Fig 2. Generic process across all Programming-by-Example frameworks
  • 72. Domain-specific Language DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific tasks and formally represents the transformation program for the user to interpret. KDD Tutorial / © 2020 IBM Corporation Source: Bunel et al, 2018. Karel DSL to guide an agent inside a grid world Source: Gulwani et al, 2011. Flash-fill DSL for string to string transformation
  • 73. What Are The Different Types Of Programming-by-example Frameworks? DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific tasks and formally represents the transformation program for the user to interpret. KDD Tutorial / © 2020 IBM Corporation Symboli c Statistical Hybrid ❑ Uses statistical methods to smartly select sub-programs for further exploration. Requires additional efforts to incorporate branching decisions ❑ No way to retract any wrong decision made while exploration. We might eliminate correct program during the process. ❑ Examples: NGDS, PL meets ML ❑ Designed to produce correct programs by construction using logical reasoning and domain-specific language ❑ Produces the intended program with few input- output examples ❑ Requires significant engineering effort and an underlying program search method ❑ Examples: PROSE ❑ Use of Neural Networks to decode transformation directly from encoded input-output string pairs. ❑ Generated program may not satisfy the provided specification. ❑ The programs are generated using beam search ❑ Requires a lot of training data, generated randomly, which might result in poor generalizability on real- world tasks ❑ Examples: Robust-Fill, NPBE, Deep-Coder
  • 74. Flash Meta – A Framework for Inductive Program Synthesis (2015) KDD Tutorial / © 2020 IBM Corporation ▪ Consider the transformation “(425) 706-7709” --> “425”. ▪ The output string ”425” can be represented in 4 possible ways as concatenation of substrings ‘4’, ‘2’, ‘5’, ‘42’, ‘25’. ▪ Leaf nodes e1, e2, e3, e12, e23 are VSAs to output substrings ‘4’, ‘2’, ‘5’, ‘42’, ‘25’ respectively. Key aspects for symbolic systems: ▪ Independence of problems: ▪ Number of possible sub-problems is quadratic: ▪ Enumerative Search: Drawback: Takes a lot of time to look for correct programs in the transformation space Source: Polozov et al, 2015. Flash Meta – A Framework for Inductive Program Synthesis [PG15] Source: Polozov et al, 2015. FlashMeta –A FrameworkforInductiveProgram Synthesis
  • 75. Hybrid Systems KDD Tutorial / © 2020 IBM Corporation To overcome the issue of symbolic models, hybrid systems have been proposed which ▪ use statistical models to generate probability distribution over DSL operators at each step of the partially-constructed AST ▪ use most-likely ones to guide deductive search for generating transformation programs. Program Search: System uses deductive search to look for programs that satisfy input-output transformation for all user provided examples.. Program Ranking: Ranking function is used to pick top programs from the set of valid programs based on properties like generalizability, simplicity etc. Caveat: No way to go back when wrong decisions are made. We might end up with sub-optimal ones or eliminating valid programs completely
  • 76. Statistical System: Robust Fill KDD Tutorial / © 2020 IBM Corporation ▪ DL approach based on seq-to-seq encoder-decoder networks which perform as good as conventional non-DL based PbE approaches like Flash-Fill but with the added motivation of having the learned model to be more robust to noise in the input data. Domain Specific Language Source: Devlin et al, 2017. Robust-Fill [DUB+17]
  • 77. Statistical System: KDD Tutorial / © 2020 IBM Corporation Key aspects: ▪ Use of neural networks with no symbolic guidance unlike hybrid systems. ▪ Use of Beam-Search to decode/predict top-k most likely transformation programs based on joint probability of decoded DSL operators ▪ Use of synthetic datasets for model training which might lead to poor generalizability in real-world tasks. ▪ Caveat: No guarantee whether the synthesized program will compile or satisfy user-provided specifications [DUB+17]
  • 78. Statistical System: KDD Tutorial / © 2020 IBM Corporation Solution: Execution-guided Synthesis ▪ During Beam-Search, execute the partially constructed transformation program on the go and eliminate ones which produce syntax errors or empty outputs ▪ Restrict DSL vocabulary at each step to enforce grammar correctness ▪ Incorporate execution guidance during training and testing so that program correctness is guaranteed by construction. Source: Wang et al, 2018. Execution-Guided Neural Program Synthesis [KMP+18]
  • 79. Data Quality Metrics for Unstructured Text Data KDD Tutorial / © 2020 IBM Corporation
  • 80. Outline KDD Tutorial / © 2020 IBM Corporation ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 81. What is Unstructured Text? KDD Tutorial / © 2020 IBM Corporation Html/Xml Documents Tweets Ticket Data Email Product Reviews
  • 82. How to Assess Quality of Text? Different properties of text can be measured to quantify its quality! ▪ Lexical Properties - Vocabulary, Misspellings etc. ▪ Syntactic Properties - Grammatical Relations, Syntax Violations etc. ▪ Semantic Properties - Text Meaningfulness, Text Complexity etc. ▪ Discourse Properties - Readability, Text Coherence etc. ▪ Distributional Properties - Outliers, Topics, Embeddings KDD Tutorial / © 2020 IBM Corporation
  • 83. How to Assess Quality of Text? Different Approaches can be further broadly classified as ▪ Generic Text quality ▪ Task specific KDD Tutorial / © 2020 IBM Corporation Generic Text Quality Approaches: ▪ Text Readability ▪ Text Coherence ▪ Text Formality ▪ Text Ambiguity ▪ Text Outliers Task Specific Approaches: ▪ Text Classification ▪ Text Generation ▪ Semantic Similarity ▪ Dialog Systems ▪ Label Noise ▪ Lexical Diversity ▪ Text Cleanliness ▪ Text Appropriateness ▪ Text Complexity ▪ Text Bias
  • 84. Outline KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/ ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps
  • 85. Automatic Text Scoring Using Neural Networks Objective: Automatically assign a score to the text given a marking scale. Approach: 1. Train a Model to learn word embeddings that capture both context and word-specific scores 2. Train an LSTM-based network using the score-specific word embeddings to output the text score Task: Predict the Essay scores written by students ranging from Grade 7 to Grade 10 KDD Tutorial / © 2020 IBM Corporation [AYR16] Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
  • 86. Automatic Text Scoring Using Neural Networks Score-Specific Word Embeddings: ▪ Learn both context and word specific scores during training. ▪ Context: ▪ For a sequence of words S = (w1,…, wt,…, wn), construct S´ = (w1,…, wc,…, wn | wc ∈ 𝕍, wt ≠ wc) ▪ The model learns word embeddings by ranking the activation of the true sequence S higher than the activation of its ‘noisy’ counterpart S´ through a hinge margin loss. ▪ Word Specific Scores: ▪ Apply linear regression on the sequence embedding to predict the essay score ▪ Minimize the mean squared error between the predicted ŷ and the actual essay score y KDD Tutorial / © 2020 IBM Corporation [AYR16] Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
  • 87. Automatic Text Scoring Using Neural Networks KDD Tutorial / © 2020 IBM Corporation Comparison: ▪ Standard Embeddings only consider context and hence place correct and incorrect spellings close to each other. ▪ Score-Specific Embeddings discriminate between spelling mistakes based on low and high essay scores, thus separating them in the embedding space while preserving the context [AYR16] Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
  • 88. Automatic Text Scoring Using Neural Networks Explainability: ▪ Which words in the input text correspond to a high score vs a low score? ▪ Perform a single pass of an essay from left to right and let the LSTM make its score prediction ▪ Provide a pseudo-high score and a pseudo-low score as ground truth and measure the gradient magnitudes corresponding to each word KDD Tutorial / © 2020 IBM Corporation [AYR16] Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
  • 89. Automatic Text Scoring Using Neural Networks KDD Tutorial / © 2020 IBM Corporation Visualizations: ▪ Positive feedback for correctly placed punctuation or long-distance dependencies (Sentence 6 “are … researching”) ▪ Negative feedback for POS mistakes (Sentence 2, “Being patience”) ▪ Incorrect negative feedback (Sentence 3, “reader satisfied”) [AYR16] Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks
  • 90. Transparent text quality assessment with convolutional neural networks Objective: Automatically assign a score to the input Swedish text Approach: 1. Train a CNN with residual connections on character embeddings 2. Utilize a pseudo-probabilistic framework to incorporate axioms on varying quality text vs high quality text to train the model Task: 1. Predict the Essay scores and compare against human graders 2. Qualitatively Analyze text to identify low vs high scoring regions KDD Tutorial / © 2020 IBM Corporation Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks [ÖG17]
  • 91. Transparent text quality assessment with convolutional neural networks Training: ▪ For a sequence of characters S = (s1, s2,…, sn), construct the character embedding matrix 𝕍 x d ▪ Pass the character embedding matrix through the CNN ▪ Final quality score is the dot product of the output weight vector and the mean value of the outputs at the final residual layer. ▪ Utilise pseudo-probabilistic framework P(a > b) = σ(q(a) − q(b)), where σ(x) is the logistic function and q(x) is the network score KDD Tutorial / © 2020 IBM Corporation [ÖG17] Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
  • 92. Transparent text quality assessment with convolutional neural networks KDD Tutorial / © 2020 IBM Corporation Axioms: 1. All authors (professional or not) are consistent 2. Professional authors are better than blog authors 3. All professional authors are equal 4. Blog authors are not equal [ÖG17] Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
  • 93. Transparent text quality assessment with convolutional neural networks Explainability: ▪ Which characters in the input text correspond to a high score vs a low score? ▪ Red encodes low scores, blue encodes high scores. Faded colors are used for scores near zero. ▪ News examples are scored higher than the blog examples ▪ Informal spellings, use of smileys leads to low scores in the blog text KDD Tutorial / © 2020 IBM Corporation [ÖG17] Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
  • 94. Outline KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/ ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps
  • 95. Outlier Detection Outlier in a tabular or timeseries data can be interpreted as a ▪ Deviating point ▪ Noise ▪ Rarely occurring point What can be an outlier in text data? ▪ Topically diverse sample? ▪ Gibberish text? ▪ Meaningless sentences? ▪ Samples from other language? KDD Tutorial / © 2020 IBM Corporation Images Credits: Google Images Figure 1 Figure 2 Figure 3
  • 96. Outlier Examples in Text Data ▪ Repetitive: greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad full stop ▪ Incomprehensible: lived let idea heck bear walk never heard whole years really funny beginning went hill quickly ▪ Incomplete: Suspenseful subtle much much disturbing KDD Tutorial / © 2020 IBM Corporation Outlier Detection Classical techniques Matrix factorization Techniques DL techniques Self-attention techniques for text representation Outlier Detection in Text Data
  • 97. Outlier Detection for Text Data ▪ Feature Generation – Simple Bag of Words Approach ▪ Apply Matrix Factorization – Decompose the given term matrix into a low rank matrix L and an outlier matrix Z ▪ Further, L can be expressed as ▪ where, ▪ The l2 norm score of a particular column zx serves as an outlier score for the document KDD Tutorial / © 2020 IBM Corporation Source : Kannan et al, 2017. Outlier detection for text data [KWAP17]
  • 98. Outlier Detection for Text Data ▪ For experiments, outliers are sampled from a unique class from standard datasets ▪ Receiver operator characteristics are studied to assess the performance of the proposed approach. ▪ Approach is useful at identifying outliers even from regular classes. ▪ Patterns such as unusually short/long documents, unique words, repetitive vocabulary etc. were observed in detected outliers. KDD Tutorial / © 2020 IBM Corporation [KWAP17] Source : Kannan et al, 2017. Outlier detection for text data
  • 99. Unsupervised Anomaly Detection in Text Multi-Head Self-Attention: KDD Tutorial / © 2020 IBM Corporation Sentence Embedding Attention matrix ▪ Proposes a novel one-class classification method which leverages pretrained word embeddings to perform anomaly detection on text ▪ Given a word embedding matrix H, multi-head self-attention is used to map sentences of variable length to a collection of fixed size representations, representing a sentence with respect to multiple contexts Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text [RZV+19] (1) (2)
  • 100. Unsupervised Anomaly Detection on Text KDD Tutorial / © 2020 IBM Corporation CVDD Objective Context Vector Orthogonality Constraint ▪ These sentence representations are trained along with a collection of context vectors such that the context vectors and representations are similar while keeping the context vectors diverse ▪ Greater the distance of mk(H) to ck implies a more anomalous sample w.r.t. context k Outlier Scoring [RZV+19] Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text (1)
  • 101. More Generic Text Quality Approaches Local Text Coherence: ▪ [MS18] Mesgar et. al. "A neural local coherence model for text quality assessment." EMNLP, 2018. Text Formality: ▪ [JMAS19] Jain et. al. "Unsupervised controllable text formalization." AAAI, 2019. Text Bias: ▪ [RDNMJ13] Recasens et. al. "Linguistic models for analyzing and detecting biased language." ACL, 2013. KDD Tutorial / © 2020 IBM Corporation
  • 102. Outline KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/ ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps
  • 103. Identifying Non-Obvious Cases in Semantic Similarity Datasets Objective: ▪ Identify obvious and non-obvious text pairs from semantic similarity datasets Approach: 1. Compare the Jenson-Shannon Divergence (JSD) of the text snippets against the median JSD of the entire distribution 2. Classify the text snippets based on divergence from the median Task: ▪ Analyze model performance numbers on SemEval Community Question Answering for both obvious and non-obvious cases KDD Tutorial / © 2020 IBM Corporation Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets [PLN19]
  • 104. Identifying Non-Obvious Cases in Semantic Similarity Datasets KDD Tutorial / © 2020 IBM Corporation Lexical divergence distribution by labels across datasets: ▪ The datasets differ with respect to the degree of lexical divergence they contain ▪ Pairs with negative labels tend to have higher JSD scores than pairs with positive labels ▪ Model performance numbers TPR/TNR/F1 score should be higher on non-obvious cases compared to obvious cases. Examples of Obvious and Non- Obvious Semantic Text Pairs: [PLN19] Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
  • 105. Improving Neural Conversational Models with Entropy-Based Data Filtering Objective: ▪ Filter generic utterances from dialog datasets to improve model performance Approach: ▪ Device a data filtering by identifying open-ended inputs determined by calculating the entropy of the distribution over target utterances Task: ▪ Measure custom metric scores such as response length, KL divergence, Coherence, Bleu score etc. for standard dialog datasets based on source/target/both filtering mechanisms KDD Tutorial / © 2020 IBM Corporation Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering [CPR19]
  • 106. Improving Neural Conversational Models with Entropy-Based Data Filtering KDD Tutorial / © 2020 IBM Corporation Examples: 1. Source = “What did you do today?” • has high entropy, since it is paired with many different responses in the data 2. Source = “What color is the sky?” • has low entropy since it’s observed with few responses in the data 3. Target = “I don’t know” • has high entropy, since it is paired with many different inputs in the data 4. Target = “Try not to take on more than you can handle” • has low entropy, since it is paired with few inputs in the data [CPR19] Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering
  • 107. Outline KDD Tutorial / © 2020 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/ ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps
  • 108. Understanding the Difficulty of Text Classification Tasks Proposes an approach to design data quality metric which explains the complexity of the given data for classification task ▪ Considers various data characteristics to generates a 48 dim feature vector for each dataset. ▪ Data characteristics include ▪ Class Diversity: Count based probability distribution of classes in the dataset ▪ Class Imbalance: σ𝑐=1 𝐶 | 1 𝐶 − 𝑛𝑐 𝑇𝑑𝑎𝑡𝑎 | ▪ Class Interference: Similarities among samples belonging to different classes ▪ Data Complexity: Linguistic properties of data samples KDD Tutorial / © 2020 IBM Corporation Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks [CRZ18] Feature vector for a given dataset cover quality properties such as ▪ Class Diversity (2-Dim) ▪ Shannon Class Diversity ▪ Shannon Class Equitability ▪ Class Imbalance (1-Dim) ▪ Class Interference (24-Dim) ▪ Hellinger Similarity ▪ Top N-Gram Interference ▪ Mutual Information ▪ Data Complexity (21-Dim) ▪ Distinct n-gram : Total n-gram ▪ Inverse Flesch Reading Ease ▪ N-Gram and Character diversity
  • 109. Understanding the Difficulty of Text Classification Tasks ▪ Authors propose usage of genetic algorithms to intelligently explore the 248 possible combinations ▪ The fitness function for the genetic algorithm was Pearson correlation between difficulty score and model performance on test set ▪ 89 datasets were considered for evaluation and 12 different types of models on each dataset ▪ The effectiveness of a given combination of metric is measured using its correlation with the performance of various models on various datasets ▪ Stronger the negative correlation of a metric with model performance, better the metric explains data complexity KDD Tutorial / © 2020 IBM Corporation [CRZ18] Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
  • 110. Understanding the Difficulty of Text Classification Tasks KDD Tutorial / © 2020 IBM Corporation Difficulty Measure D2 = Distinct Unigrams : Total Unigrams + Class Imbalance + Class Diversity + Maximum Unigram Hellinger Similarity + Unigram Mutual Info. Correlation = −0.8814 [CRZ18] Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
  • 111. Outline KDD Tutorial / © 2020 IBM Corporation ▪ Motivation ▪ What is Unstructured Text? ▪ How to Assess Text Quality? ▪ Prior-Art Approaches ▪ Generic Text Quality Approaches ▪ Text Quality Metrics for Outlier Detection ▪ Text Quality Metrics for Corpus Filtering ▪ Quality Metrics for Text Classification ▪ Future Directions ▪ Next Steps Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 112. Next Steps KDD Tutorial / © 2020 IBM Corporation How to assess overall text quality? ▪ Framework that allows user to assess text quality across various dimensions ▪ Standardized set of quality metrics that output a score to indicate a low/high score ▪ Provide insights into the specific samples in the dataset which contribute to a low/high score ▪ Specific recommendations for addressing poor text quality and evidence of model performance improvement
  • 113. Next Steps KDD Tutorial / © 2020 IBM Corporation How to assess quality of NLP models independent of task? ▪ Held-out datasets are often not comprehensive ▪ Contain the same biases as the training data ▪ Real-world performance may be overestimated ▪ Difficult to figure out where model is failing and how to fix it [RWGS20] Source : Ribeiro et al, 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
  • 114. Human In Loop Techniques for Data Quality Assessments KDD Tutorial / © 2020 IBM Corporation
  • 115. Humans in the Loop ▪ Human intervention in data quality analysis ▪ Validation and feedback ▪ Possible enhancements KDD Tutorial / © 2020 IBM Corporation ▪ Data readiness assessment ▪ Metrics for structured data ▪ Metrics for unstructured data
  • 116. Human Input in the AI Lifecycle KDD Tutorial / © 2020 IBM Corporation
  • 117. Supporting Data Operations Keywords: ▪ Preparation ▪ Integration ▪ Cleaning ▪ Curation ▪ Transformation ▪ Manipulation Notable tools: ▪ Trifacta ~ Potter’s Wheel ▪ OpenRefine ▪ Wrangler ➢ How to validate that data is clean enough ➢ How to delegate sub-tasks between respective role players / data workers ➢ How to enable reuse of human input and codifying tacit knowledge ➢ How to provide cognitive and not just statistical support KDD Tutorial / © 2020 IBM Corporation [Ham13] [KPHH11] [RH01]
  • 118. Example: Label Noise Remediation Identify noisy samples •Check target columns / attributes to detect suspect noisy or anomalous labels •E.g. Target column with the following entries : A B Label Abc def 1 Abc def 2 Abc def 1 Abc def 1 Visual preview of noisy data •Sampling to show presentable number of records in batches > Show suspected noisy class labels and predicted labels to preview A B Label Predicted Abc def 1 1 Abc def 2 1 Abc def 1 1 Abc def 1 1 Remediation input from HIL •Review •Correct •Approve > HIL reviews if row is noisy or not, verifies if predicted label is correct or not and additional marks pertinent columns corresponding Document and preserve •Learn operations for future recommendations to reduce HIL effort and enable reuse/transfer of knowledge > In future, better prediction using pertinent columns and user actions
  • 119. Example: Data Homogeneity and Transformation Identify heterogenous samples •Check target columns / attributes to detect variation on sample data •E.g. Target column with the following entries : <=50K <=50K. >50K >50K. E.g. Gender column with the following entries: male M female F Visual preview of problematic data •Sampling to show presentable number •Give explanation •Viewing options to reduce human effort > Show unique class labels and ask for correction/verification > Show distinct formats and ask for correction/verification Transformation input from HIL •Review •Correct •Approve > HIL sees anomaly in the ‘.’ appearing at end and provides corresponding transformation > Rectify heterogenous data formats and specify rule for transformation Document and preserve •Learn operations for future recommendations to reduce HIL effort and enable reuse/transfer of knowledge > In future, suggest remedial action > Suggest possible remediations or rules
  • 120. KDD Tutorial / © 2020 IBM Corporation Next Steps… Shared collaborative workspace to streamline, regulate and document human input/expertise Focus on reproducibility, traceability, and verifiability Easy to user specification interface e.g. graphical interface, by example or in natural language instead of complex grammars or regular expressions Visual preview appropriate visualisation of relevant data chunks to see the changes Minimum latency effect of transformation should be quickly visible Clear lineage ability to view what changes have been done previously and by whom documentation of operations and observed effects Enable assessment of data operations basic analysis using default ML models to show impact of changes Data versioning to enable experiments for trial and testing e.g. ability to undo changes Streamlined collaboration task lists based on persona for shared data cleaning pipeline [MLW+19] [KHFW16] [ROE+19]
  • 121. Summary ▪ Addressing data quality issues before it enters an ML pipeline allows taking remedial actions to reduce model building efforts and turn-around times. ▪ Measuring the data quality in a systematic and objective manner, through standardised metrics for example, can improve its reliability and permit informed reuse. ▪ There are distinct metrics to quantify the data issues and anomalies based on whether the data is structured or unstructured. ▪ Various data workers or personas are involved in the data quality assessment pipeline depending on the level and type of human intervention required. ▪ Streamlining the collaboration between usage of standardized metrics and human in the loop inputs can serve as a powerful framework to integrate and optimize the data quality assessment process. KDD Tutorial / © 2020 IBM Corporation
  • 122. We invite you to join us on this agenda towards a data-centric approach to analyzing data quality. Contact Hima Patel with your ideas and enquiries. KDD Tutorial / © 2020 IBM Corporation
  • 123. References KDD Tutorial / © 2020 IBM Corporation [BF99] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167, 1999. [ZWC03] Xingquan Zhu, Xindong Wu, and Qijun Chen. Eliminating class noise in large datasets. In Proceedings of the 20th International Conference on Machine Learning(ICML-03), pages 920–927, 2003. [DLG17]J unhua Ding, XinChuan Li, and Venkat N Gudivada. Augmentation and evaluation of training data for deep learning. In2017 IEEE International Conference on Big Data(Big Data), pages 2603–2611. IEEE, 2017. [ARK18] Mohammed Al-Rawi and Dimosthenis Karatzas. On the labeling correctness in computer vision datasets. In IAL@PKDD/ECML, 2018. [NJC19] Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint arXiv:1911.00068, 2019. [Coo77] R Dennis Cook. Detection of influential observation in linear regression.T echnometrics [EGH17] Rajmadhan Ekambaram, Dmitry B Goldgof, and Lawrence O Hall. Finding label noise examples in large scale datasets. In2017 IEEE International Conference on Systems, Man, and Cybernetics pages 2420–2424., 2017.
  • 124. References KDD Tutorial / © 2020 IBM Corporation [GZ19] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868, 2019 [YAP19] Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Data valuation using reinforcement learning. arXiv preprint arXiv:1909.11671, 2019. [DKO+07] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. Column heterogeneity as a measure of data quality. 2007 [sim] http://www.countrysideinfo.co.uk/simpsons.htm. [DHI12] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of data integration. Elsevier, 2012. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013 [ALI15] Aida Ali and Siti Mariyam Hj. Shamsuddin and Anca L. Ralescu. Classification with class imbalance problem: A review. SOCO 2015
  • 125. References KDD Tutorial / © 2020 IBM Corporation [PG15] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 107–126, 2015. [DUB+17] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 990–998, 2017. [KMP+18] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani. Neural-guided deductive search for real-time program synthesis from examples. arXiv preprint arXiv:1804.01186, 2018. [AYR16] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. Automatic text scoring using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–725, Berlin, Germany, August 2016. Association for Computational Linguistics. [CPR19] Richard Csaky, Patrik Purgai, and Gabor Recski. Improving neural conversational models with entropy-based data filtering. arXiv preprint arXiv:1905.05471, 2019.
  • 126. References KDD Tutorial / © 2020 IBM Corporation [CRZ18] Edward Collins, Nikolai Rozanov, and Bingbing Zhang. Evolutionary data measures: Understanding the difficulty of text classification tasks. arXiv preprint arXiv:1811.01910, 2018. [JMAS19] Parag Jain, Abhijit Mishra, Amar Prakash Azad, and Karthik Sankaranarayanan. Unsupervised controllable text formalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6554–6561, 2019. [KWAP17] Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. Outlier detection for text data. In Proceedings of the 2017 siam international conference on data mining, pages 489–497. SIAM, 2017. [MS18] Mohsen Mesgar and Michael Strube. A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4328–4339, 2018. [ÖG17] Robert Östling and Gintar ̇e Grigonyt ̇e. Transparent text quality assessment with convolutional neural networks. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 282–286, 2017.
  • 127. References KDD Tutorial / © 2020 IBM Corporation [PLN19] Nicole Peinelt, Maria Liakata, and Dong Nguyen. Aiming beyond the obvious: Identifying non-obvious cases in semantic similarity datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2792–2798, 2019. [RDNMJ13] Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1650–1659, 2013. [RWGS20] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020. [RZV+19] Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. Self-attentive, multi-context one- class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071, 2019. [RH01] Vijayshankar Raman and Joseph M Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381–390, 2001. [ROE+19] El Kindi Rezig, Mourad Ouzzani, Ahmed K Elmagarmid, Walid G Aref, and Michael Stonebraker. Towards an end-to-end human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–7, 2019.
  • 128. References KDD Tutorial / © 2020 IBM Corporation [Ham13] Kelli Ham. Openrefine (version 2.5). http://openrefine. org. free, open-source tool for cleaning and transforming data. Journal of the Medical Library Association: JMLA, 101(3):233, 2013. [KHFW16] Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. Towards reliable interactive data cleaning: A user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–5, 2016. [KPHH11] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Interactive visual specification of data transformation scripts. In ACM Human Factors in Computing Systems (CHI), 2011. [MLW+19] Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas Erickson. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–15, 2019. [BTR15] Paula Branco, Luis Torgo, and Rita Ribeiro. A survey of predictive modelling under imbalanced distributions.arXiv preprint arXiv:1505.01658, 2015
  • 129. References KDD Tutorial / © 2020 IBM Corporation [DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages 220–231. Springer, 2010. [GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757– 763. Springer, 2004 [Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003. [JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49, 2004. [Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence volume, 2016. [LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019 [Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003.
  • 130. References KDD Tutorial / © 2020 IBM Corporation [DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages 220–231. Springer, 2010. [GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757– 763. Springer, 2004 [JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49, 2004. [Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence volume, 2016. [LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019 WP03] Gary M Weiss and Foster Provost. Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial intelligence research,19:315–354, 2003